How SNAP does K-Means clustering so fast?

mehpistogoth · September 15, 2024, 1:58pm

I have read SNAP source code for some time and rewrite the code of K-Means clustering in C#. The code structure is almost the same but my code is almost ten times slower than SNAP. For an image of 7 bands and 2000*1500 pixels, SNAP can done the K-Means clustering of 8 clusters and 15 iterations in almost 3 seconds, my code needs 15-20 seconds. I have to use multi-threading in my code to achieve 5 seconds. The question is how SNAP done it so fast since I didn’t find any multi-threading or SIMD in the SNAP K-Means clustering source code. Thanks!

mengdahl · September 15, 2024, 6:58pm

I think the Java Virtual Machine (JVM) is taking care of multithreading in the background. Some Java specialist would know much better than myself. @TomBlock

mehpistogoth · September 16, 2024, 2:45am

The thing is you can not do multihreading all the time in k-means clustering, only some part of it. At least you can not do multithreading with the code in KMeansClusterer.java, it will surely cause synchronization error at code line 75-78.

TomBlock · September 16, 2024, 7:33am

Good morning mehpistogoth,

I assume that you run the “KMeansClusterOp” as operator. The SNAP Operator/GPF framework realises the parellelism in the background, without the user/software developers requiring to think about parallelism (most of the times).

The Operator framework distributes the loads to all available cores on the target system and computes on a per-tile basis (say 512 x 512 px) in parallel.

The KMeansClusterer does not need to synchronize, as the operator creates a separate instance of the class for each tile, hence each clusterer runs in a single thread.

Hope this helps,

Tom

mehpistogoth · September 16, 2024, 5:34pm

Thank you for your replying.
I assume that GPF will divide the source images into tiles then run every tile in parallel.
You can create a clusterer for each tile to do the clustering work(claculating distances) independently，without knowing any information about other tiles. But in the end of every iteration, after the clustering work is done, the center(means) and member_count of all clusters will be recalculated, and the calculation needs the data(sums and member_count) of all the clusters(tiles). That’s the synchronize(reduce) job, so it’s wrong to say that the KMeansClusterer does not need to synchronize, or the result would be ridiculous if you set every tile to 1 pixel instead of 512 * 512. The question is I didn’t find any code to do the synchronize job in the source code of KMeansCluster operator. I am pretty sure SNAP did it somewhere somehow, but where and how?