GPT performance better on 6-core i7 vs 2x18-core Xeons

polar · April 28, 2016, 2:46pm

I’ve got a lot of S1a scenes which I am processing using the SNAP toolbox v3.0 and the following steps:

Radiometric calibration
Terrain flattening
Terrain correction
Linear to dB conversion

Both TF and TC use a subset of a 200m DEM for the region of interest, and the TC step also reprojects the data to a polar stereographic projection. The results look pretty good from initial investigation.

For the processing scheme, I’ve created a graph XML file using the SNAP gui. I then use some shell scripting to update this XML file for each of the S1a scenes I have in a specific folder. Then I run GPT for each of the XML files, one at a time.

I have tested this on two systems:

Mac OS X with a 6-core Intel i7 processor, 32Gb ram, fast RAID storage.
Ubuntu linux 16.04 with two 18-core Intel Xeon E5 processors (36-cores total), 64GB ram, fast nvme SSD storage.

The Mac is by far quicker, despite having weaker specs. The only advantage the mac has is a faster clock speed.

The problem appears to be with how the CPUs are being utilized. On the Mac, its 12 threads all have high cpu usage. On the linux server, it utilizes 36 of the available 72 threads. Using htop to check the usage, threads 1-36 all have high usage, with very little/no usage in threads 37-72.

Mac htop output:

Linux htop output:

Any idea what’s going on here? I would expect that even if only one CPU is being stressed on the linux machine, that it would still be faster than a 6-core/12-thread machine.

ChristianSeverin · April 29, 2016, 7:34am

What about disk I/O? Maybe that’s the bottleneck?

carlomarin · April 29, 2016, 11:54am

I point out also the following post that somehow is connected to this one:

Would be nice if you can do a quick try and confirm this.

Two hypotheses come to my mind:

1 the CPU are waiting for a block to block reading (e.g., small blocks to be read from the disk in a block by block schema. In this case the IO bottleneck is the answer. This is not explaining why why using the GUI is faster, though)
2The blocks to be processed are stored in the ram but they are too small and they are quickly processes (in this case we have an overload bigger than the processing time).

Then, together with the developers, it would be useful to dig into this problem and eventually fix it by understand how to properly setting the gpt.vmoptins.

Thanks

polar · April 29, 2016, 2:05pm

I’ve tested performance on the two systems below. I don’t think I am seeing the same problem noted above. I’m seeing pretty comparable performance in the GUI and command line.

Mac OS X (6-cores):

SNAP 3.0 GUI: 3m 56s
gpt on XML: 3m 23s

Linux (36-cores):

SNAP 3.0 GUI: 8m 54s
gpt on XML: 9m 30s

Apparently the LinearTodB operator is not available in the Linux SNAP, only LinearToFromdb (both are available in OSX). This leads me to believe that there’s a real difference between the versions of SNAP for Mac and Linux, despite both being version 3.0.

In terms of disk performance, I think the machines should be comparable. The Linux drive I’m using is a Samsung 950 Pro nvme SSD. It should have like 2000MB/s read/write. I could do some disk benchmarking though…

edit: the Linux drive is far faster than the RAID I’m running on the Mac.

lveci · April 29, 2016, 2:38pm

The default tiling will process a tile per processor. It could be that
with so many cores that it’s bottleneck is the access to some common
resource such as all trying to access read different parts of the same file.
Try forcing the parallelization to be the same on both systems. Use the
-q option in gpt.
In the gpt -h help, it says the default is 8. I wonder if with this
default it won’t use anymore that 8 cores.

polar · April 29, 2016, 5:09pm

I forced gpt to use 18 threads on my linux machine and the processing time dropped to 7 minutes. This appears to be the optimum number – anything more/less produces slower results. I don’t know what’s up, but 18 cores should still be faster than 6.

By default, gpt appears to try and use all threads available (12 on my Mac, 72 on the Linux server). So the times in my previous post are “using” all available threads.

zhenhua · August 13, 2019, 4:26pm

Hi Polar,

I am wondering how did you force the gpt to use 18 threads? I tried adding ‘-Dsnap.parallelism=1’ to the script that calls gpt and adding “snap.parallelism = 1” to file ${SNAP_INSTALL_DIR}/etc/snap.properties. However, neither way works. I am on a shared linux server, thus I want to limit the number of threads gpt uses.

Thanks

polar · August 13, 2019, 4:57pm

I was using the gpt command line utility and using the -q flag. In the case above, -q 18.

-q Sets the maximum parallelism used for the computation,
i.e. the maximum number of parallel (native) threads.
The default parallelism is ‘12’.