Hello! I’m trying to debug a toolbox workflow for Sentinel-3 products that most of the time succeeds quickly but occasionally hangs until we time the job out and kill it to reclaim resources.
We have an XML graph with four steps:
<graph id="Graph">
<version>1.0</version>
<node id="Read">
<operator>Read</operator>
<sources/>
<parameters class="com.bc.ceres.binding.dom.XppDomElement">
<file>MANIFEST_PATH</file>
<formatName>FORMAT_NAME</formatName>
</parameters>
</node>
<node id="Reproject">
<operator>Reproject</operator>
<sources>
<sourceProduct refid="Read"/>
</sources>
<parameters class="com.bc.ceres.binding.dom.XppDomElement">
<wktFile/>
<crs>CRS</crs>
<resampling>RESAMPLING</resampling>
<orthorectify>false</orthorectify>
<elevationModelName/>
<noDataValue>NaN</noDataValue>
<includeTiePointGrids>true</includeTiePointGrids>
<addDeltaBands>false</addDeltaBands>
</parameters>
</node>
<node id="Convert-Datatype">
<operator>Convert-Datatype</operator>
<sources>
<sourceProduct refid="Reproject"/>
</sources>
<parameters class="com.bc.ceres.binding.dom.XppDomElement">
<sourceBands>SOURCE_BANDS</sourceBands>
<targetDataType>DATA_TYPE</targetDataType>
<targetScalingStr>Truncate</targetScalingStr>
<targetNoDataValue>NO_DATA_VALUE</targetNoDataValue>
</parameters>
</node>
<node id="Write">
<operator>Write</operator>
<sources>
<sourceProduct refid="Convert-Datatype"/>
</sources>
<parameters class="com.bc.ceres.binding.dom.XppDomElement">
<file>OUTPUT_FILE</file>
<formatName>GeoTIFF-BigTIFF</formatName>
<writeEntireTileRows>true</writeEntireTileRows>
<clearCacheAfterRowWrite>true</clearCacheAfterRowWrite>
</parameters>
</node>
...
</graph>
The handful of all-caps values are replaced at runtime with values calculated from the input, such as its coordinates and bands. In this example, we’re trying to process S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003
.
I’m running SNAP 9.0.0 on a Unix virtual machine with 16 cores and 64 GB of RAM. Options I’m passing to the gpt command include:
- -c 35G : We have -Xmx55G in the vmoptions file, so while this is lower than recommended for the cache as a percentage, --diag indicates that we’re only seeing 48.9G available, so 35G of that for the cache makes more sense. Setting -c higher results in “Cannot construct Databuffer” after just a few minutes.
- -q 16 : I’ve noticed our toolbox job only taking ~15-20% of available CPU even with this value set as high as 70, so maybe it can take more. Or maybe it doesn’t matter because we’re trying to do something that doesn’t parallelize well, or at all.
- No changes to options like -J-Dsnap.jai.defaultTileSize or -J-Dsnap.dataio.bigtiff.tiling.*, as it wasn’t clear to me whether the problem is not enough parallelization (so smaller tiles could help) or something else.
I tried to isolate the issue by stringing together the graph’s operations into a series of explicit gpt calls, and it gets stuck on step 2:
Log file created at: 2023/07/26 13:19:13
0726 13:20:28.843435] Run GPT for band: /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1.tif (1/3)
0726 13:20:28.848320] Executing stage 1: Read
0726 13:20:28.848555] Running /bin/sh /tmp/tmpnzt11qom_input/snap/bin/gpt Read -Pfile=/tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003.SEN3/xfdumanifest.xml -PformatName=Sen3 -Duser.home=/tmp/tmpnzt11qom_input -e -x -c 35G -q 16 -Dsnap.log.level=ALL -t /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:20:47.821486] WARNING: org.esa.snap.core.util.VersionChecker: Not able to check for new SNAP version. Remote version could not be retrieved.
0726 13:20:57.940085] Executing operator...
0726 13:21:00.103345] 20%....30%....40%....50%....60%....70%....80%....90%.... done.
0726 13:21:00.112079] INFO: org.esa.snap.core.gpf.common.WriteOp: Start writing product S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003.SEN3 to /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:21:00.112443] Writing...
0726 13:21:00.134045] INFO: java.util.prefs.FileSystemPreferences$1: Created user preferences directory.
0726 13:21:00.764693] INFO: org.hsqldb.persist.Logger: dataFileCache open start
0726 13:21:50.080088] ....10%....20%....30%....40%....50%....60%....70%....80%....90%....100% done.
0726 13:21:50.080255] INFO: org.esa.snap.core.gpf.common.WriteOp: End writing product S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0 to /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:21:50.082269] INFO: org.esa.snap.core.gpf.common.WriteOp: Time: 49.975 s total, 12.216 ms per line, 0.002511 ms per pixel
0726 13:21:50.276671] Executing stage 2: Reproject
0726 13:21:50.276962] Running /bin/sh /tmp/tmpnzt11qom_input/snap/bin/gpt Reproject -Pcrs=AUTO:42001 -Presampling=Bicubic -Porthorectify=false -PnoDataValue=NaN -PincludeTiePointGrids=true -PaddDeltaBands=false -Duser.home=/tmp/tmpnzt11qom_input -e -x -c 35G -q 16 -Dsnap.log.level=ALL -t /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Reproject1.dim /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:21:51.812028] INFO: org.esa.snap.core.gpf.operators.tooladapter.ToolAdapterIO: Initializing external tool adapters
0726 13:21:52.703690] INFO: org.esa.snap.core.util.EngineVersionCheckActivator: Please check regularly for new updates for the best SNAP experience.
0726 13:21:56.387224] INFO: org.hsqldb.persist.Logger: dataFileCache open start
0726 13:21:56.908043] Executing operator...
0726 15:50:33.122397] INFO: org.hsqldb.persist.Logger: Database closed
<no other messages from gpt before the machine rebooted on 0727 around 11 AM>
I’m already asking for logging at all granularities so I’m not sure there’s anything else I can figure out by just running it and seeing how long it takes until it knocks my machine over. What else might cause reprojection to fail like this?