GPT reproject operation churns forever

dpencosk · July 27, 2023, 7:58pm

Hello! I’m trying to debug a toolbox workflow for Sentinel-3 products that most of the time succeeds quickly but occasionally hangs until we time the job out and kill it to reclaim resources.

We have an XML graph with four steps:

<graph id="Graph">
  <version>1.0</version>
  <node id="Read">
    <operator>Read</operator>
    <sources/>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <file>MANIFEST_PATH</file>
      <formatName>FORMAT_NAME</formatName>
    </parameters>
  </node>
  <node id="Reproject">
    <operator>Reproject</operator>
    <sources>
      <sourceProduct refid="Read"/>
    </sources>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <wktFile/>
      <crs>CRS</crs>
      <resampling>RESAMPLING</resampling>
      <orthorectify>false</orthorectify>
      <elevationModelName/>
      <noDataValue>NaN</noDataValue>
      <includeTiePointGrids>true</includeTiePointGrids>
      <addDeltaBands>false</addDeltaBands>
    </parameters>
  </node>
  <node id="Convert-Datatype">
    <operator>Convert-Datatype</operator>
    <sources>
      <sourceProduct refid="Reproject"/>
    </sources>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <sourceBands>SOURCE_BANDS</sourceBands>
      <targetDataType>DATA_TYPE</targetDataType>
      <targetScalingStr>Truncate</targetScalingStr>
      <targetNoDataValue>NO_DATA_VALUE</targetNoDataValue>
    </parameters>
  </node>
  <node id="Write">
    <operator>Write</operator>
    <sources>
      <sourceProduct refid="Convert-Datatype"/>
    </sources>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <file>OUTPUT_FILE</file>
      <formatName>GeoTIFF-BigTIFF</formatName>
      <writeEntireTileRows>true</writeEntireTileRows>
      <clearCacheAfterRowWrite>true</clearCacheAfterRowWrite>
    </parameters>
  </node>
 ...
</graph>

The handful of all-caps values are replaced at runtime with values calculated from the input, such as its coordinates and bands. In this example, we’re trying to process S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003.

I’m running SNAP 9.0.0 on a Unix virtual machine with 16 cores and 64 GB of RAM. Options I’m passing to the gpt command include:

-c 35G : We have -Xmx55G in the vmoptions file, so while this is lower than recommended for the cache as a percentage, --diag indicates that we’re only seeing 48.9G available, so 35G of that for the cache makes more sense. Setting -c higher results in “Cannot construct Databuffer” after just a few minutes.
-q 16 : I’ve noticed our toolbox job only taking ~15-20% of available CPU even with this value set as high as 70, so maybe it can take more. Or maybe it doesn’t matter because we’re trying to do something that doesn’t parallelize well, or at all.
No changes to options like -J-Dsnap.jai.defaultTileSize or -J-Dsnap.dataio.bigtiff.tiling.*, as it wasn’t clear to me whether the problem is not enough parallelization (so smaller tiles could help) or something else.

I tried to isolate the issue by stringing together the graph’s operations into a series of explicit gpt calls, and it gets stuck on step 2:

Log file created at: 2023/07/26 13:19:13
0726 13:20:28.843435] Run GPT for band: /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1.tif (1/3)
0726 13:20:28.848320] Executing stage 1: Read
0726 13:20:28.848555] Running /bin/sh /tmp/tmpnzt11qom_input/snap/bin/gpt Read -Pfile=/tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003.SEN3/xfdumanifest.xml -PformatName=Sen3 -Duser.home=/tmp/tmpnzt11qom_input -e -x -c 35G -q 16 -Dsnap.log.level=ALL -t /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:20:47.821486] WARNING: org.esa.snap.core.util.VersionChecker: Not able to check for new SNAP version. Remote version could not be retrieved.
0726 13:20:57.940085] Executing operator...
0726 13:21:00.103345] 20%....30%....40%....50%....60%....70%....80%....90%.... done.
0726 13:21:00.112079] INFO: org.esa.snap.core.gpf.common.WriteOp: Start writing product S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003.SEN3 to /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:21:00.112443] Writing...
0726 13:21:00.134045] INFO: java.util.prefs.FileSystemPreferences$1: Created user preferences directory.
0726 13:21:00.764693] INFO: org.hsqldb.persist.Logger: dataFileCache open start
0726 13:21:50.080088] ....10%....20%....30%....40%....50%....60%....70%....80%....90%....100% done.
0726 13:21:50.080255] INFO: org.esa.snap.core.gpf.common.WriteOp: End writing product S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0 to /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:21:50.082269] INFO: org.esa.snap.core.gpf.common.WriteOp: Time: 49.975 s total, 12.216 ms per line, 0.002511 ms per pixel
0726 13:21:50.276671] Executing stage 2: Reproject
0726 13:21:50.276962] Running /bin/sh /tmp/tmpnzt11qom_input/snap/bin/gpt Reproject -Pcrs=AUTO:42001 -Presampling=Bicubic -Porthorectify=false -PnoDataValue=NaN -PincludeTiePointGrids=true -PaddDeltaBands=false -Duser.home=/tmp/tmpnzt11qom_input -e -x -c 35G -q 16 -Dsnap.log.level=ALL -t /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Reproject1.dim /tmp/tmpoo_5cjn8_input/S3A_OL_1_EFR____20230708T220131_20230708T220431_20230709T225332_0179_101_015_1800_PS1_O_NT_003_RADIANCE_1_Read0.dim
0726 13:21:51.812028] INFO: org.esa.snap.core.gpf.operators.tooladapter.ToolAdapterIO: Initializing external tool adapters
0726 13:21:52.703690] INFO: org.esa.snap.core.util.EngineVersionCheckActivator: Please check regularly for new updates for the best SNAP experience.
0726 13:21:56.387224] INFO: org.hsqldb.persist.Logger: dataFileCache open start
0726 13:21:56.908043] Executing operator...
0726 15:50:33.122397] INFO: org.hsqldb.persist.Logger: Database closed
<no other messages from gpt before the machine rebooted on 0727 around 11 AM>

I’m already asking for logging at all granularities so I’m not sure there’s anything else I can figure out by just running it and seeing how long it takes until it knocks my machine over. What else might cause reprojection to fail like this?

Marco_EOM · July 28, 2023, 6:57am

The product crosses the anti-meridian. I can imagine that this causes the problem.
Depending on what you finally want to achieve it could be an option to subset the product first. The ROI would start in the West at -179.99 and spans to the East as you need it.

dpencosk · July 28, 2023, 3:13pm

Hmm… ogr.Geometry says the corners are at (-179.474, 59.9649) and (179.982, 73.3552).

I’ve definitely seen some sentinel5p products dip over 180 before but this one doesn’t appear to.

Marco_EOM · July 29, 2023, 10:06am

One longitude is positive the other is negative.
It crosses the dateline, or it spans almost the whole globe.

On the map of the Open Access Hub it crosses the anti-meridian.

The same in SNAP.

dpencosk · July 29, 2023, 1:01pm

Ooof, that’d do it.

I guess next I need to figure out how why ogr is telling me something different. Even knowing that the “left corner” is the positive lng and the “right corner” should be the negative value is one thing, but the values themselves seem wrong… no way is that polygon is less than a degree across.

I haven’t been using UI tools, so thanks for the visual!

dpencosk · August 2, 2023, 1:41pm

Okay, so my guess is that the ogr library is just looking at the points in the polygon and picking the min and max of what it sees, so when the antimeridian is involved, the “leftmost” longitude is actually going to be the smallest positive value and the “rightmost” is going to be the largest negative value. (If a scene crosses both meridians, I’m doomed, but this works on this particular one.)

<node id="Subset-1">
    <operator>Subset</operator>
    <sources>
      <sourceProduct refid="Read"/>
    </sources>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <geoRegion>POLYGON((176.473 73.3552, 180 73.3552, 180 59.9649, 176.473 59.9649, 176.473 73.3552))</geoRegion>
    </parameters>
  </node>
  <node id="Subset-2">
    <operator>Subset</operator>
    <sources>
      <sourceProduct refid="Read"/>
    </sources>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <!-- -180 results in a null tile, so we have to sneak up on it. --> 
      <geoRegion>POLYGON((-179.999999 73.3552, -146.792 73.3552, -146.792 59.9649, -179.999999 59.9649, -179.999999 73.3552))</geoRegion>
    </parameters>
  </node>

Basically the same graph as before, except now with -1 and -2 tracks for two outputs. However, this doesn’t seem to have fixed the issue, as it’s been running for almost a day at 95% CPU without finishing. Surely it can’t take that long to reproject a scene of this size.

Marco_EOM · August 3, 2023, 12:20pm

Are you sure that you region are correct?

The image shows the two geometries.

Also keep in mind that for subsetting the bounding box of a polygon is used. Even if it is not crossing the anti-meridian at one point, it might cross at another point.

However, with your regions it should not take so long. Maybe there is some problem with the reprojection or the geocoding of the OLCI data.
@diana_harosa Maybe the devs can have a look.

dpencosk · August 3, 2023, 1:51pm

Hmm… so I think my polygons make sense in the target WGS84 projection, but I can see how that could happen in the source projection.

But wouldn’t that mean that it’s impossible to Reproject without dropping parts of the tile on the floor, and how would I even know ahead of time where the cutoff is if I’m doing these in a batch and have to calculate polygons on the fly?

It seems like there has to be a way to do this, since you’re able to show me the whole tile.

Marco_EOM · August 4, 2023, 8:49am

What I show is the unprojected scene. So, I didn’t manage to reproject it either.
To make it work I think the devs need to update the reprojection codebase.
[SNAP-1319] Reprojecting a product crossing the anti-meridian leads to global product - JIRA (atlassian.net)

You can detect those products by looking at the polygon coordinates and as soon as you have one value >170 and one value <-170 longitude you cross the 180°-meridian at least for the OLCI FR data.

dpencosk · August 4, 2023, 1:47pm

Thanks for your patience with me on this one. I’ve updated our batch scripts to ignore these ones for now.