Java heap errors when BackGeoCoding certain specific S1B products via GPT

SvH · April 11, 2017, 2:07pm

Dear developers,

about three weeks ago we reported an issue that occurred during the processing of certain specific S1B products (see also: earlier reported issue). Unfortunately, apart from a suggestion by @lveci that it could be related to “the product timings or orbits” we did not get any additional feedback on this issue.

In the meantime we have investigated further ourselves and found that the issue can be traced back to a “Java heap space error” occurring during the BackGeoCoding operator (see GPT error output log below). The steps to reproduce the error are very simple with just the following nodes (see also attached graphfiles):

ProductSet-Reader
with as input two subswaths #1 taken from the products:
** S1B_IW_SLC__1SDV_20170215T172400_20170215T172427_004314_0077D8_D98B
** S1B_IW_SLC__1SDV_20170227T172400_20170227T172427_004489_007D03_E01A
Back-Geocoding
with just simply the Stack resulting from 1 as input
Subset
requesting a small subset contained within both products inserted in the stack to avoid processing of the entire subswath.

The above simple steps are contained in graph_fails.xml (1.6 KB)

Just for completeness, applying exactly the same processing via graph_works.xml (1.6 KB) to subswaths of these two products:
** S1A_IW_SLC__1SDV_20170317T172447_20170317T172514_015735_019E70_700D
** S1B_IW_SLC__1SDV_20170323T172400_20170323T172427_004839_00873C_2684

the processing finishes without any issues as usual.

At this moment we are running out of ideas what else we can do to solve this S1B processing issue.

Note: we have also reported this issue to the ESA Quality team, who tested the mentioned products and reported back that they could not find any issues so they concluded the issue must be related to the applied processing.

Any help is very much appreciated!

Sven.

GPT error log/stack trace:

/opt/s1tbx-snap5.0/bin/gpt -e graph.xml -c 75G -q 9
WARNING: org.esa.snap.core.util.ServiceFinder: Can't search for SPIs, not a directory: /home/sarccd/.snap/snap-python
Java heap space
Java heap space
org.esa.snap.core.gpf.OperatorException: Java heap space
        at org.esa.snap.core.gpf.graph.GraphProcessor$GPFImagingListener.errorOccurred(GraphProcessor.java:373)
        at com.sun.media.jai.util.SunTileScheduler.sendExceptionToListener(SunTileScheduler.java:1646)
        at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:921)
        at javax.media.jai.OpImage.getTile(OpImage.java:1129)
        at javax.media.jai.PlanarImage.cobbleFloat(PlanarImage.java:3254)
        at javax.media.jai.PlanarImage.getData(PlanarImage.java:2181)
        at com.bc.ceres.glevel.MultiLevelImage.getData(MultiLevelImage.java:64)
        at org.esa.snap.core.datamodel.Band.readRasterData(Band.java:311)
        at org.esa.snap.core.dataio.ProductSubsetBuilder.readBandRasterDataRegion(ProductSubsetBuilder.java:328)
        at org.esa.snap.core.dataio.ProductSubsetBuilder.readBandRasterDataImpl(ProductSubsetBuilder.java:297)
        at org.esa.snap.core.dataio.AbstractProductReader.readBandRasterData(AbstractProductReader.java:250)
        at org.esa.snap.core.gpf.common.SubsetOp.computeTile(SubsetOp.java:268)
        at org.esa.snap.core.gpf.internal.OperatorImage.computeRect(OperatorImage.java:80)
        at javax.media.jai.SourcelessOpImage.computeTile(SourcelessOpImage.java:137)
        at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:904)
        at javax.media.jai.OpImage.getTile(OpImage.java:1129)
        at javax.media.jai.PlanarImage.getData(PlanarImage.java:2085)
        at com.bc.ceres.glevel.MultiLevelImage.getData(MultiLevelImage.java:64)
        at org.esa.snap.core.gpf.internal.OperatorContext.getSourceTile(OperatorContext.java:406)
        at org.esa.snap.core.gpf.internal.OperatorContext.getSourceTile(OperatorContext.java:392)
        at org.esa.snap.core.gpf.internal.OperatorImage.computeRect(OperatorImage.java:73)
        at javax.media.jai.SourcelessOpImage.computeTile(SourcelessOpImage.java:137)
        at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:904)
        at javax.media.jai.OpImage.getTile(OpImage.java:1129)
        at com.sun.media.jai.util.RequestJob.compute(SunTileScheduler.java:247)
        at com.sun.media.jai.util.WorkerThread.run(SunTileScheduler.java:468)
Caused by: org.esa.snap.core.gpf.OperatorException: Java heap space
        at org.esa.snap.engine_utilities.gpf.OperatorUtils.catchOperatorException(OperatorUtils.java:389)
        at org.esa.s1tbx.sentinel1.gpf.BackGeocodingOp.computeTileStack(BackGeocodingOp.java:476)
        at org.esa.snap.core.gpf.internal.OperatorImageTileStack.computeRect(OperatorImageTileStack.java:116)
        at org.esa.snap.core.gpf.internal.OperatorImageTileStack.computeTile(OperatorImageTileStack.java:85)
        at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:904)
        ... 23 more

Error: Java heap space

mengdahl · April 11, 2017, 3:55pm

Is this an issue with a specific S-1B scene or a more general issue?

SvH · April 11, 2017, 3:58pm

It is observed for multiple S1B products but not for all. So far we have seen at least 10 products that are affected.

Thanks for looking into it.

Sven.

mengdahl · April 11, 2017, 4:12pm

Did you check that the product downloads were correct&complete with all metadata present? Are the products odd in any other way? We should see if this is an issue with SNAP or with the products themselves @lveci @nuno.miranda

joaquimrosa · April 12, 2017, 6:41am

We always check that the products are correct by comparing the MD5. According to the ESA Quality Control team, the reported products have no issues so they concluded that the problem must be related with the applied processing.

Thank you,
Joaquim

nuno.miranda · April 12, 2017, 7:58am

Dear Sven. Joaquim, MArcus and Luis,

I have no problem to investigate further on this issue. However, the problems starts from a SNAP error and requires to be handles as a SNAP error in first place.

Therefore, I suggest that the SNAP team investigate this issue and discard any potential error from the s/w point of view. Mentioning that the error could be due to SAR timing or orbit is a vague statement that brings nothing to the resolve this particular issue. Hopefully your indeep analysis might reveal something wrong on the data

Only at this moment, me and (and the Mission Performance Center (behind the Copernicus User Service) can have a deeper look at the data. At the time there is no S-1B specific issues (to our current knowledge) regarding the timing.

@svh and @joaquimrosa: it is not correct to require suppport the Copernicus User service for a problem regarding SNAP or any external s/w as they will immediatly send you back to the team supporting the development of this s/w.

Kind regards,
Nuno Miranda
S-1 Data Quality Manager

SvH · April 12, 2017, 8:08am

Hi Marcus,

as already noted by my colleague @joaquimrosa we did report to ESA that we are having issues with these particular S1B products. They came back to us stating that they could not find any anomalies. It would be very helpful for us to know if you can reproduce the issue and, if so, that you can possibly give some pointers on the root-cause. Personally, I am in the dark whether it is a product/orbit related issue or a SNAP bug.

Please let me know if you can use any further information from us, if needed I can compose a longer list of products for which we have observed issue.

Best regards,
Sven.

SvH · April 12, 2017, 9:05am

Dear Nuno,

thanks for your reply. Please know that the reason for us to contact Copernicus support is that we are suspecting a possible issue with some S1B data products that ESA is not aware of. Of course debugging third part software is no business for Copernicus support, however we include information on the failing processing step and corresponding error in our report as it might give some pointers to the root-cause and/or likely origin of the issue. So it is with the best intentions, having the overall interest of the remote sensing community in mind, that we have made these reports at this time not yet knowing with full certainty what the origin of the issue is.

Let me conclude with some background info “why” we might suspect a data issue:

We are routinely processing S1 data for a commercial change monitoring service since 2015. After the rough “early days” in which both ESA and the community worked hard together to get things to work, things have stabilized in 2016 and our service based on S1A has been running without substantial issues ever since. When S1B came available our system was also successfully validated for using S1B data and combining it with S1A (allowing higher update frequencies). However, it is only since the last couple of months that we have attracted new contracts involving the processing of large amounts of S1B data. It is in this phase that we started to observe issues while processing S1B products. Note that:

the issue is observed for some but certainly not all S1B products (for example it is observed for the products coming from the ascending orbit 88 over the Netherlands)

No issues occur for any S1A data including those from the same orbit 88.

No issue is observed (yet) for processing S1B data from descending orbit 37.

Based on these observation my suspicions go out to a product (metadata) or orbit issue, but I certainly can not rule out the possibility of a toolbox issue. Hopefully, @mengdahl and the rest of the community can help on this.

Best regards.
Sven.

lveci · April 12, 2017, 4:05pm

Using your products and running a graph with apply orbit, TOPS split, backgeocoding, subset works fine for me.
I think you just ran out of memory.

You should revisit you configuration of gpt. Why in the command line are you specifying a tile cache size of 75G? This seems overkill no matter how much RAM you may have.

SvH · April 12, 2017, 5:46pm

Hi Luis,

thanks for looking into this. Unfortunately, the mystery only becomes bigger for me at this stage. Could you share with me the graph-file you used to execute your test? I want to exactly cross-reference it once more to the processing steps we apply. I really need to get to the bottom of this asap.

Regarding the 75G cache size, this was just a test suggested by @esteban.aguilera to see if we could work around any possible memory leaks and get some results that way. Our normal processing, resulting in the same error, is using the default cache settings.

Best regards,
Sven.

lveci · April 12, 2017, 6:54pm

This is the graph I used graph_fails2.xml (4.5 KB)

SvH · April 12, 2017, 8:33pm

Hi Luis,

thanks for the graph. I tried running it, after changing the inputs and outputs to point at the correct files on my system, and found that my systems chokes on it because the required Precise Orbits are not available on my system and also the corresponding zip-file for download from http://step.esa.int/auxdata/orbits/Sentinel-1/POEORB/ does not exist (yet). Even after manual downloading the corresponding Precise Orbit files from https://qc.sentinel1.eo.esa.int/aux_poeorb/ the problem persists… and snap keeps complaining about not being able to download the 2017-2.zip.

Nevertheless, for our application, we have to use RESORBS anyway since the Precise Orbits become available to late. So what I did was changing:

<orbitType>Sentinel Precise (Auto Download)</orbitType>

to

<orbitType>Sentinel Restituted (Auto Download)</orbitType>

in both “Apply-Orbit” Operators.

After this, the Orbit File download errors disappeared, but to be replaced again by the earlier Java Heap Error that started this topic (see again error log below).

Therefore, may I ask you to run your own test-graph once more, just replacing the Precise Orbits by the Restituted Orbits?

Just to make sure that all this is not just caused by some anomalies in Restituted Orbit files…

Many thanks,
Sven.

Error log:

Java heap space
Java heap space
org.esa.snap.core.gpf.OperatorException: Java heap space
at org.esa.snap.core.gpf.graph.GraphProcessor$GPFImagingListener.errorOccurred(GraphP rocessor.java:373)
at com.sun.media.jai.util.SunTileScheduler.sendExceptionToListener(SunTileScheduler.j ava:1646)
at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:921)
at javax.media.jai.OpImage.getTile(OpImage.java:1129)
at javax.media.jai.PlanarImage.cobbleFloat(PlanarImage.java:3254)
at javax.media.jai.PlanarImage.getData(PlanarImage.java:2181)
at com.bc.ceres.glevel.MultiLevelImage.getData(MultiLevelImage.java:64)
at org.esa.snap.core.datamodel.Band.readRasterData(Band.java:311)
at org.esa.snap.core.dataio.ProductSubsetBuilder.readBandRasterDataRegion(ProductSubs etBuilder.java:328)
at org.esa.snap.core.dataio.ProductSubsetBuilder.readBandRasterDataImpl(ProductSubset Builder.java:297)
at org.esa.snap.core.dataio.AbstractProductReader.readBandRasterData(AbstractProductR eader.java:250)
at org.esa.snap.core.gpf.common.SubsetOp.computeTile(SubsetOp.java:268)
at org.esa.snap.core.gpf.internal.OperatorImage.computeRect(OperatorImage.java:80)
at javax.media.jai.SourcelessOpImage.computeTile(SourcelessOpImage.java:137)
at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:904)
at javax.media.jai.OpImage.getTile(OpImage.java:1129)
at javax.media.jai.PlanarImage.getData(PlanarImage.java:2085)
at com.bc.ceres.glevel.MultiLevelImage.getData(MultiLevelImage.java:64)
at org.esa.snap.core.gpf.internal.OperatorContext.getSourceTile(OperatorContext.java: 406)
at org.esa.snap.core.gpf.internal.OperatorContext.getSourceTile(OperatorContext.java: 392)
at org.esa.snap.core.gpf.internal.OperatorImage.computeRect(OperatorImage.java:73)
at javax.media.jai.SourcelessOpImage.computeTile(SourcelessOpImage.java:137)
at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:904)
at javax.media.jai.OpImage.getTile(OpImage.java:1129)
at com.sun.media.jai.util.RequestJob.compute(SunTileScheduler.java:247)
at com.sun.media.jai.util.WorkerThread.run(SunTileScheduler.java:468)
Caused by: org.esa.snap.core.gpf.OperatorException: Java heap space
at org.esa.snap.engine_utilities.gpf.OperatorUtils.catchOperatorException(OperatorUti ls.java:389)
at org.esa.s1tbx.sentinel1.gpf.BackGeocodingOp.computeTileStack(BackGeocodingOp.java: 476)
at org.esa.snap.core.gpf.internal.OperatorImageTileStack.computeRect(OperatorImageTil eStack.java:116)
at org.esa.snap.core.gpf.internal.OperatorImageTileStack.computeTile(OperatorImageTil eStack.java:85)
at com.sun.media.jai.util.SunTileScheduler.scheduleTile(SunTileScheduler.java:904)
… 23 more

Error: Java heap space

lveci · April 13, 2017, 12:45am

Well, the orbit file problem is because you are using an older version. A while ago already we started downloading the orbit files directly from the qc website so they are as up to date as possible.
I re-ran it with the restitute orbits and it was still fine. I also ran it without the subset.

nuno.miranda · April 13, 2017, 7:03am

From my point of view I believe there is no problem with the data.
We just ran interferogram between those two products and I have obtained something nice. This exlcudes issues with the data timing.

Also there are very little chances to have a problem with the orbit file. An orbit file of wrong quality will lead to not perfect geocoding or terrain correction but not to a straight failure unless it is completly wrong which I exclude (in first place).

SvH · April 13, 2017, 7:22am

Thanks to @lveci and @nuno.miranda for running those tests. I have been running the tests on my side using the 5.0.0 (downloadable) release version, but without the updates since this release. To rule out the option that this is the origin of my issues I will rerun my tests on a mint VM with fresh install of SANP with all available updates…

I will report back here later.
Sven.

glemoine · April 13, 2017, 8:44am

I am extracting 6-day repeat coherence for exactly these S1A and S1B scene combinations (orbit 88 and also 37) since late September 2016 and have not had any processing failures, other than the odd SNAP issue with not finding SRTM 1’’ HGT tiles that are already on disk at the terrain correction stage (rerunning that step usually solves that).

SvH · April 13, 2017, 3:44pm

Hi All,

after a nice day of troubleshooting, debugging and testing I can report the following on my findings with respect to the reported issue. In fact it turned out to be two issues:

The Java Heap Error reported above appears to originate from some legacy content left behind in the .snap folder from earlier SNAP versions and update cycles. Completely, clearing out the .snap folder and starting out with a new mint one removed the Java Heap error and enabled me to reproduce the results from @lveci graph file. What remains a mystery to me is why this only affected the processing of certain specific S1B products.
With the Java Heap error out of the way, it turned out that my orginal issue as reported here was still there. It now seems that intermediate products (beam dimaps) generated earlier with older SNAP versions are no longer fully compatible with the current toolbox version (at least not for some S1B products???). I managed to get rid off the warning/error by reprocessing everything in the latest version… So at first I thought this solved my issue but I just found that instead of the toolbox throwing an error I now do get output, but the output itself contains patches with NaN’s …

So progress today, but unfortunately no full solution yet

Sven.

SvH · April 13, 2017, 9:16pm

Hi @lveci,

like posted earlier today, although I do not get error msgs anymore, I am now stuck with strange artifacts in my output data/images rendering them useless for further processing (an example is shown below). Would you have any idea what could be going on here? Causing patches of NaN to pop-up across my output files?

Many thanks in advance!
SVen.

SvH · April 18, 2017, 3:05pm

Dear @lveci,

although I still have not found a solution for my issue, I do have now a better understanding about which particular settings and circumstances seem to generate it. I therefore hope you are available to run a few more tests with me which should enable us to far better pinpoint the root-cause of this issue. For this I like to consider the following three test-cases:

All processing done with toolbox instance checked out from the master branch today!

Processing Graph Producing Patches of NaN’s (Example_creating_patches_of_NaNs.xml (6.6 KB))
In this case, we execute a simple processing sequence that illustrates the issue at hand. It is relevant to note here that the “disableReramp” parameter of the BackGeoCodingOp is set to “true”. The resulting band: Intensity_IW1_VH_slv1_27Feb2017 looks as follows:

Example_creating_patches_of_NaNs_Intensity_IW1_VH_slv1_27Feb2017.jpg1310×891 137 KB

Note that additional output is generated by the graph just after TOPSAR DeburstOp, which does not contain any NaNs (not shown here).
Almost identical processing graph not producing NaN’s (Example_NOT_creating_patches_of_NaNs.xml (6.6 KB)
In this case, exactly the same processing as in 1 is done except now the parameter “disableReramp” is set to “false”. Now the NaN-patches disappear and the Intensity_IW1_VH_slv1_27Feb2017 band looks as:

Example_NOT_creating_patches_of_NaNs_Intensity_IW1_VH_slv1_27Feb2017.jpg1310×891 185 KB

Note that, without the ramp removed we do get a lot of interpolation artifacts from the TerrainCorrection step.
Same processing steps as 1 and 2 but starting from preprocessed data (Example_preprocessed_data_disableReramp_no_NaNs.xml (4.9 KB))
In this case, instead of starting from the downloaded SLC products, we take as input a stack of two pre-processed subswaths. These pre-processed subswaths have been obtained from the processing steps: Read -> TOPSAR-Split -> Apply-Orbit executed with a SNAP 4.x.x installation. The execution of the attached graph is done with the latest toolbox version and in addition the “disableReramp” is set to “true”. Maybe to some of yours surprise this example generates the following ouput band:

Example_preprocessed_data_disableReramp_no_NaNs_Intensity_IW1_VH_slv1_23Mar2017.jpg1310×891 161 KB

Note that we do not observe NaN patches nor interpolation artifacts here. This is in fact the processing outcome that we fully rely on.

So, some observations:

The appearance of NaN patches after Terrain Correction seem to be related to the “disableReramp” parameter in the earlier BackGeoCoding step.
We require “disableReramp” set to “true”, because we need to be able to interpolate the complex data later on without interpolation artifacts.
Using input data that is pre-processed with an earlier version (SNAP 4.x.x), the issue can not be reproduced. Using pre-processed data generated with the SNAP 5.x.x versions, does reproduce the issue.
The fact that snap 4.x.x pre-processed data does not reproduce the issue, makes me conclude that changes are introduced in SNAP 5.x.x in Operators TOPSAR-Split and/or Apply_Orbit that are somehow related to the observed issue.

I hope that this extensive input will allow you to pinpoint and resolve the issue. Let me know if there is anything else I can do to help.

Best regards,
Sven.

estebanaguilera · April 19, 2017, 7:03am

Hi Sven,

I tried your graphs and was able to replicate the issue.

I hope it gets resolved soon.

Esteban