CalibrationOp causes WriteOp to hang

michael.raymond · May 6, 2020, 12:49pm

We have a graph that we were running successfully with gpt, and I am trying to replicate that pipeline using the Java library in order to embed it inside a JVM (Clojure) application. I’ve manually rewritten the XML into instances of the respective Op classes, but ran into an issue when I try to write out the final target product. I’m using version 8.0.0-SNAPSHOT of the libraries.

I’m running the processing graph against a sentinel 1 scene (without cropping), and using gpt it works fine on my laptop, taking about 10 minutes. It uses about 4 cores and a few GiB of ram, but doesn’t exhaust the system.

When I try to run my JAR file, it never returns, although it seems to keep 1 core occupied. I’ve tried to narrow down the problem. If I remove everything except the final WriteOp it will always return, although it runs out of memory if I don’t add a SubsetOp. If I add back in the CalibrationOp, it will only return if I crop it to a tiny subsection, such as 30 pixels by 30. If I try a larger subsection, it hangs forever.

Sometimes, if I cancel the process after a couple of hours, it will print out an exception
java.lang.NullPointerException
at com.sun.media.jai.util.SunCachedTile.(Unknown Source)
at com.sun.media.jai.util.SunTileCache.add(Unknown Source)
at javax.media.jai.OpImage.addTileToCache(Unknown Source)
at javax.media.jai.OpImage.getTile(Unknown Source)
at javax.media.jai.PlanarImage.cobbleShort(Unknown Source)
at javax.media.jai.PlanarImage.getData(Unknown Source)
at com.bc.ceres.glevel.MultiLevelImage.getData(MultiLevelImage.java:64)
at org.esa.snap.core.gpf.internal.OperatorContext.getSourceTile(OperatorContext.java:449)

I’ve not managed to diagnose any specific errors, but I get the vague impression that there is an uncaught exception and the writer isn’t closing. Possibly a caching issue would explain why it doesn’t happen for very small sections — it doesn’t have enough opportunties to fail.

Any help diagnosing the issue would be appreciated! I can also post my code, or translate it into Java first if that would be more helpful.

Thanks

michael.raymond · May 19, 2020, 11:17am

We’ve managed to get this working, most of the time. There were multiple issues, which made it difficult to determine exact problems.

The primary issue was that without parallelism, the image-rendering is so slow that it’s difficult to know if it’s still working. To get the speed comparable to gpf, we had to update JAI’s settings to use all of the available cores, and gave it more RAM for its cache.

(defn init-renderer-settings!
  []
  (let [jai           (JAI/getDefaultInstance)
        logical-cores (.availableProcessors (Runtime/getRuntime))
        cache-size    1073741824] ;; 1 GiB
    (.setParallelism (.getTileScheduler jai) logical-cores)
    (JAI/enableDefaultTileCache)
    (.setMemoryCapacity (.getTileCache jai) cache-size)))

Also, using GPF/writeProduct or WriteOp seems to be faster than ProductIO/writeProduct, but I haven’t tested that extensively.

Even after fixing the speed, I have still observed the write operation getting stuck and never returning. I didn’t get the same cache exception though, so that might have been a red herring. It happens infrequently enough, that it’s not blocking us though, and it’s more visible now when it has happened, because the difference in cpu usage is so marked.