Snappy: Read, Reproject, Subset; How to translate GUI options to HashMap parameters

dkin · November 2, 2021, 12:11pm

Hey guys,

I need to process a large amount of Sentinel-3-Syn-2-data and I’m trying to do it with Snappy. I have some experience with Snappy when it comes to processing Sentinel-1-data and wanted to kind of reuse these functions I wrote back then. To get a clearer picture of what I want to do: I just want to georeference and export (a subset) in netcdf format. In the GUI it is quite easy: 1. Open the ‘xfdumanifest.xml’, 2. Reproject with Raster/Geometric/Reprojection: tick off ‘save as’, predefined CRS, bicubic, preserve resolution, tick off ‘reproject tie-point grids’, 3. select product, export, select spectral bands and flags, create a subset from geo coordinates, export as netcdf. Done.

If I translate that simple workflow to snappy it goes as follows:

1. Read Product

def read_sentinel_product(manifest_file_path):
    r = ProductIO.getProductReader('SENTINEL-1')

    p = r.readProductNodes(manifest_file_path, None)
    return p

Obviously with Sentinel-3 it is another ‘product reader’ but the also obvious solution ‘SENTINEL-3’ does not work. So which reader am I supposed to use?

2. Reproject
The basic skeleton of every snappy-function is:

param = HashMap()
param.put(...)
...
GPF.createProduct(FuncNameAsString, param, InputDataSet)

If I iterate over the list of Operators, possible parameters for reprojection are:

wktFile or None
crs or None
resamplingName or resampling
referencePixelX or None
referencePixelY or None
easting or None
northing or None
orientation or None
pixelSizeX or None
pixelSizeY or None
width or None
height or None
tileSizeX or None
tileSizeY or None
orthorectify or None
elevationModelName or None
noDataValue or None
includeTiePointGrids or None
addDeltaBands or None

I couldn’t find anything about how a CRS is correctly passed as parameter. My guess would be:

param.put('crs', 'EPSG:4326')  # is that correct?

I wouldn’t say that the rest of the list above is self explanatory, but the (for my use case) relevant settings would be (I guess):

param.put('noDataValue', 'NaN')
param.put('includeTiePointGrids', False)
param.put('resamplingName', 'Bicubic')

3. Subset
My old function takes a wkt-string. I would rather give Lon/Lat coordinates but again I don’t know the correct format. I guess the correct parameter is named ‘geoRegion’. In that case do I pass a list ['Lat_min', 'Lat_max', 'Lon_min', 'Lon_max'] or something else?

def create_subset(input_data, wkt):
    param = HashMap()
    param.put('geoRegion', wkt)  # I'd prefer passing coordinates
    param.put('subSamplingX', '1')
    param.put('subSamplingY', '1')
    param.put('fullSwath', False)
    param.put('tiePointGridNames', '')
    param.put('copyMetadata', True)

    subset = GPF.createProduct('Subset', param, input_data)
    print('Subset created.')
    return subset

Furthermore I would like to add bandNames as I don’t want the whole data set, just the spectral bands and flags. In the GUI I simply tick or untick the bands. How can I select specific bands?

4. Write Product
The last step is writing the product to disk. Atleast here I don’t have any questions, I just added this for the sake of completeness.

def write_product(sat_data, save_dir, save_name, file_format='NetCDF4-CF'):
    print('Writing product...')
    target = os.path.join(save_dir, save_name)
    ProductIO.writeProduct(sat_data, target, file_format)
    print('Product written!')

Thanks for your help in advance, I hope somebody can help me with that.

Cheers

gnwiii · November 2, 2021, 3:22pm

GPT is often very efficient for tasks where runtimes are long compared to the startup time (and linux is designed to reduce startup times when the same program is run in a loop). I’m not sure ESA SNAP snappy adds much value here. Python has memory management issues if you try to loop over a list of files. To process lots of files you may want to use python to read the list of files and construct gpt command-lines in a loop. SNAP’s graph builder tool allows you create a graph that implements your GUI workflow, or you can use snapista Python wrapper for GPT to create graphs in Python (avoiding XML) and then run gtp with the graph. If you have access to high-end hardware running linux, GNU parallel is a handy way to process large numbers of files, but takes a bit of work to determine the optimal configuration (GPT memory and CPU’s versus number of parallel jobs).

For linux, Teradue’s Anaconda repository provides snapista. It will install SNAP, so is helpful if you are running your batch processing on a remote server with limited (e.g., ssh or jupyter terminal) access.

marpet · November 3, 2021, 7:15am

Getting the parameters can be tricky. That’s true.
You can use for example gpt from the command line and call

gpt <OperatorName> -h
This will provide you a list with the parameters and a short description.

The generic core operations provided by SNAP are also briefly described in the help.

Click on one of the operators and a description is shown.

Another way is to configure the operator in the GUI and then
save the parameters to a files or just display them.

In your use case you actually don’t need python.
Your work can also be done with a shell script.

How to do this you can read here: Bulk Processing with GPT - SNAP - Confluence (atlassian.net)

Other options @gnwiii already mentioned.

mengdahl · November 3, 2021, 11:29am

@marpet is there an easy way to get a list of all operators?

gnwiii · November 3, 2021, 12:19pm

gpt -h lists the available operators from A to W (you get some INFO messages at the start, which some users find concerning and usage options common to all operators):

PS D:\> & 'C:\Program Files\snap\bin\gpt.exe' -h
INFO: org.esa.s2tbx.dataio.gdal.GDALVersion: GDAL not found on system. Internal GDAL 3.0.0 from distribution will be used. (f1)
INFO: org.esa.s2tbx.dataio.gdal.GDALVersion: Internal GDAL 3.0.0 set to be used by SNAP.
INFO: org.esa.snap.core.gpf.operators.tooladapter.ToolAdapterIO: Initializing external tool adapters
INFO: org.esa.snap.core.util.EngineVersionCheckActivator: Please check regularly for new updates for the best SNAP experience.
INFO: org.esa.s2tbx.dataio.gdal.GDALVersion: Internal GDAL 3.0.0 set to be used by SNAP.
Usage:
  gpt <op>|<graph-file> [options] [<source-file-1> <source-file-2> ...]
[...]
Operators:
  Aatsr.SST                          Computes sea surface temperature (SST) from (A)ATSR products.
[...]
  Write                              Writes a data product to a file.

marpet · November 3, 2021, 2:05pm

Yes, what @gnwiii write is the easiest way.

In python one could do the following:

op_spi_it = GPF.getDefaultInstance().getOperatorSpiRegistry().getOperatorSpis().iterator()
while op_spi_it.hasNext():
    op_spi = op_spi_it.next()
    print("op_spi:", op_spi.getOperatorAlias())

dkin · November 3, 2021, 5:12pm

I tried to configure your provided batch script skeleton to my needs:

export PATH="$HOME/Applications/snap/bin:$PATH"
gptPath="gpt"
graphXmlPath="$1"
parameterFilePath="$2"
sourceDirectory="$3"
targetDirectory="$4"
targetFilePrefix="$5"

removeExtension() {
    file="$1"
    echo "$(echo "$file" | sed -r 's/\.[^\.]*$//')"
}

mkdir -p "${targetDirectory}"

for F in $(ls -1d "${sourceDirectory}"/S3*.SEN3); do
  sourceFile="$(realpath "$F")"
  targetFile="${targetDirectory}/${targetFilePrefix}_$(removeExtension "$(basename ${F})").nc"
  ${gptPath} ${graphXmlPath} -e -p ${parameterFilePath} -t ${targetFile} ${sourceFile}
done

And run it with:
$ ./processDataset.bash sen3_georeferencing.xml s3_georeference.properties /home/user/path/to/data /home/user/path/to/target s3_subset

But when I execute the batch script I get the following error:

Error: [NodeId: Read] Specified ‘file’ [sourceFile] does not exist.

I specified correctly the source directory. All products which need to be processed are unzipped in the given directory. It seems to me that "${sourceDirectory}"/S3*.SEN3) can’t find or access xfdumanifest.xml. What did I do wrong?

marpet · November 4, 2021, 7:16am

Maybe you have forgotten to prefix the sourceFile with $ or surround with ${} in the graph xml file?

dkin · November 4, 2021, 10:12am

Thanks for your reply. I tried both but neither of it works. My XML at the questioned section looks like this:

  <node id="Read">
    <operator>Read</operator>
    <sources/>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <file>${sourceFile}</file>
    </parameters>
  </node>

Or do I have to link something in the properties file?

marpet · November 4, 2021, 11:07am

Try changin

${gptPath} ${graphXmlPath} -e -p ${parameterFilePath} -t ${targetFile} ${sourceFile}

to

${gptPath} ${graphXmlPath} -e -p ${parameterFilePath} -t ${targetFile} -PsourceFile=${sourceFile}

The sourceFile is not handeld as a Product if you use the ReadOp directly. That’s why it must be provided as parameter.
If you remove the ReadOp and use ${sourceProduct} within the first node then it can work without the ‘-P’

dkin · November 4, 2021, 3:59pm

I think that I’m getting closer but there are still some issues. When I execute the script - except of the last file - processes are killed with the following message:
.33%.44%.56%.67%.78%./processDataset.bash: Line 50: 5783 Killed ${gptPath} ${graphXmlPath} -e -p ${parameterFilePath} -t ${targetFile} -PsourceFile=${sourceFile}

Line 50 refers to:
for F in $(ls -1d "${sourceDirectory}"/S3*.SEN3); do

Another thing is that the written file isn’t inside the target directory but in the directory the batch script lies. It also has the name $(targetFile).nc instead of the product name with the given prefix.

The corresponding section of the XML file is:

<node id="Write">
    <operator>Write</operator>
    <sources>
      <sourceProduct refid="Subset"/>
    </sources>
    <parameters class="com.bc.ceres.binding.dom.XppDomElement">
      <file>${targetFile}</file>
      <formatName>${format}</formatName>
    </parameters>
  </node>

gnwiii · November 4, 2021, 7:37pm

You may be encountering the dreaded OOM (out of memory killer). This will kill user processes that use so much memory that they cause problems for system tasks. Linux has many tools that allow you to monitor CPU loads and memory usage: top, htop, bpytop text mode, and also graphical process monitoring.

It can be tricky to get complex environment variable settings right. You can use the echo command, e.g.,

echo variable=$variable

in scripts to verify that the variable is being set correctly. A common and tricky to diagnose problem is due to editors that “helpfully” convert ASCII quotes or double quotes to Unicode opening/closing pairs, or dashes to Unicode en-dash, em-dash, or minus.

dkin · November 11, 2021, 11:37am

I’ve done it. As you suspected some quotes were translated to Unicode. Furthermore removing the write part in the xml fixed the wrong directory and name issue. So when using the skeleton, you have to delete the Read and Write operator.