Killing graph in command line doesn't remove "empty" .dim/.data

Hello! This is a pretty specific—but I think simple—issue I’m trying to overcome when running a graph (in command line, not using the SNAP GUI). Hopefully there are some bash users in here that might be able to give me some suggestions.

I’m testing out parallel processing in this bash script:

#!/bin/bash

# path to snap and gpt executable
export PATH=/home/user/esasnap/bin:$PATH
gptPath="gpt"

############################################
# Command line handling
############################################

# first parameter is a path to the graph xml
graphXmlPath="$1"

# second parameter is a path to a parameter file (.properties) for graph parameters
parameterFilePath="$2"

# use third parameter for path to source products
sourceDirectory="$3"

# use fourth parameter for path to target products
targetDirectory="$4"

# the fifth parameter is a file prefix for the target product name, typically indicating the type of processing
# Orb_NR_Cal_TC
targetFilePrefix="$5"

# set number of processes
numProcesses="$6"

############################################
# Helper functions
############################################
removeExtension() {
    file="$1"
    echo "$(echo "$file" | sed -r 's/\.[^\.]*$//')"
}

N=$numProcesses

for file in ${sourceDirectory}/*.zip
do
    (
      # create target file name
      sourceFile="$(realpath "$file")"
      targetFile="${targetDirectory}/${targetFilePrefix}_$(removeExtension "$(basename ${file})").dim"
  
      # do not overwrite existing files
      if [ -f $targetFile ]
      then
        echo FILE $targetFile ALREADY EXISTS
      else
        echo WORKING ON $sourceFile
        $gptPath $graphXmlPath -e -p $parameterFilePath -t $targetFile $sourceFile
        echo SUCCESS! EXPORTED TO $targetFile
      fi
    ) &

    # allow to execute up to $N jobs in parallel
    if [[ $(jobs -r -p | wc -l) -ge $N ]]; then
        # now there are $N jobs already running, so wait here for any job
        # to be finished so there is a place to start next one.
        wait -n
    fi

done

# no more jobs to be started but wait for pending jobs
# (all need to be finished)
wait

echo "all done"

Context: This script works fine, but I’m tinkering with the number of processes I want to create. If I set N processes to 10, for example, the beefy graph uses more cores than I want. So, I’m just playing around with N to get the optimal number of processes to kick off without taking up too many cores (I am using a shared system).

The problem: While I’m testing and kill a process with ctrl-c, the process dies but the .dim files and .data directories remain. When I re-run the script, it sees that the .dim files have been created and skips it—but those .data directories don’t have any data in them… I don’t want to overwrite any actually populated .data directories, so I don’t want to add an if-statement to detect if a file/dir exists. BUT, if I’ve just killed a process, I want the files/dirs just created to be deleted (or safely moved elsewhere).

Easy Solution: My solution is to just look inside the .data file for the bands I expect, and if they don’t exist, delete the files. But I would like to know if there’s a better way to do what I want: remove .dim/.data that were just generated but that have no .imgs/.hdrs created yet.

Thanks!

Morgan

#snap #s1tbx

Another issue, I am maxing out my memory trying to run two processes. Any advice on optimizing the performance of my graph without blowing up my memory? Thanks.

Update: I’m now seeing that SNAP should be parallel-izing by splitting up the input image. So maybe my bash multiprocessing is redundant. I am new to managing memory and optimization in general, so any advice is also welcome. I see that I can increase the cache size, maybe that will help me when running only one processes.