Sen2cor multiprocessing deadlock

Prince · November 4, 2016, 5:16pm

Hello guys,

I am running sen2cor with the config set to use 8 processes. The setup is that i have a bash script that loops through all sentinel 2 scenes that i have, and for each scene the script starts an instance of sen2cor to process the scene:

for scene in ls -1 /data/sentinel/2/unzipped_scenes/*L2A_*; do echo “Doing scene:$scene”; L2A_Process $scene; done

After going through a number of scenes, an instance of L2A_Process hangs for many hours. The first time this happened i just restarted the above loop but when it happened again this time i decided to investigate what the problem could be.

Using the python debugger and stepping through the L2A_Process lines of code as follows:

python -m pdb L2A_Process

I was able to learn that L2A_Process uses the python multiprocessing module called “multiprocessing”

I used the debugger because apart from sen2cor running for much longer than usual for a scene, i also noticed there were two L2A_Processing running but each with 0 CPU usage- this i observed by running top.
I then used strace as follows:

strace -p 1234

That produced output similar to
wait4(…)
select(5678,…) = 0 (Timeout)

What i was able to make of this strace output is that the L2A_Process with PID 1234 was waiting a second for a process called 5678

So i did another strace on the second L2A_Process possessing PID 5678 and got:

futex(…)

It turns out wait4, select and futex are Linux system calls.

These system calls are called by the python multiprocessing module, and this module spawned processes of which the last kept waiting in a futex for some event that never happens, meanwhile the spawning process is doing busy waiting so that it can do a join on this last process.

Because i have quite a lot of sentinel2 scenes to go through, i suspect the sen2cor fell into this known python issue with this multiprocessing module:
http://sopython.com/canon/82/programs-using-multiprocessing-hang-deadlock-and-never-complete/

The issue was probably caused by the system buffer filling up, as an inspecting in the L2A_Schedule.py and L2A_ProcessTile.py filess shows that although L2A_ProcessTile.py does queue.pu(), L2A_Schedule.py does not always do a queue.get() at the end of some processes that spawned first in the case where a scene has more granules than there are processes. This causes the system buffer to slowly fill up since there are more put() calls than get() calls. When buffer is full, the completion of a spawned processes cant be detected and so the spawned and spawning processes keep waiting for each other indefinitely, resulting in a deadlock.

As a temporary solution, i have decided to add a line in the L2A_Schedule.py to always do queue.get() operation at the end of every spawned process in order to clear the host system buffer.

I am not sure what other consequences the filling of the buffer might have. But there is also the issue of sen2cor jumping from some intermediate progress percentage, usually about 60%, straight to 100% completion. It could be that the buffer is causing corrupt communication between the spawned and spawning process, and so the incorrect progress is reported, but of this i am not sure as all my investigations were only rudimentary.

-Prince

NicWayand · December 22, 2016, 5:36pm

Hi Prince, I am running into this same issue. Thank you for doing the troubleshooting on it! Can you help me out with exactly what code change youe made to L2A_Schedule.py? Thank you.

Prince · December 23, 2016, 2:27pm

Hi Nic,

To apply patch, locate the file called L2A_Schedule.py . If you did a standard sen2cor installation, it has a path like

YourAnacondaInstallDir/lib/python2.7/site-packages/sen2cor-2.3.0-py2.7.egg/sen2cor/L2A_Schedule.py

Go to line 63 in the file; its the beginning of a loop and it reads as follows :

           for t in procs:

Add the following line as the first statement to be executed in the loop body :

          _msgQueue.get(True)

This will help to clear the message queue which never gets cleared.
You might want to put print statements around this line so that you know when it executes and when it has finished. If the sen2cor hangs again or has any other problem, you would be able to tell whether its due to this line or not. So instead of just adding the above line, you could instead add three lines:

           print "************Executing user modification******************"
           _msgQueue.get(True)
           print "************Done Executing user modification************"

This will help to flag whether any subsequent sen2cor problems are related to your change.

The Python documentation at https://docs.python.org/2/library/queue.html#Queue.Queue shows that the only problem this line could cause is that this line hangs(in which case only the first print will be output ). But this shouldn’t happen because each process always performs at least one put() operation when it terminates. See line 234 in L2A_ProcessTile.py (this is the file that actually contains the run() method executed by every spawned process ). The puts()s are never met with corresponding gets()s, which is what causes the original deadlock problem when the message queue runs full.

I was hoping that the release of the new sen2cor would have fixed this problem, but unfortunately a diff shows there is absolutely no change in this file between the this newest(2.3.0) and the previous(2.2.1) sen2cor version. So i’ve also had to re-apply the patch in the new sen2cor.