Hello guys,
I am running sen2cor with the config set to use 8 processes. The setup is that i have a bash script that loops through all sentinel 2 scenes that i have, and for each scene the script starts an instance of sen2cor to process the scene:
for scene in ls -1 /data/sentinel/2/unzipped_scenes/*L2A_*
; do echo “Doing scene:$scene”; L2A_Process $scene; done
After going through a number of scenes, an instance of L2A_Process hangs for many hours. The first time this happened i just restarted the above loop but when it happened again this time i decided to investigate what the problem could be.
Using the python debugger and stepping through the L2A_Process lines of code as follows:
python -m pdb L2A_Process
I was able to learn that L2A_Process uses the python multiprocessing module called “multiprocessing”
I used the debugger because apart from sen2cor running for much longer than usual for a scene, i also noticed there were two L2A_Processing running but each with 0 CPU usage- this i observed by running top.
I then used strace as follows:
strace -p 1234
That produced output similar to
wait4(…)
select(5678,…) = 0 (Timeout)
What i was able to make of this strace output is that the L2A_Process with PID 1234 was waiting a second for a process called 5678
So i did another strace on the second L2A_Process possessing PID 5678 and got:
futex(…)
It turns out wait4, select and futex are Linux system calls.
These system calls are called by the python multiprocessing module, and this module spawned processes of which the last kept waiting in a futex for some event that never happens, meanwhile the spawning process is doing busy waiting so that it can do a join on this last process.
Because i have quite a lot of sentinel2 scenes to go through, i suspect the sen2cor fell into this known python issue with this multiprocessing module:
http://sopython.com/canon/82/programs-using-multiprocessing-hang-deadlock-and-never-complete/
The issue was probably caused by the system buffer filling up, as an inspecting in the L2A_Schedule.py and L2A_ProcessTile.py filess shows that although L2A_ProcessTile.py does queue.pu(), L2A_Schedule.py does not always do a queue.get() at the end of some processes that spawned first in the case where a scene has more granules than there are processes. This causes the system buffer to slowly fill up since there are more put() calls than get() calls. When buffer is full, the completion of a spawned processes cant be detected and so the spawned and spawning processes keep waiting for each other indefinitely, resulting in a deadlock.
As a temporary solution, i have decided to add a line in the L2A_Schedule.py to always do queue.get() operation at the end of every spawned process in order to clear the host system buffer.
I am not sure what other consequences the filling of the buffer might have. But there is also the issue of sen2cor jumping from some intermediate progress percentage, usually about 60%, straight to 100% completion. It could be that the buffer is causing corrupt communication between the spawned and spawning process, and so the incorrect progress is reported, but of this i am not sure as all my investigations were only rudimentary.
-Prince