Intemittent SEGV starting 2.4.0 on Linux

g8sqh · February 3, 2018, 11:54am

I have installed 2.4.0 under snap 6.0, and sometimes it receives a segmentation fault on startup. Other attempts with exactly the same parameters succeed.

I have a gdb backtrace - where should I report such a problem?

Many thanks

rashadkm · February 4, 2018, 12:06pm

you can post gdb trace here and we can have alook

g8sqh · February 4, 2018, 5:23pm

Reading symbols from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/bin/python2.7…(no debugging symbols found)…done.
(gdb) run
Starting program: /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/bin/python2.7 /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/python2.7/site-packages/sen2cor/L2A_Process.py --resolution 10 /home/djch/household/E/Y2project/S2_zips_port/S2A_MSIL1C_20171002T112111_N0205_R037_T29SNC_20171002T113001.SAFE --GIP_L2A /tmp/L2A-GIPP-custom.xml
[New LWP 23689]
[New LWP 23690]
[New LWP 23691]
[New LWP 23692]

Sentinel-2 Level 2A Processor (Sen2Cor), 2.4.0, created: 2017.06.05 started …

Thread 1 “python2.7” received signal SIGSEGV, Segmentation fault.
0x00007fe9b0b5a088 in memcpy () from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/ld-musl-x86_64.so.1
(gdb) bt
#0 0x00007fe9b0b5a088 in memcpy () from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/ld-musl-x86_64.so.1
#1 0x00007fe9b0b2778f in __copy_tls () from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/ld-musl-x86_64.so.1
#2 0x00007fe9b0afd448 in _PyThreadState_GetFrame () from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/libpython2.7.so.1.0
#3 0x0000000000016000 in ?? ()
#4 0x00007fe9a8017000 in ?? ()
#5 0x0000000000001000 in ?? ()
#6 0x00007fe9a802c998 in ?? ()
#7 0x00007fe9b0b5bd04 in pthread_create () from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/ld-musl-x86_64.so.1
#8 0x00007fffffffd620 in ?? ()
#9 0x00007fe9a8018000 in ?? ()
#10 0x00007fffffffd598 in ?? ()
#11 0x00007fffffffd618 in ?? ()
#12 0x00007fe9b0844210 in ?? () from /home/djch/.snap/auxdata/Sen2Cor-2.4.0-Linux64/lib/libpython2.7.so.1.0
#13 0x0000000100a0b6e0 in ?? ()
#14 0x0000000000000020 in ?? ()
#15 0x0000000000000000 in ?? ()

(gdb) x/10i $pc-20
0x7fe9b0b5a074 <memcpy+16>: or $0xa4,%al
0x7fe9b0b5a076 <memcpy+18>: dec %rdx
0x7fe9b0b5a079 <memcpy+21>: test $0x7,%edi
0x7fe9b0b5a07f <memcpy+27>: jne 0x7fe9b0b5a075 <memcpy+17>
0x7fe9b0b5a081 <memcpy+29>: mov %rdx,%rcx
0x7fe9b0b5a084 <memcpy+32>: shr $0x3,%rcx
=> 0x7fe9b0b5a088 <memcpy+36>: rep movsq %ds:(%rsi),%es:(%rdi)
0x7fe9b0b5a08b <memcpy+39>: and $0x7,%edx
0x7fe9b0b5a08e <memcpy+42>: je 0x7fe9b0b5a095 <memcpy+49>
0x7fe9b0b5a090 <memcpy+44>: movsb %ds:(%rsi),%es:(%rdi)

(gdb) i r
rax 0x84007885a80047b0 -8935009145258489936
rbx 0x7fe9a802cab0 140641522862768
rcx 0xfa2a260002f8000 1126641389999718400
rdx 0x7d151300017c0007 9013131119997747207
rsi 0x6400028300056a00 7205762165457119744
rdi 0x84007885a80047b0 -8935009145258489936
rbp 0x100dd0398 0x100dd0398
rsp 0x7fffffffd528 0x7fffffffd528
r8 0x7fe9a8018000 140641522778112
r9 0x0 0
r10 0x22 34
r11 0x206 518
r12 0x7fe9a802c998 140641522862488
r13 0x7fe9a802c9b8 140641522862520
r14 0x7fe9a802cc00 140641522863104
r15 0x7fe9b0d95b48 140641671142216
rip 0x7fe9b0b5a088 0x7fe9b0b5a088 <memcpy+36>
eflags 0x10207 [ CF PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb)

As I read it, this is copy 8 bytes at a time in memcpy(). The value in ecx is huge - the number of bytes requested is clearly wrong.

This is intermittent - so I guess there’s a race condition between the threads as they start up.

Please say what extra information you need

Thanks

g8sqh · February 4, 2018, 10:35pm

Tried a quick experiment - there a suspicion that the default musl thread stack may be too small

I replaced the python2.7 binary in .snap/auxdata/Sen2Cor-2.4.0-Linux64/bin (using musl) with the normal python2.7 binary on my machine (uses glibc). Nothing else changed.

Sen2cor now seems to run reliably when invoked from command line or optical->thematic land processing-> sen2cor in SNAP6.0. Only odd thing is that the SNAP java binary runs at 100% cpu on one core all the time sen2cor is running (at 100% on another core).

I suggest that something in the change from glibc to musl has caused the threading in python2.7 to become unreliable on startup.

Many thanks

rashadkm · February 5, 2018, 9:32am

Thanks @g8sqh for gdb trace and also for the tracking down this issue.
I will check about thread stack size on musl platform and let you know. If there is some patch needed, I will push it to sen2cor team

EDIT: can you try the workaround ?

g8sqh · February 5, 2018, 10:35pm

I did try it - with 2M and 4M stack size. Sorry, it did not fix the problem. (I did see _fini in the gdb trace, so I know the stack.so module was being dynamically linked in)