[strongSwan-dev] deadlock in strongswan 4.6.4

Thu Mar 7 05:17:56 CET 2013

On 6 March 2013 17:03, Ansis Atteka <ansisatteka at gmail.com> wrote:
> Hello,
>
> "ipsec down" command has been stuck for a while due to a deadlock:
>
> root     18135 17201 0 Mar06 ?        00:00:00 /bin/sh
> /usr/sbin/ipsec down remote-60.1.0.7[*]
>
>
>
> Problem is that Thread#3 has locked mutex 0x7f739457e8a0 in
> child_updown() and now it wants to lock another mutex 0x7f73945acaf0
> in get_interface_name(). But this second mutex is already being held
> by Thread#5. At the same time Thread#5 is waiting for mutex
> 0x7f739457e8a0 (allocated by Thread#3 in child_updown() ) to become
> available. Hence the deadlock.
>
> I went over the whole surrounding code in Thread#5 stacktrace and it
> is not obvious to me where it locked 0x7f73945acaf0 mutex. Any help
> would be appreciated!!!
>
>
>
> (gdb) thread 3
> [Switching to thread 3 (Thread 0x7f737fbab700 (LWP 17286))]
> #0  0x00007f7392d6f89c in __lll_lock_wait () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> (gdb) bt
> #0  0x00007f7392d6f89c in __lll_lock_wait () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #1  0x00007f7392d6b065 in _L_lock_858 () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #2  0x00007f7392d6aeba in pthread_mutex_lock () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #3  0x00007f73933fa2ca in lock (this=<optimized out>) at threading/mutex.c:105
> #4  0x00007f73933fa3e7 in lock_r (this=0x7f73945acaf0) at
> threading/mutex.c:140 <---------------Thread#5 is already holding this
> mutex, so we have to wait and can't release the other mutex.
> #5  0x00007f73894430be in get_interface_name (this=0x7f73945aca40,
> ip=0x7f73400025d0) at kernel_netlink_net.c:728
> #6  0x00007f7388a1bcff in child_updown (this=0x7f73945b2dc0,
> ike_sa=0x7f7340000b60, child_sa=0x7f734400afc0, up=true) at
> updown_listener.c:200
> #7  0x00007f7392f8580a in child_updown (this=0x7f739457e700,
> child_sa=0x7f734400afc0, up=true) at bus/bus.c:519
> <---------------------Here we locked mutex that thread #5 is waiting
> for
> #8  0x00007f7392fa994f in process_i (this=0x7f7340002950,
> message=<optimized out>) at sa/tasks/child_create.c:1173
> #9  0x00007f7392fa44be in process_response (message=0x7f7350001d90,
> this=0x7f7340001db0) at sa/task_manager.c:513
> #10 process_message (this=0x7f7340001db0, msg=0x7f7350001d90) at
> sa/task_manager.c:970
> #11 0x00007f7392f9de8d in process_message (this=0x7f7340000b60,
> message=0x7f7350001d90) at sa/ike_sa.c:1323
> #12 0x00007f7392f96980 in execute (this=0x7f73500010d0) at
> processing/jobs/process_message_job.c:74
> #13 0x00007f73933f7951 in process_jobs (this=0x7f739457d3e0) at
> processing/processor.c:188
> #14 0x00007f73933f9b20 in thread_main (this=0x7f73945bfd70) at
> threading/thread.c:305
> #15 0x00007f7392d68e9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #16 0x00007f7392a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #17 0x0000000000000000 in ?? ()
> (gdb) frame 4
> #4  0x00007f73933fa3e7 in lock_r (this=0x7f73945acaf0) at threading/mutex.c:140
> 140     in threading/mutex.c
> (gdb) print /x this->thread
> $22 = 0x7f7380bad700 <---------this mutex is already taken by thread 5
> (gdb) frame 7
> #7  0x00007f7392f8580a in child_updown (this=0x7f739457e700,
> child_sa=0x7f734400afc0, up=true) at bus/bus.c:519
> (gdb) print this->mutex
> $25 = (mutex_t *) 0x7f739457e8a0 <--- Thread #3 is waiting for this
> mutex to become available
>
>
>
> (gdb) thread 5
> [Switching to thread 5 (Thread 0x7f7380bad700 (LWP 17284))]
> #0  0x00007f7392d6f89c in __lll_lock_wait () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> (gdb) bt
> #0  0x00007f7392d6f89c in __lll_lock_wait () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #1  0x00007f7392d6b065 in _L_lock_858 () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #2  0x00007f7392d6aeba in pthread_mutex_lock () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #3  0x00007f73933fa2ca in lock (this=<optimized out>) at threading/mutex.c:105
> #4  0x00007f73933fa3e7 in lock_r (this=0x7f739457e8a0) at threading/mutex.c:140
> #5  0x00007f7392f8528e in vlog (this=0x7f739457e700, group=DBG_ENC,
> level=LEVEL_DIAG, format=0x7f7392fb9b20 "added payload of type %N to
> message", args=0x7f7380baca78) at bus/bus.c:280
> #6  0x00007f7392f8534d in log_ (this=<optimized out>, group=<optimized
> out>, level=<optimized out>, format=<optimized out>) at bus/bus.c:295
> #7  0x00007f7392f8c618 in add_payload (this=<optimized out>,
> payload=0x7f73945cd150) at encoding/message.c:724
> #8  0x00007f7392f8cb6d in add_notify (this=0x7f73945ca600,
> flush=<optimized out>, type=ADDITIONAL_IP6_ADDRESS, data=...) at
> encoding/message.c:745
> #9  0x00007f7392fb1458 in build_address_list (message=0x7f73945ca600,
> this=<optimized out>) at sa/tasks/ike_mobike.c:213
> #10 0x00007f7392fb1a25 in build_i (this=0x7f734c0080d0,
> message=0x7f73945ca600) at sa/tasks/ike_mobike.c:335
> #11 0x00007f7392fa3d96 in initiate (this=0x7f734c006820) at
> sa/task_manager.c:444
> #12 initiate (this=0x7f734c006820) at sa/task_manager.c:282
> #13 0x00007f7392fa4b5b in process_response (message=0x7f7350000d50,
> this=0x7f734c006820) at sa/task_manager.c:548
> #14 process_message (this=0x7f734c006820, msg=0x7f7350000d50) at
> sa/task_manager.c:970
> #15 0x00007f7392f9de8d in process_message (this=0x7f734c006b40,
> message=0x7f7350000d50) at sa/ike_sa.c:1323
> #16 0x00007f7392f96980 in execute (this=0x7f73500017d0) at
> processing/jobs/process_message_job.c:74
> #17 0x00007f73933f7951 in process_jobs (this=0x7f739457d3e0) at
> processing/processor.c:188
> #18 0x00007f73933f9b20 in thread_main (this=0x7f73945bf770) at
> threading/thread.c:305
> #19 0x00007f7392d68e9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #20 0x00007f7392a964bd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #21 0x0000000000000000 in ?? ()
> (gdb) frame 4
> #4  0x00007f73933fa3e7 in lock_r (this=0x7f739457e8a0) at threading/mutex.c:140
> 140     in threading/mutex.c
> (gdb) print /x this->thread
> $24 = 0x7f737fbab700 <-----------this lock is already taken by thread 3
>
>
> Best regards,
> Ansis Atteka

Since I observed this bug on 4.6.4 branch, I am wondering, if it would
make sense to backport following commit from the master branch?
Though, I still do not see where Thread#5 locked the first mutex? Any
hints would be welcome. Currently we plan to use strongSwan 4.6.4 for
the Open vSwitch IPsec tunneling and this seems to be the last issue
remaining before we release it publicly available.

commit f9f867899ae5f7198d2c80bf2519e542556095ba
Author: Tobias Brunner <tobias at strongswan.org>
Date:   Thu Dec 29 19:13:08 2011 +0100

    Use a separate list and mutex for loggers.

    This avoids deadlocks caused by extensive listener_t implementations
    which might want to acquire a lock which is currently held by another
    thread wanting to log messages. Since the latter requires that thread
    to acquire the same lock the initial thread currently holds this
    previously resulted in a deadlock.

    With this change logging messages does not require threads to acquire
    the main lock in bus_t and thus avoids the deadlock.