[strongSwan-dev] charon deadlock involving three threads

Thu Dec 22 15:34:29 CET 2011

Hi all,

motion seconded :)

I too found that strongSwan deadlocks. As I have yet to replicate this with
a vanilla build I didn't mention this yet, however now seems a good time to
chime in.

I've been initiating many unique connections from the same host to mimic a
busy server. For this end I've configured strongSwan to resemble this by
setting (using ecc-certificates, ikev2) :

charon {
  reuse_ikesa = no
  block_threshold = 100
}

When many connections are initiated (using randist for exponential inter
arrival time). The system deadlocks. I've seen this happening at 40-50
initiated connections within a time period of 20 second. A short term
workaround is to increase the number of working threads of strongSwan,
however it then just deadlocks at a higher number of initiated connections.

It might very well be that my modifications impact the chance of this
deadlock happening, as I modified (add something to) the handshake.
However, that I've managed to reproduce this bug with above settings with
strongSwan version 4.5.3 that does not contain my modifications to the
source code, yet is compiled with some configure options (most notably
+openssl -gmp). On two Ubuntu machines.

I intend(ed) to find out if this also occurs when no configuration options
are given, to make sure this is a strongSwan bug and not introduced by
something I did.

I'll be happy to provide more details where so desired.

Kind regards,

Jan Willem Beusink

2011/12/22 Thomas Egerer <thomas.egerer at secunet.com>

> Hello *,
>
> I recently discovered a deadlock which can occur in charon. It was
> caused by one of my own plugins but I managed to reproduce it using
> an almost vanilla upstream charon version 4.6.1. Twp patches are
> attached to this mail:
> Patch 0001 is to create a backend with a peer config which does not
> have a child with the same name. I used stroke's backend for this
> purpose.
> Patch 0002 is simply to enlarge the window in which the deadlock can
> occur and make it reproducible.
>
> The deadlock occurs when the following happens (in the given order):
> a) an IKE_SA is built and a thread is processing the IKE_AUTH request,
>   which can take a bit longer when a smartcard is involved. This
>   causes the ike_sa_manager to lock a particular IKE_SA exclusively.
> b) an acquire is triggered which causes the rwlock in the trap_manager
>   to be read-locked, the subsequent call to
>   ike_sa_manager->checkout_by_config has to wait until a) unlocks
>   it's ike_sa.
> c) a child_cfg contained in the peer_cfg belonging to the ike_sa
>   a) has locked is routed causes the child_configs contained
>   in the peer config to be locked by c) while the actual routing
>   code within trap_manager tries to writelock it's rwlock.
>
> That's about it. As soon as a) finishes authentication of the peer
> and tries to find a matching child sa it will try to lock the child
> configs of the peer config which is not possible since it has been
> locked by c).
>
> Thread | Resource locked                | Resource desired
> -------+--------------------------------+--------------------------------
>  (a)  | ike_sa in ike_sa_manager       | child_cfgs of peer_cfg
>       |                                |
>  (b)  | rwlock in trap-manager (read)  | ike_sa in ike_sa_manager
>       |                                |
>  (c)  | child_cfgs of peer_cfg         | rwlock in trap-manager (write)
>
> Here's the configs used to reproduce the deadlock and the steps
> to perform to lock charon up. Setup involves two boxes, psk1 and psk2.
> ipsec.conf of psk1:
>
> config setup
>        charonstart=yes
>        plutostart=no
>
> conn %default
>      keyexchange=ikev2
>      authby=psk
>
> conn psk_del
>      left=192.168.178.1
>      right=192.168.178.2
>      auto=add
>      type=tunnel
>
> conn psk_keep
>      left=192.168.178.1
>      leftsubnet=192.168.178.1/32
>      right=192.168.178.2
>      rightsubnet=192.168.178.2/32
>      auto=add
>      type=tunnel
>
> conn acquire_conn
>      left=192.168.178.1
>      leftsubnet=192.168.178.1/32
>      right=192.168.178.3
>      type=tunnel
>      auto=route
>
> ipsec.conf of psk2:
>
> ################################################
> config setup
>        charonstart=yes
>        plutostart=no
>
> conn %default
>      keyexchange=ikev2
>      authby=psk
>
> conn psk
>      left=192.168.178.2
>      leftsubnet=192.168.178.2/32
>      right=192.168.178.1
>      rightsubnet=192.168.178.1/32
>      type=tunnel
>      auto=route
>
> ################################################
> Steps to perform on psk1/psk2:
> #psk1> ipsec stroke del psk_del
> #psk2> ipsec stroke up psk
> #psk1> ping 192.168.178.3
> #psk1> ipsec stroke route psk_del
> #psk1> touch /tmp/unlock_child_create
>
>
>
> Chances to trigger this deadlock are pretty slim regarding the
> tiny time window and the probability that all three actions
> are performed in the same order at the same time.
> Yet it's pretty easy to write code the routes/unroutes child_sas
> while enumerating a peer_cfgs children. So I think you should
> known.
> I can also provide the ISO-image in which I reproduced the
> deadlock and a core dump of the deadlocked charon instance.
>
> Cheers Thomas
>
>
> _______________________________________________
> Dev mailing list
> Dev at lists.strongswan.org
> https://lists.strongswan.org/mailman/listinfo/dev
>

-- 
Met vriendelijke groeten,

  Jan Willem Beusink
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/dev/attachments/20111222/e9a2b16c/attachment.html>