[strongSwan] Charon fails after recovering from a crash

Fri Jul 26 19:12:29 CEST 2013

I have multiple strongswan machines, and I have noticed all tunnels on
them have failed.  Running ipsec statusall shows everything looks
normal, but no SA's are built.  I have tried changing the auto=start,
auto=route, and some dpd settings hoping to make them recover with no
luck.

Looking back at logs, it seems to have died at some point (received
signal 11) and been restarted automatically.  Problem being that after
being restarted it errors and can not build SA's.  Therefore when the
tunnels time out, they never come back.

I have found this post from 2011 that seems to explain exactly my same
problem.  Problem being the only reply was asking him to run 4.5.3 and
attach GDB

https://lists.strongswan.org/pipermail/users/2011-August/006521.html

As this post suggests, doing a kill -11 on the process can exactly
replicate my problem.  The process dies, and is re-started.  Once this
happens no new SA's will be created (even though the current ones will
be used for the time being).  I have tried running an ipsec reload,
and that appears to allow new SA's at first, but I have seen this not
work or fail after a short time.  So far I have been doing a full
ipsec restart to rectify the situation.  So now I have a process
watching for this error and restarting strongswan.  That guarantees a
few dropped packets as the old tunnels are destroyed and new ones
created.  This is really not what I want, but it is the best I have
right now.

My version reported from ipsec is:
Linux strongSwan U4.5.2/K3.2.0-29-generic

I see that 5.04 is available, so I complied that and gave it a try.
As I can not test what was causing the segfault, I have tested a kill
-11 on the process to see what it would do in the event that a
segfault does occur.  This version is very similar to the 4.5.2 I am
running.  The tunnels stay up, until the starter notices charon isn't
running.  It correctly starts the process again, but at this point
charon fails to insert into the SPD.  Looking at ip xfrm I can see
that when it tries this insert it runs into the existing entries that
were still around because the process died unexpectedly.  It seems
when it tries to insert a second time, it actually knocks the existing
(working) tunnels out of the table.  This causes the tunnels to die.
If I issue the ipsec reload command, it tries one more time, and this
time is successful allowing the tunnels to come back up.  I can cross
my fingers that this version doesn't segfault, but it seems it should
do something with the SPD errors (maybe re-try the insert?) so that it
can recover if the process does die.

I could really use some direction here as to how this is expected to
work, or if switching to 5.04 might prevent the root cause (the
segfault).