[strongSwan-dev] Fault Restart Issue with Key Sockets

Sat Jun 6 00:00:27 CEST 2015

> > With that in mind, would the fix for this problem be to handle EEXIST 
> > in add_policy_internal by replacing
>
> Yes, that should work for this particular case.  Not sure if there are situations where updating 
existing policies is unwanted (for the daemon this still looks like it added the policies, so it will eventually remove them).  Anyway, I pushed changes for the kernel-pfkey and kernel-netlink plugins to the policy-update-eexist branch [1].

[Robinson, Herbie] This seems to be working for us; although, we have more testing to do.  We are not running with a stock FreeBSD kernel; although, the IPSec code is from FreeBSD.  I noticed that strongSwan was updating SPs on a regular basis without making any significant changes; so, I improved the FreeBSD code for updates that don't change anything.  I also fixed it so it doesn't move SPs to the end of the list when it updates them.  I eventually will offer up all of our changes for FreeBSD.  Management has approved spending time on it, but I have a lot of other chores, too :-).  This means that I don't have any easy way to test this change with a stock FreeBSD in the near term; so, I can't evaluate how risky it is.  As you said, normally, strongSwan knows when it's updating.  That would mean it would never see EEXIST unless it has been restarted and wouldn't affect any other cases.

> By the way, if auto=route is used the result might be a bit odd.  Since the previously installed SAs and policies are still there no new SAs are established until the existing SAs expire.  Soft expires will be triggered, for which the daemon does not find any state, so no rekeying is done.  And only when the SAs have finally expired will the kernel send an acquire to the daemon (this assumes that the reqids are the same as they were before the restart, which should be the case since 5.3.0).
> Also, if the other peer has lower lifetimes it will try to rekey the CHILD_SA using an IKE_SA that does not exist on the restarted peer, so after a few retransmits it will eventually trigger its configured `dpdaction`.

[Robinson, Herbie] Obviously, this isn't perfect, but it is recovery from the daemon faulting.  The alternative would be letting things grind to a halt until somebody fixes it manually.

[Robinson, Herbie] BTW, Bettina found the cause of the fault today.  Another developer had put the sha2 routines into our equivalent of libc (but without the sha384 update entry).  The callers from strongSwan were using openssl headers for defining the size of the state, but calling code from another library for some of the functions.  Madness ensues...