[strongSwan-dev] HA resync issue
emeric.poupon at stormshield.eu
Thu Aug 28 11:00:40 CEST 2014
I'm switching from the user to the dev mailing list as it is more appropriate for this issue.
When a node is reintegrating a HA cluster, the active node first sends all the IKE SA related messages.
In my case, this is hundreds of messages, since there is hundreds of tunnels installed.
On the passive node (Strongswan 5.2.0 on FreeBSD 9.2), I can see a lot of udp packet drops:
# netstat -s -p udp
575 dropped due to full socket buffers
These drops lead to the problem initially described (unnamed CONNECTING connections, errors in logs, etc.)
Therefore in order to be able to sync 300 tunnels, I have to significantly increase the net.inet.udp.recvspace sysctl parameter (at least from 42K to more than 2M!)
As the HA plugin has been initially developed for Linux, I am wondering if it is working fine on Linux, even with a large amount of tunnels? Do Linux users have to deal with the UDP socket recvspace parameter too?
I did not test using 1K or even 10K+ tunnels but the UDP based solution seems to be unable to provide the significant reliability needed for these cases.
I understand switching to a TCP based sync would require a significant work but it seems to be quite unavoidable.
What do you think?
----- Mail original -----
De: "Thomas Egerer" <hakke_007 at gmx.de>
À: users at lists.strongswan.org
Cc: "emeric poupon" <emeric.poupon at stormshield.eu>
Envoyé: Vendredi 1 Août 2014 22:41:46
Objet: Re: [strongSwan] HA resync issue
On 08/01/2014 06:22 PM, Emeric POUPON wrote:
> I'm running Strongswan 5.2.0 on FreeBSD security gateways.
> I set up a Active/Passive HA cluster.
> I successfully created 300 connections thanks to another remote gateway using strongswan's load-tester plugin.
> => the passive node has been correctly synchronized.
> I then decided to bring down the passive node and bring it up shortly after.
> The wiki says:
> "Synchronizing CHILD_SAs is not possible using the cache, as the messages do not contain sequence number information managed in the kernel. To reintegrate a node, the active node initiates rekeying on all CHILD_SAs. The new CHILD_SA will be synchronized, starting with fresh sequence numbers in the kernel. CHILD_SA rekeying is inexpensive, as it usually does not include a DH exchange."
> (BTW, why would the CHILD SA rekey not include a DH exchange?)
Because by default, PFS is not enabled for children by
> Indeed the active node rekeys the 300 CHILD SA in a few seconds, but the passive node gets synchronized with only few CHILD SA (about 30).
> Aug 1 16:15:16 02[CFG] <sample-psk|9> installed HA passive IKE_SA 'sample-psk' 172.18.0.53[srv.strongswan.org]...172.18.0.54[c108-r1.strongswan.org]
> Aug 1 16:15:16 02[CFG] <sample-psk|10> installed HA passive IKE_SA 'sample-psk' 172.18.0.53[srv.strongswan.org]...172.18.0.54[c20-r1.strongswan.org]
> And then a lot of errors like that:
> Aug 1 16:15:16 02[CFG] passive HA IKE_SA to update not found
> Aug 1 16:15:16 02[CHD] IKE_SA for HA CHILD_SA not found
> Aug 1 16:15:16 02[CHD] <11> HA is missing nodes child configuration
> Any idea?
This can happen if the passive node is
- not able to check out the IKE_SA to be updated (case 1)
- not able to check out the IKE_SA it should add a child to (case 2)
- not able to find a configuration matching the one used in
the HA CHILD_SA update (case 3)
which to me looks like your passive node does not have all the
configurations required for the synchronization.
If a passive node comes up it requests an immediate resync
by the active node. This node pushes all established IKE_SAs
(from ha_cache) to the passive node. I've seen cases that
failed the sync, if the configs were not identical.
Maybe a race condition that resync is faster than your backend
loading the configs? In that case 'stroke statusall' should
list a lot of (unnamed) IKE_SAs, the ones that were not synced
More information about the Dev