[strongSwan] HA resync issue

Mon Aug 4 10:25:59 CEST 2014

Hello,

Thanks for your answer.
Here is the configuration on the responder (which is in HA mode):

-----
conn %default
	ikelifetime=360m
	keylife=60m
	rekeymargin=3m
	keyingtries=1
	keyexchange=ikev2
	authby=secret

conn sample-psk-3k
      left=172.18.0.53
      leftid=srv.strongswan.org
      leftsubnet=172.53.0.0/16
      right=%any
      auto=add
      esp=aes128-sha1-modp2048
      ike=aes128-sha1-modp2048
----

On the passive node I can see some lines that like:
...
   (unnamed)[24]: CONNECTING, %any[%any]...%any[%any]
   (unnamed)[24]: IKEv2 SPIs: dce7d8aa449c06ea_i 312cbeb706504d9d_r*
   (unnamed)[24]: IKE proposal: AES_CBC_128/HMAC_SHA1_96/PRF_HMAC_SHA1
   (unnamed)[23]: CONNECTING, %any[%any]...%any[%any]
   (unnamed)[23]: IKEv2 SPIs: 090d4aa0884fd214_i 7ed0a8f6e8581328_r*
   (unnamed)[23]: IKE proposal: AES_CBC_128/HMAC_SHA1_96/PRF_HMAC_SHA1
...

But not that many: I have less than 40 connections in the command result.

Would the race condition explain why there are still 260 connections missing?

Regards,

Emeric

----- Mail original -----
De: "Thomas Egerer" <hakke_007 at gmx.de>
À: users at lists.strongswan.org
Cc: "emeric poupon" <emeric.poupon at stormshield.eu>
Envoyé: Vendredi 1 Août 2014 22:41:46
Objet: Re: [strongSwan] HA resync issue

Hi Emeric,

On 08/01/2014 06:22 PM, Emeric POUPON wrote:
> Hello,
> 
> I'm running Strongswan 5.2.0 on FreeBSD security gateways.
> 
> I set up a Active/Passive HA cluster.
> I successfully created 300 connections thanks to another remote gateway using strongswan's load-tester plugin.
> => the passive node has been correctly synchronized.
> 
> I then decided to bring down the passive node and bring it up shortly after.
> 
> The wiki says:
> "Synchronizing CHILD_SAs is not possible using the cache, as the messages do not contain sequence number information managed in the kernel. To reintegrate a node, the active node initiates rekeying on all CHILD_SAs. The new CHILD_SA will be synchronized, starting with fresh sequence numbers in the kernel. CHILD_SA rekeying is inexpensive, as it usually does not include a DH exchange."
> 
> (BTW, why would the CHILD SA rekey not include a DH exchange?)
Because by default, PFS is not enabled for children by
default (proposal_t::proposal_create_default).
> Indeed the active node rekeys the 300 CHILD SA in a few seconds, but the passive node gets synchronized with only few CHILD SA (about 30).
> 
> Logs:
> ...
> Aug  1 16:15:16 02[CFG] <sample-psk|9> installed HA passive IKE_SA 'sample-psk' 172.18.0.53[srv.strongswan.org]...172.18.0.54[c108-r1.strongswan.org]
> Aug  1 16:15:16 02[CFG] <sample-psk|10> installed HA passive IKE_SA 'sample-psk' 172.18.0.53[srv.strongswan.org]...172.18.0.54[c20-r1.strongswan.org]
> 
> And then a lot of errors like that:
> ...
> Aug  1 16:15:16 02[CFG] passive HA IKE_SA to update not found
> ...
> Aug  1 16:15:16 02[CHD] IKE_SA for HA CHILD_SA not found
> ...
> Aug  1 16:15:16 02[CHD] <11> HA is missing nodes child configuration
> ...
> 
> Any idea?
This can happen if the passive node is
- not able to check out the IKE_SA to be updated (case 1)
- not able to check out the IKE_SA it should add a child to (case 2)
- not able to find a configuration matching the one used in
  the HA CHILD_SA update (case 3)

which to me looks like your passive node does not have all the
configurations required for the synchronization.
If a passive node comes up it requests an immediate resync
by the active node. This node pushes all established IKE_SAs
(from ha_cache) to the passive node. I've seen cases that
failed the sync, if the configs were not identical.
Maybe a race condition that resync is faster than your backend
loading the configs? In that case 'stroke statusall' should
list a lot of (unnamed) IKE_SAs, the ones that were not synced
properly.

Cheers,

Thomas