[strongSwan] Broken CHILD_SA following IKE_SA re-auth with FortiGate remote

Fri Aug 5 11:56:41 CEST 2016

Hi,

We recently experienced that an IKEv2-negotiated ESP site-to-site
tunnel between strongSwan 5.3.5 running on Ubuntu 16.04 and a Fortinet
FortiGate router broke following the re-auth of the IKE_SA. Just one
out of six ESP CHILD_SAs broke.

I've uploaded config files, charon logs, and other debugging info to:

https://gist.github.com/toreanderson/effeaae2432abc965398f17c3d2161ef

Basically what seems to have happened is the following:

1) strongSwan decides it's time to re-auth the IKE_SA, and sends an
IKEv2 INFORMATIONAL with DELETE to the FortiGate. From the logs:

15[IKE] sending DELETE for IKE_SA tun1[6447]

2) Immediately afterwards strongSwan and the FortiGate both sends and
IKE_SA_INIT to establish a new IKE SA.

09[IKE] initiating IKE_SA tun1[6471] to 172.16.0.1
12[IKE] 172.16.0.1 is initiating an IKE_SA

3) Both of these are successfully established along with a CHILD_SA:

10[IKE] IKE_SA tun1[6471] established between 10.0.0.1[10.0.0.1]...172.16.0.1[172.16.0.1]
10[IKE] CHILD_SA tun1{14549} established with SPIs cc122363_i 104b86c4_o and TS 10.1.1.0/24 === 172.16.1.0/24
06[IKE] IKE_SA tun1[6472] established between 10.0.0.1[10.0.0.1]...172.16.0.1[172.16.0.1]
06[IKE] CHILD_SA tun1{14551} established with SPIs c1f9cea7_i 104b86c3_o and TS 10.1.1.0/24 === 172.16.1.0/24

So now we have a redundant IKE_SA and a redundant ESP_SA.

4) The FortiGate detects the mid-air collision and requests to delete
one of the two IKE_SAs:

14[IKE] received DELETE for IKE_SA tun1[6472]

5) strongSwan acts IKE_SA DELETE on this by deleting not only the
IKE_SA, but also the c1f9cea7_i 104b86c3_o CHILD_SA - at least it does
not occur in the output from "ipsec statusall". The FortiGate does
however NOT delete that CHILD_SA, indeed, it keeps on actively using
it. This can be seen from the ESP traffic resulting from a ping packet
sent from 10.1.1.0/24 to 172.16.1.0/24:

09:27:19.639105 IP 10.0.0.1 > 172.16.0.1: ESP(spi=0x104b86c4,seq=0xd0f), length 136
09:27:19.647375 IP 172.16.0.1 > 10.0.0.1: ESP(spi=0xc1f9cea7,seq=0x121d), length 136

SPI 0x104b86c4 still obviously exists on both the strongSwan (10.0.0.1)
and the FortiGate (172.16.0.1). However, strongSwan has already deleted
SPI 0xc1f9cea7 so it has no way of decrypting those packets. The result
is that all traffic from 172.16.1.0/24 to 10.1.1.0/24 was being
blackholed until the tunnel was manually brought down and up again by
NOC staff.

That concludes my analysis of the incident. I am however not sure if is
the strongSwan or FortiGate that did someting wrong here (or both), and
hoping this list might have some insight. The way I see it, the key
questions are the following:

- Is the strongSwan behaving correctly when it is also deleting the ESP
  CHILD_SA when receiving the DELETE IKE_SA from the FortiGate, instead
  of "moving" it to the other active IKE_SA as it appears the FortiGate
  has done? RFC4306, section 2.4 says the following:

  «Closing the IKE_SA implicitly closes all associated CHILD_SAs.

  ...but this doesn't mention the corner case where there are two
  parallel CHILD_SAs, as was the case for us.

- Why does the strongSwan rekey by first deleting the existing IKE_SA
  and then initiating a new one, instead of the other way around? This
  seems to me to be a violation of RFC4306, section 2.8, paragraph 4:

   «SAs SHOULD be rekeyed proactively, i.e., the new SA should be
   established before the old one expires and becomes unusable.  Enough
   time should elapse between the time the new SA is established and the
   old one becomes unusable so that traffic can be switched over to the
   new SA.»

It would appear to me that if this SHOULD had been followed, the
FortiGate would likely not have initiated an IKE_SA of its own
following the strongSwan's deletion of the old one, and the blackholing
of traffic would have been avoided.

Tore