[strongSwan] IPsec/IKEv2 tunnels scalability issue with load-tester plugin (using strongSwan 5.0.4)

Tue Aug 6 18:15:31 CEST 2013

Hi Martin/All,
Thanks a lot for your valuable and prompt response. 

I disabled
the dos_protection  at both the ends.  Now the responder does not generate IKE_AUTH i.e.,
AUTH_FAILED response. I run the scenario with 300 IPsec tunnels ( Configured
the initiators = 10, iterations = 30 and delay = 100 in strongswan.conf of IKE
initiator). It could able to bring up all the tunnels or establish the IKE and
ESP child SAs at both ends. Please note that, I have disabled the rekeying at
both ends. In order to confirm, I used the “ip xfrm state count” command at
both ends from time to time. Always it showed the SAD count to be 600, which was
expected. However upon trying to run with 500 IPsec connections/tunnels, I
found the following issue i.e., though it can create all the IPsec tunnels (i.e.,
checked and found the SAD count to be 1000) at responder end, the initiator cannot
create all the tunnels  (i.e., checked
and found the SAD count to be lesser than 400).
Thereafter turned on the log at IKE initiator end and found
the following error messages in charon.log file.
---------------------------------------------------------------------------------------------- 
Jan  1 07:14:28 05[KNL]
creating delete job for ESP CHILD_SA with SPI cd8cac1c a
nd reqid {159}
Jan  1 07:14:28 03[MGR]
checkout IKE_SA by ID
Jan  1 07:14:28 03[JOB]
CHILD_SA with reqid 159 not found for delete
Jan  1 07:14:28 05[KNL]
received a XFRM_MSG_EXPIRE
Jan  1 07:14:28 05[KNL]
creating delete job for ESP CHILD_SA with SPI c7b3110f a
nd reqid {160}
Jan  1 07:14:28 02[MGR]
checkout IKE_SA by ID
Jan  1 07:14:28 02[JOB]
CHILD_SA with reqid 160 not found for delete
Jan  1 07:14:28 05[KNL]
received a XFRM_MSG_EXPIRE
Jan  1 07:14:28 05[KNL]
creating delete job for ESP CHILD_SA with SPI c922693f a
nd reqid {161} 
Jan  1 07:14:29
12[CHD]   SPI 0xcd8cac1c, src 30.30.30.2
dst 30.30.30.1
Jan  1 07:14:29 12[KNL]
adding SAD entry with SPI cd8cac1c and reqid {159}  (mar
k 0/0x00000000)
Jan  1 07:14:29
12[KNL]   using encryption algorithm
AES_CBC with key size 128
Jan  1 07:14:29
12[KNL]   using integrity algorithm
HMAC_SHA1_96 with key size 1
60
Jan  1 07:14:29 12[KNL]   using replay window of 32 packets
Jan  1 07:14:29 12[KNL]
unable to add SAD entry with SPI cd8cac1c

Jan  1 07:14:29 12[IKE]
unable to install inbound IPsec SA (SAD) in kernel
Jan  1 07:14:29 12[IKE]
failed to establish CHILD_SA, keeping IKE_SA
Jan  1 07:14:29 12[KNL]
deleting SAD entry with SPI cd8cac1c  (mark 0/0x00000000
)       ---------------------------------------------------------------------------------------------- Here goes my analysis. Kindly go thru the same and drag me to right direction if I am wrong at any point.

What I understand, each IPsec SA in the Linux kernel has a
lifetime associated with it consisting of both a soft and a hard limit. Each
time one of the soft or hard limits is reached, the Linux kernel generates an
XFRM_MSG_EXPIRE message to which the charon keying (IKEv2) daemon subscribes
when creating the NETLINK_XFRM socket. But in this case,  since I have disabled the rekeying,  the kernel should not send XFRM_MSG_EXPIRE
event to charon daemon. Please feel free to correct me if my understanding is
wrong. Since charon gets XFRM_MSG_EXPIRE event, it creates delete job for ESP
CHILD_SA and unable to install the child SA in kernel.
As per the rfc-4306, If the response is not received within a
timeout interval, the requester needs to retransmit the request or teardown  the connection.  Let us assume that, initiator does not receive
IKE_AUTH response message, then it should go on retransmitting the IKE_AUTH
request. Since I have configured the  retransmit_timeout=60 and  retransmit_tries=30, it will send 30 IKE_AUTH requests in each 60
seconds. After (60*30=1800 seconds), it should tear down the connection (i.e.,
child SA) by sending Informational DELETE payload message to responder. In such
case, the SAD count (at responder end) should be dropped from 1000 (Note that,
i have run with 500 IPsec connections/tunnels). But it never drops from 1000.
Also there is no drop of SAD count at initiator end (i.e., 320). Is this an expected behavior or bug in strongswan (5.0.4) or I
am missing something? 

Please find our configuration below for Initiator as well
as Responder.

Configurations at IKE initiator
strongswan.conf 
# number of worker threads in charon
            threads = 16
            replay_window = 32
dos_protection = no
            block_threshold=300
            cookie_threshold=300
             init_limit_half_open=300
            retransmit_timeout=60
            retransmit_tries=30
install_virtual_ip=no 
install_routes=no          

        plugins {

        load-tester {
                     enable = yes
                     initiators = 20
                     iterations = 30
                     delay = 30
                     responder = 30.30.30.2
                     proposal = aes128-sha1-modp1024
                     initiator_auth = psk
                     responder_auth = psk
                     request_virtual_ip = no
                     ike_rekey = 0
                     child_rekey = 0
                     delete_after_established = no
                     shutdown_when_complete = no
                       }

ipsec.secrets
@srv.strongswan.org %any : PSK "strongSwan"

Configurations at IKE responder
ipsec.conf
conn %default
        ikelifetime=24h
        keylife=23h
        rekeymargin=5m
        keyingtries=1
        keyexchange=ikev2
        ike=aes128-sha1-modp1024!
        mobike=no

conn host-host
        left=30.30.30.2
        leftsubnet=30.30.30.2/8
        rightid=%any
        leftauth=psk
        leftfirewall=yes
        right=30.30.30.1
        rightsubnet=30.30.30.1/8
        leftid=@srv.strongswan.org
        rightauth=psk
type=tunnel
        authby=secret
        rekey=no
        reauth=no
        auto=add
strongswan.conf
        # number of
worker threads in charon
        threads = 16
        replay_window =
32
        block_threshold=300
        cookie_threshold=300
        init_limit_half_open=300
        half_open_timeout=300
        dos_protection =
no
Please fee free to let me know if additional information is needed. Your help in this regard will be highly appreciated. Regards,
Chinmaya 

________________________________
 From: Martin Willi <martin at strongswan.org>
To: Chinmaya Dwibedy <ckdwibedy at yahoo.com> 
Cc: "users at lists.strongswan.org" <users at lists.strongswan.org> 
Sent: Monday, August 5, 2013 5:02 PM
Subject: Re: [strongSwan] IPsec/IKEv2 tunnels scalability issue with load-tester plugin (using strongSwan 5.0.4)

Hi,

> Although we did not encounter with aforesaid message, it could
> not create all 300 IPsec tunnels. [...]

> [IKE] tried 1 shared key for 'srv.strongswan.org' - 'c13-r1.strongswan.org',
> but MAC mismatched

This problem arises when enabling IKEv2 DoS protection using COOKIEs.

Assume the following: The initiator sends an IKE_SA_INIT without a
COOKIE. The responder queues that message for processing. If it then
enables DoS protection, and the initiator retransmits the IKE_SA_INIT,
you'll encounter this issue. The initiator assumes the responder
processed the COOKIEd IKE_SA_INIT, but in fact it processed the first
IKE_SA_INIT without a COOKIE. The peers use different IKE_SA_INIT data
to authenticate during IKE_AUTH, and the MAC does not match (see also
[1]).

While there could be ways to improve the situation (i.e. respect a
second IKE_SA_INIT message having a COOKIE), I don't think that there is
way to solve this issue completely. Under DoS the chance is high that
packets get lost elsewhere. The only option would be to change the IKEv2
protocol itself (for example to not include the COOKIE in MAC
calculation).

Further, I think your assumption is wrong that there is a guarantee that
a tunnel can be established given a Denial of Service situation. You
should reconsider what a DoS situation is, and configure the daemon
accordingly using the cookie_threshold and block_threshold options. Also
you might check the init_limit_job_load/init_limit_half_open options;
they can help in avoiding this issue. But once you hit DoS limits, the
above error will happen, and there is certainly no guarantee all
connection attempts will be successful.

Regards
Martin

[1]http://git.strongswan.org/?p=strongswan.git;a=commitdiff;h=1b7debcc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/users/attachments/20130806/12ddb92c/attachment.html>