[strongSwan] Retransmission issue under high load
ckdwibedy at yahoo.com
Tue Mar 11 22:26:48 CET 2014
I am using strongswan 5.0.4 and load tester plugin. Upon further debugging found that, with 200k IPsec tunnels, although it can bring up all those tunnels successfully (with average setup rate 180), there are lots of retransmissions by IKE Initiator in charon.log. Upon debugging noticed that, under heavy load (200k), there are packet losses, i.e. the packets (IKE_SA_INIT/IKE_AUTH request messages) are received by kernel but not by Charon daemon (IKE Responder end).
1. Noticed the lost packets (i.e., packet receive errors in the Udp section) in #netstat –su, which kept on increasing with respect to time.
2. In #netstat –ua, looked at Recv-Q column of the isakmp connection and found that, the values high and don't drop to zero. If this is 0, everything’s ok, if there are non-zero value, the process can’t handle the load.
3. Read the file #cat /proc/net/udp, column rx_queue. I noticed the value different than zero in that column.
From the above statistics, It just means that Charon daemon is not reading the socket fast enough. The average arrival rate regularly causes a backlog in the receive queue. The maximum number of queued received data depends on /proc/sys/net/ipv4/udp_mem and /proc/sys/net/ipv4/udp_rmem_min.
Should I tweak these parameters to achieve zero packet loss? Also how can I employ the Charon daemon to do load balancing across multiple threads ?
On Monday, March 10, 2014 4:18 PM, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:
I am running with 200k IPsec tunnels. Although it can bring
up all those tunnels successfully, I find, there are lots of retransmissions in
Jan 1 00:10:29 56[IKE] retransmit 1 of request with message
ID 0 (IKE Initiator)
Jan 1 00:10:45 49[IKE] received retransmit of request with
ID 0, retransmitting response (IKE Responder)
I know, these are certainly considered to be bad. Checked the CPU usage of Charon daemon at IKE
responder end (through top –p <PID of Charon daemon>) and found to be
less than 10% (mostly). Upon profiling it shows that, most of the time it
spends in pthread_mutex_lock (). Note, I
have set the retransmit_timeout and retransmit_tries to 60 seconds and 30 times
respectively, which is a quite bug. Can anyone please guide/suggest what might be
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Users