[strongSwan] Performance issue with 25k IPsec tunnels (using 5.0.4 strongswan and load-tester plugin)

Chinmaya Dwibedy ckdwibedy at yahoo.com
Fri Nov 8 13:20:22 CET 2013



Hi Martin/All,

Thanks for your suggestion.

I had modified the strongswan code (in main () function of starter.c), so that it will restrict one instance of starter/Charon to run on first core only (using sched_getaffinity() and ffsl()). What I understand, number of worker threads in Charon (created and managed by the strongswan based upon configuration setting in strongswan.conf file) in order to process a large number of tasks  (in 25k IPsec security associations), will be scheduled and distributed among 16 cores. Please note that, we are using two Multi-Core MIPS64 Processors with 16 cnMIPS64 v2 cores (one acts as an IKE initiator and another as an IKE responder). We are running strongswan in both systems. Both the systems have 1Gbps Ethernet cards, which are connected to 1 Gbps L2 switch. The Wind River Linux runs on all the 16 cores. 

In 25k connections without data traffic, I noticed the kernel does not migrate tasks/threads (32 threads are configured at both ends) away from busy core (i.e., first core) to other cores. Checked through #ps -p <PID of Charon daemon> -L -o pid,tid,psr and found that,  the psr (processor that process is currently assigned to) was always zero.

I too run # perf top 2> /dev/null to monitor all CPUs at both user and kernel levels and seeing the functions where most of the time is spent. I found (from call stack) that, it reports/shows r4k_wait at ~95%. It implies that the kernel has no process to run so runs the idle loop (as r4k_wait is called by idle loop). The Charon takes only ~4% (in libgmp.so.3.4.1) on both the Linux systems (IKE Initiator as well as IKE Responder) initially, which is expected. 

Here goes the overall CPU utilization, captured via top command on both systems.

IKE Responder 
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu0  : 99.4%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8191428k total,   875784k used,  7315644k free,        0k buffers
Swap:        0k total,        0k used,        0k free,   546116k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 1180 root      20   0  945m 114m 2388 S 99.8  1.4   1:58.69 charon  


IKE Initiator 
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.1%us,  1.8%sy,  0.0%ni,  1.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8191428k total,   971792k used,  7219636k free,        0k buffers
Swap:        0k total,        0k used,        0k free,   546104k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 1147 root      20   0  639m 211m 2560 S 98.2  2.6   2:47.13 charon    


Should I need to modify strongswan program to use the pthread_setaffinity_np() function to ask for individual threads to be pinned to different cores to obtain performance benefits?. As a result, I can scale up to70k-80k IPsec connections.
Thanks in advance for your feedback and suggestion.

Regards,
Chinmaya



--------------------------------------------
On Thu, 10/24/13, Martin Willi <martin at strongswan.org> wrote:

 Subject: Re: [strongSwan] Performance issue with 25k IPsec tunnels (using 5.0.4 strongswan and load-tester plugin)
 To: "Chinmaya Dwibedy" <ckdwibedy at yahoo.com>
 Cc: "users at lists.strongswan.org" <users at lists.strongswan.org>
 Date: Thursday, October 24, 2013, 2:06 PM
 
 Hi,
 
 > gmpn_addmul_1 function in  libgmp.so.3.4.1
 consumes most of the CPU
 > cycles on both the Linux systems 
 
 Yes, this was to expect; DH computation is the most
 expensive task.
 
 > Do I need to use the Libgcrypt instead of GMP library?
 
 Probably that won't help, GMP is likely the fastest DH
 backend you can
 use, see [1].
 
 > 3.72%    charon  libgmp.so.3.4.1 
   __gmpn_addmul_1
 
 The question is: why is it only eating ~4% of your CPU? Is
 it the same
 percentage on both systems?
 
 You'll have to find out what is limiting your throughput.
 What changes
 if you initiate more aggressively? What is your overall CPU
 utilization
 during testing?
 
 You might also try to to --enable-lock-profiler; during
 daemon shutdown
 it prints the cumulative time waited in each lock to stderr
 (run with
 --nofork). That might give some indication if something is
 not scaling
 as it should.
 
 Regards
 Martin
 
 [1]http://wiki.strongswan.org/projects/strongswan/wiki/PublicKeySpeed
 
 




More information about the Users mailing list