[strongSwan] Performance issue with 25k IPsec tunnels (using 5.0.4 strongswan and load-tester plugin)

Mon Nov 11 08:24:06 CET 2013

Hi,

Can anyone please go through the below email and respond? Thanks in advance for your help and support.

Regards,
Chinmaya 

--------------------------------------------
On Fri, 11/8/13, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:

 Subject: Re: [strongSwan] Performance issue with 25k IPsec tunnels (using 5.0.4 strongswan and load-tester plugin)
 To: "Martin Willi" <martin at strongswan.org>
 Cc: "users at lists.strongswan.org" <users at lists.strongswan.org>
 Date: Friday, November 8, 2013, 5:50 PM

 Hi Martin/All,

 Thanks for your suggestion.

 I had modified the strongswan code (in main () function of
 starter.c), so that it will restrict one instance of
 starter/Charon to run on first core only (using
 sched_getaffinity() and ffsl()). What I understand, number
 of worker threads in Charon (created and managed by the
 strongswan based upon configuration setting in
 strongswan.conf file) in order to process a large number of
 tasks  (in 25k IPsec security associations), will be
 scheduled and distributed among 16 cores. Please note that,
 we are using two Multi-Core MIPS64 Processors with 16
 cnMIPS64 v2 cores (one acts as an IKE initiator and another
 as an IKE responder). We are running strongswan in both
 systems. Both the systems have 1Gbps Ethernet cards, which
 are connected to 1 Gbps L2 switch. The Wind River Linux runs
 on all the 16 cores. 

 In 25k connections without data traffic, I noticed the
 kernel does not migrate tasks/threads (32 threads are
 configured at both ends) away from busy core (i.e., first
 core) to other cores. Checked through #ps -p <PID of
 Charon daemon> -L -o pid,tid,psr and found that, 
 the psr (processor that process is currently assigned to)
 was always zero.

 I too run # perf top 2> /dev/null to monitor all CPUs at
 both user and kernel levels and seeing the functions where
 most of the time is spent. I found (from call stack) that,
 it reports/shows r4k_wait at ~95%. It implies that the
 kernel has no process to run so runs the idle loop (as
 r4k_wait is called by idle loop). The Charon takes only ~4%
 (in libgmp.so.3.4.1) on both the Linux systems (IKE
 Initiator as well as IKE Responder) initially, which is
 expected. 

 Here goes the overall CPU utilization, captured via top
 command on both systems.

 IKE Responder 
 Tasks:   1 total,   0
 running,   1 sleeping,   0
 stopped,   0 zombie
 Cpu0  : 99.4%us,  0.3%sy,  0.0%ni, 
 0.0%id,  0.0%wa,  0.3%hi,  0.0%si, 
 0.0%st
 Cpu1  :  0.0%us,  0.3%sy,  0.0%ni,
 99.7%id,  0.0%wa,  0.0%hi,  0.0%si, 
 0.0%st
 Cpu2  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu3  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu4  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu5  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu6  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu7  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu8  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu9  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu10 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu11 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu12 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu13 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu14 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu15 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Mem:   8191428k
 total,   875784k used,  7315644k
 free,        0k buffers
 Swap:        0k total,   
     0k used,        0k
 free,   546116k cached

   PID USER      PR  NI 
 VIRT  RES  SHR S %CPU %MEM   
 TIME+  COMMAND           

  1180 root      20   0 
 945m 114m 2388 S 99.8  1.4   1:58.69
 charon  

 IKE Initiator 
 Tasks:   1 total,   0
 running,   1 sleeping,   0
 stopped,   0 zombie
 Cpu0  : 96.1%us,  1.8%sy,  0.0%ni, 
 1.8%id,  0.0%wa,  0.0%hi,  0.3%si, 
 0.0%st
 Cpu1  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu2  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu3  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu4  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu5  :  0.3%us,  0.0%sy,  0.0%ni,
 99.7%id,  0.0%wa,  0.0%hi,  0.0%si, 
 0.0%st
 Cpu6  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu7  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu8  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu9  :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu10 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu11 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu12 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu13 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu14 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Cpu15 :  0.0%us,  0.0%sy, 
 0.0%ni,100.0%id,  0.0%wa,  0.0%hi, 
 0.0%si,  0.0%st
 Mem:   8191428k
 total,   971792k used,  7219636k
 free,        0k buffers
 Swap:        0k total,   
     0k used,        0k
 free,   546104k cached

   PID USER      PR  NI 
 VIRT  RES  SHR S %CPU %MEM   
 TIME+  COMMAND           

  1147 root      20   0 
 639m 211m 2560 S 98.2  2.6   2:47.13
 charon    

 Should I need to modify strongswan program to use the
 pthread_setaffinity_np() function to ask for individual
 threads to be pinned to different cores to obtain
 performance benefits?. As a result, I can scale up to70k-80k
 IPsec connections.
 Thanks in advance for your feedback and suggestion.

 Regards,
 Chinmaya

 --------------------------------------------
 On Thu, 10/24/13, Martin Willi <martin at strongswan.org>
 wrote:

  Subject: Re: [strongSwan] Performance issue with 25k IPsec
 tunnels (using 5.0.4 strongswan and load-tester plugin)
  To: "Chinmaya Dwibedy" <ckdwibedy at yahoo.com>
  Cc: "users at lists.strongswan.org"
 <users at lists.strongswan.org>
  Date: Thursday, October 24, 2013, 2:06 PM

  Hi,

  > gmpn_addmul_1 function in  libgmp.so.3.4.1
  consumes most of the CPU
  > cycles on both the Linux systems 

  Yes, this was to expect; DH computation is the most
  expensive task.

  > Do I need to use the Libgcrypt instead of GMP
 library?

  Probably that won't help, GMP is likely the fastest DH
  backend you can
  use, see [1].

  > 3.72%    charon  libgmp.so.3.4.1 
    __gmpn_addmul_1

  The question is: why is it only eating ~4% of your CPU? Is
  it the same
  percentage on both systems?

  You'll have to find out what is limiting your throughput.
  What changes
  if you initiate more aggressively? What is your overall
 CPU
  utilization
  during testing?

  You might also try to to --enable-lock-profiler; during
  daemon shutdown
  it prints the cumulative time waited in each lock to
 stderr
  (run with
  --nofork). That might give some indication if something is
  not scaling
  as it should.

  Regards
  Martin

  [1]http://wiki.strongswan.org/projects/strongswan/wiki/PublicKeySpeed

 _______________________________________________
 Users mailing list
 Users at lists.strongswan.org
 https://lists.strongswan.org/mailman/listinfo/users