[strongSwan] Performance issue with 25k IPsec tunnels (using 5.0.4 strongswan and load-tester plugin)
Chinmaya Dwibedy
ckdwibedy at yahoo.com
Mon Nov 11 08:24:06 CET 2013
Hi,
Can anyone please go through the below email and respond? Thanks in advance for your help and support.
Regards,
Chinmaya
--------------------------------------------
On Fri, 11/8/13, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:
Subject: Re: [strongSwan] Performance issue with 25k IPsec tunnels (using 5.0.4 strongswan and load-tester plugin)
To: "Martin Willi" <martin at strongswan.org>
Cc: "users at lists.strongswan.org" <users at lists.strongswan.org>
Date: Friday, November 8, 2013, 5:50 PM
Hi Martin/All,
Thanks for your suggestion.
I had modified the strongswan code (in main () function of
starter.c), so that it will restrict one instance of
starter/Charon to run on first core only (using
sched_getaffinity() and ffsl()). What I understand, number
of worker threads in Charon (created and managed by the
strongswan based upon configuration setting in
strongswan.conf file) in order to process a large number of
tasks (in 25k IPsec security associations), will be
scheduled and distributed among 16 cores. Please note that,
we are using two Multi-Core MIPS64 Processors with 16
cnMIPS64 v2 cores (one acts as an IKE initiator and another
as an IKE responder). We are running strongswan in both
systems. Both the systems have 1Gbps Ethernet cards, which
are connected to 1 Gbps L2 switch. The Wind River Linux runs
on all the 16 cores.
In 25k connections without data traffic, I noticed the
kernel does not migrate tasks/threads (32 threads are
configured at both ends) away from busy core (i.e., first
core) to other cores. Checked through #ps -p <PID of
Charon daemon> -L -o pid,tid,psr and found that,
the psr (processor that process is currently assigned to)
was always zero.
I too run # perf top 2> /dev/null to monitor all CPUs at
both user and kernel levels and seeing the functions where
most of the time is spent. I found (from call stack) that,
it reports/shows r4k_wait at ~95%. It implies that the
kernel has no process to run so runs the idle loop (as
r4k_wait is called by idle loop). The Charon takes only ~4%
(in libgmp.so.3.4.1) on both the Linux systems (IKE
Initiator as well as IKE Responder) initially, which is
expected.
Here goes the overall CPU utilization, captured via top
command on both systems.
IKE Responder
Tasks: 1 total, 0
running, 1 sleeping, 0
stopped, 0 zombie
Cpu0 : 99.4%us, 0.3%sy, 0.0%ni,
0.0%id, 0.0%wa, 0.3%hi, 0.0%si,
0.0%st
Cpu1 : 0.0%us, 0.3%sy, 0.0%ni,
99.7%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu2 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Mem: 8191428k
total, 875784k used, 7315644k
free, 0k buffers
Swap: 0k total,
0k used, 0k
free, 546116k cached
PID USER PR NI
VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
1180 root 20 0
945m 114m 2388 S 99.8 1.4 1:58.69
charon
IKE Initiator
Tasks: 1 total, 0
running, 1 sleeping, 0
stopped, 0 zombie
Cpu0 : 96.1%us, 1.8%sy, 0.0%ni,
1.8%id, 0.0%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu1 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu5 : 0.3%us, 0.0%sy, 0.0%ni,
99.7%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu6 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy,
0.0%ni,100.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Mem: 8191428k
total, 971792k used, 7219636k
free, 0k buffers
Swap: 0k total,
0k used, 0k
free, 546104k cached
PID USER PR NI
VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
1147 root 20 0
639m 211m 2560 S 98.2 2.6 2:47.13
charon
Should I need to modify strongswan program to use the
pthread_setaffinity_np() function to ask for individual
threads to be pinned to different cores to obtain
performance benefits?. As a result, I can scale up to70k-80k
IPsec connections.
Thanks in advance for your feedback and suggestion.
Regards,
Chinmaya
--------------------------------------------
On Thu, 10/24/13, Martin Willi <martin at strongswan.org>
wrote:
Subject: Re: [strongSwan] Performance issue with 25k IPsec
tunnels (using 5.0.4 strongswan and load-tester plugin)
To: "Chinmaya Dwibedy" <ckdwibedy at yahoo.com>
Cc: "users at lists.strongswan.org"
<users at lists.strongswan.org>
Date: Thursday, October 24, 2013, 2:06 PM
Hi,
> gmpn_addmul_1 function in libgmp.so.3.4.1
consumes most of the CPU
> cycles on both the Linux systems
Yes, this was to expect; DH computation is the most
expensive task.
> Do I need to use the Libgcrypt instead of GMP
library?
Probably that won't help, GMP is likely the fastest DH
backend you can
use, see [1].
> 3.72% charon libgmp.so.3.4.1
__gmpn_addmul_1
The question is: why is it only eating ~4% of your CPU? Is
it the same
percentage on both systems?
You'll have to find out what is limiting your throughput.
What changes
if you initiate more aggressively? What is your overall
CPU
utilization
during testing?
You might also try to to --enable-lock-profiler; during
daemon shutdown
it prints the cumulative time waited in each lock to
stderr
(run with
--nofork). That might give some indication if something is
not scaling
as it should.
Regards
Martin
[1]http://wiki.strongswan.org/projects/strongswan/wiki/PublicKeySpeed
_______________________________________________
Users mailing list
Users at lists.strongswan.org
https://lists.strongswan.org/mailman/listinfo/users
More information about the Users
mailing list