[strongSwan] Using multiple UDP sockets with SO_REUSEPORT option to increase high connection rate

Chinmaya Dwibedy ckdwibedy at yahoo.com
Wed Mar 26 13:01:29 CET 2014

On Wednesday, March 26, 2014 5:24 PM, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:


Thanks again for your valuable suggestion. I request your
goodness to go thru the below stated and guide me . 

We are using two Multi-Core MIPS64 Processors with 16 cnMIPS64
v2 cores (one acts as an IKE initiator and another as an IKE responder). We are
running strongswan in both systems. Both the systems have 1Gbps Ethernet cards,
which are connected to 1 Gbps L2 switch. The Wind River Linux (kernel with SMP
support) runs on all the 16 cores.

I have restricted one instance of starter/charon to run on
first core but the group of threads (created and managed by the strongswan
based upon configuration setting in strongswan.conf file) in order to process a
large number of tasks, will be scheduled and distributed among 16 cores. Being
multi-threaded charon should scale very well to multiple cores, taking full
advantage of multi-core systems.

In peak load (250k IPsec sessions), I run #perf top , which
provides a real-time aggregate of functions where most of the time is spent
across all cpu's and processes. Here goes the call stacks at both ends.


IKE Initiator

   PerfTop:  846707 irqs/sec  kernel:88.3% [1000Hz cpu-clock-msecs],  (all, 16 CPUs)


             samples  pcnt function                 DSO

_____ ________________________ ___________________________

89.7% r4k_wait                 [kernel.kallsyms]

           173541.00  1.7%
dso__find_symbol         /usr/bin/perf

           101710.00  1.0%
pthread_mutex_lock       /lib64/libpthread-2.11.1.so

            85056.00  0.8%
event__preprocess_sample /usr/bin/perf

            48367.00  0.5% vfprintf                 /lib64/libc-2.11.1.so

            44368.00  0.4%
pthread_rwlock_rdlock    /lib64/libpthread-2.11.1.so

            41310.00  0.4% __libc_malloc            /lib64/libc-2.11.1.so

            40884.00  0.4%
__pthread_rwlock_unlock  /lib64/libpthread-2.11.1.so

            34031.00  0.3% maps__find               /usr/bin/perf

            29452.00  0.3% cfree                    /lib64/libc-2.11.1.so

            26343.00  0.3% dump_printf              /usr/bin/perf

            20629.00  0.2%
perf_session__findnew    /usr/bin/perf

            18454.00  0.2%
map__find_symbol         /usr/bin/perf

            15415.00  0.1%
finish_task_switch       [kernel.kallsyms]

            13397.00  0.1%
_IO_default_xsputn       /lib64/libc-2.11.1.so

IKE Responder
PerfTop:  869276
irqs/sec  kernel:88.8% [1000Hz
cpu-clock-msecs],  (all, 16 CPUs)


             samples  pcnt function                    DSO

_____ ___________________________ _____________________


89.5% r4k_wait                    [kernel.kallsyms]

           180167.00  1.6%
dso__find_symbol            /usr/bin/perf

           109755.00  1.0%
pthread_mutex_lock          libpthread-2.11.1.so

            91641.00  0.8%
event__preprocess_sample    /usr/bin/perf

            86032.00  0.8% pthread_rwlock_rdlock       libpthread-2.11.1.so

            79349.00  0.7%
__pthread_rwlock_unlock     libpthread-2.11.1.so

            41658.00  0.4% cfree                       /lib64/libc-2.11.1.so

            40051.00  0.4% __libc_malloc               /lib64/libc-2.11.1.so

            37084.00  0.3% maps__find                  /usr/bin/perf

            36666.00  0.3% vfprintf                    /lib64/libc-2.11.1.so

            28505.00  0.2% dump_printf                 /usr/bin/perf

            27656.00  0.2%
finish_task_switch          [kernel.kallsyms]

            24894.00  0.2% SHA1Transform               libstrongswan-sha1.so

            20652.00  0.2%
perf_session__findnew       /usr/bin/perf

            20321.00  0.2%
map__find_symbol            /usr/bin/perf

            15616.00  0.1%
_raw_spin_unlock_irqrestore [kernel.kallsyms]


The overall CPU utilization at both ends was below 10%. I
profiled (using perf tool) the some threads at both the ends to figure out the hotspots.The
profiled result shows that, it spends most of the time in
__pthread_rwlock_rdlock(). The #mpstat –P ALL command (at both the ends) shows
that, the core 0 is fully busy, while other 15 cores are simply idles. Since all of threads are getting executed on the same CPU (Core 0) , Charon
is not scaling well. 


Here goes the strongswan configuration at both the ends

IKE Initiator 
        threads = 32

        replay_window =
        dos_protection =










= 73728

        ikesa_table_segments = 256
        reuse_ikesa = no

 load-tester {

                   enable = yes

                   initiators = 5

                   iterations = 50000

= 10

                   responder =

                   proposal = aes128-sha1-modp768

                   initiator_auth = psk

                   responder_auth = psk

                   request_virtual_ip = yes


                   ike_rekey = 0

                   child_rekey = 0

                   delete_after_established = no

                   shutdown_when_complete = no

libstrongswan { 

           dh_exponent_ansi_x9_42 = no



IKE Responder
        threads = 32

        replay_window =
       dos_protection =


= 73728

        ikesa_table_segments = 256
        reuse_ikesa = no

libstrongswan {

= no

         processor {


            high = 1

            medium = 10





Also I compiled with --enable-lock-profiler and run with --nofork
with 250k IPsec session. But during daemon shutdown, it does not print the
cumulative time waited in each lock to stderr. 




On Monday, March 24, 2014 3:11 PM, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:

Hi Martin,

Thanks a lot Martin for your prompt response and valuable
suggestion as well. 
I have configured the number of threads to 32 at both
the ends (IKE responder and IKE Initiator). At IKE Initiator end, if I increase
the sender threads (i.e., initiators in load-tester section) from 5 to 10
(i.e., to put load on all cores), I find the followings at IKE Responder end.

	1. There are huge packets loss (i.e., packet
receive errors in the Udp section) in #netstat –su.
	2. The Recv-Q column of the isakmp connection
is high and doesn't drop to zero in  #netstat –ua.

From the above statistics, it just means that Receiver
thread ( of Charon daemon at IKE Responder end) is not reading the socket fast
enough. The average arrival rate regularly causes a backlog in the receive
queue. The maximum number of queued received data depends on
/proc/sys/net/ipv4/udp_mem and /proc/sys/net/ipv4/udp_rmem_min. Thus I increased
these values to 8MB before the start of test but still faced the same issue.
Note that, if I keep the initiators and iterations to 5 and 50000 respectively,
then there was no loss at all and I am getting the setup rate to be

Here goes the output of “ipsec statusall" at IKE
responder end during test?

Status of IKE charon daemon (strongSwan 5.0.4, Linux,


  uptime: 2
minutes, since Jan 01 00:11:15 1970

  malloc: sbrk
500166656, mmap 5005312, used 500018224, free 148432

  worker threads:
24 of 32 idle, 6/2/0/0 working, job queue: 0/2/0/0, scheduled:


  loaded plugins:
charon aes des sha1 sha2 md5 random nonce x509 revocation cons

traints pubkey pkcs1 pkcs8 pgp dnskey pem fips-prf gmp
xcbc cmac hmac attr kerne

l-netlink resolve socket-default stroke updown

Virtual IP pools (size/online/offline):

The job queue load shows 2 jobs are queued for CRITICAL
priority. I also profiled the Charon daemon (IKE Responder) and found that,
most of the threads are getting blocked in pthread_cond_wait ().   It means the thread currently has no work to
do and waits for a job. Any suggestion what limits this connection rate or any pointer
on how to debug this issue further?



On Friday, March 21, 2014 8:23 PM, Martin Willi <martin at strongswan.org> wrote:

> And the single receiver thread becomes bottleneck due to high
> connection rate/setup rate.

The receiver job is rather trivial, only the IKE header is parsed and
some rate limiting is enforced for DoS
 protection. Any further
processing is delegated to the thread pool using that
process_message_job(). So I have my doubts that
 this is the bottleneck
you are actually looking for.

> Can it possible to create separate UDP sockets for each thread? The
> SO_REUSEPORT socket option allows multiple UDP sockets to be bound to
> the same port. With SO_REUSEPORT, multiple threads could use recvfrom
> () on its own socket to accept datagrams arriving on the port.

Theoretically that is possible, but I'm not really sure if that helps to
fix the issues you are seeing.

When running your tests, how does your job queue in "ipsec statusall"
look like? If you have many jobs queued that don't get processed,
something else prevents that charon scales properly on your platform.



Users mailing list
Users at lists.strongswan.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/users/attachments/20140326/d2632b30/attachment-0001.html>

More information about the Users mailing list