[strongSwan] Using multiple UDP sockets with SO_REUSEPORT option to increase high connection rate
Chinmaya Dwibedy
ckdwibedy at yahoo.com
Thu Mar 27 08:33:56 CET 2014
Hi Martin,
If time permits, kindly go thru the below email and suggest.
Regards,
Chinmaya
On Wednesday, March 26, 2014 5:31 PM, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:
On Wednesday, March 26, 2014 5:24 PM, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:
Hi,
Thanks again for your valuable suggestion. I request your
goodness to go thru the below stated and guide me .
We are using two Multi-Core MIPS64 Processors with 16 cnMIPS64
v2 cores (one acts as an IKE initiator and another as an IKE responder). We are
running strongswan in both systems. Both the systems have 1Gbps Ethernet cards,
which are connected to 1 Gbps L2 switch. The Wind River Linux (kernel with SMP
support) runs on all the 16 cores.
I have restricted one instance of starter/charon to run on
first core but the group of threads (created and managed by the strongswan
based upon configuration setting in strongswan.conf file) in order to process a
large number of tasks, will be scheduled and distributed among 16 cores. Being
multi-threaded charon should scale very well to multiple cores, taking full
advantage of multi-core systems.
In peak load (250k IPsec sessions), I run #perf top , which
provides a real-time aggregate of functions where most of the time is spent
across all cpu's and processes. Here goes the call stacks at both ends.
IKE Initiator
-------------------------------------------------------------------------------
PerfTop: 846707 irqs/sec kernel:88.3% [1000Hz cpu-clock-msecs], (all, 16 CPUs)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______
_____ ________________________ ___________________________
9323523.00
89.7% r4k_wait [kernel.kallsyms]
173541.00 1.7%
dso__find_symbol /usr/bin/perf
101710.00 1.0%
pthread_mutex_lock /lib64/libpthread-2.11.1.so
85056.00 0.8%
event__preprocess_sample /usr/bin/perf
48367.00 0.5% vfprintf /lib64/libc-2.11.1.so
44368.00 0.4%
pthread_rwlock_rdlock /lib64/libpthread-2.11.1.so
41310.00 0.4% __libc_malloc /lib64/libc-2.11.1.so
40884.00 0.4%
__pthread_rwlock_unlock /lib64/libpthread-2.11.1.so
34031.00 0.3% maps__find /usr/bin/perf
29452.00 0.3% cfree /lib64/libc-2.11.1.so
26343.00 0.3% dump_printf /usr/bin/perf
20629.00 0.2%
perf_session__findnew /usr/bin/perf
18454.00 0.2%
map__find_symbol /usr/bin/perf
15415.00 0.1%
finish_task_switch [kernel.kallsyms]
13397.00 0.1%
_IO_default_xsputn /lib64/libc-2.11.1.so
IKE Responder
PerfTop: 869276
irqs/sec kernel:88.8% [1000Hz
cpu-clock-msecs], (all, 16 CPUs)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______
_____ ___________________________ _____________________
10201775.00
89.5% r4k_wait [kernel.kallsyms]
180167.00 1.6%
dso__find_symbol /usr/bin/perf
109755.00 1.0%
pthread_mutex_lock libpthread-2.11.1.so
91641.00 0.8%
event__preprocess_sample /usr/bin/perf
86032.00 0.8% pthread_rwlock_rdlock libpthread-2.11.1.so
79349.00 0.7%
__pthread_rwlock_unlock libpthread-2.11.1.so
41658.00 0.4% cfree /lib64/libc-2.11.1.so
40051.00 0.4% __libc_malloc /lib64/libc-2.11.1.so
37084.00 0.3% maps__find /usr/bin/perf
36666.00 0.3% vfprintf /lib64/libc-2.11.1.so
28505.00 0.2% dump_printf /usr/bin/perf
27656.00 0.2%
finish_task_switch [kernel.kallsyms]
24894.00 0.2% SHA1Transform libstrongswan-sha1.so
20652.00 0.2%
perf_session__findnew /usr/bin/perf
20321.00 0.2%
map__find_symbol /usr/bin/perf
15616.00 0.1%
_raw_spin_unlock_irqrestore [kernel.kallsyms]
The overall CPU utilization at both ends was below 10%. I
profiled (using perf tool) the some threads at both the ends to figure out the hotspots.The
profiled result shows that, it spends most of the time in
__pthread_rwlock_rdlock(). The #mpstat –P ALL command (at both the ends) shows
that, the core 0 is fully busy, while other 15 cores are simply idles. Since all of threads are getting executed on the same CPU (Core 0) , Charon
is not scaling well.
Here goes the strongswan configuration at both the ends
IKE Initiator
threads = 32
replay_window =
32
dos_protection =
no
block_threshold=1000
cookie_threshold=1000
init_limit_half_open=25000
init_limit_job_load=25000
retransmit_timeout=30
retransmit_tries=30
install_virtual_ip=no
install_routes=no
close_ike_on_child_failure=yes
ikesa_table_size
= 73728
ikesa_table_segments = 256
reuse_ikesa = no
load-tester {
enable = yes
initiators = 5
iterations = 50000
delay
= 10
responder = 30.30.30.21
proposal = aes128-sha1-modp768
initiator_auth = psk
responder_auth = psk
request_virtual_ip = yes
initiator_tsr=40.0.0.0/8
ike_rekey = 0
child_rekey = 0
delete_after_established = no
shutdown_when_complete = no
libstrongswan {
dh_exponent_ansi_x9_42 = no
}
IKE Responder
threads = 32
replay_window =
32
dos_protection =
no
block_threshold=1000
cookie_threshold=1000
init_limit_half_open=25000
init_limit_job_load=25000
half_open_timeout=1000
close_ike_on_child_failure=yes
ikesa_table_size
= 73728
ikesa_table_segments = 256
reuse_ikesa = no
libstrongswan {
dh_exponent_ansi_x9_42
= no
processor {
priority_threads
{
high = 1
medium = 10
}
}
}
Also I compiled with --enable-lock-profiler and run with --nofork
with 250k IPsec session. But during daemon shutdown, it does not print the
cumulative time waited in each lock to stderr.
Regards,
Chinmaya
On Monday, March 24, 2014 3:11 PM, Chinmaya Dwibedy <ckdwibedy at yahoo.com> wrote:
Hi Martin,
Thanks a lot Martin for your prompt response and valuable
suggestion as well.
I have configured the number of threads to 32 at both
the ends (IKE responder and IKE Initiator). At IKE Initiator end, if I increase
the sender threads (i.e., initiators in load-tester section) from 5 to 10
(i.e., to put load on all cores), I find the followings at IKE Responder end.
1. There are huge packets loss (i.e., packet
receive errors in the Udp section) in #netstat –su.
2. The Recv-Q column of the isakmp connection
is high and doesn't drop to zero in #netstat –ua.
From the above statistics, it just means that Receiver
thread ( of Charon daemon at IKE Responder end) is not reading the socket fast
enough. The average arrival rate regularly causes a backlog in the receive
queue. The maximum number of queued received data depends on
/proc/sys/net/ipv4/udp_mem and /proc/sys/net/ipv4/udp_rmem_min. Thus I increased
these values to 8MB before the start of test but still faced the same issue.
Note that, if I keep the initiators and iterations to 5 and 50000 respectively,
then there was no loss at all and I am getting the setup rate to be
220-225
Here goes the output of “ipsec statusall" at IKE
responder end during test?
Status of IKE charon daemon (strongSwan 5.0.4, Linux
2.6.34.10-grsec-BenuOcteon,
mips64):
uptime: 2
minutes, since Jan 01 00:11:15 1970
malloc: sbrk
500166656, mmap 5005312, used 500018224, free 148432
worker threads:
24 of 32 idle, 6/2/0/0 working, job queue: 0/2/0/0, scheduled:
37639
loaded plugins:
charon aes des sha1 sha2 md5 random nonce x509 revocation cons
traints pubkey pkcs1 pkcs8 pgp dnskey pem fips-prf gmp
xcbc cmac hmac attr kerne
l-netlink resolve socket-default stroke updown
xauth-generic
Virtual IP pools (size/online/offline):
10.0.0.0/8:
16777214/21757/0
The job queue load shows 2 jobs are queued for CRITICAL
priority. I also profiled the Charon daemon (IKE Responder) and found that,
most of the threads are getting blocked in pthread_cond_wait (). It means the thread currently has no work to
do and waits for a job. Any suggestion what limits this connection rate or any pointer
on how to debug this issue further?
Regards,
Chinmaya
On Friday, March 21, 2014 8:23 PM, Martin Willi <martin at strongswan.org> wrote:
> And the single receiver thread becomes bottleneck due to high
> connection rate/setup rate.
The receiver job is rather trivial, only the IKE header is parsed and
some rate limiting is enforced for DoS
protection. Any further
processing is delegated to the thread pool using that
process_message_job(). So I have my doubts that
this is the bottleneck
you are actually looking for.
> Can it possible to create separate UDP sockets for each thread? The
> SO_REUSEPORT socket option allows multiple UDP sockets to be bound to
> the same port. With SO_REUSEPORT, multiple threads could use recvfrom
> () on its own socket to accept datagrams arriving on the port.
Theoretically that is possible, but I'm not really sure if that helps to
fix the issues you are seeing.
When running your tests, how does your job queue in "ipsec statusall"
look like? If you have many jobs queued that don't get processed,
something else prevents that charon scales properly on your platform.
Regards
Martin
_______________________________________________
Users mailing list
Users at lists.strongswan.org
https://lists.strongswan.org/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/users/attachments/20140327/18cc4528/attachment-0001.html>
More information about the Users
mailing list