[strongSwan] Throughput on high BDP networks
jsullivan at opensourcedevel.com
jsullivan at opensourcedevel.com
Thu Jun 4 17:28:09 CEST 2015
> On June 3, 2015 at 3:51 PM "John A. Sullivan III"
> <jsullivan at opensourcedevel.com> wrote:
>
>
> On Tue, 2015-06-02 at 22:23 -0400, jsullivan at opensourcedevel.com wrote:
> > > On June 1, 2015 at 11:48 AM Martin Willi <martin at strongswan.org> wrote:
> > >
> > >
> > >
> > > > Even at these rates, the CPU did not appear to be very busy. We had one
> > > > at
> > > > 85%
> > > > occupied but that was the one running nuttcp.
> > >
> > > On the outgoing path, the Linux kernel usually accounts ESP encryption
> > > under the process that sends traffic using a socket send() call. So
> > > these 85% probably include AES-GCM.
> > >
> > > On the receiving or forwarding path, you'll have to look at the software
> > > interrupt usage (si in top).
> > >
> > > > We have seen these boxes pass almost 20 Gbps with single digit
> > > > utilization so they have plenty of horsepower.
> > >
> > > That does not have to mean much. Its all about encryption, and that is
> > > rather expensive. If you have specialized hardware, this most likely
> > > means it is good at shuffling data over the network, but might be
> > > underpowered when it has to do encryption in software.
> > >
> > > > We are also running haveged on them to prevent entropy starvation for
> > > > the
> > > > encryption.
> > >
> > > Only the key exchange needs entropy, raw AES-GCM does not.
> > >
> > > Regards
> > > Martin
> > >
> >
> > Hello, all. Still battling this problem. The system is a SuperMicro server
> > and
> > not a specialized device. It looks like the problem may be software
> > interrupts
> > but on the sending side so I am very curious about the recommendation to
> > check
> > the receiving side. In sending only 100 Mbps or so, I see a single CPU
> > pegged
> > at 100% for si. I wondered if it might be the number of ACK packets being
> > returned from the other side. We have a large window size so we get a flood
> > of
> > ACK packets in reply send but that really doesn't seem to make sense. But it
> > would explain why we see this more in TCP than UDP . . . . or so we thought.
> > I
> > then sent 200 Mbps of UDP traffic so virtually no reply traffic and the
> > sender
> > was still at 100% si. What might be generating such a huge number of
> > software
> > interrupts and how can we reduce them or spread them across multiple
> > processors?
> >
> <snip>
> We appear to be chasing a compound problem perhaps also involving
> problems with GRE. As we try to isolate components, one issue we see is
> TCP Window size. For some reason, even though the w/rmem_max and tcp
> have maximum values over 16M, we are not achieving a TCP Window size
> much larger than 4M when we add IPSec to the mix. Not only does this
> seem to be the case when we are using IPSec only but, if we add a GRE
> tunnel (to make it a little easier to do a packet trace), with GRE only,
> we see the TCP window size go to the full 16M (but we have a problem
> with packet drops). When we add IPSec (GRE/IPSec), the packet drops
> magically go away (perhaps due to the lower throughput) but the TCP
> Window size stays stuck at that 4M level.
>
> What would cause this and how do we get the full sized TCP Window inside
> an IPSec transport stream? Thanks - John
>
><snip>
I suppose this might imply that the receiving station cannot drain the queue
faster than 421 Mbps but I do not see the bottleneck. There are no drops in the
NIC ring buffers:
root at lcppeppr-labc02:~# ethtool -S eth5 | grep drop
rx_dropped: 0
tx_dropped: 0
rx_fcoe_dropped: 0
root at lcppeppr-labc02:~# ethtool -S eth7 | grep drop
rx_dropped: 0
tx_dropped: 0
rx_fcoe_dropped: 0
There are no drops at the IP level:
Plenty of receive buffer space:
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 16782080
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 8960 89600 16782080
net.ipv4.tcp_wmem = 4096 65536 16777216
CPUs are not overloaded nor software interrupts excessive:
top - 11:27:02 up 16:58, 1 user, load average: 1.00, 0.56, 0.29
Tasks: 189 total, 3 running, 186 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.1 us, 3.4 sy, 0.0 ni, 96.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 2.0 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 4.1 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.1 us, 15.7 sy, 0.0 ni, 84.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.3 us, 12.6 sy, 0.0 ni, 87.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 9.3 sy, 0.0 ni, 80.9 id, 0.0 wa, 0.0 hi, 9.9 si, 0.0 st
%Cpu6 : 0.0 us, 2.5 sy, 0.0 ni, 97.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 2.6 sy, 0.0 ni, 97.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 0.0 us, 29.2 sy, 0.0 ni, 70.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 0.0 us, 19.8 sy, 0.0 ni, 75.9 id, 0.0 wa, 0.1 hi, 4.2 si, 0.0 st
%Cpu10 : 0.0 us, 3.1 sy, 0.0 ni, 96.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 0.0 us, 3.7 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32985376 total, 404108 used, 32581268 free, 33604 buffers
KiB Swap: 7836604 total, 0 used, 7836604 free, 83868 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7545 root 20 0 6280 1492 1288 S 30 0.0 0:14.05 nuttcp
49 root 20 0 0 0 0 S 24 0.0 0:51.33 kworker/8:0
7523 root 20 0 0 0 0 S 22 0.0 0:37.08 kworker/9:0
7526 root 20 0 0 0 0 S 6 0.0 0:10.22 kworker/5:2
1441 root 20 0 0 0 0 S 4 0.0 0:13.14 kworker/11:2
7527 root 20 0 0 0 0 S 4 0.0 0:07.07 kworker/2:1
7458 root 20 0 0 0 0 S 4 0.0 0:06.89 kworker/8:2
33 root 20 0 0 0 0 S 4 0.0 1:12.46 ksoftirqd/5
7528 root 20 0 0 0 0 S 4 0.0 0:06.40 kworker/10:2
7524 root 20 0 0 0 0 S 3 0.0 0:05.57 kworker/1:2
7525 root 20 0 0 0 0 R 3 0.0 0:05.61 kworker/0:0
6131 root 20 0 0 0 0 S 3 0.0 0:40.66 kworker/7:2
7531 root 20 0 0 0 0 S 3 0.0 0:06.51 kworker/8:1
7519 root 20 0 0 0 0 R 3 0.0 0:01.50 kworker/6:2
89 root 20 0 0 0 0 S 3 0.0 0:16.06 kworker/3:1
1972 root 20 0 0 0 0 S 2 0.0 0:22.64 kworker/4:2
6828 root 20 0 0 0 0 S 2 0.0 0:03.84 kworker/3:2
6047 root 20 0 0 0 0 S 2 0.0 0:23.63 kworker/9:1
7123 root 20 0 0 0 0 S 2 0.0 0:03.58 kworker/9:2
7300 root 20 0 0 0 0 S 2 0.0 0:03.18 kworker/4:0
4632 root 0 -20 15900 4828 1492 S 0 0.0 2:33.04 conntrackd
7337 root 20 0 0 0 0 S 0 0.0 0:00.14 kworker/10:0
7529 root 20 0 0 0 0 S 0 0.0 0:05.59 kworker/11:0
91 root 20 0 0 0 0 S 0 0.0 0:39.70 kworker/5:1
4520 root 20 0 4740 1780 1088 S 0 0.0 0:14.17 haveged
7221 root 20 0 0 0 0 S 0 0.0 0:00.52 kworker/5:0
7112 root 20 0 0 0 0 S 0 0.0 0:00.09 kworker/11:1
7543 root 20 0 0 0 0 S 0 0.0 0:00.05 kworker/u24:2
1 root 20 0 10660 1704 1564 S 0 0.0 0:02.38 init
2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:00.46 ksoftirqd/0
5 root 0 -20 0 0 0 S 0 0.0 0:00.00 kworker/0:0H
Where would the bottleneck be? Thanks - John
More information about the Users
mailing list