[strongSwan] Throughput on high BDP networks

Thu Jun 4 17:28:09 CEST 2015

> On June 3, 2015 at 3:51 PM "John A. Sullivan III"
> <jsullivan at opensourcedevel.com> wrote:
>
>
> On Tue, 2015-06-02 at 22:23 -0400, jsullivan at opensourcedevel.com wrote:
> > > On June 1, 2015 at 11:48 AM Martin Willi <martin at strongswan.org> wrote:
> > >
> > >
> > >
> > > > Even at these rates, the CPU did not appear to be very busy. We had one
> > > > at
> > > > 85%
> > > > occupied but that was the one running nuttcp.
> > >
> > > On the outgoing path, the Linux kernel usually accounts ESP encryption
> > > under the process that sends traffic using a socket send() call. So
> > > these 85% probably include AES-GCM.
> > >
> > > On the receiving or forwarding path, you'll have to look at the software
> > > interrupt usage (si in top).
> > >
> > > > We have seen these boxes pass almost 20 Gbps with single digit
> > > > utilization so they have plenty of horsepower.
> > >
> > > That does not have to mean much. Its all about encryption, and that is
> > > rather expensive. If you have specialized hardware, this most likely
> > > means it is good at shuffling data over the network, but might be
> > > underpowered when it has to do encryption in software.
> > >
> > > > We are also running haveged on them to prevent entropy starvation for
> > > > the
> > > > encryption.
> > >
> > > Only the key exchange needs entropy, raw AES-GCM does not.
> > >
> > > Regards
> > > Martin
> > >
> >
> > Hello, all. Still battling this problem. The system is a SuperMicro server
> > and
> > not a specialized device. It looks like the problem may be software
> > interrupts
> > but on the sending side so I am very curious about the recommendation to
> > check
> > the receiving side. In sending only 100 Mbps or so, I see a single CPU
> > pegged
> > at 100% for si. I wondered if it might be the number of ACK packets being
> > returned from the other side. We have a large window size so we get a flood
> > of
> > ACK packets in reply send but that really doesn't seem to make sense. But it
> > would explain why we see this more in TCP than UDP . . . . or so we thought.
> > I
> > then sent 200 Mbps of UDP traffic so virtually no reply traffic and the
> > sender
> > was still at 100% si. What might be generating such a huge number of
> > software
> > interrupts and how can we reduce them or spread them across multiple
> > processors?
> >
> <snip>
> We appear to be chasing a compound problem perhaps also involving
> problems with GRE. As we try to isolate components, one issue we see is
> TCP Window size. For some reason, even though the w/rmem_max and tcp
> have maximum values over 16M, we are not achieving a TCP Window size
> much larger than 4M when we add IPSec to the mix. Not only does this
> seem to be the case when we are using IPSec only but, if we add a GRE
> tunnel (to make it a little easier to do a packet trace), with GRE only,
> we see the TCP window size go to the full 16M (but we have a problem
> with packet drops). When we add IPSec (GRE/IPSec), the packet drops
> magically go away (perhaps due to the lower throughput) but the TCP
> Window size stays stuck at that 4M level.
>
> What would cause this and how do we get the full sized TCP Window inside
> an IPSec transport stream? Thanks - John
>
><snip>
I suppose this might imply that the receiving station cannot drain the queue
faster than 421 Mbps but I do not see the bottleneck.  There are no drops in the
NIC ring buffers:
root at lcppeppr-labc02:~# ethtool -S eth5 | grep drop
     rx_dropped: 0
     tx_dropped: 0
     rx_fcoe_dropped: 0
root at lcppeppr-labc02:~# ethtool -S eth7 | grep drop
     rx_dropped: 0
     tx_dropped: 0
     rx_fcoe_dropped: 0
There are no drops at the IP level:

Plenty of receive buffer space:
net.core.netdev_max_backlog = 10000
  net.core.rmem_max = 16782080
  net.core.wmem_max = 16777216
  net.ipv4.tcp_rmem = 8960 89600 16782080
  net.ipv4.tcp_wmem = 4096 65536 16777216

CPUs are not overloaded nor software interrupts excessive:
top - 11:27:02 up 16:58,  1 user,  load average: 1.00, 0.56, 0.29
Tasks: 189 total,   3 running, 186 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.1 us,  3.4 sy,  0.0 ni, 96.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  4.1 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.1 us, 15.7 sy,  0.0 ni, 84.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.3 us, 12.6 sy,  0.0 ni, 87.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  9.3 sy,  0.0 ni, 80.9 id,  0.0 wa,  0.0 hi,  9.9 si,  0.0 st
%Cpu6  :  0.0 us,  2.5 sy,  0.0 ni, 97.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  2.6 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us, 29.2 sy,  0.0 ni, 70.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us, 19.8 sy,  0.0 ni, 75.9 id,  0.0 wa,  0.1 hi,  4.2 si,  0.0 st
%Cpu10 :  0.0 us,  3.1 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  3.7 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32985376 total,   404108 used, 32581268 free,    33604 buffers
KiB Swap:  7836604 total,        0 used,  7836604 free,    83868 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND

 7545 root      20   0  6280 1492 1288 S    30  0.0   0:14.05 nuttcp

   49 root      20   0     0    0    0 S    24  0.0   0:51.33 kworker/8:0

 7523 root      20   0     0    0    0 S    22  0.0   0:37.08 kworker/9:0

 7526 root      20   0     0    0    0 S     6  0.0   0:10.22 kworker/5:2

 1441 root      20   0     0    0    0 S     4  0.0   0:13.14 kworker/11:2

 7527 root      20   0     0    0    0 S     4  0.0   0:07.07 kworker/2:1

 7458 root      20   0     0    0    0 S     4  0.0   0:06.89 kworker/8:2

   33 root      20   0     0    0    0 S     4  0.0   1:12.46 ksoftirqd/5

 7528 root      20   0     0    0    0 S     4  0.0   0:06.40 kworker/10:2

 7524 root      20   0     0    0    0 S     3  0.0   0:05.57 kworker/1:2

 7525 root      20   0     0    0    0 R     3  0.0   0:05.61 kworker/0:0

 6131 root      20   0     0    0    0 S     3  0.0   0:40.66 kworker/7:2

 7531 root      20   0     0    0    0 S     3  0.0   0:06.51 kworker/8:1

 7519 root      20   0     0    0    0 R     3  0.0   0:01.50 kworker/6:2

   89 root      20   0     0    0    0 S     3  0.0   0:16.06 kworker/3:1

 1972 root      20   0     0    0    0 S     2  0.0   0:22.64 kworker/4:2

 6828 root      20   0     0    0    0 S     2  0.0   0:03.84 kworker/3:2

 6047 root      20   0     0    0    0 S     2  0.0   0:23.63 kworker/9:1

 7123 root      20   0     0    0    0 S     2  0.0   0:03.58 kworker/9:2

 7300 root      20   0     0    0    0 S     2  0.0   0:03.18 kworker/4:0

 4632 root       0 -20 15900 4828 1492 S     0  0.0   2:33.04 conntrackd

 7337 root      20   0     0    0    0 S     0  0.0   0:00.14 kworker/10:0

 7529 root      20   0     0    0    0 S     0  0.0   0:05.59 kworker/11:0

   91 root      20   0     0    0    0 S     0  0.0   0:39.70 kworker/5:1

 4520 root      20   0  4740 1780 1088 S     0  0.0   0:14.17 haveged

 7221 root      20   0     0    0    0 S     0  0.0   0:00.52 kworker/5:0

 7112 root      20   0     0    0    0 S     0  0.0   0:00.09 kworker/11:1

 7543 root      20   0     0    0    0 S     0  0.0   0:00.05 kworker/u24:2

    1 root      20   0 10660 1704 1564 S     0  0.0   0:02.38 init

    2 root      20   0     0    0    0 S     0  0.0   0:00.01 kthreadd

    3 root      20   0     0    0    0 S     0  0.0   0:00.46 ksoftirqd/0

    5 root       0 -20     0    0    0 S     0  0.0   0:00.00 kworker/0:0H

Where would the bottleneck be? Thanks - John