  Hi all,

  I've set up Strongswan (5.1.3 as packaged for Debian) for a client of
mine on a handful of hosts running VMs.  The setup works nicely, but
under some not-yet-fully-investigated conditions, we experience packet
losses over 95% (as measured by ping) over the VPN.

  The setup is as follows: vm12 to vm15 ( to are
running on physical host phys1, which has an internal IP of
Ditto for vm22 to vm25 on host phys2, with s/10.128.1/10.128.2/.  phys1
and phys2 run Strongswan, with leftsubnet/rightsubnet appropriately set
to 10.128.{1,2}.0/24 and auto=start.

  SAs come up nicely, light traffic goes through quite fine, VMs can
talk to each other and to the internal address of the physical hosts,
and so on.

  The problem occurs basically when the production processes are
switched on in the VMs.  These processes initiate a few hundred sockets
between VMs and generate some (reasonable) CPU load.  After some time
(seconds to minutes), many packets get lost somewhere between the
physical hosts.  Running "ping vm25" from vm15 (so the packets should go
vm15→phys1→phys2→vm25 and back, with phys1→phys2 over IPsec), tshark on
the physical hosts shows that the packets get dropped between phys1 and
phys2; initially only the ping replies get lost before reaching phys1,
but after a while even the ping requests get lost and never reach phys2.
I tried generating big traffic between phys* (by sending lots of data
from vm15 to vm25 and back, using netcat), but even with 50 MB/s going
across and back, I can't see any packet loss in the running pings
(latency increases a bit, but that's expected).  The SAs stay up (as
displayed by ipsec status), and as soon as the offending processes are
stopped, packet loss ceases too, nothing relevant seems to happen in the
strongswan logs.

  So… does anyone have an idea about how to debug that?


