[strongSwan] Intermittent MTU issue
noel.kuntze+strongswan-users-ml at thermi.consulting
Wed May 30 14:09:52 CEST 2018
You can look at traffic in the tunnel by using the NFLOG target in iptables. Read the CorrectTrafficDump page on the wiki.
On 29.05.2018 18:05, Arzhel Younsi wrote:
> I started to troubleshoot intermittent but large spikes of ICMP "packet too big" messages on our servers running IPsec in transport mode with StrongSwan.
> We're tracking that issue "internally" on https://phabricator.wikimedia.org/T195365 with many digressions and real data, but here is a summarized version:
> hostA and hostB have IPsec configured such as all traffic between the two hosts is being encrypted. Traffic is relatively steady.
> At (so far) random times, a packet capture on hostA's loopback shows large spikes of ICMP "packet too big" from and to hostA's interface IP.
> The payload (detailed in the phabricator task) says: hostA tried to send a 1516 bytes packet to hostB while hostA's interface MTU is 1500.
> During that spike of ICMP, running:
> "ip -s route get hostB" on hostA shows "mtu 1500".
> This mtu mention is absent during "quiet time" (default value?).
> The ICMP spike stops before the end of the "cache" countdown. But if the ICMP spike happens again, the "cache" countdown gets re-initialized.
> Locking the MTU with:
> "ip route add hostB via xxx mtu lock 1400" seems to fix the issue.
> Our current guess is something along the lines of:
> 1/ An unknown event (eg. congestion) triggers a MTU probing from the kernel (we have tcp_mtu_probing set to 1)
> (As it's all in ipsec, we can't inspect the traffic and see what and how traffic is flowing)
> 2/ The kernel sets a temporary PMTU value based on the interface (and maybe hostB)
> without taking the ESP overhead into consideration
> 3/ Traffic use that mtu 1500 to send traffic, but can't get passed the interface after beeing encrypted because being too big.
> But as this is still quite speculative, and for Ocham's razor' sake I'd expect a miss-configuration on our side instead of a bug in the kernel/StrongSwan :)
> How to figure out what creates that cache entry?
> Is our guess plausible?
> How to troubleshoot it more?
> Any help welcome.
> As we have many to many IPsec links, I would rather avoid deploying the mtu lock everywhere. This also doesn't help understanding and nailing the root of the issue.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: OpenPGP digital signature
More information about the Users