[strongSwan] Overriding DF on XFRM interfaces

Mon Jan 17 17:35:25 CET 2022

Hi John,

Hi

Lets assume that your network deployment (for the ipsec tunnel) is as below:

Note: The values mentioned are the mtu of the interfaces connected

[appliance1]1500mtu-----1500mtu[openwrt-router1]1500mtu-------[internet]-----1500mtu[opnwrt-router2]1500mtu-----1500mtu[appliance2]1500mtu

In this case the xfrm-based ipsec tunnel is between the router1 and router2

The below are some of the points to be considered to understand what's
happening and why the iptables-mangle rule for TCPMSS is used for
"mss-clamping" for each direction of the tcp-connection

1. You mentioned that the TCP traffic between the appliances flowing via
the ipsec-tunnel is using large-packet-size.

a) This would mean that the MSS that is negotiated between the appliance1
and appliance2 would always be set to 1460 (1500-40 bytes = 1460 bytes)

- Note:The 40 bytes is TCP-Hdr of 20-bytes and the IP-Hdr of 20-bytes

2. Another point to note is that by default all the hosts/gateways/routers
by default have the PMTUD (pmtu-discovery) enabled by default. So this
means that every TCP/UDP connection "initiated" from each of them will have
the DF-bit flag set for sure

- To disable the PMTUD, the setting "/proc/sys/net/ipv4/ip_no_pmtu_disc"
has to be set to 1 "echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc"
- this will ensure that the tcp/udp packets are NOT set with the df-bit flag

3. With reference to the IPsec-Tunnel (using the xfrm-interfaces)
established on each of the Router (router1/router2), it is to be noted that
once the ipsec tunnel is established, based on the mtu size of the outbound
interface (the wan interface - which is 1500 in this case), there is
invariably a IPSEc-SA MTU set for "Outbound-SA" whose value is dependent on
the encryption algorithm used (say AES256 for example) and the
wan-interface-MTU.

a) Iam not very sure as to where exactly we can see the IPSec-SA mtu that
is set for a tunnel (using a specific algorithm), but based on my past
recollections, i would say that for AES256 algo, the ipsec-SA MTU would be
approximately about 1422 (1500 - <all the encryption/esp overhead applied>)
for all outbound ipsec-esp packets

b) So if the appliance1 was a host following the standards of PMTUD/etc
behavior, when it sends a TCP/UDP packet (with DF-bit set) of size say 1500
and this arrives at the Peer-Router1 and after this traffic matches the
ipsec-tunnel policy and needs to be forwarded thru the ipsec tunnel to
Peer-Router2, then before encryption there is a check done against the
"ipsec-SA-mtu" for the tunnel, which would be 1422.

- So in this case the Peer-Router1 would send a icmp-unreachable message
"(type-3/code-4) packet-too-big need to fragment, with the MTU value of
1422" TO the appliance1

- And if the appliance1 was following standards, then the
icmp-packet-too-big message would trigger it to "re-negotiate" the
TCP-connection with a reduced MSS value of 1422-40 = 1382 bytes....

- And the same process is expected to happen from the other end where the
appliance2 TCP-host is connected

- So this ensures that the TCP data connection is using a max packet size
of 1382-tcppayload+40=1422 to avoid fragmentation at the ipsec-tunnel in
outbound direction

Note: In case of UDP-connections, if appliance1 was following standards,
the icmp-packet-too-big message would result in the appliance1 itself
fragmenting the large-packet into 2 fragments which after re-assembly at
the Peer-Router1 would not be more than 1422-bytes. And this ensures that
there is NO fragmentation at the ipsec-tunnel in outbound direction

4. So in your case, since both appliances are misbehaving and not following
standards and ignoring the pmtu icmp-messages AND ofcourse sending traffic
with DF-bit set, so:

a) you have correctly applied one of the solutions to avoid fragmentation
for TCP-connections: mss-clamping in both directions to be applied during
the TCP-handshake negotiation (the tcp-control connection)

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o xfrm0 -j
TCPMSS --set-mss 1240
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -i xfrm0 -j
TCPMSS --set-mss 1240

b) The above i believe you have applied only on router1. You could also the
same on Router2 too, if you can.

c) What the above mss-clamping does is that the MSS-value in the outgoing
TCP-syn packet from appliance1 to appliance2 is re-written to 1240. This
would inform the appliance2 that appliance1 is capable of processing
tcp-data packets of max segment-size of 1240 ONLY. So appliance2 would
always send tcp-packets of max 1240+40=1280 Only
- the same happens in the other direction and therefore results in
appliance1 always sending tcp-data packets with max size of 1280 (1240+40)

Note: Also, generally the mss-clamping is applied at POSTROUTING, but then
again if the above works in FORWARD, do continue with it

#iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o xfrm0
-j TCPMSS --set-mss 1240
#iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -i xfrm0
-j TCPMSS --set-mss 1240

d) Some things you need to check after applying the above mss-clamping.

- Capture the tcp-session packets flowing between appliance1 and
peer-router1 lan-interface, and
- check whether the MSS value is being negotiated and set to 1240 OR are
both appliances1/2 continue to set their MSS to 1460 ignoring the
mss-clamping?

Note: One point you need to note with the above clamping value is: the mss
being to set to 1240 means the IP-tcp packet size generated from
appliance1/appliance2 will be of sizes 1240+40=1280 bytes size which would
incase the wan-interface being 1500 and the ipsec-SA mtu(if at all it is
set/used) would be 1422 (if AES256 algo is used). This should ideally
result in NO Fragmentation at all

e) BUT if you are saying that inspite of the mss-clamping, the appliances
continue to send tcp-packets with MSS of 1460 AND DF-bit set, then there
will fragmentation  - atleast post-ipsec esp-fragmentation in outbound
direction on each of the Ipsec-Peer-Routers

5. So another question to consider is, what about udp traffic generated
from each of the appliances???? Are they generating large-size
non-fragmented packets of 1500-bytes each AND DF-bit flag always set????
- in this case there is no clamping that can be done, except to apply the
final alternate solution that applies to both TCP and UDP
traffic....clearing the DF-bit flag in all of the TCP/UDP packets that are
being generated from the appliances

a) This i think can most probably done by "disabling pmtu-discovery" on
both the appliances as below:

"echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc"
- this will ensure that the tcp/udp packets are NOT set with the df-bit flag

b) The above should be possible on Linux/Unix systems if thats what the
appliances are using.

6. Now coming to another important point you had asked, about how to "clear
the df-bit flag setting" of the inbound plain TCP/UDP traffic/connections
before they are encrypted into the ipsec tunnel

a) well as such you cannot on Peer-Router1, but you should be able to on
the 2 appliances as mentioned in point-5 above

b) But FYI, in IPsec-tunnels as per the RFC-standard, every implementation
is required to do the below "during encryption with ESP"

i) copy the df-bit flag(if set) from the Inner-IP-hdr (of the plain tcp/udp
packet) to the outer IP-Hdr of the ESP-packet that is generated by
router/gateway

ii) If the df-set flag is set in the Inner-IP hdr of the plain tcp/udp
packet before encryption, then we could also "clear the df-bit" if
implemented in the ipsec-engine locally on the router/gateway

iii) If there is NO df-bit flag set in the Inner-IP hdr of the
plain-tcp/udp packet before encryption, then you can apply the setting "set
df-bit" in the outer IP-hdr of the ESP packet

- Generally its always the "copy df-bit from Inner-IP-Hdr to Outer-IP-Hdr
of ESP-packet" that is always implemented as a MUST (as per RFC
requirements)

- BUT this does not mean that this will prevent/clear the df-bit flag of
incoming plain tcp/udp packets coming from the appliances before
encryption.

c) So FYI, since you are using XFRMi interfaces with strongswan-ikev2 and
specifically using swanctl.conf, you may please try the below setting for
"clearing the df-bit flag in the outer-ip-hdr of the ESP packets"

connections.<conn>.children.<child>.copy_df (since 5.7.0) yes(by default)

- Whether to copy the DF bit to the outer IPv4 header in tunnel mode.

set this as:

connections.<conn>.children.<child>.copy_df=no

hope the above info helps somewhat

thanks & regards
Rajiv

On Wed, Dec 15, 2021 at 7:35 AM Noel Kuntze
<noel.kuntze+strongswan-users-ml at thermi.consulting> wrote:

> Hello John,
>
> I am not aware of if the kernel tracks the assigned TCP MSS of the
> connections it knows of.
> Conntrack does not have that information. So it's a good question why
> exactly that happens.
>
> Can you double check if there is not maybe something like a local proxy
> running that could
> be the cause of that? Also, what is the currently set MTU on the interface?
> Does it coincide with the MSS (taking the TCP overhead into account)?
>
> I agree that it is likely extremely fragile. A good way would be a
> userspace proxy, like squid.
> Squid knows about conntrack, so can transparently proxy connections, even
> without tproxy (speaking from memories).
>
> Kind regards
> Noel
>
>
> Am 03.12.21 um 15:35 schrieb John Marrett:
> > I am working on a VPN solution connecting some appliances on two
> > different networks. I’m using an x86 openwrt router with strongswan
> > 5.9.2 and kernel 5.4.154. The systems I am connecting exhibit
> > non-compliant TCP MSS behaviour. They are, for unknown reasons,
> > ignoring the MSS from their peers and sending oversized packets. They
> > also ignore ICMP unreachable messages indicating path MTU, I have
> > confirmed that the ICMP unreachable messages are not blocked and they
> > have been captured directly on the system sending the problematic
> > traffic. I do not have control over the appliances and need to solve
> > the issues at the network level.
> >
> > I'm using a modern IKEv2 / XFRM based configuration for this VPN. I
> > would like to ignore the DF bit and fragment traffic passing through
> > the VPN tunnel. This fragmentation could occur before or after
> > encapsulation, it's not significant to me.
> >
> > If I was using a GRE tunnel I could use the ignore-df configuration
> > [1], however there doesn't appear to be an equivalent with an xfrm
> > interface.
> >
> > I have managed to "solve" my problem, though I do not understand the
> > solution or how it works. If I create the following iptables rule to
> > adjust the MSS on traffic traversing the xfrm interface:
> >
> > iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -o xfrm0
> > -j TCPMSS --set-mss 1240
> > iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -i xfrm0
> > -j TCPMSS --set-mss 1240
> >
> > Then, in addition to the expected modification of the mss field, my
> > TCP traffic will be fragmented, ignoring the DF bit.
> >
> > Here's an excerpt of traffic in ingress to the router:
> >
> > 09:23:56.103022 IP 10.1.34.10.5060 > 10.1.61.20.25578: Flags [P.], seq
> > 883:1906, ack 1760, win 260, length 1023
> > 09:23:56.119864 IP 10.1.61.20.25578 > 10.1.34.10.5060: Flags [.], ack
> > 1906, win 501, length 0
> > 09:24:01.448960 IP 10.1.34.10.5060 > 10.1.61.20.25578: Flags [P.], seq
> > 1906:3271, ack 1760, win 260, length 1365
> > 09:24:01.467771 IP 10.1.61.20.25578 > 10.1.34.10.5060: Flags [.], ack
> > 3148, win 501, length 0
> > 09:24:01.467810 IP 10.1.61.20.25578 > 10.1.34.10.5060: Flags [.], ack
> > 3271, win 501, length 0
> >
> > And egress on the xfrm interface (In addition to being sent over a VPN
> > connect the traffic is also being NATed by the VPN router):
> >
> > 09:23:56.103150 IP 10.2.30.1.5060 > 10.2.2.6.25578: Flags [P.], seq
> > 881:1902, ack 1750, win 260, length 1021
> > 09:23:56.119828 IP 10.2.2.6.25578 > 10.2.30.1.5060: Flags [.], ack
> > 1902, win 501, length 0
> > 09:24:01.449067 IP 10.2.30.1.5060 > 10.2.2.6.25578: Flags [.], seq
> > 1902:3142, ack 1750, win 260, length 1240
> > 09:24:01.449135 IP 10.2.30.1.5060 > 10.2.2.6.25578: Flags [P.], seq
> > 3142:3265, ack 1750, win 260, length 123
> > 09:24:01.467724 IP 10.2.2.6.25578 > 10.2.30.1.5060: Flags [.], ack
> > 3142, win 501, length 0
> > 09:24:01.467725 IP 10.2.2.6.25578 > 10.2.30.1.5060: Flags [.], ack
> > 3265, win 501, length 0
> >
> > The packet with length 1365 has been split into a packet of 1240 bytes
> > and a second of 123.
> >
> > Without these rules I see the expected behaviour, the packets are
> > dropped and ICMP unreachable messages are sent indicating the path
> > MTU.
> >
> > Is anyone able to explain why, in addition to adjusting the MSS, this
> > mangle configuration is allowing fragmentation ignoring the DF bit?
> > While the solution is working as I need it to, I'm concerned that it
> > may be extremely fragile.
> >
> > Is there a better way to solve this problem?
> >
> > Thanks in advance for any help you can offer,
> >
> > -JohnF
> >
> > [1] https://man7.org/linux/man-pages/man8/ip-tunnel.8.html
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/users/attachments/20220117/96aada51/attachment-0001.html>