[strongSwan] Best practices regarding monitoring

Wed Jun 14 09:43:15 CEST 2017

Hi,

On Fri, Jun 09, 2017 at 09:11:27PM +0200, Noel Kuntze wrote:
> Besides DPD, there's no standard that charon implements for that. I am
> also not aware of any that uses CHILD_SAs.

alright, too bad. :-/

So, am I correct to assume that you guys usually evaluate the output of
`ipsec statusall` and maybe `ip xfrm {state,policy}` to implement
monitoring? Do you simply send pings to remote systems "behind" the VPN?

(If there is no DPD that uses CHILD_SAs, there might be nothing else
that you can do.)

> Huh? Check `ip xfrm state` and `ip xfrm policy`, they give you the SAD and SPD.
> Also check if you receive any ESP packets and what their SPIs are.

`ip xfrm state` shows the same SPIs as `ipsec statusall` does. Policies
look fine, too. With tcpdump, I can see outgoing encrypted traffic that
uses the correct SPIs (and we can decrypt that traffic using Wireshark
and the keys shown by `ip xfrm state`). No incoming ESP traffic, though.

> I think the much more plausible cases are the following:
> 1) Kernel does not send expiration messages to charon when an SA soft or hard expires
> 2) Something in between drops the ESP traffic. Maybe there's a problem with a stateful firewall? iptables rules?

As for #1: How can I check that? I assume that `ip xfrm state` would not
show any SAs but `ipsec statusall` still shows them, right?

As for #2: Totally possible. We always check our firewalls, but traffic
may still be dropped on the remote end.

Don't get me wrong, though. I only posted one exemplary scenario that we
see with one of our IPSec peers. It illustrates nicely that our
strongswan/charon/kernel looks like it's working fine, but still, no
response from the remote peer until we do a "service strongswan
restart". I understand that I may not have posted all required
information to debug this particular issue, simply because that's not
what I'm after. :-)

At the end of the day, we have to work closely with the admins of our
remote peers to fix the individual issues. We're not able to reliably
*detect* them, though. Any suggestions are highly appreciated.

Thanks!
Peter