[strongSwan] Best practices regarding monitoring

Fri Jun 9 11:46:42 CEST 2017

Hi,

we're running various Ubuntu systems with StrongSwan 5.1 or 5.3. Each
system connects to exactly one IPSec/IKE peer. We usually don't know
what kind of peer that is -- is it also running StrongSwan, is it a
hardware firewall, does it run OpenBSD, ... ? No idea. No way of
retrieving log files. They're all black boxes to us. Okay.

Now, the big question is: How to monitor IPSec connectivity?

It's easy to check if there are IKE SAs. It's also not a big deal to
check if there are CHILD SAs. We can do that. However, checking that is
not enough.

Let me give you an example.

Here's some output of "ipsec statusall":

    Status of IKE charon daemon (strongSwan 5.1.2, Linux 3.13.0-67-generic, x86_64):
      uptime: 5 days, since Jun 02 11:51:14 2017
      malloc: sbrk 1511424, mmap 0, used 343856, free 1167568
      worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 84
      loaded plugins: charon test-vectors aes rc2 sha1 sha2 md4 md5 rdrand random nonce x509 revocation constraints pkcs1 pkcs7 pkcs8 pkcs12 pem openssl xcbc cmac hmac ctr ccm gcm attr kernel-netlink resolve socket-default stroke updown eap-identity addrblock
    Listening IP addresses:
      10.1.2.3
      $public_IP
    Connections:
          peer_1:  $public_IP...$peer_IP  IKEv2
          peer_1:   local:  [$public_IP] uses pre-shared key authentication
          peer_1:   remote: [$peer_IP] uses pre-shared key authentication
          peer_1:   child:  192.168.23.24/32 === 192.168.100.200/32 TUNNEL
    Routed Connections:
          peer_1{1}:  ROUTED, TUNNEL
          peer_1{1}:   192.168.23.24/32 === 192.168.100.200/32
    Security Associations (1 up, 0 connecting):
          peer_1[79]: ESTABLISHED 82 minutes ago, $public_IP[$public_IP]...$peer_IP[$peer_IP]
          peer_1[79]: IKEv2 SPIs: 1234567890_i abcdefghi_r*, rekeying disabled
          peer_1[79]: IKE proposal: AES_CBC_256/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_8192
          peer_1{1}:  INSTALLED, TUNNEL, ESP SPIs: c112233_i c445566_o
          peer_1{1}:  AES_CBC_256/HMAC_SHA2_256_128, 49208 bytes_i (239 pkts, 1145s ago), 59836 bytes_o (491 pkts, 14s ago), rekeying disabled
          peer_1{1}:   192.168.23.24/32 === 192.168.100.200/32

Looks fine, doesn't it? Except 192.168.100.200 does not respond.
tcpdump shows that we properly encrypt our traffic using those exact
SPIs and everything. On our end, everything looks fine. But our peer
simply ignores our encrypted traffic. It's as if our peer has
"forgotten" about those SPIs.

If you look closely, you can see that there's outgoing traffic, but no
incoming traffic:

    peer_1{1}: ... 49208 bytes_i (239 pkts, 1145s ago), 59836 bytes_o (491 pkts, 14s ago)

Reinitiating the entire connection (essentially, doing "service
strongswan restart") fixes the problem and we can immediately reach
192.168.100.200.

(Yes, in this specific case, it might be worth a try to reenable
rekeying on our end. Still, my question is not about fixing this problem
at hand. :-))

What do you guys do in such situations? What are best practices for
monitoring? How do you detect "dead" CHILD SAs? Is that even possible?

There's the obvious idea: Try to ping a system "behind" the VPN. In the
example above, we could issue pings to 192.168.100.200 and, if that
system does not respond, consider the IPSec connection to be "down". We
would like to avoid that, though. Ideally, we could find a way to
directly check whether all CHILD SAs are "healthy". Pinging
192.168.100.200 would be "indirect" monitoring: It's a different system
and *that* system could be down, not the IPSec connection.

In other words, maybe there's something like DPD in IKEv2, but operating
on the level of CHILD SAs?

Thank you very much in advance!
Peter