[strongSwan] Measuring Strongswan key metrics for production environments

Wed Nov 2 19:34:19 CET 2011

Greetings,
  One of the necessary duties when running a production environment
operating hundreds or thousands of tunnels is to have visibility into the
state of the machines, tunnels, etc.  Naturally, this data often has to be
aggregated and interesting slices carved out of the aggregate data for
analysis.  We find that instrumentation is vital to visibility and hence a
successful operation.  Think of the pilot flying in inclimate weather.
 Without optical visibility, he must rely on the instrumentation.  Along
these lines, I would like to solicit anecdotal accounts of how others are
tackling this, or how you are not and why it is unnecessary.  Some vendor
based ipsec solutions give the access to diagnostic counters.  These are
 great for capturing in aggregate and visualizing.  One can quickly spot
trends and stress points.  After going through some of the Strongswan wiki,
I am unable to find information on how to best tackle this.  I have some
questions around how to improve our visibility.

Under Linux we have access to robust snmpd functionality to take scripts
and bind them to snmp, such that polling can provide the result returned by
the script.  One could also write an agent that grabs diagnostic data and
sends it directly to something like graphite.
This gets me part of the way there.  However, I am unable to access all of
the diagnostic data of interest.  I understand that parts of this can be
done with Strongswan commands and some may have to be extracted from the
kernel.  I am hoping the experts can provide some suggestions.  I have
below a sample list of the types of things I am interested in.  I'm not
opposed to cobbling things together, but I suspect that many of these
things are already referenced within the code somewhere, just not
instrumented to keep tally or expose those counters.  Of course, it would
be great to not have to cobble and have a nice RPC/XML API, SNMP or other
query system, but I also understand that this is not the core of the
development effort or intention.  This list is off the top of my head,
but hopefully good enough for illustrating the general direction.

Dropped packets (both cumulative and individual counters for anti-replay
drops and other drops).
Phase 1 packets sent/received
Phase 2 packets sent/received
Phase 1 proposals failure/success counter
Phase 2 proposals failure/success counter
Packets sent over tunnel (encrypted), ESPOutBytes, ESPInBytes, etc
Packets received over tunnel (decrypted)
ESP Deletes sent/received
Create_child_sa request sent/received
DPD sent
DPD success
DPD failure
New tunnels from DPD
Failure to establish tunnel
Failure due to unmatched endpoint in config
Failure in authentication, counters for the various methods (number of
invalid certs, etc)
Failures around CRLs
Failure due to unmatched routes
Per-peer relevant counters
Renegotiation success/failures

Thank you,
Robin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/users/attachments/20111102/c2f41205/attachment.html>