[strongSwan-dev] issues with rekeying, child updown listener and ike reauth.

Thu Jul 31 22:20:36 CEST 2014

Hi,

I have been testing with a Juniper SRX talking to a strongswan gateway and
have noticed several problems when I have short rekey settings on the
juniper. I have tried to debug a few and put in some "fixes", but it just
seems like I hit one problem after another. I am trying IKEv1, with
strongswan 5.1.2. The rekey values I use are 4 minutes for phase2. The
reauth times for IKE is around 10 minutes. I have described some of the
problems below and what I have found from my debugging

1. Improper state changes and calling of updown script.
I noticed this problem in the method check_for_rekeyed_child in
src/libcharon/sa/ikev1/tasks/quick_mode.c . When a new quick mode
connection comes and a check is being made for existing connections, the
check is stopped when the first connection is found. This is because of the
following code in the function
enumerator = this->ike_sa->create_child_sa_enumerator(this->ike_sa);
while (this->reqid == 0 && enumerator->enumerate(enumerator, &child_sa))
{
The check stops when reqid is not 0. Part of what this function does is
changing the state of an existing connection to CHILD_REKEYING. However, if
there are two existing connections, because the first rekeyed child has not
expired, one connection is in the state CHILD_REKEYING and the other is
state CHILD_INSTALLED, the above loop finds the connection in state
CHILD_REKEYING, changes its state to CHILD_REKEYING (from CHILD_REKEYING)
and stops. The child in state CHILD_INSTALLED is not touched. When the new
child is established, there are now three child SAs - 2 in CHILD_INSTALLED
state and one in CHILD_REKEYING state.
A side effect of this condition is that when the connection that was
incorrectly in CHILD_INSTALLED state expires, the child updown script is
called with down-client. This should not have happened, because
technically, this child was being rekeyed.

The fix I made for the above was to remove the "this->reqid == 0" check in
the while loop and let the enumerator go through all the existing
connections.

2. Sometimes child updown script with "up-client" gets called for a rekeyed
child after IKE reauth
I have noticed this a few times. In the method "install" in
src/libcharon/sa/ikev1/tasks/quick_mode.c, where the old child SA is got
from the IKE SA, and if the old SA is NULL, the child_updown is called. The
following is the code snippet.

if (this->rekey)
{
old = this->ike_sa->get_child_sa(this->ike_sa,
this->proposal->get_protocol(this->proposal),
this->rekey, TRUE);
}
if (old)
{
charon->bus->child_rekey(charon->bus, old, this->child_sa);
}
else
{
charon->bus->child_updown(charon->bus, this->child_sa, TRUE);
}

I have noticed that sometimes old is NULL for a child that actually
rekeyed. This seems to happen when after the IKE SA has been reauth. I
think in other places we do not call up-client for child-sa that is
rekeyed. I moved the check for old inside the check for this->rekey and
this seemed to solve the problem.

3. Calling of updown script for child sa in REKEYING state when DPD
detected.
It seems that in some places the updown script is not called when the child
sa being deleted is in REKEYING state. However when DPD is detected and an
IKE connection is being deleted, all the child sa also are deleted. But
there is no check to see if the child is in CHILD_REKEYING  state. So the
updown script is called with "down-client" for all child sa, even if they
are in REKEYING state, which is different from other places where the child
updown script is not called for a rekeyed child_sa.
I saw this problem in the ike_updown method in src/libcharon/bus/bus.c

4. It seems that a child state is being set to CHILD_REKEYING even before
the new child is completely established. This happens in method
check_for_rekeyed_child in src/libcharon/sa/ikev1/tasks/quick_mode.c, which
is called when the first handshake of a new quick mode connection happens.
I have noticed a problem with juniper where, the new quick mode connection
fails to complete because of a bad hash value in the message 3 of the
handshake. This causes the infant connection to be deleted, but what this
also means is that the existing connection has already made a state change
to CHILD_REKEYING. A side effect of this is that when the existing
connection expires, the updown script is never called because the child_sa
is in state CHILD_REKEYING. One possible fix I tried is that when a child
sa in state CHILD_REKEYING is being deleted, I check if there is another
connection with the same name under the IKE_SA. If there are none, I call
the updown script with down-client.

5. During IKE_SA reauth, when a new IKE SA is being created and is adopting
child sa from previous IKE SA, only INSTALLED child sa are adopted. This is
because a child_sa is added to an ike_sa only after it is installed. So if
a new child sa is being created at around the same time as the new IKE SA
is being created, there is a chance that this child sa is associated with
the old IKE SA which is being torn down, causing this child sa to be
deleted when the older IKE SA is deleted.

6. I have seen a crash when enumerating through an empty array. For
example, in enumerating an empty policy_enumerator. The problem seemed to
be a bad count in the underlying array_t. It seems that in the method
array_create in src/libstrongswan/collections/array.c, the count and data
variables are not initialized to 0.

7. One of the more important debug logs I added to help me debug was to log
the state change of a child SA in method set_state in
src/libcharon/sa/child_sa.c . It would be very helpful if this log is added
in the code base.

Please let me know if my investigations and my possible fixes make any
sense.

regards,
sach
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.strongswan.org/pipermail/dev/attachments/20140731/1307b143/attachment.html>