Friday, April 21, 2017

RRM Neighbor Timeout Factor and Channel Utilization

If you work with Cisco wireless networks, I highly recommend that you read their Radio Resource Management white paper. If you want to understand how RRM works and how to tune it for your environment, this is the document you need to read.

The foundation of all RRM operations is the neighbor table. It is a list of each radio in the system and how other radios hear it, and how it hears other radios. The neighbor table is built using Neighbor Discovery Protocol, or NDP, frames. NDP frames are sent at the maximum power and minimum data rate supported by the radio and channel. The frequency at which NDP are sent depends on the Neighbor Packet Frequency settings under 802.11a/b - RRM - General. The Neighbor Packet Frequency determines how often a radio goes off-channel (yes, I said off-channel) to transmit a NDP for each channel in the band. Which channels are used to transmit NDP frames is controlled by the Noise/Interference/Rogue/CleanAir Monitoring Channels setting, also in 802.11a/b - RRM - General.

The default setting is Country Channels, which means the radio will try to send NDP frames for every channel allowed in your defined regulatory domain for that band. (DFS channels are special; see the RRM white paper for more details).

Just like all frames sent on a wireless medium, the NDP frames must follow the same rules of the DCF. The medium must be idle in order to transmit a NDP frame on the specific channel it is going to be sent on. Unlike regular frames, the NDP frames will not wait very long for the opportunity to transmit; remember it is off-channel and can't serve clients. The radio will simply re-schedule the next NDP transmission for that channel at the defined Neighbor Packet Frequency.

To compensate for short-term problems with transmitting NDP frames, RRM operation uses a Neighbor Pruning Interval value. After a neighbor is discovered, it will stay in the Neighbor Table for a specific amount of time, even if NDP transmission fails due to high channel utilization.

Prior to 8.0, the Neighbor Pruning Interval was fixed at 1 hour. In 8.0, it is fixed at 15 minutes.

Let's look at the defaults for 7.6 code and do an exercise. The Neighbor Packet Frequency is 60 seconds default in 7.6. In order for a radio to drop off the neighbor table, the NDP would need to fail transmit

60 times in a row. That's a lot of chances for an NDP to get through, which would result in a very stable Neighbor Table.

In 8.0 and above, things change. The default Neighbor Packet Frequency increases to 3 minutes, and the Neighbor Pruning Interval shortens to 15 minutes. This means a neighbor could drop off the table if 5 NDP transmissions in a row fail.

Why shorten the Neighbor Pruning Interval? Since the neighbor table is used for both DCA and TPC, a neighbor dropping out of the table could result in radios near it increasing their power. A shorter Neighbor Pruning Interval results in faster adjustment to the loss of an AP.

In 8.1 and above Cisco introduced a new parameter call the Neighbor Timeout Factor, or NTF for short. The NTF allows the user to adjust the Neighbor Pruning Interval in the following way:

In order for a radio to drop out of the Neighbor Table in 8.1 and above, the NDP transmission would have to fail NTF times in a row.

Now let's take a look at how this all ties in with channel utilization. Suppose there are two APs; AP "A" on channel 36 and "B" on channel 149. The two radios are close enough to one another to "hear" each other's NDP frames. Every 180 seconds, "A" goes off-channel to send a NDP frame on channel 149.

Now suppose that channel utilization on "B" is x%. It's a bit of a simplification, but this means that "A" has a x% chance to fail its NDP transmission on channel 149.

The chance that "A" will fail to transmit its NDP frame on channel 149 NTF times in a row is
Let's say that x is 50% and NTF was 5. The chance that "A" would fail to transmit a NDP frame on channel 149 would be .03, or 3%. Conversely, the chance that NDP transmission would succeed on at least one of the 5 attempts would be 97%, or

Suppose you wanted the chance of at least one NDP frame to be transmitted out of 5 attempts to be 99%. What would the channel utilization have to be under for this to happen?

Channel utilization on "B" would need to be under 40% to guarantee 99% stability in the Neighbor Table. Keep in mind that channel utilization is not the only factor in NDP transmission; you also have to deal with scan defer settings for voice traffic.

The take aways:

  • The longer the Neighbor Prune Internal (higher the NTF), the more stable RRM will be. The tradeoff is not adjusting to loss of APs as quickly. 
  • Use the formulas above to calculate what channel utilization you need to stay under in order for NDP transmission to succeed for a given NTF. Plug that number into your trap thresholds. 
  • Dense environments with high channel utilization or voice clients will need higher NTF values. The default of 5 may not be enough. 
  • What works for 5 GHz may not work for 2.4 GHz. Consider using different NTF values for each band.








Sunday, April 16, 2017

Broadcast Key Rotation - Part 2

At the end of my last blog, I discussed what happens if a client misses the broadcast key rotation for the AP it is connected to. We know that a client that misses the key rotation will be disconnected, but how many retries are made before the client is removed?

Here are the default EAP settings for a Cisco controller-based wireless network:


The parameters EAPOL-Key Timeout and EAPOL-Key Max retries should be the answer to the question. The default settings would mean there are three attempts at sending the broadcast key to a client, with the 2nd and 3rd attempts being spaced apart by 1 second. If a client can't get the new broadcast key in 2 seconds, it is disconnected.

I tested this by setting an AP at minimum power, using channel 165, and my Moto G4. After connecting, I moved my phone away from the AP and watched debug output on the controller console to see if the key rotation was successful. Eventually I got far enough away and put enough attenuation between my phone and the APs for the key rotation to fail.

First, let's have a look at the output of debug dot1x all:


The first seven lines show that the key rotation has started and that the first attempt is being made at transmitting the new key to my phone. Take note of the third line, where it states "message 5 - group." About 1 second later, the first retransmission happens, after the timeoutEvt message appears in the log. Note that at the end of that line it reads "message  = M5." M5 must mean the group key, based on line 3 in the debug. Another second goes by, and there is another timeoutEvt message. The key is transmitted one more time. Another second goes by, and at 15:04:33 the client is disconnected.

It appears that the EAPOL-Key Timeout and EAPOL-Key Max Retries parameters do indeed control the behavior of broadcast key retransmissions. While I was logging the debug output, I was also capturing frames on channel 165 with an Aruba IAP. I fired up Wireshark, applied the filter that shows the key rotation frame (wlan.ta = wlan.sa = BSSID, wlan.ra = Moto 4G), and scrolled to 15:04:30. And, there's nothing there! The key rotation frames were not seen over the air.

The debug output clearly says that the key was sent three times, so what happened? To find out, I had to go back in the capture to 15:03:48, where I saw this



Less than a minute before the key rotation, my client sent a Null Data Packet to the AP saying that it was going into a sleep mode. I wrote a Wireshark filter to look for all Null Data Packets from my phone and what power management message it was sending. The message at 15:03:48 was the last one sent by the phone on channel 165.

Since the AP had not received an NDP from my phone by the time the broadcast key rotation started, the AP believed the client was still in a sleep state. An AP will not transmit a queued frame to an associated client if it thinks it is asleep. It needs to know the client is awake by receiving a NDP with the "client will stay awake" bit set to 1. This goes for key rotation frames too. Looking further in the capture, the AP did not attempt to transmit de-authentication frames to the client either.

What's the take away here? The combination of power save measures and key rotation can result in clients being disconnected from a WLAN without knowing they have been kicked off. It's known that some clients ignore the DTIM interval in beacons, preferring to save power over receiving broadcast traffic (remember, broadcast and multicast traffic is delivered at the DTIM interval beacon, when the DTIM counter value is zero). Clients are expected to be awake at the DTIM interval beacon to receive broadcast and multicast traffic, but some clients would rather save battery power.

Personally, I recommend increasing the default broadcast key rotation interval from the default 1 hour to something a bit longer, like 12 or 24 hours. If you have a WLAN that is not supporting voice, consider increasing the DTIM period to 3. This will allow clients that do honor the DTIM interval to conserve power, while avoiding problems with clients that don't honor it.




Friday, April 14, 2017

Broadcast Key Rotation in WPA2-Enterprise WLANs

Wireless networks using WPA2-Enterprise security with 802.1X authentication are a common sight in corporate environments. It provides a secure way for devices to communicate over the air.

While studying for the CWSP exam, I became familiar with the mechanisms of WPA2-Enterprise authentication. After a client has provided the correct credentials, the AP (and the DS behind it) performs what is known as the four-way handshake with the client.

ACKs not shown for brevity!
The purpose of the 4-way handshake is to securely exchange a pair of encryption keys. One of the two keys is the Pairwise Temporal Key, or PTK for short. This key is used to encrypt/decrypt unicast traffic to/from the client. The other key is the Group Temporal Key, or GTK. This key is used to encrypt/decrypt broadcast and multicast traffic for all stations on the BSSID. Because of its purpose, the GTK is also referred to as the broadcast key.

Since anyone within earshot of a wireless network can see its traffic, and since all broadcast traffic is encrypted with the same GTK, there is a possibility that an eavesdropper could collect enough broadcast traffic to guess the key. For this reason the GTK is rotated, or changed, for all stations on the BSSID periodically. The new GTK needs to be delivered securely to each station on the BSSID, which means it needs to be sent via unicast to each station, and encrypted with each station's PTK. The lifetime of the GTK is often called the broadcast key rotation interval, and it specifies how often the GTK must be changed for all stations on a BSSID that uses WPA.

For Cisco lightweight-AP based networks, the default broadcast key rotation interval is 3,600 seconds, or 1 hour. You can see the defined interval by issuing the show advanced eap command.


To see broadcast key rotation in action, it helps to shorten the interval to something manageable. I don't know about you, but I'm not waiting a hour to watch the key rotate.

Presto!
The change to the broadcast key interval takes effect after the next scheduled key rotation. If there are clients connected to an AP, you could end up waiting a while. My lab environment had no clients when I made the change.

There are two ways to "watch" the key rotation: via debug or over the air. To see the broadcast key rotation via debug, use debug dot1x all enable. Keep in mind that doing this in a production environment will likely produce a lot of output to the terminal. Here is what you will see when the broadcast key is rotated.

The Easy Way

You can see that the AP sends the GTK to the client, and that the AP resets the timer for the next key rotation.

Seeing the key rotation over the air with a packet analyzer is a bit trickier. It's easy to tell when a client associates and completes the 4-way handshake, but what do you look for to see the broadcast key rotation? The key rotation does not decode in Wireshark as an EAPOL packet; the station is already authenticated and the 802.1X port is unblocked.

The client has to be in an awake state to receive the new GTK, so I decided the best way to find the key rotation was to watch power management null-data packets around the time that I expected to see the key rotation take place. I configured my trusty Moto 4G to connect to the WPA2-Enterprise WLAN on my lab AP, and just let it sit without moving a lot of data. Taking a look at my capture, I want to see when a beacon tells my client to wake up, and what happens after that.

To see that, I need the Association ID of my client on the AP, which you can see from the Association Response when the client first connects.

No Surprise, I'm the only client
So my association ID is 1. Now I'm going to look for Beacons from the SSID I'm connected to that have the DTIM count value of 0 and are telling my client to wake up.

Wakey Wakey
(Now, I know what you're thinking: If the client is asleep, how can it hear the beacon? This brings up an interesting discussion. The client is supposed to wake up for every DTIM period. The AP doesn't know for sure that the client woke up until it receives a null-data packet from the client indicating that it will stay awake).

I'm up, what do you want? 

Immediately after, I see this QoS Data frame sent from the AP to my client. What was interesting about this frame was the Source Address field was the same as the Transmitter Address field. It was not a frame being delivered from an upstream source; it was coming directly from the AP.

You're going to have to trust me here.

My client was connected for a while, at least long enough to see two or three broadcast key rotations at the 120 second interval. So I make a Wireshark filter to match frames where the transmitter address is the AP, the Source Address is the AP, and the frame type is data: wlan.ta == bssid && wlan.sa == bssid && wlan.fc == 0x2 .

Well would you look at that.
The time deltas between those frames are lining up perfectly with the broadcast key rotation interval of 120 seconds.

So, you may be asking yourself what happens if a client does not wake up from a sleep state to receive the new GTK. The answer: they are de-authenticated from the BSS. How long will the AP wait, and how many retries will happen before the client is disconnected?

Hello again
Look at the EAPOL-Key Timeout and Max Retries parameters here. You could infer that the client is given 1 second and two retries before being disconnected. But is that what really determines it?

For another blog, perhaps. But I do know for certain that broadcast key rotation intervals that are too short will cause problems for clients that enter power-save states. This is especially true for clients that will not wake up for short DTIM values, like iPhones. It's best to extend the broadcast key interval out past the default of one hour, and make sure your WLANs have DTIM values greater than 2 where applicable (not recommended for voice networks).