Friday, April 21, 2017

RRM Neighbor Timeout Factor and Channel Utilization

If you work with Cisco wireless networks, I highly recommend that you read their Radio Resource Management white paper. If you want to understand how RRM works and how to tune it for your environment, this is the document you need to read.

The foundation of all RRM operations is the neighbor table. It is a list of each radio in the system and how other radios hear it, and how it hears other radios. The neighbor table is built using Neighbor Discovery Protocol, or NDP, frames. NDP frames are sent at the maximum power and minimum data rate supported by the radio and channel. The frequency at which NDP are sent depends on the Neighbor Packet Frequency settings under 802.11a/b - RRM - General. The Neighbor Packet Frequency determines how often a radio goes off-channel (yes, I said off-channel) to transmit a NDP for each channel in the band. Which channels are used to transmit NDP frames is controlled by the Noise/Interference/Rogue/CleanAir Monitoring Channels setting, also in 802.11a/b - RRM - General.

The default setting is Country Channels, which means the radio will try to send NDP frames for every channel allowed in your defined regulatory domain for that band. (DFS channels are special; see the RRM white paper for more details).

Just like all frames sent on a wireless medium, the NDP frames must follow the same rules of the DCF. The medium must be idle in order to transmit a NDP frame on the specific channel it is going to be sent on. Unlike regular frames, the NDP frames will not wait very long for the opportunity to transmit; remember it is off-channel and can't serve clients. The radio will simply re-schedule the next NDP transmission for that channel at the defined Neighbor Packet Frequency.

To compensate for short-term problems with transmitting NDP frames, RRM operation uses a Neighbor Pruning Interval value. After a neighbor is discovered, it will stay in the Neighbor Table for a specific amount of time, even if NDP transmission fails due to high channel utilization.

Prior to 8.0, the Neighbor Pruning Interval was fixed at 1 hour. In 8.0, it is fixed at 15 minutes.

Let's look at the defaults for 7.6 code and do an exercise. The Neighbor Packet Frequency is 60 seconds default in 7.6. In order for a radio to drop off the neighbor table, the NDP would need to fail transmit

60 times in a row. That's a lot of chances for an NDP to get through, which would result in a very stable Neighbor Table.

In 8.0 and above, things change. The default Neighbor Packet Frequency increases to 3 minutes, and the Neighbor Pruning Interval shortens to 15 minutes. This means a neighbor could drop off the table if 5 NDP transmissions in a row fail.

Why shorten the Neighbor Pruning Interval? Since the neighbor table is used for both DCA and TPC, a neighbor dropping out of the table could result in radios near it increasing their power. A shorter Neighbor Pruning Interval results in faster adjustment to the loss of an AP.

In 8.1 and above Cisco introduced a new parameter call the Neighbor Timeout Factor, or NTF for short. The NTF allows the user to adjust the Neighbor Pruning Interval in the following way:

In order for a radio to drop out of the Neighbor Table in 8.1 and above, the NDP transmission would have to fail NTF times in a row.

Now let's take a look at how this all ties in with channel utilization. Suppose there are two APs; AP "A" on channel 36 and "B" on channel 149. The two radios are close enough to one another to "hear" each other's NDP frames. Every 180 seconds, "A" goes off-channel to send a NDP frame on channel 149.

Now suppose that channel utilization on "B" is x%. It's a bit of a simplification, but this means that "A" has a x% chance to fail its NDP transmission on channel 149.

The chance that "A" will fail to transmit its NDP frame on channel 149 NTF times in a row is
Let's say that x is 50% and NTF was 5. The chance that "A" would fail to transmit a NDP frame on channel 149 would be .03, or 3%. Conversely, the chance that NDP transmission would succeed on at least one of the 5 attempts would be 97%, or

Suppose you wanted the chance of at least one NDP frame to be transmitted out of 5 attempts to be 99%. What would the channel utilization have to be under for this to happen?

Channel utilization on "B" would need to be under 40% to guarantee 99% stability in the Neighbor Table. Keep in mind that channel utilization is not the only factor in NDP transmission; you also have to deal with scan defer settings for voice traffic.

The take aways:

  • The longer the Neighbor Prune Internal (higher the NTF), the more stable RRM will be. The tradeoff is not adjusting to loss of APs as quickly. 
  • Use the formulas above to calculate what channel utilization you need to stay under in order for NDP transmission to succeed for a given NTF. Plug that number into your trap thresholds. 
  • Dense environments with high channel utilization or voice clients will need higher NTF values. The default of 5 may not be enough. 
  • What works for 5 GHz may not work for 2.4 GHz. Consider using different NTF values for each band.








Sunday, April 16, 2017

Broadcast Key Rotation - Part 2

At the end of my last blog, I discussed what happens if a client misses the broadcast key rotation for the AP it is connected to. We know that a client that misses the key rotation will be disconnected, but how many retries are made before the client is removed?

Here are the default EAP settings for a Cisco controller-based wireless network:


The parameters EAPOL-Key Timeout and EAPOL-Key Max retries should be the answer to the question. The default settings would mean there are three attempts at sending the broadcast key to a client, with the 2nd and 3rd attempts being spaced apart by 1 second. If a client can't get the new broadcast key in 2 seconds, it is disconnected.

I tested this by setting an AP at minimum power, using channel 165, and my Moto G4. After connecting, I moved my phone away from the AP and watched debug output on the controller console to see if the key rotation was successful. Eventually I got far enough away and put enough attenuation between my phone and the APs for the key rotation to fail.

First, let's have a look at the output of debug dot1x all:


The first seven lines show that the key rotation has started and that the first attempt is being made at transmitting the new key to my phone. Take note of the third line, where it states "message 5 - group." About 1 second later, the first retransmission happens, after the timeoutEvt message appears in the log. Note that at the end of that line it reads "message  = M5." M5 must mean the group key, based on line 3 in the debug. Another second goes by, and there is another timeoutEvt message. The key is transmitted one more time. Another second goes by, and at 15:04:33 the client is disconnected.

It appears that the EAPOL-Key Timeout and EAPOL-Key Max Retries parameters do indeed control the behavior of broadcast key retransmissions. While I was logging the debug output, I was also capturing frames on channel 165 with an Aruba IAP. I fired up Wireshark, applied the filter that shows the key rotation frame (wlan.ta = wlan.sa = BSSID, wlan.ra = Moto 4G), and scrolled to 15:04:30. And, there's nothing there! The key rotation frames were not seen over the air.

The debug output clearly says that the key was sent three times, so what happened? To find out, I had to go back in the capture to 15:03:48, where I saw this



Less than a minute before the key rotation, my client sent a Null Data Packet to the AP saying that it was going into a sleep mode. I wrote a Wireshark filter to look for all Null Data Packets from my phone and what power management message it was sending. The message at 15:03:48 was the last one sent by the phone on channel 165.

Since the AP had not received an NDP from my phone by the time the broadcast key rotation started, the AP believed the client was still in a sleep state. An AP will not transmit a queued frame to an associated client if it thinks it is asleep. It needs to know the client is awake by receiving a NDP with the "client will stay awake" bit set to 1. This goes for key rotation frames too. Looking further in the capture, the AP did not attempt to transmit de-authentication frames to the client either.

What's the take away here? The combination of power save measures and key rotation can result in clients being disconnected from a WLAN without knowing they have been kicked off. It's known that some clients ignore the DTIM interval in beacons, preferring to save power over receiving broadcast traffic (remember, broadcast and multicast traffic is delivered at the DTIM interval beacon, when the DTIM counter value is zero). Clients are expected to be awake at the DTIM interval beacon to receive broadcast and multicast traffic, but some clients would rather save battery power.

Personally, I recommend increasing the default broadcast key rotation interval from the default 1 hour to something a bit longer, like 12 or 24 hours. If you have a WLAN that is not supporting voice, consider increasing the DTIM period to 3. This will allow clients that do honor the DTIM interval to conserve power, while avoiding problems with clients that don't honor it.




Friday, April 14, 2017

Broadcast Key Rotation in WPA2-Enterprise WLANs

Wireless networks using WPA2-Enterprise security with 802.1X authentication are a common sight in corporate environments. It provides a secure way for devices to communicate over the air.

While studying for the CWSP exam, I became familiar with the mechanisms of WPA2-Enterprise authentication. After a client has provided the correct credentials, the AP (and the DS behind it) performs what is known as the four-way handshake with the client.

ACKs not shown for brevity!
The purpose of the 4-way handshake is to securely exchange a pair of encryption keys. One of the two keys is the Pairwise Temporal Key, or PTK for short. This key is used to encrypt/decrypt unicast traffic to/from the client. The other key is the Group Temporal Key, or GTK. This key is used to encrypt/decrypt broadcast and multicast traffic for all stations on the BSSID. Because of its purpose, the GTK is also referred to as the broadcast key.

Since anyone within earshot of a wireless network can see its traffic, and since all broadcast traffic is encrypted with the same GTK, there is a possibility that an eavesdropper could collect enough broadcast traffic to guess the key. For this reason the GTK is rotated, or changed, for all stations on the BSSID periodically. The new GTK needs to be delivered securely to each station on the BSSID, which means it needs to be sent via unicast to each station, and encrypted with each station's PTK. The lifetime of the GTK is often called the broadcast key rotation interval, and it specifies how often the GTK must be changed for all stations on a BSSID that uses WPA.

For Cisco lightweight-AP based networks, the default broadcast key rotation interval is 3,600 seconds, or 1 hour. You can see the defined interval by issuing the show advanced eap command.


To see broadcast key rotation in action, it helps to shorten the interval to something manageable. I don't know about you, but I'm not waiting a hour to watch the key rotate.

Presto!
The change to the broadcast key interval takes effect after the next scheduled key rotation. If there are clients connected to an AP, you could end up waiting a while. My lab environment had no clients when I made the change.

There are two ways to "watch" the key rotation: via debug or over the air. To see the broadcast key rotation via debug, use debug dot1x all enable. Keep in mind that doing this in a production environment will likely produce a lot of output to the terminal. Here is what you will see when the broadcast key is rotated.

The Easy Way

You can see that the AP sends the GTK to the client, and that the AP resets the timer for the next key rotation.

Seeing the key rotation over the air with a packet analyzer is a bit trickier. It's easy to tell when a client associates and completes the 4-way handshake, but what do you look for to see the broadcast key rotation? The key rotation does not decode in Wireshark as an EAPOL packet; the station is already authenticated and the 802.1X port is unblocked.

The client has to be in an awake state to receive the new GTK, so I decided the best way to find the key rotation was to watch power management null-data packets around the time that I expected to see the key rotation take place. I configured my trusty Moto 4G to connect to the WPA2-Enterprise WLAN on my lab AP, and just let it sit without moving a lot of data. Taking a look at my capture, I want to see when a beacon tells my client to wake up, and what happens after that.

To see that, I need the Association ID of my client on the AP, which you can see from the Association Response when the client first connects.

No Surprise, I'm the only client
So my association ID is 1. Now I'm going to look for Beacons from the SSID I'm connected to that have the DTIM count value of 0 and are telling my client to wake up.

Wakey Wakey
(Now, I know what you're thinking: If the client is asleep, how can it hear the beacon? This brings up an interesting discussion. The client is supposed to wake up for every DTIM period. The AP doesn't know for sure that the client woke up until it receives a null-data packet from the client indicating that it will stay awake).

I'm up, what do you want? 

Immediately after, I see this QoS Data frame sent from the AP to my client. What was interesting about this frame was the Source Address field was the same as the Transmitter Address field. It was not a frame being delivered from an upstream source; it was coming directly from the AP.

You're going to have to trust me here.

My client was connected for a while, at least long enough to see two or three broadcast key rotations at the 120 second interval. So I make a Wireshark filter to match frames where the transmitter address is the AP, the Source Address is the AP, and the frame type is data: wlan.ta == bssid && wlan.sa == bssid && wlan.fc == 0x2 .

Well would you look at that.
The time deltas between those frames are lining up perfectly with the broadcast key rotation interval of 120 seconds.

So, you may be asking yourself what happens if a client does not wake up from a sleep state to receive the new GTK. The answer: they are de-authenticated from the BSS. How long will the AP wait, and how many retries will happen before the client is disconnected?

Hello again
Look at the EAPOL-Key Timeout and Max Retries parameters here. You could infer that the client is given 1 second and two retries before being disconnected. But is that what really determines it?

For another blog, perhaps. But I do know for certain that broadcast key rotation intervals that are too short will cause problems for clients that enter power-save states. This is especially true for clients that will not wake up for short DTIM values, like iPhones. It's best to extend the broadcast key interval out past the default of one hour, and make sure your WLANs have DTIM values greater than 2 where applicable (not recommended for voice networks).


Friday, January 27, 2017

The Importance of Soft Skills

In wireless networking, we tend to focus on technical details. Wi-Fi is complicated, and the strength of a Wi-Fi professional should be in their expert knowledge of how Wi-Fi works.

If you are looking to break into working in Wi-Fi, there is also another important thing to brush-up on: your soft skills. Information Technology workers often get so wrapped up in the "Technology" part of their job that they forget about the most important part: people. We work primarily with, and for, people. The solutions you create and problems you fix ultimately help other people.

What if your personal physician was a brilliant M.D. from Harvard that was well respected in their field for in-depth knowledge, but who was also rude, late to appointments, and could not communicate well? Would you keep that doctor?

Soft skills are defined as "personal attributes that enable someone to interact effectively and harmoniously with with other people." In other words, behave in a way that doesn't make your co-workers want to stab you. Here are some of those skills:
  • Effective oral and written communication. Be able to clearly communicate the information that you want your audience to digest. 
  • Describe technical details to non-technical people. Be able to describe why something, for technical reasons, will/won't work to people not versed in the jargon. Use analogies and metaphors to get a point across. 
  • Don't scoff at people for their lack of knowledge of something you are knowledgeable in. Making someone feel stupid is a quick way to sour your relationship with them. Conversely, don't be intimidated by people that may be knowledgeable in other fields that may question your expertise. Be confident, but not cocky. 
  • Have integrity. Do what you say you will do. 
  • Be transparent. Don't hide your reasoning for choices you make. 
  • Be a team-player. Find value in your coworkers and encourage them to learn more. 
Developing these skills takes time and effort. One sure way to develop many of these skills is to teach. Hold seminars or workshops, or teach at a community college. I taught college classes for years before I started in I.T., and even a few years after. Teaching helped me hone my soft skills. 

Be and expert in your field, but don't neglect the soft side of Information Technology. 

Wednesday, January 11, 2017

Using Cisco APs in Sniffer Mode to Measure Attenuation

My previous blog entries have relied heavily on using Cisco lightweight APs in sniffer mode for packet analysis. This entry is no different. For a primer on using lightweight APs for packet capture, click here.

I had the idea of using a lightweight AP in sniffer mode to measure the attenuation of a wall in my office. I understand that my method here is not typical, doesn't translate well to pre-installation techniques, and doesn't replace AP-on-a-stick. This blog is more of a "what's a cool thing I can do with a sniffer-mode AP and Wireshark."

Measuring attenuation though an obstacle is more involved than one may think, and I learned a few things studying the standard methods before capturing any packets. The signal source should be at least 4 meters from the obstruction, and the measuring device should be at least 1 meter from the other side of the obstruction. Using these distances, instead of something closer, means the dB loss will be more linear with distance, as apposed to inverse-square. See Nigel Bowden's excellent blog on this subject at Ekahau's website.

A lightweight AP that was already installed on the ceiling was used as a signal source. The AP was broadcasting two SSIDs on both 2.4 (channel 11) and 5 GHz (channel 36) bands with a beacon interval of 100ms. If you plan on trying this yourself, you should map at least 2 SSIDs to each radio; I'll explain later.

I used a sniffer mode AP on a long patch cable so I could move around. I started the packet capture, held the sniffer AP line-of-sight to signal source, then moved a few feet to my left to put the obstruction between the sniffer and the source.

Once I had the capture, I had to filter out any packets that were not from the AP I was measuring. The easiest way I found to do this was to look for beacon frames (wlan.fc.type_subtype == 0x8), and frames received with better than -65 dBm strength (wlan_radio.signal_dbm > -65). After reviewing to make sure my filter worked, I exported the packets (File ->Export Specified Packets), making sure to select "Displayed" for the export. This step will make generating the graphs later a bit easier.

Open the capture file created by the export, and select Statistics -> I/O graph. Uncheck the box next to the "All Packets" default graph; we don't need to see it. Click the plus sign to add a new graph. I changed the name to "Channel 11". In the display filter field, enter a filter in Wireshark display filter syntax to limit what packets will be considered for the graph. I only want to see packets on channel 11, so I enter the filter wlan_radio.channel == 11. For the Y axis, change the drop-down from "Packets" to "AVG(Y Field)". In the Y-field box, enter a Wireshark display filter of the thing you want to graph. In our case, we want to see signal strength in dBm, so I put in wlan_radio.signal_dbm. (If you didn't know, when you highlight an item in the Packet Details window that Wireshark has a decoder for, it will show you relevant filter syntax in the status bar.)


Here is what the graph looks like. At left is line-of-sight, then behind the obstruction for about ten seconds. After that, I put the sniffer AP on a table and walked back to my workstation to stop the capture.


Note that the interval value is set for 1 second. This tells Wireshark to get the average value of signal strength for all packets on channel 11 over each 1 second period. With a beacon interval of 100ms, this should give you 10 samples for each SSID mapped to the radio. This is why you want more than one SSID; it gives you more samples to average over. If the Interval was set to 100ms, there would be points on the graph where there were no packets received during the interval. Wireshark considers this a value of zero, which I guess would be fine if we weren't working with negative numbers.

Repeat the process for channel 36. Click on the "Duplicate this graph" button, and change the display filter to wlan_radio.channel == 36. To make things easy to read, change the color of the line so it is distinguishable from the first one.


Eyeballing the graph, it looks like channel 11 encountered about 4 dBm of loss from the obstruction, but channel 36 had a whopping 10 - 12 dBm of loss.

I know it's not going to change the way Wi-Fi pros measure attenuation, but this was a fun way to visualize RF loss from obstructions but using some tools anyone with a Cisco lightweight infrastructure can replicate.

Thursday, December 8, 2016

Opportunistic Key Caching - Fast roaming with OKC

For devices (and wireless networks) that support Opportunistic Key Caching, this non-standard fast-roaming technique can make roaming times very fast.

In a WPA2-Enterprise Wi-Fi network, a Pairwise Master Key (PMK) is created during the process of EAP authentication between the wireless client and the AP it is connecting to. The PMK represents the Robust Security Network Security Association (RSN-SA) between the client and the AP. The PMK is also used to create the Pairwise Transient Key (PTK), which is used to encrypt frames between the client and AP.

The PMK generated after a full EAP authentication is only good between the client and the AP it initially connected to. If the client roams to a new AP, a new PMK must be generated through the EAP process. Part of the EAP process includes the 4-way handshake, which generates the PTK for encrypting data. The first frame of the 4-way handshake, which is from the AP to the client, includes an identifier for the PMK, called the PMKID. The PMKID is simply a 128-bit hash of the PMK, the client's MAC address, and the AP's MAC address. Below is an example of a PMKID seen in a wireless packet capture.

Figure 1: PMKID Captured During 4-Way Handshake
If wireless clients and wireless distribution systems cache PMKs between clients and APs, the PMKID can be used when a client roams "back" to an AP that it had been authenticated to previously. This would speed up roaming "back" to an old AP, since the full EAP authentication would not need to take place; the PMK already exists. Just the 4-way handshake would be necessary to generate the PTK. Think of the scenario shown below, where a client roams between two APs.

Figure 2: Roaming Back to an Old Friend
When the client roamed "back" to AP1, the PMKID could be sent in the re-association request. The client already has PMK1, and if the wireless distribution system cached PMK1, they authentication could proceed directly to the 4-way handshake without a full EAP authentication.

This certainly helps, but only if the client roamed back to an old AP. It still needs to complete a full EAP authentication when roaming to AP2, which usually takes at least 200ms. This is where Opportunistic Key Caching comes in. OKC is a method to calculate a new PMK between a client and an AP that it had never authenticated to before. As long as the client had authenticated to one AP in the distribution system, a new PMK could be calculated, by both the client and the distribution system, without having to do a full EAP authentication. All it requires is that both the client and distribution system use the same mathematical formula to calculate the new PMK.

A sure fire way to tell that a client supports OKC is to look at the reassociation request it sends when roaming to an AP it had not been previously authenticated to. It will include a PMKID in the reassociation request, even though it had not established a PMK with that AP previously.

Figure 3: PMKID in Re-association Request
Note that this is not the same PMKID that is shown in Figure 1. At this point, if the wireless distribution system the client is connected to does not support OKC, a full EAP authentication will start. If the distribution system does support OKC, the 4-way handshake will start after the re-association response.
Figure 4: OKC In Action
In this example, the use OKC results in a roam time of a 36ms.

OKC is supported by default in recent versions of controller-based Cisco wireless solutions. You can watch the magic happen by using the "debug client <macaddress>" command from the CLI. When the client roams using OKC, you will see this in the output:

Figure 5: Computing New PMKID
If your wireless clients support it, OKC can be handy for making clients roam faster. Unfortunately, not all clients do. Most notably, OKC is not supported by any Apple iOS devices. The standard for fast roaming, 802.11r, results in roam times that can be even faster than OKC.

Thursday, November 24, 2016

Cisco Optimized Roaming: Client behavior with 11v vs. without 11v

With or Without V

In my previous blog entry, I discussed what is necessary to get BSS Transition Management working with Cisco controller-based Wi-Fi networks. In this entry, I wanted to present a comparison of client behavior when BSS Transition Management is enabled to when it is not enabled.

For you to see 11v frames from your Cisco network, either aggressive load balancing or optimized roaming must be enabled. For this discussion, I will focus on optimized roaming. The optimized roaming engine will keep track of client statistics, such as RSSI and data rate, and disassociate clients that don't meet configurable thresholds. If BSS Transition Management is enabled (at the WLAN level) in combination with optimized roaming, the AP will send a transition request to a client before disassociating it, giving the client time to roam to a better AP. If BSS Transition Management is not enabled, or the client does not support it, the client will simply be disassociated.

Before I dive into the packet captures, we need to discuss a specific detail of optimized roaming: the engine only looks at RSSI of data packets, not management or action. To see optimized roaming work, the client must be moving data.

The test client was my Windows 10 Dell laptop with an integrated Intel 8260 dual-band wireless adapter. The advanced configuration for the adapter has a parameter called "roaming aggressiveness," which has 5 options, from lowest to highest. According to Intel's documentation, the setting "lowest" means the client will not roam unless it loses connectivity. I set roaming aggressiveness to "lowest" for the tests, so the optimized roaming engine would try to get my client to roam before it decided to itself.

The test WLAN had an SSID of Test, 5 GHz only. The WLAN was configured for WPA2-Enterprise with Fast Transition. BSS Transition Management was on for the first test, then turned off for the second.

The test setup was identical to my first post: two APs in local mode and two APs in sniffer mode, watching the same channels as the local mode APs nearest to them.

Figure 1: Testing setup
I would start off by associating to the "Back" AP, then moving towards the "Front" AP. Because I had set the roaming aggressiveness of my client to its lowest setting, I could get to line-of-sight of the "Font" AP and still be connected to the "Back" AP. I would verify what AP my client was connected to by issuing a netsh wlan show interface command from a command prompt. The output of this command will show what channel the client is on, so I could tell what AP it was connected to.

With 11v BSS Transition Management


Once the client reaches a point where the optimized roaming engine determines it should roam, a transition management request is sent to it.

Figure 2: Transition Management 

You can see in figure two that the client sends probes prior to sending the transition management response. I guess it wanted to confirm that there was an AP on the channel indicated in the transition request frame. Looking at the time stamps, it took about 50ms for the client to re-associate to another AP.

To be honest here, it looks like the client was not happy with the SNR it was seeing from the new AP. There are probe requests/responses on channel 64, which was the channel of the "Back" AP it had been associated to. This could explain why the roam took 50ms.

Without 11v BSS Transition Management 


In this test, my client was line of sight to the "Front" AP when the optimized roaming engine sent a disassociate frame.

Figure 3: Abrupt Disassociation

You can see from figure 3 that it took 800ms for the client to realize what had happened and send out probe requests, then another 90ms to get connected. Total time from disassociation to re-association response is nearly 900ms. Luckily, it was able to re-associate without having to do a complete EAP authentication cycle, otherwise the roam would have taking about a full second.

Conclusion


While 900ms may be a tolerable roam time for a data client, it is too long for voice applications. Even if you only have data clients, sticky clients ruin the party for other clients by using low data rates and consuming more air time. If you are going to use Optimized Roaming, 11v BSS Transition Management offers a way to gracefully move sticky clients to a better AP.

Comments? Suggestions? Please leave a comment below or reach me on Twitter @GiantsNerd.