Stam, Michel [FINT]
2016-07-06 12:32:02 UTC
Dear list,
I've been trying to tackle an annoying bug I've been having on a mesh of 2 units with AuthSAE enabled for the past two weeks, but I cannot seem to find what causes it.
It happens when I run an iperf test between the units. At or around the time the SAE lifetime expires, a rekey occurs, after which traffic between the units stops. Wireshark/tcpdump do seem to indicate incoming traffic when observing through a monitor interface.
Sometimes a packets arrive again about a key lifetime later. This does not give a very stable mesh though, as normally the rekey lifetime is 3600 seconds (which means the link is effectively down for an hour).
When the problem occurs, I've observed the "iw dev mesh0 station dump" command returning quickly increasing counters for "rx drop misc" ( NL80211_STA_INFO_RX_DROP_MISC). I can't be certain if it is related, the value also increases on a working link, although it seems slower.
If I leave the link idle (no iperf test, just some pings), then this problem does not seem to occur. This makes me believe it is a race of sorts.
Looking at the debug traces from meshd-nl80211I can find no fault. I also looked at the key material sent down to the ath9k driver (printk's in the kernel driver), but even reading back those registers does not indicate to me that there's a fault. I read back what was written.
Both units use an ath9k Atheros card; One is an AzureWave AR5B95, the other is a Compex WLE200N2-23. I have also observed the problem on Compex WLE350NX cards, so I am guessing this is not hardware related.
I set up both units with this configuration: meshd.txt<https://github.com/cozybit/authsae/files/330064/meshd.txt>. I'm using the latest GIT from AuthSAE.
The kernel I use 4.4.11, but I've seen the same problem with 3.10.49.
The compat-wireless 2016-01-10 driver set used by OpenWRT seems to have the same problem with the old 3.10.34 kernel I run on that system.
The iperf setup is (using 2.0.5):
* One system running iperf -s -u -p 6969 -i 5
* One system running iperf -c -u -p 6969 -i 5 -t 86400 -b 100M
I create the mesh interfaces by:
* iw phy phy0 interface add mesh0 type mp
* ifconfig mesh0 IP MASK up
* meshd-nl80211 -c meshd.txt -i mesh0
Right now the key lifetime is at 60 seconds for problem reproduction, but I have seen the same problem on a link with a key lifetime of 3600 seconds; the link then dies at that time.
Loading ath9k with nohwcrypt=1 solves the problem, but costs more CPU cycles.
Now I've made a patch which calls ath9k_queue_reset every time the key is set. This seems to get rid of the link dying on me, at the cost of a lot of authentication traffic. This is a very heavy-handed approach, and I'm fairly certain this is not gonna work in a production environment. See here for the ugly hack: https://github.com/cozybit/authsae/files/347910/ath9k-install_key-buckshot.diff.txt.
This issue has also been posted as: https://github.com/cozybit/authsae/issues/42
Someone on the AuthSAE github page mentioned that this is apparently this is a long-standing issue with the driver, which was submitted before as https://www.mail-archive.com/ath9k-devel%40lists.ath9k.org/msg13595.html.
Is anyone able to assist me/ give me a couple of pointers?
Regards,
Michel Stam
I've been trying to tackle an annoying bug I've been having on a mesh of 2 units with AuthSAE enabled for the past two weeks, but I cannot seem to find what causes it.
It happens when I run an iperf test between the units. At or around the time the SAE lifetime expires, a rekey occurs, after which traffic between the units stops. Wireshark/tcpdump do seem to indicate incoming traffic when observing through a monitor interface.
Sometimes a packets arrive again about a key lifetime later. This does not give a very stable mesh though, as normally the rekey lifetime is 3600 seconds (which means the link is effectively down for an hour).
When the problem occurs, I've observed the "iw dev mesh0 station dump" command returning quickly increasing counters for "rx drop misc" ( NL80211_STA_INFO_RX_DROP_MISC). I can't be certain if it is related, the value also increases on a working link, although it seems slower.
If I leave the link idle (no iperf test, just some pings), then this problem does not seem to occur. This makes me believe it is a race of sorts.
Looking at the debug traces from meshd-nl80211I can find no fault. I also looked at the key material sent down to the ath9k driver (printk's in the kernel driver), but even reading back those registers does not indicate to me that there's a fault. I read back what was written.
Both units use an ath9k Atheros card; One is an AzureWave AR5B95, the other is a Compex WLE200N2-23. I have also observed the problem on Compex WLE350NX cards, so I am guessing this is not hardware related.
I set up both units with this configuration: meshd.txt<https://github.com/cozybit/authsae/files/330064/meshd.txt>. I'm using the latest GIT from AuthSAE.
The kernel I use 4.4.11, but I've seen the same problem with 3.10.49.
The compat-wireless 2016-01-10 driver set used by OpenWRT seems to have the same problem with the old 3.10.34 kernel I run on that system.
The iperf setup is (using 2.0.5):
* One system running iperf -s -u -p 6969 -i 5
* One system running iperf -c -u -p 6969 -i 5 -t 86400 -b 100M
I create the mesh interfaces by:
* iw phy phy0 interface add mesh0 type mp
* ifconfig mesh0 IP MASK up
* meshd-nl80211 -c meshd.txt -i mesh0
Right now the key lifetime is at 60 seconds for problem reproduction, but I have seen the same problem on a link with a key lifetime of 3600 seconds; the link then dies at that time.
Loading ath9k with nohwcrypt=1 solves the problem, but costs more CPU cycles.
Now I've made a patch which calls ath9k_queue_reset every time the key is set. This seems to get rid of the link dying on me, at the cost of a lot of authentication traffic. This is a very heavy-handed approach, and I'm fairly certain this is not gonna work in a production environment. See here for the ugly hack: https://github.com/cozybit/authsae/files/347910/ath9k-install_key-buckshot.diff.txt.
This issue has also been posted as: https://github.com/cozybit/authsae/issues/42
Someone on the AuthSAE github page mentioned that this is apparently this is a long-standing issue with the driver, which was submitted before as https://www.mail-archive.com/ath9k-devel%40lists.ath9k.org/msg13595.html.
Is anyone able to assist me/ give me a couple of pointers?
Regards,
Michel Stam