Discussion:
periodic hang of ath9k
(too old to reply)
Dave Taht
2014-07-13 19:18:46 UTC
Permalink
cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442

this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...

some new details below
Hi List, hi Dave,
I just had a case of devices on the 2.4GHz radio not connecting anymore (the 5GHz radio still worked).
\(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 13 stopped: 1
Yes, that is the one sure symptom of something like bug 442. As the
default in cerowrt is a queue depth of 12 for BE, this kind of implies
we have some sort of off-by-one or atomic update/race condition
problem somewhere.

Maybe setting it even lower (say, 4) and doing exaustive tests will
trigger the bug more often?

In the next release, I've given up on setting the queue depth
entirely, and have it set to the default of 123 (123, I think) - and I
HAVE seen it hang with it set that high too, and more recent
benchmarks show 48 as offering more throughput than 12, and the usual
terrible latency regardless.

(the right thing here is to rework queue handling entirely to work on
a per-sta basis, incorporate fq_codel like ideas down there, etc, but
we're
a long way from doing that as yet)
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
Note one of the devices was connected to the same radio before and got somehow forced to reconnect and failed to actually do so…
Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: authenticated
Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 IEEE 802.11: associated (aid 2)
Sun Jul 13 19:15:08 2014 daemon.info hostapd: sw00: STA 10:68:3f:4b:0b:48 WPA: pairwise key handshake completed (RSN)
Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
Sun Jul 13 19:15:08 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPREQUEST(sw00) 192.168.2.107 10:68:3f:4b:0b:48
Sun Jul 13 19:15:11 2014 daemon.info dnsmasq-dhcp[2809]: DHCPNAK(sw00) 192.168.2.107 10:68:3f:4b:0b:48 wrong address
Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
Sun Jul 13 19:15:13 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
Sun Jul 13 19:15:17 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPDISCOVER(sw00) 10:68:3f:4b:0b:48
Sun Jul 13 19:15:26 2014 daemon.info dnsmasq-dhcp[2809]: DHCPOFFER(sw00) 172.30.42.90 10:68:3f:4b:0b:48
I reconnected the 2.4GHz radio from https://gw.home.lan:81/cgi-bin/luci/;stok=64f33ba722ed8a68b13ad5644e60629b/admin/network/network (by hitting the connect button for sw00)
(VO): qnum: 0 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(VI): qnum: 1 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BE): qnum: 2 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(BK): qnum: 3 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
(CAB): qnum: 8 qdepth: 0 ampdu-depth: 0 pending: 0 stopped: 0
So this seems to be “fixed” without having to reboot cerowrt. So we might consider working aroung this bug by checking /sys/kernel/debug/ieee80211/phy0/ath9k/queues repeatedly and reconnecting the 2.4GHz radio if the queue seems stopped?
I don't know what "reconnecting" actually triggers in the backend uci code.
(anyone? a network reload? what?)

Certainly if we could check for "stuckness" and then do the right
thing that might be a workaround.
Now I would like to know whether this actually is bug 442 (I seem to recall that other afflicted users needed to reboot cerowrt to get the radio back, or did they simply not try to just get sw00 unstuck in a less drastic manner?
No, most people just hit it with a large hammer and start over,
without letting me get a crack at it.
And/or maybe the re-connect only fixes some symptoms and the router will wedge layer on for good?) I attached the output of cerostats.sh just in case someone has an idea what to try next...
Damned if I know, but this is progress of a sort.
Best Regards
Sebastian
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Ben Greear
2014-07-13 19:44:38 UTC
Permalink
Post by Dave Taht
cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...
I've mentioned this before..but I have a patch that works around this
problem:

I had merge issues, so now it's two patches until I have time to clean up
my tree:

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=commit;h=579035d4b2b7110d67ff451e728709f5b2510ca0

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=commit;h=ae6f95e8a35578e0668e47becb053b29009eec9a

I suspect this may cause memory leaks, as a test that causes these work-arounds to hit
often will run out of memory after a few days (200+ station vifs constantly doing traffic).
Could be lots of other code of course, and kmemleak shows no leaks, so I'm at a loss on
the mem leak issue.

If you can, try patching these in and see if that works around your tx hang problem.

I'd love to have a better fix, btw.

Thanks,
Ben
--
Ben Greear <***@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
Sujith Manoharan
2014-07-14 04:25:05 UTC
Permalink
Post by Dave Taht
cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...
Does stock OpenWrt also have this bug, or is this specific to Cerowrt ?

Sujith
Felix Fietkau
2014-07-14 10:48:30 UTC
Permalink
Post by Sujith Manoharan
Post by Dave Taht
cc-ing ath9k-devel for this update on http://www.bufferbloat.net/issues/442
this bug, which some people (usually on macs with low signal strength)
can get to occur fairly rapidly, but I can't, is driving me 9 kinds of
crazy...
Does stock OpenWrt also have this bug, or is this specific to Cerowrt ?
After receiving some useful debug output from Antonio Quartulli (who
was able to reproduce it easily), I believe that I have tracked down
this issue to some bugs in counting pending tx frames. When frames get
pushed through the U-APSD queue for PS responses and dropped there due
to retransmissions, the counter probably does not get decremented
properly.

I've come up with an untested patch that should fix this codepath
and make it easier to verify.

If you're affected by the bug, please test this patch:
---
--- a/drivers/net/wireless/ath/ath9k/ath9k.h
+++ b/drivers/net/wireless/ath/ath9k/ath9k.h
@@ -185,7 +185,8 @@ struct ath_atx_ac {

struct ath_frame_info {
struct ath_buf *bf;
- int framelen;
+ u16 framelen;
+ s8 txq;
enum ath9k_key_type keytype;
u8 keyix;
u8 rtscts_rate;
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -147,15 +147,13 @@ static void ath_set_rates(struct ieee802
static void ath_txq_skb_done(struct ath_softc *sc, struct ath_txq *txq,
struct sk_buff *skb)
{
- int q;
-
- q = skb_get_queue_mapping(skb);
- if (txq == sc->tx.uapsdq)
- txq = sc->tx.txq_map[q];
+ struct ath_frame_info *fi = get_frame_info(skb);
+ int q = fi->txq;

- if (txq != sc->tx.txq_map[q])
+ if (q < 0)
return;

+ txq = sc->tx.txq_map[q];
if (WARN_ON(--txq->pending_frames < 0))
txq->pending_frames = 0;

@@ -1999,6 +1997,7 @@ static void setup_frame_info(struct ieee
an = (struct ath_node *) sta->drv_priv;

memset(fi, 0, sizeof(*fi));
+ fi->txq = -1;
if (hw_key)
fi->keyix = hw_key->hw_key_idx;
else if (an && ieee80211_is_data(hdr->frame_control) && an->ps_key > 0)
@@ -2150,6 +2149,7 @@ int ath_tx_start(struct ieee80211_hw *hw
struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
struct ieee80211_sta *sta = txctl->sta;
struct ieee80211_vif *vif = info->control.vif;
+ struct ath_frame_info *fi = get_frame_info(skb);
struct ath_softc *sc = hw->priv;
struct ath_txq *txq = txctl->txq;
struct ath_atx_tid *tid = NULL;
@@ -2170,11 +2170,13 @@ int ath_tx_start(struct ieee80211_hw *hw
q = skb_get_queue_mapping(skb);

ath_txq_lock(sc, txq);
- if (txq == sc->tx.txq_map[q] &&
- ++txq->pending_frames > sc->tx.txq_max_pending[q] &&
- !txq->stopped) {
- ieee80211_stop_queue(sc->hw, q);
- txq->stopped = true;
+ if (txq == sc->tx.txq_map[q]) {
+ fi->txq = q;
+ if (++txq->pending_frames > sc->tx.txq_max_pending[q] &&
+ !txq->stopped) {
+ ieee80211_stop_queue(sc->hw, q);
+ txq->stopped = true;
+ }
}

if (txctl->an && ieee80211_is_data_present(hdr->frame_control))
Stephen Hemminger
2014-07-14 18:51:12 UTC
Permalink
I think the stock netgear firmware has similar issues, 2G wireless is
flaky when the Mac's are using it.
Dave Taht
2014-07-14 23:10:55 UTC
Permalink
I have little doubt that we have been coping with more than one bug.
Hello David & list,
Do you think we could get a build with CONFIG_PACKAGE_ATH_DEBUG
enabled? I've been following OpenWRT ticket #15320 and it looks like
this would be the way to go in order to get field logs of WiFi issues
from end-users.
In the meantime, here's the latest WiFi failures on my end (running
[14378.023437] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15130.140625] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15349.164062] ath: phy0: Failed to stop TX DMA, queues=0x004!
[15349.179687] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024
AR_DIAG_SW=0x42000020 DMADBG_7=0x000084c0
[15349.191406] ath: phy0: Could not stop RX, we could be confusing the
DMA engine when we start RX up
[16886.886718] ath: phy0: Failed to stop TX DMA, queues=0x004!
[19839.468750] ath: phy0: Failed to stop TX DMA, queues=0x005!
[20286.019531] ath: phy0: Failed to stop TX DMA, queues=0x004!
[20825.996093] ath: phy0: Failed to stop TX DMA, queues=0x005!
[48749.316406] ath: phy0: Failed to stop TX DMA, queues=0x004!
[48749.433593] ath: phy0: Failed to stop TX DMA, queues=0x004!
-------------
Baseband Hang: 0
Baseband Watchdog: 0
Fatal HW Error: 0
TX HW error: 0
Transmit timeout: 0
TX Path Hang: 1
PLL RX Hang: 0
MAC Hang: 13
Stuck Beacon: 7
MCI Reset: 0
Calibration error: 1
-------------
Reply from 74.125.225.112: bytes=32 time=39ms TTL=50
Reply from 74.125.225.112: bytes=32 time=114ms TTL=50
Reply from 74.125.225.112: bytes=32 time=42ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Reply from 74.125.225.112: bytes=32 time=1323ms TTL=50
Reply from 74.125.225.112: bytes=32 time=37ms TTL=50
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from 74.125.225.112: bytes=32 time=258ms TTL=50
Request timed out.
Reply from 74.125.225.112: bytes=32 time=1043ms TTL=50
Request timed out.
Reply from 74.125.225.112: bytes=32 time=206ms TTL=50
Reply from 74.125.225.112: bytes=32 time=91ms TTL=50
Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
Reply from 74.125.225.112: bytes=32 time=38ms TTL=50
Reply from 74.125.225.112: bytes=32 time=40ms TTL=50
On Mon, Jul 14, 2014 at 2:51 PM, Stephen Hemminger
Post by Stephen Hemminger
I think the stock netgear firmware has similar issues, 2G wireless is
flaky when the Mac's are using it.
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht

NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
Continue reading on narkive:
Loading...