[ath9k-devel] Sparklan WPEA-121N AR9382 168c:abcd

Discussion:

Steffen Dettmer

2013-03-27 18:34:26 UTC

Hi,

some time ago there was a thread "Sparklan WPEA-121N AR9382 168c:abcd" about the issue that the mentioned device was erroneously reported as device ID 0xabcd. There were EEPROM issues assumed and BIOS issues reported that could cause this effect (by resetting the PCI bus at system power on) and a proposed workaround to add the wrong ID to the driver to make it load - it had been reported working.

I'm facing such a situation with embedded devices (I assume BIOS updates probably are at least very difficult) and a WPEA-127N and would like to know whether there were new findings and share mine in case they'd be of some interest.

Are there any news on that?

Is the proposed workaround adding 0xabcd to the driver still best way of handling this?

On my board it happens /from time to time/ that the device reports 0xabcd - but not always.

I made 20 tests were I saw 4 such failures. All those failures appeared after cold boot, but none after warm boot. After cold boot sometimes one of two installed devices appeared with wrong device ID but other correctly, and at other times both were working. Of course the number of tests is insufficient to draw conclusions, I write it just in case it rings a bell.

It is some Intel atom board running Linux (for example, Debian 7). Can I provide information that could help (and if so, how do I get those)?

Best regards,
Steffen

Some test results:

***@nomad:~# lspci|grep -i ath
01:00.0 Ethernet controller: Atheros Communications Inc. Device abcd (rev 01)
07:00.0 Ethernet controller: Atheros Communications Inc. Device abcd (rev 01)
***@nomad:~# grep '' /sys/devices/pci0000:00/0000:00:1c.*/0000:0{1,7}:00.0/{vendor,device}
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/device:0xabcd
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/device:0xabcd
***@nomad:~# reboot
[ssh...]
***@nomad:~# lspci|grep -i ath
01:00.0 Network controller: Atheros Communications Inc. AR9300 Wireless LAN adaptor (rev 01)
07:00.0 Network controller: Atheros Communications Inc. AR9300 Wireless LAN adaptor (rev 01)
***@nomad:~# grep '' /sys/devices/pci0000:00/0000:00:1c.*/0000:0{1,7}:00.0/{vendor,device}
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/device:0x0030
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/device:0x0030

***@nomad:/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0# grep '' *
broken_parity_status:0
class:0x020000
Binary file config matches
consistent_dma_mask_bits:32
device:0xabcd
dma_mask_bits:32
enable:0
grep: firmware_node: Is a directory
irq:10
local_cpulist:0-31
local_cpus:ffffffff
modalias:pci:v0000168Cd0000ABCDsv00000000sd00000000bc02sc00i00
grep: power: Is a directory
grep: remove: Permission denied
grep: rescan: Permission denied
grep: reset: Permission denied
resource:0x00000000fdfe0000 0x00000000fdffffff 0x0000000000140204
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x00000000fdfd0000 0x00000000fdfdffff 0x000000000004e200
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
grep: resource0: Input/output error
grep: rom: Invalid argument
grep: subsystem: Is a directory
subsystem_device:0x0000
subsystem_vendor:0x0000
uevent:PCI_CLASS=20000
uevent:PCI_ID=168C:ABCD
uevent:PCI_SUBSYS_ID=0000:0000
uevent:PCI_SLOT_NAME=0000:01:00.0
uevent:MODALIAS=pci:v0000168Cd0000ABCDsv00000000sd00000000bc02sc00i00
vendor:0x168c

Adrian Chadd

2013-03-27 21:31:27 UTC

Permalink

Hi,

The general consensus at work is - BIOSes are buggy and don't
necessarily reset the PCI bus correctly.

So either you can do your own PCI bus reset post-boot (and
re-enumerate all the PCI devices, including initialising their BARs)
or smack your vendor to fix their BIOSes. I can't really make any
further suggestions besides that.

Adrian

Post by Steffen Dettmer
Hi,
some time ago there was a thread "Sparklan WPEA-121N AR9382 168c:abcd" about the issue that the mentioned device was erroneously reported as device ID 0xabcd. There were EEPROM issues assumed and BIOS issues reported that could cause this effect (by resetting the PCI bus at system power on) and a proposed workaround to add the wrong ID to the driver to make it load - it had been reported working.
I'm facing such a situation with embedded devices (I assume BIOS updates probably are at least very difficult) and a WPEA-127N and would like to know whether there were new findings and share mine in case they'd be of some interest.
Are there any news on that?
Is the proposed workaround adding 0xabcd to the driver still best way of handling this?
On my board it happens /from time to time/ that the device reports 0xabcd - but not always.
I made 20 tests were I saw 4 such failures. All those failures appeared after cold boot, but none after warm boot. After cold boot sometimes one of two installed devices appeared with wrong device ID but other correctly, and at other times both were working. Of course the number of tests is insufficient to draw conclusions, I write it just in case it rings a bell.
It is some Intel atom board running Linux (for example, Debian 7). Can I provide information that could help (and if so, how do I get those)?
Best regards,
Steffen
01:00.0 Ethernet controller: Atheros Communications Inc. Device abcd (rev 01)
07:00.0 Ethernet controller: Atheros Communications Inc. Device abcd (rev 01)
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/device:0xabcd
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/device:0xabcd
[ssh...]
01:00.0 Network controller: Atheros Communications Inc. AR9300 Wireless LAN adaptor (rev 01)
07:00.0 Network controller: Atheros Communications Inc. AR9300 Wireless LAN adaptor (rev 01)
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/device:0x0030
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/vendor:0x168c
/sys/devices/pci0000:00/0000:00:1c.2/0000:07:00.0/device:0x0030
broken_parity_status:0
class:0x020000
Binary file config matches
consistent_dma_mask_bits:32
device:0xabcd
dma_mask_bits:32
enable:0
grep: firmware_node: Is a directory
irq:10
local_cpulist:0-31
local_cpus:ffffffff
modalias:pci:v0000168Cd0000ABCDsv00000000sd00000000bc02sc00i00
grep: power: Is a directory
grep: remove: Permission denied
grep: rescan: Permission denied
grep: reset: Permission denied
resource:0x00000000fdfe0000 0x00000000fdffffff 0x0000000000140204
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x00000000fdfd0000 0x00000000fdfdffff 0x000000000004e200
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
resource:0x0000000000000000 0x0000000000000000 0x0000000000000000
grep: resource0: Input/output error
grep: rom: Invalid argument
grep: subsystem: Is a directory
subsystem_device:0x0000
subsystem_vendor:0x0000
uevent:PCI_CLASS=20000
uevent:PCI_ID=168C:ABCD
uevent:PCI_SUBSYS_ID=0000:0000
uevent:PCI_SLOT_NAME=0000:01:00.0
uevent:MODALIAS=pci:v0000168Cd0000ABCDsv00000000sd00000000bc02sc00i00
vendor:0x168c
_______________________________________________
ath9k-devel mailing list
https://lists.ath9k.org/mailman/listinfo/ath9k-devel

Peter Stuge

2013-03-27 21:57:11 UTC

Permalink

Post by Adrian Chadd
The general consensus at work is - BIOSes are buggy

That is very true..

Post by Adrian Chadd
and don't necessarily reset the PCI bus correctly.

..but this doesn't make any sense at all.

Post by Adrian Chadd
So either you can do your own PCI bus reset post-boot

What *exactly* is meant by "PCI bus reset" here? I ask because those
words don't map to any of the tasks of PC firmware.

Post by Adrian Chadd
I can't really make any furthere suggestions besides that.

I'm afraid "PCI bus reset" is no suggestion, because it means
nothing. Please go into (much) more detail about what the hardware
requires?

Thanks

//Peter

Adrian Chadd

2013-03-27 22:33:58 UTC

Permalink

Sure, here's what's going on:

* There's a PCI bus reset. It's a pin. On the PCI bus.
* The BIOS can yank that down to reset all the devices.
* There's timing requirements for how long that pin can be pulled down
to reset and release.
* After the PCI bus is reset, the atheros MAC initialises the PCI
space by reading a bunch of values from EEPROM/OTP and writing them
into the register space. Most of these are PCI space registers but
there can be others.
* Some vendors do daft things, like multiple quick PCI bus resets
back-to-back rather than doing a reset and waiting for whatever the
standard requires or the best practice is; or just asserting reset
quickly rather than holding it down for the required time is;
* .. and this can interrupt / confuse the MAC during this whole
register initialisation path.

So, the "quick" fix is to re-reset the PCI slot or the PCI bus. But I
think that requires you to take care of PCI device resource allocation
and enumeration; which the Linux kernel may or may not do. For
cardbus/expresscard devices there's some resource allocation going on,
but not necessarily for always-attached cards.

The real fix is to smack the heck out of BIOS writers who do strange
and wonderful crap in their BIOS when resetting and enumerating PCI
devices.

Adrian

Michael Schwingen

2013-03-27 22:46:40 UTC

Permalink

Post by Adrian Chadd
* There's a PCI bus reset. It's a pin. On the PCI bus.
* The BIOS can yank that down to reset all the devices.
* There's timing requirements for how long that pin can be pulled down
to reset and release.
* After the PCI bus is reset, the atheros MAC initialises the PCI
space by reading a bunch of values from EEPROM/OTP and writing them
into the register space. Most of these are PCI space registers but
there can be others.
* Some vendors do daft things, like multiple quick PCI bus resets
back-to-back rather than doing a reset and waiting for whatever the
standard requires or the best practice is; or just asserting reset
quickly rather than holding it down for the required time is;
* .. and this can interrupt / confuse the MAC during this whole
register initialisation path.

I have had this on an ambedded design - IIRC with an AR5414, back when
Atheros switched from 3-wire EEPROMs to I2C EEPROMs on the MiniPCI modules.

If you do a PCI reset just at the time when the MAC is doing an I2C
read, the I2C EEPROM will hang in the middle of a bus cycle, with no
possibility to reset it when the MAC does the next read access, so at
least the first read will get corrupt data.

AFAIK, the PCI standard does not forbid this (there are only minimum
times for *assertion* of the reset signal), so technically, the card
violates the PCI spec if it can't cope with two PCI resets in direct order.

However, I would consider this really bad practice.

In our case, inserting a minimum delay between the point where the
hardware de-asserts reset and the point where the code re-asserts it
(because it might be a warm boot) fixed the problem reliably.

cu
Michael

Peter Stuge

2013-03-27 23:11:53 UTC

Permalink

Post by Michael Schwingen
If you do a PCI reset just at the time when the MAC is doing an I2C
read, the I2C EEPROM will hang in the middle of a bus cycle, with no
possibility to reset it when the MAC does the next read access, so at
least the first read will get corrupt data.

Yes, that makes sense.

Post by Michael Schwingen
AFAIK, the PCI standard does not forbid this (there are only minimum
times for *assertion* of the reset signal), so technically, the card
violates the PCI spec if it can't cope with two PCI resets in direct order.

I agree.

Post by Michael Schwingen
However, I would consider this really bad practice.

I agree that violating the spec is bad practice. I don't agree that
permitted reset patterns are bad practice. Especially I do not agree
that doing anything quicker than normal, while staying compliant, is
bad practice.

Post by Michael Schwingen
In our case, inserting a minimum delay between the point where the
hardware de-asserts reset and the point where the code re-asserts it
(because it might be a warm boot) fixed the problem reliably.

Thanks for the detailed information!

It makes perfect sense that the I²C transaction would be interrupted.

It would be very simple to investigate that on problematic hardware
with something quite low cost such as the $50 Openbench Logic Sniffer
or even a Logic Shrimp.

//Peter

Peter Stuge

2013-03-27 23:04:02 UTC

Permalink

Post by Adrian Chadd
the "quick" fix is to re-reset the PCI slot or the PCI bus.

Read the quote from Daniel's email again. It explains how that caused
the problem.

In his bad case there was a reset at time 1 and another reset at time 2.

Removing the reset at time 1 and keeping an unchanged reset at time 2
made the problem disappear.

It seems that the hardware could not handle the reset at time 1,
presumably because the reset was incorrect per the specification.

The reset at time 2 is presumably correct, since things work with it.

If something (the reset at time 1) is able to screw up hardware so
badly that even a correct reset (time 2) does not *actually* reset
the hardware then I would consider that a very serious bus IP
problem in the hardware.

It would be interesting to know if this is *really* the problem. I'm
not at all sure.

Post by Adrian Chadd
PCI device resource allocation and enumeration; which the Linux
kernel may or may not do.

It does not.

Post by Adrian Chadd
The real fix is to smack the heck out of BIOS writers who do
strange and wonderful crap in their BIOS when resetting

Enumeration comes much later. The only two possibilities are a. the
reset at time 1 violated the specification and b. the hardware
doesn't handle multiple resets reliably.

It would be good to get more detail about what exactly makes the
hardware fail to initialize registers from the EEPROM.

I don't think there's such a thing as "best practise" on a bus.
Either the spec is followed (by everyone) or it isn't. :)

//Peter

Adrian Chadd

2013-03-28 01:02:06 UTC

Permalink

Post by Peter Stuge
If something (the reset at time 1) is able to screw up hardware so
badly that even a correct reset (time 2) does not *actually* reset
the hardware then I would consider that a very serious bus IP
problem in the hardware.

Hey, I'm just a programmer. :-) I'm just saying what I've seen and read.

Post by Peter Stuge
Enumeration comes much later. The only two possibilities are a. the
reset at time 1 violated the specification and b. the hardware
doesn't handle multiple resets reliably.

It's not handling the back-to-back resets well.

Post by Peter Stuge
It would be good to get more detail about what exactly makes the
hardware fail to initialize registers from the EEPROM.

Well, for failing units I would like to know if they hvae EEPROM or
not. There's a small amount of one-time programmable (OTP) PROM that
we can store configuration bits in. I'd be interested to know if the
failing setup occurs with units that _don't_ have EEPROM and are just
using the OTP. And yes, the OTP controller AFAIK is also connected via
some kind of shifting interface; it's not directly memory mapped. So I
guess it's plausible that it's also failing.

And thanks for the more thorough description. I'll poke the hardware
people and see what the story is. I'm kinda surprised that this hasn't
been fixed in subsequent chips but chances are it boils down to "noone
expects such unpredictable system hardware." :-)

adrian

Daniel Smith

2013-04-01 13:05:50 UTC

Permalink

Post by Peter Stuge

Post by Adrian Chadd
the "quick" fix is to re-reset the PCI slot or the PCI bus.

It wasn't that it could not handle the reset a time 1 but that the reset at
time 2 was causing the issue that Michael explained with hanging the
EEPROM. So it is not that either reset was more appropriate than the other
but that for this BIOS implementation it was better to remove time 1 and
keep time 2 since one reset really was needed.

Post by Peter Stuge
The reset at time 2 is presumably correct, since things work with it.
If something (the reset at time 1) is able to screw up hardware so
badly that even a correct reset (time 2) does not *actually* reset
the hardware then I would consider that a very serious bus IP
problem in the hardware.
It would be interesting to know if this is *really* the problem. I'm
not at all sure.

Post by Adrian Chadd
PCI device resource allocation and enumeration; which the Linux
kernel may or may not do.

It does not.

Post by Adrian Chadd
The real fix is to smack the heck out of BIOS writers who do
strange and wonderful crap in their BIOS when resetting

Enumeration comes much later. The only two possibilities are a. the
reset at time 1 violated the specification and b. the hardware
doesn't handle multiple resets reliably.

The vendor never mentioned whether this was out of spec or if the card was
not compliant but I can say that this was not the first issue we had run
into with a BIOS. Another instance was making assumptions that no one would
ever have more than 20 PCIe devices connected to the bus. This was an
artificial limit imposed by the BIOS writter that did technically violate
the spec.

It would be good to get more detail about what exactly makes the

Post by Peter Stuge
hardware fail to initialize registers from the EEPROM.
I don't think there's such a thing as "best practise" on a bus.
Either the spec is followed (by everyone) or it isn't. :)
//Peter
_______________________________________________
ath9k-devel mailing list
https://lists.ath9k.org/mailman/listinfo/ath9k-devel

Steffen Dettmer

2013-03-28 15:04:36 UTC

Permalink

Hi all,

thanks for all your replies. Let me tell my findings just in case
it helps.

Post by Adrian Chadd
The general consensus at work is - BIOSes are buggy and don't
necessarily reset the PCI bus correctly.
So either you can do your own PCI bus reset post-boot (and
re-enumerate all the PCI devices, including initialising their
BARs) or smack your vendor to fix their BIOSes. I can't really
make any further suggestions besides that.

I talked with an expert of my unit about "resetting PCI express
cards". The units have a special controller (I^2C) able to power
off and power on the card slots. I was told that this does not
handle the PCI reset line correctly ("leaves it open"), but makes
"a hard cut to the 3 volts" (I hope I repeat it correctly).

For my WPEA-121N cards, such a power cycle in my tests so far
worked around the issue.

I tested ~30 main unit power cycles where I had 4 occurrences of
the issue. No extra steps were needed (Linux 3.2 automatically
detected the correctly after each slot power on, hundreds of
slot power cycles tested). So fine for me.

***@nomad:~# lspci|grep -i ath
01:00.0 Ethernet controller: Atheros Communications Inc. Device abcd (rev 01)
***@nomad:~# i2cset -y 14 0x20 0x0
***@nomad:~# sleep 1
***@nomad:~# i2cset -y 14 0x20 0x1f
***@nomad:~# lspci|grep -i ath
01:00.0 Network controller: Atheros Communications Inc. AR9300 Wireless LAN adaptor (rev 01)

I also tried :

***@nomad:~# echo "0" > /sys/bus/pci/slots/1/power
***@nomad:~# echo "1" > /sys/bus/pci/slots/1/power

this makes the device disappearing temporarily but does not have
the desired effect of "fixing" the vendor ID.

Happy Easter!

Regards
Steffen

Michael Schwingen

2013-03-28 18:44:51 UTC

Permalink

Post by Steffen Dettmer
I talked with an expert of my unit about "resetting PCI express
cards". The units have a special controller (I^2C) able to power
off and power on the card slots. I was told that this does not
handle the PCI reset line correctly ("leaves it open"), but makes
"a hard cut to the 3 volts" (I hope I repeat it correctly).
For my WPEA-121N cards, such a power cycle in my tests so far
worked around the issue.

Even if it works, it is not PCI(e) compliant. When physically powering
up a PCI device, you *have* to assert the reset signal to the slot.

cu
Michael