I’m seeing interrupt storm errors. I’ve also noticed that my USB 3.0 GoFlex 2TB drive (which I have to brag I bought for $67 back in 2009 or so) seems to be intermittently absent from my ZFS pool.
Here’s a vmstat:
root@server ~ >vmstat -i
interrupt total rate
irq1: atkbd0 3163 0
irq16: ehci0 xhci1 4657985391 7471
irq18: xhci0 305433357 489
irq19: atapci0++ 37971283 60
irq23: ehci1 1281657 2
cpu0:timer 3193627253 5122
irq264: ahci0 9319576 14
irq265: hdac0 100 0
irq266: em0:rx 0 165298952 265
irq267: em0:tx 0 29987834 48
irq268: em0:link 9 0
cpu1:timer 483764624 775
cpu7:timer 546385426 876
cpu5:timer 533264630 855
cpu4:timer 577926662 927
cpu3:timer 663537088 1064
cpu6:timer 534384348 857
cpu2:timer 487291455 781
Total 12227462808 19613
Notice the crazy > 7000 interrupt rate on irq16. Note that one downside of vmstat is that it shows the rate over the uptime of the computer–that is, it does no near-term windowing of the data.
So, I decided to switch the port it’s on just to see if it was an issue with the card not liking the drive, or if it was the drive itself. I was surprised that this drive was connected to the built-in Asmedia controller, and not the add-on NEC/Renesas controller. (I would have thought that the built-in controller would have supported MSI rather than falling back to IRQ.)
I’ll check on this tomorrow and see how it’s doing. Probably won’t be scientific, since I don’t know exactly what usage pattern might exercise the interrupts.
Update 1
Oh, it looks like xhci0 is indeed the add-on card:
xhci0: <NEC uPD720200 USB 3.0 controller> mem 0xf7a00000-0xf7a01fff irq 18 at device 0.0 on pci3
xhci0: 32 byte context size.
usbus1 on xhci0
pcib4: <ACPI PCI-PCI bridge> irq 16 at device 28.4 on pci0
pci4: <ACPI PCI bus> on pcib4
xhci1: <XHCI (generic) USB 3.0 controller> mem 0xf7900000-0xf7907fff irq 16 at device 0.0 on pci4
xhci1: 32 byte context size.
usbus2 on xhci1
So, I made have moved the incorrect hard drive (it was the GoFlex that I moved rather than the Backup Plus). However, it is the GoFlex that’s giving checksum errors reported by the pool:
server# zpool status tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ‘zpool clear’ or replace the device with ‘zpool replace’.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h13m with 0 errors on Fri Oct 25 22:31:32 2013
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
gpt/STBPDESK3TB ONLINE 0 0 0
gpt/WD20EARS ONLINE 0 0 0
gpt/GOFLEX2TB ONLINE 0 0 3
mirror-2 ONLINE 0 0 0
gpt/WD15EARS ONLINE 0 0 0
gpt/ST1500 ONLINE 0 0 0
cache
gpt/tank_cache0 ONLINE 0 0 0
errors: No known data errors
It should also be noted that I’ve been testing a Mushkin USB3.0 drive connected to the xhci0 port, so that might explain the high (average) interrupt rates on xhci0–I’ve been banging away at it for days.
Update 2
Oh: it looks like irq16 is on xhci1, which is the built-in controller, and the port that the GoFlex 2TB drive was attached to. So, this has nothing to do with testing the Mushkin flash drive, and does point to possible failure of the GoFlex drive.
Post a Comment