[Tfug] HDD failing

John Gruenenfelder jetpackjohn at gmail.com
Fri Apr 4 01:09:02 MST 2014


Hello again,

I think I know the answer already, but I wanted to run it by TFUG first just
to be safe.  I have just recently begun getting email from smartd about, no
surprise, SMART errors.  Specifically, there are two emails.

The first is labeled as ErrorCount and has been steadily increasing for just
over a week.  Right now I think it stands at 20.  The second is labeled
CurrentPendingSector and has occasionally changed from 1 to 2 to 1 again.
Checking the syslog, this error indicates pending unreadable sectors for which
reallocation has failed.  Right now the count is at one.

From the syslog, here is one of the recent read errors:

[662537.816875] ata8.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[662537.816881] ata8.01: failed command: READ DMA EXT
[662537.816887] ata8.01: cmd 25/00:80:57:f6:af/00:00:1e:00:00/f0 tag 0 dma 65536 in
[662537.816887]          res 51/40:00:c8:f6:af/40:00:1e:00:00/f0 Emask 0x9 (media error)
[662537.816890] ata8.01: status: { DRDY ERR }
[662537.816892] ata8.01: error: { UNC }
[662537.836300] ata8.00: configured for UDMA/133
[662537.856813] ata8.01: configured for UDMA/133
[662537.856873] sd 7:0:1:0: [sdd] Unhandled sense code
[662537.856875] sd 7:0:1:0: [sdd]  
[662537.856877] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[662537.856878] sd 7:0:1:0: [sdd]  
[662537.856880] Sense Key : Medium Error [current] [descriptor]
[662537.856882] Descriptor sense data with sense descriptors (in hex):
[662537.856884]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
[662537.856889]         1e af f6 c8 
[662537.856892] sd 7:0:1:0: [sdd]  
[662537.856894] Add. Sense: Unrecovered read error - auto reallocate failed
[662537.856896] sd 7:0:1:0: [sdd] CDB: 
[662537.856897] Read(10): 28 00 1e af f6 57 00 00 80 00
[662537.856902] end_request: I/O error, dev sdd, sector 514848456
[662537.856914] ata8: EH complete


And, hoping that it was just one spot on the HDD, I searched through the
syslog for all of these errors and their respective sectors, and, at the time,
came up with a list of 14.  :(

514848456
521096966
521183948
521186444
521189013
522076296
628184820
630696098
630764295
631528611
631528668
631628971
631633517
631634728


BTW, the drive in question is the fourth drive in my RAID-5 array.  It's a
Western Digital WD5000AAKS-75TMA0 500 GB drive.  The other three are all 500
GB Samsung drives that are either the same or one minor model number off, but
in any case they have been behaving fine.

It would seem apparent that the drive is in the process of failing, but I do
find the numbers a little confusing.  The fact that the read error count keeps
changing and has never gone above 2 seems to indicate that the auto
reallocation failues are a temporary issue.  I didn't think this was possible;
reallocation fails when the pool of spare sectors is used up, correct?

I just noticed now while looking at the most recent error in the syslog that
after the read error is a single line, about 0.5 seconds after the error
handling code is finished:

Apr  3 22:01:52 Bebop kernel: [912452.740062] md/raid:md0: read error corrected (8 sectors at 662707240 on sdd2)

I had assumed that RAID correction must come in to play at some point, but it
wasn't until now that I finally noticed it in the syslog.


So, how worried should I be?  I've been wanting to upgrade the array for a
while, and this is as good an excuse as any, I suppose.  I would like to
understand these errors a little more, though.  And maybe get an idea of how
quickly I need to act.  Having the extra layer of "protection" provided by the
RAID-5 code might extend that timeframe slightly.

Thanks for any input.


-- 
--John Gruenenfelder    Systems Manager, MKS Imaging Technology, LLC.
Try Weasel Reader for PalmOS  --  http://weaselreader.org
"This is the most fun I've had without being drenched in the blood
of my enemies!"
        --Sam of Sam & Max



More information about the tfug mailing list