[Tfug] HDD failing (post-mortem)

John Gruenenfelder jetpackjohn at gmail.com
Tue May 27 23:11:46 MST 2014


On Fri, Apr 04, 2014 at 02:02:37AM -0700, Adrian wrote:
>On Friday 04 April 2014 01:09:02 John Gruenenfelder wrote:
>...
>> The first is labeled as ErrorCount and has been steadily increasing for
>>  just over a week.  Right now I think it stands at 20.  The second is
>>  labeled CurrentPendingSector and has occasionally changed from 1 to 2 to 1
>>  again. 
>
>Dun dun dunnnn... Death is coming. My experience has been that the drive is 
>nearly always in its death throws by the time SMART starts showing sector 
>errors. Out of about a dozen drives I've had die on me, only in one case did 
>SMART alert that the drive was in in decline. In all the others the drive 
>would be hanging constantly while SMART continued to say it was good.


Hey TFUG,

Here's an update on my HDD failing issues.  It's a post-mortem so, yes, the
drive is now really most sincerely dead.

The SMART emails kept arriving daily and eventually the pending unfixable
sector count reached something like 640.  Eventually, Linux decided the drive
was unusable and took it offline.  The RAID5 subsystem immediately kicked in,
dropped the drive from the array, rebuilt the remaining array, and continues
to function.  By pure chance, I happened to be in the room when I heard a
metallic sounding screech/scratch noise coming from the server in the closet.
I believe this was the drive's death knell.

If I had the money, I wouldn't have let it get this far and would have
replaced the drive ahead of time, but that wasn't possible.

It's a rather extreme test, but I am very gratified that, when the time came,
the system behaved exactly as it was supposed to.  The drive failed, the data
was preserved, and the machine never went down nor did it lose any data.  Very
nice.  The only lasting effect is that I/O now has a much higher overhead as
the RAID5 code is now involved in all read/write I/O operations.  In practice,
though, the overhead isn't bad, and the machine is still entirely usable.

Of course, with a degraded array I need to get it fixed ASAP since the array
has done its job and cannot take any further damage.

Huzzah for Linux and software RAID.  :)


-- 
--John Gruenenfelder    Systems Manager, MKS Imaging Technology, LLC.
Try Weasel Reader for PalmOS  --  http://weaselreader.org
"This is the most fun I've had without being drenched in the blood
of my enemies!"
        --Sam of Sam & Max
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://tfug.org/pipermail/tfug_tfug.org/attachments/20140527/fb201eb4/attachment.bin>


More information about the tfug mailing list