[Tfug] HDD failing

Fri Apr 4 02:02:37 MST 2014

On Friday 04 April 2014 01:09:02 John Gruenenfelder wrote:
...
> The first is labeled as ErrorCount and has been steadily increasing for
>  just over a week.  Right now I think it stands at 20.  The second is
>  labeled CurrentPendingSector and has occasionally changed from 1 to 2 to 1
>  again. 

Dun dun dunnnn... Death is coming. My experience has been that the drive is 
nearly always in its death throws by the time SMART starts showing sector 
errors. Out of about a dozen drives I've had die on me, only in one case did 
SMART alert that the drive was in in decline. In all the others the drive 
would be hanging constantly while SMART continued to say it was good.

> And, hoping that it was just one spot on the HDD, I searched through the
> syslog for all of these errors and their respective sectors, and, at the
>  time, came up with a list of 14.  :(
...

If you notice, the sectors are all pretty much in groups. If you could 
physically see them, I would bet they would be in the same area on the 
physical platter, on adjacent tracks. You have a spot on the disk going bad. 
If you ran a badblocks on the disk (forcing the drive to read the entire 
area), it would probably pop up with more.

> It would seem apparent that the drive is in the process of failing, but I
>  do find the numbers a little confusing.  The fact that the read error
>  count keeps changing and has never gone above 2 seems to indicate that the
>  auto reallocation failues are a temporary issue.  I didn't think this was
>  possible; reallocation fails when the pool of spare sectors is used up,
>  correct?

Pending reallocation are sectors the drive firmware is still trying to correct. 
The drive will periodically try to re-read the failing sector(s) in the 
background and if it successfully can, reallocate the data to a backup sector. 
You probably have a #5 Reallocated_Sector_Ct attribute as well, which is the 
count of sectors successfully moved somewhere else so far

 I find #193 Load_Cycle_Count to be a far better indicator of remaining life on 
laptop/desktop drives. Once it gets to a couple million cycles the drive is 
likely to be to its heavily used point. Server installations can be a bit more 
difficult to judge as the heads tend not to park as often. 

> I just noticed now while looking at the most recent error in the syslog
>  that after the read error is a single line, about 0.5 seconds after the
>  error handling code is finished:
> 
> Apr  3 22:01:52 Bebop kernel: [912452.740062] md/raid:md0: read error
>  corrected (8 sectors at 662707240 on sdd2)
> 

The RAID has identified a problem and mapped around it before the drive has. 
The drive firmware will still be trying to re-read that sector in the 
background and reallocate it somewhere else, the RAID just masks the read 
failure to the user in the mean time (by using the parity/data reconstruction 
data stored elsewhere).

Adrian