[Tfug] USB drives

Bexley Hall bexley401 at yahoo.com
Wed May 20 22:16:27 MST 2009


> > Cooling is ALWAYS important.  10C = 50% lifespan (as a rule
> > of thumb)
> 
> Not as much as it used to be:
> 
> http://labs.google.com/papers/disk_failures.pdf
> 
> See figures 4 and 5.  If you keep drives cold (below 25C/77F), they
> fail at a much higher rate.
> 
> Between 35-40C results in maximum drive life.

I don't think you can draw that conclusion from the data
that they present!  Nor do I think *they* draw that conclusion.

As with most "papers", the authors leave out much that
can be used to explain or discredit their claims.  I
wonder if they have made the raw data available for analysis?

Some things just don't make sense -- either because they
haven't been explained adequately or they haven't been
completely thought out.

The temperature data reflects a range of temperatures
from ~17C (62F) to ~51C (123F).  Presumably, these are 
S.M.A.R.T. data -- though the authors do not indicate 
which actual datum is tracked.  As such, they reflect
*drive* temperatures (possibly even *platter* temperatures
depending on which SMART parameter is used).

They make no mention of what the *ambient* temperature 
is in the data center.  Google claims to be able to operate
at ambients of up to ~27C (81F).  So, the data for values
above that clearly indicate the warmer operating environment
within the drive assembly itself.

But, its hard to imagine an internal drive temperature of
17C unless the drive is quiescent (dead) and ambient happens
to *be* 17C!

Instead, it seems logical to assume that these low temperatures
are more representative of the ambient center in the container.
Perhaps a cool winter day in The Dalles?  :>  (recall, the
data were collected August to December; data center containers
aren't known to be well insulated enclosures!  :> )

This begs the question:  why would a reasonably quiescent
drive fail?  Perhaps it just wasn't *doing* anything at the
time the failure was recorded?

This might be true of a drive that fails to spin up.  However,
the authors claim that they had *no* isntances of spin retries!

OK, so that suggests the drve *had* been operational prior
to the failure being logged.  At face value, this suggests
that a highly efficient drive (i.e., one that generates very
little "waste" -- heat! -- and able to operate at very close
to -- possibly even BELOW? -- ambient suddenly decides to die.

So, how is death defined:
   "a drive is considered to have failed if it was replaced
   as part of a repairs procedure"
And, *when* is the death recorded:
   "we consider the time of failure to be when the drive 
   was replaced, which can sometimes be a few days after 
   the observed failure event"

Ahh... So, if the drive decided to spin down and sat there
*quiescent*, its internal temperature would fall -- to track
ambient.  This would probably happen in less than an hour.
Almost definitely within "a few days after the observed
failure event".

OTOH, the temperature data is claimed to represent "average 
temperatures" prior to failure.  Is this a rolling average
or a lifetime average?  The latter would be meaningless as,
over a three year lifespan, the "normal" temperatures of a
correctly operating drive would heavily overweight the
(possibly?) increased temperatures immediately prior to
failure.

Nor is the lifetime of the failed drives correlated with these
temperatures.  For example, the article mentions the continued
presence of the bathtub curve in longevity factors -- were
these "cold" drives ones that had failed early on in their
operating life?

And, there is nothing indicating the *weight* of each of
these data points.  I.e., did exactly *one* drive fail at 17C?
Note the size of the confidence interval suggests the sample
size for these low temperature failures is probably not that
large -- at least not as large as those failing in the center
of the graph.  (keep in mind, this was an analysis of "More 
than one hundred thousand disk drives were used for all the 
results presented here")

As my final point, if temperature is NOT a critical factor
in drive reliability, then the environmentally conscious folks
at Google will, no doubt, quickly turn the thermostats UP
on the cooling systems in those containers to save energy,
money and "The Planet"!  :>

<shrug>  Watch their stock.  If profits go up by an amount
that reflects REDUCED energy consumption, you'll know this
was the case!  :>  Meanwhile, I'll keep acting as if 
+10C = -50% MTBF  :-/


      




More information about the tfug mailing list