[Tfug] ECC (was Re: Using a Laptop as a server)

Bexley Hall bexley401 at yahoo.com
Thu Mar 14 19:56:50 MST 2013


Hi Zack,

On 3/14/2013 5:12 PM, Zack Williams wrote:
> On Thu, Mar 14, 2013 at 4:55 PM, Bexley Hall<bexley401 at yahoo.com>  wrote:
>> How would you "set policy" so a "flunky" would know when the error
>> rate is "too much" and could take corrective action?  (i.e., how
>> would you have the *machine* decide when it's integrity is
>> compromised -- or, soon to be?)
>
> That's a value judgement so everyone's would be different - something
> like any repeated, identical correctable error that happens more than
> once every 3 months would be my criteria for replacement.   Multiple

So, you're looking at repeatability as a criteria -- not necessarily
"failure rate" per se.

My point was, your "one every 2-3 months/64GB" is *way* lower
than my guesstimate (for 64G, I would guesstimate an error every
~10 *minutes*!).  Similarly, Louis's estimate would have you seeing
an error every ~6 hours.  OTOH, Google cites error rates as high
as "20000-75000 FIT"... i.e., 65 to 250 times as often as the
rate Louis cites!  As such, an error every 1.5 to 6 *minutes*...
5 times *my* conservative guesstimate!  (But, I am suspicious of many
of google's findings so... <shrug>)

[apologies if my math is off... I'm estimating in my head.  And, of
course, this crude analysis assumes errors are uniformly distributed
in time]

What would you do if you started seeing four per day (Louis's figure)?
Or, 250 per day (Google's *measured* rate)?  Is your memory now "bad"?
Or, has it just degraded to what is "normal" for other folks??

> non-correctable errors in the same unit, even if they aren't identical
> would probably prompt for replacement as well.  This could be written
> up as policy, assuming that said flunky could interpret the log
> information correctly.
>
> It's pointless to fix something that might not be at fault.  Given the

Of course!  But, you never *know* if its a problem until you can
define a criteria and policy to address it!  (Hence my query)

> Sun example, all the swapping of parts never fixed the root problem
> until they isolated it.  In the same way, swapping out perfectly fine
> memory/CPU's that happened to hit the cosmic rays/radon bit flip
> jackpot once in a while isn't going to solve a problem.





More information about the tfug mailing list