[Tfug] ECC (was Re: Using a Laptop as a server)

Thu Mar 14 16:55:15 MST 2013

Hi Zack,

On 3/14/2013 2:39 PM, Zack Williams wrote:
> On Thu, Mar 14, 2013 at 1:35 PM, Bexley Hall<bexley401 at yahoo.com>  wrote:
>>
>> So, what does this tell you in terms of the quality/reliability of your
>> system?  When do you start getting nervous?  Statistically, a device
>> that throws an error is more likely to throw *more* errors in the
>> future.  [Unless the source of the errors is the memory infrastructure
>> and not the memory (device) itself.]
>
> I write them up as an environmental hazard, caused by cosmic rays
> (btw, I've always wanted to build a cloud chamber after seeing one at
> a science museum), radon, etc.

Start thinking about what it must be like in outer space with all
those high energy particles flying around!  I wonder what the
fertility rate is for space station occupants after their
deployments??

> Unless there's some systematic,
> repetitive error that I see 2 or more times, I don't view it as a
> hardware flaw.   Those are the kind of errors I'm seeing.

OK.

Now, imagine you start seeing them *twice* "every 2-3 months per 64GB
of active memory"... or, *thrice*...  When do you start getting
nervous?  How much does the error rate have to "change" before you
consider it noteworthy?  Three times?  Five times?  100 times??

Now, spin the clock back and repeat the experiment.  Imagine you
have been *initially* seeing them at that "100x" level (i.e., you
had no experience with them failing at the "probably once ever 2-3
months per 64GB of active memory" rate.  Is that 100 level one
that you would be comfortable with *initially*?

(Do you understand the point I'm trying to make?  You've decided
that one every 2-3 months is something you can write off.  But,
you could just as easily have had an experience where 100 every
2-3 months was what you considered "normal".  I.e., a level that
you might *now* consider as "alarmingly high" -- based on your
one every 2-3 months EXPERIENCE could, in fact be just as
acceptable!)

It's sort of like retailers establishing an (arbitrary!) "price point"
for a product -- then "discounting" it to make it look more appealing!

("If you act now, we'll include not one but *two* Skeetle9000's -- just
pay separate shipping and handling...")

You accept some level of performance and call it "normal".  Yet,
it may be *better* than normal... or *worse*!

How would you "set policy" so a "flunky" would know when the error
rate is "too much" and could take corrective action?  (i.e., how
would you have the *machine* decide when it's integrity is
compromised -- or, soon to be?)

> I've also had cases where I did need to replace memory that was
> throwing ECC errors on a daily basis - that's where it's doing it's
> job: functioning properly until scheduled replacement can happen (see
> also: RAID).
>
> One interesting story that is tangentially related - in the early
> 2000's Sun released a bunch of processors that had radioactive casings
> on the cache chips, which caused these sorts of errors.
>
> http://www.sparcproductdirectory.com/artic-2001-dec-1.html
> http://nighthacks.com/roller/jag/entry/at_the_mercy_of_suppliers

Yes, cache (often made from *static* memory) is increasingly
prone to errors -- especially as the sizes have grown and speeds
increased.

In the 70's, soft errors were a common phenomenon -- because of
impurities in the materials used to fabricate the devices.
The thickness of the package material (back then, everything
commercially available was in a DIP) would provide protection
from low energy/alpha particles from *outside* the device.
But, impurities in the device itself had a clear shot at the die!

[Back then, 4Kb (lower case b) in a single device was AMAZING!
A 4K plane of *core* was the size of an extra large pizza!]