[Tfug] ECC (was Re: Using a Laptop as a server)

Bexley Hall bexley401 at yahoo.com
Thu Mar 14 11:22:49 MST 2013


Hi Louis,

On 3/14/2013 7:44 AM, Louis Taber wrote:
> Using the data from the 2004 paper *Soft Errors in Electronic Memory – A
> White Paper *at* *
> http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf and my
> system with 16GBytes of RAM up for 176 days and making two seriously
> invalid assumptions I computed an estimate of 187 soft errors.
> 2^(10+4)*9*(10^-9)*(176*24)*300

Using my "error rate estimate", that would be more like *8,000* errors
(16000MB*(1/MB/Yr)*0.5yr) in the same time period!  Unfortunately,
memory reliability studies in *embedded* systems are nonexistent!
Hence the SWAG...  :-(

> The "bad" assumptions were 1) I am using all of the RAM and 2) It is being
> used a full speed.

Ditto.

> The other assumption was FIT of 300 (FIT Failures In Time: Errors per
> billion (10^9 ) hours of use. Usually reported as FIT per Mbit.)
> The text suggested FIT rates of few hundred to few thousand.

I've seen error rates reported as high as 70,000/BHr/Mb!  So, who
*do* you believe??  :<

> Are you willing to:
>
> Use the wrong data in a calculation.

You can *examine* data to increase your confidence in its veracity.
This is something that *every* piece of code should do as a matter
of course.  If the data came from the user IN ANY WAY, then you
have no guarantee that it "makes sense".  Even if *you* created
the data (i.e., some function/subroutine), it doesn't hurt to
apply sanity checks to it.  This doesn't guarantee you will catch
all corruptions -- just like ECC doesn't guarantee you will
catch all memory errors!

> Execute an incorrect instruction,

I don't execute code from RAM.  Why waste something as precious as
RAM storing a *second* copy of an executable??  :>

> Use an invalid address or pointer?

This is where I am currently most vulnerable.  As a hardware person,
I am just as comfortable with pointers as with the data they reference.
So, I make heavy use of pointers in my code (e.g., pointers to data
as well as pointers to functions).

But, to be impacted by bad memory, one of those pointers would have
to be corrupted.  And, in a way that was not obvious *AND* had
negative consequences!  E.g., if a pointer references the middle
of a buffer instead of the *start* of that buffer, then it could
reduce what I can *store* in that buffer (yet still function
properly).  Or, it can cause problems by overwriting something
that other code expects to be unaltered by this operation, etc.

The problem with errors is they aren't all created equally!  :-/

First, if an error strikes a memory location that you aren't
*using*, <shrug>  (The "tree falls in the woods..." scenario)

Second, an error can work for you, against you or be indifferent!
E.g., if you are controlling a process, an error can cause this
*iteration* of the control loop to be "incorrect" or "less than
ideal" -- yet, the next iteration will *fix* it (assuming an
overdamped or critically damped system).

For example, when commanded to open the garage door, I send a
fixed, open-loop signal to the door opener (COTS device) telling it
to engage (the equivalent of "pressing the button on the remote").
Then, wait to sense motion.  After some interval (i.e., a data value
specified in RAM), I repeat the process until I *do* see motion.

What happens if that "data value" is corrupted?  Instead of waiting
2 seconds for motion, perhaps it tells me to wait 102 seconds!
But, most times, I will *see* motion and that timer will have no
effect on the control algorithm.

OTOH, the value might be corrupted to read as 0.002 seconds (i.e.,
if the value was stored as a floating point number and a bit in the
exponent had been twiddled).  Here, the algorithm wouldn't see
motion (the relay contacts in the garage door opener won't even
have had time to close in those 2 milliseconds!) so the algorithm
could "press the button again".  Depending on the opener's
implementation, this could be ignored *or* recognized and cause
the action to "toggle" (i.e., first press means open... next press
means close... next press means open...).  Alternatively, it might
just give up and wait for the user to reissue the command (how many
times have you pushed the button on your remote, waited, seen no
action and then pushed it again??)

Eventually, the timer that watches for the door to *be* (fully) open
will complain that "it's been 15 seconds and the door still isn't
open!  something is wrong!!".

Of course, obviously bad data is caught before it is used.  It
makes no sense to allow a 2 millisecond timeout.  So, this would
have been caught in an assertion before it was used!  The "job"
can then be killed and restarted (which could result in it
being loaded into a different portion of physical memory while
the region it had previously occupied is scrubbed for errors).

Programming style can have a dramatic impact on how tolerant of
"hardware errors" your application will be.  For example, why
pick values for an enumeration like:
    enum weekdays {sun, mon, tue, wed, thu, fri, sat}
This allows a single bit error to turn:
    sun into mon (and vice versa)
    tue into wed (      "       )
    sun into tue (      "       )
    mon into wed
    sun into thu
    mon into fri
    thu into sat
    any into something_completely_bogus (e.g., 0x80)
OTOH, a "less lazy" approach increases the hamming distance between
these constants:
    enum weekdays {
        sun=1 << 0,
        mon=1 << 1,
        tue=1 << 2,
        wed=1 << 3,
        thu=1 << 4,
        fri=1 << 5,
        sat=1 << 6,
    }
[Of course, other encodings with smaller and larger Hamming distances
are also possible to accomodate more than '8' distinct values while
still protecting against a single bit error]

And, most importantly, using those values like:
     switch dow {
     default:
        something_bad_has_happened();
        break;
     case sun:
        ...
     case mon:
        ...
     case tue:
        ...
     case wed:
        ...
     case thu:
        ...
     case fri:
        ...
     case sat:
        ...
     }

I.e., the "tradition" of using "zero" and "non-zero" as function return
values (FAIL/PASS) is completely bogus -- on so *many* levels!  If you
*mean* "PASS", then *say* "PASS"!  Ditto "FAIL".  So, encountering
"FRIBBLE" stands out as obviously indicative of something very wrong
in the code!!!  :>

But, you have to think a bit harder about how you want to deal with
all these conditions -- instead of hoping some exception handler
will clean up after your mess!  :>

[Somewhere I have archived an article that attempted to analyze
various systems' "tolerance"/robustness wrt memory errors.  I.e.,
how well they fare in the presence of flakey memory...]

Soft errors add even more spice to the mix!  Is the soft error
considered transient?  Or, will it repeat in the absence of an
intervening "write/update" cycle?

In the days of core, all reads were followed by "restores"
(*write* the data read *back* into the core because the read
was a destructive operation).  So, ECC could "fix" an error
and replace the "bad" data with corrected data.  An "bad read"
was then truly a transient phenomenon -- a second read of the same
location would yield "correct" data (even if the ECC subsystem was
turned off).

If, OTOH, all you are doing is correcting the data presented to the
processor, then an error can persist *in* the memory.

And, if you *do* have ECC but aren't doing anything with the data
that *it* reports (i.e., number of errors, types of errors, *where*
they occur, etc.) then you've just given yourself the *illusion*
of reliability!

> The "cost" to prevent this by using ECC seems to include:
>
> Slower execution (You need to write the entire ECC word at one time, no 8,
> 16, or 32 bit writes)

Much of that cost can be hidden in the memory interface unit.
However, it also impacts how you locate *I/O* in the memory
space (since byte addresses are no longer unique).

(And, you *still* have to deal with the possibility of uncorrectable
errors!  You've just kicked the can down the road...)

> More expensive memory (It typically is an extra bit per byte)

That's only the case for long words.  E.g., for a 32b memory system,
you're talking 20% overhead (at least 6 bits for the syndrome, per
32b word).  In reality, this means an extra byte per word (i.e., 25%).

> More expensive processors and system (it takes circuitry to implement the
> ECC)

Exactly.  And power, space, cooling, etc.  When you head down that
path, you severely limit the choices available to the implementor
("PC's+servers" are less than 1 percent of the computers in use).

If you insist on ECC, then you're no longer going to be able to look
for high integration solutions (AFAIK, there are no SoC's with
internal ECC)

> A general increase in the quality of hardware construction (If you are
> going to sell a system that supports ECC you need to appeal to customers
> who will buy it)

The more significant issue, IMO (and the essence of my question to
Harry) is:  what do you *do* with the information made available
by such a memory subsystem?  I.e., do you just let it blindly
keep fixing errors until multiple bit errors manifest?  In which
case, you're no better than having *no* memory protection!  Do
you report the errors to the user?  What do you expect the *user*
to do about it?  Light the "CHECK ENGINE" indicator?  "SERVICE
REQUIRED"?

E.g., I have ECC installed and configured in each of my machines.
None of them have ever *complained* to me in a way that suggests
to me that I have memory problems or an increased chance of
impending failure!

> Most PC ECC systems I looked at around year 2000 would also catch all
> double bit errors and all errors in a single nibble.  I would rather have a
> process or system stop if an error is encountered.

Exactly.  Instead of "don't care" or "/* CAN'T HAPPEN */", actively
respond to something that really *shouldn't* happen!  With appliances
and control systems that are *intended* to run forever, it's usually
easy to find a way to design things so you can restart operations
that choked.  You deal with the "process" as a series of discrete
subprocesses instead of lumping everything into one giant, brittle
process (having lots of assumptions)!  Then, you can treat these
subprocesses as transactions -- any that fail to complete normally
can be restarted in the hope that they *will* complete... inching the
overall process forward incrementally.

> If they are computing
> the syndrome over 128+ bits it could be even better.

That assumes it is inexpensive to *access* 128b!  :<

> IBM mainframe systems
> at the time corrected all double bite errors.

Newer memory technology crams multiple bits into each device.  So,
it is *more* likely that you will encounter multiple bit
errors.  E.g., Chipkill tries to address this by spreading the
syndrome around to different devices.  But, it now needs to be
aware of the actual topology of the memory devices being used.
(everything comes full circle  :>  in the infancy of DRAM, you
studied the topology of the actual devices that you were using
in order to understand how failures would manifest -- so you could
craft tests to check for cross cell disturbances, row line
failures, etc.)

> Does Linux, by default, log ECC errors?  If so where?  If not, how logging
> be turned on?

Dunno as I don't run Linux.  I've never heard a peep from any of my
NetBSD/Solaris systems so I've just *assumed* they are "happy".
(And I wouldn't expect Windows to *ever* tell me anything useful!)
But, they don't run long processes, either!  No idea how some
application would fare if left running for hundreds of days...
As I said earlier, not all memory errors are equal.  If a web server
hiccups and fails to serve a request properly, will the user
reissue the request?  And, if the web server succeeds on the second
attempt, is the user harmed?

Sort of like getting genetic testing done... "OK, so now you
*know* (<whatever>).  Whatcha gonna *do* about it??"




More information about the tfug mailing list