[Tfug] ECC (was Re: Using a Laptop as a server)

Louis Taber ltaber at gmail.com
Thu Mar 14 07:44:43 MST 2013


Hi All,

Using the data from the 2004 paper *Soft Errors in Electronic Memory – A
White Paper *at* *
http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf and my
system with 16GBytes of RAM up for 176 days and making two seriously
invalid assumptions I computed an estimate of 187 soft errors.
2^(10+4)*9*(10^-9)*(176*24)*300

The "bad" assumptions were 1) I am using all of the RAM and 2) It is being
used a full speed.
The other assumption was FIT of 300 (FIT Failures In Time: Errors per
billion (10^9 ) hours of use. Usually reported as FIT per Mbit.)
The text suggested FIT rates of few hundred to few thousand.

Are you willing to:

Use the wrong data in a calculation.
Execute an incorrect instruction,
Use an invalid address or pointer?


The "cost" to prevent this by using ECC seems to include:

Slower execution (You need to write the entire ECC word at one time, no 8,
16, or 32 bit writes)
More expensive memory (It typically is an extra bit per byte)
More expensive processors and system (it takes circuitry to implement the
ECC)
A general increase in the quality of hardware construction (If you are
going to sell a system that supports ECC you need to appeal to customers
who will buy it)


Most PC ECC systems I looked at around year 2000 would also catch all
double bit errors and all errors in a single nibble.  I would rather have a
process or system stop if an error is encountered.  If they are computing
the syndrome over 128+ bits it could be even better.  IBM mainframe systems
at the time corrected all double bite errors.

Does Linux, by default, log ECC errors?  If so where?  If not, how logging
be turned on?

  - Louis

On Thu, Mar 14, 2013 at 12:42 AM, Bexley Hall <bexley401 at yahoo.com> wrote:

> Hi Harry,
>
> On 3/13/2013 5:42 PM, Harry McGregor wrote:
>
>> I would have two issues with a laptop as a server (and yes, I have done
>> it in the past myself).
>>
>> Lack of ECC memory. <--- Memory errors scare me enough that I try and
>> use ECC even on desktop/workstation level systems
>>
>
> What sorts of error rates do you encounter (vs time vs array size)?
> And, more importantly, what *policies* do you have in place to
> deal with the appearance of correctable and uncorrectable errors?
>
> I've never deployed ECC in an embedded system.  Primarily, because
> RAM requirements have never been high and because RAM == DATA
> (not TEXT!  Think:  XIP).  I.e., if you assume your code is
> fetched correctly (if the memory error is a consequence of a
> failure of the actual memory *interface*, then all bets are
> off, regardless of ECC!), then *it* can validate the data on
> which it is operating.
>
> The automation/multimedia system uses the most "writable" memory
> of any system I've designed, to date.  And, dynamically loads
> code so now RAM is not *just* DATA but TEXT as well!  (this is
> slightly inaccurate but not important enough to clarify).
>
> I've been planning on ~1 FIT / MB / year as a rough goal.  So,
> an error or two per day is a conservative upper bound.  [Hard to
> get data on the types of devices I use so this is just a SWAG]
>
> I assume errors are *hard*/repeatable.  So, "correcting" the error
> doesn't really buy anything -- it means that "location" now has
> no error *correction* capability (since that bit is already requiring
> correction so any *other* bits exhibiting failures *can't* be
> corrected!  Unless I used a larger syndrome)
>
> As such, I favor parity over ECC (especially as ECC severely limits
> the implementation choices available to me -- parity can be "bolted
> on"... sometimes  :> ).
>
> I count on invariants sprinkled literally through my code to identify
> likely "bad data".  But, since most of my applications are effectively
> periodic tasks, I can restart them when such a problem manifests
> (and hope for the best).
>
> Runtime diagnostics (e.g., in this case, memory scrubbers) try to
> identify persistent failures so I can mark that section of memory
> as "not to be used" and take it out of the pool.
>
> I figure the bottom line to the *user* is to complain when
> self-diagnosed reliability falls to a point where I have little
> faith in being able to continue operating as intended.  Effectively
> failing a POST after-the-fact.  At which point, the device in
> question will need to be replaced (cheaper than supporting
> replaceable memory).
>
> I imagine the same sort of approach is used in server farms?
> I.e., when reported corrected errors exceed some threshold, the
> device in question (a DIMM, in that case) is replaced?  And, the
> server "noticed" as potentially prone to future problems?
> (e.g., if the memory errors are related to something in the
> DIMM's "operating environment").
>
> Or, is this done on a more ad hoc basis?  Is there ever a post
> mortem performed on the suspected failed devices?  Or, are they
> just considered as "consummables"?
>
> Thx!
> --don
>
> P.S.  The sandwich place has proven to be a big hit with the
> folks to whom I've suggested it!  Thanks!
>
> ______________________________**_________________
> Tucson Free Unix Group - tfug at tfug.org
> Subscription Options:
> http://www.tfug.org/mailman/**listinfo/tfug_tfug.org<http://www.tfug.org/mailman/listinfo/tfug_tfug.org>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tfug.org/pipermail/tfug_tfug.org/attachments/20130314/7cdc0fbe/attachment-0002.html>


More information about the tfug mailing list