[Tfug] ECC (was Re: Using a Laptop as a server)

Bexley Hall bexley401 at yahoo.com
Thu Mar 14 00:42:10 MST 2013


Hi Harry,

On 3/13/2013 5:42 PM, Harry McGregor wrote:
> I would have two issues with a laptop as a server (and yes, I have done
> it in the past myself).
>
> Lack of ECC memory. <--- Memory errors scare me enough that I try and
> use ECC even on desktop/workstation level systems

What sorts of error rates do you encounter (vs time vs array size)?
And, more importantly, what *policies* do you have in place to
deal with the appearance of correctable and uncorrectable errors?

I've never deployed ECC in an embedded system.  Primarily, because
RAM requirements have never been high and because RAM == DATA
(not TEXT!  Think:  XIP).  I.e., if you assume your code is
fetched correctly (if the memory error is a consequence of a
failure of the actual memory *interface*, then all bets are
off, regardless of ECC!), then *it* can validate the data on
which it is operating.

The automation/multimedia system uses the most "writable" memory
of any system I've designed, to date.  And, dynamically loads
code so now RAM is not *just* DATA but TEXT as well!  (this is
slightly inaccurate but not important enough to clarify).

I've been planning on ~1 FIT / MB / year as a rough goal.  So,
an error or two per day is a conservative upper bound.  [Hard to
get data on the types of devices I use so this is just a SWAG]

I assume errors are *hard*/repeatable.  So, "correcting" the error
doesn't really buy anything -- it means that "location" now has
no error *correction* capability (since that bit is already requiring
correction so any *other* bits exhibiting failures *can't* be
corrected!  Unless I used a larger syndrome)

As such, I favor parity over ECC (especially as ECC severely limits
the implementation choices available to me -- parity can be "bolted
on"... sometimes  :> ).

I count on invariants sprinkled literally through my code to identify
likely "bad data".  But, since most of my applications are effectively
periodic tasks, I can restart them when such a problem manifests
(and hope for the best).

Runtime diagnostics (e.g., in this case, memory scrubbers) try to
identify persistent failures so I can mark that section of memory
as "not to be used" and take it out of the pool.

I figure the bottom line to the *user* is to complain when
self-diagnosed reliability falls to a point where I have little
faith in being able to continue operating as intended.  Effectively
failing a POST after-the-fact.  At which point, the device in
question will need to be replaced (cheaper than supporting
replaceable memory).

I imagine the same sort of approach is used in server farms?
I.e., when reported corrected errors exceed some threshold, the
device in question (a DIMM, in that case) is replaced?  And, the
server "noticed" as potentially prone to future problems?
(e.g., if the memory errors are related to something in the
DIMM's "operating environment").

Or, is this done on a more ad hoc basis?  Is there ever a post
mortem performed on the suspected failed devices?  Or, are they
just considered as "consummables"?

Thx!
--don

P.S.  The sandwich place has proven to be a big hit with the
folks to whom I've suggested it!  Thanks!




More information about the tfug mailing list