[Tfug] Hardware reliability

Bexley Hall bexley401 at yahoo.com
Sun Apr 19 15:01:23 MST 2009


--- On Sun, 4/19/09, Zack Williams <zdwzdw at gmail.com> wrote:

> > Oh, I've *seen* lots of dead drives but never *experienced*
> > one, myself.  I attribute this to having many redundant
> > backups  :-/  (i.e., if you have NO backup, you can rest
> > assured ALL of your disks will die, etc.)
> 
> Just as a curiosity, what are you doing for backup?

The bulk of my "backup" needs tend to be largely archival.
I.e., 30 years of hardware designs, specifications, software,
etc. that I have pretty much moved entirely to electronic
media.  A typical "project" may fill a "Bankers Box" with
paper; needless to say, this quickly eats up a LOT of
room.  And, when you have to *move*, it breaks backs!  :-/

For each hardware design, I have specifications that need
to be preserved (along with the "business" aspect of the
project -- contracts, invoices, check stubs, receipts, etc.).
Plus, the actual schematic diagrams, PCB artwork, parts lists
(bill of materials), order forms for prototype parts, etc.
Then, copies of the datasheets for each of the components
I selected in the design (since you might not be able to find
documentation for those parts a few years down the road...
it's annoying trying to modify some code to talk to a
peripheral when you don't have documentation for what that
peripherals interface is!).

The software is considerably easier to archive -- since most
of this stuff starts out in the electronic domain!  But, I
also have to archive all the regression tests and their results,
document the tools used to run the tests, etc.

Plus, archive any tools -- including operating systems (!) -- that
were used to build/design the hardware/software.

[this was one of the primary reasons I moved away from commercial
software for my development toolchain... tools that would only
run on certain types of OS's, or on certain types of host
machines, etc.  <frown> ]

Most of my "archives" are preserved on CD-ROM, MO cartridges,
DLT tape and "on-line RAID5" (though the disks are seldom
actually spinning).  I haven't moved any archives to DVD-ROM
as its just another big time sink (and, CD's are nice because
I can often fit an entire project on one or two CD's...
easier -- psychologically -- to shuffle through a deck of
CD's looking for *this* particular project).  I've never been
happy with any sort of hierarchy that I put together for
storing projects on-line... it always seems easier to just
flip through a pile of CD's in search of whatever I need
(though that pile is pretty big).

For what you would consider more *typical* backup, I usually
push snapshots of the relevent portion of my filesystem
onto some other machine (e.g., rtar(1) it to another host).
I do regular tape backups (full, not incremental... one less
detail to keep track of) to cover me for something like a
lightning strike taking out two machines at the same time
(though I try not to keep any machines "up" for very long
just so I have a steady "mix" of repositories for current
snapshots).

I've not considered what the electromagnetic effects of a
direct strike might be on magnetic media (I have first hand
experience of a strike "magnetizing" a TV enough that it
took months of power cycles for the degaussing coil to
completely restore color purity).

<shrug>  If the house burns down, I'm SOL.  So, you just
have to figure out what you want to do and how safe it makes
you feel.

To date, I have avoided any *big* disks (e.g., 1TB+).  I
figure it would be too seductive:  just put *everything*
on that drive... save space, etc.  It's just too easy for
me to imagine a single failure taking out the entire
archive that way:  manufacturing defect, operating system
defect (the only time I "lost" a disk was due to an OS bug),
operator error ("rm -r"), physically *dropping* the drive,
etc.  Buying *two* drives is just another layer of false
security -- the same OS bug that caused me to lose the
first drive reliably trashed the backup copy of that drive
when I installed *it* (not realizing that the problem was
software related and not a hardware failure).

<shrug>  Hence my reliance on different types of media.
Hard to imagine all of them being compromised.  Or, *me*
not being cautious enough to install the write protect
strap on the disk drive before connecting (OTOH, it is
"second nature" for me to flip the write protect tab on
an MO drive; and, I'd have to deliberately try to
overwrite a CD-R in order to trash it, etc.)

> I've got disk to disk, disk to disk to (removable) disk,
> and disk to disk to tape running on a  few systems, plus 
> using version control like subversion or git to replicate
> data that is used on multiple machines.

In my case, I am only working/developing on a single machine
(well, maybe writing code on one machine, designing hardware
on another and working on mechanical packaging on a third;
but, each task really only exists in one place).  So, CVS
works fine if I want/need to figure out how I got to where
I am currently.  If I do something stupid and blow away
part of my file hierarchy, then I just power up machines
until I find whichever has the most recent tarball and drag
it back over.  (this doesn't happen that often.  if it is
inconvenient enough for you to recover from a screwup, you
tend to make fewer screwups!  :-/ )

The only things that are truly shared among machines are
things that affect my local network.  E.g., /etc contents.
And, this is usually easy enough to reconstruct from *one*
surviving machine than it would be to remember to take 
snapshots of each and archive them.

> Additionally, I've got ZFS snapshots on the opensolaris
> machines I run.
> 
> > I've been tempted to pull apart some of the larger disks
> > I've come across just to see what has failed on the controllers
> > (e.g., those that don't spin up).  But, it doesn't make
> > sense economically -- and I've far too little "free time"
> > to satisfy that bit of curiosity.  :-/
> 
> Especially with a new 1TB drive costing around $100 these
> days.

Exactly.  And, in my case, if I am avoiding big media to begin
with, I have far less incentive to fix a bad disk.

My curiosity comes from watching 500G drives get tossed in the
metal recycling bin at WorldCare; it seems a shame if there is
something simple I could do to breathe new life into them  :<

> I do have a pile of older 80GB and 120GB IDE drives I've DBAN'ed
> and have passed the manufacturer diagnostic in long mode to put in
> machines that can't support LBA48.
> 
> At Macworld last January, I talked to the guys at DriveSavers, and
> they said the order of failure on non-physically damaged
> drives was pretty much:

Note the conditional there ------ ^^^^^^^^^^^^^^^^^^^^^^^^
I saw some proprietary documentation from a (unnamed) disk
manufacturer that documented some *huge* percentage of the
drives returned to them as "defective" were actually *not*
defective (something like 60% or more!).  Rather, the
"problems" were OS bugs or folks not knowing what to expect
when "drive errors" crept in.

>  1. Component failure on the controller board
>  2. Failure of the head solenoid
>  3. Failure of the motor that spins the platter
> 
> Supposedly it's fairly common that they can take a good
> controller board off a drive with the same firmware, hook 
> it up and the drive works like new.

Yes.  Unfortunately, I don't do the "swap boards" type of repairs
(you would need a large supply of the same model number, etc.).
Rather, I troubleshoot at component level -- VOM and oscilloscope.
So, replacing a "bad FET" would be a viable option for me
whereas "swapping controller boards" would mean holding onto
this bad drive in the hope that an identical bad drive would
come in with a *different* failure (e.g., in the mechanism
instead of the controller).

<shrug>  I guess that's something to look forward to for
"retirement"!  :>  :<

--don


      




More information about the tfug mailing list