[Tfug] Machine crashing woes

Tue Oct 28 19:49:13 MST 2008

> I recently upgraded my desktop machine which made my old
> desktop my new fileserver/Mythbox.
> 
> The problem is that in its new role it has become rather
> unstable and I'm having a really hard time figuring out why.
> 
> Here are some symptoms:
> 
> 1) Two years ago when I upgraded the CPU from an Athlon64
> to an Athlon64 X2 I
> lost the ability to reboot.  Machine must be shutdown to
> restart it. Annoying, but not critical.

Is this the *only* symptom that was immediately associated
with the processor swap?  I.e., are all of those below
"more recent"?

> 2) At boot, when scanning the SATA bus, sometimes it seems
> that the BIOS
> cannot find the two connected drives and keeps rebooting
> (this reboot does work for some reason) until it succeeds.
> 
> 3) During the Linux kernel boot at approx. +5 seconds, the
> kernels SATA driver
> scans the bus and sometimes it too has trouble.  Shutting
> off/on can fix
> this.  Letting it continue to try it eventually succeeded. 
> It skipped sda,
> found sdb, and then a few seconds later found sda, though
> this caused the RAID-1 array to need a resync.
> 
> 
> Those are the only odds things.  I never see anything in
> the kernel logs
> indicating hardware problems.  It will occasionally just
> lock up hard and even Magic SysRq won't work.
> 
> If it was a faulty drive, can't the kernel semi-recover
> from this?  At the
> very least, shouldn't I see some log messages?  And
> even if a drive dies, it
> will hose the system but should not cause a hard lock,
> right?

It depends on what is "hanging".  Designing hardened drivers
is a real art form.  Most folks writing software aren't
intimately familiar with hardware and the vagaries that
can happen when it's *not* working (properly).  As a result,
their code can *look* bulletproof but only if everything
behaves as they *assume* it will.

> Temperatures do not seem overly high.  About 36C for the
> drives, 37-42C for the CPU, system/case at ~44C.
> 
> It has been suggested that power draw might be an issue,
> but I'm not sure.
> The desktop configuration had one HDD, one audio card, and
> a decent video
> card.  Mythbox config has two HDDs, one MPEG card, and the
> same video card.
> The box is a small Shuttle case and those don't
> typically have powerful PSUs,
> but I'm not using even 1/20 the capability of the video
> card so it seems that
> power usage shouldn't be a problem.
> 
> Without the kernel to give me some hints I'm at
> something of a loss as to what
> the problem is.  If it *is* a drive, I need to find out
> soon so I can RMA it
> (both drives are new).  If it's the MB... I don't
> know.  Not sure I can afford to replace it just yet.
> 
> My next course of action is to swap sda and sdb.  Then I
> can maybe see if the
> kernel boot SATA stall occurs on sdb instead of sda.  I
> hate hardware issues...

<grin>  No one *likes* them! ;-)

Can you *temporarily* replace the disk(s) with something
else?  PATA, SCSI, even USB?  You don;'t have to rebuild
an entire system image... just enough to change the 
conditions it is operating under currently.

I would also examine all of the electrolytic capacitors on
the motherboard -- especially those proximate to the processor!
Modern processors draw *huge* peak currents -- hence the
reason so much bulk decoupling around the processor.

Again, looking for easy things to alter the current operating
configuration, you might try swapping the memory DIMMs (you
dont have to replace them, just reshuffle them).  This can
address any intermittent failure that might be present in a
DIMM as well as causing you to reseat the devices (if you have
compressed air handy, give the sockets a blast prior to
reinstalling them).

You can also *carefully* remove the CPU and verify that it has
no bent contacts, etc.

I would caution you re: ESD but suspect you're humidity is
considerably higher than ours, today!  (though you should still
observe precautions).

If there are activity LEDs on the disks, etc., you might note
if they blink vs. "latch on", etc.  E.g., if you don't see the
drive being accessed when you think it *should* be, that's a
clue.  Likewise, if you see the indicator come on solid for
20 seconds (and you know the drive has already spun up), you
might wonder what the hell is going on...

If you think the power supply may be stressed, disconnect
any "unnecessary" peripherals (temporarily) just to see how that
changes the symptoms.  Note you can even pull the video card
and telnet to the box, etc.

<shrug>  Sorry, can't be more specific without seeing the
machine.  Just try changing the configuration in simple ways
(so you don';t waste a sh*tload of time) and see if *anything*
changes.