[Tfug] Machine crashing woes

Wed Nov 5 19:54:55 MST 2008

On Wed, Oct 29, 2008 at 08:39:46AM -0700, Dean Jones wrote:
>Hi,
>
>I'll offer a few suggestions if you don't mind.
>
>I would think that the problem is with the SATA chipset or something
>motherboard related and not with the drive.
>
>I would run the full 'long version' of smart tests on your drives to see
>what they say.  Using another machine if possible.
>
>Hard drive errors will show up in the SMART statistics.  But usually
>with a software raid problems you won't see a full system lock-up but
>instead your logs will be filled with errors while the system carries on
>using the other drive.   This usually does cause long delays as the
>system is busy thrashing the bad and good drive.

I think you're right about the mobo chipset (Nforce3 in this case), especially
with all the clicking/blinking/rebooting the machine does until the drives are
finally recognized.

I dug out the little pamphlet that came with the drives and it mentioned a
Samsung utility to test and configure drives.  Fortunately they were good
enough to provide a bootable ISO image and not just a floppy image.  The
motherboard is about 4-5 years old and supports only SATA-150 whereas the
drives can do SATA-300.  Normally this shouldn't be an issue, but...

I used the utility to force the drives to only do SATA-150 and voila!  They
are detected right away and the machine doesn't do a reboot dance at
boot-time.  And the kernel doesn't seem to have any issues either.

Now, whether or not this was the cause of the random hard lockups I'll know in
a few more days.  I'm definitely hoping...

>> I recently upgraded my desktop machine which made my old desktop my new
>> fileserver/Mythbox.
>> 
>> The problem is that in its new role it has become rather unstable and I'm
>> having a really hard time figuring out why.
>> 
>> Here are some symptoms:
>> 
>> 1) Two years ago when I upgraded the CPU from an Athlon64 to an Athlon64 X2
>> I lost the ability to reboot.  Machine must be shutdown to restart it.
>> Annoying, but not critical.

In answer to Bexley's question, I should have been more clear.  This
particular problem began two years ago but the SATA/drive issues and the hard
locks are a new phenomenon.  After frequent boots and reboots using the Debian
installer in rescue mode stored on a USB stick, I found that it wasn't doing
this anymore.  But, when I let it boot into the normal system with all it's
regular modules and daemons, I *cannot* do a full reboot.  It just hangs after
the reset.  I'm thinking maybe the video card?  The rescue image boot doesn't
load any modules for the video card and it behaves properly.  But, I don't
have a spare card to test that out... oh well.

I should also mention a bizarre RAID/LVM issue that occured when I was trying
to troubleshoot these problems.  In an effort to trace the problem source to a
particular drive, I unplugged sda and made sdb the new sda.  They're mirrors
so I figured it would be okay.  It did seem to boot fine, but then I noticed
that only md0 (the /boot RAID-1 array) was detected and running.  The md1
array which holds an LVM PV which in turn contains root/home/etc. was *not*
detected, yet the kernel had managed to find the volume group on md1's
constituent partition and use it.

I then put the drives back in their original order and plugs and rebooted.
Still no md1.  Somehow md1 got clobbered.  After a lot of Googling, I found I
could safely use "mdadm --create /dev/md1 dev dev" to recreate the array
without destroying the data.  Super.  I do this, adding only one drive to the
array for safety, then adding the spare when all seems okay.  90 minutes later
it's done syncing data and I reboot.

Now the standard Debian initrd cannot find the pieces of md1 and therefore
cannot find the root FS and halts the boot.  After plenty more digging I
eventually find that recreating md1 gave it a new UUID.  The generated Debian
initrd contains a mdadm.conf file which lists the arrays to bring
up at boot-time and what the UUIDs are.  It gets this data not from probing
(which I initially thought), but from the regular /etc/mdadm/mdadm.conf on the
real root filesystem.  I used "dpkg-reconfigure mdadm" since I couldn't tell
what tool generated that file, but that didn't fix it.  So I deleted the
existing mdadm.conf, ran the dpkg command again and it was fixed.  New config
file and a new initrd using the new UUIDs and now the machine boots again.

What a mess...  and I still don't know what made the md1 superblock magically
get clobbered without somehow damaging the LVM metadata or my actualy
filesystem data.

Just a helpful FYI for anybody else using root-on-LVM-on-RAID with Debian (and
probably Ubuntu).  The installer makes it really easy to set this up, but it
was a huge pain to fix when it broke.

-- 
--John Gruenenfelder    Systems Manager, MKS Imaging Technology, LLC.
Try Weasel Reader for PalmOS  --  http://weaselreader.org
"This is the most fun I've had without being drenched in the blood
of my enemies!"
        --Sam of Sam & Max