[Tfug] Persistent Linux and X Crashes. How to track down?

Adrian choprboy at dakotacom.net
Wed Aug 2 11:23:23 MST 2006


On Wednesday 02 August 2006 10:42, Chad Woolley wrote:
> It was crashed again this morning.
> 
> You know, I do think it might be heat.  That has some correlation with
> the crashes.  My "office" is the storeroom out back, so I keep it cool
> during the day (window AC), but it gets hot at night (with 3 computers
> on 24/7).  All fans work.  I have a huge CPU fan, multiple case fans,
> and was very careful with my thermal paste, but it could be another
> component besides the CPU.

Well... as others said, a full set of logs might help track down the problem. 
Particularly look for anything that signifies a device error (ie. harddisk) 
or possibly APIC/bus error. However, even though you said you swapped RAM, my 
first thought would be bad RAM or a bad memory bus... Random crashes are not 
something a normal Linux box does. Mmany of my boxes (running DB/web/mail) 
run for months at a time with continuos multiple-user use, only dying when 
the power goes out (which unfortunately seems to happen several times a 
year). Out of dozens of boxes I have administered... I have only had 2 that 
gave me problems... 1) an old devel box with a haxored SCSI bus and drivers 
that I used for extracting data off broken disks, 2) a laptop with bad memory 
that would occasionally, but not consistently or predictably, flip a bit 
without cause.

I would suggest that you grab a copy of Memtest86 and run it on the machine. 
Just grab the bootable ISO and burn a CD. Plop it in, when it starts change 
the configuration to "All tests"... and then let it run for a couple hours. 
If it is a memory/bus error, Memtest86 should find it.

[snip]
> Do any of you know if there's a relatively cheap product that has
> temparature sensors to capture data, and plugs into a usb/serial port?
>  Then I can do trend analysis on the actual temparatures of various
> components, and see if high temparatures correlate with the crashes.
> I'm sure I could build something from scratch, but I don't want to
> take that much time.
> 

Better than that.... you can probably use the machine itself to tell you its 
temperature. You should have a "sensors" or "lmsensors" package (probably 
already installed, but not configured) that can read the temperature sensors 
built into the motherboard at various components. After configuring 
(sensor-detect.sh), a quick script could periodically dump the temperature to 
a file that you could review later looking for trends.


Adrian




More information about the tfug mailing list