[Tfug] Check file system and restore array

Sat Feb 9 13:19:22 MST 2013

Looks like I might have some time in the next few days to work on this, 
the reason I found and joined this list. Sorry for the long post, alot 
of info to explain where it's at. I'm sure there is detail not needed 
and stuff needed I I left out. Always seems to be.

I have a linux computer running an older version of debian. It's main 
function is a DVR using vdr, but also used for some storage, a Teamspeak 
server, and a basic web server for file exchange when something needs to 
go to more then one person. It has 4 500Gb seagate drives setup as 2 pairs.

The first pair, sda/sdb has 3 partitions.

md0 is the boot partition with os and other programs.
md1 is a small swap partition
md2 is for data storage, recordings, etc.

sdc/d is a single partition for more storage space.

On the 24th I had recorded a couple of shows, but when I went to watch 
the second, it's playback was bad, lots of screen freezing and problems 
that can ether be bad signal, or, sometimes vdr or xine or xorg gets a 
memory leak and the computer needs to be rebooted. Just restarting vdr 
isn't always enough. So I did. 200+ days since file check and it forced 
a check but started giving lots of errors wanting to move sectors and 
stuff. It was far enough along in the boot that it was logged. It got 
past about 50% and went fine from there. When I looked at kern.log, 
there where entries for the 20th about sata problems. I don't know why 
the drive wasn't failed then and email sent like in the past.

I built the system, but working with the drives/array/file system after 
having spent so much time getting it going, stresses me out. Too easy to 
mess up and loose it all and I have to go back through notes and ask 
lots of question just about every time I work on some part of the system 
because I can't remember from one time to the next. So I thought I would 
just take it into a shop and let them sort it out. I saved some of the 
logs to a stick and provided that with the computer.

kern.log: http://pastebin.com/Tb7f3jS5
dmesg: http://pastebin.com/rP5hTRHX

I have had drives fail at least 4 times in the past. I've always had 
problems with seagate drives, so I assumed it was a seagate thing. Most 
failures happened after a power down. But when the shop started on the 
computer, they said only the fans would power up. The power supply had 
gone bad. They replaced that and then found a sata cable was flaky for 
sda. I thought they knew linux, but turns out they didn't know that 
much. The tech said he disabled the floppy in cmos because it was giving 
errors during boot. Well, yea, grub lets you know if the floppy has no 
disk before finishing booting from the hard drives. They didn't find any 
other problems, but when I got it back and did a cat /proc/mdstat, I 
found 3 arrays where down. Also started getting emails confirming it. 
(Also found they had turned off AMD cool & quiet and the fan temp 
control for both case and cpu fans in cmoss and turned the boot logo 
screen back on and turned of memory ECC. Maybe did a cmos reset.)

-------------------------------------------------------------------
The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid1 sdb2[1]
       4891712 blocks [2/1] [_U]

md2 : active raid1 sdb3[1]
       459073344 blocks [2/1] [_U]

md3 : active raid1 sdd1[1] sdc1[0]
       488383936 blocks [2/2] [UU]

md0 : active raid1 sdb1[1]
       24418688 blocks [2/1] [_U]

unused devices: <none>
-------------------------------------------------------------------

But there may be more :(. I was looking at the logs and noticed that 
even though I rebooted and turned ECC back on, the logs still seemed to 
show that ECC wasn't supported in cmos and that while the time stamp on 
kern.log and others was updating, noting new was being added. I access 
the computer though winscp and putty and I know dmesg and others often 
dosn't show the latest entry as they seemed to get cached in ram for 
awhile before writing to the log. But the other logs always seemed to be 
getting updates right away in the past which has me wondering if there 
are not other problems now besides degraded arrays.

So I need a way to fairly safely check to make sure it is working 
correctly and then need to figure out again adding the drive back in.

for the array part, from my notes I have this:
https://wiki.ubuntu.com/Grub2
https://help.ubuntu.com/community/Grub2

       # fail the disk (it's already is (f) so you may skip this step 
for a already degraded array)

sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md1 --fail /dev/sdc2
sudo mdadm --manage /dev/md2 --fail /dev/sdc3

       # remove failed disk (must always be done)

sudo mdadm --manage /dev/md0 --remove /dev/sda1
sudo mdadm --manage /dev/md1 --remove /dev/sda2
sudo mdadm --manage /dev/md2 --remove /dev/sda3

There us no "(f)" showing in mstat, so I'm guessing even though it has 
degraded the array, it hasn't failed sda? or removed it from the array yet?

I also have a note from a web page about needing to zero a block on sda 
before it can be added back in:

sudo mdadm --zero-superblock /dev/sda3

Assuming I need to do this, does it need to be done for sda1, sda2, and 
sda3?

I think this note is from the last time I replaced a drive since I log 
in as user, not root and then often have to use sudo:

       # add disk to raid array
sudo mdadm --manage /dev/md0 --add /dev/sda1
sudo mdadm --manage /dev/md1 --add /dev/sda2
sudo mdadm --manage /dev/md2 --add /dev/sda3

Then I have this note from an IRC chat for getting it back to bootable:

[14:04] <Jordan_U> Vorg: Ok, then run "grub-install /dev/sda && 
grub-install /dev/sdb" (where sda and sdb are the members of the array)

Think I have to use "su" first to switch to root user though.

---------------------------------------------------------------

Then on a side note, when this happened, It was recommended in an IRC 
chat to disable somethnig called ncq. From googling it, it has something 
to do with ide to sata or sata to ide, not sure, and that it can cause 
drives to drop from an array and slow down raid. I've had it setup this 
way for a few years, but that doesn't mean it's not setup right. I used 
to rebuild the kernel and update linux every so often, but the new stuff 
to the kernel got more and more complex and harder to figure out what 
was needed and what wasn't and what shouldn't be used. At one point I 
was told that a certain module wasn't needed any more and now the dvd 
doesn't work because it was. I have the raid stuff built into the kernel 
btw, so that I don't have to mess with ram disk and init what ever 
booting from a non-raided partition and switching over to raid. I just 
boot from the raid 1 partition.

[22:26] <@sj> i read you should disable ncq

[22:41] <@sj> most of what i read in the last few mins all said either 
you have a bad/partially connected sata cable, or you need to disable 
ncq.. although a failing drive is always a possibility

[22:45] <@sj> http://ubuntuforums.org/showthread.php?t=1640909&page=2 
<-- last post, the guy fixed that problem by updating the microcode 
  (note, only applies to Intel, I have AMD)

[22:46] <@sj> http://lists.debian.org/debian-user/2009/07/msg02209.html 
  <-- how to disable ncq