Monday, January 4, 2010

drobo disaster: backup drobo drives as disk image dumps to files

copying files to drobo, came back to find it and windows finder crashed.

scanning disk in another system, can read the data.

task: backup raw disk images to files, in case disks become corrupted during recovery, I could restore from the files back onto the disks, thus resetting the state of things to just after crash.

luckily, have a shiny new 14TB opensolaris NAS with space for at least 3 of the drobo drive images.

first, list all the devices recognized by opensolaris system before connecting drobo drive:

media:1:~#zpool status
pool: mediapool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
mediapool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0

errors: No known data errors


now shut down

# init 0

and add drive 1 from drobo drive set. power on. zpool status: same output.

media:1:~#format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@0,0/pci8086,3a46@1c,3/pci1458,b000@0/disk@0,0
1. c0t1d0
/pci@0,0/pci8086,3a46@1c,3/pci1458,b000@0/disk@1,0
2. c1d0
/pci@0,0/pci8086,244e@1e/pci-ide@0/ide@0/cmdk@0,0
3. c2t1d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@1,0
4. c2t2d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@2,0
5. c2t3d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@3,0
6. c3t0d0
/pci@0,0/pci1458,b005@1f,2/disk@0,0
7. c3t1d0
/pci@0,0/pci1458,b005@1f,2/disk@1,0
8. c3t2d0
/pci@0,0/pci1458,b005@1f,2/disk@2,0
9. c3t3d0
/pci@0,0/pci1458,b005@1f,2/disk@3,0
10. c3t4d0
/pci@0,0/pci1458,b005@1f,2/disk@4,0
Specify disk (enter its number): ^C

we see two drives which are not part of the pool. the first is our boot drive:

2. c1d0
/pci@0,0/pci8086,244e@1e/pci-ide@0/ide@0/cmdk@0,0

the other is the drive from drobo:

3. c2t1d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@1,0

from http://initialprogramload.blogspot.com/2008/07/how-solaris-disk-device-names-work.html

"The p0 device, eg c1t0d0p0, indicates the whole disk as seen by the BIOS"

first try scanning the drive to verify it is from drobo:

media:18:~#cat /dev/rdsk/c2t1d0p0 | more
���� ^L�-���2� @� - Drobo disk packing available fo �� � c;qt ��� x#խ��� ����NOT EXPUNGEDvailable fo


seems to be. next, inspired by http://docs.sun.com/app/docs/doc/805-7228/6j6q7uf21?a=view

media:19:~#dd if=/dev/rdsk/c2t1d0p0 of=/mediapool/media/Backups/drobo_drive_first_slot.dump bs=512k


underway!

still not entirely clear to me: is p0 better than s2 (which is said to generally represent "whole disk")? if it is true that p0 represents entire disk as it appears to BIOS, then it seems we couldn't possibly do any better. still, need to read up.


checking the start of each disk image:


media:4:~>cat /mediapool/media/Backups/drobo_drive_first_slot.dump | more
���� ^L�-���2� @� - Drobo disk packing available fo �� � c;qt ��� x#խ��� ����NOT EXPUNGEDvailable fo
media:5:~>cat /mediapool/media/Backups/drobo_drive_slot_2.dump | more
���� ^L�-���2� @� - Drobo disk packing available fo d� � pt ���<͉����� ����NOT EXPUNGED
media:6:~>cat /mediapool/media/Backups/drobo_drive_slot_3.dump | more
���� ^L�-���2� @� - Drobo disk packing available fo ��� � pt ���&I������ ����NOT EXPUNGED�z�,��� 0�hE�


at least all the header bytes are consistent, that's a positive sign.


after RMA'd unit and read-only firmware, same results.

questions:

where does Drobo data reside on drives? eg first 20MB? purpose is to compare drives to see if firmware version was updated / read-only flag set

more detail about read fails in logs:
* does it appear to be hardware level?
* is it the first drive that fails? all drives? one drive?
* the "all 4 drive status lights red" state - does this mean each of the 4 drives was tried and failed? or is there logic which sets a "total error state" signified by 4xred drives?
* it was said that failure to read was "catalogue" "layout" or similar data. is there other data - eg firmware version - successfully read before this failure occurs?

next steps:

* is there a firmware with higher debug level?

* post first nMB of each disk somewhere, so engineers can look into why they fail to load?

eventually:
* send drives for data recovery. was done for other users, and data successfully recovered. it is this level of customer service - proving that you really do care about our data -

in their interest to get to the bottom of the problem:
* can put better logging in; in future won't have to RMA a unit that doesn't actually have a hardware problem
* it might not be a common problem, but it did occur; this could a rare chance to have a test case to work against in solving it
* the next person to get hit

high-profile users get plenty of love:
http://thestoragearchitect.com/2009/10/19/personal-computing-drobo-weirdness/

and users report sending drives in to datarobotics and data being recovered:
http://blog.theavclub.tv/post/drobo-any-good

"unable to write anything to disk"
-> "failure to write to zone 0"
-> "unrecoverable write error"
-> "read error"
"LBA location" - not a particular area of drive, it is all mapped on the fly

"zone 40693 - double read error"

No comments: