Thursday, January 28, 2010

DROBO WARNING re: 2TB DRIVE SUPPORT

If you plan to use 2TB drives with a 4-bay Drobo, *BE SURE TO UPDATE TO LATEST FIRMWARE*. You'll probably want a full backup before updating firmware, as some people have reported data loss during firmware update. To minimize risk: shut down safely, unplug power and data cables, remove your drive set, update firmware on empty drobo, shutdown again and reinsert drive set.

[actually, the problem is probably triggered by total storage space, not by 2TB drives in particular; however in practice this will only impact 4-bay Drobo users, as 5- and 8-bay units come preinstalled with new firmware... and 4x1TB drives will not hit the limit which triggers the bug]

Apparently earlier firmwares are known to be *NOT SAFE* for use with 2TB drives. In particular, they have an insufficient amount of space allocated for keeping track of the block layout... the drives will work for a while, but once you fill up data to a certain point, it will start overwriting the data and/or block layout info at the beginning of the drive set, block zero is overwritten and YOUR DRIVE SET IS LOST. You will get 4 solid red drive status lights and the message "Drobo does not detect any hard drives. Please insert a hard drive immediately". This is due to low-level corruption of drive set, and there is currently no fix for it.

In addition, the power supply which came with your Drobo may not be powerful enough to handle 2TB drives; you need at least 7.5 amps. Check you power supply and if in doubt call data robotics, they will ship you a new power supply.

Sadly, for a company which promises to protect your data, there is no mention of either issue on their website. The 2TB drive support knowledge base entry simply says "Drobo and DroboPro support 2TB drives" with no mention of firmware or power supply requirements. Both should be indicated in screaming red text on the 2TB support page. Firmware fixes re: data safety should be prominently mentioned on the front page, support page, and anywhere that firmware is mentioned... the Updates page says users should update firmware to "take advantage of the latest features". The latest firmware release notes [PDF] mention mundane issues like OSX compatibility and miscalculation of free space when almost full, but don't mention that firmware updates provide vital data safety fixes for large drives. Nor does it include a cumulative changelog, and earlier firmware release notes are not readily available online. Given that earlier firmwares will function with 2TB drives but act as a time bomb waiting to brick your drive set, a press release / email blast would perhaps be appropriate. I'm sure I'm not the only person who avoids apparently needless firmware updates on the "if it ain't broke, don't mess with it" practice.

To their credit, the engineers are working to find a way to recover some/all data from drive sets that have been corrupted. Like many others, I have been impressed by the customer support and they are taking my issue seriously.

However there does seem to be a left hand / right hand disconnect of sorts... the engineers were aware of the data corruption problem and implemented a fix, tech support is aware of the power supply problem... and yet the web site is just a sales pitch about the greatness of drobo and how many invisible safety checks it has to protect your data. This is particularly obnoxious because their forums went private, then were shut down. Even if you are a model customer and do your research, there is no way to know about either of these two known issues, either one of which could cause total data loss.

...

Of course, you should always have a full backup. An offsite backup for that matter. Without best practices your data will always be at greater risk. That said, a redundant multi-disk device is supposed to protect your data from a single drive failure, not act as a way to lose N drives of data in one fell swoop. Drobo may have a solid tech dept and excellent customer service, but with no warnings about this data loss, there is a disaster waiting to happen for anyone who takes them at their word that 2TB drives are supported.

I don't want to badmouth Data Robotics, particularly as they are attempting a data recovery solution for me (and particularly as I haven't gotten that solution yet, and want to stay on their good side ;) However the possibility of widespread data loss is too severe, particularly for early adopters who had loaded up their Drobos with data before firmware with 2TB fixes had been released. Therefore I feel it is my duty to warn current Drobo customers of this danger.

Monday, January 4, 2010

drobo disaster: backup drobo drives as disk image dumps to files

copying files to drobo, came back to find it and windows finder crashed.

scanning disk in another system, can read the data.

task: backup raw disk images to files, in case disks become corrupted during recovery, I could restore from the files back onto the disks, thus resetting the state of things to just after crash.

luckily, have a shiny new 14TB opensolaris NAS with space for at least 3 of the drobo drive images.

first, list all the devices recognized by opensolaris system before connecting drobo drive:

media:1:~#zpool status
pool: mediapool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
mediapool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c0t0d0 ONLINE 0 0 0
c0t1d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0

errors: No known data errors


now shut down

# init 0

and add drive 1 from drobo drive set. power on. zpool status: same output.

media:1:~#format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
0. c0t0d0
/pci@0,0/pci8086,3a46@1c,3/pci1458,b000@0/disk@0,0
1. c0t1d0
/pci@0,0/pci8086,3a46@1c,3/pci1458,b000@0/disk@1,0
2. c1d0
/pci@0,0/pci8086,244e@1e/pci-ide@0/ide@0/cmdk@0,0
3. c2t1d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@1,0
4. c2t2d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@2,0
5. c2t3d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@3,0
6. c3t0d0
/pci@0,0/pci1458,b005@1f,2/disk@0,0
7. c3t1d0
/pci@0,0/pci1458,b005@1f,2/disk@1,0
8. c3t2d0
/pci@0,0/pci1458,b005@1f,2/disk@2,0
9. c3t3d0
/pci@0,0/pci1458,b005@1f,2/disk@3,0
10. c3t4d0
/pci@0,0/pci1458,b005@1f,2/disk@4,0
Specify disk (enter its number): ^C

we see two drives which are not part of the pool. the first is our boot drive:

2. c1d0
/pci@0,0/pci8086,244e@1e/pci-ide@0/ide@0/cmdk@0,0

the other is the drive from drobo:

3. c2t1d0
/pci@0,0/pci8086,244e@1e/pci1095,7124@1/disk@1,0

from http://initialprogramload.blogspot.com/2008/07/how-solaris-disk-device-names-work.html

"The p0 device, eg c1t0d0p0, indicates the whole disk as seen by the BIOS"

first try scanning the drive to verify it is from drobo:

media:18:~#cat /dev/rdsk/c2t1d0p0 | more
���� ^L�-���2� @� - Drobo disk packing available fo �� � c;qt ��� x#խ��� ����NOT EXPUNGEDvailable fo


seems to be. next, inspired by http://docs.sun.com/app/docs/doc/805-7228/6j6q7uf21?a=view

media:19:~#dd if=/dev/rdsk/c2t1d0p0 of=/mediapool/media/Backups/drobo_drive_first_slot.dump bs=512k


underway!

still not entirely clear to me: is p0 better than s2 (which is said to generally represent "whole disk")? if it is true that p0 represents entire disk as it appears to BIOS, then it seems we couldn't possibly do any better. still, need to read up.


checking the start of each disk image:


media:4:~>cat /mediapool/media/Backups/drobo_drive_first_slot.dump | more
���� ^L�-���2� @� - Drobo disk packing available fo �� � c;qt ��� x#խ��� ����NOT EXPUNGEDvailable fo
media:5:~>cat /mediapool/media/Backups/drobo_drive_slot_2.dump | more
���� ^L�-���2� @� - Drobo disk packing available fo d� � pt ���<͉����� ����NOT EXPUNGED
media:6:~>cat /mediapool/media/Backups/drobo_drive_slot_3.dump | more
���� ^L�-���2� @� - Drobo disk packing available fo ��� � pt ���&I������ ����NOT EXPUNGED�z�,��� 0�hE�


at least all the header bytes are consistent, that's a positive sign.


after RMA'd unit and read-only firmware, same results.

questions:

where does Drobo data reside on drives? eg first 20MB? purpose is to compare drives to see if firmware version was updated / read-only flag set

more detail about read fails in logs:
* does it appear to be hardware level?
* is it the first drive that fails? all drives? one drive?
* the "all 4 drive status lights red" state - does this mean each of the 4 drives was tried and failed? or is there logic which sets a "total error state" signified by 4xred drives?
* it was said that failure to read was "catalogue" "layout" or similar data. is there other data - eg firmware version - successfully read before this failure occurs?

next steps:

* is there a firmware with higher debug level?

* post first nMB of each disk somewhere, so engineers can look into why they fail to load?

eventually:
* send drives for data recovery. was done for other users, and data successfully recovered. it is this level of customer service - proving that you really do care about our data -

in their interest to get to the bottom of the problem:
* can put better logging in; in future won't have to RMA a unit that doesn't actually have a hardware problem
* it might not be a common problem, but it did occur; this could a rare chance to have a test case to work against in solving it
* the next person to get hit

high-profile users get plenty of love:
http://thestoragearchitect.com/2009/10/19/personal-computing-drobo-weirdness/

and users report sending drives in to datarobotics and data being recovered:
http://blog.theavclub.tv/post/drobo-any-good

"unable to write anything to disk"
-> "failure to write to zone 0"
-> "unrecoverable write error"
-> "read error"
"LBA location" - not a particular area of drive, it is all mapped on the fly

"zone 40693 - double read error"