Saturday, November 28, 2009

OpenSolaris as windows storage appliance: bulletproofing

Running OS snv_124 via EON, created a ZFS pool and dataset, set smbshare=guestok=true, browsed to the CIFS share from a Windows XP box, and ran some tests.

Copy two files to share by drag-and-drop in Finder, yank power to the server before second file finishes copying. After powering server back up, both files are there with correct (full) file size. The only hint that the second file might not be complete/correct is the timestamp, which was left at time of copy rather than time of original file. Of course, checksums differ between original copy on Windows box and the unfinished copy on server.








I understand that this behaviour is not OpenSolaris' fault. However it is undesireable: it is non-transactional, and a form of data corruption. If a connection to server is lost during a bulk transfer over CIFS, the user has to hunt down which file(s) might be incomplete and manually check/delete them. A backup solution which compares file size and date might fail, it would need to checksum all files or have "date inversion -> recopy" logic (and I do not know whether the timestamp is 100% reliable as an indicator of incomplete write) If a user is prompted "do you wish to overwrite file size 123456 with older version size 123456" they will likely decline, leaving the copy on server incorrect.

I also understand that this situation can be avoided: only write files using a utility which write to tmp file then renames, for instance. However it would increase the value of OS as a storage appliance to be as bulletproof as possible to *any* use by Windows users.

Ideally, incomplete files should never have full file size. This would allow any dumb or manual backup process to detect the incompleteness. For instance, right click->properties on a parent folder would show a difference in size vs original, indicating that the data within is not complete. Basic filesize comparison would trigger a recopy without relying on more esoteric properties such as timestamp.

Is there any currently available method to mitigate this from the OS side? eg

1. Making sure that file sizes will always reflect the amount of data written
2. CIFS option to implement file writes as tmp->rename... this way incomplete files appear to be missing (although without 1. some methods, eg checking size of parent folder, would still mislead users)
3. Some method to remove file completely if write did not complete, perhaps relying on ZIL mechanism so it can be applied at startup after server crash.

Failing that, is there any one-time configuration to a Windows machine to change any of its behaviours to increase robustness?

I am not familiar with the details of CIFS and how various versions of windows handle file copy/move. It is likely that the suggestions above are impossible or don't make sense. However finding a way to make sure that incomplete file/batch copies are immediately detectable would make OS more robust as a storage appliance.

Pointers from those in the know about what can and can't currently be done, what a future solution might look like, or why the above is totally wrong-headed are appreciated.

Thursday, November 26, 2009

EON NAS

A LiveCD version of OpenSolaris meant for NAS use. Unlike FreeNAS, does not offer a slick web GUI for management (although there is work being done to get a basic web gui in place). However being built off OpenSolaris and with releases tracking the latest dev builds, you get the most current ZFS implementation possible without having to sysadmin and upgrade an OpenSolaris system. Because ZFS is self-contained on a set of drives, it is pretty stable even if your hardware dies or if you lose your config when upgrading... just run the latest EON build, import your old ZFS pool, reconfigure network and shares, good to go.



EON:

http://sites.google.com/site/eonstorage/
http://eonstorage.blogspot.com/

REFERENCES:

http://docs.sun.com/app/docs/doc/819-2240/zfs-1m
http://docs.sun.com/app/docs/doc/819-2240/share-1m generic options for sharenfs (and sharesmb?)
http://docs.sun.com/app/docs/doc/819-2240/share-nfs-1m nfs-specific options for sharenfs
http://dlc.sun.com/osol/docs/content/SSMBAG/troubleshooting.html CIFS troubleshooting (OS doc)
http://wiki.genunix.org/wiki/index.php/Solaris_CIFS_Service_Troubleshooting CIFS troubleshootig (genunix wiki)
http://blogs.sun.com/dougm/entry/sharemgr_and_zfs two ways to set nfs options


Download eon-0.594-124-64-cifs.iso from http://www.genunix.org/ ... not entirely clear on SMB vs CIFS but it seems CIFS is later-and-greater.

http://wiki.genunix.org/wiki/index.php/Getting_Started_With_the_Solaris_CIFS_Service
http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?a=view
http://docs.sun.com/app/docs/doc/819-2240/share-nfs-1m?a=view

Burn to a CD - would like to put directly on USB drive but it isn't offered in that format yet, so the install path is ISO -> CD -> BOOT -> install to USB from booted system.

Boot from CD, everything comes up to "eon console login" without problem.

Following the guide at bottom of eon homepage, log in with user: root pass: eonsolaris and run /usr/bin/setup. I have 2 NICs so I configured one as part of my network (10.0.1.250) and the other for potential direct connection (192.168.1.250) /usr/bin/install.sh to install to a bootable USB drive... I ran this once before plugging in my USB drive, ctrl-C to cancel, plugged in USB drive and ran again. This way I could be sure that the new device which appeared on second run was my USB drive. install.sh prompted me a couple times to confirm writing data to thumb drive, then hung for a while apparently trying access some disk that didn't exist. waited patiently for a while, eventually the thumb drive started flashing again and install.sh finished. I then ran /usr/bin/updimg.sh as suggested by the guide, although it wasn't entirely clear to me why.. I guess the image written to thumb drive was the pre-install version, and now I needed to immediately update it? Regardless, follow the instructions. Unfortunately, updimg.sh requires path to image to be written. Suggested path is /mnt/eon0/boot/x86.eon but /mnt/eon0 eon1 and eon2 were all mounted read-only. I replugged my USB drive, it flashed and dmesg showed it "online" but it did not automount. I started install.sh again just to grab the device number of my USB disk: c1t0d0 then ran "mkdir /mnt/newroot" and "mount /dev/dsk/c1t0d0s0 /mnt/newroot" ls /mnt/newroot/boot shows that there is already an x86.eon there, so I doubt this step is necessary... but regardless run /usr/bin/updimg.sh /mnt/newroot/boot/x86.eon it took a while, finished with no errors, I removed the CD and rebooted.

On reboot, BIOS seemed broken, wouldn't get past POST screen. Hard power off, wait for a while, power on. Still no go. Unplug all devices and drives except for PS2 keyboard, reboot... get past POST to DISK BOOT FAILURE. plug in USB drive and reboot. hangs in POST... so it definitely seems to be the newly formatted thumb drive that is breaking boot. Can't find a away around it, let's try again...

Boot from CD, skip setup (which was only NIC's anyway), install.sh to a hard drive partition this time. install.sh mentions that it is creating boot/x86.eon so don't bother then the troublesome updimg step. Reboot w/o CD. Boots from HD ok.

Really would prefer USB, repeat streamlined install to USB key. Still no dice... it leaves USB key in a state that hangs POST. Just to be sure, flash the same drive with FreeNAS boot image using m0n0wall method and boot from that - no problem. So it seems to be an issue with the way EON install.sh preps the drive

WELL, I was unable to get OpenSolaris official dev .usb images to boot either, they had the same problem, and since OpenSolaris seems to be the only full-featured / robust ZFS implementation, I'm stuck with this limitation for now. As main OS branch lacks CIFS and includes GUI, I'm back to messing with EON.

Boot from installed HD image. 'mkfile 5000M /mnt/eno0/file1' and file2 then

zpool create testpool /mnt/eon/file1
zfs create testpool/testfs
df -k [shows testpool/testfs with 1% used]

now that zfs is up, share it over CIFS... I expected to zpool set sharesmb but http://sites.google.com/site/eonstorage/getting-started only has us setting permissions... because net share is default in EON? Let's try it out.

groupadd -g 600 media
useradd -u 502 -g 600 -s /bin/bash mediausr
passwd mediausr
cd /testpool
chown -R mediausr:media testfs
chmod 755 testfs

/usr/bin/setup to initialize interfaces
/usr/bin/updimg.sh /mnt/eon0/boot/x86.eon

plug in the network cable. in console get NOTICEs from smbsrv that . and zfstestpool "share not found"

windows: start -> run -> \\10.0.1.250\ results in connection, but resulting window shows no shares

"zfs get all" confirms that sharesmb is not enabled, so

zfs set sharesmb=on testpool/testfs

back to windows and try again: now testpool_testfs shows up! enter username but passwd not accepted

zfs set sharesmb=guestok=true testpool/testfs

windows: re-open \\...250\ and am now able to open testpool_testfs\ however write access is denied... set guestok to false and troubleshoot write access... get password prompt, but mediausr/media is not accepted.

test mediausr account by ssh'ing in with mediausr/media - it works. so user account is OK, it is just CIFS authentication that doesn't link up to it.

chmod A-everyone@:full_set:fd:allow testfs

windows: still keeps re-promting me for my password. set guestok=true again

can now write files! at some point I'll need it passwd protected for write, but for now i can move on to tests

copy two files, yank power halfway through second file

power up, restore testpool:
zpool import -d /mnt/eon0/ testpool
zpool status
ls -l /testpool/testfs

looks good, I see only the first file. The partial second file does not appear, even though it had been listed with full size as copy began... so unlike my FreeNAS tests, does not leave invalid but apparently correct files (which would require checksumming to detect their incorrectness) However, trying to checksum the file on NAS silently fails, and trying to copy back to desktop to checksum locally fails with "I/O error" message. If I copy file 2 over to EON I can copy it back, but file 1 still gives IO error.

zpool status -v shows the error with the file. HMMM, is this a case of bad disk, zil not flushing, or what? on the plus side the error was caught, but on the negative side there was corruption. then again, the EON install root is not on ZFS, so can't expect it to be bulletproof... really need a separate disk to test with.

START OVER with two clean, tested raw drives

Boot from CD, setup to configure interfaces, install to master HD. Replace CD-ROM with second HD (slave) and boot from installed system.

zpool create testpool c0d1
zfs create testpool/testfs
/usr/bin/updimg.sh /mnt/eon0/boot/x86.eon

reboot to test: "Bad or corrupt data while decompressing file" ouch!

Swap back to CD-ROM and reinstall to HD. skip setup and immediately updimg. reboot: worked. run setup, updimg again. reboot: worked. "zpool import" lists the testpool i previously created, "zpool import -f testpool" brings it back online. updimg again to test zfs automounting on reboot. reboot works but zfs pool is not mounted?? I need the workaround from http://eonstorage.blogspot.com/2009/10/eon-zfs-storage-0594-based-on-snv124.html - "zpool import -a" in .exec moving on:

zfs set sharesmb=on testpool/testfs
zfs set sharesmb=guestok=true testpool/testfs

reboot

zfs get all | grep sharesmb -> confirmed working
connect from windows: folder opens but no write access.
chmod A-everyone@:full_set:fd:allow testfs -> "chmod: ERROR: ACL entry doesn't exist"
chmod 777 testfs

that fixed it! so, eventually i need a coherent account/permissions strategy: would like share to be listed when network is browsed, and open read-only access, so that guests can browse and play media without needing any special instructions or login. however write access should be password protected, with only my windows box set up with password, so neighbors can't delete my media.

for now, on to tests.

copy a couple files: OK
copy them back: OK, but read is MUCH slower than write !?! perhaps limited by write speed to local disk?
md5sum a file on share: completes in about 1min, so read is faster, prev test must have been limited by write to local disk
close browser window, reboot NAS: browse and file read works

copy 2 files, yank power towards end of 2nd file. windows beefs a bit. after bootup, reconnect: file 1 and 2 both appear complete with full file size! the only difference is that the date on the new file has not been set. this is a transactional violation and form of silent data corruption... files appear to novice user to be complete, a backup solution would have to checksum all files or have special "date inversion -> recopy" logic to know the file needs to be updated, user who loses connection during a bulk copy needs to search through and figure out exactly where the copy was when it died... it would increase value as a storage appliance if there was a way to mitigate this effect... a configuration for CIFS that file writes use tmp name/path until they are complete? an option in ZFS that anything pending in ZIL which doesn't include file close / completion should be unwound?


important: When creating a ZFS file system to be used for SMB file sharing, set the casesensitivity option to mixed to permit a combination of case-sensitive and case-insensitive matching. Also, set the nbmand option to enforce mandatory cross-protocol share reservations and byte-range locking. via http://wiki.genunix.org/wiki/index.php/Getting_Started_With_the_Solaris_CIFS_Service


zfs create -o casesensitivity=mixed -o nbmand=on -o sharesmb=guestok=true testpool/testfs2
chmod 777 /testpool/testfs2

test results are the same. moving on, let's configure ACL

chmod 775 /testpool/testfs2 # Undo the global write permissions we used during tests

groupadd -g 600 media
useradd -u 502 -g 600 mediausr
passwd mediausr
chown -R mediausr:media /testpool/testfs2

For now, not getting crazy, in fact not setting ACL explicitly; just using chmod 775 to take away write access for non-authenticated users. Now the only trick is to get login working for normal user accounts over CIFS... something about CIFS passwords not being compatible with /etc/passwd, so you have to enable CIFS password shadowing then set the password again for each user that needs to connect via CIFS.

... well according to http://wiki.genunix.org/wiki/index.php/Getting_Started_With_the_Solaris_CIFS_Service and examining the state of EON, it looks like everything is already set up. the EON instructions at http://sites.google.com/site/eonstorage/getting-started support the idea that no special steps need be taken.

run -> \\10.0.1.250\ -> browse, full read permission but no write permission. up to toplevel list of shares, right click on testfs2 and "map network drive", select "connect as other user" and enter mediausr/media, click OK... it reprompts for my password. enter again and check "remember password" box. viola, Z: now appears under "My Computer", and from there I have write permission.

Now the last piece of the puzzle... it would be nice to see the shares listed under "browse network" so that visitors would not have to enter in IP addresses manually. When working with FreeNAS I had the experience that my windows box would not list the share, but OSX would. Come to think of it, my old ReadyNAS behaves the same way... OSX can browse to it but it was invisible to WinXP unless the IP address was given or Raidar GUI was used to connect. Testing OX -> Finder -> Go -> Network... I see FFEM folder with readynas server inside, and an empty WORKGROUP folder. Hmmm, is that WORKGROUP folder from EON? It is the default workgroup name... try Go -> Connect to server and type in IP address, but then it only uses afp protocol, and there is no obvious way to change the protocol used.

smbadm list -> confirms that primary domain is WORKGROUP
smbadm join -w TESTGROUP -> try a different name that will be obvious when encountered

back to OSX, no change. log user out, log in a different user, disconnect and reconnect to wireless network. go to Finder -> Nerwork: no change. still see empty WORKGROUP, no sign of TESTGROUP. try Go -> Connect to server -> smb://10.0.1.250/ - get password prompt! however empt user/pass gives error. username 'guest' empty pass gives error. i recall something about a bug with empty fields during logon... oh yeah, this: http://defect.opensolaris.org/bz/show_bug.cgi?id=11890 hasn't been fixed yet, so for now we'll make an explicit guest account:

useradd -u 503 guest
passwd guest # hit enter twice to confirm empty password

try OSX "Connect to server" -> smb://10.0.1.250 -> log in with guest/[empty] -> still fails with error. okay, set password to guest and try again.... still fails with error -50. Hrmm. Try again with our poweruser account mediausr/media ... still fails. this same login works from windows, so the problem lies elsewhere. while we're at it, test guest/guest from windows "map network drive" ... works as expected, read-only access. Hey, perhaps the OSX problem is related to a bug referenced here: http://opensolaris.org/jive/thread.jspa?messageID=438356&tstart=0 "empty domain prevents guest login". i'm not using the test account, but I did set an empty domain... rerun /usr/bin/setup and enter eon.com for domain. rejoin TESTGROUP. no dice.

I notice that my connection from OSX to ReadyNAS is actually using nfs. Try explicitly connect to smb://ip_address and it works with any username/password. So OSX is capable of connecting to SMB/CIFS. Still, this suggests a workaround for EON: sharing via NFS

zfs set sharenfs=ro testpool/testfs2

OSX -> Go -> Connect to server -> nfs://10.0.1.250 -> connecting...

SO, back to the other pending issue... how to make sure shares are visible under "My Network Places" (check again, can't see TESTGROUP and WORKGROUP is empty)

http://www.mail-archive.com/cifs-discuss@opensolaris.org/msg01386.html
http://www.addonics.com/support/faqs/faq-sunsupport.asp
http://blogs.sun.com/afshinsa/entry/how_to_enable_guest_access


START OVER with EON-NAS 0.59.5

before starting, boot from partedmagic and make sure disks are wiped, otherwise they might be mounted and prevent install to HD

boot from EON-NAS 0.59.5 CD, log in with root/eonsolaris
type "format" and ctrl-c out to get a list of drives... can clearly see that my small drive (will be root) is c0d0 while big drives are c1d0/c1d1/c2d0
/usr/bin/install.sh [select c0d0 to install on]
[hard reboot, remove EON CD... i've noticed that net interfaces don't come up until after reboot]


login with root/eonsolaris
/usr/bin/setup
and configure net interfaces - enter eon.com for domain because there is some bug related to "empty domain"
zpool create mediapool raidz c1d0 c1d1 c2d0
zfs create -o casesensitivity=mixed -o nbmand=on -o sharesmb=guestok=true -o sharenfs=ro mediapool/media
groupadd -g 600 media
useradd -u 502 -g 600 media
passwd media
chown -R media:media /mediapool/media
chmod 775 /mediapool/media
/usr/bin/updimg.sh /mnt/eon0/boot/x86.eon
reboot

login with root/eonsolaris
ping 10.0.1.1 to make sure network came up
df -k to make sure zfs automounted
if not, 'zpool import -a' to test import, then uncomment 'zpool import -a' line in /mnt/eon0/.exec

in windows, start -> run -> \\10.0.1.250\ -> anon read-only access
right click -> map network drive -> connect as other user -> media/media -> rw access
OSX -> browse network -> fail
OSX -> connect to server -> nfs://10.0.1.250/ -> connecting.................
OSX -> connect to server -> smb://10.0.1.250/ -> auth error -50 for any account info

SO: back where we were previously: win works (except for browsing network) can't access at all from OSX

further tests:

linux CIFS/SMB:

From fedora, Places -> Connect to server... -> Windows share -> 10.0.1.250 -> share list -> click on mediapool_media -> pass prompt -> can't connect w/o password -> any password works, connect, can read files

linux nfs:
no go


TROUBLESHOOTING:

check zfs options:
zfs get mediapool/media >& /tmp/getout; more /tmp/getout

check zfs:cifs options:
zfs get sharesmb mediapool/media

check zfs:nfs options:
zfs get sharenfs mediapool/media


smb service seems to be running:

media:89:~#svcs '*smb*'
STATE STIME FMRI
online 13:11:17 svc:/network/smb/client:default
online 13:11:22 svc:/network/smb/server:default

however nfs does not seem to exist in either enabled or disabled state:
media:91:~#svcs '*nfs*'
svcs: Pattern '*nfs*' doesn't match any instances
STATE STIME FMRI

furthermore:
media:93:~#ps aux | grep smb
root 607 0.0 0.215844 7620 ? S 13:11:18 0:05 /usr/lib/smbsrv/sm
root 1331 0.0 0.1 1632 960 pts/2 S 20:53:08 0:00 grep smb

media:94:~#ps aux | grep nfs
root 1333 0.0 0.1 1632 960 pts/2 S 20:53:33 0:00 grep nfs

no sign of nfs at all in running processes/services... screw it, start nfs manually and see if we get changed results...

media:98:~#/usr/lib/nfs/nfsd
media:99:~#ps aux | grep nfs
daemon 1341 0.0 0.1 2912 1848 ? S 20:55:26 0:00 /usr/lib/nfs/nfsd
root 1343 0.0 0.1 1632 960 pts/2 S 20:55:31 0:00 grep nfs

RPC: Rpcbind failure - RPC: Unable to receive

still no go...


follow the instructions here:

http://sites.google.com/site/eonstorage/faq -> How do I start NFS server services

cd /var/svc/manifest/network
svccfg -v import rpc/bind.xml
svccfg -v import nfs/status.xml
svccfg -v import nfs/nlockmgr.xml
svccfg -v import nfs/server.xml
svcadm enable rpc/bind
svcadm enable nfs/status
svcadm enable nfs/nlockmgr
svcadm enable nfs/server

now we see evidence of nfs running:

media:22:/var/svc/manifest/network#svcs | grep nfs
online 21:35:52 svc:/network/nfs/status:default
online 21:36:04 svc:/network/nfs/nlockmgr:default
online 21:36:21 svc:/network/nfs/server:default
media:23:/var/svc/manifest/network#ps aux | grep nfs
root 1526 0.0 0.1 3324 2104 ? S 21:36:20 0:00 /usr/lib/nfs/mount
daemon 1159 0.0 0.1 3272 2160 ? S 21:35:52 0:00 /usr/lib/nfs/statd
daemon 1331 0.0 0.1 2900 1968 ? S 21:36:04 0:00 /usr/lib/nfs/lockd
daemon 1528 0.0 0.1 2908 1844 ? S 21:36:20 0:00 /usr/lib/nfs/nfsd
root 1578 0.0 0.1 1632 960 pts/1 S 21:36:43 0:00 grep nfs


from linux:
[root@localhost]~# mount -v 10.0.1.250:mediapool/media /mnt/nfs
mount: no type was given - I'll assume nfs because of the colon
mount: trying 10.0.1.250 prog 100003 vers 3 prot tcp port 2049
mount: trying 10.0.1.250 prog 100005 vers 3 prot udp port 64243
[root@localhost]~# ls /mnt/nfs
TESTFILE.dat

works!

in order for changes to persist after reboot, need to
/usr/bin/updimg.sh /mnt/eon0/boot/x86.eon

from mac:

nfs://10.0.1.250
nfs://10.0.1.250:/mediapool/media
nfs://10.0.1.250:mediapool/media
nfs://10.0.1.250/mediapool/media


the last works! http://mygeekproject.com/?cat=8 seems a tad off on the syntax, at least for OSX 10.4


now for some testing: hard power off, swap drives cables around so they are out-of-order:

result: perhaps due to "zfs import -a" in .exec, it boots up fine!


BULK UP:

plug 8 2TB drives into ports 1-6 and 9-10 (7-8 are IDE which is controller 4). However find that regardless of drive or cable, we have a problem if a drive is plugged into port 9 = controller 5 master disk. can't help but recognize the possibility that there is a conflict between IDE and SATA channels 4/5 master... especially since there is a "SATA/IDE control mode" option in BIOS

BIOS: change "SATA RAID/AHCI Mode" from 'Disabled' to 'AHCI'
result: bootup is notably different, with different timing/language around disk detection. boot still stalls with 'early timeout' and 'ata_disk_start: select failed' errors.

BIOS: change 'Onboard SATA/IDE Ctrl mode' from IDE to AHCI
results: last two drive come on a separate screen of bootup and in blue. however bootup stall because (apparently) IDE disk can no longer be found. interestingly, shows IDE master as "HDD0" and GSATA_0 as HDD1 - apparently no conflict? is it possible that BIOS boot order is messed up now? BIOS settings: note that it goes through both screens of disk detection before going to setup pages. once there, YES, our IDE root disk is buried in the boot order menu. move it to first: boot succeeds. great, we now have it booting with the same physical setup that previously failed with apparent conflict. Wierd thing is, master and disk0 on controller show up as HDD1 and HDD2 in BIOS boot process... they should be either 1 and 3, counting both controllers in order, or both HDD1 because both are first drive on the controller and conflicting. Is it possible to add 8th SATA drive and break the 8-drive barrier??? plug in topmost SATA to GSATA_1... BIOS steps now show HDD0, HDD2, and HDD2!! Boot succeeds! try making an 8-disc raidz set:

zfs destroy mediapool/media
zpool destry mediapool
zpool create mediapool c0t0d0 c0t1d0 c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0
zfs create -o casesensitivity=mixed -o nbmand=on -o sharesmb=guestok=true -o sharenfs=ro mediapool/media

works: df -k show we have a 13.3TB filesystem. OSX mounts it via NFS, no write permission

actually, there is a problem: files show up with crazy-ass permissions, and can't be accessed from linux or from OSX. things to try: undoing the max_version, setting ACL-style permissions (aren't they already set?) for that fs, looking into UID_NOBODY, etc.

Examine the share on the OS box:

media:2:~#ls -ldv /mediapool/media/
drwxrwxr-x 2 media media 4 Dec 6 00:30 /mediapool/media/
0:owner@::deny
1:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
/append_data/write_xattr/execute/write_attributes/write_acl
/write_owner:allow
2:group@::deny
3:group@:list_directory/read_data/add_file/write_data/add_subdirectory
/append_data/execute:allow
4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
/write_attributes/write_acl/write_owner:deny
5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
/read_acl/synchronize:allow

media:4:~#sharemgr show -vp
default nfs=()
zfs
zfs/mediapool/media nfs=() smb=(guestok="true") nfs:sys=(ro="*")
mediapool_media=/mediapool/media

here is a file written from windows via authorized CIFS share:

media:5:~#ls -ldv /mediapool/media/snowfall2.MP4
----------+ 1 media media 63966846 Dec 8 2006 /mediapool/media/snowfall2.MP4
0:user:media:read_data/write_data/append_data/read_xattr/write_xattr
/execute/delete_child/read_attributes/write_attributes/delete
/read_acl/write_acl/write_owner/synchronize:allow
1:group:2147483648:read_data/write_data/append_data/read_xattr
/write_xattr/execute/delete_child/read_attributes/write_attributes
/delete/read_acl/write_acl/write_owner/synchronize:allow


SO, it seems the problem is not NFS per se, but the fact that files written from windows over CIFS have weird permissions. test by writing a file from OS and see if it can be read by NFS clients:

media:6:~#vi /mediapool/media/test_file_written_by_open_solaris
media:7:~#
media:7:~#ls -lv !$
ls -lv /mediapool/media/test_file_written_by_open_solaris
-rw-r--r-- 1 root root 48 Dec 6 08:35 /mediapool/media/test_file_written_by_open_solaris
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes
/write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes
/write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
:allow


from OSX:

lancebook-2% ls -l /Volumes/10.0.1.250-4/test_file_written_by_open_solaris
-rw-r--r-- 1 root wheel 48 Dec 6 11:35 /Volumes/10.0.1.250-4/test_file_written_by_open_solaris
lancebook-2% cat /Volumes/10.0.1.250-4/test_file_written_by_open_solaris
this file was written by open solaris root user



I recall something about setting ACL inherit property on the toplevel share, so all new files would inherit readibility...

zfs get all mediapool/media
mediapool/media aclmode groupmask default
mediapool/media aclinherit restricted default

ah yes, http://breden.org.uk/2009/05/10/home-fileserver-zfs-file-systems/ is an example. try:

media:14:~#zfs set aclinherit=passthrough mediapool/media
media:16:~#zfs set aclmode=passthrough mediapool/media

now copy a file from windows box via CIFS and check it:

media:20:~#ls -l /mediapool/media/YogaNidraHQ.mp3
----------+ 1 media media 59463068 May 22 2007 /mediapool/media/testfile.dat
media:21:~#ls -lV /mediapool/media/YogaNidraHQ.mp3
----------+ 1 media media 59463068 May 22 2007 /mediapool/media/testfile.dat
user:media:rwxpdDaARWcCos:-------:allow
group:2147483648:rwxpdDaARWcCos:-------:allow

still looks funky, sigh... and note there are no permissions for everyone?


chmod A=owner@:rwxpdDaARWcCos:fd-----:allow,group@:rwxpdDaARWcCos:fd-----:allow,everyone@:rwxpdDaARWcCos:fd-----:deny /mediapool/media


oops, that removed read permissions for everyone...

media:26:~#ls -ldV /mediapool/mediadrwxrwx---+ 2 media media 7 Dec 6 09:41 /mediapool/media
owner@:rwxpdDaARWcCos:fd-----:allow
group@:rwxpdDaARWcCos:fd-----:allow
everyone@:rwxpdDaARWcCos:fd-----:deny
media:27:~#
media:27:~#chmod 755 !$
chmod 755 /mediapool/media
media:28:~#ls -ldV /mediapool/media
drwxr-xr-x+ 2 media media 7 Dec 6 09:41 /mediapool/media
owner@:rwxpdDaARWcCos:fdi----:allow
owner@:----dDaARWcCos:-------:allow
group@:rwxpdDaARWcCos:fdi----:allow
group@:----dDaARWcCos:-------:allow
everyone@:rwxpdDaARWcCos:fdi----:deny
everyone@:----dDaARWcCos:-------:deny
owner@:--------------:-------:deny
owner@:rwxp---A-W-Co-:-------:allow
group@:-w-p----------:-------:deny
group@:r-x-----------:-------:allow
everyone@:-w-p---A-W-Co-:-------:deny
everyone@:r-x---a-R-c--s:-------:allow
media:29:~#chmod 775 /mediapool/media
media:30:~#ls -ldV /mediapool/media
drwxrwxr-x+ 2 media media 7 Dec 6 09:41 /mediapool/media
owner@:rwxpdDaARWcCos:fdi----:allow
owner@:----dDaARWcCos:-------:allow
group@:rwxpdDaARWcCos:fdi----:allow
group@:----dDaARWcCos:-------:allow
everyone@:rwxpdDaARWcCos:fdi----:deny
everyone@:----dDaARWcCos:-------:deny
owner@:--------------:-------:deny
owner@:rwxp---A-W-Co-:-------:allow
group@:--------------:-------:deny
group@:rwxp----------:-------:allow
everyone@:-w-p---A-W-Co-:-------:deny
everyone@:r-x---a-R-c--s:-------:allow


okay, those are wierd and stacked permissions, but seem to be what I want...

make another test file over CIFS and check result:

media:31:~#ls -l /mediapool/media/testfile2.dat
-rwxrwx---+ 1 media media 2117760 May 31 2008 /mediapool/media/testfile2.dat

well that is a step in the right direction... it has recognizable simple options. only problem is a lack of read permissions for everyone... wish i could reset the ACL... but note that the desirable ACL states don't have any inheritance... construct ideal complex ACL settings using the ACL permissions/inheritance codes listed at http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch07.html

chmod A=owner@:rwxpdDaARWcCos:fd-----:allow,group@:rwxpdDaARWcCos:fd-----:allow,everyone@:wpdDAWc:fd-----:deny,everyone@:rxaRCos:fd-----:allow /mediapool/media

now apply it:

media:32:~#chmod A=owner@:rwxpdDaARWcCos:fd-----:allow,group@:rwxpdDaARWcCos:fd-----:allow,everyone@:wpdDAWc:fd-----:deny,everyone@:rxaRCos:fd-----:allow /mediapool/media
media:33:~#
media:33:~#
media:33:~#ls -ldV /mediapool/mediadrwxrwxr-x+ 2 media media 8 Dec 6 09:48 /mediapool/media
owner@:rwxpdDaARWcCos:fd-----:allow
group@:rwxpdDaARWcCos:fd-----:allow
everyone@:-w-pdD-A-Wc---:fd-----:deny
everyone@:r-x---a-R--Cos:fd-----:allow

copy a test file over, check it:


media:34:~#ls -lV /mediapool/media/testfile3.dat
-rwxrwxr-x+ 1 media media 3641472 May 31 2008 /mediapool/media/testfile3.dat
owner@:rwxpdDaARWcCos:------I:allow
group@:rwxpdDaARWcCos:------I:allow
everyone@:-w-pdD-A-Wc---:------I:deny
everyone@:r-x---a-R--Cos:------I:allow

looks good! check from OSX nfs mount:

lancebook-2% ls -l /Volumes/10.0.1.250-4/
-rwxrwxr-x 1 fresh 600 3641472 May 31 2008 testfile3.dat

looks good except that write is allowed by uid/gid... in this case OSX local user "fresh" has the same uid so fresh would be allowed to delete the file. need to remount NFS read-only... wait, it is set to ro in zfs attributes. log in as "fresh" - can't modify files. seems solid!





OpenSolaris NAS

OpenSolaris LiveCD: boots to console login repeatedly. Seems that I have to select "VESA" option from grub menu to get desktop. However, even then bootup pauses for a long time at console login and I have to wait patiently for desktop to start.

After install,

upgrade to snv_118 or later
http://opensolaris.org/jive/thread.jspa?messageID=404103

to do so, Package Manager -> Repository -> Add with URL http://pkg.opensolaris.org/dev

use package manager to add SUNWsmbs, then use Services GUI to activate "smbd daemon"


create ZFS filesystem to share:
http://developers.sun.com/openstorage/articles/opensolaris_storage_server.html
zfs create -o casesensitivity=mixed mypool/myfs2

zfs set sharesmb=on mypool/myfs2

http://opensolaris.org/jive/thread.jspa?messageID=438181&tstart=0



WELL, updating to newer snv turned out to be fairly involved sysadmin task, which is not what I was looking for. So, try latest build image via precompiled image at http://genunix.org/ in fact, while we're at it, let's build an embedded flash system using the .usb format and instructions from http://chonan-en.blog.pid0.org/2009/11/how-to-create-opensolaris-live-usb.html which in my case resulted in cat 1gb_headimg.dat osol-dev-127-x86.usb| dd of=/dev/sda bs=16k ... the solaris live image is really big, just shy of 1GB, but they did keep it small enough to fit on a 1GB flash drive.

results: boot hangs during POST, before memory scan, just like the EON-NAS USB boot did... so there seems to be something about the OpenSolaris USB boot images incompatible with my hard drive, even through FreeBSD boot image (also using grub) works just fine, and a random non-bootable USB drive doesn't interfere with normal boot. Grrr.

Friday, November 20, 2009

FreeNAS, ZFS

Need an upgraded storage solution. Old ReadyNAS is still alive, but it sometimes requires reboot by unplug, and I can't handle the emotional trauma. Drobo is working, but previously the first two drives had episodes of spontaneously dropping and reappearing, causing disk thrashing as data was redistributed each time. WinXP is limit to 2TB max partition size, so all my data is ghettoized and I spend a lot of time copying things between partitions. Not going to invest in their overpriced 8-bay unit.

Next up: FreeNAS with ZFS support. ZFS is self-checking, self-healing, and its copy-on-write architecture means that data won't be lost if power is lost mid-write (every write either succeeds or fails in its entirety). It is pretty much the last word in filesystems datawise, although it is not yet flexible enough to meet the demands of the home user (it cannot freely scale up/down or maximize use of mismatched drives without jumping through hoops)

Any hardware should be compatible with OpenSolaris (ZFS's native home) and FreeBSD with a bonus for explicit FreeNAS support.
OpenSolaris
FreeBSD
FreeNAS

Research:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
http://harryd71.blogspot.com/2008/10/tuning-freenas-zfs.html
http://wiki.freebsd.org/ZFSTuningGuide
http://wiki.freebsd.org/ZFS
http://techpad.co.uk/content.php?sid=60 (is it true that healing happens only in *mirrored* zfs?)
http://forums.smallnetbuilder.com/showthread.php?t=1953
http://pegolon.wordpress.com/2009/01/13/build-your-own-drobo-replacement-based-on-zfs/
http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z
http://www.mouldy.org/what-i-learned-from-setting-up-zfs-on-my-fileserver
http://www.learnfreenas.com/blog/2009/04/12/ramblings-on-freenas-zfs-expandability-and-raid-5/
http://nowhereman999.wordpress.com/2009/04/19/zfs-freenas-a-poor-and-very-geeky-man%E2%80%99s-drobo-setup/
http://rskjetlein.blogspot.com/2009/08/expanding-zfs-pool.html
http://wiki.mattrude.com/index.php?title=Freenas/ZFS_and_FreeNAS_expansion
http://ask.metafilter.com/125509/FreeNAS-Hardware-Specs

Parts:

Need: 8 bays minimum for hard drives
http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=2010090007+1054808291+1309321151&QksAutoSuggestion=&ShowDeactivatedMark=False&Configurator=&Subcategory=7&description=&Ntk=&CFG=&SpeTabStoreType=&srchInDesc=

possibly convert 5.25 into 3.5 bays
Thermaltake A2309 iCage 3-in-3 with fan $20
Silverstone 4-in-3 passive $30
Cooler Master 4-in-3
Chenbro 5-in-3 hotswap
http://www.enhance-tech.com/products/multidrive/New_StorPack.htm

4 new 2tb drives to get it started
http://www.newegg.com/Product/Product.aspx?Item=N82E16822145276

motherboard: don't need a lot of power, the lowest 64-bit system available would be fine. 2GB ram, as many SATA ports as possible. Need 1 or 2 old PCI slots to handle SATA card below.
http://www.newegg.com/Product/Product.aspx?Item=N82E16813130240R

SATA controller card
SUPERMICRO AOC-SAT2-MV8 64-bit PCI-X133MHz SATA Controller Card compatible with opensolaris according to http://ask.metafilter.com/125509/FreeNAS-Hardware-Specs and https://opensolaris.org/jive/thread.jspa?messageID=435458 and also freenas






figure out where to run OS and keep zlog (and research complaints of lost data when zlog is lost)


DATA








whatrawused
Drobo8TB5TB
ReadyNAS4TB3TB
PC2TB2TB
Spare2TB0TB
USB1TB0TB
total17TB10TB







FreeNAS Primary16TB8/12TB
PC2TB2TB
Spare6TB0TB
USB1TB0TB

/
/
ZFS important info:
/
ZFS toplevel unit is pool. pools are built on vdevs. a vdev is one or more drives / partitions / files.

You can grow a pool by adding a new vdev or by increasing the size of a vdev. The size of a redundant (mirrored or raidz'ed) vdev can be grown by swapping out one disk at a time, giving zfs a chance to recalculate the parity for each drive (known as "resilvering") to re-establish redundancy before swapping the next drive out. Unless the vdev is double- or triple-parity (known as raidz2 and raidz3 (pending)) then your data is at risk during the resilver process, should one of the other drives die (TODO: could removed drive by swapped back in should that happen?)

ZFS redundancy against hardware failures is accomplished at the vdev level, by making each vdev redundant via mirroring or varying degrees of parity. ZFS will warn you if you attempt to mix different types of vdevs in a pool, because it is not natural to span data across different levels of redundancy. Because a pool stripes data across the vdevs it comprises, if one vdev fails the entire pool's data is lost. Therefore a pool made of vdevs of varying types is only as reliable as the least reliable vdev. Adding a nonredundant vdev, eg a single drive, to a pool makes the entire pool subject to data loss should that drive die.
/
If you have 8 drives of size P there are various strategies for organizing your filesystem:
/
4 vdevs of 2 mirrored drives, all 4 vdevs in a single pool. This gives 4*P space. The filesystem can survive failure of any one drive. The filesystem can survive failure of up to 4 drives if one drive of each mirrored pair dies. However if 2 drives of the same pair die, all 4TB of data could be lost. Odds are 1/7. Space efficiency is 50%

If it is acceptable for the storage to be divided into chunks, 4 vdevs, each made of a 2-drive mirror, each making up their own pool, will result in 4 P-sized pools for a total of 4*P available space. Maximum damage from a 2-drive failure would be loss of the data on the vdev which had both drives die. Data on the other 3 vdevs would be safe and protected. Note that it is the administrator's job to make sure data in any one pool never exceeds size of P, and moving data between pools is slow. Also note that the performance is slightly lower; data reads are striped across the 2 disks as opposed to all 8. However if you have a bottleneck elsewhere (network) this is irrelevant.

2 vdevs, each 4 drives in raidz, pool made of 2 vdevs. Can survive loss of any one drive. If 2 drives in same vdev die then all pool data is lost. Odds are 3/7. reliability is 1.43 Performance is lower than mirrored mode, because parity must be calculated (XXX if your processor is fast enough, this might not be a problem?)

1 pool made of one vdev made of 8 drives in raidz. Can survive loss of only one drive. If any second drive is lost, all data is lost. Space efficiency is 7/8. reliability is 1.0 read speeds could be 8x but write speeds require parity calculation.

1 pool made of 8 vdevs in raidz2. Can survive loss of any two drives, loss of third drives means all data lost. Space efficiency is 6/8, reliability is 2.0.



ZFS TESTS:

single drive pool, pull plug while writing, examine start

steps: via webgui, add disk, format as zpool device, make a vdev of single disk, make pool of that vdev, share via cifs, copy files 1-8 and yank power during file 2; repeat with 2-8 etc

results: never loses files which have been completely written. however, sometimes a new file (one which is in the processing at time of power loss) ends up listed in ZFS with correct file size but different checksum. this seems inconsistent with copy-on-write design, and means that user would have to checksum the most recent files after a crash to determine whether they are perfect copies or not.

pool with single disk, separate zlog device, remove zlog device during reboot

steps: zpool create ZLogTest ad8 log da1
cannot use '/dev/log': must be a GEOM provider

results: zpool in FreeBSD underlying FreeNAS doesn't seem to support log!!! zpool --help command formats support this. According to http://forums.freebsd.org/archive/index.php/t-4641.html it is supported in ZFSv13 in FreeBSD 7-STABLE or 8-CURRENT; freenas is using 7-RELEASE. slashdot says "ZFS13 would break 7.2 ABI, so wait for 8"

pool with separate zlog device, remove zlog device while running

pool with separate zlog device, pull zlog device during write

multi-vdev pool, pull plug while writing, examine state
TODO: webgui doesn't seem to support using partitions to build pool, need more disks to test this

multi-drive vdev, change drive order and power on, without export/import

multi-drive dev, change drive order and power on, with export/import