#%include "default.mgp" %tab 1 size 4, vgap 40, prefix " ", icon box "green" 50 %tab 2 size 4, vgap 40, prefix " ", icon arc "yellow" 50 %tab 3 size 4, vgap 40, prefix " ", icon delta3 "white" 40 %%%%%%%%%%%%% %page %tfont "/usr/share/fonts/default/TrueType/timr____.ttf" %fore "white" %center Reliable NAS from Dirt Cheap Commodity Hardware %fore "blue" %size 6 Benjamin LaHaise %size 4 bcrl@kvack.org %left %size 6 %%%%%%%%%%%% %page %fore "white" Overview %fore "skyblue" why? what is it? design criteria possible configurations implementation network protocol crc trick kernel extensions kernel driver userland bits current state future goals # %%%%%%%%%%%% %page %fore "white" Why? %fore "skyblue" - high end hardware is expensive - low end hardware has better density - 300GB ATA vs 180GB SCSI - 12 disks / 2U -> 3.6TB - failure modes - RAID controllers are complex - network cards are simpler to verify - they still happen %%%%%%%%%%%%%%%%%%%%% %page %fore "white" What is it? %fore "skyblue" - kernel driver called netmd provides block device access over UDP - similar to nbd, drbd in scope - pushes complexity to userspace - reliable %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Design Criteria %fore "skyblue" %fore "red" - Never lose or corrupt data! %fore "white" %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Actual Design Criteria %fore "skyblue" - big. really big. - easy to setup and use - detect data errors (single bit, wrong sector / fractured blocks) - automatic failover for damaged or lost nodes - efficient writes (unlike md) %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Possible Configurations %fore "skyblue" - front end system, back end - spend money on the front end - ECC matters - back ends run userspace server - no ECC needed - networks, switches - should use at least a dual network setup - switches should support multicast - more mirrors %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Implementation - Protocol %fore "skyblue" - UDP -- packet based - expected sector number, CRC in reads/writes - writes are multicast - back ends filter out uninteresting requests - reads are multicast - back ends may cache hit - puts complexity for failover in back ends %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Implementation - CRC trick %fore "skyblue" - standard CRC32 u32 crc32 = 0xffffffff; u8 *data; for (i=0; i<4; i++) { crc32 ^= *data++; crc32 ^= crc_table[crc32 & 0xff] ^ (crc32 >> 8) } crc32 ^= 0xffffffff; - inverse CRC32 u32 crc = old_crc; u8 *data = &wanted_crc + 3; u32 wanted_crc ^= 0xffffffff; for (i=0; i<4; i++) { u8 idx = rev_crc_table[data[3 - i]]; u32 entry = crc_table[idx]; wanted_crc ^= entry >> (i * 8); crc ^= entry >> (32 - i * 8); crc ^= (u32)idx << (32 - i * 8); } wanted_crc = crc; - yeah, it's endian specific, will be fixed ;-) %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Implementation - kernel driver %fore "skyblue" - 600 lines, expect 200-300 more (netmd.c) - true block device, but registers itself as scsi - small, stupid, fast %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Implementation - kernel extensions %fore "skyblue" - ethernet drivers provide crc - control method in device ops - flag in rx skb - ip layer non-blocking sendmsg - allows block layer packet submits - scsi layer disk passthrough - easy to manage - compatible - no scsi comand parsing %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Implementation - userland bits %fore "skyblue" - non-aio daemon for testing - aio daemon for performance - journalling write daemon for speed %%%%%%%%%%%%%%%%%%%%%% %page %fore "white" Current State %fore "skyblue" - kernel side runs, still buggy - slow testing daemons running - should be posted to lkml shortly %%%%%%%%%%%%%%%%%%%%%% %page %fore "white" Future Goals %fore "skyblue" - write documentation - improved userland daemons - remote write journal gateway - raid5 support - detect access patterns and relocate data - hardware? - who knows!