Tuesday, July 21, 2009

My first run-in with DotHIll storage

So I'm having my first run-in with a DotHill Array (rebranded as an HP product). The HP model number is MSA2012fc but the DotHill number would be a 2730. It's your typical looking 3U 12 disk array. It has two controllers and two 4Gb uplinks per controller. Not too bad. It only does whole-disk raid sets however so it'd be a little silly to plug it into a SAN switch, but it can done. The HP web interface is pretty straight forward and has simple concepts, you create a vdisk, which is a raid set of drives and then you carve off chunks of that vdisk to present as LUNs to the hosts. Similar to the old school Clariion's, it has the notion of assigning a lun to a particular controller. So you have to manually/mentally balance your workload. It does provide some fairly comprehensive performance stats which can help in that regard and there's a command line interface w00t!. So far my results with these devices hasn't been good at all.

First off, I should point of we have two of these. One works, one doesn't. On the one that works, when I run a simple Bonnie++ test I get results like this:

Version 1.95 Sequential Output Sequential Input Random
Concurrency 1 Per Chr Block Rewrite Per Chr Block Seeks
OS RHEL 5.3 K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
size 118G 749 99 172335 42 66940 15 1518 59 199006 18 160.0 24
Latency 11964us 25145ms 3660ms 346ms 571ms 45881us
Version 1.95 Sequential Create Random Create
Create Read Delete Create Read Delete
/sec %CPU /sec %CPU /sec %CPU /sec %CPU /sec %CPU /sec %CPU
28611 66 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 13800us 108us 1157us 1328us 13us 1145us


Not very good at all, especially considering it has 4GB of cache and this is the only test running. I mean 25145ms worth of latency for the block output? Yikes!

On the second one, it seems to work ok. I can do some "dd if=/dev/random of=somefile" of arbitrary size and it happily chugs along. But if you try to run a Bonnie++ run against it you get a lot of errors:

Jul 21 10:09:36 oracle-dev-02 kernel: : exe="?" (sauid=81, hostname=?, addr=?, terminal=?)'
Jul 21 11:00:50 oracle-dev-02 kernel: sd 0:0:0:1: SCSI error: return code = 0x08000002
Jul 21 11:00:50 oracle-dev-02 kernel: sda: Current: sense key: Aborted Command
Jul 21 11:00:50 oracle-dev-02 kernel:     Add. Sense: Scsi parity error
Jul 21 11:00:50 oracle-dev-02 kernel:
Jul 21 11:00:50 oracle-dev-02 kernel: end_request: I/O error, dev sda, sector 934774511
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846806
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846807
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846808
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846809
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846810
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846811
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846812
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846813
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846814
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846815
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Aborting journal on device sda1.
Jul 21 11:00:50 oracle-dev-02 kernel: __journal_remove_journal_head: freeing b_committed_data
Jul 21 11:00:50 oracle-dev-02 last message repeated 5 times

Wee fun! Even better is it corrupts the disk beyond recognition. You can't even fsck it in a reasonable about of time. It's faster to reformat. We've had HP on site and opened a a case or two on this but so far have gotten no where. For me the biggest issue is I don't have enough time to sit on hold with the call center while they find someone who knows how to use linux with an MSA.

So far I've tried swapping HBA's, cables, ports, fibers. Updating all the firmware and drivers (the MSA, the HBA, OS, etc). I've tried different versions of RHEL 5.3 and 5.2. I've tried only using the HP supplied drivers, firmware and utils. All with the same results. Later this week, I'm going to give in and install win2k3 or win2k8 and run something like Bst5 or iozone and hope I can reproduce the error. Under low loads it doesn't error out. It performs poorly but doesn't error out. I hate these kinds of problems. There's obviously something wrong with the array, since one works and the other doesn't, but it passes all the diags. At some point this thing's going to end up like the printer in Office Space.

No comments:

Post a Comment