First off, I should point of we have two of these. One works, one doesn't. On the one that works, when I run a simple Bonnie++ test I get results like this:
Version 1.95 | Sequential Output | Sequential Input | Random | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Concurrency | 1 | Per Chr | Block | Rewrite | Per Chr | Block | Seeks | ||||||
OS | RHEL 5.3 | K/sec | %CPU | K/sec | %CPU | K/sec | %CPU | K/sec | %CPU | K/sec | %CPU | /sec | %CPU |
size | 118G | 749 | 99 | 172335 | 42 | 66940 | 15 | 1518 | 59 | 199006 | 18 | 160.0 | 24 |
Latency | 11964us | 25145ms | 3660ms | 346ms | 571ms | 45881us | |||||||
Version | 1.95 | Sequential Create | Random Create | ||||||||||
Create | Read | Delete | Create | Read | Delete | ||||||||
/sec | %CPU | /sec | %CPU | /sec | %CPU | /sec | %CPU | /sec | %CPU | /sec | %CPU | ||
28611 | 66 | +++++ | +++ | +++++ | +++ | +++++ | +++ | +++++ | +++ | +++++ | +++ | ||
Latency | 13800us | 108us | 1157us | 1328us | 13us | 1145us |
Not very good at all, especially considering it has 4GB of cache and this is the only test running. I mean 25145ms worth of latency for the block output? Yikes!
On the second one, it seems to work ok. I can do some "dd if=/dev/random of=somefile" of arbitrary size and it happily chugs along. But if you try to run a Bonnie++ run against it you get a lot of errors:
Jul 21 10:09:36 oracle-dev-02 kernel: : exe="?" (sauid=81, hostname=?, addr=?, terminal=?)' Jul 21 11:00:50 oracle-dev-02 kernel: sd 0:0:0:1: SCSI error: return code = 0x08000002 Jul 21 11:00:50 oracle-dev-02 kernel: sda: Current: sense key: Aborted Command Jul 21 11:00:50 oracle-dev-02 kernel: Add. Sense: Scsi parity error Jul 21 11:00:50 oracle-dev-02 kernel: Jul 21 11:00:50 oracle-dev-02 kernel: end_request: I/O error, dev sda, sector 934774511 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846806 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846807 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846808 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846809 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846810 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846811 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846812 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846813 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846814 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846815 Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1 Jul 21 11:00:50 oracle-dev-02 kernel: Aborting journal on device sda1. Jul 21 11:00:50 oracle-dev-02 kernel: __journal_remove_journal_head: freeing b_committed_data Jul 21 11:00:50 oracle-dev-02 last message repeated 5 times
Wee fun! Even better is it corrupts the disk beyond recognition. You can't even fsck it in a reasonable about of time. It's faster to reformat. We've had HP on site and opened a a case or two on this but so far have gotten no where. For me the biggest issue is I don't have enough time to sit on hold with the call center while they find someone who knows how to use linux with an MSA.
So far I've tried swapping HBA's, cables, ports, fibers. Updating all the firmware and drivers (the MSA, the HBA, OS, etc). I've tried different versions of RHEL 5.3 and 5.2. I've tried only using the HP supplied drivers, firmware and utils. All with the same results. Later this week, I'm going to give in and install win2k3 or win2k8 and run something like Bst5 or iozone and hope I can reproduce the error. Under low loads it doesn't error out. It performs poorly but doesn't error out. I hate these kinds of problems. There's obviously something wrong with the array, since one works and the other doesn't, but it passes all the diags. At some point this thing's going to end up like the printer in Office Space.
No comments:
Post a Comment