Tuesday, July 21, 2009

My first run-in with DotHIll storage

So I'm having my first run-in with a DotHill Array (rebranded as an HP product). The HP model number is MSA2012fc but the DotHill number would be a 2730. It's your typical looking 3U 12 disk array. It has two controllers and two 4Gb uplinks per controller. Not too bad. It only does whole-disk raid sets however so it'd be a little silly to plug it into a SAN switch, but it can done. The HP web interface is pretty straight forward and has simple concepts, you create a vdisk, which is a raid set of drives and then you carve off chunks of that vdisk to present as LUNs to the hosts. Similar to the old school Clariion's, it has the notion of assigning a lun to a particular controller. So you have to manually/mentally balance your workload. It does provide some fairly comprehensive performance stats which can help in that regard and there's a command line interface w00t!. So far my results with these devices hasn't been good at all.

First off, I should point of we have two of these. One works, one doesn't. On the one that works, when I run a simple Bonnie++ test I get results like this:

Version 1.95 Sequential Output Sequential Input Random
Concurrency 1 Per Chr Block Rewrite Per Chr Block Seeks
OS RHEL 5.3 K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
size 118G 749 99 172335 42 66940 15 1518 59 199006 18 160.0 24
Latency 11964us 25145ms 3660ms 346ms 571ms 45881us
Version 1.95 Sequential Create Random Create
Create Read Delete Create Read Delete
/sec %CPU /sec %CPU /sec %CPU /sec %CPU /sec %CPU /sec %CPU
28611 66 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
Latency 13800us 108us 1157us 1328us 13us 1145us


Not very good at all, especially considering it has 4GB of cache and this is the only test running. I mean 25145ms worth of latency for the block output? Yikes!

On the second one, it seems to work ok. I can do some "dd if=/dev/random of=somefile" of arbitrary size and it happily chugs along. But if you try to run a Bonnie++ run against it you get a lot of errors:

Jul 21 10:09:36 oracle-dev-02 kernel: : exe="?" (sauid=81, hostname=?, addr=?, terminal=?)'
Jul 21 11:00:50 oracle-dev-02 kernel: sd 0:0:0:1: SCSI error: return code = 0x08000002
Jul 21 11:00:50 oracle-dev-02 kernel: sda: Current: sense key: Aborted Command
Jul 21 11:00:50 oracle-dev-02 kernel:     Add. Sense: Scsi parity error
Jul 21 11:00:50 oracle-dev-02 kernel:
Jul 21 11:00:50 oracle-dev-02 kernel: end_request: I/O error, dev sda, sector 934774511
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846806
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846807
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846808
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846809
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846810
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846811
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846812
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846813
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846814
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Buffer I/O error on device sda1, logical block 116846815
Jul 21 11:00:50 oracle-dev-02 kernel: lost page write due to I/O error on sda1
Jul 21 11:00:50 oracle-dev-02 kernel: Aborting journal on device sda1.
Jul 21 11:00:50 oracle-dev-02 kernel: __journal_remove_journal_head: freeing b_committed_data
Jul 21 11:00:50 oracle-dev-02 last message repeated 5 times

Wee fun! Even better is it corrupts the disk beyond recognition. You can't even fsck it in a reasonable about of time. It's faster to reformat. We've had HP on site and opened a a case or two on this but so far have gotten no where. For me the biggest issue is I don't have enough time to sit on hold with the call center while they find someone who knows how to use linux with an MSA.

So far I've tried swapping HBA's, cables, ports, fibers. Updating all the firmware and drivers (the MSA, the HBA, OS, etc). I've tried different versions of RHEL 5.3 and 5.2. I've tried only using the HP supplied drivers, firmware and utils. All with the same results. Later this week, I'm going to give in and install win2k3 or win2k8 and run something like Bst5 or iozone and hope I can reproduce the error. Under low loads it doesn't error out. It performs poorly but doesn't error out. I hate these kinds of problems. There's obviously something wrong with the array, since one works and the other doesn't, but it passes all the diags. At some point this thing's going to end up like the printer in Office Space.

Friday, July 10, 2009

Getting back to linux...

Back in my university days I was all about Linux. My first 'machine' was a 386sx, probably 16Mhz or so and booted of a 5.25" floppy. Having to compile the kernel every time you wanted to make any kind of changes and then 'rawrite' it out of the floppies. And forget about package management (well until slackware for me...) My first 'workstation/server' that I seriously used, scuba.uwsuper.edu, was a 486DX-50 with a Cirix CPU around 1992. I think it might have had 256MB of Ram and a 80gb Segate drive (3.5" form factor no less!). I think some archive.org listings for the web pages I used to host on it are still around, although from near the end of my use of it. http://web.archive.org/web/*/scuba.uwsuper.edu Good times...


Then I moved to the DC area and started working with Sun and AIX hardware. Linux moved to a novelty/side item for me. RedHat, back when you could run it and not pay for it if you didn't want to. I'd have a 2nd PC in my office, mostly to act as my X server for working with the Sun boxes more than anything else. At Convergys and Red Cross, we had linux. A fair bit of it too, but in most cases it was never the 'core' of the product/platform offerings.


Well my current job with StreamSage is primarily a Linux shop, in particular a RedHat shop (Comcast, the corporate parent is a large RedHat customer). So it's been an interesting time getting back into the swing of things. On the one hand, I really like getting back into the linux state of mind. On the other hand, I've really come to appreciate the work that has been done in AIX and Solaris in terms of hardware management, diagnostics and configuration. I mean there are Linux equivalents in a lot of cases and a lot of it is an artifact of the hardware and software being built by the same people but boy, I miss the AIX and Solaris troubleshooting tools.


The unpredictable future and buying hardware...

At my previous clients site, they have some old Sun servers that they are upgrading to M5000's. The hardware they evaluated (and they did very rigorous testing of the hardware) was a T5240, T5440 and M5000. The M5000 was configured with 4 processors not the full 8 that are possible, which is the subject of my post. When choosing the M-series server they decided to go with the M5000 because it would have 2 free slots allowing them to add memory or processors later. So they are trying to protect themselves against a CPU utilization problem down the road by having slots to put additional capacity in. I've been down this road a few times myself. I bought the 880s and 890s with 4 procs just in case we needed the other 4 down the road. Unfortunately most of the time I never needed those slots. and wasted the rack space, power and cooling. In my current clients case they should probably go with the M4000 instead.


On a list price comparison over five years you get:


M4000 $66,380 M5000 $81,880


Maintenance (numbers are swags for platinum pre-paid for 3 years):


M4000 $17,000 M5000 $22,000


I'm typically a Veritas user, so that adds complexity. Last time I looked (a year and a half ago) the M4000 was a tier E and the M5000 was a tier H. So SF for oracle for both would be:


M4000 $4,000 M5000 $9,000


Veritas Maintenance would be somewhere around (swag, 3 years):


M4000 $2,400 M5000 $2,700


The rest a bit of a wash, and brings us to:


M4000 $89,760 M5000 $115,580.


So the M5000 which has two advantages: 4 internal drives (no value unless you're partitioning) and 2 expansion slots (potential future value) has a 28% greater price premium over the M400. The reason engineers like myself make choices like this the unpredictable future. Often times when I'm asked to spec out hardware for an application I'm given initial requirements like 12,000 total users, 300 users concurrently. And if I'm lucky some information about the resource utilization associated with each user session. Most the time it's a shot in the dark however, and have to dig around for similar usage profiles via google and try to work that into my sizing model. But that's relatively straight forward. There's some art and finesse to it but at the end of the day it usually comes down to a derivative formula of X sessions * Y-Mb-per-Session + overhead + wiggle-room = ZGb of memory. Same kind of thing for CPU and I/O. Where it gets hard is when you have to forecast the life of the machine. You're forced to try and pick a machine that will meet the needs of not only year one but years two through four or five as well. When we ask the customer what their growth rate is they'll usually shrug and give a non-answer. Or they'll give you an answer that's based directly on other non-knowable facts like "our user base will increase at the same percentage as our market share". Great. Thanks for that. It's very tempting to just go out and buy the top of the line server to ensure we never get a resource problem. Buy a tour bus when all we need is a passenger van. But when they see the sticker price of that tour bus, we're usually back to the drawing board. That's what makes machines like the M5000 or the 890 it replaced so appealing. It has room for an extra row of seats in case the number of passengers we need increases drastically. Unfortunately you have to pay extra fuel costs to hall that extra space around (maintenances) and there's the up-front acquisition costs as well.


It's all about going back to the well. The reason we over-build or infrastructure this way is because of the difficulty in going back to the well for additional funding. In my work in the non-profit space there's a real risk of that well being dry as well. For example, I could buy the M4000 and then if I have problems in year 2 or 3 I would then do a forklift upgrade to the M5000 (swap the boot disks and away I go). Easy stuff, except I have to actually buy that M5000. Which comes with lots of questions: Why didn't you buy an M5000 from the start? Why were your forecasts wrong? Where do you think we can come up with that kind of money? Collective amnesia will shift all the blame to the people who spec'ed out the system. Blame rolls down hill. It picks up mass and speed as it rolls and engineers are usually at the bottom of the hill with the operations folks (often one and the same). So we buy machines that have that 'extra reserve' built in. In high end servers you can usually turn on the additional capacity by purchasing a license key. But in the mid-to-low end range we're only offered machines with expandability. So if our forecast is off or the conditions change, we're able to bring a lower incremental cost to the table to gain additional performance and capacity. Unfortunately for me however, I have rarely needed that expanded capacity. I can only remember two examples one success, were we added two boards (4 CPUs + memory) to an 890 and one case where they didn't make the versions of the board we currently had in the server which meant we would have had to replace all the boards which would have cost almost as much as replacing the server outright.


I used to be a 'keep something in reserve' kind of engineer. Be able to put the rabbit out of the hat to meet the increased demand that we didn't know was coming. Basically pull of a Montgomery Scott to save the day. By doing so however, I have enabled the behavior that has gotten me here in the first place. By not purchasing the equipment the requirements suggest and adding some reserve "just in case" the cycle repeats itself. Now I'm not going to purchase the minimum needed to meet the requirements given to me (however flawed they may be), but I am going to start putting the decision back on the requesters and have them make the choice. In writing. With as much concurrence as can be achieved from the project team as a whole. So if I were to travel back in time to the period before the aforementioned M5000s were purchased. I would offer the M4000s instead. I would tell them you save X dollars up front. Your downside risk is you may have to replace this server if your usage or growth models are wrong. And, perhaps most importantly, get documented concurrence from the stake holders.


This has turned into a much longer post than I had originally intended... phew. Now onto my next client/project.