Wednesday, 28 July 2010

Dell EqualLogic Multipathing Extension Module Performance Testing

Now that vSphere v4.1 is out Dell EqualLogic have finally released their much awaited Multipath Extension Module (MEM) for their PS series SANs. With SAN performance being key to the performance of a virtualisation infrastructure I was keen to find out just how much more performance could be squeezed out of a Dell EqualLogic SAN.

All my deployments typically use what I shall refer to from here on in as ‘Classic MPIO’ - that is, multiple VMkernel ports on a vSwitch, each VMkernel port tied to a specific physical uplink, VMkernel ports associated with the software iSCSI initiator and using VMware’s round-robin setting on each storage path.

So before I hand over the results of my testing, I guess I should cover what I’m using for the test setup hardware & configuration:

  • Dell EqualLogic PS5000XV SAN
    • 16x 146.8GB 15Krpm SAS Hard Disks
    • Firmware v5.0.1
    • RAID 50
  • Dell PowerEdge M610 Blade
    • 2x Xeon X5570 CPUs
    • 24GB RAM
    • 2x 1Gb Mezzanine NICs dedicated to iSCSI
    • Ethernet pass-through modules
  • Nortel 5510-48T Switch Stack
    • 2 switches in the stack
    • SAN & server connectivity distributed across switches
    • Flow control enabled
    • Jumbo frames enabled
  • VMware vSphere v4.1
    • ESXi build 260247
    • EqualLogic Multipathing Extension Module for VMware vSphere v1.0.0

I’m not going to go through the setup of MEM here as it already been covered in Jon’s excellent write-up at Virtualization Buster and I see no reason to reinvent the wheel! I will recommend the Remote CLI installation he does though as I couldn’t get the package to deploy to my hosts using Update Manager - the package installs fine, but it fails to show as an applicable update no matter how many times you scan the hosts.

To perform the testing I’ve used Iometer v2006.07.27 on an unpartitioned RDM accessed by a virtual machine sitting on the host with MEM configured. Iometer’s settings were configured as follows;

  • Single worker
  • Unpartitioned RDM selected as disk target
  • Maximum Disk Size set to 8192 sectors
    • 8192 sectors is to arbitrarily limit it, so nearly all I/Os will be “out of the cache” after the first set. The array has a cache memory that receives all in-coming Write data. If there is an immediate Read on the data, the array reads the data from the cache. The data is never written to the drives. Therefore, the array will turn I/Os around as quickly as they come in.
  • # of Outstanding I/O’s configured as 64 per target
    • This is inline with documented Dell EqualLogic recommendations for SAN performance testing

I performed a number of tests; first on the system using a ‘Classic MPIO’ configuration and then repeated the same tests after the host had been configured with MEM. All tests were run for 60 seconds and the results are averages over this time. I settled on a variety of test regimes to see what could be achieved in both sequential and random I/O environments with read, write and read/write loads, as defined below:

  • 32K 100% Sequential Read
  • 32K 100% Random Read
  • 32K 75% Sequential Read (25% Sequential Write)
  • 32K 75% Random Read (25% Random Write)
  • 32K 100% Sequential Write
  • 32K 100% Random Write

The test probably most applicable to virtualisation environments would be the 32K 75% Random Read (25% Random Write) test, as it’s the most taxing test on the list.

First up we have a look at IOPS:

iops

It’s pretty clear from the IOPS results that MEM has made significant improvements to how hard you can drive your SAN - that’s a 30-40% performance improvement on almost all tests with the exception of the random write test!

I attribute the smaller increase in write performance purely down to the RAID 50 configuration of the test SAN - those of you with RAID 10 SANs should see even better performance. Peaks from the sequential read/write test averaged out at over 9300 IOPS! Absolutely stunning!

IOPS is all very well and good, but how many megabytes are we actually throwing around here? Allow me to answer that with another graph:

mbps

Total bandwidth on the sequential read/write test peaked at over 290MB/sec. I have never seen a single gigabit connection exceed 125MB/sec, so to get 40MB/sec more over two connections is quite simply an amazing feat.

The wow factor didn’t stop here. Next up came the latency results:

latency

Not only has raw bandwidth been improved, it’s actually at a lower latency when saturated too! This is actually (in my opinion) more important than raw performance values. I’ve already waxed lyrical about this in my Dell EqualLogic SAN review, but suffice to say that any random I/O workload (such as a SQL database) will benefit significantly from this improvement.

Finally, I wanted to see how much extra CPU time a virtual machine would use when dealing with the increased workload capacity that MEM provides, so here’s a graph to illustrate my findings:

cpu

Yes, there is an increase, but it’s not much. It’s also much inline with the other graphs, with only the write tests differing. Again, I would put this down primarily to the RAID 50 write performance of the SAN - it' isn't going to be using much CPU if it’s just waiting for the SAN to write data!

In conclusion, Dell EqualLogic have pulled yet another ace from their sleeve with their Multipathing Extension Module for VMware vSphere - more raw performance at a lower latency for every customer hosting a VMware vSphere v4.1 infrastructure and at no cost! Bargain!

My only question is: Why aren’t you using it yet? :)

I hope you find this article helpful - now where did I put that ‘I love EqualLogic’ badge. ;)

30 comments:

  1. Have you seen any Svmotion errors/timeouts with your testing?

    I have EQL 6510 running FW 5.0.1 and ESX 4.1 and I see errors now.

    ReplyDelete
  2. No errors here - sVmotion & template deployment is noticably faster too!

    What switches are you using?

    ReplyDelete
  3. Total bandwidth on the sequential read/write test peaked at over 290MB/sec. I have never seen a single gigabit connection exceed 125MB/sec, so to get 40MB/sec more over two connections is quite simply an amazing feat.

    Quite amazing indeed, in fact must be using caching on the client end as 1Gbit/s is the same as 125MByte/s, that is all it can transfer.

    Out of that you lose extra due to the header overhead so perhaps you require some sanity checking for these numbers.

    ReplyDelete
  4. Thankyou for your comment.

    I thought about this more after writing the article and realised I'd missed an obvious fact - the gigabit link is full duplex and the test is a both reading AND writing, which is why my total MB/sec is higher than the limit of 125MB/sec.

    I guess in theory I should be able to max out a 1GB full duplex connection at 125MB/sec in both directions simultaneously, which would equal 250MB/sec total bandwidth used.

    I therefore stand by my results, just not all of my words. :)

    ReplyDelete
  5. Hi,

    Thanks for this great post.
    Would it be possible for you to save the iometer config to a file an leave it here?
    Also the csv file maybe so i would only have to copy my results?
    I would love to compare things with the different eql models i have.

    Regards

    Hans de Jongh
    my email is:

    hans (add) itcreation (dot) nl

    ReplyDelete
  6. Thanks for the post, Graham. I recently went through a futile endeavor to reduce my write latency, and did a bunch of testing with "classic" round-robin, and didn't really get anywhere. I believe that latency is the performance killer for VMs, more so than IOPS. This leaves some questions open...for a SQL VM, should I now use RDMs for data and log volumes? Virtual disks? OR continue to use guest (pass-through) iSCSI as EQ support seems to favor? Hopefully we'll get some clarification from EQ support in the form of a best practices or tech report.

    ReplyDelete
  7. Hi RobVM,

    Some good questions here!

    Yes, latency is king for performance. Doesn't matter how fat the pipe is, if latency sucks then so will the responsiveness of any VM placed on it. Think satellite internet - 155Mbit, but 2000+ ms latency = speedy downloads but sucky web browsing.

    Ultimately there is no 'wrong' way to achieve connetivity, so it's going to be a compromise between complexity, performance and features.

    Using iSCSI inside a VM is advantageous from an EqualLogic perspective as you can use ASM to get quiesced application aware volume snapshots of SQL/Exchange/etc. The negatives for this sort of solution would be increased complexity and management of the setup.

    The performance difference between RDMs and VMFS hosted volumes is negligable these days. Going with VMFS hosted virtual hard disks also improves the portability (sVmotion) of the virtual machine's drives, so I'd probably recommend VMFS unless you have a specific requirement for RDMs.

    I need to read up on the EqualLogic integration with VAAI to see if we can leverage the application snapshot integration features of ASM without the whole rigmarol of iSCSI setup within VMs.

    I'm not sure if the latency/bandwith improvements of MEM would be applicable with a virtual machine port group. In theory if you have MEM setup and a VMPG using vmxnet3 (10Gb) NICs, it *should* benefit, but without testing I'm unable to confirm.

    Hope this helps!

    ReplyDelete
  8. I think Your I/O numbers are a result of alot of caching.
    16disks , 2 is spare disks

    14 disk for i/o
    that is more than 600i/o pr disk
    a typical 15K SAS disk is around 180-200i/o
    over a longer timeperiod it wouold be difficult to get tis numbers

    cheers
    andreas

    ReplyDelete
  9. Hi Andreas,

    Indeed you are correct - we are ultimately testing performance of the controller here.

    I did explain this in my article when describing the Iometer configuiration:

    8192 sectors is to arbitrarily limit it, so nearly all I/Os will be “out of the cache” after the first set. The array has a cache memory that receives all in-coming Write data. If there is an immediate Read on the data, the array reads the data from the cache. The data is never written to the drives. Therefore, the array will turn I/Os around as quickly as they come in.

    ReplyDelete
  10. Hi,

    There's some talk about needing to disable Jumbo Frames due to it not being supported on the iSCSI HBAs; did you use 1500 MTU on your configuration for these results?

    Phil

    ReplyDelete
  11. Hi Phil,

    This testing was performed on regular NICs, not iSCSI HBAs - I was using the software iSCSI initiator.

    Jumbo frames (MTU 9000) were enabled throughout and flow control was enabled on the switch.

    Hope this answers your question!

    Graham

    ReplyDelete
  12. I did some test today on ESXi 4.1 with MEM on PS6500E connected to M610 blades.
    I compared Broadcom Iscsi Hw initator with ESXi software initator.
    The results are much better performance on software initiator compared to Broadcom 5709 nics.
    512kb block reading (around 30000 vs 40000 IOs)
    32kb reading (225Mb/s with software initor, with broadcom the numbers was up and down alot, but not good).

    Thanks for good post. Magnus

    ReplyDelete
  13. Thanks for your comment Magnus.

    Interesting to see how those BCM5709 adapters performed - not very well by the looks of it!

    A quick google shows lots of people having issues using that 'hardware' implementation, with some indicating that jumbo frames aren't supported - which would account for some degredation in performance.

    I also assume that using MEM isn't possible using hardware adapters? Oddly it is an option when you run the MEM configure/install script. If not, this would further explain the lack of path optimisation.

    I still want to do some performance testing of using the Microsoft iSCSI initiator inside a VM so that ASM can be used - I'm hoping that a single 10GB vmxnet3 virtual NIC will still be able to saturate the MEM setup without having to fuss around creating a MEM setup in the VM.

    Watch this space!

    Graham

    ReplyDelete
  14. I have both Equalogic and Dell MD3000i iSCSI enpoints using the same vswitch. Paths selection to both device types are currently using Round Robin.

    My question is, can the MEM be installed and then change the EQ paths to use the new multipath selection - leaving the MD3000i still set to Round Robin and everything still work properly

    Thanks!

    Scott

    ReplyDelete
  15. If you use MEM, all EqualLogic paths will change to the DELL_PSP_EQL_ROUTED method and the remaining non-EqualLogic paths will continue to use the round-robin method.

    I have this deployed on mixed systems and it works as described above.

    Hope this helps!

    Graham

    ReplyDelete
  16. That is exactly what I was looking for! Thanks Graham!!

    ReplyDelete
  17. Hi Graham,

    During the last few weeks, I installed my PS4000XV.

    I was running 3 free VMWare ESXi's 4.0, but reinstalled with the trial based ESXi 4.1 and installed the MEM. Performance has increased quite a bit (using the configuration script, I also setup jumbo frames).

    Problem is that I'm not going to spend > 20.000 USD on vSphere Enterprise licences...

    When the trial period ends and I enter the free licence key (or the SMB Essentials/Essentials plus), I assume I'll automatically downgrade to VMWare native routing.

    Are there other things of the automatic MEM configuration that will fall apart?

    best regards,

    Nicolas

    ReplyDelete
  18. Hi Nicolas,

    Good question - I have no idea what would happen once the licensed feature dissapears.

    I suggest you test this - please let me know :)

    Graham

    ReplyDelete
  19. dear Graham,


    It keeps on working fine ;)


    best regards,

    Nicolas

    ReplyDelete
  20. Thanks for the update Nicolas!

    ReplyDelete
  21. Any chance you could run the tests longer to test the out of cache results? I am impressed with the controller numbers but am having problems finding some RAID-50 tests using those same benchmarks. I am looking at a setup with 1 PS6010XV and one PS6000E to run a vsphere 4.1 architecture.


    Matt

    ReplyDelete
  22. Hi Matt,

    Yeah I'll get something together in the new year - time permitting.

    G

    ReplyDelete
  23. Great Article Graham,

    We have a PS6010vx RAID 50 16 x 600 SAS 15K connected to 2 x Dell 8024f 10Gb switches. 4 x Dell 710 ESXi 4.1 running Dual Port broadcom 57711 Nics. Over the last month or so we experienced a bit of pain due to the drivers with the NIC's. This has been resolved and the system is stable. But I am not getting the IO Meter number I would expect. I am a bit reluctant to use MEM until I am convinced there are no issues with the system or the network.

    Currently the ESXi Servers are running RR, using iometer 2006 we are getting approx 190MB for 100% Seq Read at 32 K Blocks and around 300 for 50%Read /Write. We are achieving IOs of around 9000 and latency is around the 4-5ms

    We have had various people look at the system but no real answers at the moment.

    It would be great to see if you have any results with the PS6010vx or similar with and without MEM.

    ReplyDelete
  24. Hi Joseph,

    What are you expecting to get?

    Your solution seems to be performing as I'd expect. 10GB doesn't really make any difference to IOPS as the controllers ultimately have a finite capability.

    I'd recommend the usuals - make sure flow control etc is enabled and don't hold off deploying MEM - you wont be dissapointed!

    ReplyDelete
  25. I have compared it to other SAN's I have installed at clients sites. The IOPs and the latency is fine but the throughput is where I would expect more.

    EG - IBM 3512 split using 6HDD and 5HDD raid 5. 8 Gb Fibre I was getting around 600MB for the same IOmeter test. The 50%Read/Write Seq wasn't as dramatic increase around 400Mb compared to 300Mb. I have similar results with EVA 4100 and 4400.

    I know for the EVA's RR implementation Best Practice was to change the RR IOPS
    esxcli nmp psp setconfig --device --config "policy=iops;iops=1". There may be a similar setting for 6010, I have seen unofficially the IOPs settings change to 9 but nothing offical yet.

    ReplyDelete
  26. Can't say I've ever messed with RR IOPS - in your situation I would probably do some comparative performance testing.

    I still think there's significant mileage in deploying MEM. I'm confident it will make your results more satisfactory.

    ReplyDelete
  27. Hi Graham,

    how did you generate graphs from iometer output?

    Regards,
    Nav

    ReplyDelete
    Replies
    1. That would be Microsoft Excel graphs ;)

      Delete
    2. Hi Graham, thanks for the response, do i need to configure cycle options under test setup because under default config, results.csv file only generate 1 line of test results running at a frequency of 60sec.
      http://sdrv.ms/1b7WcWs

      Regards,
      Nav

      Delete
    3. Perform 3 tests and take the average.

      Delete