Thursday, 19 April 2012

Dell MD3200i + VMware ESX hanging issues

The Dell MD3200i SAN is a cheap(ish) effective storage device certified for VMware ESX.
I had no issues using it with a single ESX host, but when trying to use it with more than one host I was seeing hangs of the ESX servers, loss of connectivity from hosts to the VirtualCenter and strange logs saying that it couldn't communicate with the LUN.


Mar 24 19:56:11 virt08 vobd: Mar 24 19:56:11.295: 797333253337us: [esx.problem.vmfs.heartbeat.timedout] 4eb93a45-ea456b83-4c0a-0010189da888 disk03-1.
Mar 24 19:57:36 virt08 vobd: Mar 24 19:57:36.699: 797418651679us: [esx.problem.vmfs.heartbeat.recovered] 4eb93a45-ea456b83-4c0a-0010189da888 disk03-1.



After much going back and forth with Dell ensuring that I had the network configured with their latest recommendations (they keep changing it), they took al look at the stats on the network card of the MD3200i


========================================================================
                          TCP PORT STATISTICS - PORT 0                
========================================================================
  TCP Received Segments          0xf6f28bd
  TCP Tx Segments                0xd773310
  TCP Rx Segments In Error       0x0
  TCP Tx Byte Count              0x1a167dd174
  TCP Rx Byte Count              0x2683bbc3
  TCP reTx Timer Expired Count   0x584
  TCP Rx Dup ACK Count           
0xa7890
  TCP RX ACK Count               0xee2915
  TCP Rx Delayed Ack Count       
0x5a4256
  TCP Tx Ack Count               0x65098dc
  TCP Rx Seg Out Of Order Count  
0x3242c
  TCP Rx Window Probe Count      0x0
  TCP Rx Window Update Count     0x49c



Here they saw high numbers of ACK issues (approximately 10% of all network transactions).

This prompted them to ask that I set up the IO queue depth and a delayed ACK workaround from VMware.

This immediately fixed my issue!

To change IO queue depth

esxcli nmp roundrobin setconfig --type "iops" --iops 1 --device <device UID>

Delayed ACK Workaround


Disabling Delayed ACK in ESX/ESXi 4.x and ESXi 5.0.x
  1. Log in to the vSphere Client and select the host.
  2. Navigate to the Configuration tab.
  3. Select Storage Adapters.
  4. Select the iSCSI vmhba to be modified.
  5. Click Properties.
  6. Modify the delayed ACK setting, using the option that best matches your site's needs:
    • Modify the delayed ACK setting on a discovery address (recommended):
      1. On a discovery address, select the Dynamic Discovery tab.
      2. Select the Server Address tab.
      3. Click Settings.
      4. Click Advanced.
    • Modify the delayed ACK setting on a specific target:
      1. Static Discovery tab.
      2. Select the target.
      3. Click Settings.
      4. Click Advanced.
    • Modify the delayed ACK setting globally:
      1. Select the General tab.
      2. Click Advanced.
  7. In the Advanced Settings dialog box, scroll down to the delayed ACK setting.
  8. Uncheck Inherit From parent.
  9. Uncheck DelayedAck.
  10. Reboot the host

This change should be safe to apply to all MD3xxx SANs, so you could just do the global change.

You should now have a fully working system

I hope this page helps you fix your problems, please click on an advert to show your appreciation.