Donnerstag, 27. September 2012

Having Intel NICs serving ESX 5.x hosts? Watch out for interface errors causing service disruption!

Some weeks ago I noticed some of my virtual machines, residing on Dell PowerEdge R810s/R910s running on ESXi 5.0 Update 1, did not behave very well when it comes to networking. Symptoms were random ICMP ping drops, occasional TCP connection drops, also vMotions would sometimes fail. The service disruptions would become worse after a longer ESXi uptime.

Investigating further, I found several errors on some Intel NICs. Here's a sample output of ethtool:

~ # ethtool -S vmnic7 | grep err
[...]
     rx_fifo_errors: 2305
[...]
     rx_queue_0_csum_err: 9
~ # ethtool -S vmnic6 | grep err
[...]
     rx_fifo_errors: 9848
[...]
     rx_queue_0_csum_err: 0

Counters would sometimes rise dramatically, up serveral 10.000s, in just some minutes of heavy network load (e.g. multiple vMotions over multiple 1 GbE interfaces).

Broadcom interfaces and other hosts equipped solely w/ Broadcom interfaces - not using igb/e1000 driver - did not show any issues. Also the rx_fifo_errors and rx_queue_0_csum_errors would move from one interface to another after reboot, making it impossible to isolate potentially bad interfaces/adaptors!

On the Cisco switch side, there were no further indications besides some forgivable out-discards. Upgrading to the latest IOS release did not help, neither did a promising VMware e1000 fix (see http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2020668).

Working together with Dell ProSupport gave us the opportunity to test Broadcom NICs in the servers. Replaced 4 Intel Dual Port NICs w/ 4 Broadcom DP-NICs, aaaand... gotcha! Errors gone, production restored.

With the release of VMware ESXi 5.1, the situation is still the same. I was given the opportunity to test this already in my lab and at a customer's site (running IBM servers, ESX 5 and a bunch of Intel 82580/82571-based cards). The problems really seem to related to either Intel hardware or software.

For the time being my advice would be to monitor your ESXi hypervisor's NICs more closely when running on Intel (using ethtool) or opt for Broadcoms.

Also if anyone out there has fought this thru with VMware, a hardware vendor or Intel, I'd be happy to know the outcome.

Keine Kommentare:

Kommentar veröffentlichen