QoS for vMotion and Multimedia
How do you mix vMotion and multimedia on the same VM hardware?
vMotion is a capability within VMWare that "allows you to move an entire running virtual machine from one physical server to another, without downtime." At Netcraftsmen, we are having a discussion on how to handle vMotion on a blade server that supports multimedia (such as a virtualized MCU). How should QoS be configured to keep the vMotion traffic from impacting the multimedia traffic and vice versa, particularly where there are converged network adapters that are also servicing other VMs on the same blade chassis?
Let's say that one of the VMs is a virtualized MCU (Media Control Unit) and that the other VMs on the same hardware are for non-multimedia applications. vMotion may be used on the host to move various VMs onto and off of the host platform. How do we keep the vMotion traffic from impacting the multimedia traffic and vice versa?
Both vMotion and the multimedia application are important. How should the priorities be set? Should policing or shaping be used on either type of traffic?
According to vSphere 5.1 documentation, vMotion should get its own Gigabit Ethernet interface. However, for converged network adapters, this isn't possible. It is reasonable, however, to use VLANs and configure traffic shaping to limit the vMotion traffic to 1Gbps, which is what would result if it used a dedicated 1Gbps network adapter.
Ideally, the vMotion should finish as quickly as possible, so the best QoS configuration would allow vMotion to use any available bandwidth above 1Gbps. A brief description of how vMotion works (vSphere 4.1) is in Kyle Gleed's blog, "vMotion: what's going on under the covers" and a complementary blog, "Under the covers with storage motion".
According to a blog on VMware Communication Ports, vMotion data copies are performed using TCP, so there is at least the expectation of basic TCP flow control mechanisms for the data-copying operation. This means that as vMotion throughput exceeds the bandwidth provided, packet loss will cause TCP to slow down. This is preferable to creating congestion and contention with the multimedia operations that are occurring on the same hardware platform. Because TCP is used, shaping can be effectively used to make maximal use of the available bandwidth, especially when other VMs are sending and receiving bursts of traffic.
The vMotion documentation also recommends the use of Jumbo frames (over 9,000 bytes per packet). At 1Gbps, this results in a delay of a little more than 0.072ms per packet. For a single packet, this won't have much impact on multimedia traffic. However, a long burst of packets could saturate the buffers in a QoS queue (default of 64 buffers per queue in Cisco devices), so it is best to keep the multimedia traffic separate from the vMotion traffic. If they are in the same queue, vMotion may consume enough of the buffers that it causes problems for the multimedia traffic.
The number of buffers in some of the queues may need to be increased for optimum throughput. I'm thinking that the vMotion queue may benefit from 128 or 256 buffers (9.2ms and 18.4ms of jumbo frames at 1Gbps, respectively). Of course, all this presumes that the interface is congested and that packets are being queued. Check the interface stats and if there are no drops, then there is no congestion.
Still, a change in VM assignments may cause changes in congestion, so just because you checked once doesn't mean that congestion won't happen in the future. Monitoring the interfaces over time using a network management system will be valuable in identifying congestion.
Real-time multimedia traffic, on the other hand, typically uses UDP for its transport, so there is no flow control. However, the data rate is not bursty, making it easier to provision the necessary bandwidth. Bandwidth is typically allocated for some number of simultaneous multimedia sessions, with call admission control used to make sure that only the number of provisioned sessions are allowed.
Note that streaming multimedia typically uses TCP and will therefore have flow control capabilities. These applications tend to buffer a lot of data and have much fewer problems than real-time multimedia when competing with vMotion for network bandwidth.
My recommendation is to configure real-time multimedia flows into a higher queue than vMotion, and limit the amount of bandwidth consumed, by using call admission control. I prefer to put the audio stream in the EF queue and put the real-time video into the AF41 queue. Streaming video would go into AF31. vMotion could be assigned to AF31, along with streaming video, since both are TCP-based. Or vMotion could be put into AF21, which is normally used for low-latency data (important business apps). Provision at least 1Gbps of bandwidth for vMotion, with the ability to use any remaining available bandwidth. Other applications would fall into their respective priority classes, with most going into QoS class CS0 (Best Effort).
I think the key factor when operating a virtual multimedia server on a hardware platform that also services other VMs is to limit the amount of multimedia traffic to a level that allows vMotion as well as the other applications to function properly. You have to take steps to optimize a hardware platform's operation. Just throwing various VMs together on one platform won't work well and you'll spend a fair amount of time trying to troubleshoot the application response problems that are the result of network congestion.