This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.
A new client called us in to investigate why high-definition (HD) video sometimes wasn't playing smoothly on client workstations. Here's what we learned.
The Case Background
The network was actually pretty simple, comprising multiple remote sites, each with good bandwidth to the main data center. The company ran Citrix virtual desktops on thin clients, as well as a number of regular PCs using Citrix Receiver. Videos regularly displayed at 720p resolution, and users reported problems with videos not displaying properly or stopping entirely. The videos come from a variety of websites, such YouTube, Vimeo, and NBCLearn. These sites use streaming technology based on TCP for data transport, and any network glitches will cause the video stream to pause as TCP retransmits the lost data.
Citrix has built HD video handling into its virtual desktop products, so there should be no problem on that end. Client location didn't matter; they all experienced the same problems -- they happened at different times, but we saw no overall theme that directed the troubleshooting to a specific cause.
The customer asked us to validate that the network wasn't the source of the video problems, and to try determining what was causing the trouble. One of the first thoughts was that denial-of-service attacks were limiting firewall or NetScaler load balancer performance. This proved not to be the case.
The Network
We started by looking at the network. Determining a network problem would be easy, because we'd see packet drops, either due to errors or congestion. Internet access bandwidth wasn't the problem, because a video played directly on a modern computer displayed properly. One network interface in the data path showed a high number of dropped packets. A quick moment with a calculator showed that this interface had dropped about 0.01% of its packets. While that may seem like a small number, such a drop rate has a big detrimental effect on TCP performance. The graph below is the throughput of a 1-Gbps link at different levels of packet loss, as calculated using the Mathis equation. Packet loss of 0.01% on a path with 10ms round-trip latency results in less than 100 Mbps of throughput.
We initially thought that we'd found the problem, but after watching the drop counter for a few hours and seeing that it didn't budge, we determined that it wasn't causing the video pixelation.
The Load Balancer or Firewall?
The thin clients initially connect to the NetScaler load balancer, which finds an available CPU from a pool of virtual desktop instances and starts the virtual desktop agent (VDA) on that server. The load balancer is configurable in either of two modes, and we found it configured to broker the connection. This means that once the load balancer brokered the connection between the thin client and VDA, it was no longer in the data path. Checking its interfaces showed a light load and no packet loss, so that wasn't the problem.
We also checked the system firewalls, because they were in several data paths between the thin clients and the VDA pool. They were operating in primary/standby mode, so one firewall platform had to handle the full traffic load. Indeed, it showed the full traffic volume, but no packet loss. There was plenty of CPU and interface bandwidth left, so that eliminated this element.
What's Left?
There wasn't much left to examine, so we started looking at the video performance itself. In true network fashion, the performance looked the best that the client had seen in some time. We hadn't changed anything, which was curious. But then we started to notice some degradation in video quality, in particular as we watched a Vimeo video occasionally featuring a lot of screen action.
Using the Citrix monitoring system, we observed the round trip time (RTT) for Citrix's Independent Computing Architecture (ICA), which is a protocol for passing data between services and clients, increasing from 20ms to more than 300ms, and then to 1000ms. Then the ICA RTT went back down, just when the video action subsided. Pausing the video also caused the ICA RTT to drop. There wasn't enough latency and competing bandwidth or network buffering that would create the delay we observed. Video activity correlated with ICA RTT values.
How were the CPUs doing? We used VMware's vCenter server management software to look at the CPU load of the VDA servers. When running videos, the CPU jumped to 75% to 85% utilization. The increase in ICA RTT then started to make sense.
Our theory was that periods of high video activity generate high CPU loads on the thin client and the VDA. When the CPUs can't keep up with the desired video frame rate, the thin client drops frames that the codec needs to play back uninterrupted video. The extent of pixelation depends on the specific frames dropped (see video compression picture types). Dropping I-frames causes more pixelation than dropping P-frames or B-frames. The extent of pixelation correlated with high ICA RTT values.
Another interesting data point was that the customer had previously discovered high CPU on the VDA servers and an upgrade of some of the servers had substantially reduced the problem. We also noted that a more capable PC running the Citrix Receiver client had fewer video problems than the less capable all-in-one thin clients. All of the above factors pointed to CPU utilization as the cause of the pixelation. The customer had been thinking that since the CPU wasn't pegged at 100%, the problem lay elsewhere. However, modern operating systems are complex. It was highly likely that the OS was doing things that weren't apparent in the performance graphs, like waiting on I/O.
Join Us at Enterprise Connect
We'll have an Ask the Expert panel at Enterprise Connect 2018 on Wednesday, March 14, at 1:30 p.m. Join us to ask questions like the above. Our panel of experts will strive to provide answers to your most challenging problems, provide ideas on how you can proceed, and contact information for follow-up. See you there!
We’re far from knowing exactly what network and computing architectures will provide the best performance for AI applications – but we’ve got to figure it out
Nobody’s happy with how their network and IT management is handled. Application performance monitoring (APM) or “observability” could be the key to everything