No Jitter is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Scalable Video and Heterogeneous Endpoints

In my last two posts (here and here) I wrote about how Scalable Video Coding (SVC) works and why it behaves well when there is network packet loss. In this post I want to discuss how it supports heterogeneous endpoints, some with higher or lower computing resources or higher or lower bandwidth connections to the network.Remember that an SVC encoder creates multiple streams of information, one with the base layer which contains the full video conference but at a relatively low resolution and frame rate, and others that "enhance" this image to give it additional resolution, quality or frames per second. Each of these streams consumes bandwidth, and each additional stream requires more compute power on the receiving codec to decode the additional information and recreate the video conferencing image.

Now posit a video switch in the core of the network that has a signaling connection to the endpoint. This video switch receives a high resolution, high frame rate source and then decides, for each destination, how many of the SVC streams to forward. The receiver can send signaling information back to the switch indicating its capabilities. If the receiver can handle the bandwidth and the CPU requirements for the full quality image, it signals the video switch and all the streams are forwarded. If, however, the receiver knows it has limited bandwidth or limited CPU resources, it can signal the switch to send some subset of the full information. The base layer will always be sent, and then as many of the available enhancement layers that fit within the constraints of the receiving endpoint.

Note also that the receiver may limit the image based on how it is currently displaying the video information. If the video is being displayed in a small window in the corner of the screen while the two communicating parties collaborate on a spreadsheet or other document, less resolution is needed. My laptop screen is a 15" screen running 1680 x 1050 resolution, or about 130 pixels per inch. If I display a 16x9 video image in a window 3.5" wide, it needs 455 x 284 bits on my screen, which is less than "Standard" resolution and is available in the base layer. Additional transmitted resolution will be lost to the bit bucket (see my posting on bits per face).

We did make the assumption here that the source of the video is able to create a full quality image. If the source of the video is the old laptop on a low-speed DSL line, then the lower quality image it creates will be seen by all parties receiving that image. The network cannot create the enhancement layers if they are not available from the source.

This ability for the video switch in the core of the network to selectively forward layers of the original image is only useful if there is more than one receiver. Obviously if there is only one receiver, the source can be signaled to reduce the resolution it is sending and save everyone a lot of work. But if there are multiple receivers, it is valuable to be able to provide a high quality image to those who can receive it, and a lower quality image to those who cannot.

Readers who have traditional video conferencing deployments are now muttering "We do that now with our MCU, what is the big deal?"

The big deal is the difference in CPU power required to solve this mismatch problem. Traditional MCUs in use today tackle this job by transcoding. This is a much more compute-intensive problem than the simple switching job needed for scalable coded video. When we deploy thousands of video endpoints onto desktops, which is starting to happen in some enterprises, the cost, reliability and size of the equipment that supports multipoint will become a critical factor.

Lastly, note that we have made another assumption which seems to be appearing in desktop video conferencing solutions but is not common in traditional room-based systems. We assume in the above discussion that the receiving codec (desktop or appliance) will be capable of handling multiple streams in a multipoint environment, and the user wants to be able to manipulate those windows and arrange them as they please. In today's deployments the MCU makes the decision about the screen format and delivers a single composite image. The purveyors of the multi-stream approach tout it as a feature (of course) but it might not be in every situation.