One of the first questions I ask during a video conferencing consult is the bandwidth requirement for each video conferencing call. Often users have not figured this one out yet, and so it needs some discussion. Video conferencing quality is directly related to the bandwidth consumed, and the two key factors are the image resolution and the frame rate. Most of the video conferencing discussion today is around High Definition (HD) resolution, and everyone seems to think that is what they need. But personal video conferencing delivers many more bits per face than either Telepresence or room-based video. Let's take a look at how this works.In this article I am going to discuss face to face meetings. If your video conferencing application is about seeing something else (equipment, prototypes, schematics, artwork, etc.) then you need a different approach. But many of us are using video conferencing for face to face meetings so we can get the additional communication information provided by a visual image of the folks with whom we are meeting.
Most of this non-verbal communications comes from faces. As humans, we are wired to be astute observers of faces and we derive much information from the face of our counterparts both when they are speaking and when they are listening. Body language comes from other parts of the body as well: hands, position of the shoulders and so forth. But these are big items and don't require a lot of video resolution. The small details of the face are the ones that require high resolution for us to perceive.
So let's look at how many bits of visual resolution are dedicated to the faces of the people on the other side of the glass. We have three different models of video conferencing, and each has a different model for the amount of screen real estate consumed by faces.
Traditional room-based video conferencing systems put a single camera and codec into a conference room. These systems are often placed at the end of a long boardroom type conference table so the video user is sitting at the very end. Users may be facing the camera, but in many situations are facing across the table and may even be blocked by people sitting beside them. I think this is the worst scenario for getting good video images. For the sake of estimating here, I have assumed there are 4 to 6 people in view in a room-based video environment. And by measuring some group photos I estimate that a face consumes no more than 1% of the video real estate in these images and thus only can use 1% of the available resolution. In many room-based environments it is less than this.
Telepresence systems have carefully-managed environments and do much better. In most Telepresence suites there are two meeting room attendees directly facing each video screen. By my calculation from measuring a few pictures, faces consume about 3% of the visual image. Some Telepresence suites allow for additional rows of people behind the front row, and their faces will be smaller, but let's focus here on the primary meeting attendees who are sitting in the front row. Their faces will certainly get the best resolution that Telepresence can deliver.
Personal video conferencing systems are either an appliance dedicated to a single individual, such as an executive desktop system, or a software-based codec running on a desktop system with a web-cam mounted on the screen. These systems provide a much closer image of a single individual meeting participant. By my calculations the face of the user consumes about 20% of the screen, and in many cases more. We will go with the 20% number for these calculations to be conservative.
So the table below lists these assumptions for my further calculations. The first row shows the number of participants in a call, and the second row shows the percentage of the screen that is dedicated to each face.
So how does this help us determine the resolution we need for our video conferencing calls?
In Table 2 below I calculate the bits per face for each type of video conferencing (Telepresence, room-based and personal) and for each major resolution size (CIF, 4CIF, HD720, HD1080). I use the percentage of the screen per face from Table 1 above, and multiply it by the available bits for each resolution level. So for example, HD720 is 1280 x 720. For a room-based video environment this means 1280 x 720 = 921,600 bits per screen and 1% of that, or 9,216 bits per face.
These results are pretty interesting. If we take Telepresence at HD720 as our standard for a high quality visual interaction, we see that HD720 Telepresence is delivering about 20K bits per face. A room based system needs to run at HD1080 to get equivalent results, but a personal system can deliver the same quality with CIF level resolution. Choosing middle of the road bandwidths to support these resolutions, the telepresence suite is running at 2.5 Mbps network bandwidth (HD720), the room-based system is running at 3.6 Mbps network bandwidth, and the personal system is running at 460 Kbps network bandwidth (a 384K call). Many personal video conferencing codecs can deliver a 4CIF image at 615Kbps network bandwidth (a 512K call). This means the personal video system is delivering 81K bits per face, which cannot be matched by any of the other approaches.
Now I am not arguing for the demise of room-based or telepresence video conferencing, but I am arguing that personal video conferencing can be operated at CIF or 4CIF resolution levels. It will deliver very high quality body language from faces and consume substantially lower bandwidths than required for HD video conferencing.