The Sound Science of Audio Codecs
What is sound and how do we take the noise that comes from our mouths and turn it into something that can be transported across an IP network?
I have never been happy with the answer "because." No matter what the subject or question, I am not satisfied until I am told the whys, wherefores, and possible exceptions. While I can't claim to fully understand every explanation I'm provided (I still don't completely fathom relativity), I want the opportunity to try. I won't know my limits until they've been stretched.
This year for the International Avaya Users Group's annual conference, Converge2015, one of the organizers asked me to speak about audio codecs. My first reaction was, "Is there anything I can say about codecs that hasn't already been said?" After all, G.711 has been around since 1972. How can anyone with a few years of communications under his or her belt not know about a codec that was invented before cell phones, the World Wide Web, and PCs?
After mulling it over for a few days, it suddenly hit me. Instead of simply running through the different codecs, I should explain why they exist in the first place. In other words, if G.711 has been around since 1972 and it has been doing a pretty good job all these years, why do we also have G.726, G.729, G.722, etc.?
This led me to the root question that all audio codecs share: What is sound and how do we take the noise that comes from our mouths and turn it into something that can be transported across an IP network?
Let's find out.I'm Picking Up Good Vibrations
Simply put, sound is vibration on our eardrums. These vibrations can be as simple as the hum of a fan or as complicated as a symphony orchestra.
Vibrations have amplitude and frequency. Amplitude is a gauge of pressure change and is measured in decibels (dB). Frequency (oscillations per second) is denoted in Hertz (Hz).
Ultimately, these vibrations produce pressure changes in air molecules, and we measure these pressure changes in decibel sound pressure level (dB SPL). The lowest level of pressure change is known as the threshold of hearing (0 dB SPL). The highest level is called the threshold of pain (120 dB SPL). (Of course, Spinal Tap was able to push that to 121 dB SPL.)
At best, we humans can hear frequencies between 20 and 20,000 Hz. However, our ears are most sensitive to sounds that fall below 4000 Hz. This fact will become very important as I take you from analog sound to its digital representation.From Analog to Digital
We live in an analog world of frequency and amplitude, but digital technology uses electrical energy to represent sound. To move from one to the other, we require devices known as transducers. The transducer used to convert mechanical pressure to electrical energy is known as a microphone, and the transducer that is used to convert electrical energy to mechanical pressure is called a speaker.
In the midst of all that, we have codecs. A codec (Coder-Decoder) is both the digital representation of analog sound and a logical encapsulation of that electrical energy. Think of it as the digital language that enables us to transmit sound from one device to another (e.g. from an IP phone to a conference server).
All codecs share these same characteristics:
- DACs/ADCs (Digital to Analog and Analog to Digital converters) are used to turn analog waveforms into ones and zeros. These devices sample the waveforms at regular time intervals. The more samples that are taken, the better the representation of the waveform. This is known as sampling frequency.
- Quantization is the number of bits used to represent the samples. An 8-bit sample gives us 256 different values and a 16-bit sample gives us 65,536 samples. More values equate to a better representation of sound. Only whole numbers can be used, so a sample value of 6.3455 must be rounded to 6. This rounding distorts the signal giving us noise.
- Since human ears are most sensitive to quiet sounds, greater emphasis is placed on encoding the "quiet zone" and less on the "loud zone."
- Frame size describes the length of the audio segment that a codec processes. Typical frame sizes are 10, 20, and 30 milliseconds.
- Real-Time Protocol (RTP) and User Datagram Protocol (UDP) add 40 bytes of overhead to a packet of digitally encoded audio.
Years ago, two gentlemen by the names of Harry Nyquist and Claude Shannon determined that perfect waveform reconstruction (i.e. no signal loss) happens when sampling occurs at least two times the signaling rate. This is known as the Nyquist Theorem (Shannon got the short end of the naming stick). For the human ear, a sampling rate of 40 kHz perfectly captures our hearing range of 0 to 20 kHz.
The Nyquist Theorem tells us that 8 kHz is an adequate sample frequency to capture the 0 to 4 kHz range that most speech falls into. Anything less than that creates a poor audio experience, and anything beyond 8 kHz is gravy (or as we will see in a moment, wideband audio).Encoding / Decoding
There are essentially two ways of converting an analog waveform to its digital equivalent. One of the earliest methods was called waveform encoding. This technique attempts to efficiently encode a waveform for transmission and decode it for playback. The goal is that the decoded waveform looks as close as possible to the original. As I said earlier, quantization and its inherent rounding will add noise and distortion, but a properly encoded waveform will create an acceptable user experience.
Pulse Code Modulation (PCM) is a well-known form of waveform encoding. In the world of communications, there are two common forms of PCM. Mu-Law, is used in North America and Japan and A-Law is used just about everywhere else. The difference between them is the logarithmic scale used for sizing the distance between sampling steps.
G.711 is how we typically refer to PCMU (Pulse Code Modulation Mu-Law) and PCMA (Pulse Code Modulation A-Law). Both forms use 8 bits to represent each sample -- 8 bits/sample * 8000 samples/second = 64K bits/second.
The next broad technique for encoding a waveform is known as differential coding. While PCM does a good job of representing an analog waveform, the ultimate size of its packets makes it an inefficient use of network bandwidth.
Instead of trying to make a digital copy of the waveform, differential coding predicts the next sample based on the previous sample. To compress the size of the IP packet, differential coding only stores the differences between the predicted sample and the actual sample. Differential coding is also referred to as predictive coding and is the basis of popular codecs such as G.729 and G.726.
This concept of predicting the waveform allows us to significantly reduce the number of bits transmitted between IP endpoints. For instance, G.729 requires less than half the number of bits as G.711.Vocoding
Vocoding (voice encoding) is used to more efficiently encode human speech by understanding exactly how speech is created. Vocoders act upon things such as vocal cord vibration, unvoiced sounds, plosive sounds (air pressure behind a closure in the vocal tract and then suddenly released), and many other physiological aspects of human speech. Vocoders create a mathematical model of the human larynx and reproduce sounds by emulating how the human body works.
Vocoders are great when the goal is to encode, transmit, and decode spoken conversations. However, they are nearly worthless when it comes to non-speech sounds. For instance, you can't effectively use a vocoder to transmit music since musical instruments don't produce sound in the same way as the human body.
The same can be said for Dual Tone Multi-Frequency, or DTMF. Touch tones do not sound like people no matter how high pitched and squeaky a voice might be.
A common example of a vocoder in communications is G.729. It has been optimized to encode and decode spoken conversion at the expense of sounds that fall outside the realm of words and sentences. That is why you need a protocol such as the one described by RFC 4733/2833 to transmit DTFM and other communications-related tones.
Since G.711 does not utilize vocoder technology, it can be used for DTMF transmission. Additionally, G.711 is an acceptable candidate for fax pass-through.Caution, Wide Load
While 0 to 4000 Hz is fine for what we have come to call toll quality audio, our ears are capable of hearing more than that. These extra sounds that lie outside the "sweet spot" frequency range can make a good IP telephone call sound amazing.
This is where wideband audio codecs come into play. Traditional narrow band codecs (e.g. G.711 and G.729) are focused on the frequency range of 300 to 3.4 kHz and wideband codecs (e.g. G.722) encode and decode frequencies as low as 50 Hz to about 7 kHz (or sometimes even higher). These additional frequencies fill out the sound of a conversation for a more satisfying user experience.
Since the compression techniques used by the newer wideband codecs produce a bit rate very similar to G.711, enterprises have started to use them whenever possible.Mischief Managed
I hope you stuck with me because this is important stuff to know. If you are like me, you want to know why an 8 kHz sampling rate is so commonly used and why G.711 can be deployed for fax pass-through and G.729 cannot. While this knowledge may not come up at cocktail parties, it may be useful as you decide which codecs will be applied to particular devices, network regions, and use cases.
Now, if only someone can explain Einstein's Theory of Relativity in a way that even my feeble brain can comprehend...
Andrew Prokop writes about all things unified communications on his popular blog, SIP Adventures.