Simple Spatial Audio Systems

Simple Spatial Audio Systems

The simplest spatial audio systems are limited to localizing in azimuth only. There are three basic classes of such systems:

To go beyond the limited capabilities of these approaches, we need to know more about Head-Related Transfer Functions (HRTF’s).

Two-Channel Systems (Stereo)

In the entertainment industry, stereo was the first successful commercial product involving spatial sound. The basic idea is pretty obvious — to place a sound on the left, send its signal to the left loudspeaker, to place it on the right, send its signal to the right loudspeaker.

If the same signal if sent to both speakers (and if the speakers are wired “in phase” and if the listener is more or less midway between the speakers and if the room is not too acoustically irregular), a “phantom source” will appear to originate from a point midway between the two loudspeakers. By “crossfading” the signal from one speaker to the other, one can create the impression of the source moving continuously between the two loudspeaker positions. However, simple crossfading will never create the impression of a source outside of the line segment between the two speakers. As we shall see, that can be done with crosstalk-cancelled stereo.

In fact, one can also shift the location of the phantom source by exploiting the precedence effect. If the sound on, say, the left is delayed by 10 or 15 ms relative to the sound on the right, the listener will localize the sound on the right side, even if the sound that comes on the left is as loud or somewhat louder. Of course, with too much delay, the listener will eventually become aware of the sound on the left as an annoying echo.

Multichannel Systems (Surround Sound)

Another obvious way to localize sounds is to have a separate channel for every desired direction, including above and below, if wanted. This is basically what is done with theater systems, such as Dolby Pro Logic Surround Sound. In typically reverberant environments, one can take advantage of the Franssen effect and use small loudspeakers everywhere, except for one large speaker (the “subwoofer”) that provides the nondirectional, low-frequency content.

Although they produce impressive spatial effects, multichannel systems are obviously expensive and inconvenient, and they are unlikely to play a major role in HCI.

Binaural Recordings

It has long been known that it is not necessary to have multiple channels to create convincing 3-D sound — two channels are sufficient. The trick is to recreate the sound pressures at the right and left ear drums that would exist if the listener were actually present.

A conceptually simple approach is to put two microphones in the ear canals of an acoustic manikin (or even just hold two microphones close to your own ears) and to record what they pick up. When the left and right signals are fed to the left and right headphone units, it is as if the listener were present in the original sound field. In particular, if the manikin and the listener have heads with the same size and shape, the same ITD and ILD information will be present; similarly, if the manikin and the listener have pinnae with the same sizes and shapes, the same elevation cues will be present. Recordings made this way are called binaural recordings, and they can produce quite vivid 3-D sound. In particular, it is possible to use binaural recordings in HCI to produce 3-D sounds for such things as standard system messages.

Despite their economy and effectiveness, binaural recordings suffer from several disadvantages:

  • They require the use of headphones (but see cross-talk cancelled stereo)
  • They are not interactive, but must be prerecorded
  • If the listener moves, so do the sounds
  • Sources that are directly in front usually seem to be much too close*
  • Because pinna shapes differ from person to person, elevation effects are not reliable

Improving on binaural recordings requires an understanding of head-related transfer functions.


It is not clear why the source in binaural recordings invariably seems to be too close when it is located directly ahead. Here are three possible explanations:

  1. When a source is really directly ahead, small head motions will introduce significant ITD and IID cues. However, in binaural recordings, nothing changes when the listener’s head moves. If you ask where a sound source might be located if there are no ITD or IID changes when you move your head, the answer is “inside your head”. Thus, the auditory system decides that the source must be very close.
  2. When a source is either directly ahead or directly in back, there are no interaural differences. If a source is directly ahead of us, we expect to be able to see it. In the absence of confirming visual information, the auditory system prefers to locate the source toward the back.
  3. The sense of range is strongly influenced by the ratio of direct-to-reverberant sound. The reverberent sound energy tends to be the same in both ears no matter where the source is located. When the source is in the front, the direct energy is also the same in the two ears, and it is harder to discriminate the direct from the reverberant sound. As a result, more sound energy is attributed to direct sound, the ratio of direct-to-reverberant sound increases, and the source is perceived as being closer.

Each explanation seems plausible, but the question has yet to be resolved.

Comments are closed.