Systems based on HRTF’s are able to produce elevation and range effects as well as azimuth effects. This means that, in principle, they can create the impression of a sound being at any desired 3-D location — right or left, up or down, near or far. In practice, because of person-to-person differences and computational limitations, it is much easier to control azimuth than elevation or range. However, HRTF-based systems are fast becomming the standard for advanced 3-D audio interfaces.
We begin by briefly describing the Convolvotron(tm), which is a well-known and effective spatial audio system. We then examine some of the issues that the Convolvotron raises, such as whether or not one must use headphones for high-quality spatial audio, and what to do about person-to-person differences. We conclude by describing a sequence of progressively more elaborate HRTF models that will be useful for this class. Thus, we address the following topics:
- The Convolvotron
- Headphones versus loudspeakers
- The need for head tracking
- Measured versus modeled HRTF’s
- Models for HRTF’s
The Convolvotron, which was developed for NASA and is manufactured by Crystal River Engineering, provides a conceptually simple way to use HRTF’s for spatial audio. Stripped to its essentials, it consists of two “convolution engines,” each of which can convolve the same audio input stream with a finite segment of a head-related impulse response (HRIR) retrieved from a table of measured values. The outputs of the convolvers go through amplifiers to headphones worn by the listener. If the HRIR’s for the listener are sufficiently close to the HRIR’s used by the convolvers, the sound delivered to the listener’s ears will contain all the proper spatial cues, and the sound image will be properly localized. The location will be determined by the particular azimuth, elevation, and (in principle) range used to index the stored tables.
This basic idea can be elaborated in several ways:
- Multiple sources can be accomodated by replicating the hardware and summing the outputs to each ear.
- Head motion can be accounted for by combining the absolute location of the source with the outputs of a head tracker to select the appropriate HRIR’s
- The tables can be indexed by azimuth and elevation only, with the distance from the source to each ear being used to introduce range/amplitude effects
- The number of HRIR’s stored in the tables can be reduced by using coarse spatial sampling and appropriately interpolating between nearby points
- Echoes and room reverberation can be added by including a room simulation model
- The system can be customized for a particular individual by measuring and using that person’s HRIR’s
No matter how the system is extended, the basic concept remains that of creating the proper left-ear and right-ear signals by real-time convolution of the monaural input with tabulated impulse responses.
Headphones certainly simplify the problem of delivering one sound to one ear and another sound to another ear. However, headphones are not without their problems. For example:
- Many people do not like to wear headphones. Even lightweight cordless headphones are cumbersome. The headphones that are best acoustically can be uncomfortable to wear for long periods of time. They also attenuate external sounds and socially isolate the user.
- Headphones can have notches and peaks in their frequency responses that resemble pinna responses. If uncompensated headphones are used, elevation effects can be severely compromised.
- Sounds heard over headphones often seem to be too close. Indeed, the physical source actually is very close, and the compensation needed to eliminate the acoustic cues to its location is sensitive to headphone position.
Loudspeakers circumvent most of these problems, but it is not obvious how one can use loudspeakers to deliver binaural sound. One solution is a technique called cross-talk-cancelled stereo (or transaural stereo).
The idea is simply expressed in the frequency domain. In the arrangement shown above, signals S1 and S2 drive the loudspeakers. The signal Y1 reaching the left ear is a mixture of S1 and the “crosstalk” from S2. To be more precise, Y1 = H11 S1 + H12 S2, where H11 is the HRTF between the left speaker and the left ear and H12 is the HRTF between the right speaker and the left ear. Similarly, Y2 = H21 S1 + H22 S2. If we were allowed to use headphones, we presumably would know the desired signals Y1 and Y2 at the ears. The problem is to find the proper signals S1 and S2 to create these desired results. Mathematically, this merely requires inverting the equations:
In practice, inverting the matrix is not trivial.
- At very low frequencies, all of the transfer functions are identical (why?), and thus the matrix is singular. (Fortunately, in reverberant environments low-frequency information is not very important for localization.)
- An exact solution tends to produce very long impulse responses. This problem becomes more and more severe the further the direction to the desired source is from the line between the two loudspeakers.
- The result will depend on where the listener is relative to the speakers. (Proper effects are obtained only near the so-called “sweet spot,” the assumed listener location used when the equations are inverted.)
Done carefully, crosstalk-cancelled stereo can be quite effective, producing elevation as well as azimuth effects. The phantom source can be placed significantly outside of the line segment between the two loudspeakers. However, since cross-talk-cancelled stereo still needs binaural signals, we shall confine our remaining observations to headphone systems.
When headphones are used, if the listener moves his or her head and if the signals sent to the ears are not modified, the configuration of sources appears to move also. This is intolerable for virtual reality applications. In addition, some of the spatial effects can be weakened or even destroyed. This seems to be particularly troublesome for sources that are supposed to be directly ahead or directly behind, since the rate of change of binaural cues is greatest in those directions. A typical result is that sources that are supposed to be directly ahead seem to be much too close, even appearing to originate inside the head.
A standard solution is to use a device called a head tracker to measure the location and orientation of the head, and periodically to recalculate the relative position of each source, modifying the HRIR’s accordingly. In addition to the usual concerns for cost, reliability, and accuracy, two other engineering concerns arise:
- Allowable latency. Latency is the time between when a motion is made and the corrected HRIR is used. Experience shows that it should definitely be less than 50 ms or the lag will be perceived.
- Unwanted transients. If one merely switches between one HRIR and another, audible clicks may result. Some kind of “crossfading” between the the two states is usually desirable.
Because HRTF’s are so complex, many spatial audio systems have depended on using experimentally measured data, such as the KEMAR data that we showed earlier. However, the primary reason for using HRTF’s is to capture elevation as well as azimuth effects, and elevation cues are particularly sensitive to individual differences. Four different approaches have emerged:
- Use a compromise, standard HRTF. This will give rather poor elevation results for some percentage of the population, but it is all that is practical for inexpensive systems. To date, neither the IEEE, the ACM nor the AES has defined a standard HRTF, but it looks like a company such as Microsoft or Intel will create a de facto standard.
- Use one of a set of standard HRTF’s. This requires measuring the HRTF’s for a small number of people who represent distinctly different population modes, and providing a simple way for a user to select the one that fits best. Although this has been proposed, no such set of standard HRTF’s currently exists.
- Use an individualized HRTF. This requires measuring the listener’s HRTF, which is an inconvenient and time-consuming procedure. However, it produces excellent results.
- Use a model HRTF containing parameters that can be adapted to each individual. This is the option that we explore next.
- Rational function or pole/zero models. Here the modeling problem is viewed as one of system identification, which has several classical solutions. Unfortunately, the coefficients are usually such complicated functions of azimuth and elevation that they have to be tabulated, which destroys the usefulness of the model.
- Series expansions. Fourier-series expansions or Karhunen-Loeve expansions (also known as principal component analysis or PCA) let one represent the HRTF as a weighted sum of simpler basis functions. While this is useful for inspecting the data, the run-time complexity of such models severely limits their usefulness.
- Structural models. Here one attempts to craft transfer functions that account for the physical mechanisms — head shadow, shoulder reflections, etc. This approach, which was developed in depth by Genuit, has some unsolved problems but holds considerable promise.
In the remainder of these notes, we examine a sequence of structural models of increasing sophistication:
One of the simplest effective HRTF models is the ITD model shown above. It can easily be implemented as an FIR filter. It moves the source in azimuth by introducing an azimuth-dependent time delay that is different for the two ears, which are assumed to be diagonally opposite across the head. Using the same geometrical argument that was employed to derive the ITD, we find that the time-delay function is given by
where a is the head radius and c is the speed of sound.
As one would expect, a model as simple is this is rather limited. It produces no sense of externalization and no front/back discrimination. However, it does produce a sound image that moves smoothly from the left ear through the head to the right ear as the azimuth goes from -90° to +90°, with none of the oppressive sense that one gets when all of the sound energy is going to only one ear.
With some wide-band signals, some people get the impression of two sound images, one displaced and one at the center of the head. The reason for that is that while the ITD cue is telling the brain that the source is displaced, the energy at the two ears is the same, and the ILD cue is telling the brain that the source is in the center. This problem can be rectified by adding head shadow.
As we mentioned earlier, Lord Rayleigh obtained an analytical solution for the ILD for a rigid sphere. While this solution is in the form of an infinite series, it turns out that its magnitude response can be fairly well approximated by the one-pole, one-zero transfer function
This transfer function boosts the high frequencies when the azimuth is 0°, and cuts them when the azimuth is 180°, thereby simulating the effects of head shadow. By offsetting the azimuth to the ear positions, we obtain the following simple ILD model:
This model can easily be implemented as an IIR filter. Like the ITD model, the ILD model produces no sense of externalization and no front/back discrimination. However, one does experience a smooth motion of the sound image from the left ear to the right ear as the azimuth parameter is changed.
Although there is a significant interaural group delay at low frequencies, the group delay becomes negligible at high frequencies. This again leads to a “split image problem,” since the ILD and ITD cues are conflicting. The way to fix this problem is to combine the ITD and the ILD models.
By merely cascading the ITD model and the ILD model, we obtain an approximate but very useful spherical-head model. While there is still no sense of externalization or elevation, it eliminates the “split image” problem and produces a very “tightly focused” phantom image.
Another simple modification of this model is to add a simulated room echo to produce some externalization and get an “out-of-head” sensation. The diagram shown below illustrates a method suggested by Phillip Brown at SJSU. Here the “echo” is the same in each ear, regardless of the position of the source. The gain Kecho should be between zero and one (not too large), and the delay Techo should be between 10 and 30 ms. This very simple room model is more characteristic of the “reverberant tail” than the early reflections, and (for obvious reasons, if you think about it) fails to produce externalization when the azimuth is near 0°. However, it does get the sound out of the head at other azimuths.*
Batteau, Watkins and other researchers have suggested modeling the effect of the pinna in terms of one or more “pinna echoes” (see Blauert for a discussion and critique of this approach). A typical model has the multipath structure shown below:
The problem is to determine how the gains K and time delays T vary with azimuth and elevation. In a recent thesis, Brown showed that quite good elevation effects could be produced with as few as six paths, and that the values of the time delays were much more critical than the values of the gains. However, we still do not have simple procedures for estimating parameter values for a particular individual.
By combining the head and pinna models (and adding torso diffraction models, shoulder reflection models, ear-canal resonance models, room models, … ) we can obtain successively better approximations to the actual HRTF. Although more research needs to be done to reduce model development to a routine procedure, this structural approach is physically well grounded, computationally efficient, and provides considerable flexibility for generating spatial audio systems for HCI.