The Physics of Sound
From the standpoint of physics, sound is an acoustic wave that results when a vibrating source (such as human vocal cords) disturbs an elastic medium (such as air). When a sound wave reaches a listener’s ear drum, the vibrations are transmitted to the inner ear (or cochlea), where mechanical displacements are converted to neural pulses that are sent to the brain and result in the sensation of sound.
- Basic Concepts: propagation, reflection, refraction
- Sine Waves: amplitude, frequency, and wavelength
- Fourier Analysis
- Linear Systems: impulse response and transfer functions
- Logarithmic Scales: dB and octaves
In some ways, sound waves are much like electromagnetic or light waves. In a homogeneous medium, they travel at a constant speed, c. (The speed of sound in air is 343 m/s, or about 1 foot per millisecond, which is a handy number to remember.) A uniform point source radiates spherical waves whose amplitudes fall off inversely with distance. These waves are reflected by smooth surfaces and scattered by rough surfaces. A surface is “smooth” if the size of irregularities is small relative to the wavelength, and “rough” otherwise.
Sound waves are diffracted around intervening objects. If the object is small relative to the wavelength, it has very little effect — the wave just passes around the object undisturbed. If the object is large, a “sound shadow” appears behind the object and a significant amount of energy is reflected back towards the source. If the object is about the same size as a wavelength, things are complicated, and interesting “diffraction patterns” appear.
Mathematically, sound waves satisfy the wave equation. Because the wave equation is linear with constant coefficients, sine waves are eigenfunctions, and are thus enormously important. In music and acoustics, sine waves are often called pure tones. Physically, if the source is a steady-state sine wave with frequency f, then the response at any other point in space is also a sine wave of frequency f; only the amplitude and phase change as one moves around. This is not true of any other function.*
For spatial sine waves, we specify the wavelength , which is the distance for one cycle. For temporal sine waves, it is common to specify the frequency f (in Hertz or cycles per second), the angular frequeny(in radians per second), or the period T (in seconds). These quantities are linked to the speed of sound c through the following basic equations:
It is useful to remember that a 1-kHz tone has a period of 1 ms and a wavelength of about 1 foot.
Well, strictly speaking, the eigenfunctions for the wave equation are actually complex exponentials, functions of the form . Thus, an exponential input will also produce an exponential forced response. However, the only bounded steady-state solutions are when s is imaginary (), which leads again to sine waves.
Most natural sounds are not sine waves. In particular, because they have a very narrow-band spectrum and they set up standing-wave patterns in rooms, sine waves are notoriously difficult to localize. In some ways they are the most inappropriate sounds imaginable for 3-D audio. However, other waveforms can be represented as a superposition of sine waves. In particular, a periodic signal x(t) with a fundamental frequency can be represented as a complex Fourier series.
and a finite-energy signal x(t) can be represented as a Fourier integral
where X(f) is called the Fourier transform of x(t). In general, the Fourier transform is complex, having both a magnitude |X| and a phase /_X. The squared magnitude of X gives the power for a periodic signal and the energy density for a finite-energy signal. This lets us speak about the power or energy of a signal in different frequency bands.
It is common to refer to X as the spectrum of x. Physically, this makes more sense for periodic signals than for aperiodic signals. For aperiodic signals such a speech, the usual practice is to snip out a short time segment by multiplying x(t) by a window function w(t), and to call the Fourier transform of w(t) x(t) the short-term spectrum.* When w(t) x(t) is sampled and an FFT is used to compute the short-term spectrum, the tacit assumption is that this segment is being periodically repeated. One should always keep this in mind when using FFT’s for spectral analysis, where a poor choice for a window function can produce results that are quite different from what is desired.
A number of different window functions have been proposed, and one frequently encounters rectangular, triangular, Gaussian, Hamming, Hanning and Kaiser windows, among others. In some applications, the choice of a window function can be critical, but in audio work the most important characteristic is the duration or “width” of the window. Long windows provide good frequency resolution but poor time resolution, while short windows provide good time resolution but poor frequency resolution.
In recent years, it has been observed that one really wants long windows for the low-frequency components of a signal, and short windows for the high-frequency components. This has led to another kind of time/frequency analysis known as wavelet theory. There is ample evidence that the ear employs wider bandwidths for high frequencies than for low frequencies, and wavelet analysis may be more relevant to human hearing than Fourier analysis. However, Fourier analysis remains at the core of understanding linear systems.
It turns out that the Fourier transform of the product of the window function w(t) and the impulse response h(t) is the (complex) convolution of the Fourier transforms of w(t) and h(t). Since typical window functions look like the impulse responses of low-pass filters, this suggests that windowing in the time domain results in smoothing in the frequency domain. There are circumstances in which windowing does indeed smooth the spectrum. However, windowing, which is a easy to understand, local operation in the time domain, is usually a hard to understand, global operation in the frequency domain. It is probably best just to think of windowing as a device for extracting a stationary segment from a signal such that the central portion of the segment is more important than the beginning or the end.
At normal sound pressure levels, air is a linear medium and the principle of superposition applies. This has three important consequences:
- It allows us to find the response to multiple sound sources by considering them one at a time and adding the separate responses.
- It allows us to determine the response to an arbitrary signal from knowing the response to an impulse.
- It allows us to use the Convolution Theorem to interpret behavior in the frequency domain.
In particular, suppose that x(t) is the sound pressure level of a source at one location, and y(t) is the resulting response at another location. Then, if the impulse response h(t) of the acoustic channel is known, we can find y(t) from the convolution integral
The Fourier transform of the impulse response h(t) is called the transfer function H(f). If X(f) is the Fourier transform of the input x(t), and if Y(f) is the Fourier transform of the output, then the Convolution Theorem yields the important result
Y(f) = H(f) X(f).
Thus, the spectrum of the received signal is a simple product of the spectrum of the source and the spectrum of the channel.* The channel alters the quality of the sound by increasing (and time shifting) some components of the spectrum of the source, and reducing (and time shifting) others. The magnitude of H(f) reveals the degree of change, while the phase of H(f) provides information about the time shift. To be more specific, if is the phase angle of H, then it is well known that the group delay introduced by the transfer function is given by
The relation Y(f)=H(f) X(f) applies only when X(f) is the Fourier transform of x(t) over the entire infinite time interval. It does not follow that the short-term spectrum of the output is the product of the transfer function and the short-term spectrum of the input. However, this is frequently assumed to be the case to get an approximate answer.
The magnitude of X(f) measured in decibels is defined by. A 6-dB increase corresponds (approximately) to doubling the magnitude, and a 20-dB increase corresponds to a factor of 10 increase. If denotes the angle of X(f), it follows from Y(f) = H(f) X(f) that
Thus, by measuring magnitudes in dB, we can account for the convolution of X and H by merely adding the dB values and adding the phase values. In addition, graphs that show how the magnitudes and phase angles change with frequency are usually plotted on a logarithmic frequency scale. Then a doubling of frequency (called an octave, from musical terminology) has the same horizontal displacement for all frequencies.
This is not only convenient mathematically, but also fits pretty well with the facts of human hearing. Human hearing is more or less logarithmic, responding to ratios rather than differences. Roughly speaking, 1 dB is about the smallest perceptible change in loudness, no matter what the starting intensity level, and a one-octave frequency change sounds like the same musical interval, no matter what the starting frequency.