Audio Restoration from a Psychoacoustic Standpoint

This is the first in a series of blog posts that will read more like essays. While I do love posting jovially about what I’m working on and listening to, I like to indulge my more technical and academic side as well. Enjoy!

………………………………………………

Human beings have been able to record and archive sound for well over one hundred years. Around 1856, the French inventor Leon Scot demonstrated the Phonoautograph system, the first device used to record audio. It used a mechanism that included a diaphragm that was sensitive enough to react to strong waves of sound, which was then connected to a stylus, which finally pressed against a moving glass cylinder, engraving it with the sound waves.

Baring that in mind, the need has arisen to restore the audio captured by these archaic recording devices, all the way up to more modern devices that have captured sound in less than ideal conditions. The field of audio restoration has two main focuses: From an archival viewpoint, the goal of audio restoration is to present the listener with the most unbiased and accurate reproduction of the original sound that may be obtained; Where as a more modern viewpoint has allowed the audio engineer to incorporate their own creative decisions into the restoration process, such as creating an ensemble sound that was not originally arranged in microphone placement, or updating the recording so that is palatable to modern listeners.

There are several anomalies that can occur to audio through either physical damage or through poor recording practices. The proper order of this signal processing chain is de-click, de-crackle, de-buzz, and then de-hiss. It is important to maintain this hierarchy due to the fact that each process relies on its’ predecessor to be complete in order to follow the chain (if necessary to proceed). For example, when there are clicks penetrating the signal, they must be removed in order for the de-crackle process to identify only the crackles, without interference of what may be interpreted as clicks, but are actually the original signal. Another occurrence is when clicks are presented to a de-hissing process, they may get processed and add unmusical side effects.

A fair portion of these digital audio restoration signal processes are based on the psychoacoustic model as well as other phenomena. The goal of each restoration project differs, as do the human hearing and perception models on which they are based.  Audio restoration can be based on psychoacoustic models and principles through perceiving audio quality through the perceptual audio quality measure through the masked threshold concept, processing audio signal through different sub-bands of frequencies, and restoring the fidelity of a musical recording through perceptual optimality criteria.

First, one needs to look at how we perceive the quality of sound, based on the model of the perceptual audio quality measure. “This method essentially tracks the output of an audio device onto a psychophysical representation, and calculates the quality of a signal based on the internal representation of a predefined reference. The initial perceptual models used only strive to predict sound characteristics such as pitch, loudness, and masked threshold. These aspects in their own isolation cannot predict the overall quality of a signal’s input and output, and form a comparison between the two”.

Since the initial approach did not facilitate processing the signal as a whole, a new approach needed to be formulated. The solution was to present a signal to subjects, who would then be asked if they could detect small distortions in the signal. Instead of taking into account each individual aspect of sound perception, the study sought to calculate the input and output signal as a whole.

The goal of this experiment was to see which part of the distortion was being masked by the signal and what part of the distortion was audible. Of course, this varied with each individual’s interpretation of the signal. There are different parts of the signal that could be being masked by each individual subject, and also each individual subject could interpret different levels of frequencies that are seemingly distorted to them.

Experiments of masking based on steady-state signals have concluded that our auditory system implements a spectral analysis of which is modeled by several band pass filters with a bandwidth of 100Hz. These filters model the characteristics of filtering that can be found in the cochlea, which is the first major stage in the human’s interpretation of auditory signals. From here the signal passes through the basilar membrane, where it is transduced, and reaches the neural level. Though these processes are important for our perception and cognition of audio, the phenomenon highlights how the main phenomenon that is used for interpreting signal quality is masking.

Knowing and understanding how the human auditory system processes signal is important when taking into account how a system is to restore audio. Since there is no way to circumvent one’s auditory perception of the signal unadulterated by internal processes, it is imperative to take into account how masking plays a part in our interpretation of incoming signal.

Secondly, one can examine the use of sub-bands in order to solve common problems in audio restoration. The most common problems that audio restoration engineers encounter include clicks, crackle, pops, buzzes, hums and hiss [Cedar Audio Ltd]. In this technique, masking also plays a part, but mostly around the critical band. It is a fixed arrangement of these critical bands that is put together to form a sub-band configuration.

As with any restoration technique, the goal is to avoid any extraneous signal processing. The sub-band technique addresses this by giving more “attention” to be given to the most corrupted bands. This technique is particularly favorable in the eradication of clicks in an audio restoration environment.

In this process, the audio processing is split into octaves, each with it’s own filter bank, of which there are two respective sets: One with actual real-life data and one with hypothetical data to be used as a point of comparison. Each filter bank is then calculated using the same equation repeatedly. The experiment concluded that both the hypothetical and the real-life set of data both had the same frequency response, with the only limitation being the parameter estimator itself. This operation was formed at both a sub and full band level, and found that the sub-band technique was quicker in the eradication of unwanted or corrupted signal.

In this context, a few factors may affect the computational complexity: How large the bandwidth of each band may be, and whether or not it is encompassing parts of the signal that do not need processing, and also over-simplified processing in sub-bands that are not critical to malicious signal. Essentially, this technique is particularly effective for removing clicks, as long as the correct sub-bands of frequencies are being processed.

Finally, one can examine the restoration of fidelity to a musical recording through perceptual optimality criteria. In this technique, masking also plays a crucial role in the noise reduction process. Since the masking thresholds of the original signal are unknown, they must be estimated in the calculation. It is after these masking parameters have been estimated that they are later adjusted to accommodate the rest of the filtering operation.

The power spectrum is defined as the difference between the audible spectrum of both the clean and the noisy signal. Next, a formula is calculated based on the criteria with attempts to constrain the audible noise spectrum to be less than or equal to zero at all frequencies. Since we are dealing with musical signal, a linear filter has been suggested, but replacing the power spectra with respective psychoacoustic representations based on a masking model.

The spectral subtraction is combined with a filter that simulates the bandwidth increasing alongside an increasing frequency. This technique uses a suppression rule evaluated as a function of a known signal-to-noise ratio, which is determined after they have filtered the signal with the psychoacoustic model.

This system was tested using both high-fidelity 16-bit, 44.1kHz clean recordings to which white noise had been added to yield signal-to-noise ratios of 0-30dB, as well as authentic real-world recordings which had been degraded by broadband white noise. The experiments were done solely on the noises that were stationary (in one frequency band), it is known that non-stationary noise may be found in old recordings as well.

This method of noise-suppression has proven superior to basic suppression rules, which tend to introduce new levels of musical noise and cannot even calculate low signal-to-noise ratios. However with this method at low signal-to-noise ratios, there are some artifacts that are still audible. These artifacts are currently being researched, but the best hypothesis available is that it is low-level representations of the original signal, possibly reverberation, that once the masking threshold information is added, it is no longer attenuated.

It has been suggested that this method can be easily calculated into existing audio restoration techniques. It avoids repeating itself by implementing the same method in estimating both the restored signal, and the signal that is used to calculate the masking threshold in the formula. Also, since the signal being estimated is contingent on knowing what the original signal’s masking threshold is, there is room for improvement in the process as more advanced masking models may lead the way to improved restoration quality.

This method has effectively produced a method for broadband noise reduction while taking into account the models that psychoacoustics has to offer. While it is always difficult to balance out noise reduction and the degradation of the original signal, these methods pave the way for future improvements in regards audio restoration.

In conclusion, audio restoration has been able to continually develop and transform due to the psychoacoustic model: It has enabled a deeper understanding of how we perceive the quality of sound, by eliminating the need to process sole characteristics of audio perception such as loudness and pitch, and instead processed the signal as whole, in order to gauge the level of distortion in a complete audio signal. It also lead the way to an audio restoration technique that processes sub-bands of audio, similar in the way that we hear, to reduce the amount of signal processing, and therefore ensuring that as much of the original signal is retained. The psychoacoustic masking model is the foundation for the final technique discussed. A masking threshold is estimated for a clean and noisy signal, and the two are calculated together to try and attenuate the original signal with as little extraneous noise as possible.

Psychoacoustics continues to play a vital role in the development of audio and musical restoration.

Sources: CEDAR Audio, Digital Audio Restoration, Recording History, A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation, ARMA Processes in Sub-Bands with Application to Audio Restoration

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s