Sound Modeling Using Loris
Note: your browser is not CSS compliant, so it may format this page incorrectly.
Sound Models
The Reassigned Bandwidth-Enhanced Additive Sound Model is a high-fidelity representation that allows manipulations and transformations to be applied to a great variety of sounds, including noisy and non-harmonic sounds. Sound modeling and rendering algorithms based on this model, and many kinds of manipulations, including sound morphing, are implemented in an Open Source software package called Loris.
Sound modeling goes beyond analysis or transformation of the sample data to construct something not present in the original waveform. In sound modeling, we attempt to extract a complete set of features to compose a sufficient description of all perceptually-relevant characteristics of a sound. We further strive to give stucture to those features such that the combined features and structure (the model) form a sufficient description of a family of perceptually similar or related sounds.
Since it asserts a structure to the data that is not evident in the waveform, a sound model can be said to represent more information than the original recording. Constructing a sound model from a single recorded sound is analogous to composing an architectural plan from elevation and interior sketches of a building. New information is added in the modeling process, and the resulting model represents a generalization of the original sound.
On the other hand, some information is also lost in the modeling process. It is generally not possible to recover the original sound samples from the model data. Still, we intend that our models be sufficient in detail and fidelity that we can construct a perceptual equivalent of the original sound based solely on the model. Furthermore, deformations of the model should be sufficient to construct sounds differing from the original only in predictable ways. Examples of such deformations are pitch shifting and time dilation.
Additive Sound Models
The most common kind of sound model the so-called additive sound model, in which a sound is represented by a collection of sinusoids having amplitudes and frequencies that vary over time. The model is called "additive" because the individual components (sinusoids) are independent of each other, and can be synthesized individually and simply added together.
The parameters (amplitude and frequency) of the sinusoids in an additive sound model can be estimated from a time-frequency distribution like the spectrogram by following the "ridges" on the spectral surface. Each of these ridges represents a single sinusoidal component. The ridges in the spectrum of the word "open" are emphasized in this plot.
When we construct an additive model, we retain only the parameters of the sinusoids in the model.
Sound Modeling Parameters
You will use an Open Source sound modeling and processing software package called Loris to experiment with additive sound modeling and manipulation in the model domain. Loris is a C++ class library, but much of its functionality is available through a procedural interface written in C, and this is the interface you will use.
All the functions in the procedural interface
are declared in the header file loris.h
,
which you should #include
in your program sources.
To arrive at good analysis parameters for a sound that you wish to model, you should be prepared to spend several hours experimenting with different analysis parameters, just analyzing and reconstructing your sound. Depending on the sound, and how many sounds you have modeled before, this may mor may not take several hours. Often, it takes only a few good educated guesses to obtain very good analysis results, but occassionally you will encounter a sound for which good analysis parameters are quite elusive.
The parameters of the Loris analyzer are few and orthogonal. In many cases, a good representation can be obtained by configuring only one or two parameters, and only rarely is it necessary to set more than three. Interaction between the various parameters is minimal, so it is often easy to converge quickly on an optimal parameter set for a given sound.
The first step in a Loris analysis is to configure the analyzer. The Loris analyzer can be configured according to two parameters: the minimum instantaneous frequency separation between partials (the frequency resolution), measured in Hz, and the shape of the short-time analysis window, specified by the symmetrical main lobe width in Hz.
Frequency Resolution
The frequency resolution parameter controls the frequency density of partials in the model data. Two partials will, at any instant, differ in frequency by no less than the specified frequency resolution. The frequency resolution should be slightly less than the anticipated difference in frequency between any two adjacent partials.
For quasi-harmonic sounds (sounds having energy concentrated very near integer multiples of a fundamental frequency), the anticipated frequency difference between adjacent partials is equal to the harmonic spacing, or the fundamental frequency, and the frequency resolution is typically set to 70% to 85% of the fundamental frequency.
For non-harmonic sounds, some experimentation may be necessary, and intuition can offen be gained by using a spectrogram tool.
Analysis Window Width
The shape of the short-time analysis window governs the time-frequency resolution of the reassigned spectral surface, from which bandwidth-enhanced partials are derived. An analysis window that is short in time, and therefore wide in frequency, yields improved temporal resolution at the expense of frequency resolution. Spectral components that are near in frequency are difficult to resolve, and low-frequency components are poorly represented, having too few periods in each window to yield stable and reliable estimates of frequency and amplitude. A longer analysis window compromises temporal resolution, but yields greater frequency resolution. Spectral components that are near in frequency are more easily resolved, and low-frequency components are more accurately represented, but short-duration events may suffer temporal smearing, and short-duration events that are near in time may not be resolved.
The use of time-frequency reassignment improves the time and frequency resolution of the reassigned bandwidth-enhanced model relative to traditional short-time analysis methods. Specifically, it allows us to use long (narrow in frequency) analysis windows to obtain good frequency resolution, without smearing short-duration events. However, multiple short-duration events occuring within a single analysis window still cannot be resolved. Fortunately, the improved frequency resolution due to time-frequency reassignment also allows us to use short-duration analysis windows to analyze sounds having a high density of transient events, without greatly sacrificing frequency resolution.
The choice of analysis window width depends on the anticipated partial frequency density. The window width is the width of the main lobe of the Fourier transform of the Kaiser analysis window, measured between zeros in the magnitude spectrum. (The Loris analyzer employs a Kaiser analysis window, a parameterized window function that allows independent specification of main lobe width and sidelobe rejection.)
Generally, the window width is set equal to somewhat less than twice the anticipated minimum instantaneous frequency difference between any two partials, that is, twice the value of the frequency resolution parameter. For quasi-harmonic sounds, it is rarely necessary to use windows wider than 500 Hz, although good results have been obtained using windows as wide as 800 Hz to analyze a fast bongo roll. Similarly, for very low-frequency quasi-harmonic sounds, best results are often obtained using windows as wide as 120 Hz.
In the procedural interface, the frequency resolution
and analysis window main lobe width are
the arguments to the function analyzer_configure
,
invoked
analyzer_configure( resolution, width );
For best results, it is usually necessary that the width of the main lobe if the window function be not more than twice the desired frequency resolution. If the main lobe is too wide, then the analyzer may be unable to resolve nearby frequency components, and the amplitudes and frequencies of nearby components that are resolved may be corrupted by main lobe interference.
All other parameters of the Loris analyzer are configured automatically from the specification of the frequency resolution and analysis window width parameters, and it is often unnecessary to deviate from their default configuration. All of the analyzer parameters are independently accessible and assignable, however, in the event that fine tuning is necessary.
Frequency Drift
The frequency drift parameter governs the amount by which the frequency of a partial can change between two consecutive data points extracted from the reassigned spectral surface. This parameter is generally set equal to half the frequency resolution, but in some cases, for example, in quasi-harmonic sounds having strong noise content, the frequency of some low-energy partials may tend to occasionally ``wander'' away from the harmonic frequency, resulting in poor harmonic tracking. In these cases, reducing the frequency drift to, say, one-fifth the fundamental frequency may greatly improve harmonic partial tracking, which is important for manipulations such as sound morphing.
In the procedural interface, the frequency drift parameter can be
set using the analyzer_setFrequencyDrift
function.
Hop Time
The hop time parameter specifies the time difference between successive short-time analysis window centers used to construct the reassigned spectal surface. Data is generally obtained from each analysis window for all partials active at the time corresponding to the center of that window, so the hop time controls, to some degree, the temporal density of the analysis data (though, thanks to the use of time-frequency reassignment, it controls the temporal resolution of the data to a much lesser degree).
The hop time used by the reassigned bandwidth-enhanced analyzer in Loris is normally derived heuristically from the analysis window width. In some cases, it is possible to increase the hop time, thereby reducing the volume of data, by a factor of two without compromising the quality of the respresentation. In other cases, it may be desirable to decrease the hop size, though the we have never encountered such a situation.
In the procedural interface, the hop time parameter can be
set using the analyzer_setHopTime
function.
Crop Time
For ordinary choices of hop time, there is some redundancy in the spectral analysis data. But not all of the data is of equal reliability. In a short-time segement, analysis data may represent events that are near the tapered ends of the analysis window, rather than near the center of the window. (The reassigned spectral analysis data generated by the Loris analyzer have a time coefficient that is generally different from the temporal center of the analysis window.) Due to the taper of the window, this data is more likely to suffer from numerical and other estimation errors than data representing events near the temporal center of the window.
Fortunately, there is generally considerable overlap of consecutive analysis windows (the hop time is much less than the window duration), so any event that is represented by unreliable data in one analysis segment is likely to be represented by much more reliable data in a nearby segment.
The crop time parameter specifies the temporal distance from the center of the analysis window beyond which spectral data is considered unreliable, and is not retained.
Generally, the crop time is set equal to the hop time, so that only data whose time coefficient is beyond the temporal center of an adjacent analysis window is considered unreliable. We have never found it necessary to choose different values for the hop time and crop time, so if you change one of them, you should probably change both.
In the procedural interface, the crop time parameter can be
set using the analyzer_setCropTime
function.
Frequency and Amplitude Floor
For quasi-harmonic sounds, it is appropriate to ignore any spectral energy below the fundamental frequency, so by default, the Loris analyzer does not consider spectral components having frequencies less than the specified frequency resolution when constructing partials.
In some instances, it is convenient to set the minimum instantaneous partial frequency independently of the frequency resolution. This can be accomplished by setting the frequency floor parameter, which is otherwise (by default) set equal to the frequency resolution.
Similarly, it is occasionally useful to raise the amplitude floor parameter from its default value of -90 dB. This parameter represents an amplitude threshold, relative to a full amplitude sinusoid, below which reassigned spectral components are considered insignificant, and are not used to form partials.
In the procedural interface, the frequency floor parameter can be
set using the analyzer_setFreqFloor
function, and
the amplitude floor parameter can be
set using the analyzer_setAmpFloor
function.
Sidelobe Attenuation
The Loris analyzer employs a Kaiser analysis window for short-time spectral analysis. The Kaiser window is parameterized to allow specification of not only the length of the window, the primary determinant of the main lobe width, but also the shape of the window, which determines the height of the sidelobes in the window spectrum.
Considering the window function to be a lowpass filter, lower sidelobes imply to better stopband rejection, so there is less corruption of spectral parameter estimates due to leakage from energy in distant parts of the frequency spectrum. However, windows with lower sidelobes need to be longer (in time) to achieve the same main lobe width as windows having higher sidelobes.
Sidelobe attenuation is specified in positive dB. The default sidelobe level in Loris is 90 dB. The sidelobe level and amplitude floor parameters may interact, since higher sidelobes might be misinterpreted as significant spectral components if the amplitude floor is not set high enough to reject them.
In the procedural interface, the sidelobe level parameter can be
set using the analyzer_setSidelobeLevel
function.
Bandwidth-Enhancement
The partials constructed in the Loris analysis process are not strictly sinusoidal. Bandwidth enhancement is a technique for combining sinusoidal energy and noise energy into a single partial having time-varying frequency, amplitude, and noisiness (or bandwidth) parameters. The bandwidth envelope allows both sinusoidal and noisy parts of sound to be manipulated in an intuitive way using a single type of component. The encoding of noise associated with a bandwidth-enhanced partial is robust under time-dilation and other model-domain transformations, and is independent of other partials in the representation, making the sound model very flexible and manipulable.
The bandwidth enhancement region width is the frequency range over which excess spectral energy is averaged together and distributed as noise energy among nearby partials. The default value is 2000 Hz, and values between 1000 and 4000 Hz typically give the best results.
The method by which Loris associates noise with partials during an analysis works well for many pitched musical instrument tones, and other sounds whose noisiness does not change very rapidly. Other kinds of sounds, like speech, require a different approach. For these sounds, it is sometimes difficult to obtain a good bandwidth-enhanced representation, and sometimes the purely-sinusoidal (without bandwidth enhancement) representation gives better-sounding syntheses with fewer artifacts.
Initially, it is probably best to disable bandwidth association to start with, by setting the region width to 0. If your sound is very noisy, or you intend to perform time dilation, you may want to try enabling bandwidth enhancement after you have obtained good good values for the other analyzer parameters.
Performing the Analysis
The analyze
function in the Loris
procedural interface performs an analysis according to the
current configuration of analysis parameters. It is an
error to call analyze
before calling
analyzer_configure
.
partials = createPartialList(); analyze( buffer, bufsize, srate, partials );
The first two arguments specify the array of samples
(double
s) to analyze and the number of
samples in the array. The third argument is the sample
rate.
The final argument is a PartialList
that is to store the partials constructed in the analysis.
The PartialList
is not
created by the analyzer, it must be created before calling
analyze
using the createPartialList
function.
The analyzer performs thousands of discrete Fourier transforms, so it may take a few seconds. Be patient.
Loris Analysis Example
It is a simple matter to write a small program that uses the Loris procedural interface to analyze a sound and store the partials, or synthesize a new sound from the partials and create a new samples file.
This C program demonstrates the use of the Loris procedural interface to analyze a clarinet tone having a fundamental frequency of approximately 415 Hz. The partials are exported to a Sound Description Interchange Format (SDIF) data file. The clarinet sound is then reconstructed from the partials, and the synthesized samples are exported to a samples file in the AIFF format.
/* * analysis_example.c * * Use the Loris procedural interface to perform a * Reassigned Bandwidth-Enhanced analysis of a * clarinet tone, having fundamental frequency of * approximately 415 Hz. */ #include <stdio.h> #include <string.h> #include <loris.h> int main( void ) { const double FUNDAMENTAL = 415.0; /* G#4 */ const unsigned long BUFSZ = 44100 * 4; /* approx. 4 seconds */ PartialList * partials = NULL; double samples[ BUFSZ ]; double sr; unsigned int nsamps; /* import the clarinet samples */ nsamps = importAiff("clarinet.aiff", samples, BUFSZ, &sr); /* configure the Loris analyzer, use frequency resolution equal to 80% of the fundamental frequency, main lobe width equal to the fundamental frequency, and frequency drift equal to 20% of the fundamental frequency */ analyzer_configure( .8 * FUNDAMENTAL, FUNDAMENTAL ); analyzer_setFreqDrift( .2 * FUNDAMENTAL ); /* approx. 83 Hz */ /* analyze and store partials */ partials = createPartialList(); analyze( samples, nsamps, sr, partials ); /* export to SDIF file */ exportSdif("clarinet.sdif", partials ); /* synthesize */ memset( samples, 0, BUFSZ * sizeof(double) ); nsamps = synthesize( partials, samples, BUFSZ, sr ); /* export samples to AIFF file */ exportAiff( "synth.aiff", samples, nsamps, sr, 16 ); /* cleanup */ destroyPartialList( partials ); return 0; }
In this analysis, the analyzer is configured to have frequency resolution equal to 80% of the fundamental frequency (a good rule of thumb), and analysis window width equal to the fundamental frequency. The only other analyzer parameter that is configured is the frequenecy drift, which is set at 20% of the fundamental frequency to prevent partials having large and rapid variations in their frequency envelopes. The other parameters of the analzer are configured automatically from the frequency resolution and window width.
By the way, in a Unix-like environment,
the same analysis can be performed using the
loris_analyze
command line utility.
% loris_analyze 332 415 clarinet.aiff -drift 83 \ -o clarinet.sdif -render synth.aiff
Simple Manipulations
A variety of simple model-domain transformations are implemented in Loris. You can apply these transformations to the partials obtained in an analysis, and then render a new sound from the modified partials.
The dilate
function is used to warp the time scale
of a sound, that is, to change the duration of parts of the sound.
Arguments to dilate
are the partials to transform
and two sets of time points: the times before the transformation
and the times after the transformation.
For example, to stretch the middle part of the cat's meow (which is 1.2 seconds long), you might make the following declarations:
/* time points for dilation */ double itimes[] = { 0, 0.25, 0.75, 1 }; double ttimes[] = { 0, .25, 2.75, 3 }; const int ntimes = 4;
and then invoke dilate
this way
dilate( partials, itimes, ttimes, ntimes );
The final argument is the size of the number of time points in each array (both arrays must be the same size of course).
Another fun manipulation is pitch shifting. The shiftPitch
function in Loris changes the partial frequencies
according to a pitch bend function, having units of cents (1/100 of a
semitone).
Since the pitch bend may vary over time, it is
described by a line segment envelope, like the ones we discussed
earlier in the semester. The following statements construct an
envelope (called a LinearEnvelope
in
Loris) that starts at zero, goes up to
300 cents (up three semitones, a minor third) between
0.25 and 1.25 second, down to -300 cents beteen
1.25 and 2.25 seconds, and then back to zero between
2.25 and 2.75 seconds.
env = createLinearEnvelope(); linearEnvelope_insertBreakpoint( env, 0, 0 ); linearEnvelope_insertBreakpoint( env, .25, 0 ); linearEnvelope_insertBreakpoint( env, 1.25, 300 ); linearEnvelope_insertBreakpoint( env, 2.25, -300 ); linearEnvelope_insertBreakpoint( env, 2.75, 0 ); linearEnvelope_insertBreakpoint( env, 3.0, 0 );
This envelope, which shifts the pitch during the
dilated part of the cat's meow,
is the second argument to shiftPitch
(the first is the partials to transform).
shiftPitch( partials, env );
When I am done with it, I can free the memory associated with the
LinearEnvelope
by calling
destroyLinearEnvelope
.
destroyLinearEnvelope( env );
The end result sounds rather nice, I think.
There are other transformations available in Loris,
including a simple multiplicative frequency scaler, called
scaleFrequency
.