SDIF Standard Frame Types

Organization of This Document

The SDIF standard includes an extensible collection of standard frame and matrix types, listed in this document.

Each standard matrix type exists independently of the standard frame types that must include it; any matrix may appear in a frame of any type. However, for clarity, this document describes each matrix type in the context of the frame type for which it was invented, with a special section at the end for matrix types invented to be a part of any frame.

SDIF Standard Frame Types

The following frame types have been defined as part of the SDIF standard. Each of these frame types has one or more corresponding matrix types. To give a sense of what kind of data is in each frame type, this table also lists the columns of the main matrix type for each frame type. Click on the frame type ID for a detailed description.

Frame Type ID  Frame Type  Columns of Main Matrix
1FQ0 Fundamental Frequency Estimates Fundamental frequency, confidence
1STF Discrete Short-Term Fourier Transform Real & imaginary bin values
1PIC Picked Spectral Peaks Freq, Amp, phase, confidence
1TRC Sinusoidal Tracks Index, freq, amp, phase
1HRM Pseudo-harmonic Sinusoidal Tracks Harmonic partial #, freq, amp, phase
1RES Resonances / Exponentially Decaying Sinusoids Freq, amp, decay rate, phase
1TDS Time Domain Samples Channels of sample data

Frame Types to be Standardized

The following sound descriptions should eventually have standard SDIF frame types. We have decided to delay the definition of these types until the base SDIF standard has been accepted by the community. We welcome any ideas or proposals about how to represent this data in SDIF frames.

  • Spectral envelopes (sampled and parametric)
  • Cepstral coefficients
  • LPC coefficients
  • Formants
  • Wavelets
  • Diphones
  • "Note lists"

Conventions Followed By SDIF's Standard Frame and Matrix types

  • Amplitude is always linear, never in dB or any other scale.
  • When a matrix has both frequency and amplitude columns, frequency always comes first.
  • The "main" matrix required by a frame type will have a MatrixTypeID equal to the FrameTypeID.
  • Some frame types consist of a main matrix of data plus a few extra fields in a secondary 1D matrix, e.g., time-domain-sample frames must include the sampling rate as well as the actual sample values. In these cases the naming convention is for the info matrix's MatrixTypeID to begin with the character "I" (for "info"), and have the same 3 final characters as the FrameTypeID.
  • In general, we try to encode information without reference to a particular sampling rate, and even to allow for non-isochronous sampling methods.
  • In general, we try to define the semantics of frames to be as stateless as possible: it should be possible to interpret the contents of a frame without reference to any other frames. When this is not possible, e.g., when the frames contain data for a custom synthesis method that needs to be configured, the second best alternative is to put all "initialization" information in a single frame that allows all the data frames to be interpreted.
  • SDIF frames should describe "what they are" rather than "what they came from."

Time-Domain Samples

Time-domain samples are the typical representation for digitally sampled sound, used by common sound file formats such as WAV and AIFF. The goal of SDIF's time-domain samples frame type to provide a uniform representation and the convenience of having time domain samples in the same SDIF file or stream as other sound descriptions, not to codify every ingenious scheme for representing audio in the minimum number of bits. Therefore we restrict this type to linearly quantized samples with no compression.

1TDS frames must contain a 1TDS matrix to hold the samples and a ITDS "time domain info" matrix that says how to interpret the samples:

1TDS matrix:

  • Matrix Type: "1TDS"
  • Rows: Sample frames
  • Allowed MatrixDataTypes: float32, float64, int32, int64
  • Columns: Amplitudes in each channel. Linear. All but the first are optional.

ITDS matrix:

  • Matrix type: "ITDS"
  • Rows: Always exactly one row
  • Allowed MatrixDataTypes: float64
  • Columns:
    • The sampling rate. Required.

More columns may be added to the ITDSmatrix in the future, including the following:

  • The nominal number of bits of precision of the A/D converter.
  • The nominal noise floor of the converter, in dB
  • The noise floor of the converter as computed/estimated by examining some digitized "open channel" signal.
  • The DC offset of the sample amplitudes (i.e., the average sample amplitude)
  • The location and magnitude of the most positive sample
  • The location and magnitude of the most negative sample

Unlike most other SDIF frame types, a frame of 1TDS data represents an interval of time (equal to the number of rows in the 1TDS matrix divided by the sampling rate) rather than an instant of time. The time tag of a 1TDS frame represents the beginning of this interval.

Most SDIF streams containing 1TDS data will consist of a single large frame at time zero with all of the samples for the stream in a single matrix. The same data could be represented equivalently in a series of shorter frames, for example, a series of frames containing one-second intervals of sample data at times 0, 1, 2, 3, 4, etc., or unequal-sized frames, e.g., 1.5 seconds at time 0, 2 seconds at time 1.5, 0.7 seconds at time 3.5, 1 second at time 4.2, etc. Note that at a 96K Hz sampling rate, the limit of 2^32 rows in a matrix imposes a limit of about 12.4 hours of sound in a single frame.

There is also the possibility of "gaps" in the time axis, for example, one second of sound in a frame at time 0 followed by more sound in a frame at time 10. In these cases, the stream implicitly contains zero-valued samples in any intervals of time not spanned by sample data in frames. So, in this example, there would be one second of sound, followed by 9 seconds of silence, followed by more sound.

There is also the possibility of 1TDS frames that overlap in the time axis, for example, a frame at time zero with 2 seconds of samples, followed by a frame at time 1 with more samples. In these cases, the semantics are that the sample values are added together.

Separate Matrix Type for Annotating Multi-Channel Data

Rather than define some fixed interpretation of multi-channel data like "1 is front left, 2 is front right, 3 is rear left, 4 is rear right", we propose to invent an SDIF matrix type specifically for describing multi-channel data. This would allow simple textual labels like those above, but also precise geometric measurements about exact microphone placement, speaker placement, etc. It would also support textual annotations about the content of each channel, e.g., the name of an instrument on a particular channel of a multi-track recording.

This matrix type would be optional in 1TDS frames, or any other frame type with multi-channel data.

Fundamental Frequency Estimates

Not all sounds have a definite fundamental frequency; some have multiple possible fundamental frequencies. Note that we use the term "fundamental frequency" or "f0" rather than "pitch"; this is because pitch is a perceptual phenomenon while fundamental frequency is a signal processing quantity. We might invent a new SDIF frame and matrix type for pitch to represent the result of a true pitch estimator that applied a model based on human perception.

1FQ0 frames consist of a single 1FQ0 matrix:

  • Matrix Type: "1FQ0"
  • Allowed MatrixDataTypes: float32, float64
  • Rows: Candidate fundamental frequencies suggested by the estimator.
  • Columns:
    • Fundamental frequency (Hertz). Required.
    • Confidence (0 = none, 1=completely sure). Optional, default is 1.

Note that this format accommodates estimators that vote amongst fundamental frequency candidates. Each row in the data vector is an estimated fundamental frequency.

Note that this format does not support the notion of "tracking" various fundamental frequency estimates over time. In this respect it is more like the 1PIC frame type than the 1TRC frame type. We are considering adding another frame type for "tracked fundamental frequency estimates" that would include an index for each fundamental frequency.

Discrete Short-Term Fourier Transform/Phase Vocoder

1STF frames represent the data that come out of a discrete short-term time-domain to frequency-domain transform such as an FFT.

Here is a precise mathematical definition of this frame type:

  • Let s(i) be a discrete signal with sampling rate SR Hertz
  • Let w(m) be a window defined with the support [0, M-1], i.e., w(m)=0 for m<0 and m>=M
  • Let N be the size of the transform

We define the input to the transform, x(n), as follows. Note that the windowed signal is 'put' at the beginning of the vector x(n).

Let x(n) =   s(i+n) * w(n)  for  0 <= n <= M-1
    x(n) =   0              for  M <= n <= N-1

(This is slightly redundant, since we define w(m)=0 when m>=M.)

The 1STF matrix data is the Discrete Fourier Transform (DFT) of size N, i.e. the X(k) as follows.

The DFT is a length N vector X, with these elements:

              N-1
       X(k) = sum  x(n) * exp(-j * 2 * pi * k * n/N)
              n=0

       0 <= k <=N-1

The time tag in a 1STF frame is the time of the center of the window, i.e., (i + M/2)/SR, not the beginning.

Notes:

  • This definition corresponds to the output of Matlab's (and UDI's) FFTfunction
  • The real and imaginary parts come directly from this formula: therefore, if you compute a phase as atan2(imaginary, real), it is the phase of the corresponding COSINUSOID (and not sinusoid as we are used in additive synthesis) at time (i)/SR.
  • Note that the windowed signal is 'put' at the beginning of the vector x(n) (then zero padding follows) and this is crucial for the phase definition.
  • Because of aliasing and foldover above the Nyquist frequency (and below the negative Nyquist frequency), the output of the DFT can be thought of as a periodic function of frequency over the range -infinity to infinity. The period of this function is the range from the negative Nyquist frequency to the positive Nyquist frequency, in other words, the sampling rate of the input signal.

1STF frames consist of an ISTF "info" matrix to record overall information about the transform, plus a 1STF matrix that contains the actual bin data.

STFT info matrix:

  • Matrix type: "ISTF"
  • Allowed MatrixDataTypes: float32, float64
  • Rows: always exactly one row
  • Columns (all required)
    • period of the DFT (i.e., SR): Hertz
    • Frame size (i.e., size of the windowed signal) M/SR: seconds
    • Size of the transform N (The data matrix does not necessarily represent all N bins)

Each 1STF frame must also contain a 1WIN matrix specifying the window function.

STFT data matrix:

  • Matrix type: "1STF"
  • Allowed MatrixDataTypes: float32, float64, int32, int64
  • Rows: Frequency bins output by the transform
  • Columns (both required)
    • Real part (unitless, as it comes out of the STFT)
    • Imaginary part (unitless, as it comes out of the STFT)

You can convert these complex numbers into polar form to get magnitude and phase.

Picked Spectral Peaks

Picked spectral peaks represent peaks (local maxima) in a spectrum at a given time. Peak pickers typically fit some kind of curve to 1STF data, providing frequency, amplitude, and phase estimates that are more accurate than the bins themselves.

1PIC frames consist of a single 1PIC matrix:

  • Matrix type: "1PIC"
  • Allowed MatrixDataTypes: float32, float64
  • Rows: Spectral peaks
  • Columns
    • Frequency (Hertz). Required.
    • Amplitude (linear). Optional; default is 1.0.
    • Phase (Radians: from 0 to 2*pi). Optional, no default.
    • Confidence (1.0 = 100%, 0.0 = 0%). Optional, default is 1.0.

The confidence factor might be used to indicate how much of the energy around this peak was from a sinusoid or how well the energy around this peak matches a sinusoid.

Sinusoidal Tracks

Sinusoidal tracks represent sinusoids that maintain their continuity over time as their frequencies, amplitudes, and phases evolve. Sinusoidal tracks are the standard data format used as the input to classical additive synthesis.

1TRC frames consist of a single 1TRC matrix:

  • Matrix type: "1TRC"
  • Allowed MatrixDataTypes: float32, float64
  • Rows: Sinusoidal tracks
  • Columns
    • Index (a unique integer >= 1 identifying this track and allowing it to be matched with 1TRC data in other frames. This is similar in concept to a Stream ID.) Required.
    • Frequency (Hertz). Required.
    • Amplitude (linear). Optional; default is 1.0.
    • Phase (Radians: must be between 0 and 2*pi). Optional, no default.

Synthesizers of 1TRC frames are expected to match the data for each sinusoid from frame to frame using the index numbers. Values for amplitude and frequency should somehow be interpolated so that they change smoothly between each frame.

As phase is the integral of the instantaneous frequency over time, the phase values in each frame may not necessarily match a synthesizer's concept of what the phase should be based on the previous phase and the frequency trajectory since the previous phase. Some synthesizers will ignore the phase field or use it only for the initial phase. Others will take the phases into account when interpolating frequencies from frame to frame. Others will "cheat" the desired frequencies to produce the desired phases.

We imagine SDIF utilities that would check the "reasonableness" of phase values based on the frequencies.

There is no guarantee that a partial appearing in one frame will also appear in the next frame. The situation where a partial appears in one frame but not the next is called a "death", and when a partial does not appear in one frame but does appear in the next frame it's called a "birth". These cases are challenging when writing a synthesizer. It's recommended that partials appearing for the first or last time in a series of frames should have amplitudes of zero, so that the semantics of fading in and out are explicitly in the SDIF data rather than needing to be added on by the synthesizer.

Pseudo-harmonic Sinusoidal Tracks

Pseudo-harmonic sinusoidal tracks frames are exactly like sinusoidal track frames except that the partials are understood to lie on or close to a harmonic series. Thus, the index column of the 1HRM matrix represents harmonic partial number rather than an arbitrary index. Partial numbers start from 1, so the frequency of each pseudo-harmonic sinusoid should be close to the partial number times the fundamental frequency.

Exponentially Decaying Sinusoids/Resonances

Resonances data can describe the characteristics of a resonant system like a group of tuned filter banks, or can specify parameters for a model of sinusoids with fixed frequencies and exponentially decaying amplitudes. (If you put an impulse into such a group of filter banks, the output should be a sum of sinusoids with fixed frequencies and exponentially decaying amplitudes, so these two situations are in a certain sense the same.)

1RES frames consist of a single 1RES matrix:

  • Matrix type: "1RES"
  • Allowed MatrixDataTypes: float32, float64
  • Rows: resonances
  • Columns:
    • Frequency (Hertz). Required.
    • Amplitude (linear). Optional, default is 1.0.
    • Decay Rate (Hertz). Optional, no default.
    • Phase (Radians: from 0 to 2*pi). Optional, no default.

The decay curve of a resonance should be the same as that of a two-pole filter with bandwidth equal to decay rate divided by pi. This formula gives the amplitude of each sinusoid over time:

	amp(t) = initial_amp * e ^ (- decay_rate * t)

The phase of a resonance specifies the initial phase of each decaying sinusoid. (Thanks to Jean Laroche for suggesting that we include phase in this frame type.)

The original SDIF spec included some interesting extra columns for resonances.