DAW on a Many-core System

I. Introduction

Software running on PCs has largely replaced the electronic and
electro-mechanical equipment previously used in audio and video
production.

For example, Apple's Final Cut Pro is a software video editor that has
largely replaced hardware video editors for movie trailer and
television news production. In audio recording studios, analog tape
recorders and mixing consoles have largely been replaced by software
applications known as a Digital Audio Workstation (DAWs). DAWs may
also include virtual musical instruments, which are software
emulations of acoustic and electronic musical instrument sound
production.

This type of software is challenging to design, because the two main
performance requirements -- soft real-time media I/O and low-latency
user interaction -- are difficult to achieve simultaneously.

For audio, software needs to run in real-time to ensure that audio
output samples are generated at a steady rate. For the common audio
sample rate of 44.1 kHz, a digital sample must be sent to the
digital-to-analog converter every 22.7 microseconds. If a sample is
late, an artifact (a click or dropout) results.

For broadcast-quality video, the real-time requirement follows from
the need to generate new image frames at a constant rate. For example,
the highest frame rate for HD video is 60 frames/second.

The "soft" qualifier of soft real-time indicates that infrequent
artifacts are acceptable. However, the definition of "infrequent" is
a moving target driven by user expectations. For example, in today's
market, a DAW whose audio output clicked once per ten minute would be
considered "broken" by its users.

If latency were not an issue, the real-time requirement could be
satisfied by the use of large media output buffers. What makes the
DAW and video editor design so challenging is the extra requirement of
low-latency user interaction.

For a DAW, we define latency as the time interval between a user
action (turning a knob on a hardware controller to control a mixing
console, or pressing a piano key on a MIDI keyboard to control a
virtual instrument) and the user hearing the audio change. Today,
DAWs offers minimum latencies of 4-12 ms. For most users, this
latency range is irritating but tolerable. Users would be happier if
it 1-2 ms latencies were possible.

For video applications, the latency requirement is driven by the
desire for frame-level response from the transport controls (play,
stop, etc) and other editing controls. For HDTV, the maximum frame
rate is 60 frames/second, implying 16 ms latency.

Another way to think about the low-latency requirement is to note that
it is impossible for the application to "compute ahead" very far in
time, because the computed output data would not reflect user-input
actions that might occur during the "compute ahead" time.

The low-latency requirement in these applications makes it impossible
to meet the real-time requirement via large output buffers. Instead,
audio or video data must pass through the application with true
millisecond-level latency, while only "missing" the real-time deadline
a few times per day.

In this proposal, we describe the state of the art for implementing
low-latency real-time applications (Section II). In Section III, we
make the case that the current methods will not scale to a many-core
world, and we sketch out a new design approach for DAWs and video
editors. In Section IV, we describe how our project will bring this
new approach to viability, and in Section V we let you know why we are
the right team to tackle this problem.

II. Low-Latency Soft Real-Time Systems: The State of the Art

It is remarkable that low-latency real-time programs can be made to
run well on general-purpose operating systems. Key core services of
an operating system like Unix -- process scheduling, inter-process
communication, resource locking, filesystem I/O, virtual memory --
were not designed to let multi-threaded user processes meet 3 ms
deadlines consistently. However, where there is a will there is a
way, and techniques were developed to adapt these operating system
core services for real-time work.

To introduce these techniques, we take a concrete approach, and
sketch out the architecture for a simple DAW running on Mac OS X.
Similar methods are used for video editing applications, and for
real-time applications running on modern Windows platforms.

When an audio application starts under OS X, the application sets up
the audio sources it wishes to listen to and the audio sinks it wishes
to drive with sound output. The application also specifies the size
of audio sample buffers it wishes to process; the buffer size puts a
lower limit on the processing latency of the application. For
example, an application may wish to listen to the built-in microphone
on a laptop, send audio output to the built-in speakers on the laptop,
and process the microphone audio input in 128-sample buffers at 44.1
kHz, to yield 2.9 ms buffers.

The actual audio processing is done in a callback function defined by
the application. In our example, this function is called by the
operating system every 2.9 ms. A 128-sample buffer of microphone
audio samples is passed into the function. The callback function
generates a 128-sample buffer of output audio samples intended for
playback through the laptop speakers, and passes the buffer to the
operating system when the function returns.

For audio output to be glitch-free, the callback function must
complete processing in less than 2.9 ms of wall-clock time. In this
example, where audio input and audio output are both occurring, the
minimum user-perceived latency is 5.8 ms (the operating system
overhead for audio I/O will add to this latency).

Given this background, we can now analyze the ways that the
application may fail to meet its performance goals.

First, we note that if the user configures the DAW to do signal
processing operations that typically require more than 2.9 ms of
wall-clock CPU time to compute a buffer, real-time operation is
impossible. In practice, DAW user interfaces include a "CPU meter" to
display the CPU headroom left in a session, and also offer techniques
for users to trade off CPU resources for flexibility (such as
pre-computing audio signal processing operations off-line).

Next, we note that the callback function runs as a thread in the
application's process, and its execution is under the control of the
operating system process scheduler. If the scheduler pre-empts the
execution of the callback function for a significant period of time,
the real-time deadline is likely to be missed.

To address this scheduling problem, OS X offers a time-constraint
scheduling policy option. Roughly speaking, this policy lets a thread
reserve a portion of a machine's CPU cycles on an ongoing basis, by
specifying the periodicity of the process (in our example, 2.9 ms),
the number of CPU cycles needed during the 2.9 ms, and to what degree
the CPU cycles need to be contiguous. Reservation requests which
would unduly tax the CPU are denied. A time-constraint thread is
throttled by the scheduler if it consistently uses more CPU cycles
that requested. This technique ensures that badly-behaved
applications do not monopolize the machine.

For this style of real-time programming to work, a callback function
must avoid invoking operating system services that may on occasion
take an excessive amount of time (in our example, excessive relative
to 2.9 ms). Below, we list common pitfalls related to this issue, and
the mechanisms that are used to avoid them:

* Virtual memory. A page fault in the callback routine will cause a
deadline miss. A background thread that keeps code and data pages
"warm" is the preferred mechanism to address this issue under OS X.
Some intelligence is required in choosing the pages to warm, given the
large memory footprint of a modern DAW.

* Memory synchronization. The callback function typically uses shared
memory to communicate with user-input threads (which manage hardware
control surfaces and the computer keyboard and mouse) and screen
painting threads. If the memory synchronization between these threads
results in the callback function taking a slow lock, there will be a
deadline miss. Scheduling the user-input and screen-painting threads
must also be done with care, to deliver a low latency experience to
the user. In our example, for the DAW user moving a hardware fader to
experience the low latency our 2.9 ms audio buffer size should yield,
the latency path from fader move to the reception of the fader move
information by the callback thread should be well under 2.9 ms.

* Disk I/O. DAWs emulate playing back an analog tape by reading
stored audio samples from a disk. Given the mechanical latency of
disks, doing disk reads directly from the callback function will cause
deadline misses. DAWs typically use helper threads to pre-buffer the
first 100 ms or so of audio tracks that the user is able to activate
for immediate playback. Similar pre-buffering strategies are used for
sample-based virtual musical instruments. Lock-free, low-latency
communication between helper threads and the callback thread is
essential for this scheme to work.

With careful design and performance tuning, the techniques above can
be used to build DAWs with excellent real-time performance on
single-processor machines. Simple ports of DAWs to multi-core
machines devote one core to the callback thread and the other core to
non-audio threads. For better performance, more sophisticated ports
divide the work done in the callback function among the cores, using
the low-level tools described above to maintain real-time performance.

The main thesis of our proposal is that the application of these
low-level tools will not scale to audio and video applications that
require dozens or hundreds of cores to compute its callback function.
In the next section, we describe why this method breaks down, and we
sketch out the approach we intend to pursue in our project.

III. Low-Latency Soft Real-Time Systems on Many-Core Platforms

The programming methods described in the last section share a common
approach. Each method tries to improve real-time performance by
letting the application participate in the delivery of a service that
a general-purpose program would leave entirely to the operating
system.

This approach is workable, if tedious, for programs with a small
number of threads. However, consider a DAW application running on a
128-core machine, running a session that needs to use all of the
machine CPU resources to meet real-time deadlines.

To use so many cores successfully, the application needs to split the
work done in the call-back thread over one hundred or more worker
threads. Luckily, application-level parallelism is easy to find in a
DAW. For example, an audio mixing console may have dozens of active
channels, each of which could run on its own core. The signal
processing chain within each channel (tone controls, automatic gain
control, reverberation) may also be split across several cores.

However, to ensure that the collection of threads work together to
compute the DAWs output in real-time, the application needs to manage
the time-constraint scheduling of these threads so that no thread is
starved for audio data. To do this well, the application needs to
track how the scheduler maps threads to particular processors, and how
the memory hierarchy of the machine impacts the cost of communicating
between threads. The secondary tasks described in the last section
(virtual-memory warming, preventing blocking during memory
synchronization between threads, low-latency file I/O buffering) also
become more complex when an application is managing hundreds of
worker threads.

We claim that many-core real-time programming needs to head in the
opposite direction. For an application like a DAW to run well on a
many-core machine, the programming model must change so that
applications can return operating systems duties to the operating
system, and act like "normal" applications again.

We now sketch out our main ideas for this programming model, using a
DAW running on a UNIX-like operating system as the sample application.
Although some details would change, the core architecture defined
below is also suitable for video applications.

In our programming model, the audio processing work done by a DAW is
split up into a set of "media modules". Each module is a UNIX process
(not a thread) with its own virtual address space.

Modules communicate audio and control data not by using shared memory,
but by reading and writing specialized types of UNIX sockets (audio is
sent and received over "audio sockets", control information is sent
over "control sockets", etc). We refer to the set of specialized
sockets defined by the API as "media sockets".

To start a DAW session, the master DAW process spawns a set of module
processes, and interconnects the audio and control data sockets of
these sessions to achieve the desired data flow.

For a simple module, such as a single low-pass filter, execution
proceeds as follows. The main() function of the process is normally
an infinite loop, that continuously reads from the audio-in socket,
checks the control data sockets to see if the user has changed the
filter parameters, and then computes the next filter sample and writes
it to the audio-out socket. The process does blocking reads and
writes on audio sockets, and non-blocking reads on control data
sockets.

In this programming model, the operating system understands that media
modules are real-time entities, and that the flows over audio and
control sockets need to meet soft-real time constraints for the
application to work well. Given this understanding of the media type,
the operating system can provide the usual set of process services
(scheduling, virtual-memory paging, inter-process communication,
mapping of processes to CPUs) in a way that is consistent with meeting
the real-time goal.

In a well-behaved interconnection of modules, the operating system
never needs to pre-empt a module as it is computing an audio sample.
Instead, the operating system puts a module is put to sleep when it
does a blocking audio socket read for the audio sample for the next
sample, and reawakens it when real-time has advanced to the point
where it is time to compute that next sample. In this way, the
operating system uses the producer-consumer dynamics of audio signals
flowing through a graph of modules through sockets to drive real-time
process management.

In the programming model, modules that originate an audio flow (such
as a module that reads audio samples from a microphone, or a module
that reads audio samples from a disk or from an UDP Internet port) or
terminate an audio port (such as a module that writes audio samples to
a speaker, a module that writes audio samples to disk, or a module
that writes an audio flow to a UDP Internet port) are treated in a
consistent way. General API tools are provided to handle the heavy
lifting of implementing the real-time "impedance match" between
external sources and media sockets. As in normal media modules, the
operating system takes responsibility for scheduling and paging the
modules in a sensible way.

By mapping modules to processes, and by exclusively using media-typed
sockets for inter-module communication, our programming model links
media semantics to traditional operating system primitives. This link
lets the operating system do a better job of making the application
run well (i.e. meet its real-time deadlines more often).

For example:

* If a module is clearly missing its deadlines, the operating system
can make a best effort to minimize audio artifacts on the DAW outputs,
as it knows the signal semantics of audio sockets. The definition of
a module could include the specification the optimal stream-repair
approach for the module's audio-out sockets.

* For modules whose outputs are a deterministic function of its
history of audio and control inputs, the operating system can run
several processes in parallel for a module, on different cores, and
use the audio sample from the process that writes its output socket
first. This strategy decreases the odds that an unlucky series of
cache misses will cause a real-time deadline miss.

* If a module crashes, the operating system can kill it, spawn a new
process, and wire its sockets into the signal path.

For the purposes of this proposal, the brief description above should
suffice to sketch out our technical direction for the project. In the
final parts of our proposal, we give a timeline of what we plan to do,
and we describe why the participants in the project are the right team
to execute on ideas.

IV. Project Goals and Scheduling

[todo: turn these bullets into text]

* define the API for audio in conjunction with the operating
systems team and the audio application team.

* implement the API into the OS

* implement a simple version of Max/MSP using the API, and
port the hearing-aid application to run on this many-core
version of Max/MSP

* performance measurement

* extension of API to video -- an aggressive proposal would
promise the bullets above for video too]

V. Investigators

[todo]