The audio GetPosition dilemma (long)

Fri Feb 10 10:14:02 CST 2012

Hi,

mmdevapi's GetPosition returns "the stream position of the sample that
is currently playing through the speakers", says MSDN.
This is exactly what apps ought to use that want to synchronize audio
& video (lip synchronisation).

MSDN's wording about winmm's GetPosition is not exactly the same: "the
current playback position of the given waveform-audio output device."
However there's no other winmm function that could be used for lip-sync.

Tests in winmm/tests/wave.c tend to show that WHDR_DONE notifications
are only received after a position corresponding to the written number
of bytes has been reached -- IOW the buffer has been played.

Now look at winmm's PlaySound and mciwave code:
    /* make it so that 3 buffers per second are needed */
Both then proceed to play ping pong with two 1/3 second buffers.
Every time WHDR_DONE is received, one of the two is empty, refilled
and resubmitted using waveOutWrite.

Now consider a system such as PulseAudio that wants to buffer 2
seconds of an audio stream internally.  That scheme will fail completely.

- If WHDR_DONE is based on GetPosition, notifications will only be
  sent after 2 seconds or an underrun.  Neither PlaySound nor MCIWAVE
  buffer 2 seconds.  They continuously hit an underrun as PA signals
  that it drained the tiny 40ms ALSA buffer we supplied.
  Is that PlaySound's fault?

- If WHDR_DONE is instead based on buffer usage, it could be sent
  following snd_pcm_write and let PA buffer as much as it wants.

  But then our wave tests fail if GetPosition returns something like
  the position currently playing through the speakers, which is 2s late.
  Should we have GetPosition lie?  And suffer loss of lip sync?

DSound's GetCurrentPosition suffers from the exact same issue.  For
instance, playing an audio CD-ROM with mcicda in recent Wine causes
constant underruns from DSound when Pulseaudio is involved.

What's the root cause of the issue?

15 years ago, the audio HW would likely receive a pointer to the winmm
header data and play that.  Once played, the buffer was no more needed
and returned to the app.  Also, there was nothing sitting behind the
audio HW causing significant latency (I mean DAC -> electronic ->
speaker -> acoustic had latency negligible to the human brain).
Equating GetPosition with the sample seen by the DAC was and is
reasonable -- ALSA does exactly this with hw:0.

DSound's model is that of a circular buffer from which the DAC is fed.
No different.

These days with mmdevapi, w7 users report a latency of 30-40ms
introduced by the native mixer.  Native's winmm might account for that
in its GetPosition reports.  This limit is below the threshold that my
mmdevapi tests would notice.  It is well below the approximately 100ms
that matter to human perception when correlating visible and audible
events.

Note that native's mixer is the last in an audio graph before the HW.

What happens in Linux with Wine?

ALSA's dmix is known to introduce some latency as it operates in
period-size chunks.  However ALSA's periods are typically very small,
e.g. 21.333ms and ALSA's dmix is typically connected to hw:0,
immediately playing audio to the speakers.  There's hardly a
difference between audible GetPosition and a buffer position derived
from snd_pcm_write (and old Wine mixed the two in the past).
The picture looks a lot like with native's mmdevapi mixer.

PulseAudio is entirely different.  With PA, the wine mixer is no more
near the end of the audio graph, rather than in front of a 2 second
latency introducing element.  A position derived from snd_pcm_write is
2s ahead of the true speaker position.

How did we react?  winealsa tried:

1. Using a small ALSA buffer so as to signal PA that large latencies
   are not ok.  PA seems to ignore that hint. It still appears to
   buffer a lot somewhere internally (I've no experience with PA 1.0
   or 1.1, please check).

   Even worse, small ALSA buffers like 40ms increased the risk of
   underruns and resulted in a worse audio experience to all users of
   wineALSA, whereas old wine would typically use a 100ms buffer.

   Small buffers are no issue to native's mixer.  It's using "Pro
   Audio" priority, which is the highest priority on a native system
   AFAIK.  By contrast, Wine is running at normal user priority.

2. Rate-Limiting.  We currently believe we are doing it because we
   write no more than 3 periods at a time.  But we're not.  Nothing
   can prevent PA or any other device from filling its 2s buffer over
   time.  It'll eventually do it, just slowly.

   We can't use our own clock to limit our writes because that would
   introduce clock skew issues with the audio HW clock.  What if it
   runs faster than the system's interpretation of 48000fps?  We can't
   second-guess the audio HW clock.  IOW, we can't win a rate-limiting
   game.  If the back-end signals via snd_pcm_avail that it has space,
   then we must feed it.

Now we know the reason of our troubles.  What can we do?

0. Ignore the issue.
   Equate GetPosition with Released_frames - GetCurrentPadding

1. Consider lip-sync important and strive to support
   audio & video applications.

2. mmdevapi appears to have a reasonable separation of concerns
   between buffering and speaker position.  Leave as is.  (It may
   happen though that apps built upon the assumption that they are no
   more than 40ms apart break when latencies like PA's enter the
   loop.  Here it would be interesting to see whether native has some
   similar situation, like a remote desktop & audio environment).

3. Protect DSound from a too large delta between padding and position.
   The maximum is given by the DSound primary buffer size, since
   DSound's circular GetCurrentPosition abstraction breaks down
   completely should they be further apart.

   IOW, don't rate limit DSound writes, but lie about the position if
   the wineXYZ device says it's too far behind.

   Having this right may fix some of the bugs currently in bugzilla.

   That way, we'll get lip-sync with back-ends that don't introduce an
   unbearable latency.  And we won't enter an underrun because the
   driver or the app never wait too much for the position to increase.

3. Protect winmm from a too large delta between padding and position.
   As there's no buffer limit to guide us, we must introduce an
   arbitrary one, around 100-200ms.  I don't know whether a dynamic
   limit based on average supplied WHDR buffers would work.

   Regarding WHDR_DONE, I don't know whether we should then relax the
   tests.  That feels unsafe, because an app may well wait for
   WHDR_DONE before calling waveOutReset or Close.  WHDR_DONE should
   not come too early, or we may lose trailing sounds.

   Perhaps we should not delay the position when playing the last
   buffer in the list.  Our winmm/tests/wave.c tests prove nothing
   more than that: at the end, the position corresponds to the sum of
   written samples.  Perhaps a similar reflection on MS' side explains
   why in my tests native's mmdevapi GetPosition may stay 17 samples
   below the sum of written samples with some USB headsets (or is it
   just a bug in their driver?).

4. Observe the behaviour of libraries that target mmdevapi, winmm and
   DSound, e.g. OpenAL, XAudio2, bink, smack, SDL, FMOD and learn.

Actually, thinking about what may happen to trailing samples in a
system that introduces 2s latencies still gives headaches.  If we
issue snd_pcm_reset prior to snd_pcm_close, sound will be killed for
sure.  It looks like the device should remain open for some time.

This applies to all wineXYZ devices, not just ALSA with PA.  Some
future OSS device may too introduce unforeseen latencies.

Now I'll go and fix winmm as suggested above.

Thank you for reading this far,
      Jörg Höhle