Skip to main content

Changing the Audio Format

To change the audio codec, the format & the sample rate, you have to add the following lines to .env-local-docker which is located at the root of the repository.

CV_TRANSCODE_CODEC='pcm_s16le'
CV_TRANSCODE_FORMAT='wav'
CV_TRANSCODE_SAMPLE_RATE='44100'

or

CV_TRANSCODE_CODEC='pcm_s24le'
CV_TRANSCODE_FORMAT='wav'
CV_TRANSCODE_SAMPLE_RATE='96000'

Available Lossless Audio Codecs

ffmpeg -codecs |& grep ".EA..S"

or

docker container exec -it web ffmpeg -codecs |& grep ".EA..S"
docker container exec -it web ffmpeg -codecs \
|& grep ".EA..S" \
| sed -e 's| \([^ ]\+\) \([^ ]\+\) \+\(.\+\)|\1\t\2\t\3|' \
| tabulate --sep $'\t' --format pipe
AttributesNameDescription
DEA..SalacALAC (Apple Lossless Audio Codec)
DEA.LSdtsDCA (DTS Coherent Acoustics) (decoders: dca ) (encoders: dca )
DEA..SflacFLAC (Free Lossless Audio Codec)
DEA..SmlpMLP (Meridian Lossless Packing)
DEA..Spcm_f32bePCM 32-bit floating point big-endian
DEA..Spcm_f32lePCM 32-bit floating point little-endian
DEA..Spcm_f64bePCM 64-bit floating point big-endian
DEA..Spcm_f64lePCM 64-bit floating point little-endian
DEA..Spcm_s16bePCM signed 16-bit big-endian
DEA..Spcm_s16be_planarPCM signed 16-bit big-endian planar
DEA..Spcm_s16lePCM signed 16-bit little-endian
DEA..Spcm_s16le_planarPCM signed 16-bit little-endian planar
DEA..Spcm_s24bePCM signed 24-bit big-endian
DEA..Spcm_s24daudPCM D-Cinema audio signed 24-bit
DEA..Spcm_s24lePCM signed 24-bit little-endian
DEA..Spcm_s24le_planarPCM signed 24-bit little-endian planar
DEA..Spcm_s32bePCM signed 32-bit big-endian
DEA..Spcm_s32lePCM signed 32-bit little-endian
DEA..Spcm_s32le_planarPCM signed 32-bit little-endian planar
DEA..Spcm_s64bePCM signed 64-bit big-endian
DEA..Spcm_s64lePCM signed 64-bit little-endian
DEA..Spcm_s8PCM signed 8-bit
DEA..Spcm_s8_planarPCM signed 8-bit planar
DEA..Spcm_u16bePCM unsigned 16-bit big-endian
DEA..Spcm_u16lePCM unsigned 16-bit little-endian
DEA..Spcm_u24bePCM unsigned 24-bit big-endian
DEA..Spcm_u24lePCM unsigned 24-bit little-endian
DEA..Spcm_u32bePCM unsigned 32-bit big-endian
DEA..Spcm_u32lePCM unsigned 32-bit little-endian
DEA..Spcm_u8PCM unsigned 8-bit
DEA..Ss302mSMPTE 302M
DEA..StruehdTrueHD
DEA..SttaTTA (True Audio)
DEA.LSwavpackWavPack (encoders: wavpack libwavpack )

Available Formats

ffmpeg -formats

or

docker container exec -it web ffmpeg -formats

S3 File Name

server/src/lib/clip.ts:saveClip() Saving the audio content on S3.

   const { client_id, headers } = request;
const sentenceId = headers.sentence_id as string;

const folder = client_id + '/';
const filePrefix = sentenceId;
const clipFileName = folder + filePrefix + '.' + config.TRANSCODE.FORMAT;

await this.s3
.upload({
Bucket: config.CLIP_BUCKET_NAME,
Key: clipFileName,
Body: audioOutput,
})
.promise();

I think utterances are identified by a client_id + sentenceId. The recording are saved in a bucket and there metadata in a database. this.model.db.clipExists(client_id, sentenceId)

    await this.model.saveClip({
client_id: client_id,
localeId: sentence.locale_id,
original_sentence_id: sentenceId,
path: clipFileName,
sentence: sentence.text,
});

Audio Recording in the Browser

What audio format/bit/sample rate is recorded by the browser?

Default Sample Rate?

https://developer.mozilla.org/en-US/docs/Web/API/AudioContext/AudioContext

sampleRate Optional

Indicates the sample rate to use for the new context. The value must be a floating-point value indicating the sample rate, in samples per second, for which to configure the new context; additionally, the value must be one which is supported by AudioBuffer.sampleRate. The value will typically be between 8,000 Hz and 96,000 Hz; the default will vary depending on the output device, but the sample rate 44,100 Hz is the most common. If the sampleRate property is not included in the options, or the options are not specified when creating the audio context, the new context's output device's preferred sample rate is used by default.

Common Voice's AudioWeb Class

Common-Voice web UI has a AudioWeb object: web/src/components/pages/contribution/speak/audio-web.ts

Resources

How Common Voice Sets up the Audio Analyzer

web/src/components/pages/contribution/speak/audio-web.ts

AudioWeb::init()

// Set up the analyzer node, and allocate an array for its data
// FFT size 64 gives us 32 bins. But those bins hold frequencies up to
// 22kHz or more, and we only care about lower frequencies which is where
// most human voice lies, so we use fewer bins.
analyzerNode.fftSize = 128;
analyzerNode.smoothingTimeConstant = 0.96;
this.frequencyBins = new Uint8Array(analyzerNode.frequencyBinCount);

Figuring out the Hacky Way

diff --git a/web/src/components/pages/contribution/speak/speak.tsx b/web/src/components/pages/contribution/speak/speak.tsx
index 4d023c1cf..3a9357bcc 100755
--- a/web/src/components/pages/contribution/speak/speak.tsx
+++ b/web/src/components/pages/contribution/speak/speak.tsx
@@ -360,6 +360,7 @@ class SpeakPage extends React.Component<Props, State> {

private startRecording = async () => {
try {
+ console.log(`startRecording: ${JSON.stringify(this.audio)}`);
await this.audio.start();
this.maxVolume = -1; // Initialize to -1 in case updateVolume is never called.
this.recordingStartTime = Date.now();

Start recording an utterance then look into your browser's log console.

AudioContext {baseLatency: 0.01, outputLatency: 0, destination: AudioDestinationNode, currentTime: 0, sampleRate: 48000, …}
audioWorklet: AudioWorklet {}
baseLatency: 0.01
currentTime: 6.84
destination: AudioDestinationNode
channelCount: 2
channelCountMode: "explicit"
channelInterpretation: "speakers"
context: AudioContext
audioWorklet: AudioWorklet {}
baseLatency: 0.01
currentTime: 82.06933333333333
destination: AudioDestinationNode {maxChannelCount: 2, context: AudioContext, numberOfInputs: 1, numberOfOutputs: 0, channelCount: 2, …}
listener: AudioListener {positionX: AudioParam, positionY: AudioParam, positionZ: AudioParam, forwardX: AudioParam, forwardY: AudioParam, …}
onstatechange: null
outputLatency: 0.053
sampleRate: 48000
state: "running"
[[Prototype]]: AudioContext
maxChannelCount: 2
numberOfInputs: 1
numberOfOutputs: 0
[[Prototype]]: AudioDestinationNode
listener: AudioListener
forwardX: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
forwardY: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
forwardZ: AudioParam {value: -1, automationRate: 'a-rate', defaultValue: -1, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
positionX: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
positionY: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
positionZ: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
upX: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
upY: AudioParam {value: 1, automationRate: 'a-rate', defaultValue: 1, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
upZ: AudioParam {value: 0, automationRate: 'a-rate', defaultValue: 0, minValue: -3.4028234663852886e+38, maxValue: 3.4028234663852886e+38}
[[Prototype]]: AudioListener
onstatechange: null
outputLatency: 0.053
sampleRate: 48000
state: "running"
[[Prototype]]: AudioContext

Payload Type

Capturing traffic between the browser and the server using the build-in developer tools in Brave, we can see that the payload sent to http://localhost:9000/api/v1/git/clips contains the string OPUS which suggests that the audio format used during transfer is OPUS.

EߣBB÷BòBóBwebmBBSgÿÿÿÿÿÿÿI©f*×±B@MChromeWAChromeT®k¿®½×sÅÝ¡"¸0¡A_OPUSc¢OpusHead»áµG;bd C¶uÿÿÿÿÿÿÿç£A0{8_äÁ6øE]êQ%Õ³ÝDúEÁzC\Ù¯p'@¤_ÈØÐÌR:®kv03ï<¸yÐKMåÛwÝ ÙjÓe/`ÿò5PªR0ÀøxùÝ×A?_^æ5!*ÓDÜc0Á@Ч_íã÷Ü\#,v³:Xèår6×võ:­Xn

Getting 96kHZ@24 bits

From what I understand, the sample rate is specified when creating the AudioContext. But the bit depth is specified when creating the MediaRecorder. Both are handled in the AudioWeb under web/src/components/pages/contribution/speak/audio-web.ts. Another required change is to make sure the browser's audio buffer is in raw mode. Under web/src/utility.ts::getAudioFormat(), we determine the audio buffer internal encoding to be audio/ogg; codecs=opus by default. Let's change the default to be audio/wav.

diff --git a/web/src/components/pages/contribution/speak/audio-web.ts b/web/src/components/pages/contribution/speak/audio-web.ts
index 7a6a5cd4b..8931abcf3 100644
--- a/web/src/components/pages/contribution/speak/audio-web.ts
+++ b/web/src/components/pages/contribution/speak/audio-web.ts
@@ -112,7 +116,7 @@ export default class AudioWeb {
* the page is reloaded if the user decides to do so.
*
*/
- async init() {
+ async init(sampleRate: number=96000) {
if (this.isReady()) {
return;
}
@@ -121,7 +125,7 @@ export default class AudioWeb {

this.microphone = microphone;
const audioContext = new (window.AudioContext ||
- window.webkitAudioContext)();
+ window.webkitAudioContext)({sampleRate: sampleRate});
const sourceNode = audioContext.createMediaStreamSource(microphone);
const volumeNode = audioContext.createGain();
const analyzerNode = audioContext.createAnalyser();
@@ -139,7 +143,12 @@ export default class AudioWeb {
analyzerNode.connect(outputNode);

// and set up the recorder.
- this.recorder = new window.MediaRecorder(outputNode.stream);
+ this.recorder = new window.MediaRecorder(
+ outputNode.stream,
+ {
+ mimeType: "audio/webm;codecs=PCM",
+ },
+ );

// Set up the analyzer node, and allocate an array for its data
// FFT size 64 gives us 32 bins. But those bins hold frequencies up to
diff --git a/web/src/utility.ts b/web/src/utility.ts
index a8e0cfd5c..dccc8480a 100644
--- a/web/src/utility.ts
+++ b/web/src/utility.ts
@@ -90,7 +90,8 @@ export function getManageSubscriptionURL(account: UserClient) {
}

export const getAudioFormat = (() => {
- const preferredFormat = 'audio/ogg; codecs=opus';
+ //const preferredFormat = 'audio/ogg; codecs=opus';
+ const preferredFormat = "audio/wav";
const audio = document.createElement('audio');
const format = audio.canPlayType(preferredFormat)
? preferredFormat

Payload

The client now sends a playload that seems to be PCM.

I©f™*×±ƒsŇ6Å4I©f™*×±ƒ
–
~ƒ†A_PCM/FLOAT/IEEE፵„G»€Ÿ–bd C¶u–ÿÿÿÿÿÿÿ灣O€£O€Dâ–2^V@1I„ @°²Lá:2

Investigation Notes

Where is the default audio format hardcoded.

~/git/Common-Voice/web/src/utility.ts

export const getAudioFormat = (() => {
const preferredFormat = 'audio/ogg; codecs=opus';
const audio = document.createElement('audio');
const format = audio.canPlayType(preferredFormat)
? preferredFormat
: 'audio/wav';
return function getAudioFormat() {
return format;
};
})();

This is where we create the audio blob which is encoded according to getAudioFormat().

~/git/Common-Voice/web/src/components/pages/contribution/speak/audio-web.ts

  stop(): Promise<AudioInfo> {
if (!this.isReady()) {
console.error('Cannot stop audio before microphone is ready.');
return Promise.reject();
}

return new Promise((res: Function, rej: Function) => {
this.jsNode.onaudioprocess = undefined;
this.recorder.removeEventListener('stop', this.recorderListeners.stop);
this.recorderListeners.stop = (e: Event) => {
let blob = new Blob(this.chunks, { type: getAudioFormat() });
res({
url: URL.createObjectURL(blob),
blob: blob,
});
};
this.recorder.addEventListener('stop', this.recorderListeners.stop);
this.recorder.stop();
});
}

I think we are better off not asking for a specific sample rate when Transcoding because we might try to upscale the sample rate.

server/src/lib/clip.ts

      const audioOutput = new Transcoder(audioInput)
.audioCodec(config.TRANSCODE.AUDIO_CODEC)
.format(config.TRANSCODE.FORMAT)
.channels(1)
.sampleRate(config.TRANSCODE.SAMPLE_RATE)
//.sampleRate(config.TRANSCODE.SAMPLE_RATE)
.on('error', (error: string) => {
this.clipSaveError(
headers,

Supported Codecs in Browser

Audio Formats

Chrome records natively only to .webm files. Firefox to .ogg.

manually encoding to wav

Script that Lists Supported Codecs

<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<title>createMediaStreamDestination() demo</title>
</head>
<body>
<h1>createMediaStreamDestination() demo</h1>

<p>Encoding a pure sine wave to an Opus file</p>
<button>Make sine wave</button>
<audio controls></audio>
<div id="codecs"></div>

<script>
//const mimeType = "audio/ogg; codecs=opus";
const mimeType = "audio/wav";
//const mimeType = "audio/pcm;rate=96000";
//const mimeType = "audio/x-ogg-pcm";
//const mimeType = "audio/x-ogg-flac";
//const mimeType = "audio/webm;codecs=PCM";

const b = document.querySelector("button");
let clicked = false;
const chunks = [];
const ac = new AudioContext({sampleRate: 96000});
const osc = ac.createOscillator();
const dest = ac.createMediaStreamDestination();
const mediaRecorder = new MediaRecorder(
dest.stream,
{
mimeType: "audio/webm;codecs=PCM",
},
);
osc.connect(dest);

b.addEventListener("click", (e) => {
if (!clicked) {
mediaRecorder.start();
osc.start(0);
e.target.textContent = "Stop recording";
clicked = true;
} else {
mediaRecorder.stop();
osc.stop(0);
e.target.disabled = true;
}
});

mediaRecorder.ondataavailable = (evt) => {
// Push each chunk (blobs) in an array
console.log(evt.data.type);
chunks.push(evt.data);
};

mediaRecorder.onstop = (evt) => {
// https://www.digipres.org/formats/mime-types/
// Make blob out of our blobs, and open it.
const blob = new Blob(chunks, { type: mimeType});
document.querySelector("audio").src = URL.createObjectURL(blob);
};
</script>


<script>
// https://stackoverflow.com/questions/41739837/all-mime-types-supported-by-mediarecorder-in-firefox-and-chrome/42307926#42307926
function getSupportedMimeTypes(media, types, codecs) {
const isSupported = MediaRecorder.isTypeSupported;
const supported = [];
types.forEach((type) => {
const mimeType = `${media}/${type}`;
codecs.forEach((codec) => [
`${mimeType};codecs=${codec}`,
`${mimeType};codecs=${codec.toUpperCase()}`,
// /!\ false positive /!\
// `${mimeType};codecs:${codec}`,
// `${mimeType};codecs:${codec.toUpperCase()}`
].forEach(variation => {
if(isSupported(variation))
supported.push(variation);
}));
if (isSupported(mimeType))
supported.push(mimeType);
});
return supported;
};

function displayCodecs(codecs, type) {
const el = document.getElementById("codecs");

let node = document.createElement("div");
el.appendChild(node);

title = document.createElement("H1");
title.innerText = type;
node.appendChild(title);

codecs.forEach((codec, id) => {
const p = document.createElement("p");
p.innerText = codec;
if (id === 0) {
p.setAttribute("style", "color:red");
}
node.appendChild(p);
});
}

// Usage ------------------

const videoTypes = ["webm", "ogg", "mp4", "x-matroska"];
const audioTypes = ["webm", "ogg", "mp3", "x-matroska"];
const codecs = ["should-not-be-supported","vp9", "vp9.0", "vp8", "vp8.0", "avc1", "av1", "h265", "h.265", "h264", "h.264", "opus", "pcm", "aac", "mpeg", "mp4a"];

const supportedVideos = getSupportedMimeTypes("video", videoTypes, codecs);
const supportedAudios = getSupportedMimeTypes("audio", audioTypes, codecs);

displayCodecs(supportedAudios, "Audio");
displayCodecs(supportedVideos, "Video");
</script>
</body>
</html>

Brave

Audio

  • audio/webm;codecs=opus
  • audio/webm;codecs=OPUS
  • audio/webm;codecs=pcm
  • audio/webm;codecs=PCM
  • audio/webm

Video

  • video/webm;codecs=vp9
  • video/webm;codecs=VP9
  • video/webm;codecs=vp9.0
  • video/webm;codecs=VP9.0
  • video/webm;codecs=vp8
  • video/webm;codecs=VP8
  • video/webm;codecs=vp8.0
  • video/webm;codecs=VP8.0
  • video/webm;codecs=avc1
  • video/webm;codecs=AVC1
  • video/webm;codecs=h264
  • video/webm;codecs=H264
  • video/webm;codecs=opus
  • video/webm;codecs=OPUS
  • video/webm;codecs=pcm
  • video/webm;codecs=PCM
  • video/webm
  • video/x-matroska;codecs=vp9
  • video/x-matroska;codecs=VP9
  • video/x-matroska;codecs=vp9.0
  • video/x-matroska;codecs=VP9.0
  • video/x-matroska;codecs=vp8
  • video/x-matroska;codecs=VP8
  • video/x-matroska;codecs=vp8.0
  • video/x-matroska;codecs=VP8.0
  • video/x-matroska;codecs=avc1
  • video/x-matroska;codecs=AVC1
  • video/x-matroska;codecs=h264
  • video/x-matroska;codecs=H264
  • video/x-matroska;codecs=opus
  • video/x-matroska;codecs=OPUS
  • video/x-matroska;codecs=pcm
  • video/x-matroska;codecs=PCM
  • video/x-matroska

Firefox

Audio

  • audio/webm;codecs=opus
  • audio/webm
  • audio/ogg;codecs=opus
  • audio/ogg

Video

  • video/webm;codecs=vp8
  • video/webm;codecs=vp8.0
  • video/webm;codecs=opus
  • video/webm

Audio File without Duration

Marc has noticed that our audio files don't have a valid header. The duration is hardcoded for all audio files regardless of its duration to Duration : 06:12:49.62 = 2147483647 samples ~ 1.67772e+06 CDDA sectors.

Input File     : '5061f5c3-3bf9-42c6-a268-435c146efaf6/feca9ee51da4347a10d293ca0c4afb1a5fc43120e16d568d987e5839adee4e9f.wav'
Channels : 1
Sample Rate : 96000
Precision : 16-bit
Duration : 06:12:49.62 = 2147483647 samples ~ 1.67772e+06 CDDA sectors
File Size : 1.07M
Bit Rate : 384
Sample Encoding: 16-bit Signed Integer PCM

Note that we've changed how the web client records audio because we wanted higher sample rate and lossless audio. The web client send and then the server code uses stream-transcoder.js which is a wrapper on top of ffmpeg to unwrap the raw audio to a WAV file.

Someone on stackoverflow asked: FFmpeg converting from video to audio missing duration which seems related to our current issue and the answer points to it been a problem with streamed files and ffmpeg.

WAV File Format

We are going to see if we can write the audio stream to a temporary file and have that file converted to raw audio to see if the duration field in the header gets properly populated.