Supported Audio Formats
The following audio containers and their associated codecs are supported by the Corti API:| Container | Supported Encodings | Comments |
|---|---|---|
| Ogg | Opus, Vorbis | Excellent quality at low bandwidth |
| WebM | Opus, Vorbis | Excellent quality at low bandwidth |
| MP4/M4A | AAC, MP3 | Compression may degrade transcription quality |
| MP3 | MP3 | Compression may degrade transcription quality |
In addition to the formats defined above, WAV files are supported for upload to the
/recordings endpoint.We recommend a sample rate of 16 kHz to capture the full range of human speech frequencies, with higher rates offering negligible recognition benefit but increasing computational cost.Raw audio streams are not supported at this time.Microphone Configuration
Dictation
| Setting | Recommendation | Rationale |
|---|---|---|
| echoCancellation | Off | Ensure clear, unfiltered audio from near-field recording. |
| autoGainControl | Off | Manual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping. |
| noiseSuppression | Mild (-15dB) | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment. |
Ambient Conversation
| Setting | Recommendation | Rationale |
|---|---|---|
| echoCancellation | On | Suppresses “echo” audio that is being played by your device speaker, e.g. remote call participant’s voice + system alert sounds. |
| autoGainControl | On | Adaptive correction of input gain to support varying loudness and speaking patterns of conversational audio. |
| noiseSuppression | Mild (-15dB) | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment. |
Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal speech-to-text normalization.
Channel Configuration
Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.| Audio type | Workflow | Rationale |
|---|---|---|
| Mono | Dictation or in-room doctor/patient conversation | Speech to text models expect a single coherent input source. Using one channel avoids phase cancellation and ensures consistent amplitude. Mono also reduces bandwidth and file size without affecting accuracy. |
| Multichannel (stereo or dual mono) | Telehealth or remote doctor/patient conversations | Assigns each participant to a dedicated channel, allowing the speech to text system to perform accurate diarization (speaker attribution). Provides better control over noise suppression and improves transcription accuracy when voices overlap. |
Multichannel configuration
Multichannel configuration
Additional Notes
Additional Notes
- Mono audio will produce transcripts with one channel (-1), whereas dual-mono transcripts will have two channels (0, 1).
- Diarization will be most reliable from multichannel audio and, depending on the audio/ conversation participants, may be inconsistent with mono audio.
- For multichannel audio, each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels.
- Keep all channels aligned in time; do not trim or delay audio streams independently.
- Use mono capture within each channel (16-bit / 16 kHz PCM) to prevent transcript duplication.
Please contact us if you need more information about supported audio formats or are having issues processing an audio file.Additional references and resources: