Skip to main content
Audio files must be encoded and packaged in formats that balance quality, size, and compatibility. Consistent encoding parameters ensure accurate recognition and low latency across both synchronous and asynchronous workflows.
Please ensure your audio files conform to the specifications listed below. Let us know if you need help with audio formatting or API request configuration.

Format Recommendations

WorkflowCodecRationale
Real-time audio streamingOpus (16-32 kbps) in WebM or Ogg containerExcellent quality at low bandwidth.
Asynchronous audio file processingWAV or FLAC (512-1024 kbps) in 16-bit PCM encodingLossless fidelity with deterministic decoding.

Encoding Recommendations

ParameterRecommendationRationale
Bit depth16‑bit PCMDelivers sufficient dynamic range with low quantization noise, balancing quality with processing efficiency for speech recognition.
Sample rate16 kHzCaptures the full range of human speech frequencies (up to 8 kHz), with higher rates offering negligible recognition benefit but increasing computational cost.
Corti ASR supports file transcoding; however, it is recommended to use 16-bit / 16 kHz audio encoded as Opus (16–32 kbps) for streaming or WAV/FLAC for offline uploads.A constant bitrate is recommended for consistent performance.This configuration maintains clinical speech intelligibility and transcription precision while minimizing bandwidth and compute load.

Other Audio Formats

MP3 or M4aSupported, but not recommended - audio compression may degrade transcription quality
RAW audioNot supported at this time
Audio files without any speakingRecordings of silence or background noise, without any spoken dialogue or dictation, may produce a 400 error

Microphone Configuration

Dictation

SettingRecommendationRationale
echoCancellationOffPrevent distortion from near-field recording.
autoGainControlOffManual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping.
noiseSuppressionMild (-15dB)Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.

Doctor/Patient Conversation

SettingRecommendationRationale
echoCancellationOnSuppresses audio reverberations and improves support for far-field recording.
autoGainControlOnAdaptive correction of input gain to support varying loudness and speaking patterns of conversational audio.
noiseSuppressionMild (-15dB)Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.
Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal ASR normalization.

Recording Best Practices

Environment

FactorRecommendationRationale
Ambient noiseKeep background noise below 40 dBA (quiet office).Prevent unintentional audio from being picked up by speech recognition.
ReverberationUse rooms with carpeted, non-reflective surfaces when possible.Prevents echo and audio reverberations that can harm accuracy or diarization.
Microphone typeUse directional microphones for dictation and beamforming array microphones for conversations.Focuses on the primary speaker and suppresses background noise.
Microphone placementKeep the microphone near the side of your mouth so you do not breathe directly into it. A distance of 10–20 cm for dictation is ideal, or within 1 m for doctor/patient conversation.Balances clarity and comfort.
Laptop microphonesAvoid when possible. Prefer external USB, desktop, or wearable/ headset mics.Built-in mics capture keyboard and fan noise.

Mobile Devices for Audio Recording

iPhones and iPads

Modern iOS devices have high-quality MEMS microphone arrays and can deliver professional ASR results if configured correctly:
  • Use Voice Memos app or any third-party app (like Corti Assistant) that exports uncompressed WAV, FLAC, or Opus
  • 16-bit / 16 kHz PCM Mono audio format
  • Use the microphone on the bottom of the device as the primary microphone (talk towards where you would speak for a phone call, not the screen or top/side array mic)
  • Disable Voice Isolation or Wide Spectrum as these apply aggressive filters that can distort audio quality
  • Leave system gain fixed (do not rely on iOS loudness compensation) in order to prevent dynamic gain shifts that disrupt ASR input consistency
  • If possible, explore wired or MFi-certified microphones for optimal audio quality capture
Be mindful of cases that may obstruct the microphone!

Android Devices

Android devices have variable microphone hardware, but most of the guidelines for iPhones listed above can be applied. Prefer external USB or Bluetooth headsets that record 16-bit/16 kHz mono PCM. If use of Android is required, then please contact us for further assistance selecting the best mobile microphone option.

Channel Configuration

Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.
Audio typeWorkflowRationale
MonoDictation or in-room doctor/patient conversationSpeech recognition models expect a single coherent input source. Using one channel avoids phase cancellation and ensures consistent amplitude. Mono also reduces bandwidth and file size without affecting accuracy.
Multichannel (stereo or dual mono)Telehealth or remote doctor/patient conversationsAssigns each participant to a dedicated channel, allowing the ASR system to perform accurate diarization (speaker attribution). Provides better control over noise suppression and improves transcription accuracy when voices overlap.
Mono input supports transcription with diarization; however, speaker separation may be unreliable when there is not clear turn-taking in the dialogue. Multichannel input (e.g., two audio channels, one per participant, in telehealth workflow) provides opportunity for defined speaker labeling per channel.
{
"multichannel": true,
"diarize": true,
"participants": [
    {
    "channel": 0,
    "role": "doctor"
    },
    {
    "channel": 1,
    "role": "patient"
    }
]
}

Additional Notes

  • Each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels - diarization will be most reliable from multichannel audio and may be inconsistent with mono audio.
  • Mono audio will show one channel (-1), whereas dual mono will show two channels (0, 1).
  • Keep all channels aligned in time; do not trim or delay audio streams independently.
  • Use mono capture per channel (16-bit / 16 kHz PCM) even when using multichannel containers (e.g., stereo WAV, WebM, or Ogg).

Please contact us if you need more information about supported audio formats or are having issues processing an audio file.Additional references and resources: