Skip to main content
Audio files must be encoded and packaged in formats that balance quality, size, and compatibility. Consistent encoding parameters ensure accurate recognition and low latency across both synchronous and asynchronous workflows.
Please ensure your audio files conform to the specifications listed below. Let us know if you need help with audio formatting or API request configuration.

Supported Audio Formats

The following audio containers and their associated codecs are supported by the Corti API:
ContainerSupported EncodingsComments
OggOpus, VorbisExcellent quality at low bandwidth
WebMOpus, VorbisExcellent quality at low bandwidth
MP4/M4AAAC, MP3Compression may degrade transcription quality
MP3MP3Compression may degrade transcription quality
WAV files are supported for upload to the /recordings endpoint.

Audio streaming recommendations

Sample rate of 16 kHz

Captures the full range of human speech frequencies, with higher rates offering negligible recognition benefit but increasing computational cost

Audio chunk size of 250 milliseconds

Optimal speed to support both dictation and AI scribing workflows, with chunking at faster rates can degrade recognition accuracy without improving latency

Stream at real-time speed

Audio should be streamed at or near real-time speed. Streaming audio faster than real time is not recommended and may cause buffering issues, degraded results, or stream termination. Pace audio chunks according to their actual audio duration.
Raw audio streams are not supported for audio streaming or file upload.

Microphone Configuration

Dictation

SettingRecommendationRationale
echoCancellationOffEnsure clear, unfiltered audio from near-field recording.
autoGainControlOffManual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping.
noiseSuppressionMild (-15dB)Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.

Ambient Conversation

SettingRecommendationRationale
echoCancellationOnSuppresses “echo” audio that is being played by your device speaker, e.g. remote call participant’s voice + system alert sounds.
autoGainControlOnAdaptive correction of input gain to support varying loudness and speaking patterns of conversational audio.
noiseSuppressionMild (-15dB)Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.
Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal speech-to-text normalization.

Channel Configuration

Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.
Audio typeWorkflowRationale
MonoDictation or in-room doctor/patient conversationSpeech to text models expect a single coherent input source. Using one channel avoids phase cancellation and ensures consistent amplitude. Mono also reduces bandwidth and file size without affecting accuracy.
Multichannel (stereo or dual mono)Telehealth or remote doctor/patient conversationsAssigns each participant to a dedicated channel, allowing the speech to text system to perform accurate speaker attribution. Provides better control over noise suppression and improves transcription accuracy when voices overlap.
Mono input supports transcription with diarization; however, speaker separation may be unreliable when there is not clear turn-taking in the dialogue.Multichannel input (two audio channels, one per participant, in telehealth workflow) provides opportunity for improved speaker separation and labeling.

{
"multichannel": true,
"diarize": true,
"participants": [
    {
    "channel": 0,
    "role": "doctor"
    },
    {
    "channel": 1,
    "role": "patient"
    }
]
}

Additional Notes

  • Mono audio will produce transcripts with one channel (-1), whereas dual-mono transcripts will have two channels (0, 1).
  • Diarization is typically only required on mono audio.
  • For multichannel audio, each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels.
  • Keep all channels aligned in time; do not trim or delay audio streams independently.
  • Use mono capture within each channel (16-bit / 16 kHz PCM) to prevent transcript duplication.

Please contact us if you need more information about supported audio formats or are having issues processing an audio file.Additional references and resources: