> ## Documentation Index
> Fetch the complete documentation index at: https://docs.corti.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio Configuration

> Learn about file types and codecs supported by the Corti API

Audio files must be encoded and packaged in formats that balance quality, size, and compatibility. Consistent encoding parameters ensure accurate recognition and low latency across both synchronous and asynchronous workflows.

<Tip>
  Please ensure your audio files conform to the specifications listed below. Let us know if you need [help](https://help.corti.app) with audio formatting or API request configuration.
</Tip>

***

## Supported Audio Formats

The following audio containers and their associated codecs are supported by the Corti API:

| Container | Supported Encodings | Comments                                      |
| :-------- | :------------------ | :-------------------------------------------- |
| Ogg       | Opus, Vorbis        | Excellent quality at low bandwidth            |
| WebM      | Opus, Vorbis        | Excellent quality at low bandwidth            |
| MP4/M4A   | AAC, MP3            | Compression may degrade transcription quality |
| MP3       | MP3                 | Compression may degrade transcription quality |

<Accordion title="Allowable MIME types for configuration">
  The `audioFormat` parameter can be defined in `transcribe` and `streams` configuration to declare the audio format the speech to text system should expect in the incoming audio stream.

  | Format    | Accepted MIME types                      |
  | --------- | ---------------------------------------- |
  | Ogg       | `audio/ogg`                              |
  | WebM      | `audio/webm`                             |
  | Opus      | `audio/opus`                             |
  | Vorbis    | `audio/vorbis`                           |
  | MP3       | `audio/mpeg`, `audio/mp3`, `audio/mpeg3` |
  | FLAC      | `audio/flac`                             |
  | M4A / AAC | `audio/mp4`, `audio/m4a`                 |

  <Note>This parameter is optional but recommended</Note>

  * For container formats (`audio/ogg`, `audio/webm`) you can optionally specify a codec parameter. Allowed codecs are `flac`, `opus`, `vorbis`.

  For example:

  ```
  audio/ogg; codecs=opus
  audio/webm; codecs=opus
  audio/ogg; codecs=vorbis
  ```
</Accordion>

<Callout icon="circle-check" color="green">
  In addition to the above, WAV files are supported for upload to the `/recordings` endpoint.
</Callout>

### Audio streaming recommendations

<Card title="Sample rate of 16 kHz">Captures the full range of human speech frequencies, with **higher rates offering negligible recognition benefit** but increasing computational cost</Card>

<Card title="Audio chunk size of 250 milliseconds">Optimal speed to support both dictation and AI scribing workflows, with **chunking at faster rates can degrade recognition accuracy** without improving latency</Card>

<Card title="Stream at real-time speed">Audio should be streamed at or near real-time speed. **Streaming audio faster than real time is not recommended** and may cause buffering issues, degraded results, or stream termination. Pace audio chunks according to their actual audio duration.</Card>

<Callout icon="circle-x" color="red">
  Raw audio streams are not supported for audio streaming or file upload.
</Callout>

***

## Microphone Configuration

### Dictation

| Setting          | Recommendation | Rationale                                                                                                                                                                                                                                                                                                                                                                         |
| :--------------- | :------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| echoCancellation | Off            | Ensure clear, unfiltered audio from near-field recording.                                                                                                                                                                                                                                                                                                                         |
| autoGainControl  | Off            | Manual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping. |
| noiseSuppression | Mild (-15dB)   | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.                                                                                                                                                                                                                                                                                         |

### Ambient Conversation

| Setting          | Recommendation | Rationale                                                                                                                        |
| :--------------- | :------------- | :------------------------------------------------------------------------------------------------------------------------------- |
| echoCancellation | On             | Suppresses "echo" audio that is being played by your device speaker, e.g. remote call participant's voice + system alert sounds. |
| autoGainControl  | On             | Adaptive correction of input gain to support varying loudness and speaking patterns of conversational audio.                     |
| noiseSuppression | Mild (-15dB)   | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.                                        |

<Check>Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal speech-to-text normalization.</Check>

***

## Channel Configuration

Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.

| Audio type                         | Workflow                                          | Rationale                                                                                                                                                                                                                            |
| :--------------------------------- | :------------------------------------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Mono                               | Dictation or in-room doctor/patient conversation  | Speech to text models expect a single coherent input source. Using one channel avoids phase cancellation and ensures consistent amplitude. Mono also reduces bandwidth and file size without affecting accuracy.                     |
| Multichannel (stereo or dual mono) | Telehealth or remote doctor/patient conversations | Assigns each participant to a dedicated channel, allowing the speech to text system to perform accurate speaker attribution. Provides better control over noise suppression and improves transcription accuracy when voices overlap. |

<Tip>
  Mono input supports transcription with diarization; however, speaker separation may be unreliable when there is not clear turn-taking in the dialogue.

  Multichannel input (two audio channels, one per participant, in telehealth workflow) provides opportunity for improved speaker separation and labeling.
</Tip>

<br />

### Streams Endpoint

<AccordionGroup>
  <Accordion title="Mono Audio Stream with Diarization">
    ```json highlight={6-10} theme={null}
    {
    "type": "config",
    "configuration": {
        "transcription": {
        "primaryLanguage": "en",
        "isDiarization": true,
        "isMultichannel": false,
        "participants": [
            {"channel": 0, "role": "multiple"}
          ]
        },
        "mode": {
        "type": "facts",
        "outputLocale": "en"
        }
      }
    }
    ```
  </Accordion>

  <Accordion title="Multichannel Audio Stream with Participants">
    ```json highlight={6-11} theme={null}
    {
    "type": "config",
    "configuration": {
        "transcription": {
        "primaryLanguage": "en",
        "isDiarization": false,
        "isMultichannel": true,
        "participants": [
            {"channel": 0, "role": "doctor"},
            {"channel": 1, "role": "patient"}
          ]
        },
        "mode": {
        "type": "facts",
        "outputLocale": "en"
        }
      }
    }
    ```
  </Accordion>

  <Accordion title="Diarization Disabled">
    ```json highlight={6-8} theme={null}
    {
    "type": "config",
    "configuration": {
        "transcription": {
        "primaryLanguage": "en",
        "isDiarization": false,
        "isMultichannel": false,
        "participants": []
        },
        "mode": {
        "type": "facts",
        "outputLocale": "en"
        }
      }
    }
    ```
  </Accordion>
</AccordionGroup>

### Transcripts Endpoint

<AccordionGroup>
  <Accordion title="Mono Audio File with Diarization">
    ```json highlight={5-9} theme={null}
    {
    "recordingId": "uuid",
    "primaryLanguage": "en",
    "isDictation": true,
    "isMultichannel": false,
    "diarize": true,
    "participants": [
        {"channel": 0, "role": "multiple"}
      ]
    }
    ```
  </Accordion>

  <Accordion title="Multichannel Audio File with Participants">
    ```json highlight={5-10} theme={null}
    {
    "recordingId": "uuid",
    "primaryLanguage": "en",
    "isDictation": true,
    "isMultichannel": true,
    "diarize": false,
    "participants": [
        {"channel": 0, "role": "doctor"},
        {"channel": 1, "role": "patient"}
      ]
    }
    ```
  </Accordion>

  <Accordion title="Audio File Diarization Disabled">
    ```json highlight={5-7} theme={null}
    {
    "recordingId": "uuid",
    "primaryLanguage": "en",
    "isDictation": true,
    "isMultichannel": false,
    "diarize": false,
    "participants": []
    }
    ```
  </Accordion>
</AccordionGroup>

### Additional Notes

* Enabling diarization is typically only required on mono audio.
* Mono audio with diarization disabled will produce transcripts with one channel (-1), whereas diarized-mono transcripts will have two channels (0, 1).
* For multichannel audio, each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels.
* Keep all channels aligned in time; do not trim or delay audio streams independently.
* Use mono capture within each channel (16-bit / 16 kHz PCM) to prevent transcript duplication.

<br />

<Note>
  Please [contact us](https://help.corti.app/en/articles/10860711-how-to-get-support-at-corti) if you need more information about supported audio formats or are having issues processing an audio file.

  Additional references and resources:

  * [Wikipedia - Audio file format](https://en.wikipedia.org/wiki/Audio_file_format)
  * [Wikipedia - Audio bit depth](https://en.wikipedia.org/wiki/Audio_bit_depth)
  * [Hugging Face - Introduction to audio data](https://huggingface.co/learn/audio-course/chapter1/audio_data)
</Note>
