> ## Documentation Index
> Fetch the complete documentation index at: https://docs.corti.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio Configuration

> Learn about file types and codecs supported by the Corti API

Audio files must be encoded and packaged in formats that balance quality, size, and compatibility. Consistent encoding parameters ensure accurate recognition and low latency across both synchronous and asynchronous workflows. The API supports both containerized audio formats (such as Ogg and WebM) as well as raw PCM audio streams.

<Tip>
  Please ensure your audio files conform to the specifications listed below. Let us know if you need [help](mailto:help@corti.ai) with audio formatting or API request configuration.
</Tip>

***

## Supported Audio Formats

### Container-based Audio

The following audio containers and their associated codecs are supported by the Corti API:

| Container | Supported Encodings | Comments                                      |
| :-------- | :------------------ | :-------------------------------------------- |
| Ogg       | Opus, Vorbis        | Excellent quality at low bandwidth            |
| WebM      | Opus, Vorbis        | Excellent quality at low bandwidth            |
| MP4/M4A   | AAC, MP3            | Compression may degrade transcription quality |
| MP3       | MP3                 | Compression may degrade transcription quality |

<Accordion title="Allowable MIME types for streamed audio">
  <Note>This parameter is optional but recommended</Note>

  The `audioFormat` parameter can be defined in `transcribe` and `streams` configuration to declare the audio format the speech to text system should expect in the incoming audio stream.

  | Format    | Accepted MIME types                      |
  | --------- | ---------------------------------------- |
  | Ogg       | `audio/ogg`                              |
  | WebM      | `audio/webm`                             |
  | Opus      | `audio/opus`                             |
  | Vorbis    | `audio/vorbis`                           |
  | MP3       | `audio/mpeg`, `audio/mp3`, `audio/mpeg3` |
  | FLAC      | `audio/flac`                             |
  | M4A / AAC | `audio/mp4`, `audio/m4a`                 |

  For container formats (`audio/ogg`, `audio/webm`), you can optionally specify a codec parameter. Allowed codecs are `opus` and `vorbis`.

  Examples:

  ```
  audio/ogg; codecs=opus
  audio/webm; codecs=opus
  audio/ogg; codecs=vorbis
  ```

  <Callout icon="circle-check" color="green">
    WAV files are supported for upload to the `/recordings` endpoint, but raw PCM audio should follow approach outlined below.
  </Callout>
</Accordion>

### Raw Audio

Raw pulse code modulation (PCM) audio is supported when rate, channels, and bits parameters are defined in configuration.

<Accordion title="Allowable MIME types for raw audio configuration">
  <Note>This parameter is required for use with raw PCM audio</Note>

  The `audioFormat` parameter can be defined in `transcribe` and `streams` configuration to declare the audio format the speech to text system should expect in the incoming audio stream.

  | Format  | Accepted MIME types |
  | ------- | ------------------- |
  | Raw PCM | `audio/pcm`         |

  For raw audio (`audio/pcm`), the parameters `rate`, `channels`, and `bits` must be defined.

  | Parameter | Type | Required   | Possible Values          |
  | --------- | ---- | ---------- | ------------------------ |
  | rate      | int  | `required` | `8000-48000`             |
  | channels  | int  | `required` | `1-2`                    |
  | bits      | int  | `required` | `8`, `16`, `24`, or `32` |
  | endian    | str  | `optional` | `little`, `big`          |
  | encoding  | str  | `optional` | `sint`, `uint`           |

  Examples:

  ```
  audio/pcm; rate=16000; channels=1; bits=16
  audio/pcm; rate=44100; channels=2; bits=32
  audio/pcm; rate=8000; channels=1; bits=8
  audio/pcm; rate=48000; channels=2; bits=24
  audio/pcm; rate=16000; channels=1; bits=16; endian=little; encoding=sint
  ```

  <Callout icon="circle-check" color="green">
    When using Raw PCM audio, 16-bit little-endian mono at 16 kHz is recommended.
  </Callout>
</Accordion>

### Audio streaming recommendations

<Card title="Sample rate of 16 kHz">Captures the full range of human speech frequencies, with **higher rates offering negligible recognition benefit** but increasing computational cost</Card>

<Card title="Audio chunk size of 250 milliseconds">Optimal speed to support both dictation and AI scribing workflows, with **sending much smaller chunks more frequently can degrade recognition accuracy** without improving latency</Card>

<Card title="Stream at real-time speed">Audio should be streamed at or near real-time speed. **Streaming audio faster than real time is not recommended** and may cause buffering issues, degraded results, or stream termination. Pace audio chunks according to their actual audio duration.</Card>

***

## Microphone Configuration

### Dictation

| Setting          | Recommendation | Rationale                                                                                                                                                                                                                                                                                                                                                                         |
| :--------------- | :------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| echoCancellation | Off            | Ensure clear, unfiltered audio from near-field recording.                                                                                                                                                                                                                                                                                                                         |
| autoGainControl  | Off            | Manual calibration of microphone gain level provides optimal support for consistent dictation patterns (i.e., microphone placement and speaking pattern). Recalibrate when dictation environments change (e.g., moving from a quiet to noisy environment). Recommend setting input gain with average loudness around –12 dBFS RMS (peaks near –3 dBFS) to prevent audio clipping. |
| noiseSuppression | Mild (-15dB)   | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.                                                                                                                                                                                                                                                                                         |

### Ambient Conversation

| Setting          | Recommendation | Rationale                                                                                                                        |
| :--------------- | :------------- | :------------------------------------------------------------------------------------------------------------------------------- |
| echoCancellation | On             | Suppresses "echo" audio that is being played by your device speaker, e.g. remote call participant's voice + system alert sounds. |
| autoGainControl  | On             | Adaptive correction of input gain to support varying loudness and speaking patterns of conversational audio.                     |
| noiseSuppression | Mild (-15dB)   | Removes background noise (e.g., HVAC); adjust as needed to optimize for your environment.                                        |

<Check>Maintain average loudness around –12 dBFS RMS with peaks near –3 dBFS for optimal speech-to-text normalization.</Check>

***

## Channel Configuration

Choosing the right channel configuration ensures accurate transcription, speaker separation, and diarization across different use cases.

| Audio type               | Workflow                                          | Rationale                                                                                                                                                                                                                            |
| :----------------------- | :------------------------------------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Mono                     | Dictation or in-room doctor/patient conversation  | Speech to text models expect a single coherent input source. Mono also reduces bandwidth and file size without affecting accuracy.                                                                                                   |
| Multichannel (dual mono) | Telehealth or remote doctor/patient conversations | Assigns each participant to a dedicated channel, allowing the speech to text system to perform accurate speaker attribution. Provides better control over noise suppression and improves transcription accuracy when voices overlap. |

<Tip>
  Mono input supports transcription with diarization; however, speaker separation may be unreliable when there is not clear turn-taking in the dialogue.

  Multichannel input (two audio channels, one per participant, in telehealth workflow) provides opportunity for improved speaker separation and labeling.
</Tip>

<br />

### Streams Endpoint

<AccordionGroup>
  <Accordion title="Mono Audio Stream with Diarization">
    ```json highlight={6-10} theme={null}
    {
    "type": "config",
    "configuration": {
        "transcription": {
        "primaryLanguage": "en",
        "diarize": true,
        "isMultichannel": false,
        "participants": [
            {"channel": 0, "role": "multiple"}
          ]
        },
        "mode": {
        "type": "facts",
        "outputLocale": "en"
        }
      }
    }
    ```
  </Accordion>

  <Accordion title="Multichannel Audio Stream with Participants">
    ```json highlight={6-11} theme={null}
    {
    "type": "config",
    "configuration": {
        "transcription": {
        "primaryLanguage": "en",
        "diarize": false,
        "isMultichannel": true,
        "participants": [
            {"channel": 0, "role": "doctor"},
            {"channel": 1, "role": "patient"}
          ]
        },
        "mode": {
        "type": "facts",
        "outputLocale": "en"
        }
      }
    }
    ```
  </Accordion>

  <Accordion title="Diarization Disabled">
    ```json highlight={6-8} theme={null}
    {
    "type": "config",
    "configuration": {
        "transcription": {
        "primaryLanguage": "en",
        "diarize": false,
        "isMultichannel": false,
        "participants": []
        },
        "mode": {
        "type": "facts",
        "outputLocale": "en"
        }
      }
    }
    ```
  </Accordion>
</AccordionGroup>

### Transcripts Endpoint

<AccordionGroup>
  <Accordion title="Mono Audio File with Diarization">
    ```json highlight={5-9} theme={null}
    {
    "recordingId": "uuid",
    "primaryLanguage": "en",
    "spokenPunctuation": true,
    "isMultichannel": false,
    "diarize": true,
    "participants": [
        {"channel": 0, "role": "multiple"}
      ]
    }
    ```
  </Accordion>

  <Accordion title="Multichannel Audio File with Participants">
    ```json highlight={5-10} theme={null}
    {
    "recordingId": "uuid",
    "primaryLanguage": "en",
    "spokenPunctuation": true,
    "isMultichannel": true,
    "diarize": false,
    "participants": [
        {"channel": 0, "role": "doctor"},
        {"channel": 1, "role": "patient"}
      ]
    }
    ```
  </Accordion>

  <Accordion title="Audio File Diarization Disabled">
    ```json highlight={5-7} theme={null}
    {
    "recordingId": "uuid",
    "primaryLanguage": "en",
    "spokenPunctuation": true,
    "isMultichannel": false,
    "diarize": false,
    "participants": []
    }
    ```
  </Accordion>
</AccordionGroup>

### Additional Notes

* Enabling diarization is typically only required on mono audio.
* Mono audio with diarization disabled will produce transcripts with one channel (-1), whereas diarized-mono transcripts will have two channels (0, 1).
* For multichannel audio, each channel should capture only one speaker’s microphone feed in order to avoid cross-talk or echo between channels.
* Keep all channels aligned in time; do not trim or delay audio streams independently.
* Ensure each channel contains only one participant’s feed to avoid duplicated transcript content.
* Recommended capture format is 16-bit / 16 kHz PCM

<br />

<Note>
  Please [contact us](mailto:help@corti.aien/articles/10860711-how-to-get-support-at-corti) if you need more information about supported audio formats or are having issues processing an audio file.

  Additional references and resources:

  * [Wikipedia - Audio file format](https://en.wikipedia.org/wiki/Audio_file_format)
  * [Wikipedia - Audio bit depth](https://en.wikipedia.org/wiki/Audio_bit_depth)
  * [Hugging Face - Introduction to audio data](https://huggingface.co/learn/audio-course/chapter1/audio_data)
</Note>