Overview of Corti speech recognition
Learn about the speech recognition endpoints in the Corti API
Introduction
Corti speech recognition is specifically designed for use in the healthcare domain. The API endpoints detailed below provide access to different speech recognition functionality. Selecting the right endpoint is important based on your needs and use case.
Corti speech recognition endpoints
Endpoint | Description |
---|---|
Transcribe | Stateless, real-time speech-to-text and commands used to power dictation workflows |
Stream | Real-time transcript generation and fact extraction to power intelligent ambient documentation and decision support workflows |
Transcripts | Speech-to-text via batch audio file processing supporting dictation or conversational transcripts |
Endpoint functionality
Transcribe | Stream | Transcripts | |
---|---|---|---|
Connection | WSS | WSS | REST |
Data processing | Synchronous | Synchronous | Asynchronous |
Architecture | Stateless | Stateful | Stateful |
Speech-to-text | Verbatim | Conversational transcript | Verbatim or transcript |
Diarization | No | Optional | Optional |
Multichannel | No | Optional | Optional |
Custom command definition | Yes | No | No |
Automatic punctuation | Optional | Yes | Optional |
Spoken punctuation | Optional | No | Optional |
Smart formatting | coming soon | ||
Custom dictionary | coming soon |
Model training and refinement
Architecture
The Corti speech recognition pipeline is based on a combination of model architectures. The main workhorse is an encoder-decoder architecture with byte pair encoding (BPE), similar to Whisper. There is also a connectionist temporal classification (CTC)-based architecture that works in tandem with the encoder model.
The combination of those two modeling paradigms provide the following benefits:
- Reduce potential hallucinations
- Improve speed of predictions
- Configure latency to balance response time and accuracy
- Make it easy to integrate with the system
Finetune and evaluate
Model and architecture finetuning happens in two ways:
- Decoder training: Language modeling/terminology learning
- Encoder training: Auditory model learning
As a starting point, open source models are used as baseline parameters upon which various methodologies are employed to refine the parameters of other models. Different data sets are used for model finetuning and training than for validation. Assessment methodologies include, but are not limited to the following:
- Word error rate (WER)
- Character error rate (CER)
- Medical term accuracy rate
- Levenshtein distance
See some of our supporting research here: