Introduction

Corti speech recognition is specifically designed for use in the healthcare domain. The API endpoints detailed below provide access to different speech recognition functionality. Selecting the right endpoint is important based on your needs and use case.

Please review the languages page to learn more about languages supported per endpoint, functionality per language tier, and language code to use in API requests.

Corti speech recognition endpoints

EndpointDescription
TranscribeStateless, real-time speech-to-text and commands used to power dictation workflows
StreamReal-time transcript generation and fact extraction to power intelligent ambient documentation and decision support workflows
TranscriptsSpeech-to-text via batch audio file processing supporting dictation or conversational transcripts

Endpoint functionality

TranscribeStreamTranscripts
ConnectionWSSWSSREST
Data processingSynchronousSynchronousAsynchronous
ArchitectureStatelessStatefulStateful
Speech-to-textVerbatimConversational transcriptVerbatim or transcript
DiarizationNoOptionalOptional
MultichannelNoOptionalOptional
Custom command definitionYesNoNo
Automatic punctuationOptionalYesOptional
Spoken punctuationOptionalNoOptional
Smart formattingcoming soon
Custom dictionarycoming soon

Please contact us if you are interested in features that are not listed here, need help determining the best speech recognition endpoint for your needs, or have questions about how to configure your API requests.

Model training and refinement

Architecture

The Corti speech recognition pipeline is based on a combination of model architectures. The main workhorse is an encoder-decoder architecture with byte pair encoding (BPE), similar to Whisper. There is also a connectionist temporal classification (CTC)-based architecture that works in tandem with the encoder model.

The combination of those two modeling paradigms provide the following benefits:

  • Reduce potential hallucinations
  • Improve speed of predictions
  • Configure latency to balance response time and accuracy
  • Make it easy to integrate with the system

Finetune and evaluate

Model and architecture finetuning happens in two ways:

  1. Decoder training: Language modeling/terminology learning
  2. Encoder training: Auditory model learning

As a starting point, open source models are used as baseline parameters upon which various methodologies are employed to refine the parameters of other models. Different data sets are used for model finetuning and training than for validation. Assessment methodologies include, but are not limited to the following:

  • Word error rate (WER)
  • Character error rate (CER)
  • Medical term accuracy rate
  • Levenshtein distance