Architecture

The Corti speech recognition pipeline is based on a combination of model architectures. The main workhorse is an encoder-decoder architecture with byte pair encoding (BPE), similar to Whisper. There is also a connectionist temporal classification (CTC)-based architecture that works in tandem with the encoder model. The combination of those two modeling paradigms provide the following benefits:
  • Reduce potential hallucinations
  • Improve speed of predictions
  • Configure latency to balance response time and accuracy
  • Make it easy to integrate with the system

Finetune

Model and architecture finetuning happens in two ways:
  1. Decoder training: Language modeling/terminology learning
  2. Encoder training: Auditory model learning
As a starting point, are used as baseline parameters upon which various methodologies are employed to refine the parameters of other models.

Evaluate

Different data sets are used for model finetuning and training than for validation. Assessment methodologies include, but are not limited to the following:
  • Word error rate (WER)
  • Character error rate (CER)
  • Medical term accuracy rate
  • Levenshtein distance