Architecture
The Corti speech recognition pipeline is based on a combination of model architectures. The main workhorse is an encoder-decoder architecture with byte pair encoding (BPE), similar to Whisper. There is also a connectionist temporal classification (CTC)-based architecture that works in tandem with the encoder model. The combination of those two modeling paradigms provide the following benefits:- Reduce potential hallucinations
- Improve speed of predictions
- Configure latency to balance response time and accuracy
- Make it easy to integrate with the system
Finetune
Model and architecture finetuning happens in two ways:- Decoder training: Language modeling/terminology learning
- Encoder training: Auditory model learning
Evaluate
Different data sets are used for model finetuning and training than for validation. Assessment methodologies include, but are not limited to the following:- Word error rate (WER)
- Character error rate (CER)
- Medical term accuracy rate
- Levenshtein distance
See some of our supporting research here: