The Corti speech recognition pipeline is based on a combination of model architectures. The main workhorse is an encoder-decoder architecture with byte pair encoding (BPE), similar to Whisper. There is also a connectionist temporal classification (CTC)-based architecture that works in tandem with the encoder model.The combination of those two modeling paradigms provide the following benefits:
Reduce potential hallucinations
Improve speed of predictions
Configure latency to balance response time and accuracy
Different data sets are used for model finetuning and training than for validation. Assessment methodologies include, but are not limited to the following: