Speech to Text Model Training and Refinement

Architecture

The Corti speech to text pipeline is based on a combination of model architectures. The main workhorse is an encoder-decoder architecture with byte pair encoding (BPE), similar to Whisper. There is also a connectionist temporal classification (CTC)-based architecture that works in tandem with the encoder model. The combination of those two modeling paradigms provide the following benefits:

Reduce potential hallucinations
Improve speed of predictions
Configure latency to balance response time and accuracy
Make it easy to integrate with the system

Finetune

Model and architecture finetuning happens in two ways:

Decoder training: Language modeling/terminology learning
Encoder training: Auditory model learning

As a starting point, are used as baseline parameters upon which various methodologies are employed to refine the parameters of other models.

Evaluate

Different data sets are used for model finetuning and training than for validation. Assessment methodologies include, but are not limited to the following:

Word error rate (WER)
Character error rate (CER)
Medical term accuracy rate
Levenshtein distance

See some of our supporting research here:

Endpoints

Features

Resources

Speech to Text Model Training and Refinement

Architecture

Finetune

Evaluate

Endpoints

Features

Resources

​Architecture

​Finetune

​Evaluate

Architecture

Finetune

Evaluate