Bringing Real-Time Offline Speech Recognition to OpenVoiceOS
JarbasAl
OVOS Contributor
Bringing Real-Time Offline Speech Recognition to OpenVoiceOS - ONNX, New Plugins, and the Road Here
Speech recognition — turning spoken words into text — has dramatically improved over the past few years. When OpenVoiceOS (OVOS) began, offline automatic speech recognition (ASR) was elusive. Today, with new ONNX-powered runtimes and models, offline STT is practical and performant even on modest hardware.
In this post we will:
- Explain the evolution of offline STT on OVOS
- Describe why ONNX matters
- Introduce the new OVOS plugins that make modern ASR work locally
- Point you to resources for both casual users and developers
Where Offline STT on OVOS Started
OVOS has historically supported many STT backends, but most were designed for cloud or desktop environments rather than small devices.
Here’s a quick historical rundown:
Full Framework Backends
Many early OVOS STT plugins wrapped models and code that required:
- PyTorch, TensorFlow, or
- Full toolkits like NVIDIA NeMo
This worked on desktops and servers but was difficult on small single-board computers (SBCs) like the Raspberry Pi because:
- Framework installs are massive (hundreds of MBs)
- Native builds of CUDA, CuDNN, etc., are complex
- Some models require dedicated GPUs to run at usable speeds
As a result, for most OVOS users, these backends were theoretical offline options. Practical use often required self-hosting distinct servers or complex custom builds.
Early Lightweight Offline Options
Some OVOS plugins attempted to address dependency bloat or offline operation early on:
✔️ ovos-stt-plugin-vosk
🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-vosk
- Based on the Vosk speech recognition library (Kaldi-based)
- Designed for local offline STT
- Limitation: While fast, existing Vosk models generally offer lower accuracy compared to modern end-to-end neural ASR.
✔️ ovos-stt-plugin-citrinet
🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-citrinet
- Citrinet models exported to ONNX
- Limitation: Models were small and fast, but few pre-trained models were available, and those that existed struggled with accuracy.
These plugins were important steps, they removed heavy dependencies, but in practice, their accuracy was rarely competitive with cloud solutions or large models.
The Practical Offline Choice Before ONNX
Whisper models from OpenAI were (and remain) a major force in open ASR.
The fasterwhisper backend provided a lightweight way to run Whisper models locally using CTranslate2 for C++ acceleration and historically has been the default go-to plugin.
- 🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-fasterwhisper
- 🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-whispercpp
- 🔗 https://github.com/TigreGotico/ovos-stt-plugin-whisper
Practically speaking, this was the only widely usable offline solution for OVOS on SBCs — provided:
- You used very small Whisper models (tiny/base), or
- You had a GPU to accelerate inference.
Without a GPU, even small Whisper models were often too slow on ARM CPUs for a snappy voice assistant experience.
So offline STT existed, but was effectively limited unless you had higher-end hardware.
Why ONNX Changes Everything
ONNX (Open Neural Network Exchange) isn’t a model; it’s a portable format plus optimized runtimes that let you run models efficiently on many platforms.
Official ONNX site: https://onnx.ai/
Key Properties
- Portable inference: Export a model once and run it anywhere that supports ONNX Runtime.
- Low dependency overhead: You don’t need PyTorch, TensorFlow, or large ML stacks.
- Hardware Agnostic: ONNX Runtime can use CPU, GPU, NPUs, and platform accelerators seamlessly:
- CPU: Optimized kernels
- GPU: CUDA, TensorRT
- Apple Silicon: CoreML
- Windows ML: DirectML
- Quantization support: Reduces model size and improves inference speed on resource-limited hardware (e.g., using int8 models).
The result: real-time or near-real-time speech recognition on devices that previously struggled to run anything but tiny models.
Architectures in Modern ASR (Quick Reference)
Modern ASR is built on several architectural paradigms. All of the following can now be exported to ONNX and run with optimized inference in OVOS:
| Architecture | Description | Resource |
|---|---|---|
| CTC | Aligns audio to text without pre-segmented transcripts | Wikipedia |
| RNN-T (Transducer) | sequence-to-sequence models that do not employ attention mechanisms. | Arxiv |
| Transformer | Attention-based networks used in many ASR models | Wikipedia |
| Conformer | Combines convolution and attention for speech | Arxiv |
| Whisper | Large general-purpose ASR model family from OpenAI | GitHub |
| Paraformer | Non-autoregressive, speed-optimized model | Arxiv |
| Zipformer | Faster, more memory-efficient, and better-performing transformer | Arxiv |
Two New ONNX-Powered OVOS STT Plugins
1) ovos-stt-plugin-sherpa-onnx
📦 https://github.com/TigreGotico/ovos-stt-plugin-sherpa-onnx
This plugin connects OVOS with the Sherpa-ONNX ecosystem, a performant, multi-model, ONNX-centric ASR framework.
2) ovos-stt-plugin-onnx-asr
📦 https://github.com/TigreGotico/ovos-stt-plugin-onnx-asr
This plugin integrates the onnx-asr Python library directly into OVOS, giving you a simple API to run ONNX models.
We even converted some basque/spanish/catalan models for usage with the onnx-asr plugin, you can find them in ths huggingface collection
Features of the new plugins:
- Modern model families: Transducer/Zipformer, Paraformer, Parakeet, Canary, Whisper, GigaAM, Moonshine
- Auto-download of models
- Quantized models for low-power devices
- Hardware Acceleration: Works well on both CPU and GPU (if available)
What This Means for OVOS Users
For Casual Users
- Real offline STT — no internet required.
- Better accuracy than previous lightweight plugins.
- Easier installation processes.
For Developers
- Full access to modern ASR architectures.
- Benchmark and swap models without rewriting code.
- Experiment with ONNX Runtime acceleration backends.
In Summary
The landscape of offline speech recognition on OpenVoiceOS has matured:
| Stage | Typical Requirements | Practical Result |
|---|---|---|
| Early Full Framework Models | PyTorch / NeMo | Works on desktop, not SBC |
| Vosk / Citrinet ONNX | Lightweight, low accuracy | Usable, but limited accuracy |
| Whisper / FasterWhisper | Better accuracy, still heavy | Best offline until now |
| ONNX STT (Sherpa + onnx-asr) | Minimal deps, efficient | Fast, real-time, offline, portable |
With these new ONNX plugins, OVOS users get the most capable offline STT stack yet — faster, more accurate, and deployable in more environments than ever before.
Help Us Build Voice for Everyone
OpenVoiceOS is more than software, it’s a mission. If you believe voice assistants should be open, inclusive, and user-controlled, here’s how you can help:
- 💸 Donate: Help us fund development, infrastructure, and legal protection.
- 📣 Contribute Open Data: Share voice samples and transcriptions under open licenses.
- 🌍 Translate: Help make OVOS accessible in every language.
We're not building this for profit. We're building it for people. With your support, we can keep voice tech transparent, private, and community-owned.