Bringing Real-Time Offline Speech Recognition to OpenVoiceOS - ONNX, New Plugins, and the Road Here

Speech recognition — turning spoken words into text — has dramatically improved over the past few years. When OpenVoiceOS (OVOS) began, offline automatic speech recognition (ASR) was elusive. Today, with new ONNX-powered runtimes and models, offline STT is practical and performant even on modest hardware.

In this post we will:

Explain the evolution of offline STT on OVOS
Describe why ONNX matters
Introduce the new OVOS plugins that make modern ASR work locally
Point you to resources for both casual users and developers

Where Offline STT on OVOS Started

OVOS has historically supported many STT backends, but most were designed for cloud or desktop environments rather than small devices.

Here’s a quick historical rundown:

Full Framework Backends

Many early OVOS STT plugins wrapped models and code that required:

PyTorch, TensorFlow, or
Full toolkits like NVIDIA NeMo

This worked on desktops and servers but was difficult on small single-board computers (SBCs) like the Raspberry Pi because:

Framework installs are massive (hundreds of MBs)
Native builds of CUDA, CuDNN, etc., are complex
Some models require dedicated GPUs to run at usable speeds

As a result, for most OVOS users, these backends were theoretical offline options. Practical use often required self-hosting distinct servers or complex custom builds.

Early Lightweight Offline Options

Some OVOS plugins attempted to address dependency bloat or offline operation early on:

✔️ `ovos-stt-plugin-vosk`

🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-vosk

Based on the Vosk speech recognition library (Kaldi-based)
Designed for local offline STT
Limitation: While fast, existing Vosk models generally offer lower accuracy compared to modern end-to-end neural ASR.

✔️ `ovos-stt-plugin-citrinet`

🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-citrinet

Citrinet models exported to ONNX
Limitation: Models were small and fast, but few pre-trained models were available, and those that existed struggled with accuracy.

These plugins were important steps, they removed heavy dependencies, but in practice, their accuracy was rarely competitive with cloud solutions or large models.

The Practical Offline Choice Before ONNX

Whisper models from OpenAI were (and remain) a major force in open ASR.

The fasterwhisper backend provided a lightweight way to run Whisper models locally using CTranslate2 for C++ acceleration and historically has been the default go-to plugin.

Practically speaking, this was the only widely usable offline solution for OVOS on SBCs — provided:

You used very small Whisper models (tiny/base), or
You had a GPU to accelerate inference.

Without a GPU, even small Whisper models were often too slow on ARM CPUs for a snappy voice assistant experience.

So offline STT existed, but was effectively limited unless you had higher-end hardware.

Why ONNX Changes Everything

ONNX (Open Neural Network Exchange) isn’t a model; it’s a portable format plus optimized runtimes that let you run models efficiently on many platforms.

Official ONNX site: https://onnx.ai/

Key Properties

Portable inference: Export a model once and run it anywhere that supports ONNX Runtime.
Low dependency overhead: You don’t need PyTorch, TensorFlow, or large ML stacks.
Hardware Agnostic: ONNX Runtime can use CPU, GPU, NPUs, and platform accelerators seamlessly:
- CPU: Optimized kernels
- GPU: CUDA, TensorRT
- Apple Silicon: CoreML
- Windows ML: DirectML
Quantization support: Reduces model size and improves inference speed on resource-limited hardware (e.g., using int8 models).

The result: real-time or near-real-time speech recognition on devices that previously struggled to run anything but tiny models.

Architectures in Modern ASR (Quick Reference)

Modern ASR is built on several architectural paradigms. All of the following can now be exported to ONNX and run with optimized inference in OVOS:

Architecture	Description	Resource
CTC	Aligns audio to text without pre-segmented transcripts	Wikipedia
RNN-T (Transducer)	sequence-to-sequence models that do not employ attention mechanisms.	Arxiv
Transformer	Attention-based networks used in many ASR models	Wikipedia
Conformer	Combines convolution and attention for speech	Arxiv
Whisper	Large general-purpose ASR model family from OpenAI	GitHub
Paraformer	Non-autoregressive, speed-optimized model	Arxiv
Zipformer	Faster, more memory-efficient, and better-performing transformer	Arxiv

Two New ONNX-Powered OVOS STT Plugins

1) `ovos-stt-plugin-sherpa-onnx`

📦 https://github.com/TigreGotico/ovos-stt-plugin-sherpa-onnx

This plugin connects OVOS with the Sherpa-ONNX ecosystem, a performant, multi-model, ONNX-centric ASR framework.

2) `ovos-stt-plugin-onnx-asr`

📦 https://github.com/TigreGotico/ovos-stt-plugin-onnx-asr

This plugin integrates the onnx-asr Python library directly into OVOS, giving you a simple API to run ONNX models.

We even converted some basque/spanish/catalan models for usage with the onnx-asr plugin, you can find them in ths huggingface collection

Features of the new plugins:

Modern model families: Transducer/Zipformer, Paraformer, Parakeet, Canary, Whisper, GigaAM, Moonshine
Auto-download of models
Quantized models for low-power devices
Hardware Acceleration: Works well on both CPU and GPU (if available)

What This Means for OVOS Users

For Casual Users

Real offline STT — no internet required.
Better accuracy than previous lightweight plugins.
Easier installation processes.

For Developers

Full access to modern ASR architectures.
Benchmark and swap models without rewriting code.
Experiment with ONNX Runtime acceleration backends.

In Summary

The landscape of offline speech recognition on OpenVoiceOS has matured:

Stage	Typical Requirements	Practical Result
Early Full Framework Models	PyTorch / NeMo	Works on desktop, not SBC
Vosk / Citrinet ONNX	Lightweight, low accuracy	Usable, but limited accuracy
Whisper / FasterWhisper	Better accuracy, still heavy	Best offline until now
ONNX STT (Sherpa + onnx-asr)	Minimal deps, efficient	Fast, real-time, offline, portable

With these new ONNX plugins, OVOS users get the most capable offline STT stack yet — faster, more accurate, and deployable in more environments than ever before.

Help Us Build Voice for Everyone

OpenVoiceOS is more than software, it’s a mission. If you believe voice assistants should be open, inclusive, and user-controlled, here’s how you can help:

💸 Donate: Help us fund development, infrastructure, and legal protection.
📣 Contribute Open Data: Share voice samples and transcriptions under open licenses.
🌍 Translate: Help make OVOS accessible in every language.

We're not building this for profit. We're building it for people. With your support, we can keep voice tech transparent, private, and community-owned.

👉 Support the project here

Bringing Real-Time Offline Speech Recognition to OpenVoiceOS