Bringing Real-Time Offline Speech Recognition to OpenVoiceOS

JarbasAl

JarbasAl

OVOS Contributor

Bringing Real-Time Offline Speech Recognition to OpenVoiceOS

Bringing Real-Time Offline Speech Recognition to OpenVoiceOS - ONNX, New Plugins, and the Road Here

Speech recognition — turning spoken words into text — has dramatically improved over the past few years. When OpenVoiceOS (OVOS) began, offline automatic speech recognition (ASR) was elusive. Today, with new ONNX-powered runtimes and models, offline STT is practical and performant even on modest hardware.

In this post we will:

  • Explain the evolution of offline STT on OVOS
  • Describe why ONNX matters
  • Introduce the new OVOS plugins that make modern ASR work locally
  • Point you to resources for both casual users and developers

Where Offline STT on OVOS Started

OVOS has historically supported many STT backends, but most were designed for cloud or desktop environments rather than small devices.

Here’s a quick historical rundown:

Full Framework Backends

Many early OVOS STT plugins wrapped models and code that required:

  • PyTorch, TensorFlow, or
  • Full toolkits like NVIDIA NeMo

This worked on desktops and servers but was difficult on small single-board computers (SBCs) like the Raspberry Pi because:

  • Framework installs are massive (hundreds of MBs)
  • Native builds of CUDA, CuDNN, etc., are complex
  • Some models require dedicated GPUs to run at usable speeds

As a result, for most OVOS users, these backends were theoretical offline options. Practical use often required self-hosting distinct servers or complex custom builds.


Early Lightweight Offline Options

Some OVOS plugins attempted to address dependency bloat or offline operation early on:

✔️ ovos-stt-plugin-vosk

🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-vosk

  • Based on the Vosk speech recognition library (Kaldi-based)
  • Designed for local offline STT
  • Limitation: While fast, existing Vosk models generally offer lower accuracy compared to modern end-to-end neural ASR.

✔️ ovos-stt-plugin-citrinet

🔗 https://github.com/OpenVoiceOS/ovos-stt-plugin-citrinet

  • Citrinet models exported to ONNX
  • Limitation: Models were small and fast, but few pre-trained models were available, and those that existed struggled with accuracy.

These plugins were important steps, they removed heavy dependencies, but in practice, their accuracy was rarely competitive with cloud solutions or large models.


The Practical Offline Choice Before ONNX

Whisper models from OpenAI were (and remain) a major force in open ASR.

The fasterwhisper backend provided a lightweight way to run Whisper models locally using CTranslate2 for C++ acceleration and historically has been the default go-to plugin.

Practically speaking, this was the only widely usable offline solution for OVOS on SBCs — provided:

  • You used very small Whisper models (tiny/base), or
  • You had a GPU to accelerate inference.

Without a GPU, even small Whisper models were often too slow on ARM CPUs for a snappy voice assistant experience.

So offline STT existed, but was effectively limited unless you had higher-end hardware.


Why ONNX Changes Everything

ONNX (Open Neural Network Exchange) isn’t a model; it’s a portable format plus optimized runtimes that let you run models efficiently on many platforms.

Official ONNX site: https://onnx.ai/

Key Properties

  • Portable inference: Export a model once and run it anywhere that supports ONNX Runtime.
  • Low dependency overhead: You don’t need PyTorch, TensorFlow, or large ML stacks.
  • Hardware Agnostic: ONNX Runtime can use CPU, GPU, NPUs, and platform accelerators seamlessly:
    • CPU: Optimized kernels
    • GPU: CUDA, TensorRT
    • Apple Silicon: CoreML
    • Windows ML: DirectML
  • Quantization support: Reduces model size and improves inference speed on resource-limited hardware (e.g., using int8 models).

The result: real-time or near-real-time speech recognition on devices that previously struggled to run anything but tiny models.


Architectures in Modern ASR (Quick Reference)

Modern ASR is built on several architectural paradigms. All of the following can now be exported to ONNX and run with optimized inference in OVOS:

Architecture Description Resource
CTC Aligns audio to text without pre-segmented transcripts Wikipedia
RNN-T (Transducer) sequence-to-sequence models that do not employ attention mechanisms. Arxiv
Transformer Attention-based networks used in many ASR models Wikipedia
Conformer Combines convolution and attention for speech Arxiv
Whisper Large general-purpose ASR model family from OpenAI GitHub
Paraformer Non-autoregressive, speed-optimized model Arxiv
Zipformer Faster, more memory-efficient, and better-performing transformer Arxiv

Two New ONNX-Powered OVOS STT Plugins

1) ovos-stt-plugin-sherpa-onnx

📦 https://github.com/TigreGotico/ovos-stt-plugin-sherpa-onnx

This plugin connects OVOS with the Sherpa-ONNX ecosystem, a performant, multi-model, ONNX-centric ASR framework.

2) ovos-stt-plugin-onnx-asr

📦 https://github.com/TigreGotico/ovos-stt-plugin-onnx-asr

This plugin integrates the onnx-asr Python library directly into OVOS, giving you a simple API to run ONNX models.

We even converted some basque/spanish/catalan models for usage with the onnx-asr plugin, you can find them in ths huggingface collection

Features of the new plugins:

  • Modern model families: Transducer/Zipformer, Paraformer, Parakeet, Canary, Whisper, GigaAM, Moonshine
  • Auto-download of models
  • Quantized models for low-power devices
  • Hardware Acceleration: Works well on both CPU and GPU (if available)

What This Means for OVOS Users

For Casual Users

  • Real offline STT — no internet required.
  • Better accuracy than previous lightweight plugins.
  • Easier installation processes.

For Developers

  • Full access to modern ASR architectures.
  • Benchmark and swap models without rewriting code.
  • Experiment with ONNX Runtime acceleration backends.

In Summary

The landscape of offline speech recognition on OpenVoiceOS has matured:

Stage Typical Requirements Practical Result
Early Full Framework Models PyTorch / NeMo Works on desktop, not SBC
Vosk / Citrinet ONNX Lightweight, low accuracy Usable, but limited accuracy
Whisper / FasterWhisper Better accuracy, still heavy Best offline until now
ONNX STT (Sherpa + onnx-asr) Minimal deps, efficient Fast, real-time, offline, portable

With these new ONNX plugins, OVOS users get the most capable offline STT stack yet — faster, more accurate, and deployable in more environments than ever before.


Help Us Build Voice for Everyone

OpenVoiceOS is more than software, it’s a mission. If you believe voice assistants should be open, inclusive, and user-controlled, here’s how you can help:

  • 💸 Donate: Help us fund development, infrastructure, and legal protection.
  • 📣 Contribute Open Data: Share voice samples and transcriptions under open licenses.
  • 🌍 Translate: Help make OVOS accessible in every language.

We're not building this for profit. We're building it for people. With your support, we can keep voice tech transparent, private, and community-owned.

👉 Support the project here

JarbasAl

JarbasAl