Cloning Voices for Endangered Languages: Building a Text-to-Speech Model for Asturian and Aragonese
JarbasAl
OVOS Contributor
Cloning Voices for Endangered Languages: Building a Text-to-Speech Model for Asturian and Aragonese
Have you ever wanted to hear a computer speak in an accent you love, or in a language that's rarely supported by big tech?
Today we’re releasing new experimental Text-to-Speech (TTS) models for Asturian (ast) and Aragonese (an), two beautiful minority Romance language spoken by communities who almost never get access to modern speech technology.
These models represent a small but meaningful step toward a larger mission: help any under-resourced language build openly available TTS voices using only community data and ethical voice-cloning techniques.
Why This Work Matters - A Short Reality Check
Most languages still lack even a single usable open TTS voice. Not because the technology doesn’t exist, but because:
- high-quality monolingual datasets are rare
- speakers often can’t safely provide the many hours required
- dialect diversity makes a single “official” voice unrealistic
- basic tools (phonemizers, lexicons, G2P) often don’t exist
Meanwhile, these communities still need TTS, for education, accessibility, media creation, cultural preservation, and linguistic pride.
Large tech companies rarely prioritise minority languages. Their incentives are simple: supporting a language only brings value if it brings new users or data.
For speakers of Asturian or Aragonese, who can also use Spanish, this lack of support nudges people away from their own languages. Over time, that invisibility contributes to language shift and erosion.
As a non-profit, we have different priorities: we want to empower all users, preserve linguistic diversity, and treat language as accessibility. This project is part of that mission.
So, how did we do it? We used a clever, hybrid approach that combines existing resources with cutting-edge voice cloning technology.
The "Low-Resource" Challenge
Imagine you want a computer to speak with a very specific voice, perhaps your own, or that of a beloved family member. Now imagine you only have a few seconds of that person speaking. That's our "low-resource donor voice."
At the same time, we have access to large Automatic Speech Recognition (ASR) datasets, like Mozilla Common Voice, which contain recordings of many different people. The problem is, it's not a single, consistent voice.
Our goal was to "transfer" the specific sound of our donor voice onto the vast amount of data available in these multi-speaker ASR datasets.
Our Hybrid Solution: A Step-by-Step Journey
Here's a simplified look at the process we followed (for a more detailed, technical explanation, check out our Whitepaper on Hybrid TTS Dataset Synthesis):
- Gathering Our Raw Materials:
- We started with text and audio from two datasets: Common Voice Scripted Speech 23.0 - Asturian and the Common Voice Scripted Speech 23.0 - Aragonese. These provided us with many text transcripts and their corresponding multi-speaker audio.
- We also had a short recording of our "donor voice" – the target voice we wanted the TTS model to learn.
- Audio Quality Filtering and Preparation:
- We converted all audio to a standard format and ensured the volume was consistent across all recordings (normalization).
- We trimmed silence from the beginning and end of each recording.
- We filtered out recordings where people spoke too fast or too slow (outliers based on Words-Per-Minute), keeping only the most natural and consistent segments. This focused our dataset on the best quality transcripts.
- The Magic of Voice Cloning (Zero-Shot Revoicing):
- This is where modern AI comes in! Instead of training a complex model from scratch, we used an off-the-shelf zero-shot voice cloning solution.
- This system was given a short reference clip of the donor voice. It uses this clip to learn the unique qualities of the voice.
- We then fed our filtered ASR dataset into this cloning system. The original multi-speaker audio was discarded; the cloning tool simply generates new audio in our target donor voice. The result? A new dataset of Asturian/Aragonese audio, all spoken in a single, consistent voice!
- Training the Final TTS Model:
- With our brand-new, high-quality, single-speaker datasets, we could finally train our TTS models.
About Pronunciation & Phonemizers
These models were trained directly on graphemes, we did not use a phonemizer. Good G2P (grapheme-to-phoneme) tools for Asturian and Aragonese are scarce. A phonemizer usually improves pronunciation and, importantly, allows IPA input at runtime to force pronunciations when needed.
If you know of an existing phonemizer, have lexicons/pronunciation data, or want to help train one, please get in touch. This would materially improve future releases.
The Results:
The results are not perfect, our goal was mainly to validate that this approach works.
We used phoonnx to train VITS models, VITS is very performant and can run basically anywhere, it is also very easy to train without needing massive GPUS making it perfect for experimentation
There are many better architectures we can explore in the future to train truly natural sounding voices!
"L'arcu la vieya ye un fenómenu ópticu y meteorolóxicu que produz l'apaición d'un espectru de lluz continu nel cielu cuando los rayos del sol trespasen pequeñes partícules de mugor conteníes n'atmósfera terrestre. La forma ye la d'un arcu multicolor col roxo hacia la parte esterior y el viola hacia la interior. El arco iris duble, ye menos avezau a vese, y tien los colores invertíos, esto ye, el roxo hacia dientro y el viola hacia l'esterior."
Download: Asturian — dii (female) and Asturian — miro (male)
"L'arco de sant Chuan ye un fenomeno optico y meteorolochico que produce l'aparición d'un espectro de luz contino en o cielo cuan os rayos d'o sol trescruzan chicotas particlas d'humidat situatas por l'atmosfera terrestre. A forma suya ye a d'un arco multicolor con o royo en a parti exterior y o morau en a interior. No ye tan cutiano l'arco de sant Chuan dople, que incluye un segundo arco mas tenue con as colors chiratas, ye dicir o royo en l'interior y o morau en l'exterior."
Download: Aragonese — dii (female) and Aragonese — miro (male)
These models are a significant step forward for Asturian and Aragonese language technology. It demonstrates how modern AI, combined with careful data preparation, can empower underserved languages and bring them into the digital age. We're excited to see what developers and enthusiasts will build with it!
Help Us Build Voice for Everyone
OpenVoiceOS is more than software, it’s a mission. If you believe voice assistants should be open, inclusive, and user-controlled, here’s how you can help:
- 💸 Donate: Help us fund development, infrastructure, and legal protection.
- 📣 Contribute Open Data: Share voice samples and transcriptions under open licenses.
- 🌍 Translate: Help make OVOS accessible in every language.
We're not building this for profit. We're building it for people. With your support, we can keep voice tech transparent, private, and community-owned.