8 min read
It’s been a while since I wrote up a post, but I figured I just had to do one about the most recent project I’ve been tinkering on -- voice cloning with AI. I was actually going to do one earlier in January about running an uncensored LLM locally, but Fireship’s got that covered (plus many other folks with the AI craze going on).
Garnering interest
One of my close friends in tech texted me a few nights ago to see if I had any interest in collaborating on a project to run an AI model for text-to-speech (TTS) locally. He wanted to listen to articles using an AI model and not the built-in TTS support for phones since they sound terrible. I was immediately interested since I did something similar a couple years back to turn a small ebook into an audiobook via the free tier for Google Cloud’s TTS with their WaveNet voices. I was curious to see what we could accomplish nowadays for open-source TTS, especially with all the hype around running AI locally. He sent over some links from preliminary research and I agreed to take a look the following day.
Coincidentally, the very next day, another one of my friends asked in our group chat for help developing a TTS system with a custom voice. He had a slightly different use case— messing with his friend at a party. At this point, the three of us teamed up and formed our own group chat for the project with me leading the investigation and POC.
Initial investigation
Aside from checking out the links my first friend sent, I checked GitHub for repos labeled with “tts” with the most stars. I saw the first result was a little old, but referenced Coqui AI’s TTS project, which was the second result and the first link my friend included. It seemed promising enough, and then my brain finally connected the dots when I saw the frog emoji: “coqui” was the same unforgettable frog my husband and I encountered during our time on Big Island in Hawaii for our honeymoon. They’re tiny (like the size of a quarter) and LOUD. At that point I was like, “yep, I’m giving this one a shot first and seeing how it goes”.
Setting it up
I wanted to run this on the powerful gaming PC I built last spring to take advantage of my 4090. It’s been fantastic for playing Beat Saber and running inference for LLMs via ollama. I figured, like for ollama, I’d run this in WSL2 with CUDA support. Thankfully, this was super simple to do via a pip install, no Docker needed. Models would even automatically download if set for the --model_name
in a tts
command.
However, there were a couple gotchas:
- I had to temporarily disable blocking on my Pi-Hole so the models could download.
- I had to delete the model directory if it failed to download entirely so it could reattempt the download.
Overall though, it was pretty straightforward to try out each of the models with some funny messages, the first test sparking some joy from the almost genuine sounding frustration:
🔊 "This is a test: I f%#$ing hate Windows!"
Comparing the voice models
With the amount of English models available (plus different speakers for the multi-speaker models), I wrote up a quick bash script to generate some samples using a paragraph I took from a Pocket article I had open: https://github.com/AshleyDumaine/coqui-tts-test?tab=readme-ov-file. You can see (or rather, hear) that a good number of them struggle to say less common words like "kaleidoscopic".
🔊 "But the meme’s assertion that the vivid kaleidoscopic patterns and fractals he drew reflected the progression of his mental illness, now thought to be schizophrenia, doesn’t hold up under scrutiny. Similarly, there’s no proof the famous cat lover was afflicted with Toxoplasma gondii, a parasite some cats carry that, despite popular assumptions, has not been proven to alter human behavior."
Tacotron2-DDC:
Tortoise v2:
Jenny:
The main outlier for quality based on the given paragraph alone was the jenny model, though it seemed to speak a little too fast. The tortoise model was supposed to be good and was the second link my friend had sent me, but as the README had mentioned, it was SUPER slow to generate the audio sample (keeping true to its name). It was so slow, in fact, that for 89 words it took 645 seconds to generate the audio sample despite setting --use_cuda True
.
Overall, none of these were clear winners.
Then I remembered I didn’t actually NEED to use any of these voices, we still needed to use a custom voice for the party joke. If I could figure out how to use one custom voice, it should be straightforward to use another for reading articles and ebooks.
Enter voice cloning
Here’s where things started to get fun (but scary). All you need to clone a voice is a short (6 second or longer) wav file of someone talking. No transcript of the sample audio is needed.
Here’s the magic command:
tts \
--model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--speaker_wav ${VOICE_DIR}/narrator-ramble.wav \
--language_idx "en" \
--use_cuda True \
--out_path cloned-voice-output.wav \
--text "$(cat sample.txt)”
This particular model is great for voice cloning. I at first though I needed to do voice conversion via a different model, but that didn’t produce results that sounded much like the target voice.
I made the below audio clip with a 37 minute long (382 MB) sample stitched together from audio files in one of my favorite games:
🔊 "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."
Longer clips (e.g. over an hour long or about 815 MB for the wav file size) did make the voice match more, but this removed some of the variances with intonation.
🔊 "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."
A better voice match, but sounds a bit flatter. (Can you tell what iconic character this is?)
This means you can take your favorite voice actor or narrator and produce an audio clip with that voice reading whatever text you give it. I was able to use audio assets ripped from a game via AssetStudio and MP3s from audiobooks converted into wav files to achieve this with an eerie similarity. It was simultaneously entertaining and terrifying. This readily available technology is so new that the legality around it is still uncertain[1,2] (so until then, use your best judgement).
Okay, but how fast and good is it?
This depends on how large the sample voice file you're using is. With the large (815 MB) sample voice file I was using, it took about 649 seconds to generate an audio file for 500 words. The resulting file was 200 seconds long. With a smaller (2.09 MB) sample voice file, it took 285 seconds for the same text to generate resulting in a file that was 195 seconds long. For both cases, this was using CUDA for the device. Nowhere near as slow as tortoise-tts, but not exactly real-time. However, it’s not unreasonable to simply leave it running to generate an audiobook (maybe chunking it by chapter in case it fails partway through). For reference, an average novel is about 80,000 words[1]. If going off my smaller voice sample test, that’s about 1.75 words per second, resulting in 12.67 hours to convert a typical ebook to an audiobook. Overall, based on what I've generated so far, I would say that using a larger sample voice file isn't necessarily better. The voice may sound a bit more accurate, but it won't sound very natural.
I should also note that variation of intonation depends on the sample speaker wav. If, for example, you wanted to include a line with some sarcasm in a paragraph, the line in question would need to be generated with a different speaker wav that uses exclusively a sarcastic tone. A bit tedious, but the extra effort can go a long way to make it more convincing (though not really reasonable for something as long as an ebook).
Helpful tools
While trying to create speaker wav files for voice cloning, I found these helpful:
- sox (
sudo apt-get install sox
) to stitch together wav files to make a longer audio sample - mp3wrap (
sudo apt-get install mp3wrap
) - to stitch together mp3 files to make a longer audio sample - ffmpeg (
sudo apt-get install ffmpeg
) - to convert from mp3 to wav
Wrapping up
Overall this was a fun project and not at all difficult to set up and run. It’s super cool that now we can generate our own AI voices to read our favorite ebooks, articles, and arbitrary text. However, obviously there’s the host of ethical and legal questions that are raised with this exciting new advancement in technology (as usual).