<aside> ⚠️ This guide elaborates the steps occur on Windows. MacOS and Linux might have slight variation.
</aside>
The objective of this exercise is to successfully produce a TTS (Text-to-Speech) voice model, where you can input a string of text, and the model output is the audio of the spoken text in your target speaker voice. We are not doing STS (Speech-to-Speech) in this exercise.
There are 3 important steps in successfully cloning a voice.
We will use different tools for each steps. If you face issues that are not covered in the guide, please refer to the tool’s community forum for solutions.
<aside> 💭 Alternatively, ask away in #contributor-chat in discord!
</aside>
The first step is collecting voice dataset of the target speaker. Our objective is to obtain as many clean voice on the target speaker. Some characters like characters in Genshin Impact already have a library of clean dataset provided by the community, refer here.
Select video from Youtube
But if you can’t find any ready made dataset, then we will have to curate the dataset ourselves. We can always start from Youtube to find their podcast, interviews, talk show, reality tv show, speech etc. Here’s some tips in selecting the video from Youtube.