<aside> ⚠️ This guide elaborates the steps occur on Windows. MacOS and Linux might have slight variation.

</aside>

Let’s get started!

The objective of this exercise is to successfully produce a TTS (Text-to-Speech) voice model, where you can input a string of text, and the model output is the audio of the spoken text in your target speaker voice. We are not doing STS (Speech-to-Speech) in this exercise.

There are 3 important steps in successfully cloning a voice.

  1. Collect the voice dataset
  2. Process the voice dataset
  3. Finetune a voice model with your dataset

We will use different tools for each steps. If you face issues that are not covered in the guide, please refer to the tool’s community forum for solutions.

<aside> 💭 Alternatively, ask away in #contributor-chat in discord!

</aside>

Step 1 Data collection

The first step is collecting voice dataset of the target speaker. Our objective is to obtain as many clean voice on the target speaker. Some characters like characters in Genshin Impact already have a library of clean dataset provided by the community, refer here.

Select video from Youtube

But if you can’t find any ready made dataset, then we will have to curate the dataset ourselves. We can always start from Youtube to find their podcast, interviews, talk show, reality tv show, speech etc. Here’s some tips in selecting the video from Youtube.

  1. Avoid select famous Late Night Show like from Jimmy Fallon, Norton, Ellen etc. This is due to the heavy background noise and heavily edited sound effect mask over the speaker voice.
  2. Choose video from long form podcast or solo interview in a studio, where the host less likely to interrupt the speaker