Development and research of software for modeling a singing voice based on SoftVC VITS technology

M.A. Kireenko, E.N. Antonyants, E.E. Istratova

Abstract


The article presents the results of the development and research of software for modeling a singing voice based on the use of an integrated approach based on the use of neural networks and voice modeling technologies based on search and differentiable digital signal processing. As a result of the comparative analysis of modern solutions for voice modeling, a stack of technologies for software design was selected. The finished complex system includes four interconnected modules for separating audio content based on Spleeter technology, for generating a singing voice model using the SoftVC VITS architecture, for training the generated model, as well as for combining audio files and obtaining the final result. As part of the software research, the model training process was tested using a specially prepared dataset, including 80 audio recordings of 9 seconds each. The discriminator loss values, the Kullback-Leibler distance, as well as the frequency-modulation and mel-cepstral loss values were used as metrics for the study. The quantitative indicators obtained over 60,000 training iterations confirmed the stability of the model convergence. The study noted the ability of the model to preserve the characteristic timbre features of the voice while simultaneously providing high quality of the synthesized singing voice, which is of significant practical importance for various applications in the field of audio content processing. Thus, the results of the study demonstrated the high efficiency of the proposed integrated approach in solving the problems of singing voice modeling.


Full Text:

PDF (Russian)

References


Ren Y. et al. Fastspeech 2: Fast and high-quality end-to-end text to speech // arXiv preprint arXiv:2006.04558. – 2020.

Shen K. et al. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers // arXiv preprint arXiv:2304.09116. – 2023.

Qian K. et al. Autovc: Zero-shot voice style transfer with only autoencoder loss // International Conference on Machine Learning. – PMLR, 2019. – P. 5210–5219.

Gu Y. et al. Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders // 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). – IEEE, 2021. – P. 1–5.

Cui J. et al. Sifisinger: A High-Fidelity End-to-End Singing Voice Synthesizer Based on Source-Filter Model // ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – IEEE, 2024. – P. 11126–11130.

Hayes B. et al. A review of differentiable digital signal processing for music and speech synthesis // Frontiers in Signal Processing. – 2024. – Т. 3. – P. 1284100.

Sun L. et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training // 2016 IEEE International Conference on Multimedia and Expo (ICME). – IEEE, 2016. – P. 1–6.

Polyak A. et al. Unsupervised cross-domain singing voice conversion // arXiv preprint arXiv:2008.02830. – 2020.

Liu S. et al. Fastsvc: Fast cross-domain singing voice conversion with feature-wise linear modulation // 2021 ieee international conference on multimedia and expo (icme). – IEEE, 2021. – P. 1–6.

Liu S. et al. Diffsvc: A diffusion probabilistic model for singing voice conversion // 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). – IEEE, 2021. – P. 741–748.

Li Z. et al. Ppg-based singing voice conversion with adversarial representation learning //ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – IEEE, 2021. – P. 7073-7077.

Jayashankar T. et al. Self-supervised representations for singing voice conversion // ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – IEEE, 2023. – P. 1–5.

Zhou Y. et al. VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023 // 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). – IEEE, 2023. – P. 1–8.

Hennequin R. et al. Spleeter: a fast and efficient music source separation tool with pre-trained models // Journal of Open Source Software. – 2020. – Т. 5. – № 50. – P. 2154.

Delgado-Gutiérrez G. et al. Acoustic environment identification by Kullback–Leibler divergence // Forensic Science International. – 2017. – Т. 281. – P. 134–140.

Abdul Z. K., Al-Talabani A. K. Mel frequency cepstral coefficient and its applications: A review // IEEE Access. – 2022. – Т. 10. – P. 122136–122158.


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность ИБП для ЦОД

ISSN: 2307-8162