ADAPTATION OF THE WORLD FRAMEWORK FOR FRAME-BY-FRAME REAL-TIME SPEECH ANALYSIS
DOI:
https://doi.org/10.34185/1562-9945-5-148-2023-03Keywords:
speech analysis, speech synthesis, real-time signal processing, spectral envelope, F0 estimationAbstract
WORLD is a vocoder-based speech synthesis system developed by M. Morise et al. and implemented in C++. It was demonstrated to have improved performance and accuracy when compared to other algorithms. However, it turned out to not perform well in certain scenarios, particularly, when applying the framework to very short waveforms on a frame-by-frame basis. This paper reviews the issues of the C++ implementation of WORLD and pro-poses modified versions of its constituting algorithms that attempt to mitigate those issues. The resulting framework is tested on both synthetic signals and on real recorded speech.
References
Arık, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., . . . Shoeybi, M. (2017, 06–11 Aug). Deep voice: Real-time neural text-to-speech. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 195–204). PMLR. Retrieved from https://proceedings.mlr.press/v70/arik17a.html
Daido, R., & Hisaminato, Y. (2016). A fast and accurate fundamental frequency estimator using recursive moving average filters. In Interspeech (pp. 2160–2164).
Dave, N. (2013). Feature extraction methods lpc, plp and mfcc in speech recognition. International journal for advance research in engineering and technology, 1 (6), 1–4.
De La Cuadra, P., Master, A. S., & Sapp, C. (2001). Efficient pitch detection techniques for interactive music. In Icmc.
Hy, N. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. (2015). High quality voice conversion using prosodic and high-resolution spectral features. CoRR, abs/1512.01809 . Retrieved from http://arxiv.org/abs/1512.01809
Jouvet, D., & Laprie, Y. (2017). Performance analysis of several pitch detection algorithms on simulated and real noisy speech data. In 2017 25th european signal processing conference (eusipco) (p. 1614-1618). doi: 10.23919/EUSIPCO.2017.8081482
Kornblith, S., Lynch, G., Holters, M., Santos, J. F., Russell, S., Kickliter, J., . . . Smith, J. (2022, December). Juliadsp/dsp.jl: v0.7.8. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.7406426 doi: 10.5281/zenodo.7406426
Morise, M. (2012). Platinum: A method to extract excitation signals for voice synthesis system. Acoustical Science and Technology, 33 (2), 123-125. doi: 10.1250/ast.33.123
Morise, M. (2015). Cheaptrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67 , 1-7. Retrieved from https://www.sciencedirect.com/science/article/pii/S0167639314000697 doi: https://doi.org/10.1016/j.specom.2014.09.003
Morise, M., Kawahara, H., & Katayose, H. (2009). Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech.
Morise, M., Yokomori, F., & Ozawa, K. (2016, 07). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D, 1877-1884. doi: 10.1587/transinf.2015EDP7457
Newmarch, J. (2017). Jack. In Linux sound programming (pp. 143–177). Berkeley, CA: Apress. Retrieved from https://doi.org/10.1007/978-1-4842-2496-0 7 doi: 10.1007/978-1-4842-2496-0_7
Peng, K., Ping, W., Song, Z., & Zhao, K. (2020, 13–18 Jul). Non-autoregressive neural text-to-speech. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 7586–7598). PMLR. Retrieved from https://proceedings.mlr.press/v119/peng20a.html
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., . . . Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654 . Retrieved from http://arxiv.org/abs/1710.07654
Schwarz, D. (1998). Spectral Envelopes in Sound Analysis and Synthesis (Diplomarbeit Nr. 1622). Universit¨at Stuttgart, Fakult¨at Informatik.
Sun, L., Kang, S., Li, K., & Meng, H. (2015). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In 2015 ieee international conference on acoustics, speech and signal processing (icassp) (p. 4869-4873). doi: 10.1109/I-CASSP.2015.7178896
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., . . . Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499 . Retrieved from http://arxiv.org/abs/1609.03499
Downloads
Published
Issue
Section
License
Copyright (c) 2024 System technologies
This work is licensed under a Creative Commons Attribution 4.0 International License.