ADAPTATION OF THE WORLD FRAMEWORK FOR FRAME-BY-FRAME REAL-TIME SPEECH ANALYSIS

Authors

  • Eugene Koshel

DOI:

https://doi.org/10.34185/1562-9945-5-148-2023-03

Keywords:

speech analysis, speech synthesis, real-time signal processing, spectral envelope, F0 estimation

Abstract

WORLD is a vocoder-based speech synthesis system developed by M. Morise et al. and implemented in C++. It was demonstrated to have improved performance and accuracy when compared to other algorithms. However, it turned out to not perform well in certain scenarios, particularly, when applying the framework to very short waveforms on a frame-by-frame basis. This paper reviews the issues of the C++ implementation of WORLD and pro-poses modified versions of its constituting algorithms that attempt to mitigate those issues. The resulting framework is tested on both synthetic signals and on real recorded speech.

References

Arık, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., . . . Shoeybi, M. (2017, 06–11 Aug). Deep voice: Real-time neural text-to-speech. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 195–204). PMLR. Retrieved from https://proceedings.mlr.press/v70/arik17a.html

Daido, R., & Hisaminato, Y. (2016). A fast and accurate fundamental frequency estimator using recursive moving average filters. In Interspeech (pp. 2160–2164).

Dave, N. (2013). Feature extraction methods lpc, plp and mfcc in speech recognition. International journal for advance research in engineering and technology, 1 (6), 1–4.

De La Cuadra, P., Master, A. S., & Sapp, C. (2001). Efficient pitch detection techniques for interactive music. In Icmc.

Hy, N. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. (2015). High quality voice conversion using prosodic and high-resolution spectral features. CoRR, abs/1512.01809 . Retrieved from http://arxiv.org/abs/1512.01809

Jouvet, D., & Laprie, Y. (2017). Performance analysis of several pitch detection algorithms on simulated and real noisy speech data. In 2017 25th european signal processing conference (eusipco) (p. 1614-1618). doi: 10.23919/EUSIPCO.2017.8081482

Kornblith, S., Lynch, G., Holters, M., Santos, J. F., Russell, S., Kickliter, J., . . . Smith, J. (2022, December). Juliadsp/dsp.jl: v0.7.8. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.7406426 doi: 10.5281/zenodo.7406426

Morise, M. (2012). Platinum: A method to extract excitation signals for voice synthesis system. Acoustical Science and Technology, 33 (2), 123-125. doi: 10.1250/ast.33.123

Morise, M. (2015). Cheaptrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67 , 1-7. Retrieved from https://www.sciencedirect.com/science/article/pii/S0167639314000697 doi: https://doi.org/10.1016/j.specom.2014.09.003

Morise, M., Kawahara, H., & Katayose, H. (2009). Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech.

Morise, M., Yokomori, F., & Ozawa, K. (2016, 07). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D, 1877-1884. doi: 10.1587/transinf.2015EDP7457

Newmarch, J. (2017). Jack. In Linux sound programming (pp. 143–177). Berkeley, CA: Apress. Retrieved from https://doi.org/10.1007/978-1-4842-2496-0 7 doi: 10.1007/978-1-4842-2496-0_7

Peng, K., Ping, W., Song, Z., & Zhao, K. (2020, 13–18 Jul). Non-autoregressive neural text-to-speech. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 7586–7598). PMLR. Retrieved from https://proceedings.mlr.press/v119/peng20a.html

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., . . . Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654 . Retrieved from http://arxiv.org/abs/1710.07654

Schwarz, D. (1998). Spectral Envelopes in Sound Analysis and Synthesis (Diplomarbeit Nr. 1622). Universit¨at Stuttgart, Fakult¨at Informatik.

Sun, L., Kang, S., Li, K., & Meng, H. (2015). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In 2015 ieee international conference on acoustics, speech and signal processing (icassp) (p. 4869-4873). doi: 10.1109/I-CASSP.2015.7178896

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., . . . Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499 . Retrieved from http://arxiv.org/abs/1609.03499

Downloads

Published

2023-12-19