• Eugene Koshel



speech analysis, speech synthesis, real-time signal processing, spectral envelope, F0 estimation


WORLD is a vocoder-based speech synthesis system developed by M. Morise et al. and implemented in C++. It was demonstrated to have improved performance and accuracy when compared to other algorithms. However, it turned out to not perform well in certain scenarios, particularly, when applying the framework to very short waveforms on a frame-by-frame basis. This paper reviews the issues of the C++ implementation of WORLD and pro-poses modified versions of its constituting algorithms that attempt to mitigate those issues. The resulting framework is tested on both synthetic signals and on real recorded speech.


Arık, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., . . . Shoeybi, M. (2017, 06–11 Aug). Deep voice: Real-time neural text-to-speech. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 195–204). PMLR. Retrieved from

Daido, R., & Hisaminato, Y. (2016). A fast and accurate fundamental frequency estimator using recursive moving average filters. In Interspeech (pp. 2160–2164).

Dave, N. (2013). Feature extraction methods lpc, plp and mfcc in speech recognition. International journal for advance research in engineering and technology, 1 (6), 1–4.

De La Cuadra, P., Master, A. S., & Sapp, C. (2001). Efficient pitch detection techniques for interactive music. In Icmc.

Hy, N. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. (2015). High quality voice conversion using prosodic and high-resolution spectral features. CoRR, abs/1512.01809 . Retrieved from

Jouvet, D., & Laprie, Y. (2017). Performance analysis of several pitch detection algorithms on simulated and real noisy speech data. In 2017 25th european signal processing conference (eusipco) (p. 1614-1618). doi: 10.23919/EUSIPCO.2017.8081482

Kornblith, S., Lynch, G., Holters, M., Santos, J. F., Russell, S., Kickliter, J., . . . Smith, J. (2022, December). Juliadsp/dsp.jl: v0.7.8. Zenodo. Retrieved from doi: 10.5281/zenodo.7406426

Morise, M. (2012). Platinum: A method to extract excitation signals for voice synthesis system. Acoustical Science and Technology, 33 (2), 123-125. doi: 10.1250/ast.33.123

Morise, M. (2015). Cheaptrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67 , 1-7. Retrieved from doi:

Morise, M., Kawahara, H., & Katayose, H. (2009). Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech.

Morise, M., Yokomori, F., & Ozawa, K. (2016, 07). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D, 1877-1884. doi: 10.1587/transinf.2015EDP7457

Newmarch, J. (2017). Jack. In Linux sound programming (pp. 143–177). Berkeley, CA: Apress. Retrieved from 7 doi: 10.1007/978-1-4842-2496-0_7

Peng, K., Ping, W., Song, Z., & Zhao, K. (2020, 13–18 Jul). Non-autoregressive neural text-to-speech. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 7586–7598). PMLR. Retrieved from

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., . . . Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654 . Retrieved from

Schwarz, D. (1998). Spectral Envelopes in Sound Analysis and Synthesis (Diplomarbeit Nr. 1622). Universit¨at Stuttgart, Fakult¨at Informatik.

Sun, L., Kang, S., Li, K., & Meng, H. (2015). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In 2015 ieee international conference on acoustics, speech and signal processing (icassp) (p. 4869-4873). doi: 10.1109/I-CASSP.2015.7178896

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., . . . Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499 . Retrieved from