ADAPTATION OF THE WORLD FRAMEWORK  FOR FRAME-BY-FRAME REAL-TIME SPEECH ANALYSIS

Eugene Koshel

doi:10.34185/1562-9945-5-148-2023-03

Автор(и)

Eugene Koshel

DOI:

https://doi.org/10.34185/1562-9945-5-148-2023-03

Ключові слова:

аналіз мови, синтез мови, обробка сигналу в реальному часі, спектральна огинаюча, оцінка F0

Анотація

WORLD – це система для синтезу мовлення на основі вокодера, яка була розроблена М. Морісом та ін. і реалізована на C++. Було продемонстровано, дана система має високу ефективність та точність у порівнянні з аналогічними системами. Однак вона виявилася непридатною для викори-стання у певних сценаріях, наприклад, при потоковій обробці аудіо фрейм-за-фреймом. Ця стаття розглядає недоліки C++ імплементації системи WORLD та пропонує модифіковані версії її складових алгоритмів для вирішення вияв-лених проблем. Результуючий фреймворк було протестовано на синтетич-них сигналах та на реальних записах мовлення.

Посилання

Arık, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., . . . Shoeybi, M. (2017, 06–11 Aug). Deep voice: Real-time neural text-to-speech. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 195–204). PMLR. Retrieved from https://proceedings.mlr.press/v70/arik17a.html

Daido, R., & Hisaminato, Y. (2016). A fast and accurate fundamental frequency estimator using recursive moving average filters. In Interspeech (pp. 2160–2164).

Dave, N. (2013). Feature extraction methods lpc, plp and mfcc in speech recognition. International journal for advance research in engineering and technology, 1 (6), 1–4.

De La Cuadra, P., Master, A. S., & Sapp, C. (2001). Efficient pitch detection techniques for interactive music. In Icmc.

Hy, N. Q., Lee, S. W., Tian, X., Dong, M., & Chng, E. (2015). High quality voice conversion using prosodic and high-resolution spectral features. CoRR, abs/1512.01809 . Retrieved from http://arxiv.org/abs/1512.01809

Jouvet, D., & Laprie, Y. (2017). Performance analysis of several pitch detection algorithms on simulated and real noisy speech data. In 2017 25th european signal processing conference (eusipco) (p. 1614-1618). doi: 10.23919/EUSIPCO.2017.8081482

Kornblith, S., Lynch, G., Holters, M., Santos, J. F., Russell, S., Kickliter, J., . . . Smith, J. (2022, December). Juliadsp/dsp.jl: v0.7.8. Zenodo. Retrieved from https://doi.org/10.5281/zenodo.7406426 doi: 10.5281/zenodo.7406426

Morise, M. (2012). Platinum: A method to extract excitation signals for voice synthesis system. Acoustical Science and Technology, 33 (2), 123-125. doi: 10.1250/ast.33.123

Morise, M. (2015). Cheaptrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67 , 1-7. Retrieved from https://www.sciencedirect.com/science/article/pii/S0167639314000697 doi: https://doi.org/10.1016/j.specom.2014.09.003

Morise, M., Kawahara, H., & Katayose, H. (2009). Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech.

Morise, M., Yokomori, F., & Ozawa, K. (2016, 07). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D, 1877-1884. doi: 10.1587/transinf.2015EDP7457

Newmarch, J. (2017). Jack. In Linux sound programming (pp. 143–177). Berkeley, CA: Apress. Retrieved from https://doi.org/10.1007/978-1-4842-2496-0 7 doi: 10.1007/978-1-4842-2496-0_7

Peng, K., Ping, W., Song, Z., & Zhao, K. (2020, 13–18 Jul). Non-autoregressive neural text-to-speech. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning (Vol. 119, pp. 7586–7598). PMLR. Retrieved from https://proceedings.mlr.press/v119/peng20a.html

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., . . . Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654 . Retrieved from http://arxiv.org/abs/1710.07654

Schwarz, D. (1998). Spectral Envelopes in Sound Analysis and Synthesis (Diplomarbeit Nr. 1622). Universit¨at Stuttgart, Fakult¨at Informatik.

Sun, L., Kang, S., Li, K., & Meng, H. (2015). Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In 2015 ieee international conference on acoustics, speech and signal processing (icassp) (p. 4869-4873). doi: 10.1109/I-CASSP.2015.7178896

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., . . . Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499 . Retrieved from http://arxiv.org/abs/1609.03499