Analysis of emotions using facial expressions and voice features

Authors

  • I.S. Dmytriieva
  • D.V. Bimalov

DOI:

https://doi.org/10.34185/1562-9945-3-158-2025-03

Keywords:

emotion recognition, audio emotion recognition, facial emotion recognition, machine learning, SVM, computer vision, deep learning, CNN.

Abstract

This article presents methods for recognizing human emotions. Human emotion recognition is a rapidly developing field of artificial intelligence that is essential for improving human-computer interaction. However, most modern emotion recognition systems are limited to using only one data source of voice characteristics or facial expressions, which reduces the accuracy and robustness of recognition in complex environments. The problem with emotion recognition is that human emotions are multifaceted and variable. Modern emotion recognition systems based on voice or facial expression analysis have certain limitations that affect their accuracy and performance in real-world conditions. This article discusses methods of emotion recognition using two main channels - voice and visual. The study analyzes existing approaches to emotion recognition. For the analysis of emotions, we consider ResNet (for images) and SER (for sound) technologies. Particular attention is paid to the analysis and processing of acoustic characteristics such as intonation, volume, speech rate and pause duration, as well as the use of computer vision methods to detect facial expressions such as a smile, pursed lips or furrowed brows. This research is advancing the field of emotion research by providing a better understanding of human emotional states. In the course of this work, we have considered methods of emotion recognition: Emotion recognition from facial expressions and emotion recognition from voice are two different technologies, each of which uses different types of data to analyze and interpret emotions. Emotion recognition from facial expressions is an exciting task in the field of computer vision and deep learning, with numerous applications in various industries. Emotion recognition from voice is based on a complex analysis of many acoustic features such as frequency, volume, speech rate, intonation, and others. These features can be analyzed using various mathematical and statistical models, such as machine learning and neural networks, to accurately classify emotions.

References

B. Mandal, A. Okeukwu, Y. Theis. Masked face recognition using ResNet-50. (2021), DOI:10.48550/ARXIV.2104.08997

Enguerrand Boitel , Alaa Mohasseb , Ella Haig. MIST: Multimodal emotion recognition using DeBERTa for text, Semi-CNN for speech, ResNet-50 for facial, and 3D-CNN for motion analysis. Expert Systems with Applications Volume 270, 25 April 2025, 126236. DOI: https://doi.org/10.1016/j.eswa.2024.126236

Z. Huang, M. Dong, Q. Mao, Y. Zhan. Speech emotion recognition using CNN. Proceedings of the 22nd ACM international conference on multimedia, Association for Computing Machinery, New York, NY, USA (2014), pp. 801-804, DOI: 10.1145/2647868.2654984

E. Lakomkin, C. Weber, S. Magg, S. Wermter. Reusing neural speech representations for auditory emotion recognition. (2018), DOI: 10.48550/ARXIV.1803.11508

Downloads

Published

2025-04-23