Using deep artificial neural networks for multimodal data classification


  • Oleksandr Penia
  • Yevgeniya Sulema



multimodal data, classification, parallel computing, artificial neural networks.


Multimodal data analysis is gaining attention in recent research. Pu Liang et al. (2023) provide a comprehensive overview on multimodal machine learning, highlighting its founda-tions, challenges and achievements in recent years. More problem-oriented works propose new methods and applications for multimodal ML, such a Ngiam et al. (2011) propose to use joint audio and video data to improve speech recognition accuracy; Sun, Wand and Li (2018) describe application of multimodal classification for breast cancer prognosis prediction; Mao et al. (2014) propose an architecture of multimodal recurrent network to generate text de-scription of images and so on. However, such works usually focus on the task itself and meth-ods therein, and not on integrating multimodal data processing into other software systems. The goal of this research is to propose a way to conduct multimodal data processing, specifically as a part of a digital twin systems, thus efficiency and near-real-time operation are required. The paper presents an approach to conduct parallel multimodal data classification, adapting to available computing power. The method is modular and scalable and intended for in digital twin application as a part of analysis and modeling tools. Later, the detailed example of such a software module is discussed. It uses multimodal data from open datasets to detect and classify the behavior of pets using deep learning mod-els. Videos are processed using two artificial neural networks: YOLOv3 object detection net-work to process individual frames of the video and a relatively simple convolutional network to classify sounds based on their frequency spectra. Constructed module uses a producer-consumer parallel processing pattern and allows processing 5 frames per second of a video on available hardware, which can be sufficiently improved by using GPU acceleration or more paralleled processing threads.


Liang P. P., Zadeh A., Morency L. P. Foundations and recent trends in multimodal ma-chine learning: Principles, challenges, and open questions //arXiv preprint arXiv:2209.03430. – 2022.

Ngiam J. et al. Multimodal deep learning //Proceedings of the 28th international confer-ence on machine learning (ICML-11). – 2011. – P. 689-696.

Sun D., Wang M., Li A. A multimodal deep neural network for human breast cancer prog-nosis prediction by integrating multi-dimensional data //IEEE/ACM transactions on computa-tional biology and bioinformatics. – 2018. – Т. 16. – №. 3. –

P. 841-850.

Mao J. et al. Explain images with multimodal recurrent neural networks //arXiv preprint arXiv:1410.1090. – 2014.

Chen H. et al. Vggsound: A large-scale audio-visual dataset //ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). – IEEE, 2020. – P. 721-725.

Dogs vs. Cats Dataset. URL:

Redmon J., Farhadi A. Yolov3: An incremental improvement //arXiv preprint arXiv:1804.02767. – 2018.

Bjorck N. et al. Understanding batch normalization //Advances in neural information proc-essing systems. – 2018. – Т. 31.

You Y., Gitman I., Ginsburg B. Large batch training of convolutional networks //arXiv pre-print arXiv:1708.03888. – 2017.