A method and software tool for video object recognition and analysis
DOI:
https://doi.org/10.34185/1562-9945-3-164-2026-14Keywords:
computer vision, video analytics, object detection, object tracking, multidimensional model, OLAP, YOLO, VLM, Moondream2, semantic enrichmentAbstract
The modern scientific community is paying significant attention to the automation of video analytics. Traditional methods based on convolutional neural networks (CNNs), such as Faster R-CNN, demonstrate high accuracy but have limited processing speed for real-time streams. In contrast, single-stage algorithms in the YOLO family achieve the required performance. Research in object tracking (Tracking-by-Detection) highlights DeepSORT and ByteTrack as the most effective algorithms for associating detections across frames. The application of multimodal vision-language models (VLMs) opens up new possibilities for semantic scene description, although their implementation in high-load systems is hindered by computational complexity. Additionally, multidimensional data analysis (OLAP) technologies are considered, specifically the SurvCube and VideoCube models, which integrate video processing results into structured cubes; however, they often offer limited flexibility for creating new semantic hierarchies.
The objective of the research is to increase the analysis speed of ultra-large video datasets by developing a method and software tool for the automated extraction of structured data and its subsequent integration into a multidimensional OLAP model that supports parallel processing and semantic information enrichment.
A method for cascaded video data processing is proposed, combining rapid object detection with the YOLOv26m architecture and dynamic semantic scene enrichment using the Moondream2 multimodal model. During the preprocessing stage, metadata extraction (coordinates, time) and load optimization are performed by downsampling the frame rate to 2–5 frames per second. The YOLOv26m model is used for object detection and classification, achieving an mAP50 of 0.814 at 81.03 FPS. Tracking is implemented by assigning a unique object_id, which minimizes data redundancy. Semantic context enrichment (landscape type, events) is carried out by the multimodal Moondream2 model using text prompts, allowing the system to adapt to new scenarios without retraining the network. The data is integrated into a multidimensional OLAP model using a star schema, where the dimensions are time, space, object type, environment, and event type. The software features a microservices architecture utilizing a message broker (RabbitMQ/AMQP) for asynchronous communication between the detection and semantic analysis clusters.
In the course of this work, a scalable software complex was developed that successfully transforms unstructured video streams into the format of analytical cubes. It was experimentally confirmed that the use of the Moondream2 model provides the best balance between description quality (METEOR: 25.73) and processing speed (0.422 s/frame). Scalability testing demonstrated the architecture's capability for linear speedup: increasing the number of parallel workers to 10 reduced the total video dataset processing time by a factor of 8.95. The proposed solution is effective for high-load monitoring systems and rapid decision-making based on deep analytics.
References
Mahmud, A., & Setiawan, A. A. A. (2021). A survey of convolutional neural networks in object detection. Int. J. Adv. Comput. Sci. Appl.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Abouelyazid, M. (2023). Comparative evaluation of SORT, DeepSORT, and ByteTrack for multiple object tracking in highway videos. International Journal of Sustainable Infrastructure for Cities and Societies, 8(11), 42–52.
Din, M. U., Akram, W., Bakht, A. B., & Hussain, I. (2026). LLM-VLM fusion framework for autonomous maritime port inspection using a heterogeneous UAV-USV system. arXiv. https://doi.org/10.48550/arXiv.2601.13096
Hansung, L., Sohee, P., & Jang-Hee, Y. (2013). A data cube model for surveillance video indexing and retrieval. In SIGMAP 2013: International Conference on Signal Processing and Multimedia Applications. https://www.scitepress.org/Papers/2013/46121/46121.pdf
Wu, Y., Zhang, C., Lu, Y., Su, Y., Jiang, X., Xiang, Z., & Li, Z. (2025). VideoARD: An analysis-ready multi-level data model for remote sensing video. Remote Sens., 17(22), Art. 3746. https://doi.org/10.3390/rs17223746
Downloads
Published
Issue
Section
License
Copyright (c) 2026 System technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.









