An approach to recognizing GUI elements as images based on machine learning

Authors

DOI:

https://doi.org/10.34185/1562-9945-3-164-2026-24

Keywords:

recognition, image, GUI, model, deep learning, testing

Abstract

Analysis of Recent Research and Publications. In recent years, computer vision tasks, in particular, object detection and classification in images, have undergone significant development thanks to the use of deep learning methods. The most common approaches are the YOLO, RetinaNet, EfficientDet, Cascade Mask R-CNN, and Swin Transformer models, which demonstrate high efficiency in object recognition tasks of varying complexity. At the same time, the issue of applying these models to recognize graphical user interface (GUI) elements in the context of automated testing of web applications remains insufficiently covered. Traditional approaches to UI testing are based on DOM structure analysis, which makes them vulnerable to changes in layout and complicates test support.

Research Objective. The purpose of this work is to analyze modern methods of recognizing objects in images and justify the choice of an effective machine learning model for the task of recognizing graphical user interface elements.

Presentation of the main research material. The paper proposes an approach to recognizing interface elements based on the analysis of web page screenshots as digital images. The proposed process involves obtaining a screenshot using the Selenium tool, saving the image in storage, and then transferring it to a machine learning model for processing. As a result, the model determines the classes of objects, their coordinates, and sizes, generating output data in JSON format. The obtained coordinates are used to interact with interface elements in an automated testing environment (PyTest).

A comparative analysis of modern object detection models is conducted, in particular Swin-L, EfficientDet-D7, Cascade Mask R-CNN, RetinaNet, and YOLO. Particular attention is paid to models of the YOLO family, which provide high image processing speed while maintaining a sufficient level of accuracy. In the context of the interface element recognition problem, this characteristic is important, since it allows integrating the model into the automated testing process without significantly increasing the test execution time. In addition, YOLO provides a convenient format of output data (bounding boxes), which is directly used for interaction with interface elements.

Conclusions. As a result of the conducted research, it was found that the use of computer vision methods is a promising direction for increasing the efficiency of automated UI testing. The proposed approach allows you to abandon the dependence on the DOM structure, which increases the resistance of tests to interface changes and reduces the costs of their support. Among the considered models, the most appropriate for practical application is YOLO, which provides the optimal balance between speed, accuracy and ease of integration. The results obtained can be used for the further development of intelligent web application testing systems.

References

Paschalis Tsirtsakis, Georgios Zacharis, George S. Maraslidis, George F. Fragulis, Deep learning for object recognition: A comprehensive review of models and algorithms, Interna-tional Journal of Cognitive Computing in Engineering, Volume 6, 2025, Pages 298-312, ISSN 2666-3074, https://doi.org/10.1016/j.ijcce.2025.01.004.

Vik.V.Hnatushenko, V.Yu. Tsaryk. Research on methods for highlighting graphic objects on websites to assess interface quality/ Vik.V.Hnatushenko, V.Yu. Tsaryk // System tech-nologies. Regional interuniversity collection of scientific papers. – Issue 3 (140). – Dnipro, 2022. – S.143-154 DOI 10.34185/1562-9945-3-140-2022-12

J. Shariat and C. S. Saucier, Tragic Design: The Impact of Bad Product Design and How to Fix It, First. Sebastopol: O’Reilly Media, 2017.

D. Norman and J. Nielsen, “Usability 101: Introduction to Usability,” 2012. [Online]. Available: https://www.nngroup.com/articles/usability-101-introduction-to-usability/.

D. A. Norman, The Design of Everyday Things. USA: Basic Books, Inc., 2002. J.

Nielsen, “Iterative User-Interface Design,” Computer (Long. Beach. Calif)., vol. 26, no. 11, doi.org/10.1109/2.241424. 1993, doi.org/10.1109/2.241424.

J. Cao, K. Zieba, and M. Ellis, The Ultimate Guide to Prototyping. Mountain View: UXPin Studio, 2015.

S. Minhas, “User Experience Design Process,” 2018. [Online]. Available: https://uxplanet.org/user-experiencedesign-process-d91df1a45916.

C. Murphy, “A Comprehensive Guide To Wireframing And Prototyping,” 2018. [Online]. Available: https://www.smashingmagazine.com/2018/03/guide-wireframing-prototyping/#top.

M. O. Riedl and R. St Amant, “Toward Automated Exploration of Interactive Systems,” 2002.

K. Gibbs, T. Winograd, and N. Scott, “Lens: A System for Visual Interpretation of Graphical User Interfaces,” 2002.

T.-H. Chang, T. Yeh, and R. C. Miller, “GUI Testing Using Computer Vision,” in Pro-ceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2022.

J. Koch and A. Oulasvirta, “Computational layout perception using Gestalt laws,” in Con-ference on Human Factors in Computing Systems - Proceedings, 2016, vol. 07-12-May-2016, pp. 1423–1429, 10.1145/2851581.2892537. doi: 10.1145/2851581.2892537

T. F. Liu, M. Craft, J. Situ, E. Yumer, R. Mech, and R. Kumar, “Learning design seman-tics for mobile apps,” in UIST 2018 - Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 2018, pp. 569–579, doi: 10.1145/3242587.3242650.

B. Deka et al., “Rico: A mobile app dataset for building data-driven design applications,” in UIST 2017 - Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, 2017, pp. 845–854, 10.1145/3126594.3126651. doi: 10.1145/3126594.3126651

R. A. Fernandez, J. A. Deja, and B. P. V. Samson, “Automating heuristic evaluation of websites using convolutional neural networks,” in Conference on Human Factors in Comput-ing Systems - Proceedings, 2018, pp. 9–12, doi: 10.1145/3205851.3205854.

H. Lu, L. Wang, M. Ye, K. Yan, and Q. Jin, “DNN-based Image Classification for Soft-ware GUI Testing,” in 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Ad-vanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (Smart-World/SCALCOM/UIC/ATC/CBDC om/IOP/SCI), 2018, pp. 1818–1823.

S. Hassan, M. Arya, U. Bhardwaj, and S. Kole, “Extraction and Classification of User Interface Components from an Image,” Int. J. Pure Appl. Math., vol. 118, no. 24, 2018.

T. T. Nguyen, P. M. Vu, H. V. Pham, and T. T. Nguyen, “Deep learning UI design pat-terns of mobile apps,” in Proceedings - International Conference on Software Engineering, 2018, pp. 65–68, doi: 10.1145/3183399.3183422.

Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Win-dows,” Proc. IEEE ICCV, 2021. DOI: 10.1109/ICCV48922.2021.00986

M. Tan, R. Pang, Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection,” Proc. IEEE CVPR, 2020. DOI: 10.1109/CVPR42600.2020.01039

Z. Cai, N. Vasconcelos, “Cascade R-CNN: High Quality Object Detection and Instance Segmentation,” IEEE TPAMI, 2021. DOI: 10.1109/TPAMI.2019.2956516

T.-Y. Lin et al., “Focal Loss for Dense Object Detection,” IEEE TPAMI, 2020. DOI: 10.1109/TPAMI.2018.2858826

J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” Proc. IEEE CVPR, 2016. DOI: 10.1109/CVPR.2016.91

Published

2026-04-30