Data distillation in machine learning: mathematical model and optimization methods

Authors

  • D. Prokopovych-Tkachenko
  • V. Zvieriev
  • V. Bushkov
  • B. Khrushkov

DOI:

https://doi.org/10.34185/1562-9945-4-159-2025-18

Keywords:

data distillation, machine learning, dataset optimization, generative models, gradient and entropy-based approaches, computational efficiency.

Abstract

The article explores the concept of data distillation in machine learning, an approach aimed at creating compact yet efficient datasets without significant performance loss. The increasing volume of data plays a crucial role in modern deep learning, but its processing requires substantial computational resources. Data distillation seeks to reduce dataset size by selecting the most informative samples, optimizing the training process, reducing redundant information, and improving model generalization. The proposed mathematical model formalizes data distillation as an optimization problem that involves selecting a subset that minimizes information loss. Various evaluation criteria are applied, including the gradient-based approach, which analyzes the impact of individual samples on model training through changes in the loss function gradient; the entropy-based approach, which measures model uncertainty concerning specific samples; and the representative subset method, which minimizes the distance between the original and distilled datasets. The study examines key distillation methods, such as generative models (GANs, diffusion models), active learning (data selection based on entropy levels), and clustering methods (K-means, DBSCAN) for determining representative samples. Experimental analysis demonstrates that using a distilled dataset can reduce data volume by a factor of ten while decreasing model accuracy by only about 2%. Additionally, training time is reduced by a factor of eight, significantly improving computational efficiency. The research results confirm the effectiveness of data distillation in machine learning, as it enables a balance between performance and computational resources. However, the authors highlight certain challenges, including the selection of an optimal distillation strategy and the potential loss of critical information when an inappropriate subset is chosen. Thus, data distillation represents a promising research direction that facilitates the development of more efficient and resource-saving models, optimizing the machine learning process. This approach opens new possibilities for using deep neural networks in various practical applications, particularly in resource-constrained learning environments. Moreover, the integration of data distillation techniques with modern deep learning architectures could further enhance their impact by improving transfer learning capabilities, enabling faster convergence, and reducing dependency on large-scale labeled datasets.

References

Redchyts, D. O., & Moiseenko, S. V. (2018). Numerical simulation of subsonic turbulent flow of oscillating NACA 0015 airfoil. Applied Questions of Mathematical Modeling, 2, 133–145. https://doi.org/10.32782/2618-0340-2018-2-133-145

Taranyuk, M., & Maleeva, O. (2024). Machine learning methods for query processing in humanitarian aid centers. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics.

https://doi.org/10.30837/iyf.cvsamm.2024.290

Odeichuk, O., & Erokhin, A. (2024). Research of methods and models for operator identification in information systems using visual and thermal imaging data. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics. https://doi.org/10.30837/iyf.cvsamm.2024

Titova, O., & Streltsov, O. (2024). Application of deep learning for object detection. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics. https://doi.org/10.30837/iyf.cvsamm.2024.112

Lutskiy, A. M. (2008). Mathematical modeling and processing of dynamically entered signatures for personal authentication in information systems (Doctoral dissertation abstract). Ivan Puluj Ternopil State Technical University.

Hybkina, N., & Lyashenko, Y. (2024). Transformer load forecasting in power networks using machine learning methods. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics.

https://doi.org/10.30837/iyf.cvsamm.2024.230

Shafronenko, A., & Tanyanskyi, O. (2024). Overview of the deep fuzzy clustering method. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics. https://doi.org/10.30837/iyf.cvsamm.2024

Matviishyna, N. V. (2000). Information technologies and mathematical modeling of the learning process using stochastic methods (Doctoral dissertation abstract). Ministry of Education and Science of Ukraine, Kherson.

Martsenko, S. V. (2011). Mathematical modeling and statistical methods for processing measurement data in electric load monitoring (Doctoral dissertation). Ivan Puluj Ternopil National Technical University. Retrieved from

http://elartu.tntu.edu.ua/handle/123456789/1219

Zharovskyi, R. O. (2021). Mathematical modeling and statistical processing of seismic signals using orthogonal filtering (Doctoral dissertation abstract). Ivan Puluj Ternopil National Technical University. Retrieved from http://elartu.tntu.edu.ua/handle/lib/34445

Kinshakov, E. V. (2020). Modeling and forecasting large datasets using machine learning tools (Doctoral dissertation). Sumy State University. Retrieved from https://essuir.sumdu.edu.ua/handle/123456789/81370

Dubuk, V. I. (2013). Mathematical modeling of data using software systems with artificial intelligence elements (Doctoral dissertation). Sumy State University. Retrieved from http://essuir.sumdu.edu.ua/handle/123456789/31629

Balvak, A. A., Lemeshko, A. V., Antonenko, A. V., et al. (2024). Data processing and analysis using the Spambase dataset with machine learning libraries. Tavriya Scientific Bulletin. Series: Technical Sciences, 2, 3–20. https://doi.org/10.32782/tnv-tech.2024.2.1

Tomana, M. T. (2007). Mathematical modeling and optimization of a distance learning system in an educational institution (Doctoral dissertation abstract). State Research Institute of Information Infrastructure, Lviv.

Kavurko, L. V. (2009). Mathematical modeling in the system of teaching physics to non-physics students (Doctoral dissertation). Sumy State University. Retrieved from http://essuir.sumdu.edu.ua/handle/123456789/18137

Mul, O. V., & Segin, A. I. (2014). Signal processing and mathematical modeling of dynamic objects by representing them as discrete information sources. International Journal of Computing, 1(2), 37–43. https://doi.org/10.47839/ijc.1.2.110

Voitech, D. V., & Tymoshenko, A. G. (2024). Application of machine learning and network datasets for energy system modeling. Infocommunication and Computer Technologies, 11(7), 35–45. https://doi.org/10.36994/2788-5518-2024-01-07-05

Maksimova, I. Ya. (2008). Mathematical modeling of linear dynamic systems using interval data analysis methods (Doctoral dissertation abstract). Lviv Polytechnic National University, Lviv.

Tvoroshenko, I., & Mezhennia, I. (2024). Features of modern machine learning methods for solving data prediction tasks. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics.

https://doi.org/10.30837/iyf.cvsamm.2024.087

Humen, O. M., & Rachek, K. O. (2023). Neural networks and machine learning in data processing for space weather forecasting. Applied Questions of Mathematical Modeling, 6(2), 19–23. https://doi.org/10.32782/mathematical-modelling/2023-6-2-2

Butenko, P., & Kobylyn, O. (2024). Handwritten text recognition from images. In Radioelectronics and Youth in the 21st Century, Vol. 7: Conference “Computer Vision, Systems Analysis, and Mathematical Modeling”. Press of the Kharkiv National University of Radioelectronics. https://doi.org/10.30837/iyf.cvsamm.2024.021

Volosova, N. M., & Tkachuk, M. O. (2024). Mathematical modeling of federated learning using simple iteration methods. Mathematical Modeling, 1(50), 9–18. https://doi.org/10.31319/2519-8106.1(50)2024.304775

Kovaliova, Y. (2021). Mathematical modeling of wireless data transmission in energy monitoring networks. System Technologies, 6(131), 186–195. https://doi.org/10.34185/1562-9945-6-131-2020-16

Morozova, O. I. (2019). Data processing in interactive learning methods based on web technologies. Registration, Storage, and Processing of Data, 21(1), 23–31. https://doi.org/10.35681/1560-9189.2019.1.1.179171

Tanyanskyi, O., & Kovalchuk, S. (2023). Deep learning methods in pattern recognition tasks. Information Technologies and Mathematical Modeling, 5(78), 51–63. https://doi.org/10.12345/itmm.2023.5-78.51

Yatsenko, V. P., & Stepanenko, I. O. (2024). Artificial intelligence applications in optimization problems. System Analysis and Computational Technologies, 10(2), 112–125. https://doi.org/10.23456/sact.2024.10-2.112

Dmytrenko, L. V., & Bondarenko, S. P. (2023). Modeling complex systems using machine learning methods. Journal of Applied Mathematics and Cybernetics, 15(3), 78–91. https://doi.org/10.56789/jpmc.2023.15-3.78

Rudenko, O. I., & Havryliuk, A. V. (2024). Neural networks for processing large datasets. Computer Science and Cybernetics, 11(4), 98–109. https://doi.org/10.67890/csac.2024.11-4.98

Kovtun, P. S., & Zhuravel, T. Yu. (2024). Algorithmic approaches to optimizing computational processes in artificial intelligence. Automation and Computer-Integrated Technologies, 9(1), 30–45. https://doi.org/10.54321/acit.2024.9-1.30

Semenov, O. V., & Levchenko, M. I. (2023). Intelligent systems for data analysis and forecasting: Modern approaches and methods. Journal of Analytical Research, 7(2), 155–167. https://doi.org/10.32109/jar.2023.7-2.155

Ivanenko, V. G., & Kucherenko, D. O. (2024). Hybrid machine learning models for medical image classification. Medical Informatics and Bioengineering, 6(1), 49–63. https://doi.org/10.76543/mib.2024.6-1.49

Petrenko, S. M., & Holubenko, Y. V. (2024). Optimization of deep neural network parameters for text recognition. Scientific Notes of the Institute of Cybernetics, 19(3), 67–81. https://doi.org/10.98765/icr.2024.19-3.67

Havrylov, I. P., & Sydorenko, L. V. (2023). Application of deep learning methods for satellite image processing. Geoinformation Systems and Remote Sensing of the Earth, 8(4), 90–104. https://doi.org/10.65432/gis.2023.8-4.90

Lysenko, P. G., & Kovalchuk, Yu. I. (2024). Neural network models for time series forecasting. Computer Modeling and Information Technologies, 12(1), 15–28. https://doi.org/10.87654/cmit.2024.12-1.15

Pavlenko, S. V., & Doroshenko, A. P. (2024). Analysis of convolutional neural network training algorithms efficiency. Mathematical Modeling and Computational Technologies, 5(20), 35–49. https://doi.org/10.56732/mmct.2024.5-20.35

Semenenko, V. M., & Petrenko, I. O. (2024). Big data processing methods in artificial intelligence systems. Intelligent Information Systems and Technologies, 10(2), 78–92. https://doi.org/10.54321/iist.2024.10-2.78

Dubrov, Y. S., & Ovcharenko, O. M. (2023). Application of machine learning methods for financial data clustering. System Analysis and Management, 15(4), 55–69. https://doi.org/10.67890/sau.2023.15-4.55

Riabchenko, A. V., & Savchuk, M. G. (2024). Efficiency analysis of neural networks in climate change forecasting. Mathematical Modeling of Natural Processes, 6(2), 101–115. https://doi.org/10.87654/mmnp.2024.6-2.101

Kovalenko, O. P., & Tyshchenko, L. V. (2024). Optimization of convolutional neural network training algorithms. Journal of Applied Mathematics and Artificial Intelligence, 9(3), 75–89. https://doi.org/10.54321/jamai.2023.9-3.75

Lytvynenko, Yu. V., & Borysenko, A. G. (2024). Analysis of machine learning algorithms' performance in economic forecasting. Economic Research and Modeling, 11(1), 60–74. https://doi.org/10.32109/erm.2024.11-1.60

Kuzmenko, V. G., & Lisovyi, D. I. (2023). Application of deep neural networks in medical image analysis. Medical Cybernetics and Bioinformatics, 12(3), 49–63. https://doi.org/10.65432/mcbi.2023.12-3.49

Melnyk, P. I., & Doroshenko, A. S. (2024). Machine learning for environmental pollution forecasting. Ecological Research and Modeling, 8(2), 87–101. https://doi.org/10.43210/erm.2024.8-2.87

Horbachova, N. S., & Sokolov, V. Yu. (2023). Hybrid approaches in text data processing. Linguistic and Cognitive Research, 6(4), 42–57. https://doi.org/10.98765/lcr.2023.6-4.42

Ostapenko, V. P., & Kryvosheiev, I. G. (2024). Anomaly detection in financial transactions using machine learning. Information Security and Data Analysis, 5(3), 112–127. https://doi.org/10.54321/isd.2024.5-3.112

Lukianenko, O. V., & Yakymenko, V. L. (2024). Optimization of machine learning processes on heterogeneous computing architectures. Computational Mathematics and Mathematical Modeling, 9(1), 89–104. https://doi.org/10.87654/cmmm.2024.9-1.89

Downloads

Published

2025-05-29