Algorithms for data imputation based on entropy
DOI:
https://doi.org/10.34185/1562-9945-6-155-2024-12Keywords:
data imputation, data gaps, conditional entropy, information theory, data processing algorithms, uncertainty minimization, classification, qualitative and quantitative features, iterative method, entropy approach, machine learning, missing data processing, mutual information, software engineering, data mining, activity diagram.Abstract
Recent advancements in data imputation have focused on various machine learning techniques, including methods like mean, median, and mode imputation, along with more complex approaches like k-nearest neighbors (KNN) and multiple imputation by chained equations (MICE). Research into entropy-based methods offers a promising direction. This method minimizes uncertainty by selecting imputation values that reduce the overall entropy of the dataset. The goal of this work is to develop an algorithm that imputes missing data by minimiz-ing conditional entropy, thus ensuring that the missing values are filled in a way that pre-serves the relationships between the variables. The method is designed for both qualitative and quantitative data, including discrete and continuous variables, aiming to reduce uncer-tainty in classification tasks and enhance the performance of machine learning models. The proposed algorithm is based on conditional entropy minimization, using entropy as a measure of uncertainty in data. For each incomplete row, the algorithm computes the con-ditional entropy for possible imputation values. The value that minimizes conditional entropy is selected, as it reduces uncertainty in the target variable. This process is iterated for each missing value until all missing data is imputed. Three types of tests were performed on two datasets. The analysis showed that the pro-posed algorithms are quite slow compared to other methods and can be improved, for exam-ple, by multiprocessing, as described in our work [15]. The type 1 test showed that the pro-posed algorithms do not give a gain on the RMS deviation metric, but significantly reduce en-tropy (type 2 test). At the same time, these methods show an improvement in classification performance over the baseline models (type 3 test). Thus, the proposed entropy-based imputation methods have shown good results and can be considered by researchers as an additional tool to improve the accuracy of decision mak-ing, but further computational optimisation studies are needed to improve the performance of these methods. The algorithm shows promise in improving classification accuracy by selecting imputa-tion values that minimize conditional entropy. Future research will focus on optimizing the method for large datasets and expanding its application to various domains.
References
Roderick J. A. Little, Donald B. Rubin. Statistical Analysis with Missing Data, 3rd Edition. -Wiley, 2019. - 464 p. ISBN: 978-0-470-52679-8
Земляний О.Д., Байбуз О.Г. Методи імпутування пропусків у даних про ішемічну хворобу серця // Системні технології. Регіональний міжвузівський збірник наукових праць. - Випуск 2(151). – Дніпро, 2024. – С.33 – 49. DOI: https://doi.org/10.34185/1562-9945-2-151-2024-04
Yoon, J., Jordon, J., & Schaar, M.V. (2018). GAIN: Missing Data Imputation using Generative Adver-sarial Nets. ArXiv, abs/1806.02920. DOI: https://doi.org/10.48550/arXiv.1806.02920
Gondara, L., & Wang, K. (2017). Multiple Imputation Using Deep Denoising Autoencoders. ArXiv, abs/1705.02737. DOI: https://doi.org/10.48550/arXiv.1705.02737
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. DOI: https://doi.org/10.1093/bioinformatics/btr597
Rusdah, D.A., Murfi, H. XGBoost in handling missing values for life insurance risk prediction. SN Appl. Sci. 2, 1336 (2020). DOI: https://doi.org/10.1007/s42452-020-3128-y
Deng, Y., & Lumley, T. (2023). Multiple Imputation Through XGBoost. Journal of Computational and Graphical Statistics, 33(2), 352–363. DOI: https://doi.org/10.1080/10618600.2023.2252501
Delavallade, Thomas & Dang, Thanh. (2007). Using Entropy to Impute Missing Data in a Classification Task. Proceedings of the IEEE International Conference on Fuzzy Systems, FUZZ-IEEE'07, London, UK. 1 - 6. DOI: 10.1109/FUZZY.2007.4295430
Janosi, Andras, Steinbrunn, William, Pfisterer, Matthias, and Detrano, Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X
UCI Heart Disease Data. Heart Disease Data Set from UCI data repository. – [Електронний ресурс]. – Режим доступу: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
Framingham Heart Study-Cohort (FHS-Cohort). – [Електронний ресурс]. – Режим доступу: https://biolincc.nhlbi.nih.gov/studies/framcohort/
Framingham heart study dataset. – [Електронний ресурс]. – Режим доступу: https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset
N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002
Imputation of missing values in scikit-learn. – [Електронний ресурс]. – Режим доступу: https://scikit-learn.org/stable/modules/impute.html#impute
Земляний О.Д., Байбуз О.Г. Порівняння багатопроцесорної та багатопоточної реалізацій ентро-пійного підходу для імпутування пропусків у даних на мові програмування Python // Виклики та проблеми сучасної науки [Електронний ресурс]: зб. наук. пр. – Дніпро : [б.в.], 2024. – Т. 2.
– С. 300 – 304. – Режим доступу: https://cims.fti.dp.ua/j/article/view/131/159
Downloads
Published
Issue
Section
License
Copyright (c) 2025 System technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.