Methods for imputing missing data on coronary heart disease

Authors

  • Zemlianyi O.
  • Baibuz O.

DOI:

https://doi.org/10.34185/1562-9945-2-151-2024-04

Keywords:

імпутування даних, ітеративне множинне імпутування даних, обробка змішаних даних, регресія, бінарна класифікація, перетворення якісних ознак на кількісні, мова програмування python.

Abstract

Preliminary analysis is an important stage of data analysis. A significant problem is the detection of missing values, and the most difficult part is that there is no universal algorithm to resolve this problem. For each specific task, known methods, their combina-tions, modifications, or completely new approaches have to be selected. Most machine learning models cannot handle missing values, so we cannot simply ignore gaps in the data. The problem of missing data needs to be addressed during pre-processing. The simplest solution is to delete each observation containing missing values. This solution is implemented in well-known Python programming language libraries such as NumPy or Pandas. However, this approach is extreme because we lose all the useful information that may be important for data analysis. There are several main strategies for imputing missing data: replacing missing values with mean/median or mode; replac-ing with the most frequently occurring value or a constant; data imputation using the kNN algorithm; multiple imputation of missing data (MICE algorithm); data imputation using deep learning. We suppose several modifications of algorithms for iterative multiple imputing of mixed data represented by quantitative and qualitative features. To convert qualitative features into numerical ones, we propose our own algorithms that work with missing data and allow for the conversion back to qualitative features. Two well-known datasets on observations of coronary heart disease are considered. The following is a brief description of the data imputation algorithms. The fillna_k_columns method, which performs data imputation based on k complete columns. It uses a regressor or classifier depending on the column type. The fillna_k_sorted_columns method, which traverses columns in the order corresponding to the number of missing values. It uses a regressor or classifier depending on the column type. The fillna_2steps_rg_class method, which is executed in 2 steps: first by complete rows, then by complete columns. It uses a regressor or classifier depending on the column type. The fillna_2steps_rg method, which is executed in 2 steps: first by complete rows, then by complete columns. It only uses a regressor with value adjustment for qualitative columns based on two criteria. Two types of tests are used to analyse the approaches. In the first test, a dataset is artificially filled with gaps at random positions, imputed using different methods, and the mean square error and execution time of the algorithms are estimated. In the second test, binary classification models are trained on datasets imputed with different methods and the classification accuracy is compared. The analysis showed a time advantage for the fillna_2steps_rg method and improved classification model accuracy in cases of using en-coding method considering frequency and the fillna_2steps_rg_class imputation method. Thus, the proposed methods have shown promising results, which can serve as al-ternatives to existing methods and provide researchers with additional tools to enhance decision-making accuracy. Further, the plan is to formalize the proposed methods in the scikit-learn library ar-chitecture for unified use by researchers.

References

Janosi, Andras, Steinbrunn, William, Pfisterer, Matthias, and Detrano, Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.

UCI Heart Disease Data. Heart Disease Data Set from UCI data repository. – [Елек-тронний ресурс]. – Режим доступу:

https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

Framingham Heart Study-Cohort (FHS-Cohort). – [Електронний ресурс]. – Ре-жим доступу: https://biolincc.nhlbi.nih.gov/studies/framcohort/

Framingham heart study dataset. – [Електронний ресурс]. – Режим доступу: https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset

Roderick J. A. Little, Donald B. Rubin. Statistical Analysis with Missing Data, 3rd Edition. -Wiley, 2019. - 464 p. ISBN: 978-0-470-52679-8.

Sefidian A. M., Daneshpour N. Estimating missing data using novel correlation maximization based methods // Applied Soft Computing. Volume 91. 2020. 106249. DOI: 10.1016/j.asoc.2020.106249

A.Barrios, G. Trincado, René Garreaud. Alternative approaches for estimating missing climate data: application to monthly precipitation records in South-Central Chile // Forest Ecosystems. 2018. 5(1). Pp. 1-10. DOI 10.1186/s40663-018-0147-x

Kamakura W.A., Wedel M. Factor Analysis and Missing Data

// Journal of Marketing Research. 2000. Vol. 37. No. 4: Nov. Р. 490–498.

van Buuren, S. and Groothuis-Oudshoorn, K. 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software. 45, 3 (Dec. 2011), 1–67. DOI:https://doi.org/10.18637/jss.v045.i03.

A complete guide on how to handle missing data with IterativeImputer in Py-thon. – Learning AI, 2023. – Режим доступу: https://justlearnai.com/a-complete-guide-on-how-to-handle-missing-data-with-iterativeimputer-in-python-6b224cf0896c

Imputation of missing values in scikit-learn 1.4.1. – [Електронний ресурс]. – Режим доступу: https://scikit-learn.org/stable/ modules/impute.html

NoNa: Missing Data Imputation Algorithm. – Medium, 2023. – [Електронний ресурс]. – Режим доступу: https://medium.com/@abdualimov/nona-missing-data-imputation-algorithm-d6ff92f70ab8

nona: Python gap filling toolkit. [Електронний ресурс]. – Режим доступу: https://pypi.org/project/nona/

Земляний О.Д., Ізмайлова М.К., Антоненко С.В. Методи поповнення пропу-сків даних гідрологічного моніторингу // Актуальні проблеми автоматизації та інформаційних технологій: Зб. наук. пр. / наук. ред. О.Г. Байбуз. – Дніпро, 2020. – Т. 24. – С. 3 – 15

Feature Encoding. – Medium, 2023. – [Електронний ресурс]. – Режим досту-пу: https://medium.com/@denizgunay/feature-encoding-f099a6c1abe8

sklearn.preprocessing.LabelEncoder– [Електронний ресурс]. – Режим доступу: https://scikit-learn.org/stable/modules/generated/sklearn. preprocessing.LabelEncoder.html

N.V. Chawla, K.W. Bowyer, L. O.Hall, W.P. Kegelmeyer, “SMOTE: synthetic mi-nority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.

Downloads

Published

2024-04-17