Іntelligent classification system based on ensemble methods
Keywords:Classification, ensemble models, stacking, boosting, bagging, two-level architecture, performance indicators
In the paper, based on machine learning methods, the solution of the classification task was investigated using a two-level structure of ensembles of models. To improve forecasting results, an ensemble approach was used: several basic models were trained to solve the same problem, with subsequent aggregation and improvement of the ob-tained results. The problem of classification was studied. The architecture of the intelli-gent classification system is proposed. The system consists of the following components: a subsystem of preprocessing and data analysis, a subsystem of data distribution, a subsystem of building basic models, a subsystem of building and evaluating ensembles of models. A two-level ensemble structure was used to find a compromise between bias and variance inherent in machine learning models. At the first level, an ensemble based on stacking is implemented using a logistic regression model as a metamodel. The pre-dictions that are generated by the underlying models are used as input for training in the first layer. The following basic models of the first layer were chosen: decision trees (DecisionTree), naive Bayesian classifier (NB), quadratic discriminant analysis (QDA), logistic regression (LR), support vector method (SVM), random forest model (RF). The bagging method based on the Bagged CART algorithm was used in the second layer. The algorithm creates N regression trees using M initial training sets and averages the re-sulting predictions. As the basic models of the second layer, the following were chosen: the first-level model (Stacking LR), the model of artificial neural networks (ANN); the linear discriminant analysis (LDA) model and the nearest neighbor (KNN) model. A study of basic classification models and ensemble models based on stacking and bag-ging, as well as metrics for evaluating the effectiveness of the use of basic classifiers and models of the first and second level, was conducted. The following parameters were de-termined for all the methods in the work: prediction accuracy and error rate, Kappa statistic, sensitivity and specificity, accuracy and completeness, F-measure and area under the ROC curve. The advantages and effectiveness of the ensemble of models in comparison with each basic model are determined.
Marsland S. Machine Learning: An Algorithmic Perspective. Palmerston North: Massey University, 2015. 452 р.
Wang L., Cheng L., Zhao G. Machine Learning for Human Motion Analysis. Anhui: IGI Global, 2009. 318 p.
Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. California: Springer–Verlag, 2009. 746 p.
Artificial Intelligence: A Modern Approach.
URL: https://towardsdatascience.com/understanding–the–bias–variance–tradeoff (data zvernennia: 20.12.2022).
Opitz D., Maclin R. Popular ensemble methods: An empirical study: journal of Artificial Intelligence Research No. 11. El Segundo, 1999. P. 169–198.
Ensemble Methods to Optimize Machine Learning Models.
URL: https://hub.packtpub.com/ensemble–methods–optimize–machine–learning–models (data zvernennia: 20.12.2022).
Barmak O. V., Krak Yu. V., Manziuk E. A. Kharakterystyka dlia vyboru modelei u ansambli klasyfikatoriv: naukovyi zhurnal “Problemy prohramuvannia”. Kyiv, 2018. S. 171–179.
Understanding the Bias–Variance Tradeoff
URL: http://scott.fortmannroe.com/docs/BiasVariance.html (дата звернення: 20.12.2022).
Dietterich T., Ensemble Methods in Machine Learning.
URL: http://web.engr.oregonstate.edu/~tgd/publications/mcs–ensembles.pdf (дата звернення: 25.12.2022).
Ensemble methods: bagging, boosting and stacking. URL: https://towardsdatascience.com/ensemble–methods–bagging–boosting–andstacking- c9214a10a205 (дата звернення: 27.12.2022).
Moretti F., Pizzuti S., Panzieri S., Annunziato M., Urban traffic flow forecasting through statistical and neural network bagging ensemble hybrid modeling, Neurocomputing (2015).
Kim M.J., Kang D.K., Kim H.B. Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction, Expert Syst. Appl. 42 (3) (2015) 1074–1082.
Kang S., Cho S., Kang P. Multi-class classification via heterogeneous ensemble of one-class classifiers, Eng. Appl. Artif. Intell. 43 (2015) 35–43.
Bustinh i behhinh yak metody formuvannia ansamblei modelei / Verbivskyi D. S., Karpliuk, S. O., Fonariuk, O. V., Sikora, Ya. B. Zhytomyr: ZhDU im. Ivana Fran-ka, 2021. S. 163-169.
Zhi–Hua Zhou Ensemble Learning. URL: https://cs.nju.edu.cn/zhouzh/ zhouzh.files/ publication/springerEBR09.pdf (дата звернення: 20.12.2022).
Shah B.R., Lipscombe L.L. Clinical diabetes research using data mining: a Canadian perspective, Can. J. Diabetes 39 (3) (2015) 235–238.
Improvements on Cross–Validation: The 632+ Bootstrap Method. URL: https://www.tandfonline.com/doi/abs/10.1080/01621459.1997.10474007#.U2o7MVdMzTo (data zvernennia: 22.12.2022).
Ahmad A., Brown G. Random ordinality ensembles: ensemble methods for multi-valued categorical data, Inf. Sci. 296 (2015) 75–94.
Sluban B., Lavrac N. Relating ensemble diversity and performance: a study in class noise detection, Neurocomputing 160 (2015) 120–131.
Bashir S., Qamar U., Khan F.H. IntelliHealth: A medical decision support application using a novel weighted multi-layer classifier ensemble framework. Journal of Biomedical Informatics. – 2016. – vol.59. – pp.185- 200.
E–Commerce Shipping DataSet.
URL: https://www.kaggle.com/datasets/+prachi13/ customer-analytics (дата звер-нення: 15.11.2022)
Kalinina I.O., Hozhyi O.P. Doslidzhennia efektyvnosti metodiv klasyfikatsii pry prohnozuvanni v zadachakh mashynnoho navchannia. Upravlinnia rozvytkom skladnykh system. Kyiv, 2021. № 46. S. 173 – 180, dx.doi.org10.32347/2412-9933.2021.46.173-180.
Maniruzzaman M., Rahman MJ., Ahammed B. Classification and prediction of diabetes disease using machine learning paradigm. Health information science and systems, Vol. 8. Texas, 2020. P. 1–14.
Bidiuk P.I., Kuznietsova N.V., Terentiev O.M. Systema pidtrymky pryiniattia rishen dlia analizu danykh. Kyiv: Naukovi visti NTUU «KPI», 2011. S. 48–61.
Zhan Zh. Introduction to machine learning: k-nearest neighbors. Vienna: Ann Transl Med, 2016. 218 p.
Metryky v zadachakh mashynnoho navchannia.
URL: https://habr.com/ru/company/ods/blog/328372 (data zvernennia: 20.12.2022).
This work is licensed under a Creative Commons Attribution 4.0 International License.