ПИТАННЯ ВИЗНАЧЕННЯ МІНІМАЛЬНО ДОСТАТНЬОГО ОБСЯГУ НАВЧАЛЬНОЇ ВИБІРКИ ДЛЯ МОДЕЛЕЙ МАШИННОГО НАВЧАННЯ

Б.Г. Кісельов; А.О. Сенько; А.І. Купін; Д.К. Балик

doi:10.34185/1991-7848.itmm.2026.01.079

Authors

B. Kiselov https://orcid.org/0009-0007-9338-1031 (unauthenticated)
A. Senko https://orcid.org/0000-0002-4104-8372 (unauthenticated)
A. Kupin https://orcid.org/0000-0001-7569-1721 (unauthenticated)
D. Balyk https://orcid.org/0009-0000-4768-8576 (unauthenticated)

DOI:

https://doi.org/10.34185/1991-7848.itmm.2026.01.079

Keywords:

machine learning, learning curve, minimum sample size, power-law approximation, Gaussian process, ore sorting, cross-validation, HistGradientBoosting, Neural Scaling Laws, extrapolation

Abstract

This paper addresses the problem of empirically determining the minimum sufficient training sample size for machine learning regression models in ore sensor sorting systems. A methodology based on a hierarchy of approaches is proposed: learning curve – parametric power-law extrapolation – GP-based learning-type curve. The study is conducted on a real sensor dataset (699 observations). HistGradientBoostingRegressor was selected as the primary model (R² = 0.93, 10-fold GroupKFold cross-validation). Power-law extrapolation provided point estimates of the minimum sample size for a given RMSE threshold. The GP-based approach yielded probabilistic estimates accounting for parameter uncertainty. For RMSE ≤ 12 with 95 % confidence, 810 observations are required. Practical recommendations for datasets of similar
type are formulated.

References

Kiselov, B. H., & Senko, A. O. (2026). Vplyv adytyvnykh stokhastychnykh zburenʹ na nyzhniu mezhu uzahalniuvalʹnoi pokhybky modelei rehresii v sensornykh systemakh. Komp'iuterni intelektualni systemy ta merezhi: materialy XIX Vseukrainskoi naukovo-praktychnoi WEB-konferentsii. Kryvyi Rih. P. 96–101. [in Ukrainian].

Ke, G. et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154.

Viering, T., & Loog, M. (2023). The shape of learning curves: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12), 15050–15067. https://doi.org/10.1109/TPAMI.2021.3085003

Snell, K. I. E. et al. (2024). Sample size requirements for training clinical prediction models using participant-level meta-analysis. Statistics in Medicine, 43(15), 2945–2975. https://doi.org/10.1002/sim.10121

Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12, Article 8. https://doi.org/10.1186/1472-6947-12-8

Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv. https://doi.org/10.48550/arXiv.2001.08361

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755

QUESTIONS OF DETERMINING THE MINIMUM SUFFICIENT TRAINING SAMPLE SIZE FOR MACHINE LEARNING MODELS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

Language

ouci

crossref

scholar

worldcat

issn

languages

Browse

© 2025 Information technologies in metallurgy and machine building. All Rights Reserved.