QUESTIONS OF DETERMINING THE MINIMUM SUFFICIENT TRAINING SAMPLE SIZE FOR MACHINE LEARNING MODELS

Authors

DOI:

https://doi.org/10.34185/1991-7848.itmm.2026.01.079

Keywords:

machine learning, learning curve, minimum sample size, power-law approximation, Gaussian process, ore sorting, cross-validation, HistGradientBoosting, Neural Scaling Laws, extrapolation

Abstract

This paper addresses the problem of empirically determining the minimum sufficient training sample size for machine learning regression models in ore sensor sorting systems. A methodology based on a hierarchy of approaches is proposed: learning curve – parametric power-law extrapolation – GP-based learning-type curve. The study is conducted on a real sensor dataset (699 observations). HistGradientBoostingRegressor was selected as the primary model (R² = 0.93, 10-fold GroupKFold cross-validation). Power-law extrapolation provided point estimates of the minimum sample size for a given RMSE threshold. The GP-based approach yielded probabilistic estimates accounting for parameter uncertainty. For RMSE ≤ 12 with 95 % confidence, 810 observations are required. Practical recommendations for datasets of similar
type are formulated.

References

Kiselov, B. H., & Senko, A. O. (2026). Vplyv adytyvnykh stokhastychnykh zburenʹ na nyzhniu mezhu uzahalniuvalʹnoi pokhybky modelei rehresii v sensornykh systemakh. Komp'iuterni intelektualni systemy ta merezhi: materialy XIX Vseukrainskoi naukovo-praktychnoi WEB-konferentsii. Kryvyi Rih. P. 96–101. [in Ukrainian].

Ke, G. et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154.

Viering, T., & Loog, M. (2023). The shape of learning curves: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12), 15050–15067. https://doi.org/10.1109/TPAMI.2021.3085003

Snell, K. I. E. et al. (2024). Sample size requirements for training clinical prediction models using participant-level meta-analysis. Statistics in Medicine, 43(15), 2945–2975. https://doi.org/10.1002/sim.10121

Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12, Article 8. https://doi.org/10.1186/1472-6947-12-8

Kaplan, J. et al. (2020). Scaling laws for neural language models. arXiv. https://doi.org/10.48550/arXiv.2001.08361

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755

Published

2026-04-26

Issue

Section

Theses