Empirical determination of the minimum sufficient training sample size for machine learning models at a given error level
DOI:
https://doi.org/10.34185/1562-9945-4-165-2026-01Keywords:
machine learning, learning curve, minimum sample size, power-law approximation, Gaussian process, ore sorting, cross-validation, HistGradientBoosting, Neural Scaling Laws, extrapolationAbstract
This article addresses the problem of empirically determining the minimum sufficient training sample size for machine learning regression models in ore sensor sorting systems. The relevance of this topic stems from the significant costs associated with creating representative datasets in the mining industry. The problem lies in the need to transition from a qualitative analysis of the learning curve to a quantitative assessment of the sufficient sample size for a given error level. The aim of the study is to evaluate a hierarchy of approaches: learning curve – parametric power-law extrapolation – GP-based learning-type curve. Methods: 10-fold GroupKFold cross-validation, Bayesian hyperparameter tuning (Optuna), nonlinear regression, Gaussian processes. Results: R² = 0.93 on test folds; the minimum sufficient sample size for RMSE ≤ 12 was estimated in the range of 559–810 observations. Key conclusion: the proposed method allows for a well-founded determination of the threshold beyond which further expansion of the sample ceases to yield practical benefits.
References
Kiselov, B. H., & Senko, A. O. (2026). Vplyv adytyvnykh stokhastychnykh zburenʹ na nyzh-niu mezhu uzahalniuvalʹnoi pokhybky modelei rehresii v sensornykh systemakh [Influence of additive stochastic perturbations on the lower bound of regression model generalisation error in sensor systems]. In Proceedings of the XIX All-Ukrainian Scientific-Practical WEB Con-ference (pp. 156–159). KNU. [in Ukrainian]
Viering, T., & Loog, M. (2023). The shape of learning curves: A review. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 45(12), 15050–15067. https://doi.org/10.1109/TPAMI.2021.3085003
Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12, Article 8. https://doi.org/10.1186/1472-6947-12-8
Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., & Popp, J. (2013). Sample size plan-ning for classification models. Analytica Chimica Acta, 760, 25–33.
https://doi.org/10.1016/j.aca.2012.11.007
Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PLoS ONE, 14(11), e0224365. https://doi.org/10.1371/journal.pone.0224365
Snell, K. I. E., Archer, L., Ensor, J., Maier, A., Debray, T. P. A., Burdett, S., Riley, R. D., & Ensor, J. (2024). Sample size requirements for training clinical prediction models using participant-level meta-analysis. Statistics in Medicine, 43(15), 2945–2975. https://doi.org/10.1002/sim.10121
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models [Pre-print]. arXiv. https://doi.org/10.48550/arXiv.2001.08361
Zöller, M. A., & Huber, M. F. (2021). Benchmark and survey of automated machine learn-ing frameworks. Journal of Artificial Intelligence Research, 70, 409–472. https://doi.org/10.1613/jair.1.11854
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Informa-tion Processing Systems, 30, 3146–3154.
Domingos, P. (2012). A few useful things to know about machine learning. Communica-tions of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755
Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., & Denker, J. S. (1994). Learning curves: Asymptotic values and rate of convergence. Advances in Neural Information Process-ing Systems, 6, 327–334.
Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R., & Mesirov, J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology, 10(2), 119–142. https://doi.org/10.1089/106652703321825928
Downloads
Published
Issue
Section
License
Copyright (c) 2026 System technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.









