ЕМПІРИЧНЕ ВИЗНАЧЕННЯ МІНІМАЛЬНО ДОСТАТНЬОГО ОБСЯГУ НАВЧАЛЬНОЇ ВИБІРКИ ДЛЯ МОДЕЛЕЙ МАШИННОГО НАВЧАННЯ ЗА ЗАДАНОГО РІВНЯ ПОХИБКИ

Б.Г. Кісельов; А.О. Сенько; А.І. Купін; Д.К.  Балик

doi:10.34185/1562-9945-4-165-2026-01

Authors

B.H. Kiselov https://orcid.org/0009-0007-9338-1031
A.O. Senko https://orcid.org/0000-0002-4104-8372
A.I. Kupin https://orcid.org/0000-0001-7569-1721
D.K. Balyk https://orcid.org/0009-0000-4768-8576

DOI:

https://doi.org/10.34185/1562-9945-4-165-2026-01

Keywords:

machine learning, learning curve, minimum sample size, power-law approximation, Gaussian process, ore sorting, cross-validation, HistGradientBoosting, Neural Scaling Laws, extrapolation

Abstract

This article addresses the problem of empirically determining the minimum sufficient training sample size for machine learning regression models in ore sensor sorting systems. The relevance of this topic stems from the significant costs associated with creating representative datasets in the mining industry. The problem lies in the need to transition from a qualitative analysis of the learning curve to a quantitative assessment of the sufficient sample size for a given error level. The aim of the study is to evaluate a hierarchy of approaches: learning curve – parametric power-law extrapolation – GP-based learning-type curve. Methods: 10-fold GroupKFold cross-validation, Bayesian hyperparameter tuning (Optuna), nonlinear regression, Gaussian processes. Results: R² = 0.93 on test folds; the minimum sufficient sample size for RMSE ≤ 12 was estimated in the range of 559–810 observations. Key conclusion: the proposed method allows for a well-founded determination of the threshold beyond which further expansion of the sample ceases to yield practical benefits.

References

Kiselov, B. H., & Senko, A. O. (2026). Vplyv adytyvnykh stokhastychnykh zburenʹ na nyzh-niu mezhu uzahalniuvalʹnoi pokhybky modelei rehresii v sensornykh systemakh [Influence of additive stochastic perturbations on the lower bound of regression model generalisation error in sensor systems]. In Proceedings of the XIX All-Ukrainian Scientific-Practical WEB Con-ference (pp. 156–159). KNU. [in Ukrainian]

Viering, T., & Loog, M. (2023). The shape of learning curves: A review. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 45(12), 15050–15067. https://doi.org/10.1109/TPAMI.2021.3085003

Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12, Article 8. https://doi.org/10.1186/1472-6947-12-8

Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., & Popp, J. (2013). Sample size plan-ning for classification models. Analytica Chimica Acta, 760, 25–33.

https://doi.org/10.1016/j.aca.2012.11.007

Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PLoS ONE, 14(11), e0224365. https://doi.org/10.1371/journal.pone.0224365

Snell, K. I. E., Archer, L., Ensor, J., Maier, A., Debray, T. P. A., Burdett, S., Riley, R. D., & Ensor, J. (2024). Sample size requirements for training clinical prediction models using participant-level meta-analysis. Statistics in Medicine, 43(15), 2945–2975. https://doi.org/10.1002/sim.10121

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models [Pre-print]. arXiv. https://doi.org/10.48550/arXiv.2001.08361

Zöller, M. A., & Huber, M. F. (2021). Benchmark and survey of automated machine learn-ing frameworks. Journal of Artificial Intelligence Research, 70, 409–472. https://doi.org/10.1613/jair.1.11854

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Informa-tion Processing Systems, 30, 3146–3154.

Domingos, P. (2012). A few useful things to know about machine learning. Communica-tions of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755

Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., & Denker, J. S. (1994). Learning curves: Asymptotic values and rate of convergence. Advances in Neural Information Process-ing Systems, 6, 327–334.

Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R., & Mesirov, J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology, 10(2), 119–142. https://doi.org/10.1089/106652703321825928

Empirical determination of the minimum sufficient training sample size for machine learning models at a given error level

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Language

ouci

crossref

scholar

worldcat

ISSN

bpnu

vernadskiy

copernicus

ulrichs_web

ukrainika

DNTB

Latest publications

languages

© 2025 System technologies. All Rights Reserved.