ЕМПІРИЧНЕ ВИЗНАЧЕННЯ МІНІМАЛЬНО ДОСТАТНЬОГО ОБСЯГУ НАВЧАЛЬНОЇ ВИБІРКИ ДЛЯ МОДЕЛЕЙ МАШИННОГО НАВЧАННЯ ЗА ЗАДАНОГО РІВНЯ ПОХИБКИ

Б.Г. Кісельов; А.О. Сенько; А.І. Купін; Д.К.  Балик

doi:10.34185/1562-9945-4-165-2026-01

Автор(и)

Б.Г. Кісельов https://orcid.org/0009-0007-9338-1031
А.О. Сенько https://orcid.org/0000-0002-4104-8372
А.І. Купін https://orcid.org/0000-0001-7569-1721
Д.К. Балик https://orcid.org/0009-0000-4768-8576

DOI:

https://doi.org/10.34185/1562-9945-4-165-2026-01

Ключові слова:

машинне навчання, крива навчання, мінімальний обсяг вибірки, степенева апроксимація, гауссівський процес, сортування руд, крос-валідація, HistGradientBoosting, Neural Scaling Laws, екстраполяція

Анотація

Статтю присвячено задачі емпіричного визначення мінімально достатнього обсягу навчальної вибірки для регресійних моделей машинного навчання у системах сенсорного сортування руд. Актуальність теми зумовлена значними витратами на формування репрезентативних наборів даних у гірничодобувній промисловості. Постановка проблеми полягає в необхідності переходу від якісного аналізу кривої навчання до кількісної оцінки достатнього розміру вибірки за заданого рівня похибки. Метою дослідження є перевірка ієрархії підходів: крива навчання – параметрична степенева екстраполяція – GP-based learning-type curve. Методи: 10-fold GroupKFold крос-валідація, байєсівський підбір гіперпараметрів (Optuna), нелінійна регресія, гауссівські процеси. Отримано: R² = 0,93 на тестових фолдах; мінімально достатній обсяг вибірки для RMSE ≤ 12 оцінено в діапазоні 559–810 спостережень. Ключовий висновок: запропонована методика дозволяє обґрунтовано визначати поріг, після якого подальше розширення вибірки перестає давати практичний ефект.

Посилання

Kiselov, B. H., & Senko, A. O. (2026). Vplyv adytyvnykh stokhastychnykh zburenʹ na nyzh-niu mezhu uzahalniuvalʹnoi pokhybky modelei rehresii v sensornykh systemakh [Influence of additive stochastic perturbations on the lower bound of regression model generalisation error in sensor systems]. In Proceedings of the XIX All-Ukrainian Scientific-Practical WEB Con-ference (pp. 156–159). KNU. [in Ukrainian]

Viering, T., & Loog, M. (2023). The shape of learning curves: A review. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 45(12), 15050–15067. https://doi.org/10.1109/TPAMI.2021.3085003

Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., & Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making, 12, Article 8. https://doi.org/10.1186/1472-6947-12-8

Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., & Popp, J. (2013). Sample size plan-ning for classification models. Analytica Chimica Acta, 760, 25–33.

https://doi.org/10.1016/j.aca.2012.11.007

Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PLoS ONE, 14(11), e0224365. https://doi.org/10.1371/journal.pone.0224365

Snell, K. I. E., Archer, L., Ensor, J., Maier, A., Debray, T. P. A., Burdett, S., Riley, R. D., & Ensor, J. (2024). Sample size requirements for training clinical prediction models using participant-level meta-analysis. Statistics in Medicine, 43(15), 2945–2975. https://doi.org/10.1002/sim.10121

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models [Pre-print]. arXiv. https://doi.org/10.48550/arXiv.2001.08361

Zöller, M. A., & Huber, M. F. (2021). Benchmark and survey of automated machine learn-ing frameworks. Journal of Artificial Intelligence Research, 70, 409–472. https://doi.org/10.1613/jair.1.11854

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Informa-tion Processing Systems, 30, 3146–3154.

Domingos, P. (2012). A few useful things to know about machine learning. Communica-tions of the ACM, 55(10), 78–87. https://doi.org/10.1145/2347736.2347755

Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., & Denker, J. S. (1994). Learning curves: Asymptotic values and rate of convergence. Advances in Neural Information Process-ing Systems, 6, 327–334.

Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R., & Mesirov, J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology, 10(2), 119–142. https://doi.org/10.1089/106652703321825928