Statistical text analysis and study of the dynamics of classification accuracy

Authors

  • K.Iu. Ostrovska
  • T.M. Fenenko
  • O.O. Hlushchenko

DOI:

https://doi.org/10.34185/1562-9945-5-142-2022-06

Keywords:

machine learning, statistical text analysis, authorship determination, data analysis, natural language processing

Abstract

The work is devoted to the statistical analysis of the text and the study of the dynamics of classification. In the work, the selection of statistical features of the text, the classification of texts belonging to different authors, and the study of the dynamics of classification accuracy depending on the length of text fragments are carried out. To solve the problem, the following methods were used: natural language processing methods; statistical characteristics of texts; machine learning methods; dimensionality reduction methods for visualization capability. On the basis of the obtained dynamics of changes in classification accuracy depending on the lengths of text fragments, appropriate conclusions were drawn regarding the optimal length of texts used for training and testing models. The task was solved in the Jupyter Notebook software environment of the Anaconda distribution, which allows you to immediately install Python and the necessary libraries.

References

Polynska H.A. Informatsiini systemy marketynhu. Kyiv : YuRAIT, 2016. 324 s.

Mylnikov K. Statystychni metody intelektualnoho analizu danykh. Ukraina, 2021. 240 s.

Shytykov V.K., Mastytskyi S.Э. Statystychnyi analiz ta vizualizatsiia danykh za dopomohoiu R. Yzdatelstvo «DMK Press», 2015. 496s.

T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. 2nd Edition. - Springer, 2013.

Shytykov V.K., Mastytskyi S.Э. Klassyfykatsyia, rehressyia y druhye alhorytmі Data Mining s yspolzovanyem R. 2017.

Python dlia analyza danniakh:

https://mipt-stats.gitlab.io/courses/python/09_seaborn.html

Published

2022-10-28