Review of methods for semantic text classification

Pavliuk Dmytro; Baibuz Oleh

doi:10.34185/1562-9945-5-154-2024-13

Authors

Pavliuk Dmytro
Baibuz Oleh

DOI:

https://doi.org/10.34185/1562-9945-5-154-2024-13

Keywords:

Text classification, Naive Bayes, Logistic regression, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Transformers, Tone Analysis, Natural Language Processing).

Abstract

Recent advancements in text classification have focused on the application of machine learn-ing and deep learning techniques. Traditional methods such as Naive Bayes, Logistic Regression, and Support Vector Machines (SVM) have been widely utilized due to their efficiency and simplic-ity. However, the advent of deep learning has introduced more complex models like Artificial Neu-ral Networks (ANN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN), which can automatically extract features and detect intricate patterns in textual data. Addi-tionally, transformer-based models such as BERT have set new benchmarks in text classification tasks. Despite their high accuracy, these models require substantial computational resources and are not always practical for every application. The ongoing research aims to balance accuracy and computational efficiency. Purpose of Research. The primary objective of this study is to review and compare various methods for automated text classification based on sentiment analysis. This research aims to evalu-ate the prediction accuracy of different models, including traditional machine learning algorithms and modern deep learning approaches, and to provide insights into their practical applications and limitations. Presentation of the Main Research Material. This study utilizes the “IMDB Dataset of 50K Movie Reviews” to train and test various text classification models. The dataset comprises movie reviews and their associated sentiment labels, either positive or negative. The research employs several preprocessing steps. For feature extraction, methods such as Bag-of-Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), and Word2Vec are used. These features are then fed into various classifiers: Naive Bayes, Support Vector Machines (SVM), Logistic Regression, Deep Learning Models. Conclusions. The comparative analysis reveals that while traditional machine learning meth-ods like Naive Bayes, SVM and Logistic Regression are efficient and easy to implement, deep learn-ing models offer superior accuracy by capturing more complex patterns in the data. However, the computational demands of deep learning models, particularly transformers, limit their applicability in resource-constrained environments. Future research should focus on optimizing these models to balance accuracy and computational efficiency, making advanced text classification accessible for a broader range of applications. Recent advancements in text classification have focused on the application of machine learn-ing and deep learning techniques. Traditional methods such as Naive Bayes, Logistic Regression, and Support Vector Machines (SVM) have been widely utilized due to their efficiency and simplic-ity. However, the advent of deep learning has introduced more complex models like Artificial Neu-ral Networks (ANN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN), which can automatically extract features and detect intricate patterns in textual data. Addi-tionally, transformer-based models such as BERT have set new benchmarks in text classification tasks. Despite their high accuracy, these models require substantial computational resources and are not always practical for every application. The ongoing research aims to balance accuracy and computational efficiency.

References

Source Code for the Article. URL: https://github.com/w3t4nu5/NLP-Article

IMDB Dataset of 50K Movie Reviews.

URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

HuggingFace: Transformers. URL: https://huggingface.co/docs/transformers/index

Stopwords [NLP, Python]. URL: https://medium.com/@yashj302/stopwords-nlp-python-4aa57dc492af

Pavliuk, D. I., Baibuz, O. H., and Honcharova, Y. S. "Text Preparation for Natural Language Processing." 'XIX International Scientific and Practical Conference “Creative Business Manage-ment and Implementation of New Ideas”', 14-17 May 2024, Tallinn, Estonia, pp. 223-225.

Feature extraction. URL: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vec-tor Space. 2013. URL: https://arxiv.org/pdf/1301.3781

MultinomialNB.

URL: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Support Vector Machines. URL: https://scikit-learn.org/stable/modules/svm.html

LogisticRegression. URL: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Elastic Net Regression —Combined Features of L1 and L2 regularization. URL: https://medium.com/@abhishekjainindore24/elastic-net-regression-combined-features-of-l1-and-l2-regularization-6181a660c3a5

Google Code: word2vec. URL: https://code.google.com/archive/p/word2vec/

Natural Language Processing in TensorFlow. URL: https://www.coursera.org/learn/natural-language-processing-tensorflow/home/week/1