Development of an automated system for clustering text documents

Authors

  • I. Ponomarev

DOI:

https://doi.org/10.34185/1562-9945-1-138-2022-10

Keywords:

clustering, text mining, TF-IDF, HDBSCAN, tokenization, lemmatization, stop words, PYTHON

Abstract

Grouping texts into groups similar in content is a common task in various fields of human activity. Text document clustering is used to automatically categorize text documents, filter emails, group web pages in search engines, and so on. Automation of this process can signifi-cantly reduce the time spent on this task.

References

Jiawei Han, Micheline Kamber, Jian Pei. Data Mining: Concepts and Techniques 3rd Edition. Morgan Kaufmann, 2011, 744 pages.

Prafulla Bafna, Dhanya Pramod, Anagha Vaidya. Document clustering: TF-IDF approach. ICEEOT, 2016, p.61-66.

L. McInnes, J. Healy, S. Astels. hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 2017, p.205-206.

Published

2022-03-30