Development of an automated system for clustering text documents
DOI:
https://doi.org/10.34185/1562-9945-1-138-2022-10Keywords:
clustering, text mining, TF-IDF, HDBSCAN, tokenization, lemmatization, stop words, PYTHONAbstract
Grouping texts into groups similar in content is a common task in various fields of human activity. Text document clustering is used to automatically categorize text documents, filter emails, group web pages in search engines, and so on. Automation of this process can signifi-cantly reduce the time spent on this task.
References
Jiawei Han, Micheline Kamber, Jian Pei. Data Mining: Concepts and Techniques 3rd Edition. Morgan Kaufmann, 2011, 744 pages.
Prafulla Bafna, Dhanya Pramod, Anagha Vaidya. Document clustering: TF-IDF approach. ICEEOT, 2016, p.61-66.
L. McInnes, J. Healy, S. Astels. hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 2017, p.205-206.
Published
2022-03-30
Issue
Section
Статьи