Text stream data anomalies detection approach

Authors

  • Yuriy Oliynyk
  • Elena Afanasyeva
  • Georgy Arshakyan

DOI:

https://doi.org/10.34185/1562-9945-2-127-2020-10

Keywords:

isolation forest, text mining

Abstract

Data stream increasing demands development new intellectual tools and methods for BigData processing. Existing anomalies detection approaches based on distance, density and ranking. But these approaches do not take into account data stream features. Unfortunately, in the context of streaming data, those methods more or less have some drawbacks and are not directly applied to streaming data, such as poor adaptability and extensibility, inability to detection novel anomaly, high model updating cost and slow updating speed so on. Besides, existing methods based on numeric or categorical data and do not support Ukrainian text data.
Purpose of research: creating new stream data anomalies detection approach with Ukrainian and Russian language text supporting.
Actually, several software libraries have Ukrainian language text support: Pymorphy2, OpenCorpora, LanguageTool, ВЕСУМ dictionary. Using data preprocessing (normalization, tokenization and noise reduction) and text abstracting for anomalies detection are proposed. Abstracting method developed on base combination of LSA and TextRank methods. For abstracting cosine similarity is used. For method evaluation was prepared Ukrainian dataset from news portal http://korrespondent.net/. Average cosine similarity between original dataset and dataset with 40% of volume is 0,9033, and for 40% of volume - 0,7812. Anomalies detection speed increase for “20%” dataset in 1.7 time in compare original dataset processing, and in 1.47 time for “40%” dataset.
Anomalies detection method based on Isolation forest method. Input data: TF-IDF metrics, message lengths, TF-IDF metrics of original messages and “20%” / “40%” messages are similar.
Text stream data anomalies detection approach is presented. Method includes preprocessing and Abstracting stage. Abstracting method developed on base combination of LSA and TextRank methods. Anomalies detection method based on a Isolation Forest method and data stream model. Ukrainian and Russian language text processing is supported. The processing speed of original and abstract data stream is compared.

References

Mehrotra K.G., Mohan C.K., & Huang, H. (2017). Anomaly detection principles and algorithms (p. 217). New York, NY, USA:: Springer International Publishing.

Afanasieva O.Ie. Vyiavlennia anomalii v potokakh tekstovykh danykh / Afanasieva O.Ie., Oliinyk Yu.O. // Vseukrainska naukovo-praktychna konferentsiia molodykh vchenykh ta studentiv «Informatsiini systemy ta tekhnolohii upravlinnia – ISTU-2019». Sektsiia kafedry avtomatyzovanykh system obrobky informatsii i upravlinnia. m. Kyiv: NTUU «KPI im. Ihoria Sikorskoho», 26 lystopada 2019 r,– S. 88-92

Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining (pp. 413-422). IEEE.

Ding, Z., & Fei, M. (2013). An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proceedings Volumes, 46(20), 12-17.

WordNet - A Lexical Database for English [Electronic Resource] – Mode of access: World Wide Web: wordnet.princeton.edu - - Title from the screen

Yu. Oliynik. Review and analysis of algorithms TEXT MINING / O. Gavrilenko, Yu. Oliynik, H. Hanko. // Project management, systems analysis and logistics. – K .: NTU, 2017. - Vol., pp32-41

pymorphy2 – Mode of access: World Wide Web: https://pymorphy2.readthedocs.io/ – Title from the screen

Open Corpora [Electronic Resource] – Mode of access: World Wide Web: http://opencorpora.org/ (viewed on September 20, 2019). – Title from the screen.

MIT Information Extraction [Electronic Resource] – Mode of access: World Wide Web: https://github.com/mit-nlp/MITIE/ - Title from the screen

Dlib toolkit [Electronic Resource] – Mode of access: World Wide Web: http://dlib.net/ - Title from the screen

BEDUL Dictionary- Mode of access: World Wide Web: https://github.com/brown-uk/dict_uk Title from the screen

Arshakian H.D. Ohliad pidkhodiv ta metodiv avtomatychnoho referuvannia tekstu / Arshakian H.D. Oliinyk Yu.O. // Vseukrainska naukovo-praktychna konferentsiia molodykh vchenykh ta studentiv «Informatsiini systemy ta tekhnolohii upravlinnia – ISTU-2018». Sektsiia kafedry avtomatyzovanykh system obrobky informatsii i upravlinnia. m. Kyiv: NTUU «KPI im. Ihoria Sikorskoho», 26 lystopada 2019 r,– S. 194-198

Dataset for data analysing Mode of access: World Wide Web: https://drive.google.com/open?id=1-aImiiTqKJfIWxmifnI4GZSMbVzfnfvi - Title from the screen

Tomashevskii, V. M., Oliynik, Y. O., Yaskov, V. V., Romanchuk, V. M. (2018). Realtime text stream anomalies analysis system. Вісник Херсонського національного технічного університету, (3 (1)), 361-365.

Published

2020-02-24