COMPARATIVE ANALYSIS OF LEXICAL AND SEMANTIC SEARCH IN MULTILINGUAL MEDIA AGGREGATION WEB SERVICES
DOI:
https://doi.org/10.34185/1991-7848.itmm.2026.01.094Keywords:
web service, search optimization, semantic search, Elasticsearch, vector database, BGE-M3, multilingual search, BM25, QdrantAbstract
This paper presents the results of a comparative analysis of two search approaches in multilingual media content aggregation web services: lexical search based on the BM25 algorithm (Elasticsearch) and semantic search based on BGE-M3 vector embeddings (Qdrant). The MediaAggregator platform was developed to index 70,000 news articles in seven languages and provide a unified API for comparing search quality, latency, and response size. Experiments were conducted locally on a laptop with an AMD Ryzen 9 8945HS processor (Zen 4, 8 cores, 5.2 GHz boost) and 32 GB LPDDR5x RAM. Experimental results demonstrate that lexical search provides 2.7 times lower latency on average, while semantic search enables cross-lingual retrieval of relevant content regardless of query language, which is critical for multilingual web services.
References
Mitra B., Craswell N. An Introduction to Neural Information Retrieval. Foundations and Trends in Information Retrieval. 2018. Vol. 13, No. 1. P. 1–126. DOI: 10.1561/1500000061.
Chen J., Xiao S., Zhang P. et al. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv preprint arXiv:2402.03216. 2024. DOI: 10.48550/arXiv.2402.03216.
Karpukhin V., Oguz B., Min S. et al. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. P. 6769–6781.
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator. URL: https://onnxruntime.ai/ (дата звернення: 05.04.2026).
Elasticsearch: The Official Distributed Search & Analytics Engine. URL: https://www.elastic.co/elasticsearch (дата звернення: 05.04.2026).
Qdrant – Vector Search Engine. URL: https://qdrant.tech/ (дата звернення: 05.04.2026).
Kamphuis C., de Vries A. P., Boytsov L., Lin J. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants. Proceedings of the 42nd European Conference on Information Retrieval (ECIR). 2020. P. 28–34. DOI: 10.1007/978-3-030-45442-5_4.
Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). 2019. P. 3982–3992.




