A comprehensive survey of modern frameworks and evaluation metrics for RAG systems
DOI:
https://doi.org/10.34185/1562-9945-3-164-2026-10Keywords:
computer systems, information technologies, data mining, artificial intelligence, RAG, generative language models, machine-based benchmarkingAbstract
The relevance of the study is driven by the rapid proliferation of RAG systems in search and generative tasks, where response quality depends on both the relevance of the retrieved context and the correctness of its utilization by generative language model. The objective of the research is to review modern metrics and frameworks for evaluating RAG systems and experimentally verify the impact of retrieval quality on generation metrics. The study analyzes scientific publications, compares evaluation frameworks, and conducts a machine experiment using a vector search system followed by response generation. To evaluate the impact of filtering on retrieval quality and context formation, we compare standard vector search with pre-filtered search. The obtained results confirm that RAG system evaluation must account for both retrieval and generation metrics, as increasing context size without reducing noise does not guarantee improved response quality.
References
Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W., Rocktäschel T., Riedel S., Kiela D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems. – 2020. Vol. 33. P. 9459-9474. DOI: 10.48550/arXiv.2005.11401
Es S., James J., Espinosa-Anke L., Steven S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. Computer Science. Computation and Language. – 2023. DOI: 10.48550/arXiv.2309.15217
Saad-Falcon J., Khattab O., Potts C., Zaharia M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2024. P. 3464-3483. DOI: 10.48550/arXiv.2311.09476
Park Chanhee, Moon H., Park Chanjun, Lim H. MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2504.17137
Friel R., Belyi M., Sanyal A. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. Computer Science. Computation and Language. – 2024. DOI: 10.48550/arXiv.2407.11005
Yu Z., Gan Z., Zhang Y., Tong X., Liu H., Liu Q. Evaluation of Retrieval-Augmented Generation: A Survey. Computer Science. Computation and Language. – 2024. DOI: 10.48550/arXiv.2405.07437
Gan A., Yu H., Zhang K., Liu Q., Yan W., Huang Z., Tong S., Hu G. Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2504.14891
Rau D., Déjean H., Chirkova N, Formal T., Wang S., Nikoulina V., Clinchant S. BERGEN: A Benchmarking Library for Retrieval-Augmented Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Findings. – 2024. P. 5897-5913. DOI: 10.48550/arXiv.2407.01102
Niu C., Wu Y., Zhu J., Xu S., Shum K., Zhong R., Song J., Zhang T. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024. P. 10794-10817. DOI: 10.48550/arXiv.2401.00396
Ding T., Banerjee A., Mombaerts L., Li Y., Borogovac T., Weinstein J. P. VERA: Validation and Evaluation of Retrieval-Augmented Systems. Computer Science. Information Retrieval. – 2024. DOI: 10.48550/arXiv.2409.03759
Ming Y., Purushwalkam S., Pandit S., Ke Z., Nguyen X., Xiong C., Joty S. FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows". Computer Science. Computation and Language. – 2024. DOI: 10.48550/arXiv.2410.03727
Sorodoc I.-T., Ribeiro L., Blloshmi R., Davis C., de Gispert A. GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2506.07671
Laban P., Fabbri A. R., Xiong C., Wu C.-S. Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2024. DOI: 10.48550/arXiv.2407.01370
Krumdick M., Lovering C., Reddy V., Ebner S., Tanner C. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2503.05061
Ju J.-H., Verberne S., de Rijke M., Yates A. Controlled Retrieval-augmented Context Evaluation for Long-form RAG. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2025. DOI: 10.48550/arXiv.2506.20051
Casella G., Berger R. L. Statistical Inference. 2nd ed. Pacific Grove: Duxbury, 2002.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 System technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.









