A comprehensive survey of modern frameworks and evaluation metrics for RAG systems

Authors

DOI:

https://doi.org/10.34185/1562-9945-3-164-2026-10

Keywords:

computer systems, information technologies, data mining, artificial intelligence, RAG, generative language models, machine-based benchmarking

Abstract

The relevance of the study is driven by the rapid proliferation of RAG systems in search and generative tasks, where response quality depends on both the relevance of the retrieved context and the correctness of its utilization by generative language model. The objective of the research is to review modern metrics and frameworks for evaluating RAG systems and experimentally verify the impact of retrieval quality on generation metrics. The study analyzes scientific publications, compares evaluation frameworks, and conducts a machine experiment using a vector search system followed by response generation. To evaluate the impact of filtering on retrieval quality and context formation, we compare standard vector search with pre-filtered search. The obtained results confirm that RAG system evaluation must account for both retrieval and generation metrics, as increasing context size without reducing noise does not guarantee improved response quality.

References

Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W., Rocktäschel T., Riedel S., Kiela D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems. – 2020. Vol. 33. P. 9459-9474. DOI: 10.48550/arXiv.2005.11401

Es S., James J., Espinosa-Anke L., Steven S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. Computer Science. Computation and Language. – 2023. DOI: 10.48550/arXiv.2309.15217

Saad-Falcon J., Khattab O., Potts C., Zaharia M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2024. P. 3464-3483. DOI: 10.48550/arXiv.2311.09476

Park Chanhee, Moon H., Park Chanjun, Lim H. MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2504.17137

Friel R., Belyi M., Sanyal A. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. Computer Science. Computation and Language. – 2024. DOI: 10.48550/arXiv.2407.11005

Yu Z., Gan Z., Zhang Y., Tong X., Liu H., Liu Q. Evaluation of Retrieval-Augmented Generation: A Survey. Computer Science. Computation and Language. – 2024. DOI: 10.48550/arXiv.2405.07437

Gan A., Yu H., Zhang K., Liu Q., Yan W., Huang Z., Tong S., Hu G. Retrieval Augmented Generation Evaluation in the Era of Large Language Models: A Comprehensive Survey. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2504.14891

Rau D., Déjean H., Chirkova N, Formal T., Wang S., Nikoulina V., Clinchant S. BERGEN: A Benchmarking Library for Retrieval-Augmented Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Findings. – 2024. P. 5897-5913. DOI: 10.48550/arXiv.2407.01102

Niu C., Wu Y., Zhu J., Xu S., Shum K., Zhong R., Song J., Zhang T. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024. P. 10794-10817. DOI: 10.48550/arXiv.2401.00396

Ding T., Banerjee A., Mombaerts L., Li Y., Borogovac T., Weinstein J. P. VERA: Validation and Evaluation of Retrieval-Augmented Systems. Computer Science. Information Retrieval. – 2024. DOI: 10.48550/arXiv.2409.03759

Ming Y., Purushwalkam S., Pandit S., Ke Z., Nguyen X., Xiong C., Joty S. FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows". Computer Science. Computation and Language. – 2024. DOI: 10.48550/arXiv.2410.03727

Sorodoc I.-T., Ribeiro L., Blloshmi R., Davis C., de Gispert A. GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2506.07671

Laban P., Fabbri A. R., Xiong C., Wu C.-S. Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2024. DOI: 10.48550/arXiv.2407.01370

Krumdick M., Lovering C., Reddy V., Ebner S., Tanner C. No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding. Computer Science. Computation and Language. – 2025. DOI: 10.48550/arXiv.2503.05061

Ju J.-H., Verberne S., de Rijke M., Yates A. Controlled Retrieval-augmented Context Evaluation for Long-form RAG. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). – 2025. DOI: 10.48550/arXiv.2506.20051

Casella G., Berger R. L. Statistical Inference. 2nd ed. Pacific Grove: Duxbury, 2002.

Published

2026-04-30