METHOD FOR CONSTRUCTING A CRISIS-CONTEXT DATASET FOR ADAPTIVE IRM VERIFICATION

Authors

DOI:

https://doi.org/10.34185/1991-7848.2026.01.21

Keywords:

crisis-context dataset, large language models, hidden contextual adaptation, Adaptive IRM, question generation, crisis informatics, HumAID, dataset validation

Abstract

Recent research in the field of crisis informatics is largely focused on the automatic processing of social media messages during emergency situations. Existing crisis corpora, including HumAID, CrisisBench, AIDR, TREC-IS, and CrisisMMD, provide an important foundation for message classification, informativeness detection, humanitarian categorization, prioritization, and multimodal annotation tasks. At the same time, most of these resources are oriented toward the analysis of individual messages or the identification of their class, rather than toward verifying the ability of a large language model to reconstruct hidden crisis context from an abstract query. With the development of large language models, there is a growing need for specialized datasets that make it possible to evaluate not only the general linguistic competence of a model, but also its ability to adapt a response to a context that is not explicitly specified in the user’s question.

The purpose of this work is to develop a method for constructing a crisis-context dataset for the subsequent verification of Adaptive IRM in tasks of hidden contextual adaptation of large language model responses. To achieve this purpose, it is proposed to transform crisis messages from the HumAID corpus into pairs of the form “abstract query — crisis-dependent answer”, where the question does not contain direct markers of the disaster type but preserves a semantic connection with the original message.

The paper proposes a generative dataset construction pipeline that includes primary generation, retry generation, and a fallback mechanism. For each crisis message, a locally deployed large language model generates a short WH-question to which the original tweet should provide a direct answer. After generation, the question undergoes automatic validation according to formal criteria: the presence of an interrogative structure, the absence of a yes/no form, ending with a question mark, compliance with the length limit, the absence of undesirable template-like formulations, and the absence of direct crisis markers such as earthquake, hurricane, flood, disaster, emergency, and others. If the initial question does not meet the specified requirements, a retry generation step is performed using a stricter instruction. In case of a repeated failure, a fallback question is applied, which makes it possible to preserve the completeness of the corpus. As a result, a dataset of 41,152 records was formed across five categories of crisis events: hurricanes, earthquakes, cyclones, wildfires, and floods. The fallback mechanism was used in 1,432 cases, which accounts for 3.48% of the corpus.

The main result of the study is a method for transforming crisis messages into “abstract query — crisis-dependent answer” pairs and the dataset constructed on its basis for the future verification of hidden contextual adaptation in LLMs. The proposed approach differs from classical crisis datasets in that it models a situation in which the question does not directly indicate the type of disaster, while the correct answer requires taking into account the hidden context of the event. Future work includes formalized manual validation of the corpus, automatic retrieval-style verification of semantic consistency, construction of an event-disjoint split, implementation of Adaptive IRM, and comparison with LLM-baseline, RAG, and PEFT-baseline approaches.

References

Reuter C., Hughes A. L., Kaufhold M.-A. Social media in crisis management: An evaluation and analysis of crisis informatics research. International Journal of Human–Computer Interaction. 2018. Vol. 34, No. 4. P. 280–294. DOI: 10.1080/10447318.2018.1427832.

Alam F., Qazi U., Imran M., Ofli F. HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks. Proceedings of the International AAAI Conference on Web and Social Media. 2021. Vol. 15, No. 1. P. 933–942. DOI: 10.1609/icwsm.v15i1.18116.

Alam F., Sajjad H., Imran M., Ofli F. CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing. Proceedings of the International AAAI Conference on Web and Social Media. 2021. Vol. 15, No. 1. P. 923–932. DOI: 10.1609/icwsm.v15i1.18115.

Imran M., Castillo C., Lucas J., Meier P., Vieweg S. AIDR: Artificial Intelligence for Disaster Response. WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide Web. New York : ACM, 2014. P. 159–162. DOI: 10.1145/2567948.2577034.

McCreadie R., Buntain C., Soboroff I. TREC Incident Streams: Finding Actionable Information on Social Media. Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2019). Valencia, Spain : ISCRAM Association, 2019. P. 691–705.

Alam F., Ofli F., Imran M. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. Proceedings of the International AAAI Conference on Web and Social Media. 2018. Vol. 12, No. 1. DOI: 10.1609/icwsm.v12i1.14983.

Lei Z., Dong Y., Li W., Ding R., Wang Q. R., Li J. Harnessing Large Language Models for Disaster Management: A Survey. Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria : Association for Computational Linguistics, 2025. P. 14528–14551. DOI: 10.18653/v1/2025.findings-acl.750.

Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W.-t., Rocktäschel T., Riedel S., Kiela D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems. 2020. Vol. 33. P. 9459–9474. DOI: 10.5555/3495724.3496517.

Han Z., Gao C., Liu J., Zhang J., Zhang S. Q. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. Transactions on Machine Learning Research. 2024. URL: https://openreview.net/forum?id=lIsCS8b6zj (дата звернення: 29.04.2026).

Afzal A., Chalumattu R., Matthes F., Mascarell L. AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U). Miami, Florida, USA : Association for Computational Linguistics, 2024. P. 76–85. DOI: 10.18653/v1/2024.customnlp4u-1.8.

Fu W., Wei B., Hu J., Cai Z., Liu J. QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA : Association for Computational Linguistics, 2024. P. 11783–11803. DOI: 10.18653/v1/2024.emnlp-main.658.

Zhang T., Kishore V., Wu F., Weinberger K. Q., Artzi Y. BERTScore: Evaluating Text Generation with BERT. arXiv preprint arXiv:1904.09675. 2019. DOI: 10.48550/arXiv.1904.09675.

Sellam T., Das D., Parikh A. P. BLEURT: Learning Robust Metrics for Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online : Association for Computational Linguistics, 2020. P. 7881–7892. DOI: 10.18653/v1/2020.acl-main.704.

Wang Z., Funakoshi K., Okumura M. Automatic Answerability Evaluation for Question Generation. arXiv preprint arXiv:2309.12546. 2023. DOI: 10.48550/arXiv.2309.12546.

Mohammadshahi A., Scialom T., Yazdani M., Yanki P., Fan A., Henderson J., Saeidi M. RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada : Association for Computational Linguistics, 2023. P. 6845–6867. DOI: 10.18653/v1/2023.findings-acl.428.

Houlsby N., Giurgiu A., Jastrzebski S., Morrone B., de Laroussilhe Q., Gesmundo A., Attariyan M., Gelly S. Parameter-Efficient Transfer Learning for NLP. Proceedings of the 36th International Conference on Machine Learning. 2019. Vol. 97. P. 2790–2799. URL: https://proceedings.mlr.press/v97/houlsby19a.html (дата звернення: 29.04.2026).

Li X. L., Liang P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online : Association for Computational Linguistics, 2021. P. 4582–4597. DOI: 10.18653/v1/2021.acl-long.353.

Hu E. J., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., Wang L., Chen W. LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=nZeVKeeFYf9 (дата звернення: 29.04.2026).

Liu H., Tam D., Muqeeth M., Mohta J., Huang T., Bansal M., Raffel C. Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper than In-Context Learning. Advances in Neural Information Processing Systems. 2022. Vol. 35. P. 1950–1965. URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/0cde695b83bd186c1fd456302888454c-Abstract-Conference.html (дата звернення: 29.04.2026).

Guo S., Liao L., Li C., Chua T.-S. A Survey on Neural Question Generation: Methods, Applications, and Prospects. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 2024. P. 8038–8047. DOI: 10.24963/ijcai.2024/889.

Nguyen B., Yu M., Huang Y., Jiang M. Reference-based Metrics Disprove Themselves in Question Generation. Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA : Association for Computational Linguistics, 2024. P. 13651–13666. DOI: 10.18653/v1/2024.findings-emnlp.798.

Published

2026-04-30

How to Cite

[1]
2026. METHOD FOR CONSTRUCTING A CRISIS-CONTEXT DATASET FOR ADAPTIVE IRM VERIFICATION. Modern Problems of Metallurgy. 29 (Apr. 2026), 319–336. DOI:https://doi.org/10.34185/1991-7848.2026.01.21.