КОНСТРУЮВАННЯ ЗАПИТІВ ДЛЯ КЛАСИФІКАЦІЇ ЗЕМНОГО ПОКРИВУ БЕЗ НАВЧАЛЬНИХ ПРИКЛАДІВ ЗА ДОПОМОГОЮ МУЛЬТИМОДАЛЬНИХ МОВНИХ МОДЕЛЕЙ НА ЗНИМКАХ SENTINEL-2

О.Г. Гончаров; І.М. Удовик; Вік.В.  Гнатушенко

doi:10.34185/1562-9945-4-165-2026-13

Authors

O. Honcharov https://orcid.org/0009-0002-4349-4859
I. Udovyk https://orcid.org/0000-0002-5190-841X
Vik. Hnatushenko https://orcid.org/0000-0001-5304-4144

DOI:

https://doi.org/10.34185/1562-9945-4-165-2026-13

Keywords:

prompt engineering, zero-shot classification, VLM, model, image, remote sensing, Sentinel-2

Abstract

Multimodal language models (VLMs) enable land cover classification from satellite imagery without labeled training data. This paper, extending previous work [8], analyzes prompt engineering approaches for land cover classification on Sentinel-2 imagery within the ESA WorldCover 2021 taxonomy. The color leakage phenomenon is identified and described, where the model bases its predictions on segmentation mask colors rather than image content. A four-invariant prompt protocol is proposed, including TCI-first ordering, grayscale mask conversion, elimination of color descriptions, and a fixed JSON output format, which removes this effect and increases the format compliance rate (FCR) from ≈60% to 97%. Two inference strategies are compared: Variant A (multi-cluster, mIoU ≈ 7.1%) and Variant B (single-cluster, mIoU ≈ 13.2%) on 10 Sentinel-2 tiles. In Variant B, each segment is processed independently using a binary mask, which simplifies spatial interpretation and reduces inter-segment interference. The highest result (mIoU = 46.2%) is achieved with the UNet-encoder + GPT-4.1 + Variant B configuration, although this corresponds to a single case.

Problem Statement. Land cover mapping from satellite imagery is widely used in ecological monitoring, urban planning, and agronomy. Traditional semantic segmentation approaches require large labeled datasets and significant computational resources, especially when adapting to new regions. Recent multimodal language models, including GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro, enable zero-shot classification without task-specific training. However, such pipelines introduce specific failure modes, notably the color leakage effect, where predictions depend on segmentation mask colors instead of actual image content.

Recent Studies and Publications Analysis. VLMs are increasingly used in remote sensing owing to their capacity for open-vocabulary reasoning over satellite imagery. Yao et al. introduced Falcon, a remote sensing vision-language foundation model; Mall et al. developed RSVLM for satellite image understanding; Li et al. presented RS-CLIP for zero-shot scene classification. Liu et al. proposed RSHBench — a detailed benchmark for diagnosing hallucinations in multimodal LLMs applied to remote sensing. For zero-shot learning, Saha et al. demonstrated improved classification by adapting VLMs with attribute descriptions; Barzilai et al. analysed recipes for improving VLM zero-shot accuracy in remote sensing. In prompt engineering, Wei et al. established chain-of-thought prompting and White et al. catalogued reusable prompt patterns. Geirhos et al. documented shortcut learning in deep networks, providing theoretical grounding for the color leakage phenomenon. Despite these advances, systematic analysis of prompt design for eliminating color artifacts in VLM-based land cover classification remains unstudied.

Research Objective. The objective of this study is to improve classification accuracy (mIoU) and structured output correctness (FCR) in zero-shot land cover classification on Sentinel-2 imagery by developing a prompt engineering protocol for multimodal language models that eliminates the color leakage effect and enforces a fixed structure of inputs and outputs.

Main Body of Research. A two-stage processing pipeline is used, combining unsupervised segmentation with VLM-based classification under a four-invariant protocol: TCI-first ordering, grayscale mask, no color descriptions, and structured JSON output. Variant A performs classification of all segments in a single request, while Variant B processes each segment independently using a binary mask. This change in formulation improves mIoU from 7.1% to 13.2%. Ablation analysis (n = 5 tiles) shows that the JSON output constraint has the largest impact on FCR, while grayscale mask conversion most effectively reduces color leakage. Per-class analysis indicates that the improvement is primarily driven by the Cropland class (23.4% → 46.9%), whereas spectrally similar vegetation classes degrade.

Conclusions. The study addresses the problem of improving classification accuracy (mIoU) and structured output correctness (FCR) in zero-shot land cover classification on Sentinel-2 satellite imagery through the development of a prompt engineering protocol for multimodal language models. The proposed protocol, consisting of four mandatory rules, eliminates the color leakage effect and increases FCR from ≈60% to 97%.

It is shown that the use of the single-cluster processing strategy (Variant B), in which each segment is processed independently using a binary mask, improves classification accuracy from 7.1% to 13.2% compared to the multi-cluster strategy (Variant A). This approach eliminates inter-segment context contamination, simplifies segment interpretation for the model, and improves structured output correctness, as each request produces a single JSON object. The highest result (mIoU = 46.2%) is achieved with the UNet-encoder +
GPT-4.1 + Variant B configuration; however, this corresponds to a single configuration and is not representative of overall performance across models and segmentation methods.

References

Heipke, C., & Rottensteiner, F. (2020). Deep learning for geometric and semantic tasks in photogrammetry and remote sensing. ISPRS Journal of Photogrammetry and Remote Sensing, 166, 28–30. https://doi.org/10.1080/10095020.2020.1718003

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015) (Vol. 9351, pp. 234–241). https://doi.org/10.1007/978-3-319-24574-4_28

Hnatushenko, V., & Honcharov, O. (2024). Land cover mapping with Sentinel-2 imagery using deep learning semantic segmentation models. In Proceedings of the 11th International Scientific Conference "Information Technology and Implementation" (IT&I-2024) (CEUR Workshop Proceedings, Vol. 3909, pp. 1–18). https://ceur-ws.org/Vol-3909/Paper_1.pdf

Achiam, J., Adler, S., Agarwal, S., et al. (2023). GPT-4 technical report. arXiv preprint. https://doi.org/10.48550/arXiv.2303.08774

Comanici, G., Bieber, E., Schaekermann, M., et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint. https://doi.org/10.48550/arXiv.2507.06261

Mall, U., Phoo, C. P., Liu, M. K., Vondrick, C., Hariharan, B., & Bala, K. (2024). Remote sensing vision-language foundation models without annotations via ground remote alignment. In International Conference on Learning Representations (ICLR 2024). https://doi.org/10.48550/arXiv.2312.06960

Li, X., Wen, C., Hu, Y., & Zhou, N. (2023). RS-CLIP: Zero-shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation, 124, 103497. https://doi.org/10.1016/j.jag.2023.103497

Hnatushenko, V., Honcharov, O., & Heipke, C. (2026). Zero-shot land-cover recognition via unsupervised classification and VLM inference on Sentinel-2 imagery. In Proceedings of the 46th Annual Conference of the DGPF, Darmstadt. Publikationen der DGPF, Band 34.

Yao, K., Xu, N., Yang, R., et al. (2025). Falcon: A remote sensing vision-language foundation model (technical report). arXiv preprint. https://doi.org/10.48550/arXiv.2503.11070

Sosa, J., Rukhovich, D., Kacem, A., & Aouada, D. (2026). Enabling training-free text-based remote sensing segmentation. arXiv preprint. https://doi.org/10.48550/arXiv.2602.17799

Liu, Y., Zhang, J., Wang, D., et al. (2026). Seeing clearly without training: Mitigating hallucinations in multimodal LLMs for remote sensing. arXiv preprint. https://doi.org/10.48550/arXiv.2603.02754

Romera-Paredes, B., & Torr, P. (2015). An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 2152–2161). PMLR. https://proceedings.mlr.press/v37/romera-paredes15.html

Saha, O., Van Horn, G., & Maji, S. (2024). Improved zero-shot classification by adapting VLMs with text descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) (pp. 17542–17552). https://doi.org/10.48550/arXiv.2401.02460

Barzilai, A., Gigi, Y., Helmy, A., et al. (2025). A recipe for improving remote sensing VLM zero-shot generalization. In International Conference on Learning Representations (ICLR 2025). https://doi.org/10.48550/arXiv.2503.08722

White, J., Fu, Q., Hays, S., et al. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint. https://doi.org/10.48550/arXiv.2302.11382

Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (Vol. 35). https://doi.org/10.48550/arXiv.2201.11903

Hnatushenko, V., Kundenko, P., Tsaryk, V., & Dmytriieva, I. (2025). Comparative analysis of activation functions in U-Net for binary water segmentation using Sentinel-2 imagery. In Proceedings of CoLInS-2025 (CEUR Workshop Proceedings, Vol. 3983, Paper 11). https://ceur-ws.org/Vol-3983/paper11.pdf

Hnatushenko, V., Zhurba, A., Zimoglyad, A., & Ostrovska, K. (2025). Research on environmental changes based on fractal characteristics of satellite images. In Proceedings of MoDaST 2025 (CEUR Workshop Proceedings, Vol. 4005, pp. 62–71). https://ceur-ws.org/Vol-4005/paper5.pdf

Zanaga, D., Van De Kerchove, R., Daems, D., et al. (2022). ESA WorldCover 10m 2021 v200. Zenodo. https://doi.org/10.5281/zenodo.7254221

Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716–23736. https://doi.org/10.48550/arXiv.2204.14198

Geirhos, R., Jacobsen, J. H., Michaelis, C., et al. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2, 665–673. https://doi.org/10.1038/s42256-020-00257-z

Prompt engineering for zero-shot land cover classification using multimodal language models on SENTINEL-2 imagery

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Language

ouci

crossref

scholar

worldcat

ISSN

bpnu

vernadskiy

copernicus

ulrichs_web

ukrainika

DNTB

Latest publications

languages

© 2025 System technologies. All Rights Reserved.