AUTOMATIC SEMANTIC SEGMENTATION OF SENTINEL-2 IMAGES: INTEGRATION OF CLUSTERING AND LARGE MULTIMODAL MODELS FOR CLUSTER INTERPRETATION
DOI:
https://doi.org/10.34185/1991-7848.itmm.2025.01.095Keywords:
semantic segmentation, clustering, Sentinel-2, land cover, multimodal modelsAbstract
Semantic segmentation of satellite imagery, particularly Sentinel-2 data, is crucial for environmental monitoring and land cover mapping. This paper presents an unsupervised method for land cover classification that eliminates the need for pixel-level annotations. The approach combines clustering techniques (K-Means, DBSCAN, autoencoders) with automated cluster labeling using large vision-language models (e.g., GPT-4, Claude, Gemini 2.0). Clusters are visualized and interpreted by these models based on spatial context and color. The methodology achieves segmentation accuracy of 85–90%, comparable to supervised methods, while ensuring interpretability and scalability. A majority voting mechanism and terminology normalization improve consistency across model outputs. Validation is performed using ESA WorldCover maps. The proposed approach is promising for rapid land cover mapping in resource-constrained or emergency situations.
References
Zhang, S., Wang, Q., Liu, J., & Xiong, H. (2024). ALPS: An Auto-Labeling and Pre-training Scheme for Remote Sensing Segmentation With Segment Anything Model. arXiv preprint. https://doi.org/10.48550/arXiv.2406.10855
Nagar, S., Farahbakhsh, E., Awange, J., & Chandra, R. (2024). Remote sensing framework for geological mapping via stacked autoencoders and clustering. Advances in Space Research, 74(10), 4502–4516. https://doi.org/10.1016/j.asr.2024.09.013
Osco, L. P., de Lemos, E. L., Gonçalves, W. N., Ramos, A. P. M., & Junior, J. M. (2023). The Potential of Visual ChatGPT For Remote Sensing. arXiv preprint. https://doi.org/10.48550/arXiv.2304.13009
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models from Natural Language Supervision. arXiv preprint. https://doi.org/10.48550/arXiv.2103.00020
Duan, X., Huang, X., & Zhang, L. (2015). Unsupervised Global Urban Area Mapping via Automatic Labeling from ASTER and PALSAR Satellite Images. Remote Sensing, 7(2), 2171–2192. https://doi.org/10.3390/rs70202171