EXPERIMENTAL STUDY OF DISTANCE MAP DECODER ARCHITECTURAL CONFIGURATIONS FOR INSTANCE SEGMENTATION BY TEXT QUERY
DOI:
https://doi.org/10.34185/1562-9945-2-163-2026-01Keywords:
instance segmentation, distance decoder, PixelShuffle, coordinate convolution, feature fusion, CLIP, open-vocabulary segmentation, InstanceCLIPSegAbstract
This article presents an experimental study of the architectural configurations of distance map decoder in the InstanceCLIPSeg model for instance segmentation by text query. We in-vestigate the influence of various mechanisms for restoring spatial resolution (bilinear inter-polation, PixelShuffle), the use of coordinate convolutions (CoordConv), and multi-level feature fusion strategies. Based on the evaluation of nine configurations on the LVIS and PhraseCut datasets, it was found that a hybrid architecture with PixelShuffle and single-stage feature fusion from transformer layers achieves the best results (mean Dice 0.2374), outper-forming baseline approaches. The redundancy of coordinate channels in the presence of mul-ti-level fusion was revealed.
References
"1. Mashtalir, S. V., & Kovtunenko, A. R. (2025). Improved segmentation model to identify object instances based on textual prompts. Вісник сучасних інформаційних технологій, 8(1), 54-66.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In Inter-national conference on machine learning (pp. 8748-8763). PmLR.
Lüddecke, T., & Ecker, A. (2022). Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7086-7096).
Cheng, B., Collins, M. D., Zhu, Y., Liu, T., Huang, T. S., Adam, H., & Chen, L. C. (2020). Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12475-12485).
Sun, B., Kuen, J., Lin, Z., Mordohai, P., & Chen, S. (2023). PRN: panoptic refinement network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3963-3973).
Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and checkerboard artifacts. Distill, 1(10), e3.
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., ... & Wang, Z. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recog-nition (pp. 1874-1883).
Aitken, A., Ledig, C., Theis, L., Caballero, J., Wang, Z., & Shi, W. (2017). Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. arXiv preprint arXiv:1707.02937.
Liu, R., Lehman, J., Molino, P., Petroski Such, F., Frank, E., Sergeev, A., & Yosinski, J. (2018). An intriguing failing of convolutional neural networks and the coordconv solution. Advances in neural information processing systems, 31.
Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Cham: Springer international publishing.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125).
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Seg-Former: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34, 12077-12090.
Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).
Wu, C., Lin, Z., Cohen, S., Bui, T., & Maji, S. (2020). Phrasecut: Language-based image segmentation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10216-10225).
Zhang, L., Guo, X., Sun, H., Wang, W., & Yao, L. (2025). Alternate encoder and dual decoder CNN-Transformer networks for medical image segmentation. Scientific Reports, 15(1), 8883.
Chen, J., Liang, Z., & Lu, X. (2025). A dual attention and cross layer fusion network with a hybrid CNN and transformer architecture for medical image segmentation. Scientific Re-ports, 15(1), 35707."
Downloads
Published
Issue
Section
License
Copyright (c) 2026 System technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.









