Abstract:
Semantic segmentation of remote sensing images is a task of great theoretical importance and practical values. Remote sensing images contain rich feature information, and the pixel information at the boundary is also more difficult to determine, making segmentation more difficult. Based on the structure of Mask2Former, an improved multi-scale feature interaction structure Mask2Former-MS and a boundary optimization structure Mask2Former-BR were proposed, the former utilized bilinear interpolation for up-sampling and down-sampling to achieve the effect of feature fusion, and a channel attention mechanism was introduced to reduce the effect of redundant information. The latter utilized the atrous spatial pyramid pooling (ASPP) module for feature extraction, the different atrous rates was used to capture feature information at different scales by each parallel atrous convolution branch of ASPP, and the ReLU layer and batch normalization (BN) layer for activation and normalization were used to suppress gradient vanishing and made the pixels at the boundary more accurate. The experimental results show that by comparing the U-Net network and the Mask2Former structure on the Gaofen image dataset (GID), the optimal precision and optimal accuracy of the improved Mask2Former-MS structure and the Mask2Former-BR structure are 88.82%, 85.90% and 89.56%, 87.46%, respectively, and the segmentation effect of the improved structures is better.