Semantic Segmentation Model Comparison on the Inria Aerial Image Labeling Dataset

Author

Connor Lewis & Caleb Hoffman

Introduction

Semantic segmentation is a supervised machine learning technique that involves classifying each pixel in an image into specific categories or classes. Unlike traditional image classification, which assigns a single label to an entire image, semantic segmentation provides a more granular understanding by labeling individual pixels (Guo, Liu, Georgiou, et al. 2018). This technique is used in many fields, such as medical imaging, autonomous driving, and remote sensing, where precise identification of objects or regions within images is needed. The advancements in deep learning have led to significant improvements in segmentation techniques, enabling accurate analysis of complex imagery. Different segmentation models, ranging from simpler architectures to more complex deep neural networks, each present unique benefits in terms of accuracy, computational efficiency, and generalizability across diverse image types.

The dataset used in this research is the Inria Aerial Image Labeling dataset, an online competition for evaluating semantic segmentation models in aerial imagery. The dataset consists of 180 high resolution color image tiles, each measuring 5000×5000 pixels, corresponding to a 1500 m × 1500 m area with a resolution of 30 cm per pixel, for both the true image and the corresponding mask (Images 1 and 2). This dataset includes 36 image tiles from each of five different geographic regions: Austin, Chicago, Kitsap County, Vienna, and Western Tyrol. These regions were specifically chosen to represent diverse urban and rural landscapes, which offer a variety of challenges for segmentation models. The dataset is part of a broader research aimed at assessing the ability of machine learning models to perform pixel wise labeling on aerial imagery, a key task in many geospatial applications, with a focus on a model’s ability to generalize across different cities and environments (Maggiori, Tarabalka, Charpiat, & Alliez, 2017). By using this dataset, this research aims to explore how well various segmentation models perform in handling diverse urban structures and geographical features.

Image 1: Chicago Satellite Image from Inria dataset

Image 2: Chicago Reference Image from Inria dataset

The goal of this study is to conduct a comparison of three different base architectures for semantic segmentation, each representing a distinct approach to pixel wise classification. These architectures are: the traditional UNet, DeepLabv3+, and UNet++.

Background

The UNet architecture, originally developed for biomedical image segmentation, is known for its encoder-decoder structure and its ability to capture fine-grained spatial features through skip connections. DeepLabv3+, on the other hand, incorporates advanced techniques such as atrous convolution and multi-scale processing to enhance feature extraction, particularly for capturing contextual information across varying object sizes. UNet++, an extension of the original UNet, introduces nested skip pathways to improve feature propagation and refine segmentation accuracy. Each of these models offers different strengths in terms of computational complexity, accuracy, and ability to generalize to new environments.

This research will use a set of performance metrics, including pixel accuracy, mean intersection over union (mIoU), and others, to evaluate the effectiveness of each segmentation model. By comparing these models’ results on the Inria dataset, the study will identify which architecture offers the best trade-off between segmentation accuracy and computational efficiency, providing valuable insights into the optimal models for future deployment in aerial image analysis.

Architectures

The first architecture for the study is UNet. UNet is a semantic segmentation model that is the base convolutional neural network (CNN) for models. UNet uses an encoder-decoder model that downsamples geospatial photos, creating feature maps of small resolutions, as seen in Figure 1. The decoder then upscales the image back to original sizing through convolutions producing a segmentation result. The well constructed architecture of UNet has led to multiple models being developed off of the infrastructure.

A study done in 2022 at the North China University of Technology used UNet models to build a multi-attention model that addresses segmentation in multicategory targets. This multi-attention model specifically improved by using multi-head on the lowest features, leading to a reconstruction of the feature map, which allowed for a fine-grained segmentation. This implementation saw a 4.27% increase in the mean intersection over union. This isn’t the only model that is built off of UNet. Both DeepLab and UNet++ use the infrastructure of UNet.

Figure 1: UNet Architecture: geeksforgeeks.org

DeepLab, specifically DeepLabv3+, is a CNN that addresses the issues of input feature maps shrinking through the convolutional phase. DeepLab uses Atrous Spatial Pyramid Pooling (ASPP) and Atrous Convolution (AC) to reduce the fuzzy boundaries that occur in regular CNNs. Atrous Convolution is a method that adjusts the field of view, allowing for larger filters, but no computation strain. Figure 2 shows the architecture of the DeepLabv3 model.

Figure 2: DeepLabv3+ Architecture: developers.arcgis.com

In October 2020, a paper was published by Du et al that talks about the incorporation of DeepLabv3 for remote sensing images. With DeepLabv3’s object-based image analysis, alongside random forest classifiers, researchers found that DeepLab performed with accuracies of 90% and 85% with different datasets fed into the model.

In January 2020, a study conducted by Peng et al used the DeepLabv3 model to detect litchi branches, branches prone to damage during harvest season. Combining the DeepLab model with transfer learning and data augmentation, the model achieved mean intersection over union of 0.765, showing improvement from the DeepLabv3 model.

UNet++ is a CNN built off of UNet that addresses issues with the semantic gap that occurs between the encoder and decoder features. This semantic gap results in the decoder to struggle with fine-grained details, leading to an inaccurate segmentation. How UNet++ addresses this problem is nested skip pathways.Through these nested skip pathways, the decoder is able to incorporate high-level and low-level features, improving the understanding of the image provided. Figure 3 shows the architecture of UNet++, and how the nested skip pathways alter the current method of UNet.

Figure 3: UNet ++ Architecture: geeksforgeeks.org

To understand the improvements, research was performed and published in 2018 by Zhou et al showcasing how UNet++ has more uses than just geospatial analysis. Performing image segmentation on low-dose CT scans, nuclei segmentation in microscopic images, liver segmentation in CT scans of the abdomen, and videos of colonoscopies, researchers found that UNet++ had an advantage over using base UNet.

Zhou continued his research on UNet++ with a publication in 2019 that not only shows segmentation quality increase with UNet++, but efficiency is increased in a pruned version, with minimal performance degradation. For this continuation, new medical data was used to compare UNet++ to a standard UNet model. This new UNet++ model was shown to consistently outperform UNet.

It’s important to mention that throughout the studies, the models are constantly being built upon to improve them, through the addition of other mathematical methods. The process of altering the models helps get results that are specific to the subject area in which each study is conducted. For our study, we wanted to compare the baseline of each model, in order to see which model would be the best to focus improvements on in future research.

Methods and Model Architectures

To assess the performance of each model, we evaluated them on three key metrics: validation accuracy, Intersection over Union (IoU), and total training time. These metrics provide insights into the segmentation quality and computational efficiency of each model.

Given the high resolution of the dataset images (5000 x 5000 pixels), we resize the images to 512 x 512 to make training computationally feasible. This resizing allowed us to fit the data into memory while maintaining some detail for segmentation tasks. The smaller resolution also reduced training time and enabled faster experimentation.

This framework ensured that we could systematically compare the models while accounting for the challenges introduced by resizing the images. In the next sections, we describe the architectures of the models.

UNet

The architecture for the UNet used in this study is outlined below. The network follows an encoder-decoder architecture with skip connections that help preserve spatial information lost during down-sampling. It processes input images with three channels and generates segmentation outputs with a single channel for binary segmentation.

The encoder path compresses input images through a series of convolutional and pooling operations. Each stage of the encoder consists of two convolutional layers with a 3x3 kernel, followed by batch normalization and a ReLU activation function. Max pooling layers with a stride of 2 progressively reduce the spatial dimensions of the feature maps. The encoder is composed of four stages with feature channels increasing in size: 16, 32, 64, and 128. At the bottom of the network, the bottleneck layer applies a double 3x3 convolution with batch normalization and ReLU activation, producing a compressed representation with 256 channels.

The decoder path reconstructs the segmentation map by up-sampling the feature maps through transpose convolutions with a 2x2 kernel and stride of 2. At each up-sampling step, the resulting feature maps are concatenated with corresponding encoder outputs via skip connections, which help retain spatial details. Each decoder stage then processes the concatenated feature maps through a double 3x3 convolution, batch normalization, and ReLU activation. The decoder progressively reduces the number of channels to 128, 64, 32, and 16 as it reconstructs the image.

The final output layer applies a 1x1 convolution, producing a segmentation map with logits that can be thresholded for binary segmentation. The model was trained using Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss) and optimized with the Adam optimizer at a learning rate of 0.001 for 25 epochs. The implementation was performed in PyTorch, and training was conducted on GPU to leverage hardware acceleration. The combination of skip connections, convolutional up-sampling, and efficient feature extraction enables the U-Net to achieve accurate segmentation by integrating both coarse and fine-grained image features.

EPOCHS = 25

def double_conv(inChannels, outChannels):
    return nn.Sequential(
        nn.Conv2d(inChannels, outChannels, kernel_size=(3, 3), stride=1, padding=1),
        nn.BatchNorm2d(outChannels),
        nn.ReLU(inplace=True),
        nn.Conv2d(outChannels, outChannels, kernel_size=(3, 3), stride=1, padding=1),
        nn.BatchNorm2d(outChannels),
        nn.ReLU(inplace=True)
    )

def up_conv(inChannels, outChannels):
    return nn.Sequential(
        nn.ConvTranspose2d(inChannels, outChannels, kernel_size=(2, 2), stride=2),
        nn.BatchNorm2d(outChannels),
        nn.ReLU(inplace=True)
    )

class UNet(nn.Module):
    def __init__(self, encoderChn, decoderChn, inChn, botChn, nCls):
        super().__init__()
        self.encoder1 = double_conv(inChn, encoderChn[0])

        self.encoder2 = nn.Sequential(
            nn.MaxPool2d(kernel_size=2, stride=2),
            double_conv(encoderChn[0], encoderChn[1])
        )

        self.encoder3 = nn.Sequential(
            nn.MaxPool2d(kernel_size=2, stride=2),
            double_conv(encoderChn[1], encoderChn[2])
        )

        self.encoder4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=2, stride=2),
            double_conv(encoderChn[2], encoderChn[3])
        )

        self.bottleneck = nn.Sequential(
            nn.MaxPool2d(kernel_size=2, stride=2),
            double_conv(encoderChn[3], botChn)
        )

        self.decoder1up = up_conv(botChn, botChn)
        self.decoder1 = double_conv(encoderChn[3] + botChn, decoderChn[0])

        self.decoder2up = up_conv(decoderChn[0], decoderChn[0])
        self.decoder2 = double_conv(encoderChn[2] + decoderChn[0], decoderChn[1])

        self.decoder3up = up_conv(decoderChn[1], decoderChn[1])
        self.decoder3 = double_conv(encoderChn[1] + decoderChn[1], decoderChn[2])

        self.decoder4up = up_conv(decoderChn[2], decoderChn[2])
        self.decoder4 = double_conv(encoderChn[0] + decoderChn[2], decoderChn[3])

        self.classifier = nn.Conv2d(decoderChn[3], nCls, kernel_size=(1, 1))

    def forward(self, x):
        # Encoder
        encoder1 = self.encoder1(x)
        encoder2 = self.encoder2(encoder1)
        encoder3 = self.encoder3(encoder2)
        encoder4 = self.encoder4(encoder3)

        # Bottleneck
        x = self.bottleneck(encoder4)

        # Decoder
        x = self.decoder1up(x)
        x = torch.cat([x, encoder4], dim=1)
        x = self.decoder1(x)

        x = self.decoder2up(x)
        x = torch.cat([x, encoder3], dim=1)
        x = self.decoder2(x)

        x = self.decoder3up(x)
        x = torch.cat([x, encoder2], dim=1)
        x = self.decoder3(x)

        x = self.decoder4up(x)
        x = torch.cat([x, encoder1], dim=1)
        x = self.decoder4(x)

        # Classifier head
        x = self.classifier(x)

        return x
      
modelUnet = UNet(
    encoderChn=[16, 32, 64, 128],
    decoderChn=[128, 64, 32, 16],
    inChn=3,  
    botChn=256,  
    nCls=1 
).to(device)

UNet ++

The U-Net++ model used in this study is an advanced variation of the U-Net architecture, designed to enhance segmentation performance through its nested and dense skip connections. U-Net++ aims to address limitations in traditional U-Net by incorporating intermediate feature maps at different scales, improving the accuracy of predictions and feature propagation across the network.

The architecture consists of an encoder-decoder framework with multiple convolutional blocks and interconnecting paths. The encoder path progressively compresses the input images through three levels of convolutional operations, where each level consists of two convolutional layers with a 3x3 kernel and ReLU activation function. Max pooling layers with a stride of 2 reduce the spatial dimensions between encoding stages. The feature channels increase hierarchically from 16 to 32 and then to 64 as the input image progresses through the encoder.

In the decoder path, the segmentation map is reconstructed using nested skip connections that integrate feature maps from multiple encoding and decoding stages. Bilinear up-sampling is applied to align feature maps spatially before concatenation. This nested design allows for improved gradient flow and refinement of segmentation features across different resolutions. Each decoder block consists of convolutional operations that process the concatenated feature maps from the encoder and earlier decoder outputs.

The network’s outputs are produced at multiple stages of the decoder, allowing for optional deep supervision. In the current implementation, deep supervision is disabled, and the final segmentation map is generated at the deepest level using a 1x1 convolution. This output provides the logits necessary for binary segmentation, making the model compatible with Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss).

The U-Net++ model was trained for 25 epochs using the Adam optimizer with a learning rate of 0.001. The model implementation was performed in PyTorch, and training was conducted on GPU. The nested skip connections, coupled with the dense feature fusion strategy, allow U-Net++ to achieve higher segmentation accuracy by using multi-scale feature representations.

EPOCHS = 25

class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(ConvBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        return x

class UnetPlusPlus(nn.Module):
    def __init__(self, input_channels=3, num_classes=1, deep_supervision=False):
        super(UnetPlusPlus, self).__init__()
        self.deep_supervision = deep_supervision

        # Encoding path
        self.conv00 = ConvBlock(input_channels, 16)
        self.conv10 = ConvBlock(16, 32)
        self.conv20 = ConvBlock(32, 64)

        # Decoding path
        self.conv01 = ConvBlock(16 + 32, 16)  # Concatenation of x00 and interpolated x10
        self.conv11 = ConvBlock(32 + 64, 32)  # Concatenation of x10 and interpolated x20

        self.conv02 = ConvBlock(16 + 16 + 32, 16)  # Concatenation of x00, x01, and interpolated x11
        self.conv12 = ConvBlock(32 + 32 + 64, 32)  # Concatenation of x10, x11, and interpolated x21

        self.conv03 = ConvBlock(16 + 16 + 16 + 32, 16)  # Concatenation of x00, x01, x02, and interpolated x12

        # Final outputs
        self.final01 = nn.Conv2d(16, num_classes, kernel_size=1)
        self.final02 = nn.Conv2d(16, num_classes, kernel_size=1)
        self.final03 = nn.Conv2d(16, num_classes, kernel_size=1)

    def forward(self, x):
        # Encoding
        x00 = self.conv00(x)
        x10 = self.conv10(F.max_pool2d(x00, kernel_size=2))
        x20 = self.conv20(F.max_pool2d(x10, kernel_size=2))

        # Decoding
        x01 = self.conv01(torch.cat([x00, F.interpolate(x10, size=x00.shape[2:], mode='bilinear', align_corners=True)], dim=1))
        x11 = self.conv11(torch.cat([x10, F.interpolate(x20, size=x10.shape[2:], mode='bilinear', align_corners=True)], dim=1))

        x02 = self.conv02(torch.cat([x00, x01, F.interpolate(x11, size=x00.shape[2:], mode='bilinear', align_corners=True)], dim=1))
        x12 = self.conv12(torch.cat([x10, x11, F.interpolate(x20, size=x10.shape[2:], mode='bilinear', align_corners=True)], dim=1))

        x03 = self.conv03(torch.cat([x00, x01, x02, F.interpolate(x12, size=x00.shape[2:], mode='bilinear', align_corners=True)], dim=1))

        # Deep Supervision (if enabled)
        if self.deep_supervision:
            output1 = self.final01(x01)
            output2 = self.final02(x02)
            output3 = self.final03(x03)
            return [output1, output2, output3]
        else:
            return self.final03(x03)

DeepLabv3+

The DeepLabV3 model, used in this study, is a powerful segmentation architecture that uses atrous convolutions to extract multi-scale contextual information effectively. The model was initialized with a ResNet-50 backbone pretrained on ImageNet, which serves as a feature extractor. The pretrained backbone enables the model to benefit from transfer learning, speeding up convergence and improving performance on aerial image segmentation tasks.

DeepLabV3 incorporates an Atrous Spatial Pyramid Pooling (ASPP) module that captures features at multiple receptive fields using parallel atrous convolutions with different dilation rates. This design allows the model to retain detailed spatial information while achieving a larger effective receptive field without increasing the computational cost. For binary segmentation, the original classifier of the model is adjusted to output a single-channel prediction using a 1x1 convolutional layer, suitable for generating logits for binary masks.

The model was trained for 10 epochs using the Binary Cross-Entropy with Logits and optimization was performed using the Adam optimizer with a learning rate of 0.001.

class DeepLabV3Binary(nn.Module):
  
  def __init__(self, num_classes=1):
    super(DeepLabV3Binary, self).__init__()
    self.model = deeplabv3_resnet50(pretrained=True)  # Load pretrained DeepLabV3
    self.model.classifier[4] = nn.Conv2d(256, num_classes, kernel_size=(1, 1))  

  def forward(self, x):
    return self.model(x)['out']
  
EPOCHS = 10
modelDeepLab = DeepLabV3Binary(num_classes=1).to(device)  #
criterion = nn.BCEWithLogitsLoss()  
optimizer = optim.Adam(modelDeepLab.parameters(), lr=0.001)

Results

The results of this study showed that the performance of the models varied significantly across validation accuracy, Intersection over Union (IoU), and total training time, highlighting the trade-offs between segmentation quality and computational efficiency.

Traditional U-Net

The U-Net architecture achieved the highest validation accuracy at 0.9348, indicating strong overall performance in pixel classification. Its validation IoU was 0.6097, reflecting a high degree of overlap between predicted and ground truth segmentation masks. The model’s training time was 86.6 seconds per epoch, which was moderate compared to the other architectures.

U-Net++

The U-Net++ architecture, with its dense skip connections, required the longest training time at 117.95 seconds per epoch, due to its more complex structure. However, its performance metrics showed a slight decline compared to U-Net. It achieved a validation accuracy of 0.9159 and a validation IoU of 0.4739. The drop in IoU suggests that while U-Net++ performed well at classifying pixels correctly, it struggled to accurately segment finer structures and boundaries.

DeepLabv3+

DeepLabv3+ was trained for only 10 epochs, as its larger size made it computationally more intensive. Despite this, it achieved a validation accuracy of 0.887, which was the lowest among the models. Its validation IoU of 0.3139 further indicated challenges in producing precise segmentation masks. However, the model’s training time was still the longest at 138.54 seconds per epoch, making it the least efficient in terms of computation.

Summary of Results

The traditional U-Net demonstrated the best balance of segmentation quality and training efficiency, achieving the highest validation accuracy and IoU within a reasonable training time. U-Net++, while offering architectural enhancements, required more time to train but did not surpass U-Net in accuracy or IoU. DeepLabv3+, faced challenges in producing high-quality segmentation results within the limited number of training epochs, emphasizing the need for longer training times.

These results highlight the importance of selecting the right model architecture based on task-specific requirements, such as precision, resource availability, and time constraints.

Conclusion and Discussion

This sample study showed that the traditional UNet performed the best for the semantic segmentation task. This study had limitations on computation space, and training data. Training images were large so one limitation of the models’ performance was images were compressed from their original size, 5000 x 5000, to 512 x512. Future work on this could benefit from creating distinct tiles from the larger training images, not only would this preserve clarity in the training images, it would also provide significantly more training images. All models in this research appear to be overfit on the training data, this could be due to too small of a learning rate, not enough epochs, or not enough training data.

There is reason to believe the models were either overfit or not trained enough leading to predictions being lackluster. When testing different model configurations, we found a UNet result that in only 2 epochs, was creating fairly decent prediction masks. Upon further testing and more epochs, we found that the models were starting to get better accuracies by creating blank masks due to potential overfitting or blurry images from compressing. We think this may be due to a bad learning rate/optimizer being prone to getting stuck in local minimum and unable to “climb” out due to the learning rate.

One reason for such high accuracy in all models is because of this overfit. When the models were performing predictions, they predicted the background for each pixel, which can lead to this high accuracy on images that are majority background pixels. With future adjustments to the model, this overfitting can be corrected giving us more accurate results.

References

[1] Du, S., Du, S., Liu, B., & Zhang, X. (2020). Incorporating DeepLabv3+ and object-based image analysis for semantic segmentation of very high resolution remote sensing images. International Journal of Digital Earth, 14(3), 357–378. https://doi.org/10.1080/17538947.2020.1831087

[2] Esri. How deeplabv3 works. ArcGIS API for Python. (n.d.). https://developers.arcgis.com/python/latest/guide/how-deeplabv3-works/

[3] Guo, Y., Liu, Y., Georgiou, T., & et al. (2018). A review of semantic segmentation using deep neural networks. International Journal of Multimedia Information Retrieval, 7(1), 87–93. https://doi.org/10.1007/s13735-017-0141-z

[4] H. Peng et al., “Semantic Segmentation of Litchi Branches Using DeepLabV3+ Model,” in IEEE Access, vol. 8, pp. 164546-164555, 2020, doi: 10.1109/ACCESS.2020.3021739. https://ieeexplore.ieee.org/document/9186684

[5] Maggiori, E., Tarabalka, Y., Charpiat, G., & Alliez, P. (2017). Can semantic labeling methods generalize to any city? The Inria aerial image labeling benchmark. IEEE International Geoscience and Remote Sensing Symposium (IGARSS).

[6] Sun Y, Bi F, Gao Y, Chen L, Feng S. A Multi-Attention UNet for Semantic Segmentation in Remote Sensing Images. Symmetry. 2022; 14(5):906. https://doi.org/10.3390/sym14050906

[7] Zhou, Zongwei, Siddiquee, Md Mahfuzur Rahman, Tajbakhsh, Nima, & Liang, Jianming. (2018). UNet++: A nested U-Net Architecture for Medical Image segmentation.