Introduction

Have you ever wondered how does our Earth look like from the Space? If you take a look on the NASA picture below, you may think that the lights resemble the look of the stars themselves. Would you like to know what is our impact on the level of light pollution in Europe? You can learn this by reading this paper, which shows how to choose the optimal number of cluster centers using the image segmentation with K-Means clustering method. Afterwards, it is possible to calculate the percentage of pixels in a cluster which represents the light pollution. It is incredible, how simple it is to calculate this, only having a good quality image.

In this project, I aim to use K-Means clustering to analyze satellite image of land use patterns. The dataset consists of satellite image of a region of Europe, where different areas have different land uses. Precisely, there are areas with and without the light pollution. The goal is to segment the image into different clusters based on the similarity of pixel values.

Data

The dataset consists of a satellite image in raster format. The image is of size 894 x 602 pixels, and each pixel has three color channels: red, green, and blue. My image was downloaded from this link, which was taken from the NASA’s website on the night lights maps.

If you would like to make similar analysis, but on a greater scale, I would recommend performing analogous analysis on the image of the whole Earth’s night lights map, from the link. I would be delighted to have messages in comments with links to such analysis, if they were conducted.

K-means Clustering methodology

I used the K-Means clustering algorithm to segment the satellite image into different clusters. The algorithm assigns each pixel in the image to one of the K clusters based on the similarity of its color values to the cluster centers. I experimented with different values of K and evaluated the results using visual inspection.

According to Dhanachandra et al. (2015), image segmentation is a process of dividing an image into multiple segments or regions with similar characteristics such as color, texture, or intensity. K-means clustering is a popular unsupervised machine learning algorithm that can be used for image segmentation.

In k-means clustering (Ray & Turi, 2000), the algorithm partitions the input data into k clusters, where k is a user-defined parameter. Each cluster is represented by its centroid, which is the mean of all the data points assigned to the cluster. In the context of image segmentation, the input data points are the pixels of the image, and the algorithm partitions the pixels into k clusters based on their color or intensity values.

The k-means clustering algorithm starts by randomly initializing the centroids, then iteratively assigns each pixel to the nearest centroid based on the Euclidean distance between the pixel’s color values and the centroid’s color values. After all pixels have been assigned to a centroid, the centroid is updated to be the mean of all the pixels assigned to it. This process is repeated until convergence is reached, meaning that the centroids no longer change.

The final output of the k-means clustering algorithm is a set of k clusters, where each pixel in the image is assigned to one of the k clusters based on the nearest centroid. This segmentation can be used for various applications, such as object detection, image compression, and image editing.

In my project, I used k-means clustering for image segmentation to extract the regions of interest - light pollution - from the image. I first converted the image to grayscale and applied k-means clustering to partition the pixels into two clusters: background and foreground. This segmentation allowed me to isolate the lungs from the rest of the image and apply further analysis to the lung region.

Analysis

Loading packages, data input and clustering with 2 centers

To do the analysis, the following packages will be needed. After loading them into R session, the image has to be loaded into R using the jpeg package.

Then we can then plot the image using the plot function. The light pollution areas are clearly visible, concentrated around the Europe’s largest cities, but also around smaller metropolitan areas.

The set.seed function is to ensure the consistent plotting, meaning that each time, after running the code, the plots will be looking exactly the same. It is very useful in order to for instance have set of colors assigned to particular clusters, in order to have a better visualization of the results.

library(jpeg)
library(ggplot2)
library(raster)

## Loading required package: sp

image <- stack("/Users/annaczarnocka/Desktop/UL1/Europe_2016.jpeg")

## Warning: [rast] unknown extent

plotRGB(image)

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

# Set seed value
set.seed(1234)

# Run K-Means clustering
kMeansResult <- kmeans(image[], centers=2)

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

# Convert image data to raster object
result <- raster(image[[1]])

# Set cluster values
result <- setValues(result, kMeansResult$cluster)

The ruster() function creates a raster object using the first layer of the image data. A raster object is a grid of cells that can be used to represent the image data.

The setValues() function assigns the cluster values from the k-means analysis to the cells of the raster object.

Ploting results for 2 cluster centers

Finally, let’s create a plot of the resulting raster object, showing the image data with the cluster assignments represented as different colors.

plot(result)

For a better comparison with the source image, we can add the suitable colors to the plot above.

plot(result, col=c("lightyellow", "black"))

The result is similar to the initial image, but still not fully satisfactory. Using 2 clusters may not have provided enough granularity in the segmentation, so we should perform further analysis for a higher number of k.

Calculating share of pixels in cluster 1

length(kMeansResult$cluster[kMeansResult$cluster==1])/length(kMeansResult$cluster)*100

## [1] 2.616077

This means that, according to the K-Means clustering model, 2.14% of the NASA’s image is covered by the light pollution.

Clustering with 3 centers

Following the procedure, we run K-Means clustering with now k=3, convert image to raster object and set cluster values.

# Set seed value
set.seed(1234)

# Run K-Means clustering
kMeansResult <- kmeans(image[], centers=3)

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

# Convert image data to raster object
result <- raster(image[[1]])

# Set cluster values
result <- setValues(result, kMeansResult$cluster)

Ploting results for 3 cluster centers

On this plot it is visible a very high resemblance of the initial image, therefore, k=3 could be an optimal number of clusters.

plot(result)

To create a better comparison, accurate colors should be imputed on the plot.

plot(result, col=c("lightyellow", "darkblue", "black"))

Calculating the percentage of pixels in cluster 1

We need to calculate percentage share of pixels in cluster 1, because from the plot above, it is visible that it is this cluster that represents the area of interest, so the lights.

percentCluster1 <- length(kMeansResult$cluster[kMeansResult$cluster==1])/length(kMeansResult$cluster)*100
print(paste0("Percentage of pixels in cluster 1: ", round(percentCluster1, 2), "%"))

## [1] "Percentage of pixels in cluster 1: 2.14%"

Clustering with 4 centers

The results for k=3 are satisfactory, but it may also be useful to check the results for k=4, to compare the difference.

# Set seed value
set.seed(1234)  

# Run K-Means clustering
kMeansResult <- kmeans(image[], centers=4)

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

# Convert image data to raster object
result <- raster(image[[1]])

# Set cluster values
result <- setValues(result, kMeansResult$cluster)

Ploting results for 4 cluster centers

plot(result)

plot(result, col=c("lightyellow", "darkblue", "black", "yellow"))

This is an interesting outcome, as it assigned two clusters, 1 and 4, to the areas of light. The cluster 4 shows the areas of the strongest light pollution, and the cluster 1, moderate.

Calculating the percentage of pixels in cluster 4

percentCluster4 <- length(kMeansResult$cluster[kMeansResult$cluster==4])/length(kMeansResult$cluster)*100
print(paste0("Percentage of pixels in cluster 4: ", round(percentCluster4, 2), "%"))

## [1] "Percentage of pixels in cluster 4: 1.34%"

Calculating the percentage of pixels in cluster 1

percentCluster1 <- length(kMeansResult$cluster[kMeansResult$cluster==1])/length(kMeansResult$cluster)*100
print(paste0("Percentage of pixels in cluster 1: ", round(percentCluster1, 2), "%"))

## [1] "Percentage of pixels in cluster 1: 1.94%"

Calculating the percentage of pixels in clusters 1 and 4

percentClusters1and4 <- sum(kMeansResult$cluster %in% c(1, 4)) / length(kMeansResult$cluster) * 100
print(paste0("Percentage of pixels in clusters 1 and 4: ", round(percentClusters1and4, 2), "%"))

## [1] "Percentage of pixels in clusters 1 and 4: 3.28%"

The outcome is that cluster 4 (strongest pollution) is 1.34% of the total area, cluster 1 (moderate pollution) is 1.94%, and together 3.28%.

In my point of view, the k=4 is the most optimal number of clusters for the image segmentation in this paper, however, k=3 is only slightly less optimal, so both of them would be accurate choice.

Finally, I also performed clustering with 5 centers, to see the difference in the result.

Clustering with 5 centers

# set the seed value
set.seed(1234)

# Run K-Means clustering
kMeansResult <- kmeans(image[], centers=5)

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

## Warning: [rast] unknown extent

# Convert image data to raster object
result <- raster(image[[1]])

# Set cluster values
result <- setValues(result, kMeansResult$cluster)

Ploting results for 5 cluster centers

plot(result)

plot(result, col=c("darkgrey",  "darkblue", "black", "white", "yellow"))

It is clear that using more than 4 clusters results in over-segmentation and noise, and it is no longer useful for my research question.

Calculating the percentage of pixels in cluster 5

percentCluster5 <- length(kMeansResult$cluster[kMeansResult$cluster==5])/length(kMeansResult$cluster)*100
print(paste0("Percentage of pixels in cluster 5: ", round(percentCluster5, 2), "%"))

## [1] "Percentage of pixels in cluster 5: 1.26%"

Calculating the percentage of pixels in clusters 4 and 5

percentClusters4and5 <- sum(kMeansResult$cluster %in% c(4, 5)) / length(kMeansResult$cluster) * 100
print(paste0("Percentage of pixels in clusters 4 and 5: ", round(percentClusters4and5, 2), "%"))

## [1] "Percentage of pixels in clusters 4 and 5: 2.97%"

The brightest area (cluster 5) is 1.26% of the image, and the whole area of light pollution is here 2.97%.

Conclusions

K-means clustering can be used for image segmentation to divide an image into distinct clusters or regions.
The optimal number of cluster centers for image segmentation may vary depending on the image and the application, and can be determined through various methods such as elbow plot analysis and visual inspection, like it was in my paper.
In the case of the provided image of a NASA Europe’s night map, the optimal number of cluster centers was found to be four, which successfully segmented the image into distinct regions representing the lights (high and moderately intensive), sea, and land areas.
The outcome is that cluster 4 (strongest light pollution) is 1.34% of the total area, cluster 1 (moderate light pollution) is 1.94%, and together 3.28%. It may be a small share, however, as visible on the image and plots, it is clear the impact of this pollution is huge on the natural environment. It influences both humans, animals and plants of Europe area.
The resulting segmented image can be used for various purposes, such as identifying and analyzing the properties and behavior of the different regions in the image.
The effectiveness of image segmentation using K-means clustering can be affected by factors such as image resolution, lighting conditions, and noise levels, and it may require pre-processing steps to improve the quality of the image segmentation.
The segmentation results can also be improved by using more advanced clustering algorithms or incorporating additional information such as spatial information or color intensity values.

References

Dhanachandra, N., Manglem, K., & Chanu, Y. J. (2015). Image Segmentation Using K -means Clustering Algorithm and Subtractive Clustering Algorithm. Procedia Computer Science, 54, 764–771. https://doi.org/10.1016/j.procs.2015.06.090

Ray, S., & Turi, R. H. (2000). Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation. International Conference on Advances in Pattern Recognition, 137–143. https://www.csse.monash.edu.au/~roset/papers/cal99.pdf

RPubs - Using K-Means to Cluster Wildfires on Australia. (n.d.). https://rpubs.com/ozgur/usl-01-kmeans

K-means clustering on the NASA image of the Europe’s light pollution

Anna Czarnocka

2023-02-23

Introduction

Data

K-means Clustering methodology

Analysis

Loading packages, data input and clustering with 2 centers

Ploting results for 2 cluster centers

Clustering with 3 centers

Ploting results for 3 cluster centers

Calculating the percentage of pixels in cluster 1

Clustering with 4 centers

Ploting results for 4 cluster centers

Calculating the percentage of pixels in cluster 4

Calculating the percentage of pixels in cluster 1

Calculating the percentage of pixels in clusters 1 and 4

Clustering with 5 centers

Ploting results for 5 cluster centers

Calculating the percentage of pixels in cluster 5

Calculating the percentage of pixels in clusters 4 and 5

Conclusions

References