Introduction

In this article I will use cluster methods and focus on image clustering. For this purpose I chose the night map of United States. The aim of the study is to calculate percentage night light area of USA.

Data presenatation and modification

Firstly, I am loading useful libraries and my picture.

library('cluster')
library('jpeg')
library('ggplot2')

image <- readJPEG("light2.jpg")

Now I can print my original image.

plot(1, type="n", main='USA night map')
rasterImage(image, 0.6, 0.6, 1.4, 1.4)

Let’s check the dimension first.

dm1 <- dim(image)
dm1[1:2]
## [1] 630 985

Size of my image is equal to 630x985.

In the next step I will change my image type from jpg to rgb. RGB is color model based on assumption that every colour is a mix of three basic colours: red, green and blue with diffrent intensity.

rgbImage1<-data.frame(x=rep(1:dm1[2], 
                      each=dm1[1]),  
                      y=rep(dm1[1]:1, dm1[2]), 
                      r.value=as.vector(image[,,1]),  
                      g.value=as.vector(image[,,2]), 
                      b.value=as.vector(image[,,3]))

We get data frame that have information about every pixel of our picture.

Now I can display the same picture but in RGB.

plot(y~x, data=rgbImage1, main="USA night map", 
     col=rgb(rgbImage1[c("r.value", "g.value", "b.value")]), asp=1, pch=".")

Clustering

My data is quite large so I will use CLARA to reduce computing time. CLARA is extension to k-medoids method but it is created for large dataset.

n1 <- c()
for (i in 1:10) {
  cl <- clara(rgbImage1[, c("r.value", "g.value", "b.value")], i)
  n1[i] <- cl$silinfo$avg.width
}

plot(n1, type = 'l',
     main = "Optimal number of clusters for my Image",
     xlab = "Number of clusters",
     ylab = "Average silhouette",
     col = "blue")

Recomended number of clusters for my image is 2. Let’s print it now.

image = rgbImage1[, c("r.value", "g.value", "b.value")]
clara <- clara(image, 2)
plot(silhouette(clara))

colours<-rgb(clara$medoids[clara$clustering, ])

plot(rgbImage1$y~rgbImage1$x, col=colours, pch=".", cex=2, asp=1, main="2 colours")

As we can see on the picture, 2 clusters can tell us only about land and ocean area. We can’t see the “light places”. That’s why I will use (not recomended) 3 clusters.

image = rgbImage1[, c("r.value", "g.value", "b.value")]
clara <- clara(image, 3)
plot(silhouette(clara))

colours<-rgb(clara$medoids[clara$clustering, ])

plot(rgbImage1$y~rgbImage1$x, col=colours, pch=".", cex=2, asp=1, main="3 colours")

3 clusters help us to reach our goal. Now we can see 3 colours which represent: light places of USA, the USA land and oceans.

dominantColours <- as.data.frame(table(colours))

max_col  <- max(dominantColours$Freq)/sum(dominantColours$Freq)
min_col  <- min(dominantColours$Freq)/sum(dominantColours$Freq)
medium_col <- 1-max_col - min_col

dominantColours$distribution <- round((c(max_col, medium_col, min_col) * 100), 2)
dominantColours
##   colours   Freq distribution
## 1 #05050F 409206        65.94
## 2 #131222 201308        32.44
## 3 #B9A89E  10036         1.62
dominantColours$colours <- as.character(dominantColours$colours)
pie(dominantColours$Freq, labels = dominantColours$distribution,
    col = dominantColours$colours,
    xlab = "Colours",
    ylab = "Frequency")

plot(image)

USAarea <- sum(dominantColours$Freq[dominantColours$colours=='#131222' | dominantColours$colours=='#B9A89E'  ])
USAlightarea <- dominantColours$Freq[dominantColours$colours=='#B9A89E']

USAlightarea/USAarea
## [1] 0.04748656

Conclusion

In this article, I used clustering method on image to calculate percentage of light area of USA during night. I used 3 clusters to isolate 3 areas: light, land and ocean. I calculate that 4,7% of USA is light area.