Unsupervised Learning

Preparing the data

Library jpeg is allows to import jpg file to R. Dplyr and ggplot2 are well known libraries for data manipulation and plotting. Finally clusteR package give us tools to cluster the data. One of the problems with images is the size of dataset. Each pixel is described by three values R,G B respectively indicating red, green and blue intesity. For example image of size 1280x1080 after tranformation is described by 1280x1080x3= 4 147 200 elements. ClusteR package thanks to its efficiency is the most suitable tool for clustering, making it fast enough.

library(jpeg)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(cluster)
library(ClusterR)

## Loading required package: gtools

First step is preparing the data. We will use the screenshot from Google Maps in jpg format. This particular screenshot shows the north part of Australia.

#Only to display image
# Import Image
australia <- readJPEG("australia.JPG",native=TRUE)

#Display Image
plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE)
rasterImage(australia,0,0,1,1)

Code below reads the image and transforms it to the convienient data frame which is easier to analyze. Please notice that I import the image once again but this time without “native=TRUE” argument. This time it will produce data format which is more easier to transform but cannot be easily displayed like before.

# importing picture
img<-readJPEG('australia.JPG')

imgRGB <- data.frame(
  R = as.vector(img[,,1]),
  G = as.vector(img[,,2]),
  B = as.vector(img[,,3])
)

Colors are represented by 3 values R,G,B each in range from 0 to 255 for every pixel. Idea is to present these values in 3 dimensions and apply clustering algorithm. Coordinates of cluster will be the general colors out of which an image will be drawn.
First important problem is to decide the number of clusters. In a typical situation we refers to statistical tests like silhoutte. We should choose the number of clusters with the highest sillhoute. In this case the silhoutte value is only a hint beacuse our goal it to receive an image where ocean is displayed in diffrent colors than land. The perfect situation is to obtain whole ocean in only one color but if it is impossible the number of cluster may be higher.

for (i in 2:10) {
  cl <- clara(imgRGB, i)
  print(paste('Average silhouette for',i,'clusters:',cl$silinfo$avg.width)) 
}

## [1] "Average silhouette for 2 clusters: 0.529950971009155"
## [1] "Average silhouette for 3 clusters: 0.631769833048838"
## [1] "Average silhouette for 4 clusters: 0.514712959545021"
## [1] "Average silhouette for 5 clusters: 0.628577399110254"
## [1] "Average silhouette for 6 clusters: 0.573714214742675"
## [1] "Average silhouette for 7 clusters: 0.481083036566289"
## [1] "Average silhouette for 8 clusters: 0.450775235735685"
## [1] "Average silhouette for 9 clusters: 0.435124515551823"
## [1] "Average silhouette for 10 clusters: 0.43930042008243"

We use clustering algorith clara beacuse it is suitable for big datasets.

cl_dt <- clara(imgRGB, 2)

The graph below shows the silhoutte plot which shows how well data is clustered.

plot(silhouette(cl_dt))

The next step is to check what image we obtained after applying clara. We want to change R,G,B values of each row for medoits of clusters. After such data transformation it is necessary to transform it back to the jpeg package format and display.

# which cluster for each pixel(row)
dt_clust <- cbind(imgRGB, cl_dt$clustering)

k<-2 #number of clusters

#function to swap number of cluster for it's center

medoits <-function(x){
  for (i in 1:k){
    if(x==i) return(cl_dt$medoids[i,])
  }
}

# apply above fuction
a<-t(apply(as.data.frame(dt_clust$`cl_dt$clustering`),1,medoits))

#change dimension from single columns to matrix
r<-a[,1]
g<-a[,2]
b<-a[,3]
dim(r) <- dim(img)[1:2]
dim(g) <- dim(img)[1:2]
dim(b) <- dim(img)[1:2]

rgb2 <- list(r,g,b)

# tranformation of data to jpeg package format
rgb2 <- sapply(rgb2, function(j) {compressed.img <- j}, simplify = 'array')
writeJPEG(rgb2, "aaaaaa.jpg")

#displaying the clustered image
plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE)
rasterImage(rgb2,0,0,1,1)

The result of clustering is perfect for us because we received image only in two colors where whole ocean is marked by only one color. Now it is enough to calculate the ratio of rows assigned to cluster correspond to ocean. The problem is that for every image we would have to manually check which cluster is assigned to ocean so my idea is to calculate euclidean distances between R,G,B values of each cluster to the point which is a perfect blue. Cluster covering the ocean should has color the most similar to blue, so we choose cluster which has the lowest distance. This allows us to increase an automation of the whole code.

#R,G,B values were scaled (0,1) so we transform them to orignal values
z<-cl_dt$medoids
z<-z*255


#calculation of euclidean distance
distance <- function(x){
  blue <- c(0,0,255)
  distance <- (sqrt((x[1]-blue[1])^2) + sqrt((x[2]-blue[2])^2) + sqrt((x[3]-blue[3])^2)) 
}

#add distances to data frame
x<- as.data.frame(cbind(z, apply(z, 1, distance)))
# choose the cluster with the lowest distance
wt_cluster<-which(x$V4==min(x$V4))
#assigning each row to cluster
dt_clust <- cbind(imgRGB, cl_dt$clustering)

#calculation of ratio
dt_clust %>% filter(dt_clust$`cl_dt$clustering` == wt_cluster) %>% count() / dt_clust %>% count()

##          n
## 1 0.503587

The area on the image covered by the ocean is equal to 50.3%.

Unsupervised Learning

Introduction

Preparing the data