The main purpose of the project is to present and implement unsupervised learning methods for image clustering (RGB). In my analysis, I will use the air quality map of Europe to show the areas with good, medium, and bad air quality. The methodology of the project will concentrate on: transforming an image into a matrix of numbers and then into the RGB format; assessing the optimal number of colors k-clusters based on k-medoids PAM algorithm; using Clara algorithm to distribute data within certain clusters. The analysis will be supported by the necessary visualisations. The outcome of the project is to calculate the percentage of the area of safe and dangerous regions in Europe (in terms of air quality).
Firstly, the needed libraries are introduced: jpeg, rasterImage, dplyr, gglpot2, cluster, gtools, ClusterR, factoextra. Jpeg - allows to import jpg file to R. RatserImage - enables to make raster plots. Dplyr and ggplot2 - are used for data manipulation and plotting. Cluster and ClusterR - provides methods and tools for cluster analysis. Gtools - includes various R programming tools. Factoextra - provides functions to extract and visualize the output of multivariate data analyses.
library(jpeg)
library(rasterImage)
## Ładowanie wymaganego pakietu: plotrix
library(dplyr)
##
## Dołączanie pakietu: 'dplyr'
## Następujące obiekty zostały zakryte z 'package:stats':
##
## filter, lag
## Następujące obiekty zostały zakryte z 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(cluster)
library(gtools)
library(ClusterR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Now it is possible to import a jpg file in array class to R, which will be a screenshot representing an air quality map of Europe (ref. IQAir). The image can be displayed after importing.
europe<-readJPEG("europe.jpg")
plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE, asp=1)
rasterImage(europe,0,0,1,1)
The image represents the map of Europe with the country division and the quality of air in all regions. Air quality is measured by Air Quality Index (AQI) on a scale of 0-300 points. Colors interpretation is as follows: Dark green color corresponds to regions with AQI 0-50, where air quality is satisfactory, and air pollution poses little or no risk. Light green color has AQI 51-100, which means that air quality is acceptable, however, there may be a risk for some people. Yellow takes 101-151 on the AQI scale, which says that air quality level is unhealthy for sensitive groups. Red, purple and maroon colors show very unhealthy regions, which are a minority in Europe, thus I will not focus my research on them.
The image has to be represented in a matrix of numbers in order to process it in further analysis. The most convenient format is RGB. It denotes to amount of red, green, and blue colors in a given pixel of an image, therefore it displays the image in a 3-column matrix (the first column relates to the amount of red color, the second - green color and the third - blue color). Additionally, the dimension of the image as a dataset is checked, which indicates that it is a 3D object.
dm1<-dim(europe)
dm1
## [1] 704 773 3
The first step of the analysis is focused around creating a data frame: coordinates “x” and “y” correspond to coordinates of each pixel of an image while RGB layer is displayed in a form of 3 columns of colors.
rgbImage1<-data.frame(x=rep(1:dm1[2], each=dm1[1]), y=rep(dm1[1]:1, dm1[2]), r.value=as.vector(europe[,,1]),
g.value=as.vector(europe[,,2]),
b.value=as.vector(europe[,,3]))
head(rgbImage1)
## x y r.value g.value b.value
## 1 1 704 0.09019608 0.2588235 0.2862745
## 2 1 703 0.09019608 0.2588235 0.2862745
## 3 1 702 0.09019608 0.2588235 0.2862745
## 4 1 701 0.09019608 0.2588235 0.2862745
## 5 1 700 0.09019608 0.2588235 0.2862745
## 6 1 699 0.09019608 0.2588235 0.2862745
Based on such a data frame, the air quality map of Europe is designed as a plot, which will be the point of reference in the next steps of the project methodology.
#plotting pixel by pixel in colors from RGB
plot(y ~ x, data=rgbImage1, main="Air quality map of Europe",
col = rgb(rgbImage1[c("r.value", "g.value", "b.value")]),
asp = 1, pch = ".")
In order to find the safe and dangerous regions on the designated map, the data frame observations ought to be allocated into established clusters. Thanks to clustering similar colors of pixels of the image it will be possible to draw a new map in a limited sample of colors appropriate to conduct the analysis.
Clustering will consist of two parts. Firstly, the optimal number of clusters (colors) has to be found. For that purpose, Clara clustering algorithm from the cluster package will be used. It is based on the k-medoids PAM algorithm which is optimal for large data sets. The empty vector n1 is created to save the results and there are 8 clusters maximum taken into consideration for silhouette analysis.Secondly, the silhouette plots of Clara for the optimal number of clusters will be displayed to examine the distribution of data for each cluster. Then the next significant step will be assigning medoids, so “average” RGB values, to each cluster id and converting RGB values into colors. From the aftermath, pixels will be plotted in the new colors on a clustered image of Europe’s air quality map.
library(cluster)
n1 <- c()
for (i in 1:8) {
cl <- clara(rgbImage1[, c("r.value", "g.value", "b.value")], i)
n1[i] <- cl$silinfo$avg.width
}
plot(n1, type = 'l',
main = "Optimal number of clusters for air quality map of Europe",
xlab = "Number of clusters",
ylab = "Average silhouette",
col = "navyblue")
The results of Clara clustering algorithm reveal that the optimal number of clusters (colors) for the “Air quality map of Europe” image is 3 or 7. For the subsequent phase of clustering (silhouette plot) I will try both variations of the clustered image to evaluate which corresponds the best with air quality analysis.
# Silhouette plot of clara for k=3 clusters
europe = rgbImage1[, c("r.value", "g.value", "b.value")]
clara1 <- clara(europe, 3)
plot(silhouette(clara1))
# Silhouette plot of clara for k=7 clusters
europe = rgbImage1[, c("r.value", "g.value", "b.value")]
clara2 <- clara(europe, 7)
plot(silhouette(clara2))
#k=3
colors <-rgb(clara1$medoids[clara1$clustering, ])
plot(y ~ x, data=rgbImage1, main = "Air quality map of Europe",
col = colors,
asp = 1, pch = ".")
#k=7
colors2 <-rgb(clara2$medoids[clara2$clustering, ])
plot(y ~ x, data=rgbImage1, main = "Air quality map of Europe",
col = colors2,
asp = 1, pch = ".")
Based on the above silhouette plots for 3 and 7 clusters, as well as RGB clustered images of air quality map of Europe, further research will be conducted on 3 clusters to correctly assess the good and bad air quality regions without any more details.
The purpose of this project is to distinguish safe and dangerous regions in Europe in terms of air quality and calculate the percentage of these areas. So far, the image clustering was done with the use of RGB matrix transformation, k-medoids PAM algorithm and Clara algorithm. Considering the fact, that the optimal number of clusters (colors) is k=3, I will count the frequency of each color of the map, to examine which one dominates on the image.
dominant <- as.data.frame(table(colors))
high <- max(dominant$Freq)/sum(dominant$Freq)
low <- min(dominant$Freq)/sum(dominant$Freq)
medium <- 1 - high - low
dominant$distribution <- round((c(low, high, medium) * 100), 2)
dominant
## colors Freq distribution
## 1 #174249 125141 23.0
## 2 #83CAE0 259576 47.7
## 3 #ADAB44 159475 29.3
dominant$colors <- as.character(dominant$colors)
pie(dominant$Freq, labels = dominant$distribution,
col = dominant$colors,
xlab = "colors",
ylab = "frequency")
In summary, the pie chart represents blue, light green and dark green colors. Color blue indicates the surface of the oceans, seas, and other water reservoirs, thus it is not significant for the air quality analysis. Light green (AQI Satisfactory) occurs in 29.3% of Europe’s surface, whereas dark green (AQI Acceptable) - 23%. Although these frequencies do not vary a lot, the main conclusion is that there are more regions in Europe where there is a risk due to not satisfactory air quality (especially in the center of the continent) than regions with very good air conditions.