The goal of this project was to explore and determie whether Principal Component Analysis (PCA) will have an impact on performance of clustering with K-means and PAM algorithms. I will examine and compare the results of clustering before and after performing PCA on a dataset that contains weather data from all over Australia. By performing this project I will be able to check whether reducing dimensionality will improve the quality of clustering with the two aforementioned methods. I will be evaluating that using methods such as silhouette score and visualization of clusters.
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(plotly)
library(corrplot)
library(NbClust)
library(clusterSim)
library(GGally)
library(factoextra)
library(FactoMineR)
library(caret)
library(wesanderson)
library(cclust)
library(flexclust)
library(hopkins)
setwd("C:/Users/Lukasz/Desktop/unsupervised learning/dane")
getwd()
#Loading initial data
aus <- read.csv("dane/weatherAUS.csv")
head(aus)
## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir
## 1 2008-12-01 Albury 13.4 22.9 0.6 NA NA W
## 2 2008-12-02 Albury 7.4 25.1 0.0 NA NA WNW
## 3 2008-12-03 Albury 12.9 25.7 0.0 NA NA WSW
## 4 2008-12-04 Albury 9.2 28.0 0.0 NA NA NE
## 5 2008-12-05 Albury 17.5 32.3 1.0 NA NA W
## 6 2008-12-06 Albury 14.6 29.7 0.2 NA NA WNW
## WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am
## 1 44 W WNW 20 24 71
## 2 44 NNW WSW 4 22 44
## 3 46 W WSW 19 26 38
## 4 24 SE E 11 9 45
## 5 41 ENE NW 7 20 82
## 6 56 W W 19 24 55
## Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
## 1 22 1007.7 1007.1 8 NA 16.9 21.8
## 2 25 1010.6 1007.8 NA NA 17.2 24.3
## 3 30 1007.6 1008.7 NA 2 21.0 23.2
## 4 16 1017.6 1012.8 NA NA 18.1 26.5
## 5 33 1010.8 1006.0 7 8 17.8 29.7
## 6 23 1009.2 1005.4 NA NA 20.6 28.9
## RainToday RainTomorrow
## 1 No No
## 2 No No
## 3 No No
## 4 No No
## 5 No No
## 6 No No
This dataset contains weather observations collected from various locations in Australia. Below is a summary:
MinTemp: Minimum temperature of the day.
MaxTemp: Maximum temperature of the day.
Temp9am & Temp3pm: Temperature recorded at 9 AM and 3 PM.
Rainfall: Amount of rain recorded.
Evaporation: Evaporation levels .
Sunshine: Number of sunshine hours.
WindGustDir & WindGustSpeed: Direction and
speed of the strongest wind gust.
WindDir9am & WindDir3pm: Wind direction at 9
AM and 3 PM.
WindSpeed9am & WindSpeed3pm: Wind speed at 9 AM and 3 PM .
Humidity9am & Humidity3pm: Percentage of
humidity at 9 AM and 3 PM.
Pressure9am & Pressure3pm: Atmospheric pressure at 9 AM and 3 PM.
Cloud9am & Cloud3pm: Overcast levels.
RainToday: Whether it rained today
("Yes"/"No").
RainTomorrow: predicting if it will rain the
next day ("Yes"/"No").
#counting missing values
count_na <- colSums(is.na(aus))
count_na
## Date Location MinTemp MaxTemp Rainfall
## 0 0 1485 1261 3261
## Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am
## 62790 69835 10326 10263 10566
## WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm
## 4228 1767 3062 2654 4507
## Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am
## 15065 15028 55888 59358 1767
## Temp3pm RainToday RainTomorrow
## 3609 3261 3267
Clearly there are column that have a lot of missing values. I set a threshold at bout 1/3 of the data missing and deleted the that had more missing data than that
cols_del <- aus[, colSums(is.na(aus)) <= 50000]
count_na <- colSums(is.na(cols_del))
count_na
## Date Location MinTemp MaxTemp Rainfall
## 0 0 1485 1261 3261
## WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am
## 10326 10263 10566 4228 1767
## WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm
## 3062 2654 4507 15065 15028
## Temp9am Temp3pm RainToday RainTomorrow
## 1767 3609 3261 3267
There were still many columns with missing values, however, the amount was acceptable so I decided to omit the rows that still had missing values.
aus_clean <- na.omit(cols_del)
colSums(is.na(aus_clean))
## Date Location MinTemp MaxTemp Rainfall
## 0 0 0 0 0
## WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am
## 0 0 0 0 0
## WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm
## 0 0 0 0 0
## Temp9am Temp3pm RainToday RainTomorrow
## 0 0 0 0
Now that there are no more missing values I proceeded to the next part of preparing the dataset for clustering, which was selecting only columns that contained numerical data. Since the dataset was very large (over 140000 observations) I also decided to only select 10000 so that the cluster analysis does not take extremely long to run.
aus_numeric <- aus_clean[,sapply(aus_clean, is.numeric)]
aus_numeric <- head(aus_numeric, 10000)
It is also vital to check the correlation between variables before performing clustering on the dataset. Especially in a dataset like that where the data is not only related to weather but also it consists of measures performed in different times of a day. We can definitely anticipate that some variables will be heavily correlated.
aus_cor<-cor(aus_numeric, method="pearson")
corrplot(aus_cor)
As expected, measures such as humidity and temperature that occur in the same day are heavily correlated and should be deleted from the dataset.I set the cut-off value of correlation at 0.7 and handled all the variables with correlation values higher than that. After performing this operation we are left with a much cleaner dataset with no instances of very high correlation
highly_correlated <- findCorrelation(aus_cor, cutoff = 0.7)
aus_numeric <- aus_numeric[, -highly_correlated]
aus_cor<-cor(aus_numeric, method="pearson")
corrplot(aus_cor)
Now that all aspects of data pre-processing are handled, the dataset is ready for clustering.
Before any sort of clustering is performed it is important to check whether a daset shows clustering tendencies. To do that I used the Hopkins statistic. As the value of Hopkins statistic is very high at 0.9999892, it is safe to say that this dataset is a good candidate for clustering
hopkins_statistic <- hopkins::hopkins(aus_numeric)
cat("Hopkins statistic:", hopkins_statistic)
## Hopkins statistic: 0.9999985
To determine how many clusters should be created I used the elbow method as well as the silhouette:
show_elbow<-fviz_nbclust(aus_numeric, kmeans, method = "wss") +ggtitle("Elbow K-means")
show_elbow
By looking at the elbow method plot I concluded that either 2 or 3 clusters are optimal for this dataset. It is clear that anything above 3 clusters will be too much. However, in order to make a final decision on the amount of clusters that should be used, it is important to check the plot of the silhouette method:
show_silhouette<- fviz_nbclust(aus_numeric, kmeans, method = "silhouette") +ggtitle("Silhouette K-means")
show_silhouette
By analyzing the plot of the silhouette method we can conclude that in the end the optimal number of clusters is 2 and I decided to run the K-means clustering algorithm on such amount of clusters
kmeans_method<-eclust(aus_numeric, "kmeans", hc_metric="euclidian",k=2)
As it is visible on the plot the observations are divided into 2 clusters and they seem to be somewhat separate there are some observations that clearly belong to a certain cluster, however, there are also some that seem to be in the middle with no sense of belonging to one cluster. To check the exact performance of the K-means algorithm it is vital to look at a silhouette score that is presented below.
sil<-silhouette(kmeans_method$cluster, dist(aus_numeric))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 4357 0.37
## 2 2 5643 0.28
The value of silhouette score is 0.32 what confirms the suspicion that the clusters are not particularly well defined and a lot of observations are not properly separated. The silhouette score is a bit higher for the blue cluster what seems to be fitting with the plot that presents all the observations, as the blue cluster seems to be separated better.
The same steps are repeated before performing PAM algorithm. For this algorithm I decided to shrink the dataset even further as the algorithm was taking too much time to execute for a dataset with 10000 observations. I ended up setting the number of observations to 2000.
aus_numeric <- head(aus_numeric, 2000)
Defining the optimal number of clusters using the silhouette and elbow methods:
show_elbow<-fviz_nbclust(aus_numeric, pam, method = "wss") +ggtitle("Elbow PAM")
show_elbow
show_silhouette<- fviz_nbclust(aus_numeric, pam, method = "silhouette") +ggtitle("Silhouette PAM")
show_silhouette
By looking at the elbow method I came to the same conclusion as in case of the K-means algorithm. Either 2 or 3 clusters are optimal. However, the silhouette method again pointed towards 2 as the optimal number of clusters that is why I ended up selecting this number for my analysis. After the number of clusters is selected it is time for the analysis.
pam_method<-eclust(aus_numeric, "pam", hc_metric="euclidian",k=2)
The visual representation of the clusters is similar to the one for
K-means algorithm. There are some observations that are evidently
seperated and clearly belong to one cluster, but there are also
obserations that lay in an area where clusters overlap. To see which
method performs better - K-means or PAM we can compare the Silhouette
scores:
sil<-silhouette(pam_method$cluster, dist(aus_numeric))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 1072 0.39
## 2 2 928 0.30
As we can see the Silhouette score for the PAM algorithm is slightly higher at 0.35. It is also worth noting that the difference between the score for different clusters is smaller than it was in case of K-means algorithm, therefore it is safe to conclude that PAM is the superior method in this scenario.
Now that the results for clustering before performing Principal
Component Analysis are obtained, it is time to conduct the PCA and check
the results after the operation of dimensional reduction. While
performing PCA I also included parameter âscale. = TRUEâ
since analyzed data were measured in different scales and that could
seriously skew the results.
pca_data <- prcomp(aus_numeric, center = TRUE, scale. = TRUE)
summary(pca_data)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7519 1.3772 1.0010 0.80030 0.7616 0.57627 0.55697
## Proportion of Variance 0.3836 0.2371 0.1252 0.08006 0.0725 0.04151 0.03878
## Cumulative Proportion 0.3836 0.6207 0.7460 0.82603 0.8985 0.94004 0.97881
## PC8
## Standard deviation 0.41170
## Proportion of Variance 0.02119
## Cumulative Proportion 1.00000
The next step was to check which variables contribute the most to first 3 components. As it is visible on the plot below, The first three components explain the highest percentage of variance so I focused on them.
fviz_screeplot(pca_data, addlabels = TRUE, barfill = "darkgreen", linecolor = "black")
p_1 <- fviz_contrib(pca_data, choice = "var", axes = 1)
p_2 <- fviz_contrib(pca_data, choice = "var", axes = 2)
p_3 <- fviz_contrib(pca_data, choice = "var", axes = 3)
p_1
By looking at the contributions of variables to the first principal component we can see that WindGustSpeed is the biggest contributor along with Preassure9am. MinTemp, WindSpeed3pm and Humidity9am also have decent contribution.
p_2
For the second principal component Humidity3pm, Humidity9am and Rainfall are by far the biggest contributors
p_3
For the third principal component Rainfall MinTemp and WIndSpeed3pm are the biggest contributors
The contribution of variables as well as singular observations can also be visualised on the plots below
fviz_pca_var(pca_data,
col.var= "contrib",
gradient.cols = c("#40e0d0", "#E9A122", "#850025"),
ggtheme = theme_minimal())
fviz_pca_ind(pca_data, col.ind = "cos2",
gradient.cols = c("#40e0d0", "#E9A122", "#850025"), select.ind = list(contrib = 100))
Next we can look at the eigenvalues to determine how many component should we consider. As we can see the highest eigenvalues are contained by the first three components. It is also visible that the first three components capture almost 75% of the variance and after the third component the rising rate of the variance captured seem to decrease, therefore by looking at this data as well as the plot of eigenvalues I decided to limit the data to 3 dimensions.
fviz_eig(pca_data, choice='eigenvalue', addlabels = TRUE, barfill = "orange", barcolor = "black")
eig.val<-get_eigenvalue(pca_data)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.0691462 38.364327 38.36433
## Dim.2 1.8966241 23.707801 62.07213
## Dim.3 1.0019917 12.524897 74.59703
## Dim.4 0.6404764 8.005955 82.60298
## Dim.5 0.5799656 7.249570 89.85255
## Dim.6 0.3320896 4.151120 94.00367
## Dim.7 0.3102127 3.877658 97.88133
## Dim.8 0.1694938 2.118672 100.00000
pca_df <- as.data.frame(pca_data$x)
pca_final <- pca_df[,1:3]
head(pca_final)
## PC1 PC2 PC3
## 1 -2.4321785 0.16300998 -0.6067536
## 2 -1.3642660 -1.14308073 -0.5164047
## 3 -3.1006462 -0.50576657 -0.7289755
## 4 -0.1521473 -2.06654029 0.3253939
## 5 -1.1253636 -0.07725937 0.4435210
## 6 -3.0794221 -0.22070572 -0.7220130
Now let us move on to clustering after performing PCA and comparing the results with original dataset.
hopkins_statistic <- hopkins::hopkins(pca_final)
cat("Hopkins statistic:", hopkins_statistic)
## Hopkins statistic: 0.999177
Hopkins statistic displays a very similar value as the one obtained before performing PCA.
show_elbow<-fviz_nbclust(pca_final, kmeans, method = "wss") +ggtitle("Elbow K-means")
show_elbow
By looking at the graph of the elbow method it seems that this time 3 might be the optimal number of clusters. Let us see if this will be confirmed by the silhouette method.
show_silhouette<- fviz_nbclust(pca_final, kmeans, method = "silhouette") +ggtitle("Silhouette K-Means")
show_silhouette
Silhouette method seems to agree, therefore 3 clusters will be created during performing K-means clustering.
k_means_after_pca<-eclust(pca_final, "kmeans", hc_metric="euclidian",k=3)
sil<-silhouette(k_means_after_pca$cluster, dist(pca_final))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 786 0.40
## 2 2 812 0.40
## 3 3 402 0.17
By looking at the silhouette score for the K-Means algorithm we can see that it has actually improved from 0.32 before performing PCA to 0.35 now. That means that in this case reducing complexity of the data actually helped K-means clustering algorithm to obtain improved results.
Now let us check whether the score has also improved for the PAM algorithm.
show_elbow<-fviz_nbclust(pca_final, pam, method = "wss") +ggtitle("Elbow PAM")
show_elbow
show_silhouette<- fviz_nbclust(pca_final, pam, method = "silhouette") +ggtitle("Silhouette PAM")
show_silhouette
By analyzing the elbow method and the silhouette we can conclude that just like in case of K-means after performing PCA 3 is the optimal number of clusters. Now that the number of clusters is defined we can run the PAM algorithm
pam_method<-eclust(pca_final, "pam", hc_metric="euclidian",k=3)
sil<-silhouette(pam_method$cluster, dist(pca_final))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 535 0.11
## 2 2 780 0.42
## 3 3 685 0.43
In case of PAM the silhouette score decreased slightly from 0.35 to 0.34. However, If we consider the fact that thanks to PCA the dimensional complexity of the dataset has been significantly reduced I believe that in the grand scheme of things the dataset after PCA should be chosen for PAM clustering.
As presented above, PCA had a positive impact on K-mean, as the silhouette for this method improved from 0.32 to 0.35 after the reduction of dimensions. It is clear that this operation helped with some redundant and noisy data and allowed to acheive a better result. The situation is slightly different for PAM method. The silhouette score slightly decreased from 0.35 to 0.34 after the implementation of PCA on dataset. It is clear, that when we just look at the silhouette score, it has not improved. It is however worth noting that the complexity of the data has been reduced thanks to PCA. Therefore it leaves us with a decision to make. What do we value more? A very slight difference in quality of clusters or the fact that we receive very similar results with much lower complexity of the dataset.