1. Introduction to the project and the main idea

The goal of this project was to explore and determie whether Principal Component Analysis (PCA) will have an impact on performance of clustering with K-means and PAM algorithms. I will examine and compare the results of clustering before and after performing PCA on a dataset that contains weather data from all over Australia. By performing this project I will be able to check whether reducing dimensionality will improve the quality of clustering with the two aforementioned methods. I will be evaluating that using methods such as silhouette score and visualization of clusters.

2. Loading required libraries. Loading, pre-processing and describing the dataset

library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(plotly)
library(corrplot)
library(NbClust)
library(clusterSim)
library(GGally)
library(factoextra)
library(FactoMineR)
library(caret)
library(wesanderson)
library(cclust)
library(flexclust)
library(hopkins)
setwd("C:/Users/Lukasz/Desktop/unsupervised learning/dane")
getwd()
#Loading initial data
aus <- read.csv("dane/weatherAUS.csv")
head(aus)
##         Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir
## 1 2008-12-01   Albury    13.4    22.9      0.6          NA       NA           W
## 2 2008-12-02   Albury     7.4    25.1      0.0          NA       NA         WNW
## 3 2008-12-03   Albury    12.9    25.7      0.0          NA       NA         WSW
## 4 2008-12-04   Albury     9.2    28.0      0.0          NA       NA          NE
## 5 2008-12-05   Albury    17.5    32.3      1.0          NA       NA           W
## 6 2008-12-06   Albury    14.6    29.7      0.2          NA       NA         WNW
##   WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am
## 1            44          W        WNW           20           24          71
## 2            44        NNW        WSW            4           22          44
## 3            46          W        WSW           19           26          38
## 4            24         SE          E           11            9          45
## 5            41        ENE         NW            7           20          82
## 6            56          W          W           19           24          55
##   Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
## 1          22      1007.7      1007.1        8       NA    16.9    21.8
## 2          25      1010.6      1007.8       NA       NA    17.2    24.3
## 3          30      1007.6      1008.7       NA        2    21.0    23.2
## 4          16      1017.6      1012.8       NA       NA    18.1    26.5
## 5          33      1010.8      1006.0        7        8    17.8    29.7
## 6          23      1009.2      1005.4       NA       NA    20.6    28.9
##   RainToday RainTomorrow
## 1        No           No
## 2        No           No
## 3        No           No
## 4        No           No
## 5        No           No
## 6        No           No

2.1 Dataset Description

This dataset contains weather observations collected from various locations in Australia. Below is a summary:

  • MinTemp: Minimum temperature of the day.

  • MaxTemp: Maximum temperature of the day.

  • Temp9am & Temp3pm: Temperature recorded at 9 AM and 3 PM.

  • Rainfall: Amount of rain recorded.

  • Evaporation: Evaporation levels .

  • Sunshine: Number of sunshine hours.

  • WindGustDir & WindGustSpeed: Direction and speed of the strongest wind gust.

  • WindDir9am & WindDir3pm: Wind direction at 9 AM and 3 PM.

  • WindSpeed9am & WindSpeed3pm: Wind speed at 9 AM and 3 PM .

  • Humidity9am & Humidity3pm: Percentage of humidity at 9 AM and 3 PM.

  • Pressure9am & Pressure3pm: Atmospheric pressure at 9 AM and 3 PM.

  • Cloud9am & Cloud3pm: Overcast levels.

  • RainToday: Whether it rained today ("Yes"/"No").

  • RainTomorrow: predicting if it will rain the next day ("Yes"/"No").

#counting missing values
count_na <- colSums(is.na(aus))
count_na
##          Date      Location       MinTemp       MaxTemp      Rainfall 
##             0             0          1485          1261          3261 
##   Evaporation      Sunshine   WindGustDir WindGustSpeed    WindDir9am 
##         62790         69835         10326         10263         10566 
##    WindDir3pm  WindSpeed9am  WindSpeed3pm   Humidity9am   Humidity3pm 
##          4228          1767          3062          2654          4507 
##   Pressure9am   Pressure3pm      Cloud9am      Cloud3pm       Temp9am 
##         15065         15028         55888         59358          1767 
##       Temp3pm     RainToday  RainTomorrow 
##          3609          3261          3267

Clearly there are column that have a lot of missing values. I set a threshold at bout 1/3 of the data missing and deleted the that had more missing data than that

cols_del <- aus[, colSums(is.na(aus)) <= 50000]
count_na <- colSums(is.na(cols_del))
count_na
##          Date      Location       MinTemp       MaxTemp      Rainfall 
##             0             0          1485          1261          3261 
##   WindGustDir WindGustSpeed    WindDir9am    WindDir3pm  WindSpeed9am 
##         10326         10263         10566          4228          1767 
##  WindSpeed3pm   Humidity9am   Humidity3pm   Pressure9am   Pressure3pm 
##          3062          2654          4507         15065         15028 
##       Temp9am       Temp3pm     RainToday  RainTomorrow 
##          1767          3609          3261          3267

There were still many columns with missing values, however, the amount was acceptable so I decided to omit the rows that still had missing values.

aus_clean <- na.omit(cols_del)
colSums(is.na(aus_clean))
##          Date      Location       MinTemp       MaxTemp      Rainfall 
##             0             0             0             0             0 
##   WindGustDir WindGustSpeed    WindDir9am    WindDir3pm  WindSpeed9am 
##             0             0             0             0             0 
##  WindSpeed3pm   Humidity9am   Humidity3pm   Pressure9am   Pressure3pm 
##             0             0             0             0             0 
##       Temp9am       Temp3pm     RainToday  RainTomorrow 
##             0             0             0             0

Now that there are no more missing values I proceeded to the next part of preparing the dataset for clustering, which was selecting only columns that contained numerical data. Since the dataset was very large (over 140000 observations) I also decided to only select 10000 so that the cluster analysis does not take extremely long to run.

aus_numeric <- aus_clean[,sapply(aus_clean, is.numeric)]
aus_numeric <- head(aus_numeric, 10000)

2.2 Testing correlation

It is also vital to check the correlation between variables before performing clustering on the dataset. Especially in a dataset like that where the data is not only related to weather but also it consists of measures performed in different times of a day. We can definitely anticipate that some variables will be heavily correlated.

aus_cor<-cor(aus_numeric, method="pearson")
corrplot(aus_cor)

As expected, measures such as humidity and temperature that occur in the same day are heavily correlated and should be deleted from the dataset.I set the cut-off value of correlation at 0.7 and handled all the variables with correlation values higher than that. After performing this operation we are left with a much cleaner dataset with no instances of very high correlation

highly_correlated <- findCorrelation(aus_cor, cutoff = 0.7) 
aus_numeric <- aus_numeric[, -highly_correlated]

aus_cor<-cor(aus_numeric, method="pearson")
corrplot(aus_cor)

Now that all aspects of data pre-processing are handled, the dataset is ready for clustering.

3. Clustering

3.1 K-means algorithm

Before any sort of clustering is performed it is important to check whether a daset shows clustering tendencies. To do that I used the Hopkins statistic. As the value of Hopkins statistic is very high at 0.9999892, it is safe to say that this dataset is a good candidate for clustering

hopkins_statistic <- hopkins::hopkins(aus_numeric)
cat("Hopkins statistic:", hopkins_statistic)
## Hopkins statistic: 0.9999985

To determine how many clusters should be created I used the elbow method as well as the silhouette:

show_elbow<-fviz_nbclust(aus_numeric, kmeans, method = "wss") +ggtitle("Elbow K-means")
show_elbow

By looking at the elbow method plot I concluded that either 2 or 3 clusters are optimal for this dataset. It is clear that anything above 3 clusters will be too much. However, in order to make a final decision on the amount of clusters that should be used, it is important to check the plot of the silhouette method:

show_silhouette<- fviz_nbclust(aus_numeric, kmeans, method = "silhouette") +ggtitle("Silhouette K-means")
show_silhouette

By analyzing the plot of the silhouette method we can conclude that in the end the optimal number of clusters is 2 and I decided to run the K-means clustering algorithm on such amount of clusters

kmeans_method<-eclust(aus_numeric, "kmeans", hc_metric="euclidian",k=2)

As it is visible on the plot the observations are divided into 2 clusters and they seem to be somewhat separate there are some observations that clearly belong to a certain cluster, however, there are also some that seem to be in the middle with no sense of belonging to one cluster. To check the exact performance of the K-means algorithm it is vital to look at a silhouette score that is presented below.

sil<-silhouette(kmeans_method$cluster, dist(aus_numeric))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 4357          0.37
## 2       2 5643          0.28

The value of silhouette score is 0.32 what confirms the suspicion that the clusters are not particularly well defined and a lot of observations are not properly separated. The silhouette score is a bit higher for the blue cluster what seems to be fitting with the plot that presents all the observations, as the blue cluster seems to be separated better.

3.2 PAM algorithm

The same steps are repeated before performing PAM algorithm. For this algorithm I decided to shrink the dataset even further as the algorithm was taking too much time to execute for a dataset with 10000 observations. I ended up setting the number of observations to 2000.

aus_numeric <- head(aus_numeric, 2000)

Defining the optimal number of clusters using the silhouette and elbow methods:

show_elbow<-fviz_nbclust(aus_numeric, pam, method = "wss") +ggtitle("Elbow PAM")
show_elbow

show_silhouette<- fviz_nbclust(aus_numeric, pam, method = "silhouette") +ggtitle("Silhouette PAM")
show_silhouette

By looking at the elbow method I came to the same conclusion as in case of the K-means algorithm. Either 2 or 3 clusters are optimal. However, the silhouette method again pointed towards 2 as the optimal number of clusters that is why I ended up selecting this number for my analysis. After the number of clusters is selected it is time for the analysis.

pam_method<-eclust(aus_numeric, "pam", hc_metric="euclidian",k=2)

The visual representation of the clusters is similar to the one for K-means algorithm. There are some observations that are evidently seperated and clearly belong to one cluster, but there are also obserations that lay in an area where clusters overlap. To see which method performs better - K-means or PAM we can compare the Silhouette scores:

sil<-silhouette(pam_method$cluster, dist(aus_numeric))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 1072          0.39
## 2       2  928          0.30

As we can see the Silhouette score for the PAM algorithm is slightly higher at 0.35. It is also worth noting that the difference between the score for different clusters is smaller than it was in case of K-means algorithm, therefore it is safe to conclude that PAM is the superior method in this scenario.

4. Principal Component Analysis

Now that the results for clustering before performing Principal Component Analysis are obtained, it is time to conduct the PCA and check the results after the operation of dimensional reduction. While performing PCA I also included parameter “scale. = TRUE” since analyzed data were measured in different scales and that could seriously skew the results.

pca_data <- prcomp(aus_numeric, center = TRUE, scale. = TRUE)
summary(pca_data)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.7519 1.3772 1.0010 0.80030 0.7616 0.57627 0.55697
## Proportion of Variance 0.3836 0.2371 0.1252 0.08006 0.0725 0.04151 0.03878
## Cumulative Proportion  0.3836 0.6207 0.7460 0.82603 0.8985 0.94004 0.97881
##                            PC8
## Standard deviation     0.41170
## Proportion of Variance 0.02119
## Cumulative Proportion  1.00000

The next step was to check which variables contribute the most to first 3 components. As it is visible on the plot below, The first three components explain the highest percentage of variance so I focused on them.

fviz_screeplot(pca_data, addlabels = TRUE, barfill = "darkgreen", linecolor = "black")

p_1 <- fviz_contrib(pca_data, choice = "var", axes = 1)
p_2 <- fviz_contrib(pca_data, choice = "var", axes = 2)
p_3 <- fviz_contrib(pca_data, choice = "var", axes = 3)
p_1

By looking at the contributions of variables to the first principal component we can see that WindGustSpeed is the biggest contributor along with Preassure9am. MinTemp, WindSpeed3pm and Humidity9am also have decent contribution.

p_2

For the second principal component Humidity3pm, Humidity9am and Rainfall are by far the biggest contributors

p_3

For the third principal component Rainfall MinTemp and WIndSpeed3pm are the biggest contributors

The contribution of variables as well as singular observations can also be visualised on the plots below

fviz_pca_var(pca_data,
             col.var= "contrib",
             gradient.cols = c("#40e0d0", "#E9A122", "#850025"),
             ggtheme = theme_minimal())

fviz_pca_ind(pca_data, col.ind = "cos2", 
            gradient.cols = c("#40e0d0", "#E9A122", "#850025"), select.ind = list(contrib = 100))

Next we can look at the eigenvalues to determine how many component should we consider. As we can see the highest eigenvalues are contained by the first three components. It is also visible that the first three components capture almost 75% of the variance and after the third component the rising rate of the variance captured seem to decrease, therefore by looking at this data as well as the plot of eigenvalues I decided to limit the data to 3 dimensions.

fviz_eig(pca_data, choice='eigenvalue', addlabels = TRUE, barfill = "orange", barcolor = "black")   

eig.val<-get_eigenvalue(pca_data)
eig.val
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.0691462        38.364327                    38.36433
## Dim.2  1.8966241        23.707801                    62.07213
## Dim.3  1.0019917        12.524897                    74.59703
## Dim.4  0.6404764         8.005955                    82.60298
## Dim.5  0.5799656         7.249570                    89.85255
## Dim.6  0.3320896         4.151120                    94.00367
## Dim.7  0.3102127         3.877658                    97.88133
## Dim.8  0.1694938         2.118672                   100.00000
pca_df <- as.data.frame(pca_data$x)
pca_final <- pca_df[,1:3]
head(pca_final)
##          PC1         PC2        PC3
## 1 -2.4321785  0.16300998 -0.6067536
## 2 -1.3642660 -1.14308073 -0.5164047
## 3 -3.1006462 -0.50576657 -0.7289755
## 4 -0.1521473 -2.06654029  0.3253939
## 5 -1.1253636 -0.07725937  0.4435210
## 6 -3.0794221 -0.22070572 -0.7220130

5. Clustering after PCA

5.1 K-means

Now let us move on to clustering after performing PCA and comparing the results with original dataset.

hopkins_statistic <- hopkins::hopkins(pca_final)
cat("Hopkins statistic:", hopkins_statistic)
## Hopkins statistic: 0.999177

Hopkins statistic displays a very similar value as the one obtained before performing PCA.

show_elbow<-fviz_nbclust(pca_final, kmeans, method = "wss") +ggtitle("Elbow K-means")
show_elbow

By looking at the graph of the elbow method it seems that this time 3 might be the optimal number of clusters. Let us see if this will be confirmed by the silhouette method.

show_silhouette<- fviz_nbclust(pca_final, kmeans, method = "silhouette") +ggtitle("Silhouette K-Means")
show_silhouette

Silhouette method seems to agree, therefore 3 clusters will be created during performing K-means clustering.

k_means_after_pca<-eclust(pca_final, "kmeans", hc_metric="euclidian",k=3)

sil<-silhouette(k_means_after_pca$cluster, dist(pca_final))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  786          0.40
## 2       2  812          0.40
## 3       3  402          0.17

By looking at the silhouette score for the K-Means algorithm we can see that it has actually improved from 0.32 before performing PCA to 0.35 now. That means that in this case reducing complexity of the data actually helped K-means clustering algorithm to obtain improved results.

5.2 PAM

Now let us check whether the score has also improved for the PAM algorithm.

show_elbow<-fviz_nbclust(pca_final, pam, method = "wss") +ggtitle("Elbow PAM")
show_elbow

show_silhouette<- fviz_nbclust(pca_final, pam, method = "silhouette") +ggtitle("Silhouette PAM")
show_silhouette

By analyzing the elbow method and the silhouette we can conclude that just like in case of K-means after performing PCA 3 is the optimal number of clusters. Now that the number of clusters is defined we can run the PAM algorithm

pam_method<-eclust(pca_final, "pam", hc_metric="euclidian",k=3)

sil<-silhouette(pam_method$cluster, dist(pca_final))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1  535          0.11
## 2       2  780          0.42
## 3       3  685          0.43

In case of PAM the silhouette score decreased slightly from 0.35 to 0.34. However, If we consider the fact that thanks to PCA the dimensional complexity of the dataset has been significantly reduced I believe that in the grand scheme of things the dataset after PCA should be chosen for PAM clustering.

6.conclusions and final thoughts

As presented above, PCA had a positive impact on K-mean, as the silhouette for this method improved from 0.32 to 0.35 after the reduction of dimensions. It is clear that this operation helped with some redundant and noisy data and allowed to acheive a better result. The situation is slightly different for PAM method. The silhouette score slightly decreased from 0.35 to 0.34 after the implementation of PCA on dataset. It is clear, that when we just look at the silhouette score, it has not improved. It is however worth noting that the complexity of the data has been reduced thanks to PCA. Therefore it leaves us with a decision to make. What do we value more? A very slight difference in quality of clusters or the fact that we receive very similar results with much lower complexity of the dataset.