# General
library(dplyr)
library(tidyr)
library(ggplot2)

# Clustering
library(caret)
library(corrplot) # Step 1c: visualising corrplot
library(clValid) # cluster validation
library(cluster) # Step 4: clusGap()
library(NbClust) # Step 4: NbClust()
library(factoextra)
library(clustertend) # for Hopkins stats 
library(fpc) #plotcluster function 
library(broom) #tidy() in understanding features in clusters
library(flexclust) #as.kcca(), to combine clusters numbers with original df

# Variable partitioning methods 
library(rpart)
library(rpart.plot)
library(partykit)

1 Data pre-preparation & scaling

Based on the feature selection for the 80 EVs in part (1/3) of the analysis, this clustering analysis uses features that have an impact to kWh/100km as described in the previous regression analysis: outdoor temperature, a_temp_start and average km/h, kmh. The data is also scaled to normalise the value.

2 Validate cluster

Before clustering, it is necessary to access whether the data contains grouping structure. Why is this necessary? Because k-means and hierarchical clustering will generate clusters even though there is no real clustering groups in random data.

2.1 Statistical method (Hopkis statistics)

Hopkins statistics access the clustering tendency of the data set by measuring the probability that a given data set is generated by a uniform data distribution. This is done by testing the spatial randomness of the data:
- H0: dataset is uniformaly distribiuted (i.e. no meaningful clusters)
- H1: dataset is not uniformaly distributed (i.e. contains meaningful clusters)

If the value of Hopkins statistic (H) is close to zero (below 0.5), reject H0, means that the data is not uniformly distributed (e.g. contains meaningful clusters), and it can be concluded that the data in the data set is significantly clusterable.

set.seed(8)
hopkins(BMW_clus_EV_noid_scale, n=nrow(BMW_clus_EV_noid_scale)-1)
## $H
## [1] 0.3013379

The Hopkins statistics has a score of 0.30, which is below 0.5. The Null Hypothesis is rejected, meaning that the data has clusterable properties - suitable for clustering analysis.

2.2 Visual method (Dissimilarity plot)

Another way to justify the suitability of applying clustering technique on the data is by using the Visual Assessment of cluster Tendency (VAT) algorithm. The dissimilarity plot uses distance measures to give an overview on the clustering attributes of the data.

# Distance matrix between rows using pearson method
dist_pear_EV <- get_dist(BMW_clus_EV_noid_scale, stand = TRUE, method="pearson") 

# View a subset of the distance matrices
round(as.matrix(dist_pear_EV)[1:6, 1:6],2)
##      1    2    3    4    5    6
## 1 0.00 1.26 2.00 1.52 1.29 0.36
## 2 1.26 0.00 0.84 0.04 0.00 1.91
## 3 2.00 0.84 0.00 0.56 0.80 1.56
## 4 1.52 0.04 0.56 0.00 0.03 1.99
## 5 1.29 0.00 0.80 0.03 0.00 1.92
## 6 0.36 1.91 1.56 1.99 1.92 0.00
# visualise the distance matrix
fviz_dist(dist_pear_EV, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

This plot shows that there are possible clusters that are on the ordered dissimilarity plot, and that the data has some clustering attributes - continue with clustering analysis.

3 Validating cluster

Before applying clustering algorithms on the data, it is necessary to understand which cluster algorithm type is suitable for the data. This analysis selects between the 3 clustering methods, hierarchical, kmeans and pam to find the most suitable one for the data.

clmethods <- c("hierarchical", "kmeans", "pam")

valid_clus_type_EV <- clValid(BMW_clus_EV_noid_scale, nClust = 3:6, clMethods = clmethods, validation = "internal", maxitems = 80)
summary(valid_clus_type_EV)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  3 4 5 6 
## 
## Validation Measures:
##                                  3       4       5       6
##                                                           
## hierarchical Connectivity  14.4151 16.5734 17.5734 27.0694
##              Dunn           0.1301  0.1301  0.1301  0.1331
##              Silhouette     0.3303  0.2926  0.2242  0.2807
## kmeans       Connectivity  25.4897 44.4329 47.4929 49.5635
##              Dunn           0.0650  0.0468  0.0255  0.1338
##              Silhouette     0.3265  0.2538  0.2596  0.2642
## pam          Connectivity  28.6036 30.8226 36.0016 57.2262
##              Dunn           0.0651  0.0666  0.0666  0.0418
##              Silhouette     0.3251  0.2637  0.2710  0.2407
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## Connectivity 14.4151 hierarchical 3       
## Dunn          0.1338 kmeans       6       
## Silhouette    0.3303 hierarchical 3
plot(valid_clus_type_EV)

The validation internal measures include Connectivity, Silhouette Width, and Dunn Index. Briefly:

  • “Connectivity” has a value between \(0\) and infinity and is best to be minimized
  • “Dunn Index” is the ratio between the smallest distance between observations in a different cluster, to the largest distance within the cluster. It has a value between \(0\) and infinity and is best to be maximised
  • “Silhouette width” measures how similar an object is to the other object in its own cluster vs. those in the neighbour cluster. It has the range from \(1\) to \(-1\). value close to \(1\) indicates the observations are well clustered ~ similar to the other observations in its group. Value close to \(-1\) indicates the observations are poorly clustered

Based on the analysis using the clValid function, hierarchical clustering method is the favoured method for clustering the EVs observations, with 3 clusters.

4 Visualisation of clusters

4.1 Hierarchical clustering

Using the optimum cluster method described in the previous section, hierarchical clustering is being deployed to cluster the EV data. Using Ward’s minimum variance method, ward.D as the selected clustering method, and grouped into 3 clusters.

BMW_hc_EV_nice <- eclust(BMW_clus_EV_noid_scale, "hclust", k=3, graph=FALSE, hc_method="ward.D")
fviz_dend(BMW_hc_EV_nice, rect=TRUE, cex=0.5, k_colors = c("#00AFBB", "#E7B800", "#D95F02"), show_labels = TRUE, main="Clustering of 80 EVs in 3 clusters", xlab="EVs ID") 

4.2 Validate clusters with silhouette plot

Looking deeper into one of the validation methods, the clusters have an average “Silhouette width” of \(0.32\).

fviz_silhouette(BMW_hc_EV_nice)
##   cluster size ave.sil.width
## 1       1   11          0.22
## 2       2   41          0.32
## 3       3   28          0.34

5 Features Understanding of the EVs clusters

With the 3 clusters, the analysis continues by assigning the cluster numbers to each of the 80 EVs and unscaling the data back to its original value.

BMW_EV_c1 <- subset(BMW_clus_EV_id_scale, grp_EV==1)
summary(BMW_EV_c1)
##    vehicle_id     a_temp_start        kmh          kwh_100km     
##  Min.   : 1.00   Min.   :12.33   Min.   :35.62   Min.   : 9.611  
##  1st Qu.:24.00   1st Qu.:13.97   1st Qu.:37.60   1st Qu.:13.501  
##  Median :50.00   Median :14.35   Median :42.42   Median :14.252  
##  Mean   :43.36   Mean   :14.39   Mean   :43.08   Mean   :13.990  
##  3rd Qu.:62.50   3rd Qu.:15.03   3rd Qu.:48.28   3rd Qu.:14.938  
##  Max.   :78.00   Max.   :16.20   Max.   :54.17   Max.   :16.032
BMW_EV_c2 <- subset(BMW_clus_EV_id_scale, grp_EV==2)
summary(BMW_EV_c2)
##    vehicle_id     a_temp_start        kmh          kwh_100km    
##  Min.   : 2.00   Min.   :10.13   Min.   :19.39   Min.   :13.15  
##  1st Qu.:17.00   1st Qu.:12.43   1st Qu.:23.28   1st Qu.:14.97  
##  Median :28.00   Median :13.15   Median :25.33   Median :15.48  
##  Mean   :30.29   Mean   :12.98   Mean   :27.01   Mean   :15.55  
##  3rd Qu.:39.00   3rd Qu.:13.76   3rd Qu.:29.34   3rd Qu.:16.05  
##  Max.   :80.00   Max.   :14.71   Max.   :43.44   Max.   :17.82
BMW_EV_c3 <- subset(BMW_clus_EV_id_scale, grp_EV==3)
summary(BMW_EV_c3)
##    vehicle_id     a_temp_start        kmh          kwh_100km    
##  Min.   : 3.00   Min.   :14.58   Min.   :18.77   Min.   :11.63  
##  1st Qu.:48.75   1st Qu.:15.65   1st Qu.:22.33   1st Qu.:13.69  
##  Median :59.50   Median :16.02   Median :25.06   Median :14.41  
##  Mean   :54.32   Mean   :16.23   Mean   :25.48   Mean   :14.50  
##  3rd Qu.:68.25   3rd Qu.:16.49   3rd Qu.:28.47   3rd Qu.:15.34  
##  Max.   :79.00   Max.   :19.64   Max.   :33.21   Max.   :16.41
BMW_EV_c1$cluster <- 1
BMW_EV_c2$cluster <- 2
BMW_EV_c3$cluster <- 3

Visualising the features of a_temp_start and kmh among the 3 clusters:

ggplot(t_EV_long, aes(x=variables, y=score, fill=factor(cluster))) +
  geom_boxplot(position=position_dodge(0.8)) +
  facet_wrap(~variables, scales = "free") +
  scale_fill_discrete(name="Cluster") +
  ggtitle("Variables role within each Clusters (Drive EVs)") +
  theme(axis.text=element_text(size=12))

5.1 Summary of clusters characteristics:

  • Cluster 1 characteristics: average temperature (median 14°C), high speed (median 43km/h)
  • Cluster 2 characteristics: low temperature (median 13°C), low speed (median 26km/h)
  • Cluster 3 characteristics: high temperature (median 16°C), low speed (median 25km/h)

6 cluster-and-predict

The analysis applies cluster-and-predict approach on the 3 clusters, in order to understand how the different clusters impacts energy consumption.

# Cluster 1
BMW_EV_c1_all <- subset(BMW_clus_EV_id, BMW_clus_EV_id$vehicle_id %in% unique(BMW_EV_c1$vehicle_id))

# Cluster 2
BMW_EV_c2_all <- subset(BMW_clus_EV_id, BMW_clus_EV_id$vehicle_id %in% unique(BMW_EV_c2$vehicle_id))

# Cluster 3
BMW_EV_c3_all <- subset(BMW_clus_EV_id, BMW_clus_EV_id$vehicle_id %in% unique(BMW_EV_c3$vehicle_id))

BMW_EV_c1_all$cluster <- "1"
BMW_EV_c2_all$cluster <- "2"
BMW_EV_c3_all$cluster <- "3"

BMW_EV_allc <- rbind(BMW_EV_c1_all, BMW_EV_c2_all, BMW_EV_c3_all)
BMW_EV_allc <- BMW_EV_allc[,c(5,2:4)]

6.1 Controllable feature - km/h

Most of the features are not controllable by the EV drivers, apart from km/h. The analysis therefore uses km/h to portray the relationship towards kWh/100km.

ggplot(BMW_EV_allc, aes(x=kmh, y=kwh_100km, group=factor(cluster))) +
  geom_point(aes(colour=factor(cluster)), alpha=0.1) +
  geom_smooth(aes(group=factor(cluster), colour=factor(cluster)), method="gam", formula = y~s(x), size=2.0, se=FALSE) +
  geom_ribbon(stat='smooth', method = "gam", formula = y~s(x), se=TRUE, alpha=0.1, aes(color = NULL, group = factor(cluster))) + # to control the transparency of the SE 
  ylab("Electricity Consumption (kWh/100km)") + 
  xlab("km/hour") +
  ggtitle("Clustered - EVs Electricity Consumption vs. km/h") +
  theme(axis.text=element_text(size=12))

6.2 Regression Tree on lowest energy consumption cluster

From the above regression, it seems that cluster 3 tends to have the overal lowest energy consumption. Using Regression Tree, the features of a_temp_start and kmh for cluster 3 is analysed.

BMW_EV_c3_rpart <- rpart(kwh_100km ~ a_temp_start + kmh, data=BMW_EV_c3_all)
BMW_EV_c3_party <- as.party(BMW_EV_c3_rpart)
plot(BMW_EV_c3_party, main="Cluster 3 EVs: kWh/100km vs. km/h & outdoor temperature)")

6.3 Energy efficiency recommendation to achieve lower kWh/100km:

Based on clustering with outdoor temperature and average driving speed, the energy consumption among the three clusters is fairly similar, with cluster 3 generally having the lowest energy consupmtion. Referring to the regression tree for cluster 3, in order to achieve lower kWh/100km, the optimum outdoor temperature is above 11.75°C, while the average speed to be not less 22km/h.