Abstract
This final part (3/3) of the series highlights from the “Data-driven approach using machine learning techniques to analyse energy consumption & CO2 emission of corporate electric vehicles for the up-scaling of low carbon mobility Berlin” describes machine learning methods to cluster similar behaving EV driving behaviors. Using outdoor temperature and average driving speed, 80 EVs are clustered using hierarchical clustering method. The analysis then recommends features that leads to higher energy efficiency for driving EVs.
Part 1: Driving data analytics of 80 EVs
Part 2: EV carbon intensity & Smart Charging Strategies
# General
library(dplyr)
library(tidyr)
library(ggplot2)
# Clustering
library(caret)
library(corrplot) # Step 1c: visualising corrplot
library(clValid) # cluster validation
library(cluster) # Step 4: clusGap()
library(NbClust) # Step 4: NbClust()
library(factoextra)
library(clustertend) # for Hopkins stats
library(fpc) #plotcluster function
library(broom) #tidy() in understanding features in clusters
library(flexclust) #as.kcca(), to combine clusters numbers with original df
# Variable partitioning methods
library(rpart)
library(rpart.plot)
library(partykit)
Based on the feature selection for the 80 EVs in part (1/3) of the analysis, this clustering analysis uses features that have an impact to kWh/100km as described in the previous regression analysis: outdoor temperature, a_temp_start and average km/h, kmh. The data is also scaled to normalise the value.
Before clustering, it is necessary to access whether the data contains grouping structure. Why is this necessary? Because k-means and hierarchical clustering will generate clusters even though there is no real clustering groups in random data.
Hopkins statistics access the clustering tendency of the data set by measuring the probability that a given data set is generated by a uniform data distribution. This is done by testing the spatial randomness of the data:
- H0: dataset is uniformaly distribiuted (i.e. no meaningful clusters)
- H1: dataset is not uniformaly distributed (i.e. contains meaningful clusters)
If the value of Hopkins statistic (H) is close to zero (below 0.5), reject H0, means that the data is not uniformly distributed (e.g. contains meaningful clusters), and it can be concluded that the data in the data set is significantly clusterable.
set.seed(8)
hopkins(BMW_clus_EV_noid_scale, n=nrow(BMW_clus_EV_noid_scale)-1)
## $H
## [1] 0.3013379
The Hopkins statistics has a score of 0.30, which is below 0.5. The Null Hypothesis is rejected, meaning that the data has clusterable properties - suitable for clustering analysis.
Another way to justify the suitability of applying clustering technique on the data is by using the Visual Assessment of cluster Tendency (VAT) algorithm. The dissimilarity plot uses distance measures to give an overview on the clustering attributes of the data.
# Distance matrix between rows using pearson method
dist_pear_EV <- get_dist(BMW_clus_EV_noid_scale, stand = TRUE, method="pearson")
# View a subset of the distance matrices
round(as.matrix(dist_pear_EV)[1:6, 1:6],2)
## 1 2 3 4 5 6
## 1 0.00 1.26 2.00 1.52 1.29 0.36
## 2 1.26 0.00 0.84 0.04 0.00 1.91
## 3 2.00 0.84 0.00 0.56 0.80 1.56
## 4 1.52 0.04 0.56 0.00 0.03 1.99
## 5 1.29 0.00 0.80 0.03 0.00 1.92
## 6 0.36 1.91 1.56 1.99 1.92 0.00
# visualise the distance matrix
fviz_dist(dist_pear_EV, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
This plot shows that there are possible clusters that are on the ordered dissimilarity plot, and that the data has some clustering attributes - continue with clustering analysis.
Before applying clustering algorithms on the data, it is necessary to understand which cluster algorithm type is suitable for the data. This analysis selects between the 3 clustering methods, hierarchical, kmeans and pam to find the most suitable one for the data.
clmethods <- c("hierarchical", "kmeans", "pam")
valid_clus_type_EV <- clValid(BMW_clus_EV_noid_scale, nClust = 3:6, clMethods = clmethods, validation = "internal", maxitems = 80)
summary(valid_clus_type_EV)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6
##
## Validation Measures:
## 3 4 5 6
##
## hierarchical Connectivity 14.4151 16.5734 17.5734 27.0694
## Dunn 0.1301 0.1301 0.1301 0.1331
## Silhouette 0.3303 0.2926 0.2242 0.2807
## kmeans Connectivity 25.4897 44.4329 47.4929 49.5635
## Dunn 0.0650 0.0468 0.0255 0.1338
## Silhouette 0.3265 0.2538 0.2596 0.2642
## pam Connectivity 28.6036 30.8226 36.0016 57.2262
## Dunn 0.0651 0.0666 0.0666 0.0418
## Silhouette 0.3251 0.2637 0.2710 0.2407
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 14.4151 hierarchical 3
## Dunn 0.1338 kmeans 6
## Silhouette 0.3303 hierarchical 3
plot(valid_clus_type_EV)
The validation internal measures include Connectivity, Silhouette Width, and Dunn Index. Briefly:
Based on the analysis using the clValid function, hierarchical clustering method is the favoured method for clustering the EVs observations, with 3 clusters.
Using the optimum cluster method described in the previous section, hierarchical clustering is being deployed to cluster the EV data. Using Ward’s minimum variance method, ward.D as the selected clustering method, and grouped into 3 clusters.
BMW_hc_EV_nice <- eclust(BMW_clus_EV_noid_scale, "hclust", k=3, graph=FALSE, hc_method="ward.D")
fviz_dend(BMW_hc_EV_nice, rect=TRUE, cex=0.5, k_colors = c("#00AFBB", "#E7B800", "#D95F02"), show_labels = TRUE, main="Clustering of 80 EVs in 3 clusters", xlab="EVs ID")
Looking deeper into one of the validation methods, the clusters have an average “Silhouette width” of \(0.32\).
fviz_silhouette(BMW_hc_EV_nice)
## cluster size ave.sil.width
## 1 1 11 0.22
## 2 2 41 0.32
## 3 3 28 0.34
With the 3 clusters, the analysis continues by assigning the cluster numbers to each of the 80 EVs and unscaling the data back to its original value.
BMW_EV_c1 <- subset(BMW_clus_EV_id_scale, grp_EV==1)
summary(BMW_EV_c1)
## vehicle_id a_temp_start kmh kwh_100km
## Min. : 1.00 Min. :12.33 Min. :35.62 Min. : 9.611
## 1st Qu.:24.00 1st Qu.:13.97 1st Qu.:37.60 1st Qu.:13.501
## Median :50.00 Median :14.35 Median :42.42 Median :14.252
## Mean :43.36 Mean :14.39 Mean :43.08 Mean :13.990
## 3rd Qu.:62.50 3rd Qu.:15.03 3rd Qu.:48.28 3rd Qu.:14.938
## Max. :78.00 Max. :16.20 Max. :54.17 Max. :16.032
BMW_EV_c2 <- subset(BMW_clus_EV_id_scale, grp_EV==2)
summary(BMW_EV_c2)
## vehicle_id a_temp_start kmh kwh_100km
## Min. : 2.00 Min. :10.13 Min. :19.39 Min. :13.15
## 1st Qu.:17.00 1st Qu.:12.43 1st Qu.:23.28 1st Qu.:14.97
## Median :28.00 Median :13.15 Median :25.33 Median :15.48
## Mean :30.29 Mean :12.98 Mean :27.01 Mean :15.55
## 3rd Qu.:39.00 3rd Qu.:13.76 3rd Qu.:29.34 3rd Qu.:16.05
## Max. :80.00 Max. :14.71 Max. :43.44 Max. :17.82
BMW_EV_c3 <- subset(BMW_clus_EV_id_scale, grp_EV==3)
summary(BMW_EV_c3)
## vehicle_id a_temp_start kmh kwh_100km
## Min. : 3.00 Min. :14.58 Min. :18.77 Min. :11.63
## 1st Qu.:48.75 1st Qu.:15.65 1st Qu.:22.33 1st Qu.:13.69
## Median :59.50 Median :16.02 Median :25.06 Median :14.41
## Mean :54.32 Mean :16.23 Mean :25.48 Mean :14.50
## 3rd Qu.:68.25 3rd Qu.:16.49 3rd Qu.:28.47 3rd Qu.:15.34
## Max. :79.00 Max. :19.64 Max. :33.21 Max. :16.41
BMW_EV_c1$cluster <- 1
BMW_EV_c2$cluster <- 2
BMW_EV_c3$cluster <- 3
Visualising the features of a_temp_start and kmh among the 3 clusters:
ggplot(t_EV_long, aes(x=variables, y=score, fill=factor(cluster))) +
geom_boxplot(position=position_dodge(0.8)) +
facet_wrap(~variables, scales = "free") +
scale_fill_discrete(name="Cluster") +
ggtitle("Variables role within each Clusters (Drive EVs)") +
theme(axis.text=element_text(size=12))
The analysis applies cluster-and-predict approach on the 3 clusters, in order to understand how the different clusters impacts energy consumption.
# Cluster 1
BMW_EV_c1_all <- subset(BMW_clus_EV_id, BMW_clus_EV_id$vehicle_id %in% unique(BMW_EV_c1$vehicle_id))
# Cluster 2
BMW_EV_c2_all <- subset(BMW_clus_EV_id, BMW_clus_EV_id$vehicle_id %in% unique(BMW_EV_c2$vehicle_id))
# Cluster 3
BMW_EV_c3_all <- subset(BMW_clus_EV_id, BMW_clus_EV_id$vehicle_id %in% unique(BMW_EV_c3$vehicle_id))
BMW_EV_c1_all$cluster <- "1"
BMW_EV_c2_all$cluster <- "2"
BMW_EV_c3_all$cluster <- "3"
BMW_EV_allc <- rbind(BMW_EV_c1_all, BMW_EV_c2_all, BMW_EV_c3_all)
BMW_EV_allc <- BMW_EV_allc[,c(5,2:4)]
Most of the features are not controllable by the EV drivers, apart from km/h. The analysis therefore uses km/h to portray the relationship towards kWh/100km.
ggplot(BMW_EV_allc, aes(x=kmh, y=kwh_100km, group=factor(cluster))) +
geom_point(aes(colour=factor(cluster)), alpha=0.1) +
geom_smooth(aes(group=factor(cluster), colour=factor(cluster)), method="gam", formula = y~s(x), size=2.0, se=FALSE) +
geom_ribbon(stat='smooth', method = "gam", formula = y~s(x), se=TRUE, alpha=0.1, aes(color = NULL, group = factor(cluster))) + # to control the transparency of the SE
ylab("Electricity Consumption (kWh/100km)") +
xlab("km/hour") +
ggtitle("Clustered - EVs Electricity Consumption vs. km/h") +
theme(axis.text=element_text(size=12))
From the above regression, it seems that cluster 3 tends to have the overal lowest energy consumption. Using Regression Tree, the features of a_temp_start and kmh for cluster 3 is analysed.
BMW_EV_c3_rpart <- rpart(kwh_100km ~ a_temp_start + kmh, data=BMW_EV_c3_all)
BMW_EV_c3_party <- as.party(BMW_EV_c3_rpart)
plot(BMW_EV_c3_party, main="Cluster 3 EVs: kWh/100km vs. km/h & outdoor temperature)")
kWh/100km:Based on clustering with outdoor temperature and average driving speed, the energy consumption among the three clusters is fairly similar, with cluster 3 generally having the lowest energy consupmtion. Referring to the regression tree for cluster 3, in order to achieve lower kWh/100km, the optimum outdoor temperature is above 11.75°C, while the average speed to be not less 22km/h.