Abstract
This data processing & descriptive analytics work is part 1/3 of a series of highlights for the “Data-driven approach using machine learning techniques to analyse energy consumption & CO2 emission of corporate electric vehicles for the up-scaling of low carbon mobility Berlin”. Using 2 years driving data of 80 Electric Vehicles (EVs), this page describes the analytical highlights through data cleaning, manipulation & processing and feature selection on important features that impacts the energy consumptions of EVs. The analysis also highlights some descriptive analytics and deploys model building using Variable Partitioning method via Regression Tree. The Key Performance Indicator (KPI) used for this part of the analytics is kWh/100km. The analytics are conducted within the framework of German Aerospace Agency-DLR “Initiative Berlin-Brandenburg” project, which intends to stimulate the adoption of EVs among corporate fleets.
Part 2: EV carbon intensity & Smart Charging Strategies
Part 3: Machine Learning on 80 EVs driving behaviour
# General, data manipulation
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
library(sp)
library(caTools)
library(Hmisc) # to extract no. of days in each month
# feature selection
library(caret)
library(MASS)
library(leaps) # regsubset()
library(Boruta)
# regression
library(corrplot)
library(reshape2)
library(car) # check multicollinearity
# regression Tree
library(partykit)
library(rpart)
library(rpart.plot)
library(randomForest)
Data preparation includes cleaning, manipulating and processing the raw data and set the stage for further analysis.
This analysis undertook the following data pre-processing steps, among others:
dpylr packages works with POSIXct)id for all the driving and charging eventsvin for the 80 EVsyear, month, day, hour, yeardaydistance, state of EV zustand (fahrt, laden, ruhe)Create new features for BMW_drive data such as duration, average speed kmh, temperature difference temp_diff, State-of-Charge Energy soc_energy (amount of kWh energy in the battery). soc_energyis generated from the battery State-of-Charge (100% is a fully charged battery, while 0% is an empty battery) multiplied with the full capacity of the battery (18.8kWh usable) of this electric car model.
This step also included removing implausible data through plausibility checking.
Using the calculated soc_energy feature to create the Key Performance Indicator (KPI) of kWh/100km for the EVs. Outliers (1.5 times of the top and bottom inter quartile range) for the drive event data are removed.
summary(BMW_drive$kwh_100km)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.17 11.62 14.12 15.08 18.75 29.38
Not all variables influence the KPI of kWh/100km. This section implements various feature selection techniques in order to identify the features that affects kWh/100km.
Firstly, in order to understand the variables, their correlation with the the KPI of kWh/100km is being plotted in the following diagram.
library(corrplot)
M <- cor(BMW_drive_reg_sig)
corrplot(M, method="circle")
Multicollinearity among the features is where two of more features are highly correlated in the regression model. It is necessary to check for multicollinearity among the features using variance inflation factor (VIF) for the features of the linear model. VIF estimate the increase of variance of a particular coefficient that is collinear with other features. It treats that feature as the output of linear regression model, with other features as input features. It then computes VIF using the formula \(VIF = 1/(1-R^{2})\). Using the package car, the square-root VIF of more than 4 is a suspect for collinearity.
sqrt(vif(BMW_drive_lm_noout))
## a_temp_start distance soc_diff duration kmh
## 1.068670 3.382226 629.967626 1.735048 1.477750
## temp_diff soc_energy
## 1.005526 629.929076
It seems that soc_diff and soc_energy have high collinearity, the analysis proceeds by removing these variables, as the state-of-charge of the battery directly indicates how much energy is consumed.
Based on the above undertaken steps, the features that would influence energy consumption are a_temp_start, distance, duration, kmh, temp_diff and soc_energy.
Start with empty model with no features selected, then perform linear regressions and pick the one with the highest \(R^{2}\) or \(AIC\). Keep adding features until all features are evaluated in the model and stop.
BMW_drive_lm_null_sig <- lm(kwh_100km ~ 1, data=BMW_drive_reg_sig)
BMW_drive_lm_full_sig <- lm(kwh_100km ~., data=BMW_drive_reg_sig)
step(BMW_drive_lm_null_sig, scope = list(lower=BMW_drive_lm_null_sig, upper=BMW_drive_lm_full_sig), data=BMW_drive_reg_sig, direction = "forward")
## Start: AIC=144776.6
## kwh_100km ~ 1
##
## Df Sum of Sq RSS AIC
## + a_temp_start 1 219416 931536 135364
## + kmh 1 33911 1117041 143447
## + distance 1 15417 1135535 144178
## + temp_diff 1 6329 1144623 144533
## + duration 1 4616 1146337 144600
## <none> 1150953 144777
##
## Step: AIC=135364.3
## kwh_100km ~ a_temp_start
##
## Df Sum of Sq RSS AIC
## + kmh 1 37141 894396 133555
## + distance 1 15772 915765 134606
## + duration 1 4538 926999 135149
## + temp_diff 1 3660 927876 135191
## <none> 931536 135364
##
## Step: AIC=133555.3
## kwh_100km ~ a_temp_start + kmh
##
## Df Sum of Sq RSS AIC
## + temp_diff 1 4278.9 890117 133344
## + duration 1 3620.7 890775 133377
## + distance 1 567.4 893828 133529
## <none> 894396 133555
##
## Step: AIC=133343.9
## kwh_100km ~ a_temp_start + kmh + temp_diff
##
## Df Sum of Sq RSS AIC
## + duration 1 3388.1 886729 133176
## + distance 1 482.7 889634 133322
## <none> 890117 133344
##
## Step: AIC=133176.1
## kwh_100km ~ a_temp_start + kmh + temp_diff + duration
##
## Df Sum of Sq RSS AIC
## + distance 1 1942.2 884786 133081
## <none> 886729 133176
##
## Step: AIC=133080.5
## kwh_100km ~ a_temp_start + kmh + temp_diff + duration + distance
##
## Call:
## lm(formula = kwh_100km ~ a_temp_start + kmh + temp_diff + duration +
## distance, data = BMW_drive_reg_sig)
##
## Coefficients:
## (Intercept) a_temp_start kmh temp_diff duration
## 21.40245 -0.28165 -0.07497 -0.13865 -0.02709
## distance
## 0.03460
Similar to the forward selection method, the backward selection method performs the evaluation backwards.
step(BMW_drive_lm_full_sig, data=BMW_drive_reg_sig, direction = "backward")
## Start: AIC=133080.5
## kwh_100km ~ a_temp_start + distance + duration + kmh + temp_diff
##
## Df Sum of Sq RSS AIC
## <none> 884786 133081
## - distance 1 1942 886729 133176
## - temp_diff 1 4081 888867 133283
## - duration 1 4848 889634 133322
## - kmh 1 26495 911281 134392
## - a_temp_start 1 220209 1104995 142971
##
## Call:
## lm(formula = kwh_100km ~ a_temp_start + distance + duration +
## kmh + temp_diff, data = BMW_drive_reg_sig)
##
## Coefficients:
## (Intercept) a_temp_start distance duration kmh
## 21.40245 -0.28165 0.03460 -0.02709 -0.07497
## temp_diff
## -0.13865
Using the stepAIC function for backward selection from the package MASS.
step = stepAIC(BMW_drive_lm_full_sig, direction="backward")
## Start: AIC=133080.5
## kwh_100km ~ a_temp_start + distance + duration + kmh + temp_diff
##
## Df Sum of Sq RSS AIC
## <none> 884786 133081
## - distance 1 1942 886729 133176
## - temp_diff 1 4081 888867 133283
## - duration 1 4848 889634 133322
## - kmh 1 26495 911281 134392
## - a_temp_start 1 220209 1104995 142971
summary(step)
##
## Call:
## lm(formula = kwh_100km ~ a_temp_start + distance + duration +
## kmh + temp_diff, data = BMW_drive_reg_sig)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9830 -2.6703 -0.2537 2.4085 21.4637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.402451 0.073304 291.970 <2e-16 ***
## a_temp_start -0.281646 0.002676 -105.244 <2e-16 ***
## distance 0.034597 0.003500 9.884 <2e-16 ***
## duration -0.027090 0.001735 -15.615 <2e-16 ***
## kmh -0.074966 0.002054 -36.506 <2e-16 ***
## temp_diff -0.138649 0.009678 -14.327 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.459 on 44504 degrees of freedom
## Multiple R-squared: 0.2313, Adjusted R-squared: 0.2312
## F-statistic: 2678 on 5 and 44504 DF, p-value: < 2.2e-16
From both the forward and backward selection stepwise regression method, seems like the features selected in the model is already optimum.
By using the bothways method (forward and backward together), it also shows that the features in the model is already optimum.
regsubsets methodThe regsubsets method from the package leaps performs an exhaustive search for the best subsets of the features in predicting the kWh/100km using linear regression.
library(leaps)
leaps <- regsubsets(kwh_100km ~ ., data=BMW_drive_reg_sig)
par(mfrow=c(1,2))
plot(leaps, scale="adjr2")
plot(leaps, scale="bic")
The regsubsets diagnostics:
Boruta is a package that compares attribute importance randomly, using a wrapper algorithm based on random forest.
set.seed(8)
boruta.train <- Boruta(kwh_100km~., data = BMW_drive_reg_sig[1:1000,], doTrace = 2)
# limit to 1000 rows for time performance purpose, if run on whole data set it took 4+ mins
# credits: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/
print(boruta.train)
## Boruta performed 9 iterations in 4.853066 secs.
## 5 attributes confirmed important: a_temp_start, distance,
## duration, kmh, temp_diff;
## No attributes deemed unimportant.
plot(boruta.train, xlab="", xaxt = "n", main="Features Importances for Drive data")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels), at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.8)
boruta.df <- attStats(boruta.train)
boruta.df
## meanImp medianImp minImp maxImp normHits decision
## a_temp_start 65.223388 64.996743 62.375443 67.320232 1 Confirmed
## distance 20.672224 20.807480 18.605221 22.297023 1 Confirmed
## duration 13.962697 13.960566 12.580729 16.953697 1 Confirmed
## kmh 32.761046 32.627911 31.389030 35.023310 1 Confirmed
## temp_diff 5.133179 4.702952 3.860697 7.080101 1 Confirmed
The features selection section concludes by deeming the a_temp_start, distance, soc_diff, duration, kmh and temp_diff variables as important to energy consumption kwh/100km of the EVs.
ggplot(BMW_drive, aes(x=time_start, y=kwh_100km)) +
geom_point(aes(colour=factor(REX_BEV)), alpha=0.3) +
stat_smooth(method = "gam", formula = y~s(x), colour = "darkblue", size=1.0) +
ylab("Electricity Consumption (kWh/100km)") +
xlab("Date") +
ggtitle("EVs Electricity Consumption") +
scale_fill_discrete(name="EV type:") +
theme(legend.position="top") +
theme(axis.text=element_text(size=12))
Energy consumption of EVs seems to be influenced by seasonality: Colder months have higher
kWh/100km while warmer months have lower kWh/100km. Also, REX are more part of the data set towards 2016 of the project time frame.
Diving deeper into the seasonality (i.e outdoor temperature) influence on energy consumption. There seems to be a range of outdoor temperature that energy consumption is lower, approx. 17-25°C.
BMW_drive_avg_cons_month <- BMW_drive %>%
group_by(month) %>%
summarise(avg_cons = round(mean(kwh_100km),2))
ggplot(BMW_drive_avg_cons_month, aes(x=factor(month), y=avg_cons, fill=avg_cons)) +
geom_bar(stat="identity") +
geom_text(aes(label=avg_cons), vjust=-0.25) +
theme(legend.position="none") +
ggtitle("All EVs average electricity consumption in months (kWh/100km)") +
scale_x_discrete("Month") +
scale_y_continuous("Average Consupmtion (kWh/100km)") +
theme(legend.position="none") +
theme(axis.text=element_text(size=12))
Using the average evergy consumption of the 80 EVs, the monthly consumption follows the similar seasonal pattern where the milder and warmer months have lower
kWh/100km, compared to the colder months. Heating during the colder months plays a significant role in energy consumption of EVs?
Analysis to understand whether average driving speed influences kWh/100km.
ggplot(BMW_drive, aes(x=kmh, y=kwh_100km)) +
#geom_jitter(alpha=0.1, color="#880011") +
geom_point(aes(colour=factor(REX_BEV)), alpha=0.3) +
stat_smooth(method = "gam", formula = y~s(x), colour = "darkblue", size=1.0) +
ylab("Electricity Consumption (kWh/100km)") +
xlab("km/hour") +
ggtitle("EVs Electricity Consumption vs. km/h") +
scale_fill_discrete(name="EV type:") +
theme(legend.position="top") +
theme(axis.text=element_text(size=12))
There seems to be a range of average driving speed, i.e. 30-65 km/h that have relatively lower energy consumption.
##
## Regression tree:
## rpart(formula = kwh_100km ~ ., data = BMW_drive_reg_sig, method = "anova")
##
## Variables actually used in tree construction:
## [1] a_temp_start kmh
##
## Root node error: 1150953/44510 = 25.858
##
## n= 44510
##
## CP nsplit rel error xerror xstd
## 1 0.180878 0 1.00000 1.00005 0.0070464
## 2 0.031326 1 0.81912 0.81917 0.0065714
## 3 0.029312 2 0.78780 0.78788 0.0061579
## 4 0.019505 3 0.75848 0.75902 0.0060355
## 5 0.010000 4 0.73898 0.74129 0.0060254
The R-square is highest and the xerror is lowest when the number of splits are 4. No pruning used for the regression tree in this case.
tree.reg.party <- as.party(tree.reg)
plot(tree.reg.party, main="Regression Tree on all Significant Features on kWh/100km")
The regression tree shows that
kWh/100km is lowest when a_temp_start is more than 11.75°C and the kmh is more than 15km/h.