Electric Vehicles Analytics - part (1/3) - Driving data analytics of 80 EVs

Data Processing, Feature Selection, Descriptive & Regression Tree on EV driving data

Jian Lee

13/04/2017

Abstract

This data processing & descriptive analytics work is part 1/3 of a series of highlights for the “Data-driven approach using machine learning techniques to analyse energy consumption & CO2 emission of corporate electric vehicles for the up-scaling of low carbon mobility Berlin”. Using 2 years driving data of 80 Electric Vehicles (EVs), this page describes the analytical highlights through data cleaning, manipulation & processing and feature selection on important features that impacts the energy consumptions of EVs. The analysis also highlights some descriptive analytics and deploys model building using Variable Partitioning method via Regression Tree. The Key Performance Indicator (KPI) used for this part of the analytics is kWh/100km. The analytics are conducted within the framework of German Aerospace Agency-DLR “Initiative Berlin-Brandenburg” project, which intends to stimulate the adoption of EVs among corporate fleets.

Part 2: EV carbon intensity & Smart Charging Strategies
Part 3: Machine Learning on 80 EVs driving behaviour

1 Notes and assumptions

Data set consists of 80 unique vehicles
Charging time calculated using 3.3kW (standard chargers), 10kW (fast chargers)
Electricity consumption and production for 2016 uses the similar data for 2015
BMW i3 comes in 2 different models in this data set - Range extender (REX) and Battery electric vehicle (BEV). The analysis seperates REX and BEV for more precise results
In order to differentiate standard and fast charger, the categorization of charger types are distinguished by using the “charge rate”" of the charge events, i.e. charge events that are > 3.3 kW are designated as using fast charger
Note that not all drive and charge events were recorded. The analysis tried best to consider and minimise these lost of data. Nevertheless, the results of the analysis have to consider the point that some data are not recorded, which would have an influence to the result of the analysis

# General, data manipulation
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
library(sp)
library(caTools)
library(Hmisc) # to extract no. of days in each month

# feature selection
library(caret) 
library(MASS) 
library(leaps) # regsubset()
library(Boruta)

# regression
library(corrplot)
library(reshape2)
library(car) # check multicollinearity

# regression Tree
library(partykit)
library(rpart)
library(rpart.plot)
library(randomForest)

2 Data Preparation

Data preparation includes cleaning, manipulating and processing the raw data and set the stage for further analysis.

2.1 Data cleaning & pre-processing

This analysis undertook the following data pre-processing steps, among others:

Convert time stamp data from POSIXlt to POSIXct format (dpylr packages works with POSIXct)
Create sequential/unique event id for all the driving and charging events
Assign vehicles labelling id vin for the 80 EVs
Create time-related features such as year, month, day, hour, yearday
Create features such as distance, state of EV zustand (fahrt, laden, ruhe)
Remove implausible event data

2.2 Feature Engineering

Create new features for BMW_drive data such as duration, average speed kmh, temperature difference temp_diff, State-of-Charge Energy soc_energy (amount of kWh energy in the battery). soc_energyis generated from the battery State-of-Charge (100% is a fully charged battery, while 0% is an empty battery) multiplied with the full capacity of the battery (18.8kWh usable) of this electric car model.

This step also included removing implausible data through plausibility checking.

2.3 Create KPI kWh/100km

Using the calculated soc_energy feature to create the Key Performance Indicator (KPI) of kWh/100km for the EVs. Outliers (1.5 times of the top and bottom inter quartile range) for the drive event data are removed.

summary(BMW_drive$kwh_100km)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.17   11.62   14.12   15.08   18.75   29.38

3 Feature/variable selection

Not all variables influence the KPI of kWh/100km. This section implements various feature selection techniques in order to identify the features that affects kWh/100km.

3.1 Visualise correlation of variables

Firstly, in order to understand the variables, their correlation with the the KPI of kWh/100km is being plotted in the following diagram.

library(corrplot)
M <- cor(BMW_drive_reg_sig)
corrplot(M, method="circle")

3.2 Multicollinearity of features

Multicollinearity among the features is where two of more features are highly correlated in the regression model. It is necessary to check for multicollinearity among the features using variance inflation factor (VIF) for the features of the linear model. VIF estimate the increase of variance of a particular coefficient that is collinear with other features. It treats that feature as the output of linear regression model, with other features as input features. It then computes VIF using the formula \(VIF = 1/(1-R^{2})\). Using the package car, the square-root VIF of more than 4 is a suspect for collinearity.

sqrt(vif(BMW_drive_lm_noout))

## a_temp_start     distance     soc_diff     duration          kmh 
##     1.068670     3.382226   629.967626     1.735048     1.477750 
##    temp_diff   soc_energy 
##     1.005526   629.929076

It seems that soc_diff and soc_energy have high collinearity, the analysis proceeds by removing these variables, as the state-of-charge of the battery directly indicates how much energy is consumed.

3.3 Feature selection methods

3.3.1 Step 1: Use domain knowledge to remove variables that do not influence kWh/100

Based on the above undertaken steps, the features that would influence energy consumption are a_temp_start, distance, duration, kmh, temp_diff and soc_energy.

3.3.2 Step 2: apply Stepwise regression methods

3.3.2.1 Forward selection method

Start with empty model with no features selected, then perform linear regressions and pick the one with the highest \(R^{2}\) or \(AIC\). Keep adding features until all features are evaluated in the model and stop.

BMW_drive_lm_null_sig <- lm(kwh_100km ~ 1, data=BMW_drive_reg_sig)
BMW_drive_lm_full_sig <- lm(kwh_100km ~., data=BMW_drive_reg_sig)
step(BMW_drive_lm_null_sig, scope = list(lower=BMW_drive_lm_null_sig, upper=BMW_drive_lm_full_sig), data=BMW_drive_reg_sig, direction = "forward")

## Start:  AIC=144776.6
## kwh_100km ~ 1
## 
##                Df Sum of Sq     RSS    AIC
## + a_temp_start  1    219416  931536 135364
## + kmh           1     33911 1117041 143447
## + distance      1     15417 1135535 144178
## + temp_diff     1      6329 1144623 144533
## + duration      1      4616 1146337 144600
## <none>                      1150953 144777
## 
## Step:  AIC=135364.3
## kwh_100km ~ a_temp_start
## 
##             Df Sum of Sq    RSS    AIC
## + kmh        1     37141 894396 133555
## + distance   1     15772 915765 134606
## + duration   1      4538 926999 135149
## + temp_diff  1      3660 927876 135191
## <none>                   931536 135364
## 
## Step:  AIC=133555.3
## kwh_100km ~ a_temp_start + kmh
## 
##             Df Sum of Sq    RSS    AIC
## + temp_diff  1    4278.9 890117 133344
## + duration   1    3620.7 890775 133377
## + distance   1     567.4 893828 133529
## <none>                   894396 133555
## 
## Step:  AIC=133343.9
## kwh_100km ~ a_temp_start + kmh + temp_diff
## 
##            Df Sum of Sq    RSS    AIC
## + duration  1    3388.1 886729 133176
## + distance  1     482.7 889634 133322
## <none>                  890117 133344
## 
## Step:  AIC=133176.1
## kwh_100km ~ a_temp_start + kmh + temp_diff + duration
## 
##            Df Sum of Sq    RSS    AIC
## + distance  1    1942.2 884786 133081
## <none>                  886729 133176
## 
## Step:  AIC=133080.5
## kwh_100km ~ a_temp_start + kmh + temp_diff + duration + distance

## 
## Call:
## lm(formula = kwh_100km ~ a_temp_start + kmh + temp_diff + duration + 
##     distance, data = BMW_drive_reg_sig)
## 
## Coefficients:
##  (Intercept)  a_temp_start           kmh     temp_diff      duration  
##     21.40245      -0.28165      -0.07497      -0.13865      -0.02709  
##     distance  
##      0.03460

3.3.2.2 Backward selection method (a)

Similar to the forward selection method, the backward selection method performs the evaluation backwards.

step(BMW_drive_lm_full_sig, data=BMW_drive_reg_sig, direction = "backward")

## Start:  AIC=133080.5
## kwh_100km ~ a_temp_start + distance + duration + kmh + temp_diff
## 
##                Df Sum of Sq     RSS    AIC
## <none>                       884786 133081
## - distance      1      1942  886729 133176
## - temp_diff     1      4081  888867 133283
## - duration      1      4848  889634 133322
## - kmh           1     26495  911281 134392
## - a_temp_start  1    220209 1104995 142971

## 
## Call:
## lm(formula = kwh_100km ~ a_temp_start + distance + duration + 
##     kmh + temp_diff, data = BMW_drive_reg_sig)
## 
## Coefficients:
##  (Intercept)  a_temp_start      distance      duration           kmh  
##     21.40245      -0.28165       0.03460      -0.02709      -0.07497  
##    temp_diff  
##     -0.13865

3.3.2.3 Backward selection method (b)

Using the stepAIC function for backward selection from the package MASS.

step = stepAIC(BMW_drive_lm_full_sig, direction="backward")

## Start:  AIC=133080.5
## kwh_100km ~ a_temp_start + distance + duration + kmh + temp_diff
## 
##                Df Sum of Sq     RSS    AIC
## <none>                       884786 133081
## - distance      1      1942  886729 133176
## - temp_diff     1      4081  888867 133283
## - duration      1      4848  889634 133322
## - kmh           1     26495  911281 134392
## - a_temp_start  1    220209 1104995 142971

summary(step)

## 
## Call:
## lm(formula = kwh_100km ~ a_temp_start + distance + duration + 
##     kmh + temp_diff, data = BMW_drive_reg_sig)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9830  -2.6703  -0.2537   2.4085  21.4637 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  21.402451   0.073304  291.970   <2e-16 ***
## a_temp_start -0.281646   0.002676 -105.244   <2e-16 ***
## distance      0.034597   0.003500    9.884   <2e-16 ***
## duration     -0.027090   0.001735  -15.615   <2e-16 ***
## kmh          -0.074966   0.002054  -36.506   <2e-16 ***
## temp_diff    -0.138649   0.009678  -14.327   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.459 on 44504 degrees of freedom
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2312 
## F-statistic:  2678 on 5 and 44504 DF,  p-value: < 2.2e-16

From both the forward and backward selection stepwise regression method, seems like the features selected in the model is already optimum.

3.3.2.4 Both ways

By using the bothways method (forward and backward together), it also shows that the features in the model is already optimum.

3.3.2.5 `regsubsets` method

The regsubsets method from the package leaps performs an exhaustive search for the best subsets of the features in predicting the kWh/100km using linear regression.

library(leaps)
leaps <- regsubsets(kwh_100km ~ ., data=BMW_drive_reg_sig)
par(mfrow=c(1,2)) 
plot(leaps, scale="adjr2")
plot(leaps, scale="bic")

The regsubsets diagnostics:

black square indicates that a variable is included in model, white indicates not
model left shows R2, the more means more variance is explained by model
model right shows BIC (Bayesian Information Criterion), smaller BIC the variables gives the model a better fit
looking at y-axis, it seems the top 3 or 4 models have roughly same R2 and BIC-> possible discrepency in result

3.3.3 Step 3: Use feature selection wrapper algorithm

Boruta is a package that compares attribute importance randomly, using a wrapper algorithm based on random forest.

set.seed(8)
boruta.train <- Boruta(kwh_100km~., data = BMW_drive_reg_sig[1:1000,], doTrace = 2) 
# limit to 1000 rows for time performance purpose, if run on whole data set it took 4+ mins
# credits: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/
print(boruta.train)

## Boruta performed 9 iterations in 4.853066 secs.
##  5 attributes confirmed important: a_temp_start, distance,
## duration, kmh, temp_diff;
##  No attributes deemed unimportant.

plot(boruta.train, xlab="", xaxt = "n", main="Features Importances for Drive data")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels), at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.8)

boruta.df <- attStats(boruta.train)
boruta.df

##                meanImp medianImp    minImp    maxImp normHits  decision
## a_temp_start 65.223388 64.996743 62.375443 67.320232        1 Confirmed
## distance     20.672224 20.807480 18.605221 22.297023        1 Confirmed
## duration     13.962697 13.960566 12.580729 16.953697        1 Confirmed
## kmh          32.761046 32.627911 31.389030 35.023310        1 Confirmed
## temp_diff     5.133179  4.702952  3.860697  7.080101        1 Confirmed

The features selection section concludes by deeming the a_temp_start, distance, soc_diff, duration, kmh and temp_diff variables as important to energy consumption kwh/100km of the EVs.

4 Drive data - Descriptive Analysis

4.1 Time Series analysis for kWh/100km vs. Date

ggplot(BMW_drive, aes(x=time_start, y=kwh_100km)) +
  geom_point(aes(colour=factor(REX_BEV)), alpha=0.3) +
  stat_smooth(method = "gam", formula = y~s(x), colour = "darkblue", size=1.0) +
  ylab("Electricity Consumption (kWh/100km)") + 
  xlab("Date") +
  ggtitle("EVs Electricity Consumption") +
  scale_fill_discrete(name="EV type:") +
  theme(legend.position="top") +
  theme(axis.text=element_text(size=12))

Energy consumption of EVs seems to be influenced by seasonality: Colder months have higher kWh/100km while warmer months have lower kWh/100km. Also, REX are more part of the data set towards 2016 of the project time frame.

4.2 kWh/100km vs. outdoor temperature

Diving deeper into the seasonality (i.e outdoor temperature) influence on energy consumption. There seems to be a range of outdoor temperature that energy consumption is lower, approx. 17-25°C.

4.3 Aggregated monthly average electricity consumption for all EVs

BMW_drive_avg_cons_month <- BMW_drive %>% 
  group_by(month) %>% 
  summarise(avg_cons = round(mean(kwh_100km),2))
  
ggplot(BMW_drive_avg_cons_month, aes(x=factor(month), y=avg_cons, fill=avg_cons)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=avg_cons), vjust=-0.25) +
  theme(legend.position="none") +
  ggtitle("All EVs average electricity consumption in months (kWh/100km)") +
  scale_x_discrete("Month") +
  scale_y_continuous("Average Consupmtion (kWh/100km)") +
  theme(legend.position="none") +
  theme(axis.text=element_text(size=12))

Using the average evergy consumption of the 80 EVs, the monthly consumption follows the similar seasonal pattern where the milder and warmer months have lower kWh/100km, compared to the colder months. Heating during the colder months plays a significant role in energy consumption of EVs?

4.4 Speed(kmh) vs kWh/100km

Analysis to understand whether average driving speed influences kWh/100km.

ggplot(BMW_drive, aes(x=kmh, y=kwh_100km)) +
  #geom_jitter(alpha=0.1, color="#880011") +
  geom_point(aes(colour=factor(REX_BEV)), alpha=0.3) +
  stat_smooth(method = "gam", formula = y~s(x), colour = "darkblue", size=1.0) +
  ylab("Electricity Consumption (kWh/100km)") + 
  xlab("km/hour") +
  ggtitle("EVs Electricity Consumption vs. km/h") +
  scale_fill_discrete(name="EV type:") +
  theme(legend.position="top") +
  theme(axis.text=element_text(size=12))

There seems to be a range of average driving speed, i.e. 30-65 km/h that have relatively lower energy consumption.

5 Drive data - Regression Tree

## 
## Regression tree:
## rpart(formula = kwh_100km ~ ., data = BMW_drive_reg_sig, method = "anova")
## 
## Variables actually used in tree construction:
## [1] a_temp_start kmh         
## 
## Root node error: 1150953/44510 = 25.858
## 
## n= 44510 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.180878      0   1.00000 1.00005 0.0070464
## 2 0.031326      1   0.81912 0.81917 0.0065714
## 3 0.029312      2   0.78780 0.78788 0.0061579
## 4 0.019505      3   0.75848 0.75902 0.0060355
## 5 0.010000      4   0.73898 0.74129 0.0060254

The R-square is highest and the xerror is lowest when the number of splits are 4. No pruning used for the regression tree in this case.

tree.reg.party <- as.party(tree.reg)
plot(tree.reg.party, main="Regression Tree on all Significant Features on kWh/100km")

The regression tree shows that kWh/100km is lowest when a_temp_start is more than 11.75°C and the kmh is more than 15km/h.