This document applies a wide variety of Data Science techniques to gain insight into patterns of electricity consumption, and other customer data, that possibly indicate ownership of electric vehicles. Market research has identified 70,000 Electric Vehicle (EV) Customers; however, the client company is responsible for only 10% of those customers. Only 2,400 EV customers have been identified within the client’s database as they are on EV Tariff. The remainder of the 67,600 customers may own an EV but are not on EV Tariff. The task of this document is to find the rest of the suspected EV Customers. The metric for success is R^2, RMSE, and MAE accuracy of the EV ownership Machine Learning predictions.
In this section, the Working Directory of RStudio is set to the directory with the source files.
# The "Set Working Directory" R language function is used to set the working directory to the directory with the source files.
# setwd("C:/Users/johna/Dropbox/Programming/Science")
# The "Get Working Directory" function is used to verify that the working directory has been set to the right location.
# getwd()
The dataset used for this document are “Data_1.xlsx” and “Data_1.xlsx”, that are then combined into “Data_Combined”.
# Imports the datasets for analysis
# The "read_excel()" function reads Excel spreadsheets.
Data_1 <- read_excel("Data_1.xlsx", 2)
Data_2 <- read_excel("Data_2.xlsx", 2)
# Combines the two imported datasets
Data_Combined <- rbind(Data_1, Data_2)
# Convert the "EV" variable to numeric, then convert values
# value from "1-2" to "0-1"
Data_Combined$EV <- as.numeric(as.factor(Data_Combined$EV))
Data_Combined$EV[Data_Combined$EV==1] <- 0
Data_Combined$EV[Data_Combined$EV==2] <- 1
# Creates datasets of separate "EV Yes" and "EV Unknown" data
EV_Yes <- Data_Combined[Data_Combined$EV==1,]
EV_Unknown <- Data_Combined[Data_Combined$EV==0,]
Explanation of dataset variables:
Bill_1_2021 – Usage data in KwH for first bill(Jan + Feb) in 2021, similarly Bill_2_2021 would be second bill in 2021, Bill_1_2023 is first bill in the year 2021 and so on. (So, if a customer has joined mid 2021, so that customer will not have like first 3 billing data for the year 2021.)
In total there are 13 columns of billing data. 6 columns from 2021, 6 from 2022 and 1 from 2023
EV_New_or_Old - Whether the customer was previously X-Company
customer and switched to EV tariff at some point or if the customer is a
New Customer and opted for EV Tariff
StartDate – Start Date of the Customer
agedband – Age band of customers
signedUpGroup – Channel data through which the customer was
acquired like Field Sales, tele Sales, Web , etc
title – Mr/Mrs/Miss, etc
Mosiac Type – Customer segmentation data set
EV - Whether EV Customer or Not (Y/N)
ContractStartDateEV – Start Date of EV tariff contract only for
EV, will help to identify when did the customer switch to EV)
ContractStartDate – Start Date of contract (Start date of
contract in general)
ContractEndDate – End Date of contract (End date of contract in
general)
Rows | 186558 |
Columns | 25 |
## Table 3. Structure of the Dataset
## tibble [186,558 × 25] (S3: tbl_df/tbl/data.frame)
## $ accountID : num [1:186558] 0 1 2 3 4 5 6 7 8 9 ...
## $ StartDate : POSIXct[1:186558], format: "2022-05-03" "2018-03-09" ...
## $ ContractStartDateEV: POSIXct[1:186558], format: NA NA ...
## $ contractStartDate : POSIXct[1:186558], format: "2022-05-03" "2023-03-14" ...
## $ contractEndDate : POSIXct[1:186558], format: "2023-05-03" "2024-03-14" ...
## $ saStatus : chr [1:186558] "Active" "Active" "Active" "Active" ...
## $ agedBand : chr [1:186558] "41 to 50" "Unclassified" "Unclassified" "41 to 50" ...
## $ signedUpGroup : chr [1:186558] "PCW" "Telesales" "Other" "Other" ...
## $ title : chr [1:186558] "Mr" "Ms" "Mr" "Mrs" ...
## $ mosaicType : chr [1:186558] "N/A" "F17" "G18" "K31" ...
## $ EV : num [1:186558] 0 0 0 0 0 0 0 0 0 0 ...
## $ EV_New_or_Old : chr [1:186558] "Non-EV Customers" "Non-EV Customers" "Non-EV Customers" "Non-EV Customers" ...
## $ bill_1_2021 : num [1:186558] 0 1278 0 0 670 ...
## $ bill_2_2021 : num [1:186558] 0 1008 0 0 736 ...
## $ bill_3_2021 : num [1:186558] 0 1278 0 428 452 ...
## $ bill_4_2021 : num [1:186558] 0 926 0 348 663 ...
## $ bill_5_2021 : num [1:186558] 0 1131 0 343 265 ...
## $ bill_6_2021 : num [1:186558] 0 1081 0 456 313 ...
## $ bill_1_2022 : num [1:186558] 0 1200 381 470 497 0 0 375 0 883 ...
## $ bill_2_2022 : num [1:186558] 0 1179 706 381 183 ...
## $ bill_3_2022 : num [1:186558] 581 1182 629 364 250 ...
## $ bill_4_2022 : num [1:186558] 542 985 681 333 234 0 770 227 0 538 ...
## $ bill_5_2022 : num [1:186558] 696 917 848 397 381 90 921 219 0 544 ...
## $ bill_6_2022 : num [1:186558] 0 983 1455 407 358 ...
## $ bill_1_2023 : num [1:186558] 2813 1130 0 526 233 ...
The following table displays the standard deviation, mean value, minimum value, and maximum value of the dataset’s Explanatory Variables. The objective is to determine if normalization of the variable ranges is required.
Standard Deviation | Mean | Min | Max | |
---|---|---|---|---|
bill_1_2021 | 718.0940 | 466.9171 | 0 | 8914 |
bill_2_2021 | 808.2985 | 562.1151 | 0 | 9599 |
bill_3_2021 | 667.4849 | 513.7302 | 0 | 9191 |
bill_4_2021 | 562.3975 | 466.3544 | 0 | 9654 |
bill_5_2021 | 534.6123 | 463.4131 | 0 | 9499 |
bill_6_2021 | 635.0939 | 577.3035 | 0 | 8376 |
bill_1_2022 | 786.4653 | 725.4220 | 0 | 9361 |
bill_2_2022 | 741.1136 | 677.3660 | 0 | 9551 |
bill_3_2022 | 640.5935 | 651.4517 | 0 | 9840 |
bill_4_2022 | 519.7540 | 564.7194 | 0 | 8843 |
bill_5_2022 | 485.2380 | 568.5173 | 0 | 8643 |
bill_6_2022 | 557.3149 | 688.3538 | 0 | 9117 |
bill_1_2023 | 758.6266 | 920.3199 | 0 | 9333 |
Bar chart of the dependent variable, “EV”, to determine a strategy for regression in relationship to the independent variables.
Bar chart of the frequencies of Signed Up Group membership.
Bar chart of the frequencies of agedBand categories.
Bar chart of the frequencies of titles.
Bar chart of the frequencies of Mosaic Type (Segmentation Group).
Bar chart of the frequencies of EV_New_or_Old.
The effect of the dependent variables, “title_status” and “brand” can be surmised from the following table.
18 & Under | 19 to 24 | 25 to 40 | 41 to 50 | 51 to 60 | 61 to 70 | 70+ | Unclassified | |
---|---|---|---|---|---|---|---|---|
0 | 8 | 647 | 19417 | 19394 | 15110 | 10747 | 8224 | 110489 |
1 | 0 | 2 | 215 | 314 | 231 | 131 | 61 | 1562 |
The effect of the independent variable “signedUpGroup” on the independent variable “agedBand” can be surmised from the following table.
## Table 6. signedUpGroup Per agedBand
##
## 18 & Under 19 to 24 25 to 40 41 to 50 51 to 60 61 to 70 70+
## Field Sales 1 63 2437 3623 3259 2592 2486
## OBS 0 6 435 850 872 654 379
## Other 2 223 5699 4334 3220 2244 1684
## PCW 0 44 2650 3717 2909 1841 1189
## Telesales 3 157 4369 3896 3188 2439 1888
## Web 2 156 4042 3288 1893 1108 659
##
## Unclassified
## Field Sales 21457
## OBS 4879
## Other 26117
## PCW 19140
## Telesales 23741
## Web 16717
The following plots display the time series of billing data.
The following dendrogram groups the first 100 customers into groups
of similar time series, with the objective of finding the clusters with
a high propensity for buying EVs.
The Shapiro-Wilk test is a way to tell if a random sample comes from a normal distribution. The test gives you a W value; small values indicate your sample is not normally distributed. Ho = The data are from the LogNormal distribution. Small p-values reject Ho.
## Table 7. Shapiro-Wilk Test of Distribution Normality
## Ho: Data is Normally Distributed.
##
## Shapiro-Wilk normality test
##
## data: Data_Combined$bill_1_2021[1:5000]
## W = 0.68375, p-value < 2.2e-16
## Shapiro test results for bill_1_2021 distribution < 0.05 = not normally distributed
Skewness is a measure of symmetry. Negative skewness indicates that the mean is less than the median, and left-skewed. Positive skewness indicates that the mean is larger than the median, and right-skewed. The significance of skewness is determined by distance from 1.0.
## The second moment of bill_1_2021 = 515656.3
## The third moment of bill_1_2021 = 812844985
## The skewness of bill_1_2021 = 6.3438
## Skew of the bill_1_2021 distribution = 2.195 is skewed to the left.
The quantile-quantile plot is a graphical tool to help assess if data
plausibly came from a theoretical distribution such as Normal or
Exponential.
In case of homoskedasticity, the data points are equally scattered while in heteroskedasticity the data points are not equally scattered. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.
## Figures 12 - 15. Linear Model to Test for Heteroskadacity
The following coding section transforms categorical dataset columns from the character class to the factor class, then from factor to numerical for statistical regression analysis.
Data_Combined$StartDate <- as.numeric(Data_Combined$StartDate)
Data_Combined$ContractStartDateEV <- as.numeric(Data_Combined$ContractStartDateEV)
Data_Combined$contractStartDate <- as.numeric(Data_Combined$contractStartDate)
Data_Combined$contractEndDate <- as.numeric(Data_Combined$contractEndDate)
Data_Combined$agedBand <- as.numeric(as.factor(Data_Combined$agedBand))
Data_Combined$signedUpGroup <- as.numeric(as.factor(Data_Combined$signedUpGroup))
Data_Combined$title <- as.numeric(as.factor(Data_Combined$title))
Data_Combined$mosaicType <- as.numeric(as.factor(Data_Combined$mosaicType))
Data_Combined$EV_New_or_Old <- as.numeric(as.factor(Data_Combined$EV_New_or_Old))
# Remove saStatus column that only has one value "Active"
Data_Combined <- Data_Combined[-6]
This section examines the dependent variable for outliers and
identifies, visualizes and removes outliers from a dataset.
# Find missing values in Data_Combined
cat('It is ', any(is.na(Data_Combined)),
' that Data_Combined has Missing Values', '.', sep='')
## It is TRUE that Data_Combined has Missing Values.
# Find columns with missing values
NAcol <- which(colSums(is.na(Data_Combined)) > 0)
# Print names of columns with missing values
cat("Data_Combined observations with missing values.")
## Data_Combined observations with missing values.
## [1] "ContractStartDateEV" "contractStartDate" "contractEndDate"
## [4] "agedBand" "title"
# Impute NAs with column means
Data_Impute <- na.aggregate(Data_Combined)
# Find missing values in Data_Impute
cat('It is ', any(is.na(Data_Impute)),
' that Data_Impute has Missing Values', '.', sep='')
## It is FALSE that Data_Impute has Missing Values.
Feature Selection is the process of selecting the most significant
features from a given dataset. The most significant features are
independent (or “explanatory”) variables that have a high correlation
with the dependent (or “response”) variable.
## The Independent Variable, contractStartDate, has a negative correlation with the Independent Variable, bill_1_2021, of -0.1334527
## The Dependent Variable, EV, has a positive correlation with the Independent Variable, bill_1_2023, of 0.1068634
Cross-Validation divides the dataset into data for the Machine Learning algorithm to learn (training data) and data that the Machine Learning Algorithm is required to predict (testing data). A model that generalizes well produces consistent results when presented with new cases.
# Setting a seed number to generate a reproducible random sampling
set.seed(123)
# Use Bernoulli sampling to divide the data into train and test sets.
v <- as.vector(c(rep(TRUE,130590),rep(FALSE,55968)))
ind <- sample(v)
# Select the training and test rows
train <- Data_Impute[ind,]
test <- Data_Impute[!ind,]
## It is TRUE that the CrossValidated Train and Test datasets
## are an accurate partition of the source dataset.
# Set the Seed for reproducibility of results "set.seed()" allows for the setting of an initial seed for stream-based variate generators, thereby assuring reproducibility of regression algorithm results.
set.seed(5678)
# Create linear regression model, minus ContractStartDateEV that contains NAs
model_LN <- lm(EV ~ ., data = train)
Random Forests classify the mean/average of an ensemble of decision tree regressions of the independent variables vs the dependent variable.
Gradient Boosting employs a group of linear regression methods to boost the predictive accuracy of a linear model. Gradient Boosting has a higher possible predictive accuracy vs other regression techniques.
model_GBM <- gbm(EV~., data = train, distribution = "tdist", n.trees = 1000, interaction.depth = 11, shrinkage = 0.02)
This section contains evaluations of regression model assumptions. The Anderson-Darling Goodness of Fit Test is a measure of how well your data fits a specified distribution.
Statistic | P-Value | |
---|---|---|
A | 18304.14 | 3.7e-24 |
Statistic | P-Value | |
---|---|---|
A | 49677.9 | 3.7e-24 |
Statistic | P-Value | |
---|---|---|
A | 49749.63 | 3.7e-24 |
Predictive Analytics is a group of algorithmic statistical methods that create predictions of future data values from regressions of historical data.
pred_LN <- predict(model_LN, newdata = test)
pred_RF <- predict(model_RF, newdata = test)
pred_GBM <- predict(model_GBM, newdata = test)
The Performance Measurement Metrics applied are R-Squared, Root Mean Square Error, and Mean Absolute Error.
## R2 RMSE MAE
## 1 0.10474 0.9462617 0.2314958
## R2 RMSE MAE
## 1 0.9994026 0.0262715 0.003995913
## R2 RMSE MAE
## 1 1 0.01363724 0.00315309
This document utilizes a Machine Learning pipeline to predict EV ownership from an electricity consumption dataset that includes information about electricity consumption of 186,558 customers. Twenty-five observations were assembled for each customer in the dataset. The objective is to determine the most effective method of predicting EV ownership from available Machine Learning techniques.
When exploring the dataset, it is found that around 15 observations are sparse time series data that are treatable as independent observation for the purpose of regression analysis. About 5 of the 25 observations have missing values, that are then imputted with column means. The dataset many outliers values, that are retained in consideration of having patterns that assist with the Machine Learning of EV customers. For regression modeling, the character columns are converted to factors then to numeric values, and normalized via min/max scaling. The dataset contains left skewed independent variables that are not normally distributed, corresponding with the high level of outliers. One of the greatest predictors of EV in the dataset is bill_1_2023.
After cross-validation of the dataset with train/test sets, three regression models were applied, Linear Modeling, Random Forest, and Gradient Boosting Machines. RF and GBM demonstrate the ability to predict the values in the test set perfectly. However, basic linear regression had only 11% accuracy.