• Section 1 - Answering the Question
    • Section 1.1 - Project Summary
  • Section 2 - Data Importation
    • Section 2.1 - Working Directory
    • Section 2.3 - Data Importing
  • Section 3 - Data Discovery
    • Section 3.1 - Exploratory Data Analysis
    • Section 3.2 - Statistical Analysis of the Dataset
  • Section 4 - Feature Engineering
    • Section 4.1 - Outliers and Missing Values
    • Section 4.2 - Feature Selection
    • Section 4.3 - Cross-Validation
    • Section 4.4 - Feature Scaling (Data Normalization)
  • Section 5 - Model Development
    • Regression Modeling
    • Section 5.1 - Evaluation of Model Assumptions
    • Section 5.2 - Predictive Analytics
  • Section 6 - Model Validation
    • Section 6.1 - Accuracy of the Predictions
  • Section 7 - Insights & Inferences
    • Conclusions
  • Section 8 - References



ContextBase Logo



Section 1 - Answering the Question

Section 1.1 - Project Summary

This document applies a wide variety of Data Science techniques to gain insight into patterns of electricity consumption, and other customer data, that possibly indicate ownership of electric vehicles. Market research has identified 70,000 Electric Vehicle (EV) Customers; however, the client company is responsible for only 10% of those customers. Only 2,400 EV customers have been identified within the client’s database as they are on EV Tariff. The remainder of the 67,600 customers may own an EV but are not on EV Tariff. The task of this document is to find the rest of the suspected EV Customers. The metric for success is R^2, RMSE, and MAE accuracy of the EV ownership Machine Learning predictions.



Section 2 - Data Importation

Section 2.1 - Working Directory

In this section, the Working Directory of RStudio is set to the directory with the source files.

# The "Set Working Directory" R language function is used to set the working directory to the directory with the source files.
# setwd("C:/Users/johna/Dropbox/Programming/Science")

# The "Get Working Directory" function is used to verify that the working directory has been set to the right location.
# getwd()


Section 2.3 - Data Importing

The dataset used for this document are “Data_1.xlsx” and “Data_1.xlsx”, that are then combined into “Data_Combined”.

# Imports the datasets for analysis
# The "read_excel()" function reads Excel spreadsheets. 
Data_1 <- read_excel("Data_1.xlsx", 2)
Data_2 <- read_excel("Data_2.xlsx", 2)

# Combines the two imported datasets
Data_Combined <- rbind(Data_1, Data_2)

# Convert the "EV" variable to numeric, then convert values
# value from "1-2" to "0-1"
Data_Combined$EV <- as.numeric(as.factor(Data_Combined$EV))
Data_Combined$EV[Data_Combined$EV==1] <- 0
Data_Combined$EV[Data_Combined$EV==2] <- 1

# Creates datasets of separate "EV Yes" and "EV Unknown" data
EV_Yes <- Data_Combined[Data_Combined$EV==1,]
EV_Unknown <- Data_Combined[Data_Combined$EV==0,]



Section 3 - Data Discovery

Section 3.1 - Exploratory Data Analysis

Explanation of dataset variables:

Bill_1_2021 – Usage data in KwH for first bill(Jan + Feb) in 2021, similarly Bill_2_2021 would be second bill in 2021, Bill_1_2023 is first bill in the year 2021 and so on. (So, if a customer has joined mid 2021, so that customer will not have like first 3 billing data for the year 2021.)

In total there are 13 columns of billing data. 6 columns from 2021, 6 from 2022 and 1 from 2023

EV_New_or_Old - Whether the customer was previously X-Company customer and switched to EV tariff at some point or if the customer is a New Customer and opted for EV Tariff
StartDate – Start Date of the Customer
agedband – Age band of customers
signedUpGroup – Channel data through which the customer was acquired like Field Sales, tele Sales, Web , etc
title – Mr/Mrs/Miss, etc
Mosiac Type – Customer segmentation data set
EV - Whether EV Customer or Not (Y/N)
ContractStartDateEV – Start Date of EV tariff contract only for EV, will help to identify when did the customer switch to EV)
ContractStartDate – Start Date of contract (Start date of contract in general)
ContractEndDate – End Date of contract (End date of contract in general)


The Row and Column Dimensions of the Dataset

Table 2. The Row and Column Dimensions of the Dataset
Rows 186558
Columns 25


The Structure of the Data_Combined Dataset

## Table 3. Structure of the Dataset
## tibble [186,558 × 25] (S3: tbl_df/tbl/data.frame)
##  $ accountID          : num [1:186558] 0 1 2 3 4 5 6 7 8 9 ...
##  $ StartDate          : POSIXct[1:186558], format: "2022-05-03" "2018-03-09" ...
##  $ ContractStartDateEV: POSIXct[1:186558], format: NA NA ...
##  $ contractStartDate  : POSIXct[1:186558], format: "2022-05-03" "2023-03-14" ...
##  $ contractEndDate    : POSIXct[1:186558], format: "2023-05-03" "2024-03-14" ...
##  $ saStatus           : chr [1:186558] "Active" "Active" "Active" "Active" ...
##  $ agedBand           : chr [1:186558] "41 to 50" "Unclassified" "Unclassified" "41 to 50" ...
##  $ signedUpGroup      : chr [1:186558] "PCW" "Telesales" "Other" "Other" ...
##  $ title              : chr [1:186558] "Mr" "Ms" "Mr" "Mrs" ...
##  $ mosaicType         : chr [1:186558] "N/A" "F17" "G18" "K31" ...
##  $ EV                 : num [1:186558] 0 0 0 0 0 0 0 0 0 0 ...
##  $ EV_New_or_Old      : chr [1:186558] "Non-EV Customers" "Non-EV Customers" "Non-EV Customers" "Non-EV Customers" ...
##  $ bill_1_2021        : num [1:186558] 0 1278 0 0 670 ...
##  $ bill_2_2021        : num [1:186558] 0 1008 0 0 736 ...
##  $ bill_3_2021        : num [1:186558] 0 1278 0 428 452 ...
##  $ bill_4_2021        : num [1:186558] 0 926 0 348 663 ...
##  $ bill_5_2021        : num [1:186558] 0 1131 0 343 265 ...
##  $ bill_6_2021        : num [1:186558] 0 1081 0 456 313 ...
##  $ bill_1_2022        : num [1:186558] 0 1200 381 470 497 0 0 375 0 883 ...
##  $ bill_2_2022        : num [1:186558] 0 1179 706 381 183 ...
##  $ bill_3_2022        : num [1:186558] 581 1182 629 364 250 ...
##  $ bill_4_2022        : num [1:186558] 542 985 681 333 234 0 770 227 0 538 ...
##  $ bill_5_2022        : num [1:186558] 696 917 848 397 381 90 921 219 0 544 ...
##  $ bill_6_2022        : num [1:186558] 0 983 1455 407 358 ...
##  $ bill_1_2023        : num [1:186558] 2813 1130 0 526 233 ...


Section 3.2 - Statistical Analysis of the Dataset

The following table displays the standard deviation, mean value, minimum value, and maximum value of the dataset’s Explanatory Variables. The objective is to determine if normalization of the variable ranges is required.

Table 4. The Standard Deviation of the Explanatory Variables
Standard Deviation Mean Min Max
bill_1_2021 718.0940 466.9171 0 8914
bill_2_2021 808.2985 562.1151 0 9599
bill_3_2021 667.4849 513.7302 0 9191
bill_4_2021 562.3975 466.3544 0 9654
bill_5_2021 534.6123 463.4131 0 9499
bill_6_2021 635.0939 577.3035 0 8376
bill_1_2022 786.4653 725.4220 0 9361
bill_2_2022 741.1136 677.3660 0 9551
bill_3_2022 640.5935 651.4517 0 9840
bill_4_2022 519.7540 564.7194 0 8843
bill_5_2022 485.2380 568.5173 0 8643
bill_6_2022 557.3149 688.3538 0 9117
bill_1_2023 758.6266 920.3199 0 9333

Figure 1 - EV Distribution

Bar chart of the dependent variable, “EV”, to determine a strategy for regression in relationship to the independent variables.

Figure 2 - Quantity Per SignedUp Group

Bar chart of the frequencies of Signed Up Group membership.


Figure 3 - Quantity Per Aged Band

Bar chart of the frequencies of agedBand categories.

Figure 4 - Quantity Per Title

Bar chart of the frequencies of titles.


Figure 5 - Quantity Per Mosaic Type

Bar chart of the frequencies of Mosaic Type (Segmentation Group).

Figure 6 - Quantity Per EV_New_or_Old

Bar chart of the frequencies of EV_New_or_Old.


Table of EV per agedBand

The effect of the dependent variables, “title_status” and “brand” can be surmised from the following table.

Table 5. EV Per agedBand
18 & Under 19 to 24 25 to 40 41 to 50 51 to 60 61 to 70 70+ Unclassified
0 8 647 19417 19394 15110 10747 8224 110489
1 0 2 215 314 231 131 61 1562


Table of signedUpGroup per agedBand

The effect of the independent variable “signedUpGroup” on the independent variable “agedBand” can be surmised from the following table.

## Table 6. signedUpGroup Per agedBand
##              
##               18 & Under 19 to 24 25 to 40 41 to 50 51 to 60 61 to 70   70+
##   Field Sales          1       63     2437     3623     3259     2592  2486
##   OBS                  0        6      435      850      872      654   379
##   Other                2      223     5699     4334     3220     2244  1684
##   PCW                  0       44     2650     3717     2909     1841  1189
##   Telesales            3      157     4369     3896     3188     2439  1888
##   Web                  2      156     4042     3288     1893     1108   659
##              
##               Unclassified
##   Field Sales        21457
##   OBS                 4879
##   Other              26117
##   PCW                19140
##   Telesales          23741
##   Web                16717


Time Series Plots

The following plots display the time series of billing data.


Time Series Analysis

The following dendrogram groups the first 100 customers into groups of similar time series, with the objective of finding the clusters with a high propensity for buying EVs.


Test for Normal Distribution of Data

The Shapiro-Wilk test is a way to tell if a random sample comes from a normal distribution. The test gives you a W value; small values indicate your sample is not normally distributed. Ho = The data are from the LogNormal distribution. Small p-values reject Ho.

## Table 7. Shapiro-Wilk Test of Distribution Normality
## Ho: Data is Normally Distributed.
## 
##  Shapiro-Wilk normality test
## 
## data:  Data_Combined$bill_1_2021[1:5000]
## W = 0.68375, p-value < 2.2e-16
## Shapiro test results for bill_1_2021 distribution < 0.05 = not normally distributed


Skew of the Distributions

Skewness is a measure of symmetry. Negative skewness indicates that the mean is less than the median, and left-skewed. Positive skewness indicates that the mean is larger than the median, and right-skewed. The significance of skewness is determined by distance from 1.0.

## The second moment of bill_1_2021 =   515656.3
## The third moment of bill_1_2021 =    812844985
## The skewness of bill_1_2021 =  6.3438

## Skew of the bill_1_2021 distribution =  2.195 is skewed to the left.


QQ-Plots of Normality

The quantile-quantile plot is a graphical tool to help assess if data plausibly came from a theoretical distribution such as Normal or Exponential.


Heteroskedasticity

In case of homoskedasticity, the data points are equally scattered while in heteroskedasticity the data points are not equally scattered. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.

## Figures 12 - 15. Linear Model to Test for Heteroskadacity



Section 4 - Feature Engineering

Data Conversion

The following coding section transforms categorical dataset columns from the character class to the factor class, then from factor to numerical for statistical regression analysis.

Data_Combined$StartDate <- as.numeric(Data_Combined$StartDate)
Data_Combined$ContractStartDateEV <- as.numeric(Data_Combined$ContractStartDateEV)
Data_Combined$contractStartDate <- as.numeric(Data_Combined$contractStartDate)
Data_Combined$contractEndDate <- as.numeric(Data_Combined$contractEndDate)
Data_Combined$agedBand <- as.numeric(as.factor(Data_Combined$agedBand))
Data_Combined$signedUpGroup <- as.numeric(as.factor(Data_Combined$signedUpGroup))
Data_Combined$title <- as.numeric(as.factor(Data_Combined$title))
Data_Combined$mosaicType <- as.numeric(as.factor(Data_Combined$mosaicType))
Data_Combined$EV_New_or_Old <- as.numeric(as.factor(Data_Combined$EV_New_or_Old))

# Remove saStatus column that only has one value "Active"
Data_Combined <- Data_Combined[-6]



Section 4.1 - Outliers and Missing Values

This section examines the dependent variable for outliers and identifies, visualizes and removes outliers from a dataset.

Find and impute Data_Combined observations that have Missing Values

# Find missing values in Data_Combined
cat('It is ', any(is.na(Data_Combined)),
    ' that Data_Combined has Missing Values', '.', sep='')
## It is TRUE that Data_Combined has Missing Values.
# Find columns with missing values
NAcol <- which(colSums(is.na(Data_Combined)) > 0)

# Print names of columns with missing values
cat("Data_Combined observations with missing values.")
## Data_Combined observations with missing values.
names(which(colSums(is.na(Data_Combined)) > 0))
## [1] "ContractStartDateEV" "contractStartDate"   "contractEndDate"    
## [4] "agedBand"            "title"
# Impute NAs with column means
Data_Impute <- na.aggregate(Data_Combined)

# Find missing values in Data_Impute
cat('It is ', any(is.na(Data_Impute)),
    ' that Data_Impute has Missing Values', '.', sep='')
## It is FALSE that Data_Impute has Missing Values.



Section 4.2 - Feature Selection

Feature Selection is the process of selecting the most significant features from a given dataset. The most significant features are independent (or “explanatory”) variables that have a high correlation with the dependent (or “response”) variable.



Negative Correlation / Positive Correlation Plots

## The Independent Variable, contractStartDate, has a negative correlation with the Independent Variable, bill_1_2021, of  -0.1334527
## The Dependent Variable, EV, has a positive correlation with the Independent Variable, bill_1_2023, of  0.1068634



Section 4.3 - Cross-Validation

Cross-Validation divides the dataset into data for the Machine Learning algorithm to learn (training data) and data that the Machine Learning Algorithm is required to predict (testing data). A model that generalizes well produces consistent results when presented with new cases.

# Setting a seed number to generate a reproducible random sampling  
set.seed(123)

# Use Bernoulli sampling to divide the data into train and test sets.
v <- as.vector(c(rep(TRUE,130590),rep(FALSE,55968)))
ind <- sample(v)

# Select the training and test rows
train <- Data_Impute[ind,]
test <- Data_Impute[!ind,]
## It is TRUE that the CrossValidated Train and Test datasets 
##     are an accurate partition of the source dataset.


Section 4.4 - Feature Scaling (Data Normalization)

Feature Scaling is where we adjust the scales of the features to have a standard scale of measure, (i.e. fit in the same range).

train <- data.frame(scale(train))
test <- data.frame(scale(test))



Section 5 - Model Development

Regression Modeling

Fit the Linear Regression Model

# Set the Seed for reproducibility of results "set.seed()" allows for the setting of an initial seed for stream-based variate generators, thereby assuring reproducibility of regression algorithm results.
set.seed(5678)

# Create linear regression model, minus ContractStartDateEV that contains NAs
model_LN <- lm(EV ~ ., data = train)


Fit the Random Forest Model

Random Forests classify the mean/average of an ensemble of decision tree regressions of the independent variables vs the dependent variable.

model_RF <- randomForest(EV~., data = train, ntree = 100)


Fit the Gradient Boosting Model

Gradient Boosting employs a group of linear regression methods to boost the predictive accuracy of a linear model. Gradient Boosting has a higher possible predictive accuracy vs other regression techniques.

model_GBM <- gbm(EV~., data = train, distribution = "tdist", n.trees = 1000, interaction.depth = 11, shrinkage = 0.02)


Section 5.1 - Evaluation of Model Assumptions

This section contains evaluations of regression model assumptions. The Anderson-Darling Goodness of Fit Test is a measure of how well your data fits a specified distribution.

Table 8. Basic Linear Model AD-Test
Statistic P-Value
A 18304.14 3.7e-24
Table 9. Random Forest Model AD-Test
Statistic P-Value
A 49677.9 3.7e-24
Table 10. Gradient Boosting Model AD-Test
Statistic P-Value
A 49749.63 3.7e-24


Section 5.2 - Predictive Analytics

Predictive Analytics is a group of algorithmic statistical methods that create predictions of future data values from regressions of historical data.

pred_LN <- predict(model_LN, newdata = test)

pred_RF <- predict(model_RF, newdata = test)

pred_GBM <- predict(model_GBM, newdata = test)



Section 6 - Model Validation

Section 6.1 - Accuracy of the Predictions

The Performance Measurement Metrics applied are R-Squared, Root Mean Square Error, and Mean Absolute Error.

Table 11. Accuracy of the Basic Linear Modeling Predictions

##        R2      RMSE       MAE
## 1 0.10474 0.9462617 0.2314958


Table 12. Accuracy of the Random Forest Predictions

##          R2      RMSE         MAE
## 1 0.9994026 0.0262715 0.003995913


Table 13. Accuracy of the Gradient Boosting Predictions

##   R2       RMSE        MAE
## 1  1 0.01363724 0.00315309


Chart of Predictive Model Accuracy



Section 7 - Insights & Inferences

Conclusions

This document utilizes a Machine Learning pipeline to predict EV ownership from an electricity consumption dataset that includes information about electricity consumption of 186,558 customers. Twenty-five observations were assembled for each customer in the dataset. The objective is to determine the most effective method of predicting EV ownership from available Machine Learning techniques.

When exploring the dataset, it is found that around 15 observations are sparse time series data that are treatable as independent observation for the purpose of regression analysis. About 5 of the 25 observations have missing values, that are then imputted with column means. The dataset many outliers values, that are retained in consideration of having patterns that assist with the Machine Learning of EV customers. For regression modeling, the character columns are converted to factors then to numeric values, and normalized via min/max scaling. The dataset contains left skewed independent variables that are not normally distributed, corresponding with the high level of outliers. One of the greatest predictors of EV in the dataset is bill_1_2023.

After cross-validation of the dataset with train/test sets, three regression models were applied, Linear Modeling, Random Forest, and Gradient Boosting Machines. RF and GBM demonstrate the ability to predict the values in the test set perfectly. However, basic linear regression had only 11% accuracy.