Description

This report provide Electric Vehicle price prediction using Regression Algorithm Method.The dataset using in this report for modeling is electric vehicle data all over the world. The Purposes of this report is :

To Car Manufacturer

Determining the selling price of the car based on the model from the reference we provide.
Determine which types of cars are selling well in the market, and compare them with consumables in Indonesia.
Decide to the car maker which aspects the most contribute & the worst contribute to Electric Vehicle price.

To Prospective Car Buyers

Give consult regarding the most suitable EV for the buyer.
Recommendation Features of Car to buyer.

The dataset is hosted in kaggle and can be downloaded here

The report is structured as follows :
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Import necessary libraries.

rm(list = ls())

library(ggplot2)
library(corrgram)
library(gridExtra)

Read house dataset dan see its structure.

# read data

electric_df <- read.csv("data/ElectricCarData_Clean.csv")

# structure of dataframe
str(electric_df)

## 'data.frame':    103 obs. of  14 variables:
##  $ Brand          : chr  "Tesla " "Volkswagen " "Polestar " "BMW " ...
##  $ Model          : chr  "Model 3 Long Range Dual Motor" "ID.3 Pure" "2" "iX3 " ...
##  $ AccelSec       : num  4.6 10 4.7 6.8 9.5 2.8 9.6 8.1 5.6 6.3 ...
##  $ TopSpeed_KmH   : int  233 160 210 180 145 250 150 150 225 180 ...
##  $ Range_Km       : int  450 270 400 360 170 610 190 275 310 400 ...
##  $ Efficiency_WhKm: int  161 167 181 206 168 180 168 164 153 193 ...
##  $ FastCharge_KmH : chr  "940" "250" "620" "560" ...
##  $ RapidCharge    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ PowerTrain     : chr  "AWD" "RWD" "AWD" "RWD" ...
##  $ PlugType       : chr  "Type 2 CCS" "Type 2 CCS" "Type 2 CCS" "Type 2 CCS" ...
##  $ BodyStyle      : chr  "Sedan" "Hatchback" "Liftback" "SUV" ...
##  $ Segment        : chr  "D" "C" "D" "D" ...
##  $ Seats          : int  5 5 5 5 4 5 5 5 5 5 ...
##  $ PriceEuro      : int  55480 30000 56440 68040 32997 105000 31900 29682 46380 55000 ...

The dataset has 103 observations and 14 variables. The target variable is price and the remaining variables are features.

Extract statistical summary of each variables.

# statistical summary

summary(electric_df)

##     Brand              Model              AccelSec       TopSpeed_KmH  
##  Length:103         Length:103         Min.   : 2.100   Min.   :123.0  
##  Class :character   Class :character   1st Qu.: 5.100   1st Qu.:150.0  
##  Mode  :character   Mode  :character   Median : 7.300   Median :160.0  
##                                        Mean   : 7.396   Mean   :179.2  
##                                        3rd Qu.: 9.000   3rd Qu.:200.0  
##                                        Max.   :22.400   Max.   :410.0  
##     Range_Km     Efficiency_WhKm FastCharge_KmH     RapidCharge       
##  Min.   : 95.0   Min.   :104.0   Length:103         Length:103        
##  1st Qu.:250.0   1st Qu.:168.0   Class :character   Class :character  
##  Median :340.0   Median :180.0   Mode  :character   Mode  :character  
##  Mean   :338.8   Mean   :189.2                                        
##  3rd Qu.:400.0   3rd Qu.:203.0                                        
##  Max.   :970.0   Max.   :273.0                                        
##   PowerTrain          PlugType          BodyStyle           Segment         
##  Length:103         Length:103         Length:103         Length:103        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Seats         PriceEuro     
##  Min.   :2.000   Min.   : 20129  
##  1st Qu.:5.000   1st Qu.: 34430  
##  Median :5.000   Median : 45000  
##  Mean   :4.883   Mean   : 55812  
##  3rd Qu.:5.000   3rd Qu.: 65000  
##  Max.   :7.000   Max.   :215000

We can see minimun, median, mean, and maximum values of each numeric variables. But in variables FastCharge_KmH have type data of “character” and would be converted into numerical variable

We can also notice that the maximum values of price is statistically far away from median and third quantile. This could be an outliers.

2. Exploratory Data Analysis

2.1 Univariate Analysis

Plot distribution of price using histogram.

## boxplot

ggplot(data = electric_df, aes(x = PriceEuro)) +
  geom_histogram(bins = 10)

Based on histogram above, the most car prices in the range 35k – 55k (Euro).

Plot distribution of price using boxplot.

ggplot(data = electric_df, aes(y = PriceEuro)) +
  geom_boxplot()

Based on boxplot above, we can see that there are outliers in price.

2.2 Bivariate Analysis

Plot electric car price based on top speed.

ggplot(data = electric_df, aes(y = PriceEuro, x = TopSpeed_KmH)) +
  geom_point()

Based on price by top speed, we can see in general, the higher top speed the higher the price.

2.3 Multivariete Analysis

Compute correlation coefficient (R) among all numerical and interger features. Visualize Correlation Coefficient in a diagram.

electric_df$FastCharge_KmH <- as.numeric(electric_df$FastCharge_KmH)
electric_df_num <- electric_df[ ,c(3,4,5,6,7,13,14)]
r <- cor(electric_df_num)
corrgram(electric_df_num, order = TRUE,
         upper.panel = panel.pie)

summary (electric_df_num)

##     AccelSec       TopSpeed_KmH      Range_Km     Efficiency_WhKm
##  Min.   : 2.100   Min.   :123.0   Min.   : 95.0   Min.   :104.0  
##  1st Qu.: 5.100   1st Qu.:150.0   1st Qu.:250.0   1st Qu.:168.0  
##  Median : 7.300   Median :160.0   Median :340.0   Median :180.0  
##  Mean   : 7.396   Mean   :179.2   Mean   :338.8   Mean   :189.2  
##  3rd Qu.: 9.000   3rd Qu.:200.0   3rd Qu.:400.0   3rd Qu.:203.0  
##  Max.   :22.400   Max.   :410.0   Max.   :970.0   Max.   :273.0  
##                                                                  
##  FastCharge_KmH      Seats         PriceEuro     
##  Min.   :170.0   Min.   :2.000   Min.   : 20129  
##  1st Qu.:275.0   1st Qu.:5.000   1st Qu.: 34430  
##  Median :440.0   Median :5.000   Median : 45000  
##  Mean   :456.7   Mean   :4.883   Mean   : 55812  
##  3rd Qu.:560.0   3rd Qu.:5.000   3rd Qu.: 65000  
##  Max.   :940.0   Max.   :7.000   Max.   :215000  
##  NA's   :5

Several variables are highly correlated. For example, PriceEuro and TopSpeed_KmH.

For target variable (price), the variables with high correlation in order are FastCharge_KmH, TopSpeed_KmH and Range_Km.

Insight From Exploratory Data Analysis

Extreme outliers on top speed feature
There are co-linear variables (AccelSec & TopSpeed_KmH)
Based on Pearson’s correlation coefficient (r), the variables with highest correlation with target (price) are FastCharge_KmH, TopSpeed_KmH and Range_Km.
In general, the higher TopSpeed_KmH, the higher the price.

3. Data Preparation

3.1. Data Cleaning

Remove rows with incorrect values (NA) in feature FastCharge

# get indices FastCharge = NA
idx <- which(electric_df_num$FastCharge_KmH %in% c(NA))


# remove those rows
electric_df_num <- electric_df_num[-idx, ]

summary(electric_df_num$FastCharge_KmH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   170.0   275.0   440.0   456.7   560.0   940.0

The minimum value on FastCharge_KmH is now not NA but 170.

3.2 Feature Extraction

One Hot Encoding for Brand Features (using Brand column)

# Add Brand features 
# OHE on Brand

# remove rows as in electric_df_num
electric_df <- electric_df[ rownames(electric_df_num), ]

# 1a. Create dataframe for Brand (available on electric_df)
Brand <- electric_df$Brand
Brand_df <- data.frame(Brand)
colnames(Brand_df) <- c("Brand.")

# 2a. OHE on Brand dataframe
library(caret)
df1 <- dummyVars("~.", data = Brand_df)
df2 <- data.frame(predict(df1, newdata = Brand_df))

# 3. Combine to house_df_num dataframe
## columns
electric_df_num <- cbind(electric_df_num, df2)
dim(electric_df_num)

## [1] 98 39

Number of features (columns) is now 39. It means we added 32 new features for brand information using OHE in Brand.

3.3. Training and Testing Division

Randomly devided the dataset into training and testing with ratio = 70:30

## ratio train:test = 70 : 30
#data dimension
d <- dim(electric_df_num)
m <- d[1] # m = number of row
n <- d[2] # m = number of column

set.seed(2021)
train_idx <- sample(m, 0.7 * m)
train_idx[1:3]

## [1]  7 38 46

train_data <- electric_df_num[train_idx , ]
test_data <- electric_df_num[-train_idx,  ]

dim(train_data)

## [1] 68 39

dim(test_data)

## [1] 30 39

The train data has 68 rows and test data has 30 rows.

4. Modeling

Create Regression Model.

4.1 Simple Linear Regression

Predict the target Price with variable TopSpeed_KmH using Simple Linear Regression.

# PriceEuro ~ TopSpeed_KmH
slr <- lm(formula = PriceEuro ~ TopSpeed_KmH,
          data = train_data)
summary(slr)

## 
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45759  -7653  -1928   2709 112316 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -58661.68   12859.17  -4.562 2.26e-05 ***
## TopSpeed_KmH    635.64      70.66   8.995 4.44e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20800 on 66 degrees of freedom
## Multiple R-squared:  0.5508, Adjusted R-squared:  0.5439 
## F-statistic: 80.91 on 1 and 66 DF,  p-value: 4.436e-13

Predict the target Price with variable Range_KmH using Simple Linear Regression.

# PriceEuro ~ Range_Km
slr2 <- lm(formula = PriceEuro ~ Range_Km,
          data = train_data)
summary(slr2)

## 
## Call:
## lm(formula = PriceEuro ~ Range_Km, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47518 -14600  -5260   4858 120492 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1939.60   11154.06  -0.174    0.862    
## Range_Km      165.94      31.31   5.299 1.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25990 on 66 degrees of freedom
## Multiple R-squared:  0.2985, Adjusted R-squared:  0.2879 
## F-statistic: 28.08 on 1 and 66 DF,  p-value: 1.439e-06

Predict the target Price with variable Efficiency_KmH using Simple Linear Regression.

# PriceEuro ~ Efficiency_WhKm
slr3 <- lm(formula = PriceEuro ~ Efficiency_WhKm,
          data = train_data)
summary(slr3)

## 
## Call:
## lm(formula = PriceEuro ~ Efficiency_WhKm, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34567 -14865  -8740   4472 125278 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -14487.2    22200.8  -0.653  0.51631   
## Efficiency_WhKm    367.4      116.3   3.159  0.00239 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28920 on 66 degrees of freedom
## Multiple R-squared:  0.1313, Adjusted R-squared:  0.1182 
## F-statistic: 9.979 on 1 and 66 DF,  p-value: 0.00239

4.2 Multivariate Linear Regression

Predict the target Price with variable TopSpeed, Range_KmH and Efficiency_KmH using Multivariate Linear Regression.

mlr <- lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm, 
          data = train_data)
summary(mlr)

## 
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42164  -8988  -1863   4211  99617 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -65023.34   17065.58  -3.810 0.000314 ***
## TopSpeed_KmH       539.55      88.97   6.065 7.85e-08 ***
## Range_Km            53.05      30.73   1.726 0.089148 .  
## Efficiency_WhKm     28.54      92.96   0.307 0.759839    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20590 on 64 degrees of freedom
## Multiple R-squared:  0.5731, Adjusted R-squared:  0.5531 
## F-statistic: 28.64 on 3 and 64 DF,  p-value: 7.301e-12

4.3. Polynomial Regression

Predict the target Price with variable TopSpeed, Range_KmH and Efficiency_KmH using Polynomial Regression.

poly <- lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm + 
             I(TopSpeed_KmH^2) + I(Range_Km^2), 
           data = train_data)
summary(poly)

## 
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm + 
##     I(TopSpeed_KmH^2) + I(Range_Km^2), data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50795  -7333  -4351   4352  98149 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        1.139e+05  7.254e+04   1.570   0.1216  
## TopSpeed_KmH      -1.702e+03  8.940e+02  -1.903   0.0616 .
## Range_Km           1.438e+02  1.228e+02   1.171   0.2460  
## Efficiency_WhKm    1.265e+02  9.799e+01   1.291   0.2016  
## I(TopSpeed_KmH^2)  5.497e+00  2.174e+00   2.529   0.0140 *
## I(Range_Km^2)     -8.250e-02  1.413e-01  -0.584   0.5615  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19900 on 62 degrees of freedom
## Multiple R-squared:  0.6136, Adjusted R-squared:  0.5824 
## F-statistic: 19.69 on 5 and 62 DF,  p-value: 1.079e-11

4.4. MLR with Interaction

Predict the target Price with variable TopSpeed, Range_KmH and Efficiency_KmH using MLR with Interaction.

mli <- lm(formula = PriceEuro ~ TopSpeed_KmH : Range_Km + Efficiency_WhKm,
          data = train_data)
summary(mli)

## 
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH:Range_Km + Efficiency_WhKm, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54034 -11008  -3898   5676  98497 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -4602.7144 17400.0085  -0.265    0.792    
## Efficiency_WhKm          72.0703   101.3111   0.711    0.479    
## TopSpeed_KmH:Range_Km     0.7263     0.1104   6.577 9.65e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22580 on 65 degrees of freedom
## Multiple R-squared:  0.4784, Adjusted R-squared:  0.4624 
## F-statistic: 29.81 on 2 and 65 DF,  p-value: 6.506e-10

4.5 Our New Model

Predict the target Price with all variable numeric and integer using Our Model Regression.

ourmodel <- lm(formula = PriceEuro ~. +
                TopSpeed_KmH : Range_Km,
                data = train_data)
summary(ourmodel)

## 
## Call:
## lm(formula = PriceEuro ~ . + TopSpeed_KmH:Range_Km, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25251  -2795      0   3311  36953 
## 
## Coefficients: (6 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -3.350e+04  7.311e+04  -0.458   0.6497    
## AccelSec              -2.827e+03  2.157e+03  -1.311   0.1987    
## TopSpeed_KmH           2.051e+02  3.333e+02   0.615   0.5425    
## Range_Km              -1.501e+02  1.568e+02  -0.957   0.3453    
## Efficiency_WhKm        2.753e+02  8.392e+01   3.281   0.0024 ** 
## FastCharge_KmH        -1.842e+01  2.168e+01  -0.849   0.4016    
## Seats                  6.356e+03  3.681e+03   1.727   0.0933 .  
## Brand.Aiways.         -1.979e+03  1.150e+04  -0.172   0.8644    
## Brand.Audi.            3.446e+02  9.276e+03   0.037   0.9706    
## Brand.BMW.             4.818e+03  8.352e+03   0.577   0.5678    
## Brand.Byton.          -1.597e+04  1.324e+04  -1.207   0.2359    
## Brand.Citroen.         3.940e+03  1.163e+04   0.339   0.7368    
## Brand.CUPRA.           9.771e+03  1.197e+04   0.816   0.4201    
## Brand.DS.                     NA         NA      NA       NA    
## Brand.Fiat.            8.600e+03  1.184e+04   0.727   0.4725    
## Brand.Ford.           -7.408e+03  7.727e+03  -0.959   0.3445    
## Brand.Honda.           1.627e+03  1.275e+04   0.128   0.8992    
## Brand.Hyundai.        -1.056e+03  9.410e+03  -0.112   0.9113    
## Brand.Jaguar.         -1.045e+04  1.360e+04  -0.768   0.4476    
## Brand.Kia.            -5.729e+03  7.712e+03  -0.743   0.4626    
## Brand.Lexus.          -2.878e+03  1.242e+04  -0.232   0.8181    
## Brand.Lightyear.       1.482e+05  1.934e+04   7.663 6.58e-09 ***
## Brand.Lucid.                  NA         NA      NA       NA    
## Brand.Mazda.                  NA         NA      NA       NA    
## Brand.Mercedes.       -1.042e+04  9.952e+03  -1.047   0.3024    
## Brand.MG.             -1.329e+04  1.318e+04  -1.008   0.3204    
## Brand.Mini.           -2.521e+03  1.373e+04  -0.184   0.8554    
## Brand.Nissan.         -1.117e+04  7.228e+03  -1.546   0.1314    
## Brand.Opel.           -4.245e+03  9.242e+03  -0.459   0.6489    
## Brand.Peugeot.        -5.020e+03  9.273e+03  -0.541   0.5917    
## Brand.Polestar.       -1.586e+04  1.301e+04  -1.220   0.2310    
## Brand.Porsche.         3.930e+04  1.482e+04   2.653   0.0120 *  
## Brand.Renault.         8.437e+03  9.365e+03   0.901   0.3740    
## Brand.SEAT.                   NA         NA      NA       NA    
## Brand.Skoda.          -4.287e+03  6.783e+03  -0.632   0.5317    
## Brand.Sono.           -4.955e+03  1.234e+04  -0.402   0.6905    
## Brand.Tesla.          -2.617e+04  1.159e+04  -2.259   0.0304 *  
## Brand.Volkswagen.             NA         NA      NA       NA    
## Brand.Volvo.                  NA         NA      NA       NA    
## TopSpeed_KmH:Range_Km  7.840e-01  7.449e-01   1.052   0.3000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10640 on 34 degrees of freedom
## Multiple R-squared:  0.9394, Adjusted R-squared:  0.8806 
## F-statistic: 15.97 on 33 and 34 DF,  p-value: 6.572e-13

Make some regression models to see the best performance.

5. Evaluation

Predict Price

## actual price
actual <- test_data$PriceEuro

## predicted price
pred.slr <- predict(slr, test_data)
pred.slr2 <- predict(slr2, test_data)
pred.slr3 <- predict(slr3, test_data)
pred.mlr <- predict(mlr, test_data)
pred.poly <- predict(poly, test_data)
pred.mli <- predict(mli, test_data)
pred.ourmodel <- predict(ourmodel, test_data)

Plot Actual vs Predicted Price with SLR (TopSpeed_KmH)

price_df <- data.frame(actual, pred.slr, pred.slr2, pred.slr3, pred.mlr, 
                       pred.poly, pred.mli)

ggplot(data = price_df, aes(x = actual, y = pred.slr)) +
  geom_point() + labs(title = "Prediction Simple Linear Regression vs Actual (Top Speed)", 
                      x = "Actual", 
                      y = "Prediction") +
  geom_smooth(method = "lm") +
  scale_x_continuous(limits = c(0,250000)) +
  scale_y_continuous(limits = c(0,250000))

## `geom_smooth()` using formula 'y ~ x'

Mostly the points are located in the diagonal. It means the predicted value are close to the actual value. However there are still some points that are relatively far for diagonal.

Plot Actual vs Predicted Price with SLR (Range_KmH)

ggplot(data = price_df, aes(x = actual, y = pred.slr2)) +
  geom_point() + labs(title = "Prediction Simple Linear Regression vs Actual (Range)", 
                      x = "Actual", 
                      y = "Prediction") +
  geom_smooth(method = "lm") +
  scale_x_continuous(limits = c(0,250000)) +
  scale_y_continuous(limits = c(0,250000))

## `geom_smooth()` using formula 'y ~ x'

Plot Actual vs Predicted Price with SLR (Efficiency_WhKm)

ggplot(data = price_df, aes(x = actual, y = pred.slr3)) +
  geom_point() + labs(title = "Prediction Simple Linear Regression vs Actual (Efficiency)", 
                      x = "Actual", 
                      y = "Prediction") +
  geom_smooth(method = "lm") +
  scale_x_continuous(limits = c(0,250000)) +
  scale_y_continuous(limits = c(0,250000))

## `geom_smooth()` using formula 'y ~ x'

Mostly the points are located in the horizontal pattern. It means the predicted value are far to the actual value.

Plot Actual vs Predicted Price with MLR

ggplot(data = price_df, aes(x = actual, y = pred.mlr)) +
  geom_point() + labs(title = "Prediction Multivariate Linear Regression vs Actual", 
                      x = "Actual", 
                      y = "Prediction") +
  geom_smooth(method = "lm") +
  scale_x_continuous(limits = c(0,250000)) +
  scale_y_continuous(limits = c(0,250000))

## `geom_smooth()` using formula 'y ~ x'