This report provide Electric Vehicle price prediction using Regression Algorithm Method.The dataset using in this report for modeling is electric vehicle data all over the world. The Purposes of this report is :
The dataset is hosted in kaggle and can be downloaded here
The report is structured as follows :
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import necessary libraries.
rm(list = ls())
library(ggplot2)
library(corrgram)
library(gridExtra)
Read house dataset dan see its structure.
# read data
electric_df <- read.csv("data/ElectricCarData_Clean.csv")
# structure of dataframe
str(electric_df)
## 'data.frame': 103 obs. of 14 variables:
## $ Brand : chr "Tesla " "Volkswagen " "Polestar " "BMW " ...
## $ Model : chr "Model 3 Long Range Dual Motor" "ID.3 Pure" "2" "iX3 " ...
## $ AccelSec : num 4.6 10 4.7 6.8 9.5 2.8 9.6 8.1 5.6 6.3 ...
## $ TopSpeed_KmH : int 233 160 210 180 145 250 150 150 225 180 ...
## $ Range_Km : int 450 270 400 360 170 610 190 275 310 400 ...
## $ Efficiency_WhKm: int 161 167 181 206 168 180 168 164 153 193 ...
## $ FastCharge_KmH : chr "940" "250" "620" "560" ...
## $ RapidCharge : chr "Yes" "Yes" "Yes" "Yes" ...
## $ PowerTrain : chr "AWD" "RWD" "AWD" "RWD" ...
## $ PlugType : chr "Type 2 CCS" "Type 2 CCS" "Type 2 CCS" "Type 2 CCS" ...
## $ BodyStyle : chr "Sedan" "Hatchback" "Liftback" "SUV" ...
## $ Segment : chr "D" "C" "D" "D" ...
## $ Seats : int 5 5 5 5 4 5 5 5 5 5 ...
## $ PriceEuro : int 55480 30000 56440 68040 32997 105000 31900 29682 46380 55000 ...
The dataset has 103 observations and 14 variables. The target variable is price and the remaining variables are features.
Extract statistical summary of each variables.
# statistical summary
summary(electric_df)
## Brand Model AccelSec TopSpeed_KmH
## Length:103 Length:103 Min. : 2.100 Min. :123.0
## Class :character Class :character 1st Qu.: 5.100 1st Qu.:150.0
## Mode :character Mode :character Median : 7.300 Median :160.0
## Mean : 7.396 Mean :179.2
## 3rd Qu.: 9.000 3rd Qu.:200.0
## Max. :22.400 Max. :410.0
## Range_Km Efficiency_WhKm FastCharge_KmH RapidCharge
## Min. : 95.0 Min. :104.0 Length:103 Length:103
## 1st Qu.:250.0 1st Qu.:168.0 Class :character Class :character
## Median :340.0 Median :180.0 Mode :character Mode :character
## Mean :338.8 Mean :189.2
## 3rd Qu.:400.0 3rd Qu.:203.0
## Max. :970.0 Max. :273.0
## PowerTrain PlugType BodyStyle Segment
## Length:103 Length:103 Length:103 Length:103
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Seats PriceEuro
## Min. :2.000 Min. : 20129
## 1st Qu.:5.000 1st Qu.: 34430
## Median :5.000 Median : 45000
## Mean :4.883 Mean : 55812
## 3rd Qu.:5.000 3rd Qu.: 65000
## Max. :7.000 Max. :215000
We can see minimun, median, mean, and maximum values of each numeric variables. But in variables FastCharge_KmH have type data of “character” and would be converted into numerical variable
We can also notice that the maximum values of price is statistically far away from median and third quantile. This could be an outliers.
Plot distribution of price using histogram.
## boxplot
ggplot(data = electric_df, aes(x = PriceEuro)) +
geom_histogram(bins = 10)
Based on histogram above, the most car prices in the range 35k – 55k (Euro).
Plot distribution of price using boxplot.
ggplot(data = electric_df, aes(y = PriceEuro)) +
geom_boxplot()
Based on boxplot above, we can see that there are outliers in price.
Plot electric car price based on top speed.
ggplot(data = electric_df, aes(y = PriceEuro, x = TopSpeed_KmH)) +
geom_point()
Based on price by top speed, we can see in general, the higher top speed the higher the price.
Compute correlation coefficient (R) among all numerical and interger features. Visualize Correlation Coefficient in a diagram.
electric_df$FastCharge_KmH <- as.numeric(electric_df$FastCharge_KmH)
electric_df_num <- electric_df[ ,c(3,4,5,6,7,13,14)]
r <- cor(electric_df_num)
corrgram(electric_df_num, order = TRUE,
upper.panel = panel.pie)
summary (electric_df_num)
## AccelSec TopSpeed_KmH Range_Km Efficiency_WhKm
## Min. : 2.100 Min. :123.0 Min. : 95.0 Min. :104.0
## 1st Qu.: 5.100 1st Qu.:150.0 1st Qu.:250.0 1st Qu.:168.0
## Median : 7.300 Median :160.0 Median :340.0 Median :180.0
## Mean : 7.396 Mean :179.2 Mean :338.8 Mean :189.2
## 3rd Qu.: 9.000 3rd Qu.:200.0 3rd Qu.:400.0 3rd Qu.:203.0
## Max. :22.400 Max. :410.0 Max. :970.0 Max. :273.0
##
## FastCharge_KmH Seats PriceEuro
## Min. :170.0 Min. :2.000 Min. : 20129
## 1st Qu.:275.0 1st Qu.:5.000 1st Qu.: 34430
## Median :440.0 Median :5.000 Median : 45000
## Mean :456.7 Mean :4.883 Mean : 55812
## 3rd Qu.:560.0 3rd Qu.:5.000 3rd Qu.: 65000
## Max. :940.0 Max. :7.000 Max. :215000
## NA's :5
Several variables are highly correlated. For example, PriceEuro and TopSpeed_KmH.
For target variable (price), the variables with high correlation in order are FastCharge_KmH, TopSpeed_KmH and Range_Km.
Remove rows with incorrect values (NA) in feature FastCharge
# get indices FastCharge = NA
idx <- which(electric_df_num$FastCharge_KmH %in% c(NA))
# remove those rows
electric_df_num <- electric_df_num[-idx, ]
summary(electric_df_num$FastCharge_KmH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 170.0 275.0 440.0 456.7 560.0 940.0
The minimum value on FastCharge_KmH is now not NA but 170.
One Hot Encoding for Brand Features (using Brand column)
# Add Brand features
# OHE on Brand
# remove rows as in electric_df_num
electric_df <- electric_df[ rownames(electric_df_num), ]
# 1a. Create dataframe for Brand (available on electric_df)
Brand <- electric_df$Brand
Brand_df <- data.frame(Brand)
colnames(Brand_df) <- c("Brand.")
# 2a. OHE on Brand dataframe
library(caret)
df1 <- dummyVars("~.", data = Brand_df)
df2 <- data.frame(predict(df1, newdata = Brand_df))
# 3. Combine to house_df_num dataframe
## columns
electric_df_num <- cbind(electric_df_num, df2)
dim(electric_df_num)
## [1] 98 39
Number of features (columns) is now 39. It means we added 32 new features for brand information using OHE in Brand.
Randomly devided the dataset into training and testing with ratio = 70:30
## ratio train:test = 70 : 30
#data dimension
d <- dim(electric_df_num)
m <- d[1] # m = number of row
n <- d[2] # m = number of column
set.seed(2021)
train_idx <- sample(m, 0.7 * m)
train_idx[1:3]
## [1] 7 38 46
train_data <- electric_df_num[train_idx , ]
test_data <- electric_df_num[-train_idx, ]
dim(train_data)
## [1] 68 39
dim(test_data)
## [1] 30 39
The train data has 68 rows and test data has 30 rows.
Create Regression Model.
Predict the target Price with variable TopSpeed_KmH using Simple Linear Regression.
# PriceEuro ~ TopSpeed_KmH
slr <- lm(formula = PriceEuro ~ TopSpeed_KmH,
data = train_data)
summary(slr)
##
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45759 -7653 -1928 2709 112316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58661.68 12859.17 -4.562 2.26e-05 ***
## TopSpeed_KmH 635.64 70.66 8.995 4.44e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20800 on 66 degrees of freedom
## Multiple R-squared: 0.5508, Adjusted R-squared: 0.5439
## F-statistic: 80.91 on 1 and 66 DF, p-value: 4.436e-13
Predict the target Price with variable Range_KmH using Simple Linear Regression.
# PriceEuro ~ Range_Km
slr2 <- lm(formula = PriceEuro ~ Range_Km,
data = train_data)
summary(slr2)
##
## Call:
## lm(formula = PriceEuro ~ Range_Km, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47518 -14600 -5260 4858 120492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1939.60 11154.06 -0.174 0.862
## Range_Km 165.94 31.31 5.299 1.44e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25990 on 66 degrees of freedom
## Multiple R-squared: 0.2985, Adjusted R-squared: 0.2879
## F-statistic: 28.08 on 1 and 66 DF, p-value: 1.439e-06
Predict the target Price with variable Efficiency_KmH using Simple Linear Regression.
# PriceEuro ~ Efficiency_WhKm
slr3 <- lm(formula = PriceEuro ~ Efficiency_WhKm,
data = train_data)
summary(slr3)
##
## Call:
## lm(formula = PriceEuro ~ Efficiency_WhKm, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34567 -14865 -8740 4472 125278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14487.2 22200.8 -0.653 0.51631
## Efficiency_WhKm 367.4 116.3 3.159 0.00239 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28920 on 66 degrees of freedom
## Multiple R-squared: 0.1313, Adjusted R-squared: 0.1182
## F-statistic: 9.979 on 1 and 66 DF, p-value: 0.00239
Predict the target Price with variable TopSpeed, Range_KmH and Efficiency_KmH using Multivariate Linear Regression.
mlr <- lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm,
data = train_data)
summary(mlr)
##
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42164 -8988 -1863 4211 99617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65023.34 17065.58 -3.810 0.000314 ***
## TopSpeed_KmH 539.55 88.97 6.065 7.85e-08 ***
## Range_Km 53.05 30.73 1.726 0.089148 .
## Efficiency_WhKm 28.54 92.96 0.307 0.759839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20590 on 64 degrees of freedom
## Multiple R-squared: 0.5731, Adjusted R-squared: 0.5531
## F-statistic: 28.64 on 3 and 64 DF, p-value: 7.301e-12
Predict the target Price with variable TopSpeed, Range_KmH and Efficiency_KmH using Polynomial Regression.
poly <- lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm +
I(TopSpeed_KmH^2) + I(Range_Km^2),
data = train_data)
summary(poly)
##
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH + Range_Km + Efficiency_WhKm +
## I(TopSpeed_KmH^2) + I(Range_Km^2), data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50795 -7333 -4351 4352 98149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.139e+05 7.254e+04 1.570 0.1216
## TopSpeed_KmH -1.702e+03 8.940e+02 -1.903 0.0616 .
## Range_Km 1.438e+02 1.228e+02 1.171 0.2460
## Efficiency_WhKm 1.265e+02 9.799e+01 1.291 0.2016
## I(TopSpeed_KmH^2) 5.497e+00 2.174e+00 2.529 0.0140 *
## I(Range_Km^2) -8.250e-02 1.413e-01 -0.584 0.5615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19900 on 62 degrees of freedom
## Multiple R-squared: 0.6136, Adjusted R-squared: 0.5824
## F-statistic: 19.69 on 5 and 62 DF, p-value: 1.079e-11
Predict the target Price with variable TopSpeed, Range_KmH and Efficiency_KmH using MLR with Interaction.
mli <- lm(formula = PriceEuro ~ TopSpeed_KmH : Range_Km + Efficiency_WhKm,
data = train_data)
summary(mli)
##
## Call:
## lm(formula = PriceEuro ~ TopSpeed_KmH:Range_Km + Efficiency_WhKm,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54034 -11008 -3898 5676 98497
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4602.7144 17400.0085 -0.265 0.792
## Efficiency_WhKm 72.0703 101.3111 0.711 0.479
## TopSpeed_KmH:Range_Km 0.7263 0.1104 6.577 9.65e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22580 on 65 degrees of freedom
## Multiple R-squared: 0.4784, Adjusted R-squared: 0.4624
## F-statistic: 29.81 on 2 and 65 DF, p-value: 6.506e-10
Predict the target Price with all variable numeric and integer using Our Model Regression.
ourmodel <- lm(formula = PriceEuro ~. +
TopSpeed_KmH : Range_Km,
data = train_data)
summary(ourmodel)
##
## Call:
## lm(formula = PriceEuro ~ . + TopSpeed_KmH:Range_Km, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25251 -2795 0 3311 36953
##
## Coefficients: (6 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.350e+04 7.311e+04 -0.458 0.6497
## AccelSec -2.827e+03 2.157e+03 -1.311 0.1987
## TopSpeed_KmH 2.051e+02 3.333e+02 0.615 0.5425
## Range_Km -1.501e+02 1.568e+02 -0.957 0.3453
## Efficiency_WhKm 2.753e+02 8.392e+01 3.281 0.0024 **
## FastCharge_KmH -1.842e+01 2.168e+01 -0.849 0.4016
## Seats 6.356e+03 3.681e+03 1.727 0.0933 .
## Brand.Aiways. -1.979e+03 1.150e+04 -0.172 0.8644
## Brand.Audi. 3.446e+02 9.276e+03 0.037 0.9706
## Brand.BMW. 4.818e+03 8.352e+03 0.577 0.5678
## Brand.Byton. -1.597e+04 1.324e+04 -1.207 0.2359
## Brand.Citroen. 3.940e+03 1.163e+04 0.339 0.7368
## Brand.CUPRA. 9.771e+03 1.197e+04 0.816 0.4201
## Brand.DS. NA NA NA NA
## Brand.Fiat. 8.600e+03 1.184e+04 0.727 0.4725
## Brand.Ford. -7.408e+03 7.727e+03 -0.959 0.3445
## Brand.Honda. 1.627e+03 1.275e+04 0.128 0.8992
## Brand.Hyundai. -1.056e+03 9.410e+03 -0.112 0.9113
## Brand.Jaguar. -1.045e+04 1.360e+04 -0.768 0.4476
## Brand.Kia. -5.729e+03 7.712e+03 -0.743 0.4626
## Brand.Lexus. -2.878e+03 1.242e+04 -0.232 0.8181
## Brand.Lightyear. 1.482e+05 1.934e+04 7.663 6.58e-09 ***
## Brand.Lucid. NA NA NA NA
## Brand.Mazda. NA NA NA NA
## Brand.Mercedes. -1.042e+04 9.952e+03 -1.047 0.3024
## Brand.MG. -1.329e+04 1.318e+04 -1.008 0.3204
## Brand.Mini. -2.521e+03 1.373e+04 -0.184 0.8554
## Brand.Nissan. -1.117e+04 7.228e+03 -1.546 0.1314
## Brand.Opel. -4.245e+03 9.242e+03 -0.459 0.6489
## Brand.Peugeot. -5.020e+03 9.273e+03 -0.541 0.5917
## Brand.Polestar. -1.586e+04 1.301e+04 -1.220 0.2310
## Brand.Porsche. 3.930e+04 1.482e+04 2.653 0.0120 *
## Brand.Renault. 8.437e+03 9.365e+03 0.901 0.3740
## Brand.SEAT. NA NA NA NA
## Brand.Skoda. -4.287e+03 6.783e+03 -0.632 0.5317
## Brand.Sono. -4.955e+03 1.234e+04 -0.402 0.6905
## Brand.Tesla. -2.617e+04 1.159e+04 -2.259 0.0304 *
## Brand.Volkswagen. NA NA NA NA
## Brand.Volvo. NA NA NA NA
## TopSpeed_KmH:Range_Km 7.840e-01 7.449e-01 1.052 0.3000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10640 on 34 degrees of freedom
## Multiple R-squared: 0.9394, Adjusted R-squared: 0.8806
## F-statistic: 15.97 on 33 and 34 DF, p-value: 6.572e-13
Make some regression models to see the best performance.
Predict Price
## actual price
actual <- test_data$PriceEuro
## predicted price
pred.slr <- predict(slr, test_data)
pred.slr2 <- predict(slr2, test_data)
pred.slr3 <- predict(slr3, test_data)
pred.mlr <- predict(mlr, test_data)
pred.poly <- predict(poly, test_data)
pred.mli <- predict(mli, test_data)
pred.ourmodel <- predict(ourmodel, test_data)
price_df <- data.frame(actual, pred.slr, pred.slr2, pred.slr3, pred.mlr,
pred.poly, pred.mli)
ggplot(data = price_df, aes(x = actual, y = pred.slr)) +
geom_point() + labs(title = "Prediction Simple Linear Regression vs Actual (Top Speed)",
x = "Actual",
y = "Prediction") +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using formula 'y ~ x'
Mostly the points are located in the diagonal. It means the predicted value are close to the actual value. However there are still some points that are relatively far for diagonal.
ggplot(data = price_df, aes(x = actual, y = pred.slr2)) +
geom_point() + labs(title = "Prediction Simple Linear Regression vs Actual (Range)",
x = "Actual",
y = "Prediction") +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using formula 'y ~ x'
ggplot(data = price_df, aes(x = actual, y = pred.slr3)) +
geom_point() + labs(title = "Prediction Simple Linear Regression vs Actual (Efficiency)",
x = "Actual",
y = "Prediction") +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using formula 'y ~ x'
Mostly the points are located in the horizontal pattern. It means the predicted value are far to the actual value.
ggplot(data = price_df, aes(x = actual, y = pred.mlr)) +
geom_point() + labs(title = "Prediction Multivariate Linear Regression vs Actual",
x = "Actual",
y = "Prediction") +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using formula 'y ~ x'
Mostly the points are located in the diagonal. It means the predicted value are close to the actual value. However there are still some points that are relatively far for diagonal.
ggplot(data = price_df, aes(x = actual, y = pred.poly)) +
geom_point() + labs(title = "Prediction Polynomial Regression vs Actual",
x = "Actual",
y = "Prediction") +
geom_smooth() +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This model predicted value are close to the actual value. However there are still some points that are relatively far for diagonal.
ggplot(data = price_df, aes(x = actual, y = pred.mli)) +
geom_point() + labs(title = "Prediction MLR with Interaction vs Actual",
x = "Actual",
y = "Prediction") +
geom_smooth() +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This model predicted value are close to the actual value. However there are still some points that are relatively far for diagonal.
ggplot(data = price_df, aes(x = actual, y = pred.ourmodel)) +
geom_point() + labs(title = "Prediction our Model vs Actual",
x = "Actual",
y = "Prediction") +
geom_smooth() +
scale_x_continuous(limits = c(0,250000)) +
scale_y_continuous(limits = c(0,250000))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Mostly the points are located in the diagonal. It means the predicted value are close to the actual value.
performance <- function(prediction, actual, method){
e <- prediction - actual # error
se <- e^2 # squared error
sse <- sum(se) # sum of squared error
mse <- mean(se) # mean squared error
rmse <- sqrt(mse) # root mean squared error
r <- cor(prediction, actual) # correlation coefficient
result <- paste("*** Method: ", method,
"\nRMSE = ", round(rmse,3),
"\nR = ", round(r,3))
cat(result)
}
performance(pred.slr, actual, "Simple Linear Regression (Top Speed)")
## *** Method: Simple Linear Regression (Top Speed)
## RMSE = 16803.729
## R = 0.911
performance(pred.slr2, actual, "Simple Linear Regression (Range)")
## *** Method: Simple Linear Regression (Range)
## RMSE = 26087.07
## R = 0.792
performance(pred.slr3, actual, "Simple Linear Regression (Efficiency)")
## *** Method: Simple Linear Regression (Efficiency)
## RMSE = 37795.428
## R = 0.423
performance(pred.mlr, actual, "Multivariate Linear Regression")
## *** Method: Multivariate Linear Regression
## RMSE = 16836.499
## R = 0.91
performance(pred.poly, actual, "Polynomial Regression")
## *** Method: Polynomial Regression
## RMSE = 41826.252
## R = 0.861
performance(pred.mli, actual, "MLR Regression with Interaction")
## *** Method: MLR Regression with Interaction
## RMSE = 24430.61
## R = 0.866
performance(pred.ourmodel, actual, "Our Model")
## *** Method: Our Model
## RMSE = 11800.581
## R = 0.965
The Best performance is Our Model with has RMSE = 11800 and R = 96.5%. The Correlation Coefficient is very high. The Error in terms of RMSE is significantly reduced to below 12000. This means we made huge improvement in modeling.