The purpose of this project is to generate a model that will be able to predict the price of an electric vehicle based on the features that vehicle comes with.
An electric car is a car that does not run on a gasoline engine but rather runs on an electric battery. Some benefits to owning an electric car are that the car is quieter, no exhaust emissions, and low maintenance costs. Electric cars are becoming increasingly popular, due to the rise of gas prices, and the advancement in technology. The problem with electric cars before was that they were too expensive, they couldn’t travel far, and there weren’t many places to charge your vehicle. Most electric cars do not suffer from these problems anymore. Therefore, there is more reason to purchase an electric car.
embed_youtube("FDTf2d47P4A")
This model can be helpful by helping customers choose the features they want in their car and getting an estimated price based on what they chose. Another way they can use the model is if they have a specific budget for their electric car, then they can see what options are available to them.
electric_cars <- read.csv("Data/ElectricCarData_Clean.csv")
head(electric_cars)
## Brand Model AccelSec TopSpeed_KmH Range_Km
## 1 Tesla Model 3 Long Range Dual Motor 4.6 233 450
## 2 Volkswagen ID.3 Pure 10.0 160 270
## 3 Polestar 2 4.7 210 400
## 4 BMW iX3 6.8 180 360
## 5 Honda e 9.5 145 170
## 6 Lucid Air 2.8 250 610
## Efficiency_WhKm FastCharge_KmH RapidCharge PowerTrain PlugType BodyStyle
## 1 161 940 Yes AWD Type 2 CCS Sedan
## 2 167 250 Yes RWD Type 2 CCS Hatchback
## 3 181 620 Yes AWD Type 2 CCS Liftback
## 4 206 560 Yes RWD Type 2 CCS SUV
## 5 168 190 Yes RWD Type 2 CCS Hatchback
## 6 180 620 Yes AWD Type 2 CCS Sedan
## Segment Seats PriceEuro
## 1 D 5 55480
## 2 C 5 30000
## 3 D 5 56440
## 4 D 5 68040
## 5 B 4 32997
## 6 F 5 105000
is.numeric(electric_cars$FastCharge_KmH)
## [1] FALSE
electric_cars$FastCharge_KmH <- as.numeric(electric_cars$FastCharge_KmH)
electric_cars <- electric_cars %>%
clean_names()
head(electric_cars)
## brand model accel_sec top_speed_km_h range_km
## 1 Tesla Model 3 Long Range Dual Motor 4.6 233 450
## 2 Volkswagen ID.3 Pure 10.0 160 270
## 3 Polestar 2 4.7 210 400
## 4 BMW iX3 6.8 180 360
## 5 Honda e 9.5 145 170
## 6 Lucid Air 2.8 250 610
## efficiency_wh_km fast_charge_km_h rapid_charge power_train plug_type
## 1 161 940 Yes AWD Type 2 CCS
## 2 167 250 Yes RWD Type 2 CCS
## 3 181 620 Yes AWD Type 2 CCS
## 4 206 560 Yes RWD Type 2 CCS
## 5 168 190 Yes RWD Type 2 CCS
## 6 180 620 Yes AWD Type 2 CCS
## body_style segment seats price_euro
## 1 Sedan D 5 55480
## 2 Hatchback C 5 30000
## 3 Liftback D 5 56440
## 4 SUV D 5 68040
## 5 Hatchback B 4 32997
## 6 Sedan F 5 105000
is.numeric(electric_cars$fast_charge_km_h)
## [1] TRUE
I have cleaned the names and also made the variable “FastCharge_KmH” numerical instead of categorical because the values represent the charging rate of the charger from the vehicle. Therefore I believe those values should be numerical, to help improve the model’s calculations.
The dataset I am using is from the website Kaggle. This dataset contains some variables that are important to a customer looking to buy an electric vehicle. (Can also be found in codebook)
The variables in this dataset are as follows:
Brand
: The brand of the vehicle
Model
: The model of the vehicle
AccelSec
: The acceleration of the vehicle from 0-100
(km/h)
TopSpeed_KmH
: The top speed of the vehicle
(km/h)
Range_Km
: The range of the vehicle on a full charge
(km)
Efficiency_WhKm
: The efficiency of the electric
vehicle in watts per hour per kilometer (wh/km)
FastCharge_KmH
: The charging rate of an electric car
in kilometer per hour (km/h)
RapidCharge
: If the car has the option of
rapidcharge. Rapidcharge is a faster charge rate than the typical
charging rate
PowerTrain
: If the car has front, rear, or all-wheel
drive
PlugType:
The type of plug the electric vehicle
uses
BodyStyle
: The type of body style or car style the
vehicle is
Segment
: The market segment the car is in
Seats
: The amount of seats the vehicle
holds
PriceEuro
: The price of the vehicle before any tax
incentives
Factors that could result in a person's purchase
As a person who’s family currently owns an electric car, there are some key aspects that are important to a customer when finding the right electric car for them. Depending on a person’s environment, needs, and wants, will help decide on what electric car would be perfect for them.
The acceleration of a vehicle could be important to customers who live in an environment with many hills and inclines. Another reason why a customer may be interested in a higher acceleration car is if they want a more “exciting” drive. A car with high acceleration will give the vehicle a more sportier feel.
The top speed of a vehicle could be important to a customer if they have a place where they could take advantage of the car’s top speed. A place could be the Autobahn. The Autobahn is a freeway in Germany does not have any speed limits in some area. So a high top speed vehicle would benefit here.
Fig 1. Autobahn, map of freeway in Germany
Top Speed and Acceleration usually seem to have some type of correlation with each other in gas powered vehicles. Usually if a car has a high top speed, they usually have a lower 0-100kmh time.
plot(electric_cars$top_speed_km_h,electric_cars$accel_sec, pch = 19, col = "lightblue", xlab="Top Speed (Kmh)", ylab = "Acceleration (0-100 kmh)", main = "Correlation between Acceleration and Top Speed", xlim = c(0,450), ylim = c(0, 25),)
# Regression line
abline(lm(electric_cars$accel_sec ~ electric_cars$top_speed_km_h), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(electric_cars$top_speed_km_h, electric_cars$accel_sec), 2)), x = 350, y = 20)
Based on the graph we can see that acceleration and top speed have a relatively strong negative correlation. Acceleration and Top Speed have a correlation of -0.79. From this graph we can say that the higher a top speed an electric car has, the lower 0-100kmh time.
embed_youtube("xnYPQjAHrQ8")
Out of all the variables of the electric car, I believe this is to be the most important variable of them all. Range tells how far the vehicle can travel before needing to be recharged. Having a shorter range on car can significantly decrease the amount of places the car can travel and how long it will take to get to each destination (due to having to recharge more often). At this period of time (2022), there are not as many charging stations as gas stations. Therefore, it is very important to know where all the nearest charging stations are when traveling, or else it will result in the car running out of electricity.
Going off charging stations, a faster charger would benefit. The benefit of gas cars over electric cars are that for a gas powered car, it takes a few minutes to refill. This is compared to electric cars that may take up to an hour or more to be completely charged. However, some cars have fast charging or even better rapid charging. This can get the electric cars to 80% charge in just 10-15 minutes.
ggplot(electric_cars, aes(x=range_km, y=fast_charge_km_h, colour = rapid_charge)) + geom_point() + facet_wrap( ~ rapid_charge)
sum(is.na(electric_cars$fast_charge_km_h))
## [1] 5
electric_cars %>%
ggplot(aes(x = rapid_charge, colour = rapid_charge)) + geom_bar()
Based on our graphs we can see that there are no observations in this
dataset that do not have a rapid charge. However, when I check if there
are any NA’s in the column for fast_charge_km_h
, it says
there are 5. This is because the vehicles with no rapid charging are
labeled as NA in the fast_charge_km_h
column. We can
conclude that if the vehicle does not have rapid charging, then the
vehicle cannot fast charge. These cars will most likely not sell as well
as the vehicles with rapid charge.
electric_cars %>%
ggplot(aes(x = rapid_charge, y = range_km, colour = rapid_charge)) + geom_boxplot()
We can see that the cars with no rapid charge do not have a high range. The cars that do have rapid charge have a longer range and a faster charge rate. There also seems to be a positive correlation between range and fast charging. We can conclude that vehicles without rapid charging do not have a far range. We can also conclude that the faster the charger you have, the farther range your electric vehicle will have.
Of course price is truly important to customers. An electric car can contain all these cool features, but if the price is completely overpriced then it would not be worth it. Price is a key factor for everything we purchase. We buy things that we feel is worth the price. No matter how good an object it is, we always end up buying that item if the price is reasonable. This is the reason why I decided to make the price my response variable in this project.
ggplot(electric_cars, aes(price_euro)) +
geom_histogram(bins = 70, color = "red") +
labs(title = "Range of Prices")
Based on the plot we can say that the price of the electric vehicles from this dataset, mainly ranges from 0 to 80,000. The vehicles that are above this range are most likely luxurious electric vehicles or high-end sport electric sport cars. The cars that are within this range are most likely efficiently friendly vehicles.
A brand may not seem as important when looking for the perfect electric vehicle for you, but to others it may be something well considering. A few reasons why a customer may want to go for a specific brand are: the brand has a great reputation, the buyer may have bought vehicles from the brand before, or the brand gives it’s buyer a type of prestige. A brand that is currently leading the electric vehicle market is Tesla. This is simply because Tesla has a premium-like quality, with the addition of multiple high-tech features such as “self-driving”.
Fig 2. Brand: Tesla, Model: Model S
electric_cars %>%
ggplot(aes(reorder(brand, price_euro), price_euro)) + geom_boxplot(varwidth = TRUE) + coord_flip() + labs(subtitle = "brands",x = NULL)
Based on our plot we can see that the top 3 brands are Lightyear, Porsche, and Lucid. Lucid and Lightyear only make electric cars so they are not as well known. Porsche is a well-known car brand for making gasoline powered vehicles. There vehicles cost more than the average vehicle so it makes sense that Porsche is third on our plot. Tesla arrives at fifth which also makes sense since their vehicles are closer to the more luxurious side. The bottom 3 are Sono, Smart, and SEAT. Sono and SEAT are not as well known in the automotive world. However, Smart makes small eco-friendly vehicles that are cheaper than the average vehicle therefore it makes sense that they are one of the lowest on the list.
Fig 3. Brand: Smart, Model:fortwo Electric Drive
We have found some correlations already between some variables. However, let us graph a correlation plot with the numeric predictors to see if they are any more correlations.
cor_ec <- electric_cars %>%
select_if(is.numeric) %>%
correlate()
rplot(cor_ec)
cor_ec %>%
stretch() %>%
ggplot(aes(x,y,fill = r)) + geom_tile() + geom_text(aes(label = as.character(fashion(r))))
Based on the correlation plot we can see that accel_sec has strong
negative correlations with the variables: top_speed_km_h
,
range_km
, fast_charge_km_h
. The other relative
relation noting is top_speed_km_h has strong positive correlations with
the variables: range_km
,
price_euro
,fast_charge_km_h
.
The following models conducted were done in this order and procedure: 1. Building the model 2. Running the model 3. Making predictions using the model
electric_cars <- electric_cars %>%
mutate(brand = factor(brand)) %>%
mutate(rapid_charge = factor(rapid_charge)) %>%
mutate(power_train = factor(power_train)) %>%
mutate(plug_type = factor(plug_type)) %>%
mutate(body_style = factor(body_style)) %>%
mutate(segment = factor(segment))
head(electric_cars)
## brand model accel_sec top_speed_km_h range_km
## 1 Tesla Model 3 Long Range Dual Motor 4.6 233 450
## 2 Volkswagen ID.3 Pure 10.0 160 270
## 3 Polestar 2 4.7 210 400
## 4 BMW iX3 6.8 180 360
## 5 Honda e 9.5 145 170
## 6 Lucid Air 2.8 250 610
## efficiency_wh_km fast_charge_km_h rapid_charge power_train plug_type
## 1 161 940 Yes AWD Type 2 CCS
## 2 167 250 Yes RWD Type 2 CCS
## 3 181 620 Yes AWD Type 2 CCS
## 4 206 560 Yes RWD Type 2 CCS
## 5 168 190 Yes RWD Type 2 CCS
## 6 180 620 Yes AWD Type 2 CCS
## body_style segment seats price_euro
## 1 Sedan D 5 55480
## 2 Hatchback C 5 30000
## 3 Liftback D 5 56440
## 4 SUV D 5 68040
## 5 Hatchback B 4 32997
## 6 Sedan F 5 105000
I mutated the variables by factoring the numerical predictors.
set.seed(12)
ec_split <- electric_cars %>%
initial_split(prop = 0.75, strata = "price_euro")
ec_train <- training(ec_split)
dim(ec_train) #75
## [1] 75 14
ec_test <- testing(ec_split)
dim(ec_test) #28
## [1] 28 14
I decided to split the data with a .75 proportion and I made the strata price_euro because we are trying to predict the price of a vehicle based on the other variables. After splitting the data we get 75 observations for the training set and 28 left over for the testing set.
ec_recipe <- recipe(price_euro ~ accel_sec + top_speed_km_h + range_km + efficiency_wh_km + fast_charge_km_h + rapid_charge + power_train + plug_type + body_style + segment + seats, data = ec_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_zv(all_nominal_predictors())
ec_folds <- vfold_cv(ec_train, strata = price_euro, v = 10, repeats = 5)
We made a recipe using the training set. The predictor variables we
left out of the recipe are brand and model. I decided to leave them out
because I wanted the prediction to be completely based on the features
of the vehicle. I step_dummy()
all nominal predictors to
encode them as categorical predictors. I also
step_normalize()
to center and scale all the predictors. I
step_novel
and step_zv
all nominal predictors
so it would assign any previously unseen factor level to a new value and
to remove any variables that contain only a single value.
The first model I decided to create was a Ridge Regression Model. Ridge regression is one of the alternative approaches to modeling. Ridge is one of the main types of the Regularization approach. The goal of the Regularization approach is to shrink the coefficient estimates toward zero, similar to least squares. Ridge minimizes the sum of squared residuals and \(\lambda * slope^2\).
ridge_spec <- linear_reg(penalty = tune(), mixture = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
ridge_workflow <- workflow() %>%
add_recipe(ec_recipe) %>%
add_model(ridge_spec)
set.seed(12)
penalty_grid <- grid_regular(penalty(range = c(1, 11)), levels = 50)
penalty_grid
## # A tibble: 50 × 1
## penalty
## <dbl>
## 1 10
## 2 16.0
## 3 25.6
## 4 40.9
## 5 65.5
## 6 105.
## 7 168.
## 8 268.
## 9 429.
## 10 687.
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows
tune_res <- tune_grid(
ridge_workflow,
resamples = ec_folds,
grid = penalty_grid
)
tune_res
## # Tuning results
## # 10-fold cross-validation repeated 5 times using stratification
## # A tibble: 50 × 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [66/9]> Repeat1 Fold01 <tibble [100 × 5]> <tibble [2 × 3]>
## 2 <split [66/9]> Repeat1 Fold02 <tibble [100 × 5]> <tibble [2 × 3]>
## 3 <split [66/9]> Repeat1 Fold03 <tibble [100 × 5]> <tibble [1 × 3]>
## 4 <split [66/9]> Repeat1 Fold04 <tibble [100 × 5]> <tibble [1 × 3]>
## 5 <split [66/9]> Repeat1 Fold05 <tibble [100 × 5]> <tibble [2 × 3]>
## 6 <split [69/6]> Repeat1 Fold06 <tibble [100 × 5]> <tibble [1 × 3]>
## 7 <split [69/6]> Repeat1 Fold07 <tibble [100 × 5]> <tibble [1 × 3]>
## 8 <split [69/6]> Repeat1 Fold08 <tibble [100 × 5]> <tibble [1 × 3]>
## 9 <split [69/6]> Repeat1 Fold09 <tibble [100 × 5]> <tibble [1 × 3]>
## 10 <split [69/6]> Repeat1 Fold10 <tibble [100 × 5]> <tibble [1 × 3]>
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows
##
## There were issues with some computations:
##
## - Warning(s) x50: A correlation computation is required, but `estimate` is constant... - Warning(s) x2: A correlation computation is required, but `estimate` is constant... - Warning(s) x3: A correlation computation is required, but `estimate` is constant... - Warning(s) x1: A correlation computation is required, but `estimate` is constant... - Warning(s) x1: A correlation computation is required, but `estimate` is constant... - Warning(s) x3: A correlation computation is required, but `estimate` is constant... - Warning(s) x5: A correlation computation is required, but `estimate` is constant... - Warning(s) x2: A correlation computation is required, but `estimate` is constant... - Warning(s) x3: A correlation computation is required, but `estimate` is constant...
##
## Run `show_notes(.Last.tune.result)` for more information.
autoplot(tune_res)
In this step we are adding the recipe to the ridge model. We are also
making the workflow and grid for the tune_grid
. We use the
folds we did earlier for the tune_grid
also.
Ridge_RMSE <- collect_metrics(tune_res) %>%
dplyr::select(.metric, mean, std_err) %>%
head()
We collect the metrics of our regression tune and look at the mean and standard error.
best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty
## # A tibble: 1 × 2
## penalty .config
## <dbl> <chr>
## 1 75431. Preprocessor1_Model20
ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = ec_train)
Ridge_Prediction <- predict(ridge_final_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Ridge_Prediction <- bind_cols(Ridge_Prediction, ec_test %>% dplyr::select(price_euro))
Ridge_Graph <- Ridge_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha = 1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Ridge_Accuracy <- augment(ridge_final_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
The second model I decided to create was a Lasso Regression Model. Lasso regression is also one of the alternative approaches to modeling. Like Ridge, Lasso is one of the main types of the Regularization approach.The difference is Lasso minimizes the sum of squared residuals and \(\lambda * |slope|\).
lasso_spec <-
linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
lasso_workflow <- workflow() %>%
add_recipe(ec_recipe) %>%
add_model(lasso_spec)
set.seed(12)
tune_res_lasso <- tune_grid(
lasso_workflow,
resamples = ec_folds,
grid = penalty_grid
)
tune_res_lasso
## # Tuning results
## # 10-fold cross-validation repeated 5 times using stratification
## # A tibble: 50 × 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [66/9]> Repeat1 Fold01 <tibble [100 × 5]> <tibble [2 × 3]>
## 2 <split [66/9]> Repeat1 Fold02 <tibble [100 × 5]> <tibble [2 × 3]>
## 3 <split [66/9]> Repeat1 Fold03 <tibble [100 × 5]> <tibble [1 × 3]>
## 4 <split [66/9]> Repeat1 Fold04 <tibble [100 × 5]> <tibble [1 × 3]>
## 5 <split [66/9]> Repeat1 Fold05 <tibble [100 × 5]> <tibble [2 × 3]>
## 6 <split [69/6]> Repeat1 Fold06 <tibble [100 × 5]> <tibble [1 × 3]>
## 7 <split [69/6]> Repeat1 Fold07 <tibble [100 × 5]> <tibble [1 × 3]>
## 8 <split [69/6]> Repeat1 Fold08 <tibble [100 × 5]> <tibble [1 × 3]>
## 9 <split [69/6]> Repeat1 Fold09 <tibble [100 × 5]> <tibble [1 × 3]>
## 10 <split [69/6]> Repeat1 Fold10 <tibble [100 × 5]> <tibble [1 × 3]>
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows
##
## There were issues with some computations:
##
## - Warning(s) x50: A correlation computation is required, but `estimate` is constant... - Warning(s) x2: A correlation computation is required, but `estimate` is constant... - Warning(s) x3: A correlation computation is required, but `estimate` is constant... - Warning(s) x1: A correlation computation is required, but `estimate` is constant... - Warning(s) x1: A correlation computation is required, but `estimate` is constant... - Warning(s) x3: A correlation computation is required, but `estimate` is constant... - Warning(s) x5: A correlation computation is required, but `estimate` is constant... - Warning(s) x2: A correlation computation is required, but `estimate` is constant... - Warning(s) x3: A correlation computation is required, but `estimate` is constant...
##
## Run `show_notes(.Last.tune.result)` for more information.
autoplot(tune_res_lasso)
In this step we are adding the recipe to the lasso model. We are also
making the workflow and grid for the tune_grid
. We use the
folds we did earlier for the tune_grid
also. The plots does
not seem to be as smooth as the ridge model.
Lasso_RMSE <- collect_metrics(tune_res_lasso) %>%
dplyr::select(.metric, mean, std_err) %>%
head()
We collect the metrics of our regression tune and look at the mean and standard error.
best_penalty_lasso <- select_best(tune_res_lasso, metric = "rsq")
lasso_final <- finalize_workflow(lasso_workflow, best_penalty_lasso)
lasso_final_fit <- fit(lasso_final, data = ec_train)
Lasso_Prediction <- predict(lasso_final_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Lasso_Prediction <- bind_cols(Lasso_Prediction, ec_test %>% dplyr::select(price_euro))
Lasso_Graph <- Lasso_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha=1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Lasso_Accuracy <- augment(lasso_final_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
The third model I created was a boosted tree model. A boosted model builds a weak decision tree that has low predictive accuracy. Then the model goes through the process of sequentially improving previous decision trees. Doing this, slowly reduces the bias at each step without drastically increasing the variance.
boost_spec <- boost_tree() %>%
set_engine("xgboost") %>%
set_mode("regression")
boost_wf <- workflow() %>%
add_model(boost_spec %>%
set_args(trees = tune())) %>%
add_recipe(ec_recipe)
set.seed(12)
boost_grid <- grid_regular(trees(range = c(10, 2000)), levels = 50)
boost_tune_res <- tune_grid(
boost_wf,
resamples = ec_folds,
grid = boost_grid,
)
autoplot(boost_tune_res)
In this step we are adding the recipe to the Boost model. We are also
making the workflow and grid for the tune_grid
. We use the
folds we did earlier for the tune_grid
also. The tree plots
here seem to go straight up and then flatten out, never changing it’s
slope again.
Boost_RMSE <- collect_metrics(boost_tune_res) %>%
dplyr::select(.metric, mean, std_err) %>%
head()
We collect the metrics of our regression tune and look at the mean and standard error.
best_boost_final <- select_best(boost_tune_res)
best_boost_final_model <- finalize_workflow(boost_wf, best_boost_final)
best_boost_final_model_fit <- fit(best_boost_final_model, data = ec_train)
Boost_Prediction <- predict(best_boost_final_model_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Boost_Prediction <- bind_cols(Boost_Prediction, ec_test %>% dplyr::select(price_euro))
Boost_Graph <- Boost_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha=1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Boost_Accuracy <- augment(best_boost_final_model_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
The fourth and final model I decided to make is a decision tree model. A decision tree model puts the data into classified chunks. Then based on the data from those chunks, the model does it’s best to predict the outcome.
tree_spec <-decision_tree() %>%
set_engine("rpart")
class_tree_spec <- tree_spec %>%
set_mode("regression")
class_tree_wf <- workflow() %>%
add_model(class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(ec_recipe)
set.seed(12)
param_grid <- grid_regular(cost_complexity(range = c(-5, 5)), levels = 50)
tune_res_tree <- tune_grid(
class_tree_wf,
resamples = ec_folds,
grid = param_grid,
)
autoplot(tune_res_tree)
In this step we are adding the recipe to the Tree model. We are also
making the workflow and grid for the tune_grid
. We use the
folds we did earlier for the tune_grid
also. The
cost_complexity
parameter seems to have more similarities
to the lasso plots. The reasoning is the plots are not smooth, and seem
to have sudden changes of slope.
Tree_RMSE <- collect_metrics(tune_res_tree) %>%
dplyr::select(.metric, mean, std_err) %>%
head()
We collect the metrics of our regression tune and look at the mean and standard error.
best_complexity <- select_best(tune_res_tree)
class_tree_final <- finalize_workflow(class_tree_wf, best_complexity)
class_tree_final_fit <- fit(class_tree_final, data = ec_train)
class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
The tree plot asks specific questions. These questions can only be answered yes or no. Looking at the tree plot we can see that the first question was about the accel_sec. The second line asked about the range_km. The third line asked about the accel_sec again or it asked about segment_c. Then it finally asked again about accel_sec. Going through this tree gave predictions of the car price. Looking at the plot, the model predicts: - 9% \(\approx\) 25,000 - 28% \(\approx\) 34,000 - 15% \(\approx\) 45,000 - 9% \(\approx\) 52,000 - 19% \(\approx\) 64,000 - 20% \(\approx\) 108,000
Tree_Prediction <- predict(class_tree_final_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Tree_Prediction <- bind_cols(Tree_Prediction, ec_test %>% dplyr::select(price_euro))
Tree_Graph <- Tree_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha=1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Tree_Accuracy <- augment(class_tree_final_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
Comparison of the four different models: We will compare the four different models in this by these factors: - Prediction Graphs - RMSE & RSQ (R-Squared) from Training Set - RSQ from Testing Set
figure <- ggarrange(Ridge_Graph, Lasso_Graph, Boost_Graph,Tree_Graph,
labels = c("Ridge", "Lasso", "Boost","Tree"),
ncol = 2, nrow = 2)
figure
In the plots the dotted line represents where the points would be if the actual price of the vehicle was the same number as the prediction. Looking at the plots I would say that the Boost has the points closest to the dotted line meaning they most likely have the highest accuracy between the four models.
Ridge
head(Ridge_RMSE)
## # A tibble: 6 × 3
## .metric mean std_err
## <chr> <dbl> <dbl>
## 1 rmse 18922. 2013.
## 2 rsq 0.799 0.0325
## 3 rmse 18922. 2013.
## 4 rsq 0.799 0.0325
## 5 rmse 18922. 2013.
## 6 rsq 0.799 0.0325
Looking at the mean and standard we get that Ridge has the following
values: RMSE
: mean = 18,922 & standard error = 2013
RSQ
: mean = 0.7994 & standard error = 0.03249
Lasso
head(Lasso_RMSE)
## # A tibble: 6 × 3
## .metric mean std_err
## <chr> <dbl> <dbl>
## 1 rmse 20886. 2544.
## 2 rsq 0.715 0.0481
## 3 rmse 20762. 2521.
## 4 rsq 0.717 0.0479
## 5 rmse 20573. 2488.
## 6 rsq 0.720 0.0476
Looking at the mean and standard we get that Lasso has the following
values: RMSE
: mean = 20,886 & standard error = 2544
RSQ
: mean = 0.715 & standard error = 0.04805
Boost
head(Boost_RMSE)
## # A tibble: 6 × 3
## .metric mean std_err
## <chr> <dbl> <dbl>
## 1 rmse 19527. 2335.
## 2 rsq 0.763 0.0397
## 3 rmse 19919. 2283.
## 4 rsq 0.768 0.0392
## 5 rmse 19919. 2283.
## 6 rsq 0.768 0.0392
Looking at the mean and standard we get that Boost has the following
values: RMSE
: mean = 23,443 & standard error = 2127
RSQ
: mean = 0.8232 & standard error = 0.04482
Tree
head(Tree_RMSE)
## # A tibble: 6 × 3
## .metric mean std_err
## <chr> <dbl> <dbl>
## 1 rmse 25233. 2102.
## 2 rsq 0.584 0.0390
## 3 rmse 25233. 2102.
## 4 rsq 0.584 0.0390
## 5 rmse 25233. 2102.
## 6 rsq 0.584 0.0390
Looking at the mean and standard we get that Tree has the following
values: RMSE
: mean = 25,233 & standard error = 2102
RSQ
: mean = 0.584 & standard error = 0.0399
Looking at all the model’s rmse and rsq, Boost would be the best model to test on the Testing Set. Boost has the lowest RMSE and highest RSQ.
rsq_comparisons <- bind_rows(Ridge_Accuracy, Lasso_Accuracy, Boost_Accuracy, Tree_Accuracy) %>%
tibble() %>% mutate(model = c("Ridge", "Lasso", "Boost", "Tree")) %>%
dplyr::select(model, .estimate) %>%
arrange(.estimate)
rsq_comparisons
## # A tibble: 4 × 2
## model .estimate
## <chr> <dbl>
## 1 Tree 0.565
## 2 Boost 0.681
## 3 Ridge 0.815
## 4 Lasso 0.862
Looking at the R-Squared of the four different models we see that the Lasso model had the highest R-Squared and the tree model had the lowest R-Squared. Ridge was after Lasso then Boost, then finally Tree.
Going through the dataset we found a couple of correlations between the variables. After we decided to produce four different models. These models were: Ridge, Lasso, Boost, and Tree.
Analyzing the results of the rmse and rsq from the training set, I
would of thought that the boost would be the best model to predict
price_euro
. However, when using each model for the testing
data set, we got that boost had a lower rsq value. Lasso had the best
model prediction. I think the reason why boost did not do as well is
because the model is not as flexible and it over fitted. I believe the
reason why the tree model did poorly was because I think the tree model
is more successful with categorical data. The data we used mainly
consisted of numerical data. Lasso and Ridge came out on top, because
Lasso and Ridge are more flexible than the other two models I
fitted.
This research shows Lasso would be the best model. This model is the best at predicting the price of an electric vehicle, based on the features the vehicle has.