As someone very passion about cars, one very interesting problem is trying to find the maximum performance car models with the lowest price. The goal of optimizing peformance and trying to find the best deal is the goal of this dataset. This data set is from Kaggle (https://www.kaggle.com/datasets/rkiattisak/sports-car-prices-dataset) and contains sports car data from several luxury car makes and models. All the variables we will be working with are numerical and the response variable is categorical with two possible values. In this project we will use the year of the car, its engine size in litres, its horsepower, its torque in lb/ft, and its zero to sixty time in seconds to predict whether a car’s price is above or below 100,000 usd. A value of 0 for target indicates that the price is below 100k and a value of 1 indicates that the price is above 100k. We have 1008 observations and will be using 5 predictor variables. The goal is to see if we can predict the price tier of a car given several data points of its performance.
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(ISLR)
library(ISLR2)
library(dplyr)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR)
library(pROC)
library(yardstick)
library(glmnet)
library(modeldata)
library(janitor)
library(naniar)
library(xgboost)
library(ranger)
library(vip)
tidymodels_prefer()
library(visdat)
library(magrittr)
library(ggplot2)
library(kknn)
# read in data
carData <- read.csv(file = "~/Downloads/Sportcarprice.csv")
# Prints top rows of data set
head(carData)
## Car.Make Car.Model Year Engine.Size..L. Horsepower Torque..lb.ft.
## 1 Porsche 911 2022 3 379 331
## 2 Lamborghini Huracan 2021 5.2 630 443
## 3 Ferrari 488 GTB 2022 3.9 661 561
## 4 Audi R8 2022 5.2 562 406
## 5 McLaren 720S 2021 4 710 568
## 6 BMW M8 2022 4.4 617 553
## X0.60.MPH.Time..seconds. Price..in.USD. Target
## 1 4 101200 1
## 2 2.8 274390 1
## 3 3 333750 1
## 4 3.2 142700 1
## 5 2.7 298000 1
## 6 3.1 130000 1
# Prints bottom rows of data set
tail(carData)
## Car.Make Car.Model Year Engine.Size..L. Horsepower Torque..lb.ft.
## 1002 Bugatti Chiron 2021 8 1479 1180
## 1003 Koenigsegg Jesko 2022 5 1280 1106
## 1004 Lotus Evija 2021 Electric Motor 1972 1254
## 1005 McLaren Senna 2021 4 789 590
## 1006 Pagani Huayra 2021 6 764 738
## 1007 Rimac Nevera 2021 Electric Motor 1888 1696
## X0.60.MPH.Time..seconds. Price..in.USD. Target
## 1002 2.4 3,000,000 1
## 1003 2.5 3,000,000 1
## 1004 2 2,000,000 1
## 1005 2.7 1,000,000 1
## 1006 3 2,600,000 1
## 1007 1.85 2,400,000 1
# Dimensions of data set
dim(carData)
## [1] 1007 9
# Look at missing data points
vis_miss(carData)
As can be seen there is no missing data in our dataset. The goal of this project is to see if we can accurately predict the price tier of a car given 5 predictor values relating to its performance. These results could be very helpful in allowing car customers to find the most performance for price and give key indications into which vehicles are overpriced and underpriced. Further research could put handling and luxury predictor variables and provide a whole indication of a car’s worth at a given price. Let’s begin the EDA below.
# Getting rid of commas in the price column
carData$Price..in.USD. <- gsub(",", "", carData$Price..in.USD.)
# Converts our dataframe columns into numeric and integer values
carData$Year <- as.integer(carData$Year)
carData$Engine.Size..L. <- as.numeric(carData$Engine.Size..L.)
carData$Horsepower<- as.integer(carData$Horsepower)
carData$Torque..lb.ft. <- as.integer(carData$Torque..lb.ft.)
carData$X0.60.MPH.Time..seconds. <- as.numeric(carData$X0.60.MPH.Time..seconds.)
carData$Price..in.USD. <- as.integer(carData$Price..in.USD.)
# Stores columns of data set
colnames(carData)
## [1] "Car.Make" "Car.Model"
## [3] "Year" "Engine.Size..L."
## [5] "Horsepower" "Torque..lb.ft."
## [7] "X0.60.MPH.Time..seconds." "Price..in.USD."
## [9] "Target"
make <- carData[1]
model <- carData[2]
year <- carData[3]
engine <- carData[4]
horsepower <- carData[5]
torque <- carData[6]
zerosixty <- carData[7]
price <- carData[8]
# Gets rid of missing values, sets missing values to the mean of all values in the column
carData$Horsepower[is.na(carData$Horsepower)] <- mean(carData$Horsepower, na.rm = TRUE)
carData$Engine.Size..L.[is.na(carData$Engine.Size..L.)] <- mean(carData$Engine.Size..L., na.rm = TRUE)
carData$Torque..lb.ft.[is.na(carData$Horsepower)] <- mean(carData$Torque..lb.ft., na.rm = TRUE)
carData$X0.60.MPH.Time..seconds.[is.na(carData$X0.60.MPH.Time..seconds.)] <- mean(carData$X0.60.MPH.Time..seconds., na.rm = TRUE)
carData$Torque..lb.ft.[is.na(carData$Torque..lb.ft.)] <- mean(carData$Torque..lb.ft., na.rm = TRUE)
# Checks if there are any missing values left
sum(is.na(carData$Year))
## [1] 0
sum(is.na(carData$Engine.Size..L.))
## [1] 0
sum(is.na(carData$X0.60.MPH.Time..seconds.))
## [1] 0
sum(is.na(carData$Horsepower))
## [1] 0
sum(is.na(carData$Torque..lb.ft.))
## [1] 0
This displays a histogram which gives us a range of horsepower values in our dataset. We expect horsepower to be a significant predictor variable of price because higher engine power is usually an indication that a car will have higher performance and hence a higher price.
carData %>%
ggplot2::ggplot(aes(x = carData$Horsepower)) + geom_histogram(bins = 60) + theme_bw()
This scatterplot plots 0-60 time on the x-axis and torque on the y axis. We can see a negative correlation so as 0-60 time increases, torque decreases. This makes intuitive sense because a slower car will have lower torque and engine power.
carData %>%
ggplot(aes(x = carData$X0.60.MPH.Time..seconds., y = carData$Torque..lb.ft.)) +
geom_point() +
labs(x = "0-60 Time", y = "Torque") +
ggtitle("Scatter Plot")
This scatter-plot horsepower on the x-axis and torque on the y axis. We can see a positive correlation so as horsepower increases, torque increases as well.. This makes intuitive sense because horsepower is mechnaically linked to torque and would logically be correlated.
carData %>%
ggplot(aes(x = carData$Horsepower, y = carData$Torque..lb.ft.)) +
geom_point() +
labs(x = "Horsepower", y = "Torque)") +
ggtitle("Scatter Plot")
This shows the price tier on the x-axis and engine size on the y axis. This provides an indication for how well engine size can predict the price tier of the car.
carData %>%
ggplot(aes(x = factor(Target), y = Engine.Size..L.)) +
geom_boxplot() +
geom_jitter(alpha = 2) +
labs(title = "Price Tier by Engine Size (litres)",
x = "Price Tier",
y = "Engine Size")
numeric_columns <- carData[sapply(carData, is.numeric)]
carData %>%
select(is.numeric) %>%
cor() %>%
corrplot::corrplot(type = 'lower',diag = TRUE,
method = 'color')
yrFactor <- factor(carData$Year)
This correlation plot reveals a lot of useful information about our dataset. The strongest positive correlations exist for all of our predictor variables except for 0-60 time. Engine size, torque, and horsepower all have positive correlations when compared to each other indicating that as one variable increases the other one will as well. 0-60 time has a negative correlation with all of our other predictor variables which makes sense because as a car becomes more expensive, 0-60 should be lower because the car should have higher performance. The result from the correlation plot gives us a good indication that the expected correlations between the predictor variables hold true. The expectation the horsepower, torque, and engine size will be linked is also proved to be true. The classification models that follow will better quantify the predictive power of our variables.
First we will split our data into training and testing. We will first create a recipe for the data
set.seed(3436)
carData <- carData %>%
mutate(Year = factor(Year)) %>%
mutate(Engine.Size..L. = factor(Engine.Size..L.)) %>%
mutate(Horsepower = factor(Horsepower)) %>%
mutate(X0.60.MPH.Time..seconds. = factor(X0.60.MPH.Time..seconds.)) %>%
mutate(Torque..lb.ft. = factor(Torque..lb.ft.)) %>%
mutate(Price..in.USD. = factor(Price..in.USD.)) %>%
mutate(Target = factor(Target))
carData$Engine.Size..L. <- as.numeric(as.character(carData$Engine.Size..L.))
carData$Horsepower <- as.numeric(as.character(carData$Horsepower))
carData$Torque..lb.ft. <- as.numeric(as.character(carData$Torque..lb.ft.))
carData$Year <- as.numeric(as.character(carData$Year))
carData$X0.60.MPH.Time..seconds. <- as.numeric(as.character(carData$X0.60.MPH.Time..seconds.))
carData$Price..in.USD. <- as.numeric(as.character(carData$Price..in.USD.))
carData_split <- initial_split(carData, prop = 0.7, strata = NULL)
carData_train <- training(carData_split)
carData_test <- testing(carData_split)
carData_folds <- vfold_cv(carData_train, v=5, repeats=3)
Let’s take a look between the correlations of the continuous variables in our data set. We can also examine the corresponding correlation matrix.
continuous_vars <- select_if(carData_train, is.numeric)
correlation_matrix <- cor(continuous_vars)
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black")
print(correlation_matrix)
## Year Engine.Size..L. Horsepower
## Year 1.0000000000 -0.2182691 -0.0007119097
## Engine.Size..L. -0.2182690966 1.0000000 0.4366507040
## Horsepower -0.0007119097 0.4366507 1.0000000000
## Torque..lb.ft. 0.0183172892 0.4303241 0.9365013918
## X0.60.MPH.Time..seconds. 0.0304448216 -0.3258643 -0.7151725543
## Price..in.USD. -0.0508095298 0.3729523 0.7802859437
## Torque..lb.ft. X0.60.MPH.Time..seconds. Price..in.USD.
## Year 0.01831729 0.03044482 -0.05080953
## Engine.Size..L. 0.43032406 -0.32586430 0.37295230
## Horsepower 0.93650139 -0.71517255 0.78028594
## Torque..lb.ft. 1.00000000 -0.66391707 0.72364782
## X0.60.MPH.Time..seconds. -0.66391707 1.00000000 -0.48993932
## Price..in.USD. 0.72364782 -0.48993932 1.00000000
carData$Target <- factor(carData$Target)
carData_train$Target <- factor(carData_train$Target)
carData_test$Target <- factor(carData_test$Target)
# Let's use our training data to create a recipe prediction for our response variable
car_recipe <- recipe(Target ~ Year + Engine.Size..L. + Horsepower + Torque..lb.ft. + X0.60.MPH.Time..seconds., data=carData)
We have our recipe and have done stratified sampling. We can now create different models for classification using different engines.
# This is a logistic regression model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
car_workflow <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(car_recipe)
car_fit <- fit(car_workflow, carData_train)
car_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Year Engine.Size..L.
## 1.032e+03 -5.092e-01 -3.473e-01
## Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## 1.283e-02 -6.419e-03 -1.193e+00
##
## Degrees of Freedom: 703 Total (i.e. Null); 698 Residual
## Null Deviance: 917.2
## Residual Deviance: 600.3 AIC: 612.3
# This is a linear discriminant analysis model for classification.
lda_model <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
car_workflow_lda <- workflow() %>%
add_recipe(car_recipe) %>%
add_model(lda_model)
car_fit_lda <- fit(car_workflow_lda, data = carData_train)
car_fit_lda
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_linear()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## lda(..y ~ ., data = data)
##
## Prior probabilities of groups:
## 0 1
## 0.3565341 0.6434659
##
## Group means:
## Year Engine.Size..L. Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## 0 2021.430 4.014707 463.9920 427.2390 4.101195
## 1 2021.172 4.583301 717.6945 605.4072 3.177892
##
## Coefficients of linear discriminants:
## LD1
## Year -0.240313940
## Engine.Size..L. -0.006991305
## Horsepower 0.001485402
## Torque..lb.ft. -0.001841505
## X0.60.MPH.Time..seconds. -1.506258790
# This is a quadratic discriminant analysis model for classification.
qda_model <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
car_workflow_qda <- workflow() %>% add_recipe(car_recipe) %>% add_model(qda_model)
car_fit_qda <- fit(car_workflow_qda, data = carData_train)
car_fit_qda
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_quad()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## qda(..y ~ ., data = data)
##
## Prior probabilities of groups:
## 0 1
## 0.3565341 0.6434659
##
## Group means:
## Year Engine.Size..L. Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## 0 2021.430 4.014707 463.9920 427.2390 4.101195
## 1 2021.172 4.583301 717.6945 605.4072 3.177892
# Finally this is a KNN Mode ( k-nearest neighbors)
knn_model <- nearest_neighbor(neighbors = 7) %>%
set_mode("classification") %>%
set_engine("kknn")
car_workflow_knn <- workflow() %>%
add_recipe(car_recipe) %>%
add_model(knn_model)
car_fit_knn <- fit(car_workflow_knn, data = carData_train)
car_fit_knn
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(7, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.05823864
## Best kernel: optimal
## Best k: 7
# We can use the metric of area under the ROC Curve to measure each model's performance using the training data
logistic_preds <- predict(car_fit, new_data = carData_train, type = "prob") %>% select(.pred_0)
lda_preds <- predict(car_fit_lda, new_data = carData_train, type = "prob") %>% select(.pred_0)
qda_preds <- predict(car_fit_qda, new_data = carData_train, type = "prob") %>% select(.pred_0)
knn_preds <- predict(car_fit_knn, new_data = carData_train, type = "prob") %>% select(.pred_0)
four_models <- bind_cols(logistic_preds, lda_preds, qda_preds, knn_preds)
four_models
## # A tibble: 704 × 4
## .pred_0...1 .pred_0...2 .pred_0...3 .pred_0...4
## <dbl> <dbl> <dbl> <dbl>
## 1 0.428 0.376 5.57e- 1 0.182
## 2 0.291 0.274 7.94e- 1 0
## 3 0.310 0.319 7.09e- 1 0
## 4 0.0569 0.126 1.76e- 2 0
## 5 0.0339 0.0643 2.74e- 4 0
## 6 0.892 0.923 9.98e- 1 1
## 7 0.235 0.243 6.53e- 2 0
## 8 0.877 0.922 9.98e- 1 0.985
## 9 0.000725 0.0189 7.76e-16 0
## 10 0.198 0.291 7.86e- 1 1
## # ℹ 694 more rows
roc_logistic <- roc(carData_train$Target, logistic_preds$.pred_0)
roc_lda <- roc(carData_train$Target, lda_preds$.pred_0)
roc_qda <- roc(carData_train$Target, qda_preds$.pred_0)
roc_knn <- roc(carData_train$Target, knn_preds$.pred_0)
cat("Logistic Regression AUC (Training):", auc(roc_logistic), "\n")
## Logistic Regression AUC (Training): 0.8720966
cat("Linear Discriminant Analysis AUC (Training):", auc(roc_lda), "\n")
## Linear Discriminant Analysis AUC (Training): 0.8612174
cat("Quadratic Discriminant Analysis AUC (Training):", auc(roc_qda), "\n")
## Quadratic Discriminant Analysis AUC (Training): 0.900759
cat("K-Nearest Neighbors AUC (Training):", auc(roc_knn), "\n")
## K-Nearest Neighbors AUC (Training): 0.992797
# As can be seen the k nearest neighbors is the top performer, followed by the quadratic, logistic
# and finally the logistic model
#This illustrates a confusion matrix of this model
augment(car_fit_knn, new_data = carData_train) %>%
conf_mat(truth = Target, estimate = .pred_class) %>%
autoplot(type="heatmap")
# Now let's run the models with our test data
logistic_test_preds <- predict(car_fit, new_data = carData_test, type = "prob") %>%
select(.pred_0)
lda_test_preds <- predict(car_fit_lda, new_data = carData_test, type = "prob") %>%
select(.pred_0)
qda_test_preds <- predict(car_fit_qda, new_data = carData_test, type = "prob") %>%
select(.pred_0)
knn_test_preds <- predict(car_fit_knn, new_data = carData_test, type = "prob") %>%
select(.pred_0)
roc_logistic_test <- roc(carData_test$Target, logistic_test_preds$.pred_0)
roc_lda_test <- roc(carData_test$Target, lda_test_preds$.pred_0)
roc_qda_test <- roc(carData_test$Target, qda_test_preds$.pred_0)
roc_knn_test <- roc(carData_test$Target, knn_test_preds$.pred_0)
cat("Logistic Regression AUC (Testing):", auc(roc_logistic_test), "\n")
## Logistic Regression AUC (Testing): 0.8769131
cat("Linear Discriminant Analysis AUC (Testing):", auc(roc_lda_test), "\n")
## Linear Discriminant Analysis AUC (Testing): 0.8617002
cat("Quadratic Discriminant Analysis AUC (Testing):", auc(roc_qda_test), "\n")
## Quadratic Discriminant Analysis AUC (Testing): 0.9119952
cat("K-Nearest Neighbors AUC (Testing):", auc(roc_knn_test), "\n")
## K-Nearest Neighbors AUC (Testing): 0.9744837
Again the KNN model is the best performer followed by the quadratic model. The logistic model is next followed by the linear model. All models performed pretty well with all values above 85%.
rf_class_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_class_wf <- workflow() %>%
add_model(rf_class_spec) %>%
add_recipe(car_recipe)
rf_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(200, 600)),
min_n(range = c(10, 20)), levels = 5)
rf_grid
## # A tibble: 125 × 3
## mtry trees min_n
## <int> <int> <int>
## 1 1 200 10
## 2 2 200 10
## 3 3 200 10
## 4 4 200 10
## 5 6 200 10
## 6 1 300 10
## 7 2 300 10
## 8 3 300 10
## 9 4 300 10
## 10 6 300 10
## # ℹ 115 more rows
tune_class <- tune_grid( rf_class_wf, resamples= carData_folds, grid = rf_grid)
## → A | warning: 6 columns were requested but there were 5 predictors in the data. 5 will be used.
## There were issues with some computations A: x1
## There were issues with some computations A: x4
## There were issues with some computations A: x8
## There were issues with some computations A: x13
## There were issues with some computations A: x17
## There were issues with some computations A: x21
## There were issues with some computations A: x25
## There were issues with some computations A: x25
##
## → B | error: Cannot find current progress bar for `<environment: 0x7fd92c1411b8>`
autoplot(tune_class) + theme_minimal()
show_best(tune_class, n = 1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 200 10 roc_auc binary 0.992 1 NA Preprocessor1_Model0…
best_rf_class <- select_best(tune_class)
bt_class_spec <- boost_tree(mtry = tune(), trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
bt_class_wf <- workflow() %>%
add_model(bt_class_spec) %>%
add_recipe(car_recipe)
bt_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(200, 600)),learn_rate(range = c(-10, -1)), levels = 5)
bt_grid
## # A tibble: 125 × 3
## mtry trees learn_rate
## <int> <int> <dbl>
## 1 1 200 0.0000000001
## 2 2 200 0.0000000001
## 3 3 200 0.0000000001
## 4 4 200 0.0000000001
## 5 6 200 0.0000000001
## 6 1 300 0.0000000001
## 7 2 300 0.0000000001
## 8 3 300 0.0000000001
## 9 4 300 0.0000000001
## 10 6 300 0.0000000001
## # ℹ 115 more rows
tune_bt_class <- tune_grid( bt_class_wf,
resamples = carData_folds, grid = bt_grid)
autoplot(tune_bt_class) + theme_minimal()
show_best(tune_bt_class, n = 1)
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 200 0.1 roc_auc binary 0.980 15 0.00349 Preprocessor1_M…
best_bt_class <- select_best(tune_bt_class)
final_bt_model <- finalize_workflow(bt_class_wf, best_bt_class)
final_bt_model <- fit(final_bt_model, carData_test)
final_bt_model %>%
extract_fit_parsnip() %>% vip() +
theme_minimal()
final_bt_model_test <- augment(final_bt_model, carData_test) %>%
select(Target, starts_with(".pred"))
roc_auc(final_bt_model_test, truth = Target, .pred_0)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.998
roc_curve(final_bt_model_test, truth = Target, .pred_0) %>%
autoplot()
# The ROC curve is 0.97 which is pretty high meaning that the predictor variables did a good job predicting the outcome.
#The ROC curve displays these results
conf_mat(final_bt_model_test, truth = Target, .pred_class) %>%
autoplot(type = "heatmap")
This confusion matrix indicates that our model performed very well as the amount of false positives and negatives were very low. For almost all data entries in the testing set, the classification model performed mostly accurately.
To further tune our models and come to the conclusion of the best model between KNN and logistic, we will fit them to our folded data using grids and workflows, and then collect their ROC AUC metrics:
car_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
wflow_kknn <- workflow() %>%
add_model(knn_mod) %>% add_recipe(car_recipe)
log_mod <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm")
log_workflow <- workflow() %>%
add_model(log_mod) %>%
add_recipe(car_recipe)
tune_knn_car <- tune_grid( wflow_kknn,
resamples = carData_folds,
grid = car_grid)
tune_log_car <- tune_grid( log_workflow,
resamples = carData_folds)
collect_metrics(tune_knn_car)
## # A tibble: 20 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 accuracy binary 0.931 15 0.00506 Preprocessor1_Model01
## 2 1 roc_auc binary 0.928 15 0.00547 Preprocessor1_Model01
## 3 2 accuracy binary 0.934 15 0.00532 Preprocessor1_Model02
## 4 2 roc_auc binary 0.950 15 0.00500 Preprocessor1_Model02
## 5 3 accuracy binary 0.930 15 0.00590 Preprocessor1_Model03
## 6 3 roc_auc binary 0.952 15 0.00540 Preprocessor1_Model03
## 7 4 accuracy binary 0.938 15 0.00585 Preprocessor1_Model04
## 8 4 roc_auc binary 0.960 15 0.00529 Preprocessor1_Model04
## 9 5 accuracy binary 0.938 15 0.00488 Preprocessor1_Model05
## 10 5 roc_auc binary 0.965 15 0.00493 Preprocessor1_Model05
## 11 6 accuracy binary 0.938 15 0.00488 Preprocessor1_Model06
## 12 6 roc_auc binary 0.965 15 0.00484 Preprocessor1_Model06
## 13 7 accuracy binary 0.937 15 0.00490 Preprocessor1_Model07
## 14 7 roc_auc binary 0.966 15 0.00487 Preprocessor1_Model07
## 15 8 accuracy binary 0.936 15 0.00526 Preprocessor1_Model08
## 16 8 roc_auc binary 0.966 15 0.00471 Preprocessor1_Model08
## 17 9 accuracy binary 0.932 15 0.00587 Preprocessor1_Model09
## 18 9 roc_auc binary 0.967 15 0.00471 Preprocessor1_Model09
## 19 10 accuracy binary 0.929 15 0.00635 Preprocessor1_Model10
## 20 10 roc_auc binary 0.966 15 0.00461 Preprocessor1_Model10
collect_metrics(tune_log_car)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.778 15 0.00649 Preprocessor1_Model1
## 2 roc_auc binary 0.868 15 0.00694 Preprocessor1_Model1
The logistic model perfomed the because the roc is the highest on average We can fit the model to the entire training set and assess its performance with the testing set
final_wf_car <- finalize_workflow(log_workflow, tune_log_car$.metrics)
final_wf_car
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
final_fit_car <- fit(final_wf_car, carData_train)
final_fit_car
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) Year Engine.Size..L.
## 1.032e+03 -5.092e-01 -3.473e-01
## Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## 1.283e-02 -6.419e-03 -1.193e+00
##
## Degrees of Freedom: 703 Total (i.e. Null); 698 Residual
## Null Deviance: 917.2
## Residual Deviance: 600.3 AIC: 612.3
augment(final_fit_car, new_data = carData_test) %>%
roc_auc(Target, .pred_0)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.877
As you can see the testing ROC AUC is higher than average across the folds.
At the start of the project we aimed to see whether we could get accurate price classifications from our predictor variables. We first did some exploratory data analysis to examine potential correlations between our predictor variables. The KNN model turned out to be the most useful with the logistic model also useful.
This model could be expanded to include other scores like luxury scores, handling, and aesthetics to provide an accurate score revealing how good of a deal a car’s price is.
Horsepower is the best predictor with engine size next. 0-60 is also a good predictor with torque being the wrost only half as good as horsepower. Year did not turn out to be useful.
The overall model score were quite high, above 90%, indicating that the correlations and predicitve quality turned out to be useful.