Predicting Sports Car Price based on Performance Metrics

Anjan Chakravarti 3/16/2024

Introduction

As someone very passion about cars, one very interesting problem is trying to find the maximum performance car models with the lowest price. The goal of optimizing peformance and trying to find the best deal is the goal of this dataset. This data set is from Kaggle (https://www.kaggle.com/datasets/rkiattisak/sports-car-prices-dataset) and contains sports car data from several luxury car makes and models. All the variables we will be working with are numerical and the response variable is categorical with two possible values. In this project we will use the year of the car, its engine size in litres, its horsepower, its torque in lb/ft, and its zero to sixty time in seconds to predict whether a car’s price is above or below 100,000 usd. A value of 0 for target indicates that the price is below 100k and a value of 1 indicates that the price is above 100k. We have 1008 observations and will be using 5 predictor variables. The goal is to see if we can predict the price tier of a car given several data points of its performance.

library(tidyverse) 
library(tidymodels) 
library(ggplot2) 
library(corrplot) 
library(ggthemes) 
library(ISLR) 
library(ISLR2) 
library(dplyr) 
library(discrim) 
library(poissonreg) 
library(corrr) 
library(klaR) 
library(pROC) 
library(yardstick) 
library(glmnet) 
library(modeldata) 
library(janitor)
library(naniar)
library(xgboost)
library(ranger)
library(vip)
tidymodels_prefer()
library(visdat)
library(magrittr)
library(ggplot2)
library(kknn)

# read in data
carData <- read.csv(file = "~/Downloads/Sportcarprice.csv")
# Prints top rows of data set
head(carData)

##      Car.Make Car.Model Year Engine.Size..L. Horsepower Torque..lb.ft.
## 1     Porsche       911 2022               3        379            331
## 2 Lamborghini   Huracan 2021             5.2        630            443
## 3     Ferrari   488 GTB 2022             3.9        661            561
## 4        Audi        R8 2022             5.2        562            406
## 5     McLaren      720S 2021               4        710            568
## 6         BMW        M8 2022             4.4        617            553
##   X0.60.MPH.Time..seconds. Price..in.USD. Target
## 1                        4         101200      1
## 2                      2.8         274390      1
## 3                        3         333750      1
## 4                      3.2         142700      1
## 5                      2.7         298000      1
## 6                      3.1         130000      1

# Prints bottom rows of data set
tail(carData)

##        Car.Make Car.Model Year Engine.Size..L. Horsepower Torque..lb.ft.
## 1002    Bugatti    Chiron 2021               8       1479           1180
## 1003 Koenigsegg     Jesko 2022               5       1280           1106
## 1004      Lotus     Evija 2021  Electric Motor       1972           1254
## 1005    McLaren     Senna 2021               4        789            590
## 1006     Pagani    Huayra 2021               6        764            738
## 1007      Rimac    Nevera 2021  Electric Motor       1888           1696
##      X0.60.MPH.Time..seconds. Price..in.USD. Target
## 1002                      2.4      3,000,000      1
## 1003                      2.5      3,000,000      1
## 1004                        2      2,000,000      1
## 1005                      2.7      1,000,000      1
## 1006                        3      2,600,000      1
## 1007                     1.85      2,400,000      1

# Dimensions of data set
dim(carData)

## [1] 1007    9

# Look at missing data points
vis_miss(carData)

As can be seen there is no missing data in our dataset. The goal of this project is to see if we can accurately predict the price tier of a car given 5 predictor values relating to its performance. These results could be very helpful in allowing car customers to find the most performance for price and give key indications into which vehicles are overpriced and underpriced. Further research could put handling and luxury predictor variables and provide a whole indication of a car’s worth at a given price. Let’s begin the EDA below.

Exploratory Data Analysis

# Getting rid of commas in the price column
carData$Price..in.USD. <- gsub(",", "", carData$Price..in.USD.)

# Converts our dataframe columns into numeric and integer values
carData$Year <- as.integer(carData$Year)
carData$Engine.Size..L. <- as.numeric(carData$Engine.Size..L.)
carData$Horsepower<- as.integer(carData$Horsepower)
carData$Torque..lb.ft. <- as.integer(carData$Torque..lb.ft.)
carData$X0.60.MPH.Time..seconds. <- as.numeric(carData$X0.60.MPH.Time..seconds.)
carData$Price..in.USD. <- as.integer(carData$Price..in.USD.)

# Stores columns of data set
colnames(carData)

## [1] "Car.Make"                 "Car.Model"               
## [3] "Year"                     "Engine.Size..L."         
## [5] "Horsepower"               "Torque..lb.ft."          
## [7] "X0.60.MPH.Time..seconds." "Price..in.USD."          
## [9] "Target"

make <- carData[1]
model <- carData[2]
year <- carData[3]
engine <- carData[4]
horsepower <- carData[5]
torque <- carData[6]
zerosixty <- carData[7]
price <- carData[8]

# Gets rid of missing values, sets missing values to the mean of all values in the column
carData$Horsepower[is.na(carData$Horsepower)] <- mean(carData$Horsepower, na.rm = TRUE)
carData$Engine.Size..L.[is.na(carData$Engine.Size..L.)] <- mean(carData$Engine.Size..L., na.rm = TRUE)
carData$Torque..lb.ft.[is.na(carData$Horsepower)] <- mean(carData$Torque..lb.ft., na.rm = TRUE)
carData$X0.60.MPH.Time..seconds.[is.na(carData$X0.60.MPH.Time..seconds.)] <- mean(carData$X0.60.MPH.Time..seconds., na.rm = TRUE)
carData$Torque..lb.ft.[is.na(carData$Torque..lb.ft.)] <- mean(carData$Torque..lb.ft., na.rm = TRUE)

# Checks if there are any missing values left
sum(is.na(carData$Year))

## [1] 0

sum(is.na(carData$Engine.Size..L.))

## [1] 0

sum(is.na(carData$X0.60.MPH.Time..seconds.))

## [1] 0

sum(is.na(carData$Horsepower))

## [1] 0

sum(is.na(carData$Torque..lb.ft.))

## [1] 0

This displays a histogram which gives us a range of horsepower values in our dataset. We expect horsepower to be a significant predictor variable of price because higher engine power is usually an indication that a car will have higher performance and hence a higher price.

carData %>%
  ggplot2::ggplot(aes(x = carData$Horsepower)) + geom_histogram(bins = 60) + theme_bw()

This scatterplot plots 0-60 time on the x-axis and torque on the y axis. We can see a negative correlation so as 0-60 time increases, torque decreases. This makes intuitive sense because a slower car will have lower torque and engine power.

carData %>%
  ggplot(aes(x = carData$X0.60.MPH.Time..seconds., y = carData$Torque..lb.ft.)) +
  geom_point() +
  labs(x = "0-60 Time", y = "Torque") +
  ggtitle("Scatter Plot")

This scatter-plot horsepower on the x-axis and torque on the y axis. We can see a positive correlation so as horsepower increases, torque increases as well.. This makes intuitive sense because horsepower is mechnaically linked to torque and would logically be correlated.

carData %>%
  ggplot(aes(x = carData$Horsepower, y = carData$Torque..lb.ft.)) +
  geom_point() +
  labs(x = "Horsepower", y = "Torque)") +
  ggtitle("Scatter Plot")

This shows the price tier on the x-axis and engine size on the y axis. This provides an indication for how well engine size can predict the price tier of the car.

carData %>%
ggplot(aes(x = factor(Target), y = Engine.Size..L.)) +
geom_boxplot() +
geom_jitter(alpha = 2) +
labs(title = "Price Tier by Engine Size (litres)",
      x = "Price Tier",
      y = "Engine Size")

numeric_columns <- carData[sapply(carData, is.numeric)]

carData %>%
  select(is.numeric) %>%
  cor() %>%
  corrplot::corrplot(type = 'lower',diag = TRUE,
                     method = 'color')

yrFactor <- factor(carData$Year)

This correlation plot reveals a lot of useful information about our dataset. The strongest positive correlations exist for all of our predictor variables except for 0-60 time. Engine size, torque, and horsepower all have positive correlations when compared to each other indicating that as one variable increases the other one will as well. 0-60 time has a negative correlation with all of our other predictor variables which makes sense because as a car becomes more expensive, 0-60 should be lower because the car should have higher performance. The result from the correlation plot gives us a good indication that the expected correlations between the predictor variables hold true. The expectation the horsepower, torque, and engine size will be linked is also proved to be true. The classification models that follow will better quantify the predictive power of our variables.

Data Splitting

First we will split our data into training and testing. We will first create a recipe for the data

set.seed(3436)

carData <- carData %>%
  mutate(Year = factor(Year)) %>% 
  mutate(Engine.Size..L. = factor(Engine.Size..L.)) %>% 
  mutate(Horsepower = factor(Horsepower)) %>% 
  mutate(X0.60.MPH.Time..seconds. = factor(X0.60.MPH.Time..seconds.)) %>% 
  mutate(Torque..lb.ft. = factor(Torque..lb.ft.)) %>% 
  mutate(Price..in.USD. = factor(Price..in.USD.)) %>%
  mutate(Target = factor(Target))


carData$Engine.Size..L. <- as.numeric(as.character(carData$Engine.Size..L.))
carData$Horsepower <- as.numeric(as.character(carData$Horsepower))
carData$Torque..lb.ft. <- as.numeric(as.character(carData$Torque..lb.ft.))
carData$Year <- as.numeric(as.character(carData$Year))
carData$X0.60.MPH.Time..seconds. <- as.numeric(as.character(carData$X0.60.MPH.Time..seconds.))
carData$Price..in.USD. <- as.numeric(as.character(carData$Price..in.USD.))


carData_split <- initial_split(carData, prop = 0.7, strata = NULL)
carData_train <- training(carData_split)
carData_test <- testing(carData_split)
carData_folds <- vfold_cv(carData_train, v=5, repeats=3)

Let’s take a look between the correlations of the continuous variables in our data set. We can also examine the corresponding correlation matrix.

continuous_vars <- select_if(carData_train, is.numeric)
correlation_matrix <- cor(continuous_vars)
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black")

print(correlation_matrix)

##                                   Year Engine.Size..L.    Horsepower
## Year                      1.0000000000      -0.2182691 -0.0007119097
## Engine.Size..L.          -0.2182690966       1.0000000  0.4366507040
## Horsepower               -0.0007119097       0.4366507  1.0000000000
## Torque..lb.ft.            0.0183172892       0.4303241  0.9365013918
## X0.60.MPH.Time..seconds.  0.0304448216      -0.3258643 -0.7151725543
## Price..in.USD.           -0.0508095298       0.3729523  0.7802859437
##                          Torque..lb.ft. X0.60.MPH.Time..seconds. Price..in.USD.
## Year                         0.01831729               0.03044482    -0.05080953
## Engine.Size..L.              0.43032406              -0.32586430     0.37295230
## Horsepower                   0.93650139              -0.71517255     0.78028594
## Torque..lb.ft.               1.00000000              -0.66391707     0.72364782
## X0.60.MPH.Time..seconds.    -0.66391707               1.00000000    -0.48993932
## Price..in.USD.               0.72364782              -0.48993932     1.00000000

carData$Target <- factor(carData$Target)
carData_train$Target <- factor(carData_train$Target)
carData_test$Target <- factor(carData_test$Target)

# Let's use our training data to create a recipe prediction for our response variable
car_recipe <- recipe(Target ~ Year + Engine.Size..L. + Horsepower + Torque..lb.ft. + X0.60.MPH.Time..seconds., data=carData)

We have our recipe and have done stratified sampling. We can now create different models for classification using different engines.

# This is a logistic regression model
logistic_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")
car_workflow <- workflow() %>% 
  add_model(logistic_model) %>% 
  add_recipe(car_recipe)


car_fit <- fit(car_workflow, carData_train) 
car_fit

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##              (Intercept)                      Year           Engine.Size..L.  
##                1.032e+03                -5.092e-01                -3.473e-01  
##               Horsepower            Torque..lb.ft.  X0.60.MPH.Time..seconds.  
##                1.283e-02                -6.419e-03                -1.193e+00  
## 
## Degrees of Freedom: 703 Total (i.e. Null);  698 Residual
## Null Deviance:       917.2 
## Residual Deviance: 600.3     AIC: 612.3

# This is a linear discriminant analysis model for classification.
lda_model <- discrim_linear() %>% 
  set_engine("MASS") %>% 
  set_mode("classification")

car_workflow_lda <- workflow() %>% 
  add_recipe(car_recipe) %>% 
  add_model(lda_model)
car_fit_lda <- fit(car_workflow_lda, data = carData_train) 
car_fit_lda

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_linear()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## lda(..y ~ ., data = data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.3565341 0.6434659 
## 
## Group means:
##       Year Engine.Size..L. Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## 0 2021.430        4.014707   463.9920       427.2390                 4.101195
## 1 2021.172        4.583301   717.6945       605.4072                 3.177892
## 
## Coefficients of linear discriminants:
##                                   LD1
## Year                     -0.240313940
## Engine.Size..L.          -0.006991305
## Horsepower                0.001485402
## Torque..lb.ft.           -0.001841505
## X0.60.MPH.Time..seconds. -1.506258790

# This is a quadratic discriminant analysis model for classification.
qda_model <- discrim_quad() %>% 
  set_mode("classification") %>% 
  set_engine("MASS")
car_workflow_qda <- workflow() %>% add_recipe(car_recipe) %>% add_model(qda_model)
car_fit_qda <- fit(car_workflow_qda, data = carData_train) 
car_fit_qda

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: discrim_quad()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Call:
## qda(..y ~ ., data = data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.3565341 0.6434659 
## 
## Group means:
##       Year Engine.Size..L. Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## 0 2021.430        4.014707   463.9920       427.2390                 4.101195
## 1 2021.172        4.583301   717.6945       605.4072                 3.177892

# Finally this is a KNN Mode ( k-nearest neighbors)
knn_model <- nearest_neighbor(neighbors = 7) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")
car_workflow_knn <- workflow() %>% 
  add_recipe(car_recipe) %>% 
  add_model(knn_model)
car_fit_knn <- fit(car_workflow_knn, data = carData_train) 
car_fit_knn

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(7,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.05823864
## Best kernel: optimal
## Best k: 7

# We can use the metric of area under the ROC Curve to measure each model's performance using the training data
logistic_preds <- predict(car_fit, new_data = carData_train, type = "prob") %>% select(.pred_0)
lda_preds <- predict(car_fit_lda, new_data = carData_train, type = "prob") %>% select(.pred_0)
qda_preds <- predict(car_fit_qda, new_data = carData_train, type = "prob") %>% select(.pred_0)
knn_preds <- predict(car_fit_knn, new_data = carData_train, type = "prob") %>% select(.pred_0)
four_models <- bind_cols(logistic_preds, lda_preds, qda_preds, knn_preds) 
four_models

## # A tibble: 704 × 4
##    .pred_0...1 .pred_0...2 .pred_0...3 .pred_0...4
##          <dbl>       <dbl>       <dbl>       <dbl>
##  1    0.428         0.376     5.57e- 1       0.182
##  2    0.291         0.274     7.94e- 1       0    
##  3    0.310         0.319     7.09e- 1       0    
##  4    0.0569        0.126     1.76e- 2       0    
##  5    0.0339        0.0643    2.74e- 4       0    
##  6    0.892         0.923     9.98e- 1       1    
##  7    0.235         0.243     6.53e- 2       0    
##  8    0.877         0.922     9.98e- 1       0.985
##  9    0.000725      0.0189    7.76e-16       0    
## 10    0.198         0.291     7.86e- 1       1    
## # ℹ 694 more rows

roc_logistic <- roc(carData_train$Target, logistic_preds$.pred_0) 
roc_lda <- roc(carData_train$Target, lda_preds$.pred_0)
roc_qda <- roc(carData_train$Target, qda_preds$.pred_0)
roc_knn <- roc(carData_train$Target, knn_preds$.pred_0)

cat("Logistic Regression AUC (Training):", auc(roc_logistic), "\n")

## Logistic Regression AUC (Training): 0.8720966

cat("Linear Discriminant Analysis AUC (Training):", auc(roc_lda), "\n")

## Linear Discriminant Analysis AUC (Training): 0.8612174

cat("Quadratic Discriminant Analysis AUC (Training):", auc(roc_qda), "\n")

## Quadratic Discriminant Analysis AUC (Training): 0.900759

cat("K-Nearest Neighbors AUC (Training):", auc(roc_knn), "\n")

## K-Nearest Neighbors AUC (Training): 0.992797

# As can be seen the k nearest neighbors is the top performer, followed by the quadratic, logistic
# and finally the logistic model

#This illustrates a confusion matrix of this model
augment(car_fit_knn, new_data = carData_train) %>% 
  conf_mat(truth = Target, estimate = .pred_class) %>% 
  autoplot(type="heatmap")

# Now let's run the models with our test data
logistic_test_preds <- predict(car_fit, new_data = carData_test, type = "prob") %>% 
  select(.pred_0)
lda_test_preds <- predict(car_fit_lda, new_data = carData_test, type = "prob") %>% 
  select(.pred_0)
qda_test_preds <- predict(car_fit_qda, new_data = carData_test, type = "prob") %>% 
  select(.pred_0)
knn_test_preds <- predict(car_fit_knn, new_data = carData_test, type = "prob") %>% 
  select(.pred_0)
roc_logistic_test <- roc(carData_test$Target, logistic_test_preds$.pred_0) 
roc_lda_test <- roc(carData_test$Target, lda_test_preds$.pred_0) 
roc_qda_test <- roc(carData_test$Target, qda_test_preds$.pred_0) 
roc_knn_test <- roc(carData_test$Target, knn_test_preds$.pred_0)

cat("Logistic Regression AUC (Testing):", auc(roc_logistic_test), "\n")

## Logistic Regression AUC (Testing): 0.8769131

cat("Linear Discriminant Analysis AUC (Testing):", auc(roc_lda_test), "\n")

## Linear Discriminant Analysis AUC (Testing): 0.8617002

cat("Quadratic Discriminant Analysis AUC (Testing):", auc(roc_qda_test), "\n")

## Quadratic Discriminant Analysis AUC (Testing): 0.9119952

cat("K-Nearest Neighbors AUC (Testing):", auc(roc_knn_test), "\n")

## K-Nearest Neighbors AUC (Testing): 0.9744837

Again the KNN model is the best performer followed by the quadratic model. The logistic model is next followed by the linear model. All models performed pretty well with all values above 85%.

rf_class_spec <- rand_forest(mtry = tune(), 
                             trees = tune(),
                             min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_class_wf <- workflow() %>% 
  add_model(rf_class_spec) %>% 
  add_recipe(car_recipe)

rf_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(200, 600)),
                        min_n(range = c(10, 20)), levels = 5)
rf_grid

## # A tibble: 125 × 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1   200    10
##  2     2   200    10
##  3     3   200    10
##  4     4   200    10
##  5     6   200    10
##  6     1   300    10
##  7     2   300    10
##  8     3   300    10
##  9     4   300    10
## 10     6   300    10
## # ℹ 115 more rows

tune_class <- tune_grid( rf_class_wf, resamples= carData_folds, grid = rf_grid)

## → A | warning: 6 columns were requested but there were 5 predictors in the data. 5 will be used.

## There were issues with some computations   A: x1

## There were issues with some computations   A: x4

## There were issues with some computations   A: x8

## There were issues with some computations   A: x13

## There were issues with some computations   A: x17

## There were issues with some computations   A: x21

## There were issues with some computations   A: x25
## There were issues with some computations   A: x25

##

## → B | error:   Cannot find current progress bar for `<environment: 0x7fd92c1411b8>`

autoplot(tune_class) + theme_minimal()

show_best(tune_class, n = 1)

## # A tibble: 1 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     6   200    10 roc_auc binary     0.992     1      NA Preprocessor1_Model0…

best_rf_class <- select_best(tune_class)


bt_class_spec <- boost_tree(mtry = tune(), trees = tune(),
                            learn_rate = tune()) %>% 
  set_engine("xgboost") %>%
  set_mode("classification")

bt_class_wf <- workflow() %>% 
  add_model(bt_class_spec) %>% 
  add_recipe(car_recipe)


bt_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(200, 600)),learn_rate(range = c(-10, -1)), levels = 5)
bt_grid

## # A tibble: 125 × 3
##     mtry trees   learn_rate
##    <int> <int>        <dbl>
##  1     1   200 0.0000000001
##  2     2   200 0.0000000001
##  3     3   200 0.0000000001
##  4     4   200 0.0000000001
##  5     6   200 0.0000000001
##  6     1   300 0.0000000001
##  7     2   300 0.0000000001
##  8     3   300 0.0000000001
##  9     4   300 0.0000000001
## 10     6   300 0.0000000001
## # ℹ 115 more rows

tune_bt_class <- tune_grid( bt_class_wf,
                            resamples = carData_folds, grid = bt_grid)

autoplot(tune_bt_class) + theme_minimal()

show_best(tune_bt_class, n = 1)

## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1     2   200        0.1 roc_auc binary     0.980    15 0.00349 Preprocessor1_M…

best_bt_class <- select_best(tune_bt_class)


final_bt_model <- finalize_workflow(bt_class_wf, best_bt_class) 
final_bt_model <- fit(final_bt_model, carData_test)


final_bt_model %>% 
  extract_fit_parsnip() %>% vip() +
  theme_minimal()

final_bt_model_test <- augment(final_bt_model, carData_test) %>%
  select(Target, starts_with(".pred")) 
roc_auc(final_bt_model_test, truth = Target, .pred_0)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.998

roc_curve(final_bt_model_test, truth = Target, .pred_0) %>% 
  autoplot()

# The ROC curve is 0.97 which is pretty high meaning that the predictor variables did a good job predicting the outcome.
#The ROC curve displays these results

conf_mat(final_bt_model_test, truth = Target, .pred_class) %>%
  autoplot(type = "heatmap")

This confusion matrix indicates that our model performed very well as the amount of false positives and negatives were very low. For almost all data entries in the testing set, the classification model performed mostly accurately.

Model Fitting

To further tune our models and come to the conclusion of the best model between KNN and logistic, we will fit them to our folded data using grids and workflows, and then collect their ROC AUC metrics:

car_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10) 
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
  set_mode("classification") %>%
  set_engine("kknn") 
wflow_kknn <- workflow() %>%
  add_model(knn_mod) %>% add_recipe(car_recipe)

log_mod <- logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glm")
log_workflow <- workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(car_recipe)

tune_knn_car <- tune_grid( wflow_kknn,
                           resamples = carData_folds,
                           grid = car_grid)
tune_log_car <- tune_grid( log_workflow,
                           resamples = carData_folds)

collect_metrics(tune_knn_car)

## # A tibble: 20 × 7
##    neighbors .metric  .estimator  mean     n std_err .config              
##        <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1         1 accuracy binary     0.931    15 0.00506 Preprocessor1_Model01
##  2         1 roc_auc  binary     0.928    15 0.00547 Preprocessor1_Model01
##  3         2 accuracy binary     0.934    15 0.00532 Preprocessor1_Model02
##  4         2 roc_auc  binary     0.950    15 0.00500 Preprocessor1_Model02
##  5         3 accuracy binary     0.930    15 0.00590 Preprocessor1_Model03
##  6         3 roc_auc  binary     0.952    15 0.00540 Preprocessor1_Model03
##  7         4 accuracy binary     0.938    15 0.00585 Preprocessor1_Model04
##  8         4 roc_auc  binary     0.960    15 0.00529 Preprocessor1_Model04
##  9         5 accuracy binary     0.938    15 0.00488 Preprocessor1_Model05
## 10         5 roc_auc  binary     0.965    15 0.00493 Preprocessor1_Model05
## 11         6 accuracy binary     0.938    15 0.00488 Preprocessor1_Model06
## 12         6 roc_auc  binary     0.965    15 0.00484 Preprocessor1_Model06
## 13         7 accuracy binary     0.937    15 0.00490 Preprocessor1_Model07
## 14         7 roc_auc  binary     0.966    15 0.00487 Preprocessor1_Model07
## 15         8 accuracy binary     0.936    15 0.00526 Preprocessor1_Model08
## 16         8 roc_auc  binary     0.966    15 0.00471 Preprocessor1_Model08
## 17         9 accuracy binary     0.932    15 0.00587 Preprocessor1_Model09
## 18         9 roc_auc  binary     0.967    15 0.00471 Preprocessor1_Model09
## 19        10 accuracy binary     0.929    15 0.00635 Preprocessor1_Model10
## 20        10 roc_auc  binary     0.966    15 0.00461 Preprocessor1_Model10

collect_metrics(tune_log_car)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.778    15 0.00649 Preprocessor1_Model1
## 2 roc_auc  binary     0.868    15 0.00694 Preprocessor1_Model1

The logistic model perfomed the because the roc is the highest on average We can fit the model to the entire training set and assess its performance with the testing set

final_wf_car <- finalize_workflow(log_workflow, tune_log_car$.metrics) 
final_wf_car

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

final_fit_car <- fit(final_wf_car, carData_train) 
final_fit_car

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##              (Intercept)                      Year           Engine.Size..L.  
##                1.032e+03                -5.092e-01                -3.473e-01  
##               Horsepower            Torque..lb.ft.  X0.60.MPH.Time..seconds.  
##                1.283e-02                -6.419e-03                -1.193e+00  
## 
## Degrees of Freedom: 703 Total (i.e. Null);  698 Residual
## Null Deviance:       917.2 
## Residual Deviance: 600.3     AIC: 612.3

augment(final_fit_car, new_data = carData_test) %>% 
  roc_auc(Target, .pred_0)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.877

As you can see the testing ROC AUC is higher than average across the folds.

Conclusion

At the start of the project we aimed to see whether we could get accurate price classifications from our predictor variables. We first did some exploratory data analysis to examine potential correlations between our predictor variables. The KNN model turned out to be the most useful with the logistic model also useful.

This model could be expanded to include other scores like luxury scores, handling, and aesthetics to provide an accurate score revealing how good of a deal a car’s price is.

Horsepower is the best predictor with engine size next. 0-60 is also a good predictor with torque being the wrost only half as good as horsepower. Year did not turn out to be useful.

The overall model score were quite high, above 90%, indicating that the correlations and predicitve quality turned out to be useful.