0.1 Introduction:

For my final homework assignment I decided to take a look at a dataset that contains information about cars and its various characteristic that will help us determine its selling price. I will use the tidymodels package to help stream my workflow of feature engineering and model creation.

0.1.1 Importing The Data

First let’s call all our relevant libraries and import the data. The cars dataset contains 8128 observations and 13 variables. Our target variable is the selling price. With all of the other being predictors.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cars <- read.csv("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\Homework4\\Car details v3.csv")

head(cars)
##                            name year selling_price km_driven   fuel seller_type
## 1        Maruti Swift Dzire VDI 2014        450000    145500 Diesel  Individual
## 2  Skoda Rapid 1.5 TDI Ambition 2014        370000    120000 Diesel  Individual
## 3      Honda City 2017-2020 EXi 2006        158000    140000 Petrol  Individual
## 4     Hyundai i20 Sportz Diesel 2010        225000    127000 Diesel  Individual
## 5        Maruti Swift VXI BSIII 2007        130000    120000 Petrol  Individual
## 6 Hyundai Xcent 1.2 VTVT E Plus 2017        440000     45000 Petrol  Individual
##   transmission        owner    mileage  engine  max_power
## 1       Manual  First Owner  23.4 kmpl 1248 CC     74 bhp
## 2       Manual Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3       Manual  Third Owner  17.7 kmpl 1497 CC     78 bhp
## 4       Manual  First Owner  23.0 kmpl 1396 CC     90 bhp
## 5       Manual  First Owner  16.1 kmpl 1298 CC   88.2 bhp
## 6       Manual  First Owner 20.14 kmpl 1197 CC  81.86 bhp
##                     torque seats
## 1           190Nm@ 2000rpm     5
## 2      250Nm@ 1500-2500rpm     5
## 3    12.7@ 2,700(kgm@ rpm)     5
## 4 22.4 kgm at 1750-2750rpm     5
## 5    11.5@ 4,500(kgm@ rpm)     5
## 6        113.75nm@ 4000rpm     5

0.1.2 Data Exploration

We can see that there a few missing data within our dataset..

library(visdat)
vis_dat(cars)

There are 8128 unique car names.. may delete this column since it may not help predict car prices.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
unique(length(cars$name))
## [1] 8128

Most cars were made between 2010s and 2020s within this dataset, it shows a left tail skew and we can see prominent peaks at the end of the year.

## Visualize # of car year..

ggplot(cars,aes(x = year)) +
  geom_histogram() + 
  ggtitle("Distribution of Cars made within a certain year") +
  xlab("Year") + 
  ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Subtitle argument adds text under title
ggplot(cars,aes(x = transmission)) + geom_bar(fill = "darkblue") + labs(x = "Car Transmission", y = "Count", subtitle = "A greater proportion of Manual Transmission than Auto") + 
  ggtitle("Proportion of Cars Transmission") +
  theme(
    plot.title = element_text(size = 15)
  )

We can look at the fuel type as well

ggplot(cars,aes(x = fuel)) + 
  geom_bar(fill = "purple") +
  labs(x = "Fuel Type", y = "Count") +
  ggtitle("Fuel Type For Cars",subtitle = "Greater proportion of Diesel and Petrol Cars Than any others") +
  theme(
    plot.title = element_text(size = 15)
  )

0.2 Data Cleaning..

In order to view all of the distribution of the numeric predictors in our dataset we are going to have to perform some data cleaning..

## Let's clean out the characters of some of these columns and convert them into numeric.. 

## replace columns with only whole numbers and no decimal points
cars$mileage <- str_extract(cars$mileage, "(.*)\\.\\d+")
cars$mileage <- str_extract(cars$mileage,"(.*)\\.")
cars$mileage <- str_replace(cars$mileage,"\\.","")
cars$mileage <- as.numeric(cars$mileage)


cars$engine <- str_extract(cars$engine, "(.*)\\s")
cars$engine <- as.numeric(cars$engine)


## make a new column here
cars$max_power <- str_extract(cars$max_power,"(.*)\\s")

## Error occurs here.
cars$max_power <-str_extract(cars$max_power,"\\d{2,3}")
cars$max_power <- as.numeric(cars$max_power)

A torque is a measurment of your car’s ability to do work, the more the torque the greater amount of power an engine can produce.

Splitting The torque/rpm into Two different seperate columns..

## Removing unnecssary columns like name, and owner torque was a bit difficult
Cars <- cars %>%
  select(-c(torque,name,owner)) %>%
  na.omit()

0.2.1 Correlation Plot

library(corrplot)
## corrplot 0.92 loaded
ggp1 <- Cars %>%
  select_if(is.numeric)


## This does not bode well for model creation.. 
cor(ggp1)
##                       year selling_price  km_driven    mileage     engine
## year           1.000000000    0.41230156 -0.4285485  0.3306907  0.0182631
## selling_price  0.412301558    1.00000000 -0.2221585 -0.1255353  0.4556818
## km_driven     -0.428548483   -0.22215848  1.0000000 -0.1749108  0.2060307
## mileage        0.330690729   -0.12553530 -0.1749108  1.0000000 -0.5775525
## engine         0.018263100    0.45568180  0.2060307 -0.5775525  1.0000000
## max_power      0.224465283    0.74997391 -0.0371659 -0.3679525  0.7048962
## seats         -0.007923033    0.04161669  0.2272594 -0.4571174  0.6111034
##                max_power        seats
## year           0.2244653 -0.007923033
## selling_price  0.7499739  0.041616694
## km_driven     -0.0371659  0.227259388
## mileage       -0.3679525 -0.457117443
## engine         0.7048962  0.611103386
## max_power      1.0000000  0.192978714
## seats          0.1929787  1.000000000
corrplot(cor(ggp1),method = "number")

0.2.2 Plotting The Distributions

ggp1 <- ggp1 %>%
  pivot_longer(colnames(ggp1)) %>%
  as.data.frame()

We can see that all of the numeric predictors are either skewed to the left or to the right.

ggplot(ggp1,aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3 Machine Learning

Time to apply some ml algorithms from the weeks of the curriculum. I will try to apply linear regression or decision tree. Let’s try tidymodels for the final project. use decision tree and xgboost.. also try to finetune our model.

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──
## ✔ broom        1.0.5     ✔ rsample      1.1.1
## ✔ dials        1.2.0     ✔ tune         1.1.1
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.1.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.0     ✔ yardstick    1.2.0
## ✔ recipes      1.0.6
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/
set.seed(125)
price_split <- Cars %>%
  initial_split(prop = 0.75,strata = selling_price)

price_train <- training(price_split)
price_test <- testing(price_split)


## Create a 10-fold cross-validation model..
set.seed(125)
price_folds <- vfold_cv(price_train,strata = selling_price)
price_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [5332/596]> Fold01
##  2 <split [5333/595]> Fold02
##  3 <split [5335/593]> Fold03
##  4 <split [5336/592]> Fold04
##  5 <split [5336/592]> Fold05
##  6 <split [5336/592]> Fold06
##  7 <split [5336/592]> Fold07
##  8 <split [5336/592]> Fold08
##  9 <split [5336/592]> Fold09
## 10 <split [5336/592]> Fold10
## Utilize the recipe steps to further preprocess the data. 
recipe_rec <- recipe(selling_price~ km_driven + fuel + mileage + engine + max_power + seats + transmission ,data = price_train) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())
head(recipe_rec %>%
  prep() %>%
  bake(new_data = NULL))
## # A tibble: 6 × 10
##   km_driven mileage engine max_power seats selling_price fuel_Diesel fuel_LPG
##       <int>   <dbl>  <dbl>     <dbl> <int>         <int>       <dbl>    <dbl>
## 1    140000      17   1497        78     5        158000           0        0
## 2    127000      23   1396        90     5        225000           1        0
## 3    175000      17   1061        57     5         96000           0        1
## 4      5000      16    796        37     4         45000           0        0
## 5    100000      17    993        60     5         92000           0        0
## 6     90000      18   1061        67     5        180000           0        0
## # ℹ 2 more variables: fuel_Petrol <dbl>, transmission_Manual <dbl>
## Here we will fit our models using the models we have created..  


rf_spec <- rand_forest(trees = 100) %>%
  set_mode("regression") %>%
  set_engine("ranger")

xgbos <- boost_tree(trees = 100) %>%
  set_mode("regression") %>%
  set_engine("xgboost")

lm_spec <- linear_reg() 
## We use the workflow set to bundle our recipe and our models together.. 
price_set <- workflow_set(list(recipe_rec),list(rf_spec,lm_spec,xgbos),cross = TRUE)

price_set
## # A workflow set/tibble: 3 × 4
##   wflow_id           info             option    result    
##   <chr>              <list>           <list>    <list>    
## 1 recipe_rand_forest <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 recipe_linear_reg  <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 recipe_boost_tree  <tibble [1 × 4]> <opts[0]> <list [0]>
doParallel::registerDoParallel()
set.seed(124)

price_rs <- workflow_map(price_set,"fit_resamples",resamples = price_folds)

price_rs
## # A workflow set/tibble: 3 × 4
##   wflow_id           info             option    result   
##   <chr>              <list>           <list>    <list>   
## 1 recipe_rand_forest <tibble [1 × 4]> <opts[1]> <rsmp[+]>
## 2 recipe_linear_reg  <tibble [1 × 4]> <opts[1]> <rsmp[+]>
## 3 recipe_boost_tree  <tibble [1 × 4]> <opts[1]> <rsmp[+]>

0.3.1 Evaluate Workflow set

autoplot(price_rs)

collect_metrics(price_rs)
## # A tibble: 6 × 9
##   wflow_id        .config preproc model .metric .estimator    mean     n std_err
##   <chr>           <chr>   <chr>   <chr> <chr>   <chr>        <dbl> <int>   <dbl>
## 1 recipe_rand_fo… Prepro… recipe  rand… rmse    standard   1.97e+5    10 1.03e+4
## 2 recipe_rand_fo… Prepro… recipe  rand… rsq     standard   9.39e-1    10 5.58e-3
## 3 recipe_linear_… Prepro… recipe  line… rmse    standard   4.74e+5    10 1.08e+4
## 4 recipe_linear_… Prepro… recipe  line… rsq     standard   6.53e-1    10 1.30e-2
## 5 recipe_boost_t… Prepro… recipe  boos… rmse    standard   1.95e+5    10 1.17e+4
## 6 recipe_boost_t… Prepro… recipe  boos… rsq     standard   9.40e-1    10 6.71e-3

Place each individual workflow I got from each model and then evaluate it on the testing data..

0.3.2 Linear Regression Diagnostics:

final_fit <- extract_workflow(price_rs,"recipe_linear_reg") %>%
  fit(price_train)

final_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##         (Intercept)            km_driven              mileage  
##          -3.774e+05           -2.656e+00            2.285e+04  
##              engine            max_power                seats  
##           4.370e+01            1.414e+04           -4.567e+03  
##         fuel_Diesel             fuel_LPG          fuel_Petrol  
##          -3.598e+04            8.835e+04           -1.373e+05  
## transmission_Manual  
##          -5.454e+05
set.seed(125)
final_fitt <- last_fit(final_fit,price_split)
collect_metrics(final_fitt)
## # A tibble: 2 × 4
##   .metric .estimator  .estimate .config             
##   <chr>   <chr>           <dbl> <chr>               
## 1 rmse    standard   507150.    Preprocessor1_Model1
## 2 rsq     standard        0.632 Preprocessor1_Model1
head(collect_predictions(final_fitt))
## # A tibble: 6 × 5
##   id                 .pred  .row selling_price .config             
##   <chr>              <dbl> <int>         <int> <chr>               
## 1 train/test split 702030.     2        370000 Preprocessor1_Model1
## 2 train/test split 265519.     5        130000 Preprocessor1_Model1
## 3 train/test split  49649.    10        200000 Preprocessor1_Model1
## 4 train/test split 561367.    19        680000 Preprocessor1_Model1
## 5 train/test split 503828.    25        575000 Preprocessor1_Model1
## 6 train/test split 321842.    27        300000 Preprocessor1_Model1
collect_predictions(final_fitt) %>%
  ggplot(aes(selling_price, .pred)) + 
  geom_abline(lty = 2, col = "blue", size = 1.5) +
  geom_point(alpha = 0.5) +
  coord_obs_pred()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

0.3.3 Random Forest Diagnostics..

final_fitt2 <- extract_workflow(price_rs,"recipe_rand_forest") %>%
  fit(price_train)

final_fitt2
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      5928 
## Number of independent variables:  9 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       37361667704 
## R squared (OOB):                  0.9424963
set.seed(125)
final_fittt <- last_fit(final_fitt2,price_split)
## Warning: package 'ranger' was built under R version 4.3.2
collect_metrics(final_fittt)
## # A tibble: 2 × 4
##   .metric .estimator  .estimate .config             
##   <chr>   <chr>           <dbl> <chr>               
## 1 rmse    standard   235332.    Preprocessor1_Model1
## 2 rsq     standard        0.923 Preprocessor1_Model1
head(collect_predictions(final_fittt))
## # A tibble: 6 × 5
##   id                 .pred  .row selling_price .config             
##   <chr>              <dbl> <int>         <int> <chr>               
## 1 train/test split 536470.     2        370000 Preprocessor1_Model1
## 2 train/test split 230294.     5        130000 Preprocessor1_Model1
## 3 train/test split 253735.    10        200000 Preprocessor1_Model1
## 4 train/test split 536541.    19        680000 Preprocessor1_Model1
## 5 train/test split 522310.    25        575000 Preprocessor1_Model1
## 6 train/test split 435621.    27        300000 Preprocessor1_Model1
collect_predictions(final_fittt) %>%
  ggplot(aes(selling_price, .pred)) + 
  geom_abline(lty = 2, col = "red", size = 1.5) +
  geom_point(alpha = 0.5) +
  coord_obs_pred()

0.3.4 Boosted Trees diagnostics

Final_fit2 <- extract_workflow(price_rs,"recipe_boost_tree")

Final_fit2
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_dummy()
## • step_zv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   trees = 100
## 
## Computational engine: xgboost
set.seed(125)
Final_Fit <- last_fit(Final_fit2,price_split)
## Warning: package 'xgboost' was built under R version 4.3.2
collect_metrics(Final_Fit)
## # A tibble: 2 × 4
##   .metric .estimator  .estimate .config             
##   <chr>   <chr>           <dbl> <chr>               
## 1 rmse    standard   208431.    Preprocessor1_Model1
## 2 rsq     standard        0.938 Preprocessor1_Model1
head(collect_predictions(Final_Fit))
## # A tibble: 6 × 5
##   id                 .pred  .row selling_price .config             
##   <chr>              <dbl> <int>         <int> <chr>               
## 1 train/test split 442959.     2        370000 Preprocessor1_Model1
## 2 train/test split 218439.     5        130000 Preprocessor1_Model1
## 3 train/test split 255136.    10        200000 Preprocessor1_Model1
## 4 train/test split 568410.    19        680000 Preprocessor1_Model1
## 5 train/test split 547065.    25        575000 Preprocessor1_Model1
## 6 train/test split 432518     27        300000 Preprocessor1_Model1
collect_predictions(Final_Fit) %>%
  ggplot(aes(selling_price, .pred)) + 
  geom_abline(lty = 2, col = "deeppink4", size = 1.5) +
  geom_point(alpha = 0.5) +
  coord_obs_pred()

If we look at the metrics and evaluation of the model diagnostics it appears that random forest and boosted trees performed really well on the testing dataset, and this occured without tuning any hyperparameters.