Import Libraries
# import packages
library(tidyverse)
library(dplyr)
library(tidyr)
library(rpart)
library(rpart.plot)
library(lubridate)
library(skimr)
library(stringr)
library(corrplot)
library(ggplot2)
library(fpp3)
library(caret)
library(readr)
library(GGally)
library(tidymodels)
library(randomForest)Load and Describe Data
# Load Datasets
cars <- "/Users/alecmccabe/Downloads/honda_sell_data.csv"
data <- read_csv(cars)Cars.com is now the second-largest automotive classified site with a large collection of vehicles for sale. Often times, people feel uncomfortable navigating the intricacies of buying and selling a vehicle. There are a lot of scammers out there, so knowing what a good price for a vehicle is critical.
The dataset that I am choosing to analyze for homework 2 contains information for around 5,000 cars listed on cars.com.
Exploratory Data Analysis
head(data)dim(data) ## [1] 4999 25
This dataset contains 25 features for each of its 5,000 listings. These features include:
- Year
- Make
- Model
- Condition
- Price (our target)
- Consumer Rating
- Consumer Reviews
- Color
- Drive Train
- MPG
- Fuel Type
- Transmission
- Engine Type
- VIN
- Mileage
- State of Sale
- Cars.com Ratings
This is a lot of information! But certainly it provides a lot of meat for a Decision Tree or Random Forest algorithm to construct a powerful predictive model. As stated previously, we will be attempting to predict the price of a listing, given other features. In this way, users will be able to both ‘correctly’ price their own car when selling, or will be able to spot good deals… or bad deals… when searching for a new vehicle.
Preprocessing Data
Columns Mistyped
skim(data)| Name | data |
| Number of rows | 4999 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| character | 16 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Make | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
| Model | 0 | 1.00 | 3 | 50 | 0 | 146 | 0 |
| Condition | 0 | 1.00 | 3 | 15 | 0 | 3 | 0 |
| Price | 0 | 1.00 | 6 | 10 | 0 | 2333 | 0 |
| Exterior_Color | 11 | 1.00 | 1 | 50 | 0 | 172 | 0 |
| Interior_Color | 11 | 1.00 | 1 | 28 | 0 | 62 | 0 |
| Drivetrain | 11 | 1.00 | 1 | 17 | 0 | 7 | 0 |
| MPG | 1485 | 0.70 | 3 | 7 | 0 | 125 | 0 |
| Fuel_Type | 11 | 1.00 | 3 | 22 | 0 | 7 | 0 |
| Transmission | 11 | 1.00 | 1 | 45 | 0 | 58 | 0 |
| Engine | 11 | 1.00 | 4 | 64 | 0 | 75 | 0 |
| VIN | 11 | 1.00 | 10 | 17 | 0 | 4988 | 0 |
| Stock_# | 11 | 1.00 | 2 | 65 | 0 | 4966 | 0 |
| Mileage | 11 | 1.00 | 1 | 12 | 0 | 2416 | 0 |
| State | 73 | 0.99 | 2 | 5 | 0 | 53 | 0 |
| Seller_Type | 73 | 0.99 | 6 | 10 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1.00 | 2020.51 | 3.71 | 1981.0 | 2019.0 | 2022.0 | 2023.0 | 2023 | ▁▁▁▁▇ |
| Consumer_Rating | 0 | 1.00 | 4.57 | 0.54 | 1.2 | 4.4 | 4.7 | 4.9 | 5 | ▁▁▁▁▇ |
| Consumer_Review_# | 0 | 1.00 | 1288.37 | 1970.00 | 0.0 | 126.0 | 697.0 | 1695.5 | 29258 | ▇▁▁▁▁ |
| Comfort_Rating | 552 | 0.89 | 4.82 | 0.26 | 3.8 | 4.7 | 4.9 | 5.0 | 5 | ▁▁▁▂▇ |
| Interior_Design_Rating | 552 | 0.89 | 4.71 | 0.48 | 3.0 | 4.7 | 4.8 | 5.0 | 5 | ▁▁▁▂▇ |
| Performance_Rating | 552 | 0.89 | 4.66 | 0.35 | 3.6 | 4.5 | 4.8 | 5.0 | 5 | ▁▂▁▃▇ |
| Value_For_Money_Rating | 552 | 0.89 | 4.58 | 0.37 | 3.6 | 4.2 | 4.6 | 5.0 | 5 | ▁▃▂▆▇ |
| Exterior_Styling_Rating | 552 | 0.89 | 4.73 | 0.31 | 3.9 | 4.6 | 4.8 | 5.0 | 5 | ▂▁▁▃▇ |
| Reliability_Rating | 552 | 0.89 | 4.87 | 0.19 | 4.0 | 4.8 | 5.0 | 5.0 | 5 | ▁▁▁▃▇ |
The skim method tells us a lot about our data. One of the simple, but powerful features of the method is the ability to show you quickly the types of your data features. Looking above, we can see some issues:
- Price should be numeric
- Mileage should be numeric
# checking price
data %>%
count(Price)Price is currently listed as a character field, due to the presence of the dollar sign ($) and commas (,) in their entry. This should be relatively straighforward to fix.
Using the parse_number method from the readr package we can easily convert prices (with $ and commas) to numeric values. It also identifies any failures, and shows the original values which failed parsing. In this our case, we can see that there are many values listed as “Not Priced”. Because we will be building models that predict price, there is no reason to impute. We will drop all rows with Price “Not Priced”
data <- data[data$Price != 'Not Priced', ]data$Price <- parse_number(data$Price)# checking mileage
data %>%
count(Mileage)Mileage should also be numeric. The issue here that some users have not entered mileage values, defaulting to ‘-’. We will need to replace those with NA, convert the column to numeric, compute the median, and then replace the NAs with the median to properly address.
# replace '-' with null
repl_char <- data$Mileage[2]
data$Mileage <- as.numeric(na_if(data$Mileage, repl_char))## Warning: NAs introduced by coercion
# calculate median and replace NAs with it
repl_median <- median(data$Mileage, na.rm=TRUE)
data$Mileage <- replace(data$Mileage, is.na(data$Mileage), repl_median)Dropping Columns
Additionally, because we want to build a robust model that can accurately predict the value of a car without external assistance, we will be dropping the cars.com ratings features. Using the same logic, we will be dropping the consumer reviews features as well.
# drop rating features
data <- data %>%
select(-c('Comfort_Rating','Interior_Design_Rating','Performance_Rating','Value_For_Money_Rating','Exterior_Styling_Rating','Reliability_Rating','Consumer_Rating','Consumer_Review_#','VIN','Stock_#','Seller_Type'))Missing Values
Looking above still at the output from skim(), we can see that there are a few problematic features with respect to missing values.
These include - MPG (70% complete) - Exterior Color (99% complete) - Interior Color (99% complete) - Drivetrain (99% complete) - Fuel Type (99% complete) - Transmission (99% complete) - Engine (99% complete) - State (98% complete)
Of these features, MPG will be the only one that we will use imputation for, as there is a significant missing rate here. One approach could be to impute using the mean. However a smarter approach could be to match the MPG from other cars that match the make and model.
# examine the unique values and counts in MPG
data %>%
count(MPG)The MPG column has mpg values denoted by ranges. We can create a helper function that computes the mean of the range, and replace accordingly.
# extract special character 'hyphen'
test <- data$MPG[4]
hyphen <- substr(test,3,3)# create regex patterns
pattern1 <- paste("\\d+",hyphen,sep='')
pattern2 <- paste(hyphen,"\\d+",sep='')# create function for calculating mean mpg given string
mean_mpg <- function(x){
if (is.na(x)) {
return(NA)
}
#print(x)
left <- str_extract(x, regex(pattern1))
left <- as.numeric(str_replace(left, hyphen,""))
#print(left)
right <- str_extract(x, regex(pattern2))
right <- as.numeric(str_replace(right,hyphen,""))
#print(right)
mean_val <- (left + right) / 2
return(mean_val)
}# apply function to dataset
data$MPG <- unlist(lapply(data$MPG, mean_mpg))# Check updated counts
data %>%
count(MPG)# create groupby for later merge
groupby <- data %>%
group_by(Make, Model) %>%
summarise(mean_mpg = mean(MPG, rm.na=TRUE))## `summarise()` has grouped output by 'Make'. You can override using the
## `.groups` argument.
# merge grouped data onto original data
data <- merge(data, groupby, by=c('Make','Model'))# create new column that uses original and updated to compute
data <- transform(data, final_mpg= ifelse(is.na(MPG), mean_mpg, MPG))
data <- data %>%
select(-c('MPG','mean_mpg'))
skim(data)| Name | data |
| Number of rows | 4960 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Make | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
| Model | 0 | 1.00 | 3 | 50 | 0 | 146 | 0 |
| Condition | 0 | 1.00 | 3 | 15 | 0 | 3 | 0 |
| Exterior_Color | 11 | 1.00 | 1 | 50 | 0 | 172 | 0 |
| Interior_Color | 11 | 1.00 | 1 | 28 | 0 | 62 | 0 |
| Drivetrain | 11 | 1.00 | 1 | 17 | 0 | 7 | 0 |
| Fuel_Type | 11 | 1.00 | 3 | 22 | 0 | 7 | 0 |
| Transmission | 11 | 1.00 | 1 | 45 | 0 | 58 | 0 |
| Engine | 11 | 1.00 | 4 | 64 | 0 | 74 | 0 |
| State | 73 | 0.99 | 2 | 5 | 0 | 53 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1.00 | 2020.49 | 3.72 | 1981 | 2019.0 | 2022.0 | 2023 | 2023 | ▁▁▁▁▇ |
| Price | 0 | 1.00 | 33713.46 | 10321.61 | 1995 | 26998.0 | 33949.5 | 41799 | 69980 | ▁▅▇▃▁ |
| Mileage | 0 | 1.00 | 23363.88 | 36981.28 | 0 | 5.0 | 704.5 | 36246 | 259029 | ▇▁▁▁▁ |
| final_mpg | 1462 | 0.71 | 27.81 | 7.20 | 0 | 22.5 | 23.5 | 32 | 52 | ▁▁▇▃▁ |
While our attempts were valiant, it seems that we did not correct the original missing values. This is most likely due no duplicate car entries listed (by make and model).
We now have to decide how to move forward. We will choose the most straightforward approach, replacing missing values with medians.
# replace with median
repl_median <- median(data$final_mpg, na.rm=TRUE)
data$final_mpg <- replace(data$final_mpg, is.na(data$final_mpg), repl_median)There are additional features with missing values, but for those they have completeness ratios of around 98% or greater. Because decision trees and random forest algorithms can handle missing values inherently we could choose to either leave these alone, or alternatively drop these columns.
sum(!complete.cases(data))## [1] 73
As only 73 columns are affected by missing values (excluding MPG) we can drop these entries entirely after correcting MPG.
data <- data %>% drop_na()Adjusting Categorical Features
There are features with a large number of dimensions. - Model has 146 types - Exterior Color and Interior Color have 172 and 62 dimensions respectively - Transmission and Engine have 74 and 53 dimensions respectively - State has 53(?) dimensions
We can reduce the dimensions by grouping certain splits together.
simplify_colors <- function(x) {
x <- tolower(x)
if (str_detect(x, 'black')) {return('black')}
if (str_detect(x, 'white')) {return('white')}
if (str_detect(x, 'gold')) {return('gold')}
if (str_detect(x, 'silver')) {return('silver')}
if (str_detect(x, 'red')) {return('red')}
if (str_detect(x, 'blue')) {return('blue')}
if (str_detect(x, 'green')) {return('green')}
if (str_detect(x, 'tan')) {return('tan')}
if (str_detect(x, 'yellow')) {return('yellow')}
if (str_detect(x, 'orange')) {return('orange')}
if (str_detect(x, 'purple')) {return('purple')}
if (str_detect(x, 'cyan')) {return('cyan')}
if (str_detect(x, 'pearl')) {return('pearl')}
if (str_detect(x, 'platinum')) {return('platinum')}
if (str_detect(x, 'magenta')) {return('magenta')}
if (str_detect(x, 'grey')) {return('grey')}
if (str_detect(x, 'gray')) {return('gray')}
if (str_detect(x, 'metallic')) {return('metallic')}
if (str_detect(x, 'maroon')) {return('maroon')}
return('other')
}data$exterior_color_2 <- sapply(data$Exterior_Color, simplify_colors)
data$interior_color_2 <- sapply(data$Interior_Color, simplify_colors)data %>%
count(exterior_color_2)Exterior_Color has been modified as exterior_color_2, and now has significantly less dimensions. This will make subsequent interpretation much easier.
data %>%
count(interior_color_2)Interior_Color has been modified as interior_color_2, and now has significantly less dimensions. This will make subsequent interpretation much easier.
simplify_model <- function(x) {
x <- tolower(x)
if (str_detect(x, 'accord')) {return('accord')}
if (str_detect(x, 'civic')) {return('civic')}
if (str_detect(x, 'clarity')) {return('clarity')}
if (str_detect(x, 'cr-v')) {return('cr-v')}
if (str_detect(x, 'crosstour')) {return('crosstour')}
if (str_detect(x, 'del sol')) {return('del sol')}
if (str_detect(x, 'element')) {return('element')}
if (str_detect(x, 'fit')) {return('fit')}
if (str_detect(x, 'hr-v')) {return('hr-v')}
if (str_detect(x, 'insight')) {return('insight')}
if (str_detect(x, 'odyssey')) {return('odyssey')}
if (str_detect(x, 'passport')) {return('passport')}
if (str_detect(x, 'pilot')) {return('pilot')}
if (str_detect(x, 'prelude')) {return('prelude')}
if (str_detect(x, 'ridgeline')) {return('ridgeline')}
if (str_detect(x, 's2000')) {return('s2000')}
return('other')
}data$model_2 <- sapply(data$Model, simplify_model)data %>%
count(model_2)Model has been modified as model_2, and now has significantly less dimensions. This will make subsequent interpretation much easier.
simplify_transmission <- function(x) {
x <- tolower(x)
if (str_detect(x, 'automatic')) {return('automatic')}
if (str_detect(x, 'cvt')) {return('cvt')}
if (str_detect(x, 'continuous')) {return('cvt')}
if (str_detect(x, 'manual')) {return('manual')}
return('other')
}data$transmission_2 <- sapply(data$Transmission, simplify_transmission)extract_liters <- function(x) {
liters <- str_extract(x, regex("\\d\\.\\dL"))
if (is.na(liters)) {
return('other')
}
else {
return(liters)
}
}data$engine_liters <- sapply(data$Engine, extract_liters)extract_cylinders <- function(x) {
liters <- str_extract(x, regex("\\d\\.\\dL"))
if (is.na(liters)) {
return('other')
}
else {
return(liters)
}
}data$engine_cylinders <- sapply(data$Engine, extract_cylinders)Selecting final features
skim(data)| Name | data |
| Number of rows | 4887 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 16 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Make | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
| Model | 0 | 1 | 3 | 50 | 0 | 146 | 0 |
| Condition | 0 | 1 | 3 | 15 | 0 | 3 | 0 |
| Exterior_Color | 0 | 1 | 1 | 50 | 0 | 172 | 0 |
| Interior_Color | 0 | 1 | 1 | 28 | 0 | 62 | 0 |
| Drivetrain | 0 | 1 | 1 | 17 | 0 | 7 | 0 |
| Fuel_Type | 0 | 1 | 3 | 22 | 0 | 7 | 0 |
| Transmission | 0 | 1 | 1 | 45 | 0 | 58 | 0 |
| Engine | 0 | 1 | 4 | 64 | 0 | 73 | 0 |
| State | 0 | 1 | 2 | 5 | 0 | 53 | 0 |
| exterior_color_2 | 0 | 1 | 3 | 8 | 0 | 17 | 0 |
| interior_color_2 | 0 | 1 | 3 | 6 | 0 | 8 | 0 |
| model_2 | 0 | 1 | 3 | 9 | 0 | 17 | 0 |
| transmission_2 | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| engine_liters | 0 | 1 | 4 | 5 | 0 | 12 | 0 |
| engine_cylinders | 0 | 1 | 4 | 5 | 0 | 12 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 2020.56 | 3.67 | 1981 | 2020.0 | 2022.0 | 2023.0 | 2023 | ▁▁▁▁▇ |
| Price | 0 | 1 | 33861.70 | 10277.49 | 1995 | 27200.0 | 34041.0 | 41905.0 | 69980 | ▁▅▇▃▁ |
| Mileage | 0 | 1 | 22937.20 | 36823.00 | 0 | 5.0 | 704.5 | 35070.5 | 259029 | ▇▁▁▁▁ |
| final_mpg | 0 | 1 | 26.53 | 6.34 | 0 | 22.5 | 23.5 | 30.0 | 52 | ▁▁▇▂▁ |
data <- data %>%
select(model_2, Condition, exterior_color_2, interior_color_2, Drivetrain, Fuel_Type, transmission_2, engine_liters, Year, Mileage, final_mpg, Price)EDA
Numerical Distributions
hist(data$Year)Our dataset is definitely comprised mostly of newer vehicles, with Year values mostly in the 2020s.
hist(data$Mileage)
Unlike the year feature, we see that Mileage follows a right skew.
hist(data$final_mpg)Most cars being sold have around 20-25 miles per gallon, which matches my own external research.
hist(data$Price)Our target variable, price, is nicely distributed. It follows a roughly normal shape, with no major skews.
Numerical Correlation
ggpairs(data %>% select(c("Year","Mileage","final_mpg","Price")), progress=FALSE)From above, we can see that the numerical features most highly correlated with our target variable, Price, are Mileage and Year. It doesn’t seem that MPG is a strongly correlated.
It is interesting to note that Mileage and MPG both are negatiely correlated. This makes sense for Mileage, but it is suprising for MPG.
Reviewing Categorical Features
data %>%
ggplot(aes(x=model_2, y=Price)) +
geom_boxplot() +
coord_flip()While there are a lot of lower end outliers, we can see from te above that the model type of the car certainly has an impact on the final asking price. More top-of-the-line models like the Ridgeline will fetch a larger asking price (on average) than the Crosstour or Element.
skim(data)| Name | data |
| Number of rows | 4887 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| model_2 | 0 | 1 | 3 | 9 | 0 | 17 | 0 |
| Condition | 0 | 1 | 3 | 15 | 0 | 3 | 0 |
| exterior_color_2 | 0 | 1 | 3 | 8 | 0 | 17 | 0 |
| interior_color_2 | 0 | 1 | 3 | 6 | 0 | 8 | 0 |
| Drivetrain | 0 | 1 | 1 | 17 | 0 | 7 | 0 |
| Fuel_Type | 0 | 1 | 3 | 22 | 0 | 7 | 0 |
| transmission_2 | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
| engine_liters | 0 | 1 | 4 | 5 | 0 | 12 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 2020.56 | 3.67 | 1981 | 2020.0 | 2022.0 | 2023.0 | 2023 | ▁▁▁▁▇ |
| Mileage | 0 | 1 | 22937.20 | 36823.00 | 0 | 5.0 | 704.5 | 35070.5 | 259029 | ▇▁▁▁▁ |
| final_mpg | 0 | 1 | 26.53 | 6.34 | 0 | 22.5 | 23.5 | 30.0 | 52 | ▁▁▇▂▁ |
| Price | 0 | 1 | 33861.70 | 10277.49 | 1995 | 27200.0 | 34041.0 | 41905.0 | 69980 | ▁▅▇▃▁ |
data %>%
ggplot(aes(x=Condition, y=Price)) +
geom_boxplot() +
coord_flip()Similarly, we can expect higher asking prices when the condition of the car is listed as “New”. This is not surprising.
data %>%
ggplot(aes(x=exterior_color_2, y=Price)) +
geom_boxplot() +
coord_flip()There even seems to be a preference for color among car buyers! While I did not expect this, it appears that yellow cars tend to sell for more. However, that could be due to only a certain (more expensive) type of car being sold more often in yellow.
data %>%
ggplot(aes(x=transmission_2, y=Price)) +
geom_boxplot() +
coord_flip()There seems to be less of a difference for preference around drivetrain. Drivers marginally prefer automatic more than other options, but it is close.
data %>%
ggplot(aes(x=engine_liters, y=Price)) +
geom_boxplot() +
coord_flip()It is interesting to note that the average price is not linearly related to the engine liter capacity. I would have expected more capacity to equate to higher average price, but we see variable trends. The most popular (or highest priced) engine capacities are 3.5 (largest), 2.0, and 1.5.
Initial Models
# Partition data - train (80%) & test (20%)
set.seed(42)
ind <- sample(2, nrow(data), replace = T, prob = c(0.80, 0.20))
train <- data[ind==1,]
test <- data[ind==2,]Decision Tree #1
The first model we will try to build will be a decision tree with only one feature: model_2. I believe that the model of a car is one of the most predictive features when considering a potential sell or buy price.
# test decision tree with just model_2 as feature
model_1 <- rpart(Price ~ `model_2`,
method="anova", train)summary(model_1)## Call:
## rpart(formula = Price ~ model_2, data = train, method = "anova")
## n= 3880
##
## CP nsplit rel error xerror xstd
## 1 0.31440689 0 1.0000000 1.0004007 0.02129112
## 2 0.04755939 1 0.6855931 0.6872072 0.01930869
## 3 0.02083648 2 0.6380337 0.6418787 0.01898669
## 4 0.01265733 3 0.6171972 0.6228764 0.01895735
## 5 0.01000000 4 0.6045399 0.6111974 0.01869524
##
## Variable importance
## model_2
## 100
##
## Node number 1: 3880 observations, complexity param=0.3144069
## mean=33841.93, MSE=1.055206e+08
## left son=2 (1382 obs) right son=3 (2498 obs)
## Primary splits:
## model_2 splits as LLLRLLLLLLRLRRLRL, improve=0.3144069, (0 missing)
##
## Node number 2: 1382 observations, complexity param=0.02083648
## mean=26098.08, MSE=5.512994e+07
## left son=4 (67 obs) right son=5 (1315 obs)
## Primary splits:
## model_2 splits as RRR-LLLLRR-L--L-R, improve=0.111969, (0 missing)
##
## Node number 3: 2498 observations, complexity param=0.04755939
## mean=38126.16, MSE=8.186787e+07
## left son=6 (1348 obs) right son=7 (1150 obs)
## Primary splits:
## model_2 splits as ---L------L-LR-R-, improve=0.09521369, (0 missing)
##
## Node number 4: 67 observations
## mean=15091.1, MSE=2.789397e+07
##
## Node number 5: 1315 observations
## mean=26658.9, MSE=5.003027e+07
##
## Node number 6: 1348 observations, complexity param=0.01265733
## mean=35547.4, MSE=7.520351e+07
## left son=12 (642 obs) right son=13 (706 obs)
## Primary splits:
## model_2 splits as ---L------R-R----, improve=0.05111911, (0 missing)
##
## Node number 7: 1150 observations
## mean=41148.91, MSE=7.274769e+07
##
## Node number 12: 642 observations
## mean=33491.3, MSE=5.669908e+07
##
## Node number 13: 706 observations
## mean=37417.12, MSE=8.46903e+07
model_1 %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)
This first decision tree provides a very interesting way to map our
models to expected prices. We can see that at the start, all models are
included. As we go down the tree, we start to see those models get split
into different groupings. It looks like the Odyssey, Passport, and CR-V
are some of the more expensive vehicles offered, while the crosstour and
element are some of the lower end models (which validates our initial
EDA).
# make predictions
p <- predict(model_1, test)
# Root Mean Square Error
sqrt(mean((test$Price - p)^2))## [1] 8009.782
Using a simple mode like this is not very effective though, as we can see that the RMSE is around $8,009. We know that the prices for our cars are near-normally distributed and ranging from 0 - 70,000. So while this is ‘close’, it could be better.
Decision Tree #2
The next model we will try will use two new features: Condition and Mileage.
# test decision tree with just model_2 as feature
model_2 <- rpart(Price ~ Condition + Mileage,
method="anova", train)summary(model_2)## Call:
## rpart(formula = Price ~ Condition + Mileage, data = train, method = "anova")
## n= 3880
##
## CP nsplit rel error xerror xstd
## 1 0.41689305 0 1.0000000 1.0002949 0.021293335
## 2 0.08367721 1 0.5831070 0.5991179 0.011845764
## 3 0.04112544 2 0.4994297 0.5056072 0.010043889
## 4 0.01659143 3 0.4583043 0.4744812 0.009534910
## 5 0.01000000 4 0.4417129 0.4518086 0.008953413
##
## Variable importance
## Mileage Condition
## 67 33
##
## Node number 1: 3880 observations, complexity param=0.416893
## mean=33841.93, MSE=1.055206e+08
## left son=2 (1234 obs) right son=3 (2646 obs)
## Primary splits:
## Mileage < 24710 to the right, improve=0.4168930, (0 missing)
## Condition splits as LRL, improve=0.3787503, (0 missing)
## Surrogate splits:
## Condition splits as RRL, agree=0.861, adj=0.562, (0 split)
##
## Node number 2: 1234 observations, complexity param=0.08367721
## mean=24129.72, MSE=6.847446e+07
## left son=4 (451 obs) right son=5 (783 obs)
## Primary splits:
## Mileage < 67398.5 to the right, improve=0.40544550, (0 missing)
## Condition splits as R-L, improve=0.05161342, (0 missing)
##
## Node number 3: 2646 observations, complexity param=0.04112544
## mean=38371.36, MSE=5.829106e+07
## left son=6 (543 obs) right son=7 (2103 obs)
## Primary splits:
## Condition splits as LRL, improve=0.1091661, (0 missing)
## Mileage < 1349.5 to the right, improve=0.1016215, (0 missing)
## Surrogate splits:
## Mileage < 720.5 to the right, agree=0.98, adj=0.901, (0 split)
##
## Node number 4: 451 observations, complexity param=0.01659143
## mean=17187.1, MSE=4.58321e+07
## left son=8 (183 obs) right son=9 (268 obs)
## Primary splits:
## Mileage < 105080.5 to the right, improve=0.32862940, (0 missing)
## Condition splits as R-L, improve=0.03483997, (0 missing)
##
## Node number 5: 783 observations
## mean=28128.59, MSE=3.776255e+07
##
## Node number 6: 543 observations
## mean=33406.99, MSE=4.47934e+07
##
## Node number 7: 2103 observations
## mean=39653.18, MSE=5.376974e+07
##
## Node number 8: 183 observations
## mean=12490.54, MSE=3.111674e+07
##
## Node number 9: 268 observations
## mean=20394.08, MSE=3.053378e+07
model_2 %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)Based on the model summary as well as the tree-graph, we can see that Mileage plays a significantly more important role than Condition when pricing out a car. Essentially, the condition only matters if the mileage is less than 25. In such cases, having a Honda Certified vehicle will lead to higher sell-prices.
# make predictions
p <- predict(model_2, test)
# Root Mean Square Error
sqrt(mean((test$Price - p)^2))## [1] 6873.186
This new model, using two variables previously not seen before, shows stronger performance than our first model. We have successfully reduced our RSME by over $1,000.
Decision Tree # 3
The last decision tree we will try is using all features! In order to do this, I needed to preprocess the data some more in order to avoid out of sample issues. To achieve this, we onehot encode all categorical features.
## One hot encode all categorical features
dmy <- dummyVars(" ~ .", data = data)
onehot <- data.frame(predict(dmy, newdata = data))
head(onehot)# Partition data - train (80%) & test (20%)
set.seed(42)
ind <- sample(2, nrow(onehot), replace = T, prob = c(0.80, 0.20))
dummy_train <- onehot[ind==1,]
dummy_test <- onehot[ind==2,]# test decision tree with just model_2 as feature
model_3 <- rpart(Price ~ .,
method="anova", dummy_train)model_3 %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)# make predictions
p <- predict(model_3, dummy_test)
# Root Mean Square Error
sqrt(mean((dummy_test$Price - p)^2))## [1] 4289.253
Using a decision tree with all predictors resulted in the lowest RSME yet. However, the improvement to predicitive power comes at the cost of significantly more features to consider. Ultimately, the most important features to consider in this model are: - Year - Engine Liters - Mileage - MPG - Model - Condition
Random Forest #1
Our first random forest model will include only 4 features, 3 of which we have tested prviously with decision trees. The new feature we will add is: exterior_color_2.
rf_1 <- randomForest(Price ~ model_2 + Mileage + Condition + exterior_color_2, importance = TRUE, na.action = na.omit, train)plot(rf_1)# make predictions
p3 <- predict(rf_1, test)
# Root Mean Square Error
sqrt(mean((test$Price - p3)^2))## [1] 4529.521
Using a random forest with just three features produces an RSME or 4,521, which is only slightly worse than the decision tree with all features. Let’s try to build a random forest with all features as well to compare.
Random Forest #2
This random forest will take all features into consideration, similar to how our last decision tree model was constructed.
rf_2 <- randomForest(Price ~ ., importance = TRUE, na.action = na.omit, dummy_train)plot(rf_2)# make predictions
p3 <- predict(rf_2, dummy_test)
# Root Mean Square Error
sqrt(mean((dummy_test$Price - p3)^2))## [1] 2731.559
We see a significant boost when using all features! The training time for this model was signficantly longer than all of our previous attempts, but the final results are very powerful!
Conclusion
After exploring different models using both decision tree and random forect algorithms with various selected features, I conclude that random forest typically outperforms decision trees with respect to predictive power. In all iterations, I observed that random forest with similar features will produce better results than decision trees.
That being said, I noticed additionally that the decision tree models took significantly less time to run. If we were ever in a situation where we needed a quick and dirty model to better understand the data, perhaps I would use a decision tree in that scenario. I can imagine using Decision trees as a means to identifying the most important features to be a use-case.