Data 622 Homework 2

Import Libraries

# import packages
library(tidyverse)
library(dplyr)
library(tidyr)
library(rpart)
library(rpart.plot)
library(lubridate)
library(skimr)
library(stringr)
library(corrplot)
library(ggplot2)
library(fpp3)
library(caret)
library(readr)
library(GGally)
library(tidymodels)
library(randomForest)

Load and Describe Data

# Load Datasets
cars <- "/Users/alecmccabe/Downloads/honda_sell_data.csv"
data <- read_csv(cars)

Cars.com is now the second-largest automotive classified site with a large collection of vehicles for sale. Often times, people feel uncomfortable navigating the intricacies of buying and selling a vehicle. There are a lot of scammers out there, so knowing what a good price for a vehicle is critical.

The dataset that I am choosing to analyze for homework 2 contains information for around 5,000 cars listed on cars.com.

Exploratory Data Analysis

head(data)

dim(data)

## [1] 4999   25

This dataset contains 25 features for each of its 5,000 listings. These features include:

Year
Make
Model
Condition
Price (our target)
Consumer Rating
Consumer Reviews
Color
Drive Train
MPG
Fuel Type
Transmission
Engine Type
VIN
Mileage
State of Sale
Cars.com Ratings

This is a lot of information! But certainly it provides a lot of meat for a Decision Tree or Random Forest algorithm to construct a powerful predictive model. As stated previously, we will be attempting to predict the price of a listing, given other features. In this way, users will be able to both ‘correctly’ price their own car when selling, or will be able to spot good deals… or bad deals… when searching for a new vehicle.

Preprocessing Data

Columns Mistyped

skim(data)

Data summary
Name	data
Number of rows	4999
Number of columns	25
_______________________
Column type frequency:
character	16
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Make	0	1.00	5	5	1
Model	0	1.00	3	50	146
Condition	0	1.00	3	15	3
Price	0	1.00	6	10	2333
Exterior_Color	11	1.00	1	50	172
Interior_Color	11	1.00	1	28	62
Drivetrain	11	1.00	1	17	7
MPG	1485	0.70	3	7	125
Fuel_Type	11	1.00	3	22	7
Transmission	11	1.00	1	45	58
Engine	11	1.00	4	64	75
VIN	11	1.00	10	17	4988
Stock_#	11	1.00	2	65	4966
Mileage	11	1.00	1	12	2416
State	73	0.99	2	5	53
Seller_Type	73	0.99	6	10	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1.00	2020.51	3.71	1981.0	2019.0	2022.0	2023.0	2023	▁▁▁▁▇
Consumer_Rating	0	1.00	4.57	0.54	1.2	4.4	4.7	4.9	5	▁▁▁▁▇
Consumer_Review_#	0	1.00	1288.37	1970.00	0.0	126.0	697.0	1695.5	29258	▇▁▁▁▁
Comfort_Rating	552	0.89	4.82	0.26	3.8	4.7	4.9	5.0	5	▁▁▁▂▇
Interior_Design_Rating	552	0.89	4.71	0.48	3.0	4.7	4.8	5.0	5	▁▁▁▂▇
Performance_Rating	552	0.89	4.66	0.35	3.6	4.5	4.8	5.0	5	▁▂▁▃▇
Value_For_Money_Rating	552	0.89	4.58	0.37	3.6	4.2	4.6	5.0	5	▁▃▂▆▇
Exterior_Styling_Rating	552	0.89	4.73	0.31	3.9	4.6	4.8	5.0	5	▂▁▁▃▇
Reliability_Rating	552	0.89	4.87	0.19	4.0	4.8	5.0	5.0	5	▁▁▁▃▇

The skim method tells us a lot about our data. One of the simple, but powerful features of the method is the ability to show you quickly the types of your data features. Looking above, we can see some issues:

Price should be numeric
Mileage should be numeric

# checking price
data %>%
  count(Price)

Price is currently listed as a character field, due to the presence of the dollar sign ($) and commas (,) in their entry. This should be relatively straighforward to fix.

Using the parse_number method from the readr package we can easily convert prices (with $ and commas) to numeric values. It also identifies any failures, and shows the original values which failed parsing. In this our case, we can see that there are many values listed as “Not Priced”. Because we will be building models that predict price, there is no reason to impute. We will drop all rows with Price “Not Priced”

data <- data[data$Price != 'Not Priced', ]

data$Price <- parse_number(data$Price)

# checking mileage
data %>%
  count(Mileage)

Mileage should also be numeric. The issue here that some users have not entered mileage values, defaulting to ‘-’. We will need to replace those with NA, convert the column to numeric, compute the median, and then replace the NAs with the median to properly address.

# replace '-' with null
repl_char <- data$Mileage[2]
data$Mileage <- as.numeric(na_if(data$Mileage, repl_char))

## Warning: NAs introduced by coercion

# calculate median and replace NAs with it
repl_median <- median(data$Mileage, na.rm=TRUE)
data$Mileage <- replace(data$Mileage, is.na(data$Mileage), repl_median)

Dropping Columns

Additionally, because we want to build a robust model that can accurately predict the value of a car without external assistance, we will be dropping the cars.com ratings features. Using the same logic, we will be dropping the consumer reviews features as well.

# drop rating features

data <- data %>%
  select(-c('Comfort_Rating','Interior_Design_Rating','Performance_Rating','Value_For_Money_Rating','Exterior_Styling_Rating','Reliability_Rating','Consumer_Rating','Consumer_Review_#','VIN','Stock_#','Seller_Type'))

Missing Values

Looking above still at the output from skim(), we can see that there are a few problematic features with respect to missing values.

These include - MPG (70% complete) - Exterior Color (99% complete) - Interior Color (99% complete) - Drivetrain (99% complete) - Fuel Type (99% complete) - Transmission (99% complete) - Engine (99% complete) - State (98% complete)

Of these features, MPG will be the only one that we will use imputation for, as there is a significant missing rate here. One approach could be to impute using the mean. However a smarter approach could be to match the MPG from other cars that match the make and model.

# examine the unique values and counts in MPG
data %>%
  count(MPG)

The MPG column has mpg values denoted by ranges. We can create a helper function that computes the mean of the range, and replace accordingly.

# extract special character 'hyphen'
test <- data$MPG[4]
hyphen <- substr(test,3,3)

# create regex patterns
pattern1 <- paste("\\d+",hyphen,sep='')
pattern2 <- paste(hyphen,"\\d+",sep='')

# create function for calculating mean mpg given string
mean_mpg <- function(x){
  
  if (is.na(x)) {
    return(NA)
  }
  
  #print(x)
  
  left <- str_extract(x, regex(pattern1))
  left <- as.numeric(str_replace(left, hyphen,""))
  #print(left)
  
  right <- str_extract(x, regex(pattern2))
  right <- as.numeric(str_replace(right,hyphen,""))
  #print(right)
  
  mean_val <- (left + right) / 2
  
  return(mean_val)
  
}

# apply function to dataset
data$MPG <- unlist(lapply(data$MPG, mean_mpg))

# Check updated counts
data %>%
  count(MPG)

# create groupby for later merge
groupby <- data %>%
  group_by(Make, Model) %>%
  summarise(mean_mpg = mean(MPG, rm.na=TRUE))

## `summarise()` has grouped output by 'Make'. You can override using the
## `.groups` argument.

# merge grouped data onto original data
data <- merge(data, groupby, by=c('Make','Model'))

# create new column that uses original and updated to compute
data <- transform(data, final_mpg= ifelse(is.na(MPG), mean_mpg, MPG))

data <- data %>%
  select(-c('MPG','mean_mpg'))

skim(data)

Data summary
Name	data
Number of rows	4960
Number of columns	14
_______________________
Column type frequency:
character	10
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Make	0	1.00	5	5	1
Model	0	1.00	3	50	146
Condition	0	1.00	3	15	3
Exterior_Color	11	1.00	1	50	172
Interior_Color	11	1.00	1	28	62
Drivetrain	11	1.00	1	17	7
Fuel_Type	11	1.00	3	22	7
Transmission	11	1.00	1	45	58
Engine	11	1.00	4	64	74
State	73	0.99	2	5	53

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1.00	2020.49	3.72	1981	2019.0	2022.0	2023	2023	▁▁▁▁▇
Price	0	1.00	33713.46	10321.61	1995	26998.0	33949.5	41799	69980	▁▅▇▃▁
Mileage	0	1.00	23363.88	36981.28	0	5.0	704.5	36246	259029	▇▁▁▁▁
final_mpg	1462	0.71	27.81	7.20	0	22.5	23.5	32	52	▁▁▇▃▁

While our attempts were valiant, it seems that we did not correct the original missing values. This is most likely due no duplicate car entries listed (by make and model).

We now have to decide how to move forward. We will choose the most straightforward approach, replacing missing values with medians.

# replace with median

repl_median <- median(data$final_mpg, na.rm=TRUE)
data$final_mpg <- replace(data$final_mpg, is.na(data$final_mpg), repl_median)

There are additional features with missing values, but for those they have completeness ratios of around 98% or greater. Because decision trees and random forest algorithms can handle missing values inherently we could choose to either leave these alone, or alternatively drop these columns.

sum(!complete.cases(data))

## [1] 73

As only 73 columns are affected by missing values (excluding MPG) we can drop these entries entirely after correcting MPG.

data <- data %>% drop_na()

Adjusting Categorical Features

There are features with a large number of dimensions. - Model has 146 types - Exterior Color and Interior Color have 172 and 62 dimensions respectively - Transmission and Engine have 74 and 53 dimensions respectively - State has 53(?) dimensions

We can reduce the dimensions by grouping certain splits together.

simplify_colors <- function(x) {
  x <- tolower(x)
  
  if (str_detect(x, 'black')) {return('black')}
  if (str_detect(x, 'white')) {return('white')}
  if (str_detect(x, 'gold')) {return('gold')}
  if (str_detect(x, 'silver')) {return('silver')}
  if (str_detect(x, 'red')) {return('red')}
  if (str_detect(x, 'blue')) {return('blue')}
  if (str_detect(x, 'green')) {return('green')}
  if (str_detect(x, 'tan')) {return('tan')}
  if (str_detect(x, 'yellow')) {return('yellow')}
  if (str_detect(x, 'orange')) {return('orange')}
  if (str_detect(x, 'purple')) {return('purple')}
  if (str_detect(x, 'cyan')) {return('cyan')}
  if (str_detect(x, 'pearl')) {return('pearl')}
  if (str_detect(x, 'platinum')) {return('platinum')}
  if (str_detect(x, 'magenta')) {return('magenta')}
  if (str_detect(x, 'grey')) {return('grey')}
  if (str_detect(x, 'gray')) {return('gray')}
  if (str_detect(x, 'metallic')) {return('metallic')}
  if (str_detect(x, 'maroon')) {return('maroon')}
  
  
  return('other')

}

data$exterior_color_2 <- sapply(data$Exterior_Color, simplify_colors)
data$interior_color_2 <- sapply(data$Interior_Color, simplify_colors)

data %>%
  count(exterior_color_2)

Exterior_Color has been modified as exterior_color_2, and now has significantly less dimensions. This will make subsequent interpretation much easier.

data %>%
  count(interior_color_2)

Interior_Color has been modified as interior_color_2, and now has significantly less dimensions. This will make subsequent interpretation much easier.

simplify_model <- function(x) {
  x <- tolower(x)
  
  if (str_detect(x, 'accord')) {return('accord')}
  if (str_detect(x, 'civic')) {return('civic')}
  if (str_detect(x, 'clarity')) {return('clarity')}
  if (str_detect(x, 'cr-v')) {return('cr-v')}
  if (str_detect(x, 'crosstour')) {return('crosstour')}
  if (str_detect(x, 'del sol')) {return('del sol')}
  if (str_detect(x, 'element')) {return('element')}
  if (str_detect(x, 'fit')) {return('fit')}
  if (str_detect(x, 'hr-v')) {return('hr-v')}
  if (str_detect(x, 'insight')) {return('insight')}
  if (str_detect(x, 'odyssey')) {return('odyssey')}
  if (str_detect(x, 'passport')) {return('passport')}
  if (str_detect(x, 'pilot')) {return('pilot')}
  if (str_detect(x, 'prelude')) {return('prelude')}
  if (str_detect(x, 'ridgeline')) {return('ridgeline')}
  if (str_detect(x, 's2000')) {return('s2000')}
  
  
  return('other')

}

data$model_2 <- sapply(data$Model, simplify_model)

data %>%
  count(model_2)

Model has been modified as model_2, and now has significantly less dimensions. This will make subsequent interpretation much easier.

simplify_transmission <- function(x) {
  x <- tolower(x)
  
  if (str_detect(x, 'automatic')) {return('automatic')}
  if (str_detect(x, 'cvt')) {return('cvt')}
  if (str_detect(x, 'continuous')) {return('cvt')}
  if (str_detect(x, 'manual')) {return('manual')}
  
  return('other')
}

data$transmission_2 <- sapply(data$Transmission, simplify_transmission)

extract_liters <- function(x) {
  liters <- str_extract(x, regex("\\d\\.\\dL"))
  
  if (is.na(liters)) {
    return('other')
  }
  else {
    return(liters)
  }
}

data$engine_liters <- sapply(data$Engine, extract_liters)

extract_cylinders <- function(x) {
  liters <- str_extract(x, regex("\\d\\.\\dL"))
  
  if (is.na(liters)) {
    return('other')
  }
  else {
    return(liters)
  }
}

data$engine_cylinders <- sapply(data$Engine, extract_cylinders)

Selecting final features

skim(data)

Data summary
Name	data
Number of rows	4887
Number of columns	20
_______________________
Column type frequency:
character	16
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Make	1	5	5	1
Model	1	3	50	146
Condition	1	3	15	3
Exterior_Color	1	1	50	172
Interior_Color	1	1	28	62
Drivetrain	1	1	17	7
Fuel_Type	1	3	22	7
Transmission	1	1	45	58
Engine	1	4	64	73
State	1	2	5	53
exterior_color_2	1	3	8	17
interior_color_2	1	3	6	8
model_2	1	3	9	17
transmission_2	1	3	9	4
engine_liters	1	4	5	12
engine_cylinders	1	4	5	12

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	1	2020.56	3.67	1981	2020.0	2022.0	2023.0	2023	▁▁▁▁▇
Price	1	33861.70	10277.49	1995	27200.0	34041.0	41905.0	69980	▁▅▇▃▁
Mileage	1	22937.20	36823.00	0	5.0	704.5	35070.5	259029	▇▁▁▁▁
final_mpg	1	26.53	6.34	0	22.5	23.5	30.0	52	▁▁▇▂▁

data <- data %>%
  select(model_2, Condition, exterior_color_2, interior_color_2, Drivetrain, Fuel_Type, transmission_2, engine_liters, Year, Mileage, final_mpg, Price)

EDA

Numerical Distributions

hist(data$Year)

Our dataset is definitely comprised mostly of newer vehicles, with Year values mostly in the 2020s.

hist(data$Mileage)

Unlike the year feature, we see that Mileage follows a right skew.

hist(data$final_mpg)

Most cars being sold have around 20-25 miles per gallon, which matches my own external research.

hist(data$Price)

Our target variable, price, is nicely distributed. It follows a roughly normal shape, with no major skews.

Numerical Correlation

ggpairs(data %>% select(c("Year","Mileage","final_mpg","Price")), progress=FALSE)

From above, we can see that the numerical features most highly correlated with our target variable, Price, are Mileage and Year. It doesn’t seem that MPG is a strongly correlated.

It is interesting to note that Mileage and MPG both are negatiely correlated. This makes sense for Mileage, but it is suprising for MPG.

Reviewing Categorical Features

data %>%
  ggplot(aes(x=model_2, y=Price)) + 
  geom_boxplot() +
  coord_flip()

While there are a lot of lower end outliers, we can see from te above that the model type of the car certainly has an impact on the final asking price. More top-of-the-line models like the Ridgeline will fetch a larger asking price (on average) than the Crosstour or Element.

skim(data)

Data summary
Name	data
Number of rows	4887
Number of columns	12
_______________________
Column type frequency:
character	8
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
model_2	1	3	9	17
Condition	1	3	15	3
exterior_color_2	1	3	8	17
interior_color_2	1	3	6	8
Drivetrain	1	1	17	7
Fuel_Type	1	3	22	7
transmission_2	1	3	9	4
engine_liters	1	4	5	12

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	1	2020.56	3.67	1981	2020.0	2022.0	2023.0	2023	▁▁▁▁▇
Mileage	1	22937.20	36823.00	0	5.0	704.5	35070.5	259029	▇▁▁▁▁
final_mpg	1	26.53	6.34	0	22.5	23.5	30.0	52	▁▁▇▂▁
Price	1	33861.70	10277.49	1995	27200.0	34041.0	41905.0	69980	▁▅▇▃▁

data %>%
  ggplot(aes(x=Condition, y=Price)) + 
  geom_boxplot() +
  coord_flip()

Similarly, we can expect higher asking prices when the condition of the car is listed as “New”. This is not surprising.

data %>%
  ggplot(aes(x=exterior_color_2, y=Price)) + 
  geom_boxplot() +
  coord_flip()

There even seems to be a preference for color among car buyers! While I did not expect this, it appears that yellow cars tend to sell for more. However, that could be due to only a certain (more expensive) type of car being sold more often in yellow.

data %>%
  ggplot(aes(x=transmission_2, y=Price)) + 
  geom_boxplot() +
  coord_flip()

There seems to be less of a difference for preference around drivetrain. Drivers marginally prefer automatic more than other options, but it is close.

data %>%
  ggplot(aes(x=engine_liters, y=Price)) + 
  geom_boxplot() +
  coord_flip()

It is interesting to note that the average price is not linearly related to the engine liter capacity. I would have expected more capacity to equate to higher average price, but we see variable trends. The most popular (or highest priced) engine capacities are 3.5 (largest), 2.0, and 1.5.

Initial Models

# Partition data - train (80%) & test (20%)
set.seed(42)
ind <- sample(2, nrow(data), replace = T, prob = c(0.80, 0.20))
train <- data[ind==1,]
test <- data[ind==2,]

Decision Tree #1

The first model we will try to build will be a decision tree with only one feature: model_2. I believe that the model of a car is one of the most predictive features when considering a potential sell or buy price.

# test decision tree with just model_2 as feature
model_1 <- rpart(Price ~ `model_2`,
   method="anova", train)

summary(model_1)

## Call:
## rpart(formula = Price ~ model_2, data = train, method = "anova")
##   n= 3880 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.31440689      0 1.0000000 1.0004007 0.02129112
## 2 0.04755939      1 0.6855931 0.6872072 0.01930869
## 3 0.02083648      2 0.6380337 0.6418787 0.01898669
## 4 0.01265733      3 0.6171972 0.6228764 0.01895735
## 5 0.01000000      4 0.6045399 0.6111974 0.01869524
## 
## Variable importance
## model_2 
##     100 
## 
## Node number 1: 3880 observations,    complexity param=0.3144069
##   mean=33841.93, MSE=1.055206e+08 
##   left son=2 (1382 obs) right son=3 (2498 obs)
##   Primary splits:
##       model_2 splits as  LLLRLLLLLLRLRRLRL, improve=0.3144069, (0 missing)
## 
## Node number 2: 1382 observations,    complexity param=0.02083648
##   mean=26098.08, MSE=5.512994e+07 
##   left son=4 (67 obs) right son=5 (1315 obs)
##   Primary splits:
##       model_2 splits as  RRR-LLLLRR-L--L-R, improve=0.111969, (0 missing)
## 
## Node number 3: 2498 observations,    complexity param=0.04755939
##   mean=38126.16, MSE=8.186787e+07 
##   left son=6 (1348 obs) right son=7 (1150 obs)
##   Primary splits:
##       model_2 splits as  ---L------L-LR-R-, improve=0.09521369, (0 missing)
## 
## Node number 4: 67 observations
##   mean=15091.1, MSE=2.789397e+07 
## 
## Node number 5: 1315 observations
##   mean=26658.9, MSE=5.003027e+07 
## 
## Node number 6: 1348 observations,    complexity param=0.01265733
##   mean=35547.4, MSE=7.520351e+07 
##   left son=12 (642 obs) right son=13 (706 obs)
##   Primary splits:
##       model_2 splits as  ---L------R-R----, improve=0.05111911, (0 missing)
## 
## Node number 7: 1150 observations
##   mean=41148.91, MSE=7.274769e+07 
## 
## Node number 12: 642 observations
##   mean=33491.3, MSE=5.669908e+07 
## 
## Node number 13: 706 observations
##   mean=37417.12, MSE=8.46903e+07

model_1 %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)

This first decision tree provides a very interesting way to map our models to expected prices. We can see that at the start, all models are included. As we go down the tree, we start to see those models get split into different groupings. It looks like the Odyssey, Passport, and CR-V are some of the more expensive vehicles offered, while the crosstour and element are some of the lower end models (which validates our initial EDA).

# make predictions
p <- predict(model_1, test)

# Root Mean Square Error
sqrt(mean((test$Price - p)^2))

## [1] 8009.782

Using a simple mode like this is not very effective though, as we can see that the RMSE is around $8,009. We know that the prices for our cars are near-normally distributed and ranging from 0 - 70,000. So while this is ‘close’, it could be better.

Decision Tree #2

The next model we will try will use two new features: Condition and Mileage.

# test decision tree with just model_2 as feature
model_2 <- rpart(Price ~ Condition + Mileage,
   method="anova", train)

summary(model_2)

## Call:
## rpart(formula = Price ~ Condition + Mileage, data = train, method = "anova")
##   n= 3880 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.41689305      0 1.0000000 1.0002949 0.021293335
## 2 0.08367721      1 0.5831070 0.5991179 0.011845764
## 3 0.04112544      2 0.4994297 0.5056072 0.010043889
## 4 0.01659143      3 0.4583043 0.4744812 0.009534910
## 5 0.01000000      4 0.4417129 0.4518086 0.008953413
## 
## Variable importance
##   Mileage Condition 
##        67        33 
## 
## Node number 1: 3880 observations,    complexity param=0.416893
##   mean=33841.93, MSE=1.055206e+08 
##   left son=2 (1234 obs) right son=3 (2646 obs)
##   Primary splits:
##       Mileage   < 24710    to the right, improve=0.4168930, (0 missing)
##       Condition splits as  LRL,          improve=0.3787503, (0 missing)
##   Surrogate splits:
##       Condition splits as  RRL, agree=0.861, adj=0.562, (0 split)
## 
## Node number 2: 1234 observations,    complexity param=0.08367721
##   mean=24129.72, MSE=6.847446e+07 
##   left son=4 (451 obs) right son=5 (783 obs)
##   Primary splits:
##       Mileage   < 67398.5  to the right, improve=0.40544550, (0 missing)
##       Condition splits as  R-L,          improve=0.05161342, (0 missing)
## 
## Node number 3: 2646 observations,    complexity param=0.04112544
##   mean=38371.36, MSE=5.829106e+07 
##   left son=6 (543 obs) right son=7 (2103 obs)
##   Primary splits:
##       Condition splits as  LRL,          improve=0.1091661, (0 missing)
##       Mileage   < 1349.5   to the right, improve=0.1016215, (0 missing)
##   Surrogate splits:
##       Mileage < 720.5    to the right, agree=0.98, adj=0.901, (0 split)
## 
## Node number 4: 451 observations,    complexity param=0.01659143
##   mean=17187.1, MSE=4.58321e+07 
##   left son=8 (183 obs) right son=9 (268 obs)
##   Primary splits:
##       Mileage   < 105080.5 to the right, improve=0.32862940, (0 missing)
##       Condition splits as  R-L,          improve=0.03483997, (0 missing)
## 
## Node number 5: 783 observations
##   mean=28128.59, MSE=3.776255e+07 
## 
## Node number 6: 543 observations
##   mean=33406.99, MSE=4.47934e+07 
## 
## Node number 7: 2103 observations
##   mean=39653.18, MSE=5.376974e+07 
## 
## Node number 8: 183 observations
##   mean=12490.54, MSE=3.111674e+07 
## 
## Node number 9: 268 observations
##   mean=20394.08, MSE=3.053378e+07

model_2 %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)

Based on the model summary as well as the tree-graph, we can see that Mileage plays a significantly more important role than Condition when pricing out a car. Essentially, the condition only matters if the mileage is less than 25. In such cases, having a Honda Certified vehicle will lead to higher sell-prices.

# make predictions
p <- predict(model_2, test)

# Root Mean Square Error
sqrt(mean((test$Price - p)^2))

## [1] 6873.186

This new model, using two variables previously not seen before, shows stronger performance than our first model. We have successfully reduced our RSME by over $1,000.

Decision Tree # 3

The last decision tree we will try is using all features! In order to do this, I needed to preprocess the data some more in order to avoid out of sample issues. To achieve this, we onehot encode all categorical features.

## One hot encode all categorical features

dmy <- dummyVars(" ~ .", data = data)
onehot <- data.frame(predict(dmy, newdata = data))
head(onehot)

# Partition data - train (80%) & test (20%)
set.seed(42)
ind <- sample(2, nrow(onehot), replace = T, prob = c(0.80, 0.20))
dummy_train <- onehot[ind==1,]
dummy_test <- onehot[ind==2,]

# test decision tree with just model_2 as feature
model_3 <- rpart(Price ~ .,
   method="anova", dummy_train)

model_3 %>% rpart.plot(box.palette="RdBu", shadow.col="gray", nn=TRUE)

# make predictions
p <- predict(model_3, dummy_test)

# Root Mean Square Error
sqrt(mean((dummy_test$Price - p)^2))

## [1] 4289.253

Using a decision tree with all predictors resulted in the lowest RSME yet. However, the improvement to predicitive power comes at the cost of significantly more features to consider. Ultimately, the most important features to consider in this model are: - Year - Engine Liters - Mileage - MPG - Model - Condition

Random Forest #1

Our first random forest model will include only 4 features, 3 of which we have tested prviously with decision trees. The new feature we will add is: exterior_color_2.

rf_1 <- randomForest(Price ~ model_2 + Mileage + Condition + exterior_color_2, importance = TRUE, na.action = na.omit, train)

plot(rf_1)

# make predictions
p3 <- predict(rf_1, test)
# Root Mean Square Error
sqrt(mean((test$Price - p3)^2))

## [1] 4529.521

Using a random forest with just three features produces an RSME or 4,521, which is only slightly worse than the decision tree with all features. Let’s try to build a random forest with all features as well to compare.

Random Forest #2

This random forest will take all features into consideration, similar to how our last decision tree model was constructed.

rf_2 <- randomForest(Price ~ ., importance = TRUE, na.action = na.omit, dummy_train)

plot(rf_2)

# make predictions
p3 <- predict(rf_2, dummy_test)
# Root Mean Square Error
sqrt(mean((dummy_test$Price - p3)^2))

## [1] 2731.559

We see a significant boost when using all features! The training time for this model was signficantly longer than all of our previous attempts, but the final results are very powerful!

Conclusion

After exploring different models using both decision tree and random forect algorithms with various selected features, I conclude that random forest typically outperforms decision trees with respect to predictive power. In all iterations, I observed that random forest with similar features will produce better results than decision trees.

That being said, I noticed additionally that the decision tree models took significantly less time to run. If we were ever in a situation where we needed a quick and dirty model to better understand the data, perhaps I would use a decision tree in that scenario. I can imagine using Decision trees as a means to identifying the most important features to be a use-case.

Data 622 Homework 2

Alec McCabe

Import Libraries

Load and Describe Data

Exploratory Data Analysis

Preprocessing Data

Columns Mistyped

Dropping Columns

Missing Values

Adjusting Categorical Features

Selecting final features

EDA

Numerical Distributions

Numerical Correlation

Reviewing Categorical Features

Initial Models

Decision Tree #1

Decision Tree #2

Decision Tree # 3

Random Forest #1

Random Forest #2

Conclusion