Textbook Problem 9.3

Read in and subset Toyota Corolla dataset to have a data frame with our 1 response (price) and 15 predictor variables.

library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(forecast)

ToyotaCorolla <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/ToyotaCorolla.csv")

ToyotaCorolla <- ToyotaCorolla %>% 
                    select(Price, Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, 
                           Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, 
                           Automatic_airco, CD_Player, Powered_Windows, Sport_Model, Tow_Bar)

Partition data

Split the data into a training (60%) and validation (40%) set.

set.seed(100)
train.index <- sample(c(1:dim(ToyotaCorolla)[1]), dim(ToyotaCorolla)[1]*0.6)
train.df <- ToyotaCorolla[train.index, ]
valid.df <- ToyotaCorolla[-train.index, ]

Regression Tree with little restrictions on growth:

tree.1 <- rpart(Price ~ ., data = train.df,
                method = "anova", 
                control = rpart.control(
                  cp = 0.001, 
                  minbucket = 1, 
                  maxdepth = 30
                ))

prp(tree.1, type = 1, extra = 1, varlen = 0, under = TRUE)

From this unpruned tree, we see that the age, mileage (km), AC, Horse Power, and quarterly taxes are all important predictors for price. The most expensive cars (middle right) are about $31,000 and are those that have AC, less than 32 months old, and have a quarterly tax <259.

RMS error and Boxplot of errors:

tree.1.train <- predict(tree.1, train.df)
train.errors <- data.frame(Training = tree.1.train - train.df$Price)

accuracy(tree.1.train, train.df$Price)

##                    ME     RMSE      MAE        MPE     MAPE
## Test set -1.97771e-13 961.6308 755.1341 -0.9820825 7.678457

ggplot(train.errors) + geom_boxplot(aes(x = Training))

tree.1.valid <- predict(tree.1, valid.df)
valid.errors <- data.frame(Validation = tree.1.valid - valid.df$Price)

accuracy(tree.1.valid, valid.df$Price)

##                ME     RMSE      MAE      MPE     MAPE
## Test set 206.7865 1357.144 967.1083 0.730866 9.116813

ggplot(valid.errors) + geom_boxplot(aes(x = Validation))

From the boxplots, we can see the training set has far less extreme outliers. In the validation set, there are quite a few predicitons that were over $5,000 off. When we use a tree with so many splits, we expect the data to be a little overfitted resulting in some large prediction errors on unseen, validation data. Also, as expected for the same reasons, the RMSE and MAE are slightly higher on the validation set.

If we want to achieve predictions that are not equal to the actual prices, we could create classes or factors of price ranges and create a tree that aims to predict a specific class, rather than the actual price.

Tree pruned using the cross-validation error:

pruned.tree.1 <- prune(tree.1, cp = tree.1$cptable[which.min(tree.1$cptable[, "xerror"]), "CP"])

prp(pruned.tree.1, type = 1, extra = 1, varlen = 0, under = TRUE)

pruned.tree.1.train <- predict(pruned.tree.1, train.df)
pruned.train.errors <- data.frame(Training = pruned.tree.1.train - train.df$Price)

accuracy(pruned.tree.1.train, train.df$Price)

##                    ME     RMSE      MAE        MPE     MAPE
## Test set -1.76371e-13 968.3715 759.6538 -0.9947015 7.723231

ggplot(pruned.train.errors) + geom_boxplot(aes(x = Training))

pruned.tree.1.valid <- predict(pruned.tree.1, valid.df)
pruned.valid.errors <- data.frame(Validation = pruned.tree.1.valid - valid.df$Price)

accuracy(pruned.tree.1.valid, valid.df$Price)

##                ME     RMSE      MAE       MPE     MAPE
## Test set 204.4832 1358.178 968.6415 0.7012751 9.133298

ggplot(pruned.valid.errors) + geom_boxplot(aes(x = Validation))

After pruning the tree (from 99 nodes to 77 nodes), we see the prediction on the initial training set is slightly worse. The RMSE for the training data goes from 961 to 1021. However, the validation set is predicted ever so slightly better. There are still some outliers present, but the RMSE goes down a from 1357 to 1351.

Now let’s transform Price into a categorical variable and create a new training/validation set:

Binned_Price <- cut(ToyotaCorolla$Price, breaks = 20, labels = c(1:20))

ToyotaCorolla.2 <- ToyotaCorolla %>% select(-Price)
ToyotaCorolla.2$Binned_Price <- as.factor(Binned_Price)

train.df2 <- ToyotaCorolla.2[train.index, ]
valid.df2 <- ToyotaCorolla.2[-train.index, ]

Using the training set, create a tree and prune using the training set:

tree.2 <- rpart(Binned_Price ~ ., data = train.df2,
                method = "class", 
                control = rpart.control(
                  cp = 0.001, 
                  minbucket = 1, 
                  maxdepth = 30
                  ))

pruned.tree.2 <- prune(tree.2, cp = tree.2$cptable[which.min(tree.2$cptable[, "xerror"]), "CP"])

prp(pruned.tree.2, type = 1, varlen = 0)

After creating another tree with price separated into 20 bins, we see a very similar tree structure and the same 4 of the 5 variables seem to play the biggest role in predicting price (age, mileage, HP, and taxes). Now,the CD player variable helps predict the price bins more than the previous tree.

Now to predict the price of a specific car:

particular_car <- data.frame(Age_08_04 = 77, KM = 117000, Fuel_Type = "Petrol", HP = 110, Automatic = 0, Doors = 5, 
                              Quarterly_Tax = 100, Mfr_Guarantee = 0, Guarantee_Period = 3, Airco = 1, 
                              Automatic_airco = 0, CD_Player = 0, Powered_Windows = 0, Sport_Model = 0, Tow_Bar = 1)

predicted.price <- predict(pruned.tree.1, particular_car)
predicted.price

##        1 
## 7538.952

predicted.price2 <- predict(pruned.tree.2, particular_car, type = "class")
predicted.price2

## 1 
## 3 
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Classification and Regression Trees

Simeon Paynter

3/23/2021

Textbook Problem 9.3

Partition data

Summary