Textbook Problem 9.3

Read in and subset Toyota Corolla dataset to have a data frame with our 1 response (price) and 15 predictor variables.

library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(forecast)
ToyotaCorolla <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/ToyotaCorolla.csv")

ToyotaCorolla <- ToyotaCorolla %>% 
                    select(Price, Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, 
                           Quarterly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, 
                           Automatic_airco, CD_Player, Powered_Windows, Sport_Model, Tow_Bar)

Partition data

Split the data into a training (60%) and validation (40%) set.

set.seed(100)
train.index <- sample(c(1:dim(ToyotaCorolla)[1]), dim(ToyotaCorolla)[1]*0.6)
train.df <- ToyotaCorolla[train.index, ]
valid.df <- ToyotaCorolla[-train.index, ]

Regression Tree with little restrictions on growth:

tree.1 <- rpart(Price ~ ., data = train.df,
                method = "anova", 
                control = rpart.control(
                  cp = 0.001, 
                  minbucket = 1, 
                  maxdepth = 30
                ))

prp(tree.1, type = 1, extra = 1, varlen = 0, under = TRUE)

From this unpruned tree, we see that the age, mileage (km), AC, Horse Power, and quarterly taxes are all important predictors for price. The most expensive cars (middle right) are about $31,000 and are those that have AC, less than 32 months old, and have a quarterly tax <259.

RMS error and Boxplot of errors:

tree.1.train <- predict(tree.1, train.df)
train.errors <- data.frame(Training = tree.1.train - train.df$Price)

accuracy(tree.1.train, train.df$Price)
##                    ME     RMSE      MAE        MPE     MAPE
## Test set -1.97771e-13 961.6308 755.1341 -0.9820825 7.678457
ggplot(train.errors) + geom_boxplot(aes(x = Training))

tree.1.valid <- predict(tree.1, valid.df)
valid.errors <- data.frame(Validation = tree.1.valid - valid.df$Price)

accuracy(tree.1.valid, valid.df$Price)
##                ME     RMSE      MAE      MPE     MAPE
## Test set 206.7865 1357.144 967.1083 0.730866 9.116813
ggplot(valid.errors) + geom_boxplot(aes(x = Validation))

From the boxplots, we can see the training set has far less extreme outliers. In the validation set, there are quite a few predicitons that were over $5,000 off. When we use a tree with so many splits, we expect the data to be a little overfitted resulting in some large prediction errors on unseen, validation data. Also, as expected for the same reasons, the RMSE and MAE are slightly higher on the validation set.

If we want to achieve predictions that are not equal to the actual prices, we could create classes or factors of price ranges and create a tree that aims to predict a specific class, rather than the actual price.

Tree pruned using the cross-validation error:

pruned.tree.1 <- prune(tree.1, cp = tree.1$cptable[which.min(tree.1$cptable[, "xerror"]), "CP"])

prp(pruned.tree.1, type = 1, extra = 1, varlen = 0, under = TRUE)

pruned.tree.1.train <- predict(pruned.tree.1, train.df)
pruned.train.errors <- data.frame(Training = pruned.tree.1.train - train.df$Price)

accuracy(pruned.tree.1.train, train.df$Price)
##                    ME     RMSE      MAE        MPE     MAPE
## Test set -1.76371e-13 968.3715 759.6538 -0.9947015 7.723231
ggplot(pruned.train.errors) + geom_boxplot(aes(x = Training))

pruned.tree.1.valid <- predict(pruned.tree.1, valid.df)
pruned.valid.errors <- data.frame(Validation = pruned.tree.1.valid - valid.df$Price)

accuracy(pruned.tree.1.valid, valid.df$Price)
##                ME     RMSE      MAE       MPE     MAPE
## Test set 204.4832 1358.178 968.6415 0.7012751 9.133298
ggplot(pruned.valid.errors) + geom_boxplot(aes(x = Validation))

After pruning the tree (from 99 nodes to 77 nodes), we see the prediction on the initial training set is slightly worse. The RMSE for the training data goes from 961 to 1021. However, the validation set is predicted ever so slightly better. There are still some outliers present, but the RMSE goes down a from 1357 to 1351.

Now let’s transform Price into a categorical variable and create a new training/validation set:

Binned_Price <- cut(ToyotaCorolla$Price, breaks = 20, labels = c(1:20))

ToyotaCorolla.2 <- ToyotaCorolla %>% select(-Price)
ToyotaCorolla.2$Binned_Price <- as.factor(Binned_Price)

train.df2 <- ToyotaCorolla.2[train.index, ]
valid.df2 <- ToyotaCorolla.2[-train.index, ]

Using the training set, create a tree and prune using the training set:

tree.2 <- rpart(Binned_Price ~ ., data = train.df2,
                method = "class", 
                control = rpart.control(
                  cp = 0.001, 
                  minbucket = 1, 
                  maxdepth = 30
                  ))

pruned.tree.2 <- prune(tree.2, cp = tree.2$cptable[which.min(tree.2$cptable[, "xerror"]), "CP"])

prp(pruned.tree.2, type = 1, varlen = 0)

After creating another tree with price separated into 20 bins, we see a very similar tree structure and the same 4 of the 5 variables seem to play the biggest role in predicting price (age, mileage, HP, and taxes). Now,the CD player variable helps predict the price bins more than the previous tree.

Now to predict the price of a specific car:

particular_car <- data.frame(Age_08_04 = 77, KM = 117000, Fuel_Type = "Petrol", HP = 110, Automatic = 0, Doors = 5, 
                              Quarterly_Tax = 100, Mfr_Guarantee = 0, Guarantee_Period = 3, Airco = 1, 
                              Automatic_airco = 0, CD_Player = 0, Powered_Windows = 0, Sport_Model = 0, Tow_Bar = 1)

predicted.price <- predict(pruned.tree.1, particular_car)
predicted.price
##        1 
## 7538.952
predicted.price2 <- predict(pruned.tree.2, particular_car, type = "class")
predicted.price2
## 1 
## 3 
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Summary

Our Regression Tree from before predicts the price of this car to be $7640.27. The Classification tree just above predicts this car to be in the in the bin of $7,160-$8,570, which is the third lowest group (of 20). As we can see, both trees predict this car to be on the cheaper end of Toyota Corollas. The regression tree is advantageous because it gives a more precise prediction, but classification trees are advantageous for determining a class it would fall in. In this case, the classes could be more helpful in predicting new cars if we were to separate the prices into more bins, say 30 even. In either case, we can see a Corolla matching these specifications would be towards the cheaper end.