The file ToyotaCorolla.csv contains information on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section 9.7 is a subset of this dataset).
Load the ToyotaCorolla.csv and install/load any required packages. Split the data into training (60%), and validation (40%) datasets.
car.df <- read.csv("D:/MSBA/3-Winter 2020/560/data/ToyotaCorolla.csv")
# partition
set.seed(22)
train.index <- sample(c(1:dim(car.df)[1]), dim(car.df)[1]*0.6)
train.df <- car.df[train.index, ]
valid.df <- car.df[-train.index, ]
Run a regression tree (RT) with outcome variable Price and predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfg_Guarantee, Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal node to 1, maximum number of tree levels to 100, and cp = 0:001, to make the run least restrictive.
tr <- rpart(Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + Doors + Quarterly_Tax +
Mfr_Guarantee + Guarantee_Period + Airco + Automatic_airco + CD_Player +
Powered_Windows + Sport_Model + Tow_Bar,
data = train.df,
method = "anova", minbucket = 1, maxdepth = 30, cp = 0.001)
prp(tr)
Which appear to be the three or four most important car specifications for predicting the car’s price?
#we could use str we to identify the components of the tree
##str(tr)
#or we can transpose the "vector" to create a column vector with the variable labels to get the variable importance
t(t(tr$variable.importance))
## [,1]
## Age_08_04 8883389342
## Automatic_airco 2919505875
## KM 2401906416
## Quarterly_Tax 1607048729
## HP 1050164969
## CD_Player 343306312
## Fuel_Type 218721574
## Guarantee_Period 208534738
## Airco 122851533
## Powered_Windows 64981981
## Doors 59287239
## Mfr_Guarantee 43522929
## Automatic 5272476
Based on our decision tree vector; Age, Auto AC, KM, and Quarterly Tax seem to be the top predictors of car price.
NOTE: Your results may vary based on your random seed selection.
Compare the prediction errors of the training and validation sets by examining their RMS error and by plotting the two boxplots. What is happening with the training set predictions? How does the predictive performance of the validation set compare to the training set? Why does this occur?
#calculate training and validation accuracy
accuracy(predict(tr, train.df), train.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 6.534122e-14 993.2493 780.6424 -1.094599 8.044445
accuracy(predict(tr, valid.df), valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set -6.90479 1319.292 926.2175 -1.213805 8.824279
#to calculate the errors you need to subtract the prediction from the actual
train.err <-predict(tr, train.df) -train.df$Price
valid.err <-predict(tr, valid.df) -valid.df$Price
#create a data fram from the training and validation errors in order to plot
err <-data.frame(Error =c(train.err, valid.err),
Set =c(rep("Training", length(train.err)),
rep("Validation", length(valid.err))))
#create your error box plots
boxplot(Error~Set, data=err, main="RMS Errors",
xlab = "Set", ylab = "Error",
col="blueviolet",medcol="darkgoldenrod1",boxlty=0,border="black",
whisklty=1,staplelwd=4,outpch=13,outcex=1,outcol="darkslateblue")
The training data has fewer error outliers and when compared to the validation data it appears to be performing better, but still within a similar range as our validation data. This is a good thing because it indicates that we have split our training data into a large enough sample size. If the validation had been more accurate we would need to consider the possibility that we had underfit the data.
How can we achieve predictions for the training set that are not equal to the actual prices?
If we wanted to get predictions in a training set that are not equal to the actual prices we would want to experiment with making the training sample size smaller.
Prune the full tree using the cross-validation error. Compared to the full tree, what is the predictive performance for the validation set?
#prune the tree
tr.shallow <- rpart(Price ~ Age_08_04 + KM + Fuel_Type +
HP + Automatic + Doors + Quarterly_Tax +
Mfr_Guarantee + Guarantee_Period + Airco +
Automatic_airco + CD_Player + Powered_Windows +
Sport_Model + Tow_Bar, data = train.df)
prp(tr.shallow)
#determine the training accuracy of the prunned tree's training and validation sets
accuracy(predict(tr.shallow, train.df), train.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 3.422132e-13 1342.992 1006.305 -1.64915 10.05497
accuracy(predict(tr.shallow, valid.df), valid.df$Price)
## ME RMSE MAE MPE MAPE
## Test set 65.81223 1440.945 1043.952 -0.8139846 9.827839
As expected, compared to the full tree, our pruned tree performs worse on the training set (RMSE=1343 compared to 993 for the full tree). The validation set also performed worse better(RMSE=1441 compared to 1319), but the pruned validation set performed better than the pruned training set. This indicates that we have underfit our model.
Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping Binned_Price instead of Price. Run a classification tree with the same set of input variables as in the RT, and with Binned_Price as the output variable. Keep the minimum number of records in a terminal node to 1.
#create bins based on the car prices as categories
#determine the number of bins based on the min and max prices
bins <- seq(min(car.df$Price),
max(car.df$Price),
(max(car.df$Price) - min(car.df$Price))/20)
bins
## [1] 4350.0 5757.5 7165.0 8572.5 9980.0 11387.5 12795.0 14202.5 15610.0
## [10] 17017.5 18425.0 19832.5 21240.0 22647.5 24055.0 25462.5 26870.0 28277.5
## [19] 29685.0 31092.5 32500.0
#use bincode to determine the bin assignments
#NOTE: if you vie Binned_Price you can see the assignments
Binned_Price <- .bincode(car.df$Price,
bins,
include.lowest = TRUE)
#convert the Binned_Price to factors for classification
Binned_Price <- as.factor(Binned_Price)
Binned_Price
## [1] 7 7 7 8 7 7 9 11 13 7 12 12 11 13 13 13 14 10 9 9 9 9 9 9
## [25] 9 9 10 9 9 10 7 9 9 8 8 9 9 8 9 8 7 9 7 9 9 11 10 9
## [49] 10 13 10 9 12 13 8 7 8 8 11 9 8 9 11 10 10 9 11 8 13 9 9 7
## [73] 11 9 12 9 11 11 9 8 11 10 8 10 9 10 8 10 9 13 9 13 12 9 11 12
## [97] 9 9 11 10 11 9 11 11 11 9 11 10 10 20 19 20 15 15 14 15 13 10 11 13
## [121] 11 12 9 11 9 13 9 9 10 9 9 9 9 9 9 9 11 9 14 12 9 14 12 11
## [145] 11 9 12 15 11 12 10 12 11 11 13 9 11 11 11 11 11 12 11 11 10 12 12 12
## [169] 12 10 10 14 11 11 13 12 11 12 13 13 11 11 12 13 10 10 2 4 6 3 6 1
## [193] 1 6 7 6 6 8 4 6 6 5 5 5 7 6 6 5 6 6 7 8 6 6 7 5
## [217] 7 5 5 7 6 6 6 8 6 7 6 6 6 6 6 7 6 7 6 6 5 7 7 6
## [241] 5 6 6 7 6 7 6 7 7 6 6 5 6 8 4 7 7 6 6 7 6 6 7 6
## [265] 6 6 6 6 8 5 7 7 7 7 7 6 7 6 6 8 7 7 7 7 6 7 6 4
## [289] 6 7 6 7 5 6 7 5 7 7 7 7 6 6 7 6 7 6 4 7 6 6 7 7
## [313] 6 6 4 7 7 5 4 6 6 5 7 5 7 6 5 7 7 6 5 7 6 6 6 6
## [337] 7 6 6 6 6 6 8 6 7 8 7 7 7 6 6 4 6 6 8 7 6 8 6 8
## [361] 7 6 6 7 7 5 5 6 6 7 5 7 6 7 7 6 6 7 2 2 2 3 4 3
## [385] 4 4 5 4 3 4 3 3 4 1 4 4 4 6 5 5 4 5 1 5 4 4 5 6
## [409] 4 6 3 5 4 6 5 4 4 5 4 4 5 4 4 6 4 4 6 6 5 7 6 5
## [433] 5 5 5 5 6 4 5 6 6 5 6 6 6 5 6 5 6 4 5 6 6 6 6 4
## [457] 5 5 4 5 4 6 5 4 4 6 4 6 7 5 5 4 4 6 5 4 5 4 5 6
## [481] 6 6 6 4 4 5 5 4 6 4 5 5 4 6 6 5 6 5 5 4 4 6 4 5
## [505] 4 6 6 6 5 5 6 6 7 5 5 5 6 5 5 6 4 6 4 11 6 5 6 4
## [529] 5 7 4 5 5 6 7 6 5 4 5 6 5 6 5 5 7 5 6 4 5 6 5 5
## [553] 7 5 6 5 6 7 5 7 5 5 4 7 4 5 5 5 5 7 7 6 5 6 4 6
## [577] 6 6 6 6 6 5 4 5 5 7 4 7 4 4 5 5 4 5 5 5 5 5 5 7
## [601] 5 3 4 2 3 2 3 3 2 1 2 3 3 3 3 2 4 2 3 3 4 2 4 4
## [625] 3 4 4 4 3 3 3 4 4 4 4 4 5 3 5 4 4 4 3 5 4 4 3 2
## [649] 3 4 4 3 4 4 2 3 4 3 4 5 3 4 4 4 4 3 4 4 4 4 2 3
## [673] 3 4 2 4 4 4 4 4 3 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [697] 6 4 5 4 3 4 3 5 3 4 5 3 4 4 4 3 4 4 3 3 4 4 4 3
## [721] 3 3 4 3 2 3 5 4 4 4 6 5 5 4 5 5 4 4 4 4 3 5 4 4
## [745] 3 3 3 5 4 4 5 5 3 4 4 4 4 4 3 5 3 3 4 4 5 5 4 4
## [769] 5 3 3 3 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3
## [793] 3 4 6 4 6 4 4 3 4 5 4 5 4 4 4 3 3 4 3 4 4 5 4 4
## [817] 3 4 4 5 4 3 4 5 2 4 2 4 4 4 4 4 5 5 4 4 5 4 4 5
## [841] 5 4 5 4 3 4 5 5 4 4 3 4 3 4 4 4 4 3 3 4 4 5 4 3
## [865] 4 4 5 4 4 5 4 4 5 4 4 4 4 4 3 4 5 4 4 3 4 4 5 4
## [889] 5 4 3 6 4 3 5 4 3 4 4 4 3 4 4 4 4 4 4 4 3 3 4 4
## [913] 4 7 4 5 3 5 4 4 5 4 4 3 4 4 5 4 4 5 4 5 5 4 4 5
## [937] 5 4 4 5 4 4 5 3 5 5 6 4 3 4 3 4 3 4 4 4 5 4 4 4
## [961] 4 4 4 4 5 4 4 4 4 5 4 5 4 3 5 4 4 4 4 4 5 4 4 3
## [985] 4 4 3 4 5 4 5 3 4 4 3 4 4 4 4 5 4 4 3 5 4 4 5 5
## [1009] 4 4 4 4 4 3 5 5 4 4 5 4 6 3 5 5 5 4 5 4 5 5 4 5
## [1033] 5 5 5 6 4 5 4 5 4 5 5 4 2 2 2 1 1 2 3 2 2 1 4 2
## [1057] 2 2 5 3 3 2 2 2 1 2 4 2 3 3 3 3 2 3 2 1 2 2 3 4
## [1081] 3 4 4 2 3 3 2 3 2 3 4 3 3 1 3 2 3 3 3 3 3 2 2 3
## [1105] 3 3 3 3 3 4 3 3 3 1 2 2 2 3 4 3 3 3 3 4 3 2 2 4
## [1129] 3 3 3 4 2 4 3 2 2 2 4 3 2 3 3 4 3 2 2 3 2 3 4 3
## [1153] 3 3 2 3 2 4 2 4 3 3 3 4 4 4 3 2 3 4 2 2 3 2 3 4
## [1177] 4 3 3 4 3 2 4 3 4 2 3 3 3 3 2 3 2 3 3 4 4 4 3 4
## [1201] 4 3 2 3 3 2 3 3 3 3 3 3 3 2 4 4 3 3 4 3 3 3 3 3
## [1225] 4 3 2 3 3 4 4 2 3 3 4 3 3 2 3 2 4 4 3 3 2 3 3 2
## [1249] 3 3 4 3 3 2 3 3 3 3 3 4 3 4 4 3 3 4 2 3 4 4 3 2
## [1273] 3 2 4 3 3 4 3 3 3 3 3 4 4 3 3 3 4 3 3 3 3 3 2 3
## [1297] 3 2 3 4 3 2 3 3 3 4 3 4 3 4 4 4 4 2 3 4 3 3 3 3
## [1321] 4 3 4 4 3 2 3 4 2 3 4 2 3 5 2 4 3 4 3 4 4 2 3 3
## [1345] 4 3 3 3 4 2 3 2 3 3 4 2 3 3 3 4 4 2 3 2 3 3 3 4
## [1369] 4 3 4 4 2 3 4 3 3 4 4 3 4 3 2 5 4 3 4 3 4 4 3 4
## [1393] 3 3 3 4 4 3 4 4 3 4 5 2 3 3 4 3 4 3 3 3 4 4 3 2
## [1417] 4 4 3 3 3 3 3 3 3 3 4 4 3 4 3 3 5 3 3 2
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 19 20
#add the binned price to the data frame based on the indexes
#this will allow us to identify the training and validation data frames
train.df$Binned_Price <- Binned_Price[train.index]
valid.df$Binned_Price <- Binned_Price[-train.index]
Compare the tree generated by the CT with the one generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc.) Why?
tr.binned <- rpart(Binned_Price ~ Age_08_04 + KM + Fuel_Type +
HP + Automatic + Doors + Quarterly_Tax +
Mfr_Guarantee + Guarantee_Period + Airco +
Automatic_airco + CD_Player + Powered_Windows +
Sport_Model + Tow_Bar, data = train.df)
prp(tr.binned)
t(t(tr.binned$variable.importance))
## [,1]
## Age_08_04 140.372139
## KM 72.439706
## CD_Player 25.003200
## Automatic_airco 19.178101
## Airco 18.794593
## Quarterly_Tax 17.627902
## Sport_Model 15.479258
## HP 8.338291
## Powered_Windows 7.547571
## Fuel_Type 4.051562
## Mfr_Guarantee 3.774845
## Doors 3.273578
## Guarantee_Period 1.002471
When creating bins the intent is to decrease the number of variables. This unsurpisingly results in the binned tree being significantly smaller than the original full tree. One interesting thing to note is the in our binned tree, CD Player replaces Quarterly Tax as one of the top four specifications in indicating price. This could be because tax variations are less significant within bins.
#first create your new record
new.record <- data.frame(Age_08_04 = 77,
KM = 117000,
Fuel_Type = "Petrol",
HP = 110,
Automatic = 0,
Doors = 5,
Quarterly_Tax = 100,
Mfr_Guarantee = 0,
Guarantee_Period = 3,
Airco = 1,
Automatic_airco = 0,
CD_Player = 0,
Powered_Windows = 0,
Sport_Model = 0,
Tow_Bar = 1)
#set up your regression and classification trees
price.tr <- predict(tr, newdata = new.record)
#remember that we have bins for our CT
price.tr.bin <- bins[predict(tr.binned, newdata = new.record, type = "class")]
cat(paste("Regression Price Estimate: ",scales::dollar(price.tr,0.01)),
paste("Classification Price Estimate: ",scales::dollar(price.tr.bin,0.01)),
sep='\n')
## Regression Price Estimate: $7,132.22
## Classification Price Estimate: $7,165.00
Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.
Our predictions for the two models were very simmilar. A difference of $32.78 (less than 1% of the total price of the car) is statistically insignificant in this case. Our binned model returned a whole number while the full model returned a more “accurate” price, but ultimately it is a wash. Both models had comparable accuracy, but the full regression seemed to be better trained. If we wanted to use the binned model I would suggest creating smaller bin ranges to prevent underfitting the model. However, when considering the the overall accuracy range and the car sale market both models would be considered good enough for most used car sales markets.