Prediction Trees

Example: California Real Estate The basic idea is very simple. We want to predict a response or class Y from inputs X1, X2, . . . Xp. We do this by growing a binary tree.

This does a tree regression of the log price on longitude and latitude

require(tree)
## Loading required package: tree
calif <- read.table(file="https://raw.githubusercontent.com/jbryer/CompStats/master/Data/cadata.dat",head=TRUE)
head(calif)
##   MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms
## 1           452600       8.3252             41        880           129
## 2           358500       8.3014             21       7099          1106
## 3           352100       7.2574             52       1467           190
## 4           341300       5.6431             52       1274           235
## 5           342200       3.8462             52       1627           280
## 6           269700       4.0368             52        919           213
##   Population Households Latitude Longitude
## 1        322        126    37.88   -122.23
## 2       2401       1138    37.86   -122.22
## 3        496        177    37.85   -122.24
## 4        558        219    37.85   -122.25
## 5        565        259    37.85   -122.25
## 6        413        193    37.85   -122.25
treefit = tree(log(MedianHouseValue) ~ Longitude+Latitude,data=calif)
plot(treefit)
text(treefit,cex=0.75)

price.deciles = quantile(calif$MedianHouseValue,0:10/10)
cut.prices = cut(calif$MedianHouseValue,price.deciles,include.lowest=TRUE)
plot(calif$Longitude,calif$Latitude,col=grey(10:2/11)[cut.prices],pch=20,
xlab="Longitude",ylab="Latitude")
partition.tree(treefit,ordvars=c("Longitude","Latitude"),add=TRUE)

summary(treefit)
## 
## Regression tree:
## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude, 
##     data = calif)
## Number of terminal nodes:  12 
## Residual mean deviance:  0.1662 = 3429 / 20630 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.75900 -0.26080 -0.01359  0.00000  0.26310  1.84100

Here “deviance” is just mean squared error; this gives us an RMS error of 0.41, which is higher than the models in the last handout, but not shocking since we’re using only two variables, and have only twelve nodes.

treefit3 <- tree(log(MedianHouseValue) ~., data=calif)
plot(treefit3)
text(treefit3,cex=0.75)

summary(treefit3)
## 
## Regression tree:
## tree(formula = log(MedianHouseValue) ~ ., data = calif)
## Variables actually used in tree construction:
## [1] "MedianIncome"   "Latitude"       "Longitude"      "MedianHouseAge"
## Number of terminal nodes:  15 
## Residual mean deviance:  0.1321 = 2724 / 20620 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.86000 -0.22650 -0.01475  0.00000  0.20740  2.03900
cut.predictions = cut(predict(treefit3),log(price.deciles),include.lowest=TRUE)
plot(calif$Longitude,calif$Latitude,col=grey(10:2/11)[cut.predictions],pch=20,
xlab="Longitude",ylab="Latitude")

Trees use only one feature (input variable) at each step. If multiple features are equally good, which one is chosen is a matter of chance, or arbitrary programming decisions