Example: California Real Estate The basic idea is very simple. We want to predict a response or class Y from inputs X1, X2, . . . Xp. We do this by growing a binary tree.
This does a tree regression of the log price on longitude and latitude
require(tree)
## Loading required package: tree
calif <- read.table(file="https://raw.githubusercontent.com/jbryer/CompStats/master/Data/cadata.dat",head=TRUE)
head(calif)
## MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms
## 1 452600 8.3252 41 880 129
## 2 358500 8.3014 21 7099 1106
## 3 352100 7.2574 52 1467 190
## 4 341300 5.6431 52 1274 235
## 5 342200 3.8462 52 1627 280
## 6 269700 4.0368 52 919 213
## Population Households Latitude Longitude
## 1 322 126 37.88 -122.23
## 2 2401 1138 37.86 -122.22
## 3 496 177 37.85 -122.24
## 4 558 219 37.85 -122.25
## 5 565 259 37.85 -122.25
## 6 413 193 37.85 -122.25
treefit = tree(log(MedianHouseValue) ~ Longitude+Latitude,data=calif)
plot(treefit)
text(treefit,cex=0.75)
price.deciles = quantile(calif$MedianHouseValue,0:10/10)
cut.prices = cut(calif$MedianHouseValue,price.deciles,include.lowest=TRUE)
plot(calif$Longitude,calif$Latitude,col=grey(10:2/11)[cut.prices],pch=20,
xlab="Longitude",ylab="Latitude")
partition.tree(treefit,ordvars=c("Longitude","Latitude"),add=TRUE)
summary(treefit)
##
## Regression tree:
## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude,
## data = calif)
## Number of terminal nodes: 12
## Residual mean deviance: 0.1662 = 3429 / 20630
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.75900 -0.26080 -0.01359 0.00000 0.26310 1.84100
Here “deviance” is just mean squared error; this gives us an RMS error of 0.41, which is higher than the models in the last handout, but not shocking since we’re using only two variables, and have only twelve nodes.
treefit3 <- tree(log(MedianHouseValue) ~., data=calif)
plot(treefit3)
text(treefit3,cex=0.75)
summary(treefit3)
##
## Regression tree:
## tree(formula = log(MedianHouseValue) ~ ., data = calif)
## Variables actually used in tree construction:
## [1] "MedianIncome" "Latitude" "Longitude" "MedianHouseAge"
## Number of terminal nodes: 15
## Residual mean deviance: 0.1321 = 2724 / 20620
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.86000 -0.22650 -0.01475 0.00000 0.20740 2.03900
cut.predictions = cut(predict(treefit3),log(price.deciles),include.lowest=TRUE)
plot(calif$Longitude,calif$Latitude,col=grey(10:2/11)[cut.predictions],pch=20,
xlab="Longitude",ylab="Latitude")
Trees use only one feature (input variable) at each step. If multiple features are equally good, which one is chosen is a matter of chance, or arbitrary programming decisions