Here we demonstrate the sensivity of regression trees to small changes in the training dataset, using the Meuse heavy metals dataset.

1 Setup

Libraries used in this exercise: sp for spatial objects, rpart for regression trees and random forests, rpart.plot for nicer plots of trees.

library(sp)
library(rpart)
library(rpart.plot)

Sample dataset, comes with sp; make a log10-transformed copy of the metal concentrations.

data(meuse)
str(meuse)

## 'data.frame':    155 obs. of  14 variables:
##  $ x      : num  181072 181025 181165 181298 181307 ...
##  $ y      : num  333611 333558 333537 333484 333330 ...
##  $ cadmium: num  11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
##  $ copper : num  85 81 68 81 48 61 31 29 37 24 ...
##  $ lead   : num  299 277 199 116 117 137 132 150 133 80 ...
##  $ zinc   : num  1022 1141 640 257 269 ...
##  $ elev   : num  7.91 6.98 7.8 7.66 7.48 ...
##  $ dist   : num  0.00136 0.01222 0.10303 0.19009 0.27709 ...
##  $ om     : num  13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
##  $ ffreq  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ soil   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ lime   : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
##  $ dist.m : num  50 30 150 270 380 470 240 120 240 420 ...

meuse$logZn <- log10(meuse$zinc)    
meuse$logCu <- log10(meuse$copper)  
meuse$logPb <- log10(meuse$lead)    
meuse$logCd <- log10(meuse$cadmium)

Prediction grid:

data(meuse.grid)
coordinates(meuse.grid) <- ~ x + y
gridded(meuse.grid) <- TRUE

2 The single best tree

Use all 155 to build and prune:

m.lzn <- rpart(logZn ~ ffreq + dist + soil + x + y,
                    data=meuse,
                    minsplit=2,
                    cp=0.005)
cp.table <- m.lzn[["cptable"]]
cp.min <- cp.table[which.min(cp.table[,"xerror"]),"CP"]
m.lzn <- prune(m.lzn, cp=cp.min)
rpart.plot(m.lzn)

grid.map <- predict(m.lzn, newdata=meuse.grid)
meuse.grid$pred <- grid.map
print(spplot(meuse.grid, zcol="pred"))

x <- m.lzn$variable.importance 
data.frame(variableImportance = 100 * x / sum(x))

##       variableImportance
## dist            45.36660
## x               17.65927
## soil            15.02946
## y               11.35591
## ffreq           10.58877

3 Repeated random sampling and tree construction

Randomly sample a specific number of the 155 points and use it to build and prune a regression tree; use the remaining points as out-of-bag evaluation. The model uses all the variable in the prediction grid: coordnates, relative distance from river, flooding frequency clqss, and soil type.

We write this as a function, so we can call it any number of times. It has one argument: the number of points to include in the calibration sample.

It returns information about the tree: the minimum complexity parameter, the out-of-bag error, and the variable importance, and the prediction over the grid. It also displays the tree and the prediction over the grid.

build.rpart <- function(n.cal) {
  meuse.cal.ix <- sample(1:dim(meuse)[1], size=n.cal, replace=FALSE)
  meuse.cal <- meuse[meuse.cal.ix,]
  meuse.val <- meuse[-meuse.cal.ix,]
  m.lzn.cal <- rpart(logZn ~ ffreq + dist + soil + x + y,
                    data=meuse.cal,
                    minsplit=2,
                    cp=0.005)
  cp.table <- m.lzn.cal[["cptable"]]
  cp.min <- cp.table[which.min(cp.table[,"xerror"]),"CP"]
  m.lzn.cal <- prune(m.lzn.cal, cp=cp.min)
  rpart.plot(m.lzn.cal)
  tmp <- m.lzn.cal$variable.importance
  v.imp <- data.frame(variableImportance = 100 * tmp / sum(tmp))
  grid.map <- predict(m.lzn.cal, newdata=meuse.grid)
  meuse.grid$pred <- grid.map
  val <- predict(m.lzn.cal, newdata=meuse.val)
  rmse <- sqrt((sum(meuse.val$logZn - val)^2)/length(meuse.val$logZn))
  print(spplot(meuse.grid, zcol="pred"))
  # return a list
  return(list(cp = cp.min, rmse = rmse, v.imp = v.imp, pred = grid.map))
}

Set up an object to collect the fitted paramter, and a field in the prediction grid to show the average prediction:

runs <- 64
fits.rf <- list(cp=vector(mode="numeric",length=runs),
             rmse=vector(mode="numeric",length=runs),
             v.imp=as.data.frame(matrix(data=0, nrow=runs, ncol=5))
)
names(fits.rf[["v.imp"]]) <- c("ffreq", "dist", "soil","x","y")
meuse.grid$logZn.avg <- 0  # will accumulate predictions and at the end divide by # of runs

Call the function and collect the fitted parameters. Save the graphics in a PDF for display.

for (run in 1:runs) {
  tmp <- build.rpart(140)
  fits.rf[["cp"]][run] <- tmp[[1]]
  fits.rf[["rmse"]][run] <- tmp[[2]]
  fits.rf[["v.imp"]][run,] <- t(tmp[[3]])  # returned a row, need a column
  # add to average prediction
  meuse.grid$logZn.avg <- meuse.grid$logZn.avg + tmp[[4]]
}

Show the distribution of the complexity parameters, RMSE, and variable importance:

summary(fits.rf[["cp"]])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005000 0.008515 0.010092 0.014902 0.020104 0.066046

hist(fits.rf[["cp"]], xlab="Complexity parameter", main="")
rug(fits.rf[["cp"]])
summary(fits.rf[["rmse"]])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001446 0.077043 0.146688 0.158756 0.218311 0.455252

hist(fits.rf[["rmse"]], xlab="Out-of-bag RMSE", main="")
rug(fits.rf[["rmse"]])
summary(fits.rf[["v.imp"]])

##      ffreq            dist             soil              x         
##  Min.   :37.28   Min.   : 8.834   Min.   : 7.003   Min.   : 5.977  
##  1st Qu.:39.51   1st Qu.:16.665   1st Qu.:13.692   1st Qu.:11.822  
##  Median :42.97   Median :17.417   Median :15.032   Median :12.944  
##  Mean   :44.22   Mean   :17.509   Mean   :14.488   Mean   :12.700  
##  3rd Qu.:45.84   3rd Qu.:18.231   3rd Qu.:15.634   3rd Qu.:14.277  
##  Max.   :77.33   Max.   :21.157   Max.   :17.221   Max.   :15.403  
##        y          
##  Min.   : 0.4981  
##  1st Qu.: 9.7307  
##  Median :11.7368  
##  Mean   :11.0851  
##  3rd Qu.:13.2427  
##  Max.   :14.6200

Show the average prediction.

meuse.grid$logZn.avg <- meuse.grid$logZn.avg/runs
summary(meuse.grid$logZn.avg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.213   2.231   2.360   2.463   2.683   3.012

print(spplot(meuse.grid, zcol="logZn.avg"))

Still not very smooth, but better than any single tree.

4 Linear model variability

Compare with the variability of linear models:

Function to build a linear model with a subset. Because some predictors are correlated the model may be over-fit, so apply backwards stepwise elimination before summarizing the model and computing the RMSE.

The distance will always be in this model; return this to see how much it varies between models. Other coefficients may be eliminated.

build.lm <- function(n.cal) {
  meuse.cal.ix <- sample(1:dim(meuse)[1], size=n.cal, replace=FALSE)
  meuse.cal <- meuse[meuse.cal.ix,]
  meuse.val <- meuse[-meuse.cal.ix,]
  m.lzn.cal <- lm(logZn ~ ffreq + dist + soil + x + y, data=meuse.cal)
  m.lzn.cal <- step(m.lzn.cal, trace=0)
  meuse.grid$predictedLog10Zn.cal <- predict(m.lzn.cal, newdata=meuse.grid)
  val <- predict(m.lzn.cal, newdata=meuse.val)
  rmse <- sqrt((sum(meuse.val$logZn - val)^2)/length(meuse.val$logZn))
  print(spplot(meuse.grid, zcol="predictedLog10Zn.cal"))
  return(list(rmse = rmse, coef.dist=coefficients(m.lzn.cal)["dist"]))
}

Set up an object to collect the fitted paramters:

# runs should be consistent with regression trees
# runs <- 64
fits.lm <- list(rmse=vector(mode="numeric",length=runs),
             coef.dist=vector(mode="numeric",length=runs))

Call the function and collect the fitted parameters. Save the graphics in a PDF for display.

for (run in 1:runs) {
  tmp <- build.lm(140)
  fits.lm[["rmse"]][run] <- tmp[[1]]
  fits.lm[["coef.dist"]][run] <- tmp[[2]]
}

These maps are much more similar to each other than the single RT maps.

Show the distribution of the RMSE and linear model coefficients; compare RMSE to RT RMSE. :

summary(fits.rf[["rmse"]])

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001446 0.077043 0.146688 0.158756 0.218311 0.455252

summary(fits.lm[["rmse"]])

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0002244 0.0568280 0.1314007 0.1557448 0.2334420 0.6308646

hist(fits.lm[["rmse"]], xlab="Out-of-bag RMSE", main="")
rug(fits.lm[["rmse"]])
summary(fits.lm[["coef.dist"]])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9371 -0.8667 -0.8300 -0.8342 -0.8050 -0.7067

hist(fits.lm[["coef.dist"]], xlab="Linear model coefficient: relative distance", main="")
rug(fits.lm[["coef.dist"]])

The reduced linear model gives lower RMSE on average than the single trees. The distance coefficient is fairly stable.

The Variability of regression tree models

Alema Fissuh

2023-02-01

1 Setup

2 The single best tree

3 Repeated random sampling and tree construction

4 Linear model variability