Machine Learning vs Poisson Regression

Random Forest

A regression tree is a type of machine learning algorithm that outputs a series of decisions with each decision leading to a value of the response or to another decision.

For example, in hurricane climate studis, if the value of the NAO is less than -1 s.d. for example, then the response is two hurricanes. If it is greater, then is the SOI greater than 0.5 s.d. and so on.

Import the annual hurricane data and subset on years since 1950. We create a data frame containing only the basin-wide hurricane counts and SOI and SST as the two predictors.

annual = read.table("http://myweb.fsu.edu/jelsner/data/AnnualData.txt", header = TRUE)
dat = subset(annual, Year >= 1950)
df = data.frame(H = dat$B.1, SOI = dat$soi, SST = dat$sst)
head(df)

##    H     SOI        SST
## 1 11  3.4333 -0.0007373
## 2  8 -3.4667  0.2932627
## 3  6 -0.5667  0.3989293
## 4  6 -3.6000  0.2565960
## 5  8  1.0667  0.0112627
## 6  9  4.2333  0.2522627

Then using the tree() function, type

require(tree)

## Loading required package: tree

tr = tree(H ~ SOI + SST, data = df)
plot(tr)
text(tr)

plot of chunk regressionTree2

To determine the mean number of hurricanes when SOI is -2 s.d. and SST is 0.2C, you use the predict() method and type

predict(tr, data.frame(SOI = -2, SST = 0.2))

##     1 
## 7.348

The predicted value depends on the tree. The tree depends on what years are used to grow it. For example, regrow the tree by leaving the last year out and make a prediction using the same two predictor values.

tr2 = tree(H ~ SOI + SST, data = df[-61, ])
predict(tr2, newdata = data.frame(SOI = -2, SST = 0.2))

##     1 
## 5.714

Results are different. Which prediction do you rely on? Forecast sensitivity occurs with all statistical models, but is more acute with models that contain a large number of parameters. Each branch in a regression tree is a parameter so with your two predictors the model has five parameters.

A random forest algorithm side steps the question of prediction choice by making predictions from many trees. It creates a sample from the set of all years and grows a tree using data only from the sampled years. It then repeats the sampling and grows another tree. Each tree gives a prediction and the mean is taken. The function randomForest() in the randomForest package provides a random forest algorithm. For example, type

require(randomForest)

## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

rf = randomForest(H ~ SOI + SST, data = df)

The algorithm grows 500 (default) trees. To make a prediction type,

predict(rf, newdata = data.frame(SOI = -2, SST = 0.2))

##     1 
## 4.831

The random forest is less sensitive to removing single points. As before we remove the last observation.

rf = randomForest(H ~ SOI + SST, data = df[-61, ])
predict(rf, newdata = data.frame(SOI = -2, SST = 0.2))

##     1 
## 4.804

Regression trees and random forest algorithms tend to over fit your data especially when you are searching over a large set of potential predictors. Over fitting results in small in-sample error, but large out-of-sample error.

Cross Validation

Cross-validation is needed to compare predictive skill. Cross validation removes the noise specific to each data point and estimates how well the random forest algorithm finds prediction rules when this coincident information is unavailable.

For example, does the random forest algorithm provide better prediction skill than a Poisson regression? To answer this question you arrange a cross validation as follows.

n = length(df$H)
rfx = numeric(n)
prx = numeric(n)
for (i in 1:n) {
    rfm = randomForest(H ~ SOI + SST, data = df[-i, ])
    prm = glm(H ~ SOI + SST, data = df[-i, ], family = "poisson")
    new = df[i, ]
    rfx[i] = predict(rfm, newdata = new)
    prx[i] = predict(prm, newdata = new, type = "response")
}

The out-of-sample mean-squared prediction error is computed by typing

mean((df$H - prx)^2)

## [1] 5.07

mean((df$H - rfx)^2)

## [1] 5.327

Results show that the Poisson regression performs slightly better than the random forest algorithm in this case although the difference is not statistically significant. The correlation between the actual and predicted value is 0.539 for the Poisson model and 0.505 for the random forest algorithm.

You can see the influence of the two explanatory variables on hurricanes using the following code chunk.

newdat = expand.grid(SST = seq(-0.5, 0.7, 0.01), SOI = seq(-5, 4, 0.1))
z1 = predict(rf, newdata = newdat)
prm = glm(H ~ SOI + SST, data = df, family = "poisson")
z2 = predict(prm, newdata = newdat, type = "response")
newdat$Hrf = z1
newdat$Hpr = z2
require(lattice)

## Loading required package: lattice

require(colorRamps)

## Loading required package: colorRamps

require(grid)

## Loading required package: grid

cr = blue2green(20)
p1 = levelplot(Hrf ~ SST + SOI, data = newdat, at = seq(2, 12, 2), scales = list(tck = 0.5, 
    alternating = 2, cex = 0.7), xlab.top = textGrob("SST [C]", gp = gpar(cex = 0.8)), 
    ylab.right = textGrob("SOI [s.d.]", gp = gpar(cex = 0.8), rot = 90), xlab = "", 
    ylab = "", col.regions = cr, colorkey = list(space = "bottom", cex = 0.7), 
    sub = list("Hurricane count", cex = 0.9, font = 1))
p2 = levelplot(Hpr ~ SST + SOI, data = newdat, at = seq(2, 12, 2), scales = list(tck = 0.5, 
    alternating = 2, cex = 0.7), xlab.top = textGrob("SST [C]", gp = gpar(cex = 0.8)), 
    ylab.right = textGrob("SOI [s.d.]", gp = gpar(cex = 0.8), rot = 90), xlab = "", 
    ylab = "", col.regions = cr, colorkey = list(space = "bottom", cex = 0.7), 
    sub = list("Hurricane count", cex = 0.9, font = 1))
p1 = update(p1, main = textGrob("a", x = unit(0.05, "npc")))
p2 = update(p2, main = textGrob("b", x = unit(0.05, "npc")))
print(p1, split = c(1, 1, 2, 1), more = TRUE)
print(p2, split = c(2, 1, 2, 1), more = FALSE)

plot of chunk bivariateInfluence

The figure shows the bivariate influence of SST and SOI on hurricane counts using the random forest algorithm and Poisson regression. Hurricane counts increase with SST and SOI but for high values of SOI the influence of SST is stronger.

Similarly for high values of SST the influence of the SOI is more pronounced. The random forest is able to capture non-linearities and thresholds, but at the expense of interpreting some noise as signal as seen by the relative high count with SOI values near -3 s.d. and SST values near -0.1C.