Introduction

In our VI paper, there is an interesting “problem”, where site-level models are, on average, performing worse than national-level models. But, when the predictions vs. observations are aggregated among all site-level and national-level models, it gives the appearance that site-level models are better. Whenever faced with weird problems like this, I find it useful to simulate some data and try to replicate the issue, so that’s what I’ll do here.

Step 1. Create correlated data for three different “sites”

First, I’ll create three datasets, each of which will have a moderately correlated set of x and y variables. Importantly, these three datasets will have different ranges, with site 1’s response variable (y1) being centered at about 0, site 2’s (y2) at 5, and site 3’s (y3) at 10.

# site 1
x1 <- rnorm(1000, mean = 0)
y1 <- x1 + rnorm(1000)

# site 2
x2 <- rnorm(1000, mean = 5)
y2 <- x2 + rnorm(1000)

# site 3
x3 <- rnorm(1000, mean = 10)
y3 <- x3 + rnorm(1000)

# plot them out
par(mfrow = c(1,3), las = 1, mar = c(5,5,1,1))
plot(y1 ~ x1)
plot(y2 ~ x2)
plot(y3 ~ x3)

Step 2. Split into training and test data

I’m going to be building linear regression models for each of these three sites, and comparing predicted to observed values. To do that, I’ll split the data from each site into training and test subsets.

# site 1
x1.train <- x1[1:800]
x1.test <- x1[801:1000]
y1.train <- y1[1:800]
y1.test <- y1[801:1000]

# site 2
x2.train <- x2[1:800]
x2.test <- x2[801:1000]
y2.train <- y2[1:800]
y2.test <- y2[801:1000]

# site 3
x3.train <- x3[1:800]
x3.test <- x3[801:1000]
y3.train <- y3[1:800]
y3.test <- y3[801:1000]

Step 3. Build regression models

Using the training data, I’ll now build a regression model for each of the three sites.

# build regression models
mod.1 <- lm(y1.train ~ x1.train)
mod.2 <- lm(y2.train ~ x2.train)
mod.3 <- lm(y3.train ~ x3.train)

Summary stats for training models:

Step 4. Make predictions on test data and assess model performance

I’ll now use these models to make predictions on the test data and compare predicted to observed values to gain a sense of individual, site-level model performance.

# make predictions
pred.1 <- predict(mod.1, newdata = list(x1.train = x1.test))
pred.2 <- predict(mod.2, newdata = list(x2.train = x2.test))
pred.3 <- predict(mod.3, newdata = list(x3.train = x3.test))

# set up plot
par(mfrow = c(1,3), las = 1, mar = c(5,5,1,1))

# plot site 1
plot(pred.1 ~ y1.test, 
     xlim = c(min(c(pred.1, y1.test)),
              max(c(pred.1, y1.test))),
     ylim = c(min(c(pred.1, y1.test)),
              max(c(pred.1, y1.test))))
lines(c(-100,100), c(-100,100), lty = 2)
test.mod.1 <- lm(pred.1 ~ y1.test)
abline(test.mod.1, lwd = 2, col = "red")
r2 <- paste0("r2=", round(summary(test.mod.1)$adj.r.squared, 3))
rrmse <- paste0("rrmse=", round(sqrt(mean((pred.1-y1.test)^2))/sd(y1.test),3))
legend("topleft", legend = c(r2, rrmse), text.col = "red", x.intersp = 0, bty = "n")
     
# plot site 2
plot(pred.2 ~ y2.test, 
     xlim = c(min(c(pred.2, y2.test)),
              max(c(pred.2, y2.test))),
     ylim = c(min(c(pred.2, y2.test)),
              max(c(pred.2, y2.test))))
lines(c(-100,100), c(-100,100), lty = 2)
test.mod.2 <- lm(pred.2 ~ y2.test)
abline(test.mod.2, lwd = 2, col = "red")
r2 <- paste0("r2=", round(summary(test.mod.2)$adj.r.squared, 3))
rrmse <- paste0("rrmse=", round(sqrt(mean((pred.2-y2.test)^2))/sd(y2.test),3))
legend("topleft", legend = c(r2, rrmse), text.col = "red", x.intersp = 0, bty = "n")
     
# plot site 3
plot(pred.3 ~ y3.test, 
     xlim = c(min(c(pred.3, y3.test)),
              max(c(pred.3, y3.test))),
     ylim = c(min(c(pred.3, y3.test)),
              max(c(pred.3, y3.test))))
lines(c(-100,100), c(-100,100), lty = 2)
test.mod.3 <- lm(pred.3 ~ y3.test)
abline(test.mod.3, lwd = 2, col = "red")
r2 <- paste0("r2=", round(summary(test.mod.3)$adj.r.squared, 3))
rrmse <- paste0("rrmse=", round(sqrt(mean((pred.3-y3.test)^2))/sd(y3.test),3))
legend("topleft", legend = c(r2, rrmse), text.col = "red", x.intersp = 0, bty = "n")

Summary stats for test models:

So, the mean R2 among all three models is 0.478.

Step 5. Aggregate all predicted vs. observed together

Indiviudally, the models are doing OK. But let’s see how they appear to perform when combined.

# combine predicted and observed data together
pred.agg <- c(pred.1, pred.2, pred.3)
y.test.agg <- c(y1.test, y2.test, y3.test)

# plot it out
par(mar = c(5,5,1,1), las = 1)
pt.cols <- c(rep("blue", 200), rep("green", 200), rep("red", 200))
plot(pred.agg ~ y.test.agg, col = pt.cols,
     xlim = c(min(c(pred.agg, y.test.agg)),
              max(c(pred.agg, y.test.agg))),
     ylim = c(min(c(pred.agg, y.test.agg)),
              max(c(pred.agg, y.test.agg))))
lines(c(-100,100),c(-100,100), lty = 2)
abline(test.mod.1, lwd = 2, col = "blue")
abline(test.mod.2, lwd = 2, col = "green")
abline(test.mod.3, lwd = 2, col = "red")
test.mod.agg <- lm(pred.agg ~ y.test.agg)
abline(test.mod.agg, lwd = 2)
r2 <- paste0("r2=", round(summary(test.mod.agg)$adj.r.squared, 3))
rrmse <- paste0("rrmse=", round(sqrt(mean((pred.agg-y.test.agg)^2))/sd(y.test.agg),3))
legend("topleft", legend = c(r2, rrmse), x.intersp = 0, bty = "n")

Conclusion

So, this is a bit of an extreme example, but I think it explains why we’re seeing the somewhat unexpected results we’re seeing:

By aggregating the site-level predicted vs. observed results together, the apparent combined model performance is inflated simply due to the differences in value ranges among individual sites.

In fact, even if the three models had zero explanatory power, and you could only estimate the mean of the population (i.e., no significant relationship, R2 = 0, you only have a y-intercept), the combined results would still appear good, because when you combine the site-level results, variance in y (predicted VI) is still explained by variance in x (observed VI).

So, this begs the question:

Is this only an artifact of the site-level aggregation or does this also occur with the national-level analysis?

I would argue that the national-level results also suffer from this phenomenon, but in a much more justifiable manner. The national-level results are based on leave-one-site-out cross-validation. So, that means for each test site, data from 23 other sites are being used to train the model. Those 23 sites capture a MUCH wider range of VI (and predictor variables), and are not specifically tailored to the landscape conditions of the test site. In fact, the test site being omitted entirely from the model enforces that relationship. Given the wide range of VI in the training data, it is entirely possible that predictions could be made outside of the range of VI in the test data. At the site-level, however, this would not be possible – the predictions are constrained to the ranges of values within the site. Random forests don’t extrapolate. So, compiling the predicted vs. observed results from a national-level cross-validation like this still gives us a robust sense of how well the model will perform when applied anywhere – even areas unseen by the training data.

That being said(!), I think the best course of action, for the sake of the paper’s simplicity, is to remove the training/validation scatterplots altogether. We still retain the scatterplot from the true test area (Monroe Mountain), which provides a graphical depiction of relative model performance. But we avoid the complexity of having to explain why mean/median performance among sites is different than aggregate performance. The median R2 and RRMSE among models gives a robust sense of how well site-level and national-level models perform across a diverse range of landscapes, and gives poorer-performing sites (and sites with narrower ranges of VI) more justifiable influence on the perception of combined model performance, so let’s make that the focus of the training/validation results.