Axis of Confusion

First things first – I’m going to create two plotting functions, one for basic x-y scatterplots, and one for predicted vs. observed scatterplots, to minimize repetitive plotting code throughout this document. Please reserve all questions about my continued use of base R and my resistance against the Tidyverse cult until the end… Of time.

# create a plotting function for basic x-y plots
plot.fun.basic <- function(x, y){
  par(mar = c(5,5,1,1))
  par(las = 1)
  plot(y ~ x)
  grid()
  mod <- lm(y ~ x)
  abline(mod, lwd = 2, col = "red")
  r2 <- round(summary(mod)$adj.r.squared, 5)
  r2.txt <- paste0("R2 = ", r2)
  a <- round(coef(mod)[2], 5)
  b <- round(coef(mod)[1], 5)
  if (b < 0){
    eq.txt <- paste0("y = ", a, "x - ", abs(b))
  } else {
    eq.txt <- paste0("y = ", a, "x + ", abs(b))
  }
  all.txt <- paste0(eq.txt, "\n", r2.txt)
  text(x = par("usr")[1] + 0.05 * (par("usr")[2]-par("usr")[1]),
       y = par("usr")[3] + 0.95 * (par("usr")[4]-par("usr")[3]),
       labels = all.txt,
       adj = c(0,1),
       col = "red")
}

# create a plotting function for pred vs. obs plots
plot.fun.pred.obs <- function(x, y, x.type = "obs"){
  par(mar = c(5,5,1,1))
  par(las = 1)
  ax.min <- min(c(x, y))
  ax.max <- max(c(x, y))
  if (x.type == "obs"){
    xlab <- "y.test"
    ylab <- "y.pred"
  } else {
    xlab <- "y.pred"
    ylab <- "y.test"
  }
  plot(y ~ x, xlim = c(ax.min, ax.max), ylim = c(ax.min, ax.max), 
       xlab = xlab, ylab = ylab)
  grid()
  lines(x = c(-100,100), y = c(-100,100))
  mod <- lm(y ~ x)
  abline(mod, lwd = 2, col = "red")
  r2.line <- round(summary(mod)$adj.r.squared, 5)
  r2.line.txt <- paste0("R2line = ", r2.line)
  a <- round(coef(mod)[2], 5)
  b <- round(coef(mod)[1], 5)
  if (b < 0){
    eq.txt <- paste0("y = ", a, "x - ", abs(b))
  } else {
    eq.txt <- paste0("y = ", a, "x + ", abs(b))
  }
  rmse <- round(sqrt(mean((y-x)^2)),5)
  if (x.type == "obs"){
    r2.calc <- round(1-sum((x-y)^2)/sum((x-mean(x))^2),5)
  } else {
    r2.calc <- round(1-sum((y-x)^2)/sum((y-mean(y))^2),5)
  }
  rmse.txt <- paste0("RMSE = ", rmse)
  r2.calc.txt <- paste0("R2calc = ", r2.calc)
  all.txt <- paste0(eq.txt, "\n", r2.line.txt, "\n", r2.calc.txt, "\n", rmse.txt)
  text(x = par("usr")[1] + 0.05 * (par("usr")[2]-par("usr")[1]),
       y = par("usr")[3] + 0.95 * (par("usr")[4]-par("usr")[3]),
       labels = all.txt,
       adj = c(0,1),
       col = "red")
}

Next, I’ll simulate some data. x will be a random normally-distributed variable and will act as the independent data. y will be linearly correlated with x with some normally-distributed noise/error thrown in for good measure, and will act as the dependent variable.

# create two linearly-coorelated variables with some normally-distributed error
x <- rnorm(1000)
y <- 2 * x + rnorm(1000)

I’ll split the data 50/50 into training and test, build a model with the training data, and plot out the OLS linear regression between x and y.

# split into training and test data
samp <- sample(length(x), 500)
x.train <- x[samp]
x.test <- x[-samp]
y.train <- y[samp]
y.test <- y[-samp]

# build a model with the training data for predicting y as a function of x
mod <- lm(y.train ~ x.train)

# plot the data out
plot.fun.basic(x.train, y.train)

Next, I’ll use that model to make predictions on the test data, and generate two plots: (1) predicted (y) vs. observed (x); and (2) observed (y) vs. predicted (x). In each case, the plots will include the linear OLS regression equation between x and y (or, I guess y as a function of x), in addition to the r-squared value of that regression line.

# make predictions on the test data
y.pred <- predict(mod, list(x.train = x.test))
y.test <- y.test

# open two-panel plot
par(mfrow = c(1,2))

# plot predicted (y) vs. observed (x)
plot.fun.pred.obs(y.test, y.pred, x.type = "obs")

# plot observed (y) vs. predicted (x)
plot.fun.pred.obs(y.pred, y.test, x.type = "pred")

So, in summary:

The OLS R² for the original training data model = 0.79488
The OLS R² for both the pred vs. obs and vice versa = 0.76835
The calculated R² for both the pred vs. obs and vice versa = 0.7675
The slope of the OLS regression when modeling y.pred as a function of y.test (lm(y.pred ~ y.test)) = 0.79053
The slope of the OLS regression when modeling y.test as a function of y.pred (lm(y.test ~ y.pred)) = 0.97253

In isolation, on a single dataset, I’m not sure I have enough information yet to really understand the implications of using these different approaches and metrics. So, for my own understanding, I’m going to now repeat this process on four different simulated datasets, each one progressively getting more noisy, and thus producing worse predictive models.

# define noise vector (the sd in a random normal distribution)
noises <- c(1,2,3,4)

# open an eight-panel plot
par(mfrow = c(4,2))

# loop through the noise levels, simulate data, training/test split, build
# model using training data, make on test data, and assess predictive model
# performance
for (noise in noises){
  set.seed(999)
  x <- rnorm(1000)
  y <- 2 * x + rnorm(1000, sd = noise)
  samp <- sample(length(x), 500)
  x.train <- x[samp]
  x.test <- x[-samp]
  y.train <- y[samp]
  y.test <- y[-samp]
  mod <- lm(y.train ~ x.train)
  y.pred <- predict(mod, list(x.train = x.test))
  y.test <- y.test
  plot.fun.pred.obs(y.test, y.pred, x.type = "obs")
  plot.fun.pred.obs(y.pred, y.test, x.type = "pred")
}

No point in digging too deep into the details for each of these models, but here are my general observations:

As noise increases, model performance decreases. This makes sense.
As model performance decreases, so too does the tendency for low values to be overestimated and high values to be underestimated. This too makes sense, as with increasing noise, the model gets closer and closer to only being able to estimate the mean, as there is an increasingly-muddy relationship between x and y.
However, if you use the lm(y.test ~ y.pred) approach, where the observed data (y-axis) are being modeled as a function of the predicted data (x-axis), the linear regression still gives the (false) appearance that the data still have a relatively strong 1:1-ish relationship. I don’t think this is helpful or correct, at least visually.
Regardless of which variable is on which axis, the R² and RMSE calculations remain the same.
The R2calc is always slightly worse than the R2line. I can’t reason through why that might be the case, but I trust Dan’s recommendation to use R2calc instead of R2line.

My personal conclusions (not necessarily correct or recommended for others):

When analyzing predictive performance, place predicted data on the y-axis and observed data on the x-axis
It’s still useful to display an OLS regression line, if for no other reason than a visual guide to understand trends in model performance at different magnitudes
But use the manually-calculated R² as the basis of predicted-observed fit, not the OLS R²
R² is insufficient on its own as a metric for model performance, so include others such as RMSE

Axis of Confusion

Mickey Campbell

2022-10-27

My personal conclusions (not necessarily correct or recommended for others):