First things first – I’m going to create two plotting functions, one for basic x-y scatterplots, and one for predicted vs. observed scatterplots, to minimize repetitive plotting code throughout this document. Please reserve all questions about my continued use of base R and my resistance against the Tidyverse cult until the end… Of time.

# create a plotting function for basic x-y plots
plot.fun.basic <- function(x, y){
  par(mar = c(5,5,1,1))
  par(las = 1)
  plot(y ~ x)
  grid()
  mod <- lm(y ~ x)
  abline(mod, lwd = 2, col = "red")
  r2 <- round(summary(mod)$adj.r.squared, 5)
  r2.txt <- paste0("R2 = ", r2)
  a <- round(coef(mod)[2], 5)
  b <- round(coef(mod)[1], 5)
  if (b < 0){
    eq.txt <- paste0("y = ", a, "x - ", abs(b))
  } else {
    eq.txt <- paste0("y = ", a, "x + ", abs(b))
  }
  all.txt <- paste0(eq.txt, "\n", r2.txt)
  text(x = par("usr")[1] + 0.05 * (par("usr")[2]-par("usr")[1]),
       y = par("usr")[3] + 0.95 * (par("usr")[4]-par("usr")[3]),
       labels = all.txt,
       adj = c(0,1),
       col = "red")
}

# create a plotting function for pred vs. obs plots
plot.fun.pred.obs <- function(x, y, x.type = "obs"){
  par(mar = c(5,5,1,1))
  par(las = 1)
  ax.min <- min(c(x, y))
  ax.max <- max(c(x, y))
  if (x.type == "obs"){
    xlab <- "y.test"
    ylab <- "y.pred"
  } else {
    xlab <- "y.pred"
    ylab <- "y.test"
  }
  plot(y ~ x, xlim = c(ax.min, ax.max), ylim = c(ax.min, ax.max), 
       xlab = xlab, ylab = ylab)
  grid()
  lines(x = c(-100,100), y = c(-100,100))
  mod <- lm(y ~ x)
  abline(mod, lwd = 2, col = "red")
  r2.line <- round(summary(mod)$adj.r.squared, 5)
  r2.line.txt <- paste0("R2line = ", r2.line)
  a <- round(coef(mod)[2], 5)
  b <- round(coef(mod)[1], 5)
  if (b < 0){
    eq.txt <- paste0("y = ", a, "x - ", abs(b))
  } else {
    eq.txt <- paste0("y = ", a, "x + ", abs(b))
  }
  rmse <- round(sqrt(mean((y-x)^2)),5)
  if (x.type == "obs"){
    r2.calc <- round(1-sum((x-y)^2)/sum((x-mean(x))^2),5)
  } else {
    r2.calc <- round(1-sum((y-x)^2)/sum((y-mean(y))^2),5)
  }
  rmse.txt <- paste0("RMSE = ", rmse)
  r2.calc.txt <- paste0("R2calc = ", r2.calc)
  all.txt <- paste0(eq.txt, "\n", r2.line.txt, "\n", r2.calc.txt, "\n", rmse.txt)
  text(x = par("usr")[1] + 0.05 * (par("usr")[2]-par("usr")[1]),
       y = par("usr")[3] + 0.95 * (par("usr")[4]-par("usr")[3]),
       labels = all.txt,
       adj = c(0,1),
       col = "red")
}

Next, I’ll simulate some data. x will be a random normally-distributed variable and will act as the independent data. y will be linearly correlated with x with some normally-distributed noise/error thrown in for good measure, and will act as the dependent variable.

# create two linearly-coorelated variables with some normally-distributed error
x <- rnorm(1000)
y <- 2 * x + rnorm(1000)

I’ll split the data 50/50 into training and test, build a model with the training data, and plot out the OLS linear regression between x and y.

# split into training and test data
samp <- sample(length(x), 500)
x.train <- x[samp]
x.test <- x[-samp]
y.train <- y[samp]
y.test <- y[-samp]

# build a model with the training data for predicting y as a function of x
mod <- lm(y.train ~ x.train)

# plot the data out
plot.fun.basic(x.train, y.train)

Next, I’ll use that model to make predictions on the test data, and generate two plots: (1) predicted (y) vs. observed (x); and (2) observed (y) vs. predicted (x). In each case, the plots will include the linear OLS regression equation between x and y (or, I guess y as a function of x), in addition to the r-squared value of that regression line.

# make predictions on the test data
y.pred <- predict(mod, list(x.train = x.test))
y.test <- y.test

# open two-panel plot
par(mfrow = c(1,2))

# plot predicted (y) vs. observed (x)
plot.fun.pred.obs(y.test, y.pred, x.type = "obs")

# plot observed (y) vs. predicted (x)
plot.fun.pred.obs(y.pred, y.test, x.type = "pred")

So, in summary:

In isolation, on a single dataset, I’m not sure I have enough information yet to really understand the implications of using these different approaches and metrics. So, for my own understanding, I’m going to now repeat this process on four different simulated datasets, each one progressively getting more noisy, and thus producing worse predictive models.

# define noise vector (the sd in a random normal distribution)
noises <- c(1,2,3,4)

# open an eight-panel plot
par(mfrow = c(4,2))

# loop through the noise levels, simulate data, training/test split, build
# model using training data, make on test data, and assess predictive model
# performance
for (noise in noises){
  set.seed(999)
  x <- rnorm(1000)
  y <- 2 * x + rnorm(1000, sd = noise)
  samp <- sample(length(x), 500)
  x.train <- x[samp]
  x.test <- x[-samp]
  y.train <- y[samp]
  y.test <- y[-samp]
  mod <- lm(y.train ~ x.train)
  y.pred <- predict(mod, list(x.train = x.test))
  y.test <- y.test
  plot.fun.pred.obs(y.test, y.pred, x.type = "obs")
  plot.fun.pred.obs(y.pred, y.test, x.type = "pred")
}

No point in digging too deep into the details for each of these models, but here are my general observations:

  1. As noise increases, model performance decreases. This makes sense.
  2. As model performance decreases, so too does the tendency for low values to be overestimated and high values to be underestimated. This too makes sense, as with increasing noise, the model gets closer and closer to only being able to estimate the mean, as there is an increasingly-muddy relationship between x and y.
  3. However, if you use the lm(y.test ~ y.pred) approach, where the observed data (y-axis) are being modeled as a function of the predicted data (x-axis), the linear regression still gives the (false) appearance that the data still have a relatively strong 1:1-ish relationship. I don’t think this is helpful or correct, at least visually.
  4. Regardless of which variable is on which axis, the R2 and RMSE calculations remain the same.
  5. The R2calc is always slightly worse than the R2line. I can’t reason through why that might be the case, but I trust Dan’s recommendation to use R2calc instead of R2line.