First things first – I’m going to create two plotting functions, one for basic x-y scatterplots, and one for predicted vs. observed scatterplots, to minimize repetitive plotting code throughout this document. Please reserve all questions about my continued use of base R and my resistance against the Tidyverse cult until the end… Of time.
# create a plotting function for basic x-y plots
plot.fun.basic <- function(x, y){
par(mar = c(5,5,1,1))
par(las = 1)
plot(y ~ x)
grid()
mod <- lm(y ~ x)
abline(mod, lwd = 2, col = "red")
r2 <- round(summary(mod)$adj.r.squared, 5)
r2.txt <- paste0("R2 = ", r2)
a <- round(coef(mod)[2], 5)
b <- round(coef(mod)[1], 5)
if (b < 0){
eq.txt <- paste0("y = ", a, "x - ", abs(b))
} else {
eq.txt <- paste0("y = ", a, "x + ", abs(b))
}
all.txt <- paste0(eq.txt, "\n", r2.txt)
text(x = par("usr")[1] + 0.05 * (par("usr")[2]-par("usr")[1]),
y = par("usr")[3] + 0.95 * (par("usr")[4]-par("usr")[3]),
labels = all.txt,
adj = c(0,1),
col = "red")
}
# create a plotting function for pred vs. obs plots
plot.fun.pred.obs <- function(x, y, x.type = "obs"){
par(mar = c(5,5,1,1))
par(las = 1)
ax.min <- min(c(x, y))
ax.max <- max(c(x, y))
if (x.type == "obs"){
xlab <- "y.test"
ylab <- "y.pred"
} else {
xlab <- "y.pred"
ylab <- "y.test"
}
plot(y ~ x, xlim = c(ax.min, ax.max), ylim = c(ax.min, ax.max),
xlab = xlab, ylab = ylab)
grid()
lines(x = c(-100,100), y = c(-100,100))
mod <- lm(y ~ x)
abline(mod, lwd = 2, col = "red")
r2.line <- round(summary(mod)$adj.r.squared, 5)
r2.line.txt <- paste0("R2line = ", r2.line)
a <- round(coef(mod)[2], 5)
b <- round(coef(mod)[1], 5)
if (b < 0){
eq.txt <- paste0("y = ", a, "x - ", abs(b))
} else {
eq.txt <- paste0("y = ", a, "x + ", abs(b))
}
rmse <- round(sqrt(mean((y-x)^2)),5)
if (x.type == "obs"){
r2.calc <- round(1-sum((x-y)^2)/sum((x-mean(x))^2),5)
} else {
r2.calc <- round(1-sum((y-x)^2)/sum((y-mean(y))^2),5)
}
rmse.txt <- paste0("RMSE = ", rmse)
r2.calc.txt <- paste0("R2calc = ", r2.calc)
all.txt <- paste0(eq.txt, "\n", r2.line.txt, "\n", r2.calc.txt, "\n", rmse.txt)
text(x = par("usr")[1] + 0.05 * (par("usr")[2]-par("usr")[1]),
y = par("usr")[3] + 0.95 * (par("usr")[4]-par("usr")[3]),
labels = all.txt,
adj = c(0,1),
col = "red")
}
Next, I’ll simulate some data. x will be a random
normally-distributed variable and will act as the independent data.
y will be linearly correlated with x with some
normally-distributed noise/error thrown in for good measure, and will
act as the dependent variable.
# create two linearly-coorelated variables with some normally-distributed error
x <- rnorm(1000)
y <- 2 * x + rnorm(1000)
I’ll split the data 50/50 into training and test, build a model with
the training data, and plot out the OLS linear regression between
x and y.
# split into training and test data
samp <- sample(length(x), 500)
x.train <- x[samp]
x.test <- x[-samp]
y.train <- y[samp]
y.test <- y[-samp]
# build a model with the training data for predicting y as a function of x
mod <- lm(y.train ~ x.train)
# plot the data out
plot.fun.basic(x.train, y.train)
Next, I’ll use that model to make predictions on the test data, and generate two plots: (1) predicted (y) vs. observed (x); and (2) observed (y) vs. predicted (x). In each case, the plots will include the linear OLS regression equation between x and y (or, I guess y as a function of x), in addition to the r-squared value of that regression line.
# make predictions on the test data
y.pred <- predict(mod, list(x.train = x.test))
y.test <- y.test
# open two-panel plot
par(mfrow = c(1,2))
# plot predicted (y) vs. observed (x)
plot.fun.pred.obs(y.test, y.pred, x.type = "obs")
# plot observed (y) vs. predicted (x)
plot.fun.pred.obs(y.pred, y.test, x.type = "pred")
So, in summary:
y.pred as
a function of y.test (lm(y.pred ~ y.test)) =
0.79053y.test as
a function of y.pred (lm(y.test ~ y.pred)) =
0.97253In isolation, on a single dataset, I’m not sure I have enough information yet to really understand the implications of using these different approaches and metrics. So, for my own understanding, I’m going to now repeat this process on four different simulated datasets, each one progressively getting more noisy, and thus producing worse predictive models.
# define noise vector (the sd in a random normal distribution)
noises <- c(1,2,3,4)
# open an eight-panel plot
par(mfrow = c(4,2))
# loop through the noise levels, simulate data, training/test split, build
# model using training data, make on test data, and assess predictive model
# performance
for (noise in noises){
set.seed(999)
x <- rnorm(1000)
y <- 2 * x + rnorm(1000, sd = noise)
samp <- sample(length(x), 500)
x.train <- x[samp]
x.test <- x[-samp]
y.train <- y[samp]
y.test <- y[-samp]
mod <- lm(y.train ~ x.train)
y.pred <- predict(mod, list(x.train = x.test))
y.test <- y.test
plot.fun.pred.obs(y.test, y.pred, x.type = "obs")
plot.fun.pred.obs(y.pred, y.test, x.type = "pred")
}
No point in digging too deep into the details for each of these models, but here are my general observations:
lm(y.test ~ y.pred) approach,
where the observed data (y-axis) are being modeled as a function of the
predicted data (x-axis), the linear regression still gives the (false)
appearance that the data still have a relatively strong 1:1-ish
relationship. I don’t think this is helpful or correct, at least
visually.