This document illustrates what normally and non-normally distributed residuals can look like, using simulated data. The aim is to illustrate what you might look for in a residuals plot when doing a regression analysis.
First I’m going to simulate some data for a simple univariate analysis. I will randomly generate some \(X\) values. Then I will generate some \(Y\) values as a function of the \(X\) values plus some normally distributed error. The formula I will use to describe the relationship between the \(X\) and \(Y\) values is:
\(Y_i = 0.5 \times X_i + e_i\)
Where \(e_i\) is normally distributed with mean = 0 and sd = 1. I have plotted these below.
# Set seed for reproducibility
set.seed(250419)
# Generate random x values
x <- rnorm(n = 500,
mean = 5,
sd = 2)
# Generate y values y = 0.5x + e
y <- 0.5*x + rnorm(500,
mean = 0,
sd = 1)
d <- data.frame(x, y)
# Make a scatterplot
g1 <- qplot(x, y,
data = d,
geom = "point") +
labs(x = "X",
y = "Y",
title = "Scatterplot of simulated data")
g1
Now we can fit a linear regression model and plot the regression line (plot a below). We can see from this plot, that the points are roughly distributed evenly above and below the regression line. The slope of the relationship in the graph makes it a bit harder to assess whether the residuals are normally distributed or not. We can do this by plotting the residuals themselves, both against the fitted values (plot b), and by plotting a histogram (plot c) and a quantile-quantile plot (plot d).
# Function for doing residual plots
residplot <- function(d){
# Fit a linear model
m <- lm(y ~ x, data = d)
# Add residuals to data
d$residual <- m$residuals
# Make a scatterplot
g1 <- qplot(x, y,
data = d,
geom = "point") +
geom_abline(intercept = m$coefficients[1],
slope = m$coefficients[2],
colour = "red") +
labs(x = "X",
y = "Y",
title = "Scatterplot of simulated data")
# Make a scatter plot of residuals against fitted values
g2 <- qplot(m$fitted, m$residuals,
geom = "point") +
geom_abline(intercept = 0,
slope = 0,
colour = "red") +
labs(title = "Plot of residuals vs fitted values",
x = "fitted value",
y = "residual")
# Make a histogram of the residuals
g3 <- qplot(m$residuals,
geom = "histogram",
bins = 10) +
labs(title = "Histogram of residuals",
x = "residual")
# Make a quantile-quantile plot
g4 <-ggplot(data = d,
aes(sample = residual)) +
geom_qq() +
geom_qq_line(colour = "red") +
labs(title = "Quantile plot of residuals")
# Plot the plots
g <- plot_grid(g1, g2, g3, g4, ncol = 2, labels = "auto")
# Return the plots
return(g)
}
residplot(d)
These plots suggest that, as expected, the residuals are normally distributed. They’re evenly spread above and below the regression line, and this doesn’t change depending on the fitted value (or the \(X\) value).
Our points don’t need to be neatly clustered along the x-axis. We can see what happens if we introduce 50 outliers in the x-axis, but keep the same relationship between \(X\) and \(Y\).
# Set seed for reproducibility
set.seed(250419)
# Generate random x values
x <- rnorm(n = 500,
mean = 5,
sd = 2)
# Adding outliers
x2 <- rnorm(n = 50,
mean = 15,
sd = 1.5)
x <- c(x, x2)
# calculate y values
y <- 0.5*x + rnorm(n = 550,
mean = 0,
sd = 1)
d <- data.frame(x, y)
# Re-do the plots
residplot(d)
So now we have a cluster of data points that are way off to one side. But if we do the scatterplot again (plot a above), although those points cluster at one end of the graph, they are still evenly spread above and below the regression line. The residual plot (b) shows the same thing with the points evenly spread above and below the zero line. The histogram of residuals is still roughly normally distributed (c), and the quantile-quantile plot is still nice and straight (d).
Another thing to note is that our underlying data is not normally distributed anymore, but that this doesn’t matter as much as the fact that our residuals are still normally distributed. We can see from the histograms below of our \(X\) and \(Y\) variables that both are starting to look skewed:
# Make histograms of x and y
g5 <- qplot(x = x,
data = d,
geom = "histogram",
bins = 20) +
ggtitle("Hisotgram of X")
g6 <- qplot(x = y,
data = d,
geom = "histogram",
bins = 20) +
ggtitle(("Histogram of Y"))
plot_grid(g5, g6, ncol = 2, labels = "auto")
There are a number of ways the normality of residuals assumption can be violated. This could happen if the relationship between our x
and y
isn’t linear. Below I’ll illustrate what this does to the residual plots, again using simulated data. Here I will simulate the data so that \(Y\) depends on \(X^2\):
\(Y_i = 0.5 \times X_i^2 + e_i\)
Where \(e_i\) is normally distributed with mean = 0 and sd = 1.
# Set seed
set.seed(250419)
# Simulate the data: y ~ x^2
x <- rnorm(n = 500,
mean = 6,
sd = 1)
y <- 0.5*(x^2) + rnorm(n = 500,
mean = 0,
sd = 1)
d <- data.frame(x, y)
# Re-do the plots
residplot(d)
Here we can see that the points no longer scatter neatly around the zero line in plot b for all fitted values, the histogram of residuals is starting to look skewed, and the quantile-quantile plot isn’t looking as straight.
Another thing to look for in the residual plots is that the distribution of the residuals is constant across all the fitted values. In other words the variance of the data is not constant at all values of the predictor variables. We can illustrate this below:
\(Y_i = log(0.5 \times X_i + e_i)\)
# Set the seed
set.seed(250419)
# Make the data
x <- rnorm(n = 500,
mean = 6,
sd = 1)
y <- log(x + rnorm(n = 500, mean = 0, sd = 1))
d <- data.frame(x, y)
# Re-do the plots
residplot(d)
Here we can see that the points form a funnel or fan shape around the regression line (plot a) and the residuals are fanned around 0 (b). The residual histogram is skewed and the quantile-quantile plot isn’t straight either.
This has just been a very quick example of what normally and non-normally distributed residuals can look like. It’s important top note that there are lots of ways that the assumption of normally distributed residuals can be violated.