Linear Regression: Residual Analysis

# Clear the workspace
  rm(list = ls()) # Clear all files from your environment
# gc()            # Clear unused memory
  cat("\f")       # Clear the console

  graphics.off()  # Clear all graphs

Linear Regression Specification

# Loads specified data sets, or list the available data sets.
data() 

# Store the data mtcars as a dataframe  
df <- as.data.frame(mtcars)

# Create a scatter plot 
plot(x = mtcars$wt, 
     y = mtcars$mpg,
     xlab = "Weight",         
     ylab = "Miles Per Gallon", 
     main = "Cars Dataset"
     )

# Fit a linear regression model
model <- lm(mpg ~ wt, 
            data = mtcars
            )

summary(model)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

# fitted and residual

Residual Analysis

When you run the plot() function on a linear regression model, it produces a four-panel diagnostic plot that includes a histogram of the residuals, a Normal QQ plot of the residuals, a plot of the residuals vs. fitted values, and a plot of the residuals vs. leverage.

# Set up the plotting region / graphical parameters to have four subplots
par(mfrow = c(2, 2))

# Create a residual plot
plot(model)

Plot of Fitted vs Residual

You can also create individual residual plots by specifying the which argument. Here’s an example:

# Create a residual plot of residuals vs. fitted values

plot(model, 
     which = 1
     )

This will produce a plot of the residuals against the fitted values.

In a residual vs. fitted plot after running the lm() command in R, the red line represents the fitted line through the residuals. This line is often called the “zero line” or the “residuals equals zero line”, and represents the line of perfect prediction. The closer the residuals are to the zero line, the better the model fits the data.

If there is a pattern or trend in the residual vs. fitted plot, such as a curve or a funnel shape, it suggests that the model is not adequately capturing the variability in the data. In such cases, you may need to consider -

adding additional predictors to the model,
transforming the response or predictor variables, or
using a different model altogether.

It’s important to examine the residual vs. fitted plot along with other diagnostic plots, such as the QQ plot of the residuals and the scale-location plot, to assess the adequacy of the linear regression model. See next sections.

Creating Manually

Alternatively, you can use the ggplot2 package to create customized residual plots. Here’s an example:

library(ggplot2)

# Create a data frame of fitted values and residuals
df <- data.frame(fitted   = model$fitted.values, 
                 residual = model$residuals
                 )

# Create a residual plot using ggplot2
ggplot(data = df, 
       mapping = aes(x = fitted, 
                     y = residual)
       ) +
  geom_point() +
  geom_smooth(method = "loess") + 
  xlab("Fitted Values") +
  ylab("Residuals") +
  ggtitle("Residual Plot")

`geom_smooth()` using formula = 'y ~ x'

In this example, we first create a data frame of fitted values and residuals from the linear regression model using the $ operator. We then use ggplot() and geom_point() to create a scatterplot of the residuals against the fitted values. We also add a smoothed line to the plot using geom_smooth() with the method = "loess" argument, and add axis labels and a title using xlab(), ylab(), and ggtitle().

Normal QQ plot of the residuals

# Create a QQ plot of the residuals

plot(model, 
     which = 2
     )

When you run the plot() function on a linear regression model with which = 2, it produces a Normal QQ plot of the residuals.

A Q-Q (quantile-quantile) plot is a graphical technique that allows us to compare the distribution of a sample with a theoretical distribution. In a Q-Q plot, the quantiles of the sample data are plotted against the corresponding quantiles of a theoretical distribution, such as the normal distribution. Here are some key takeaways from a Q-Q plot:

Linearity: If the Q-Q plot forms a straight line, then the sample data is approximately normally distributed. If the line deviates from a straight line, then the sample data is not normally distributed¹.
Slope: The slope of the line in the Q-Q plot indicates the relative variance of the sample data compared to the theoretical distribution. If the slope is steeper than the line for the theoretical distribution, then the sample data has a higher variance. If the slope is shallower than the line for the theoretical distribution, then the sample data has a lower variance.
Outliers: Outliers in the sample data may be visible in a Q-Q plot as points that deviate from the straight line. These outliers may indicate a departure from the normal distribution, or they may be due to sampling error or measurement error.
Skewness: If the line in the Q-Q plot is not centered on the diagonal line, then the sample data may be skewed.

Overall, a Q-Q plot can provide valuable insights into the distribution of a sample data set, and can help us to identify departures from a theoretical distribution, such as the normal distribution.

Creating Manually

You can also use the ggplot2 package to create customized QQ plots of the residuals. Here’s an example:

library(ggplot2)

# Create a data frame of the residuals
df <- data.frame(residuals = model$residuals)

# Create a QQ plot using ggplot2
ggplot(data = df, 
       mapping = aes(sample = residuals)
       ) +
  stat_qq() +
  stat_qq_line() +
  xlab("Theoretical Quantiles") +
  ylab("Sample Quantiles") +
  ggtitle("Normal QQ Plot of Residuals")

In this example, we first create a data frame of the residuals from the linear regression model using the $ operator. We then use ggplot() with aes() to specify that we want to create a QQ plot of the residuals, and add a stat_qq() layer to create the QQ plot itself. We also add a stat_qq_line() layer to add a reference line to the plot, and add axis labels and a title using xlab(), ylab(), and ggtitle().

Scale Location

# Create a scale location plot

plot(model, 
     which = 3
     )

A “scale-location” or “spread-location” chart is a graphical tool used to assess the assumptions of homoscedasticity (constant variance of residuals) in a linear regression model. In this type of chart, you typically plot the square root of the absolute standardized residuals against the fitted values of the model.

We standardize the residuals by dividing them by their standard deviation. This helps in comparing residuals with different scales.

Standardized residuals of a linear regression model

Standardized residuals are the residuals (the differences between observed and predicted values) of the model divided by their estimated standard deviation. These standardized residuals are often referred to as “Studentized residuals.”

\[Standardized \ Residual_i = \dfrac{Residual_i}{Estimated \ Standard \ Deviation \ of \ Residuals}\]

When you call rstandard(model), it calculates the standardized residuals for all observations in the model and returns a vector of these standardized residuals. These standardized residuals are useful for assessing the goodness of fit, identifying outliers, and checking the assumptions of the linear regression model.

residuals(model)

          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
         -2.2826106          -0.9197704          -2.0859521           1.2973499 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
         -0.2001440          -0.6932545          -3.9053627           4.1637381 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
          2.3499593           0.2998560          -1.1001440           0.8668731 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
         -0.0502472          -1.8830236           1.1733496           2.1032876 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
          5.9810744           6.8727113           1.7461954           6.4219792 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
         -2.6110037          -2.9725862          -3.7268663          -3.4623553 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
          2.4643670           0.3564263           0.1520430           1.2010593 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         -4.5431513          -2.7809399          -3.2053627          -1.0274952

?rstandard
rstandard(model)

          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
        -0.76616765         -0.30743051         -0.70575249          0.43275114 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
        -0.06681879         -0.23148309         -1.30552216          1.38889709 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
         0.78392687          0.10010803         -0.36728706          0.29288651 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
        -0.01683789         -0.63159969          0.42296071          0.76979873 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
         2.17353314          2.33490215          0.61035691          2.21708271 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
        -0.87964013         -0.99313634         -1.24418015         -1.16279098 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
         0.82771968          0.12244407          0.05177187          0.42254270 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
        -1.51549710         -0.93086929         -1.07151943         -0.34388215

require(psych)

Loading required package: psych


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

describe(residuals(model))

   vars  n mean sd median trimmed  mad   min  max range skew kurtosis   se
X1    1 32    0  3  -0.13   -0.27 3.05 -4.54 6.87 11.42 0.64     -0.3 0.53

describe(rstandard(model))

   vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
X1    1 32 0.01 1.02  -0.04   -0.09 1.03 -1.52 2.33  3.85 0.68    -0.24 0.18

You might use standardized residuals to create diagnostic plots, such as a residuals versus leverage plot or a Q-Q plot, to evaluate the model’s assumptions and identify influential observations or outliers.

To interpret the scale-location plot, look at -

Vertical Spread: Examine the spread of the points along the y-axis (square root of absolute standardized residuals). Ideally, you want the points to be evenly spread out with no discernible funnel shape. If the spread remains roughly consistent as you move along the x-axis, it indicates that the variance of residuals is roughly constant across the range of fitted values.
- Horizontal Spread: Look at the spread of the points along the x-axis (fitted values) as well. If the points are relatively evenly distributed with no clear pattern as the fitted values increase, it suggests that the assumption of constant variance (homoscedasticity) is met.
Issues: If you observe a fan-shaped pattern or a clear trend (increasing or decreasing spread) as the fitted values increase, it suggests heteroscedasticity, which violates the assumption of constant variance.

In summary, a scale-location plot helps you assess whether the assumption of homoscedasticity is met in your linear regression model. If the plot shows no clear pattern and the spread of residuals remains relatively constant as fitted values increase, the assumption is likely met. If you observe a discernible pattern or increasing/decreasing spread, it may indicate a violation of the assumption and the need to consider alternative modeling techniques or transformations.

Creating Manually

This can help ensure you understand what exactly is R plotting above.

# pull out the x and y variable that you want to plot
scale_location_data <- data.frame(
  Fitted_Values = fitted(model),
  Sqrt_Abs_Std_Residuals = sqrt(abs(rstandard(model)))
)


# Create a scale-location (spread-location) plot with ggplot2 package

scale_location_plot <- ggplot(data = scale_location_data, 
                              mapping = aes(x = Fitted_Values, 
                                            y = Sqrt_Abs_Std_Residuals)
                              ) +
  geom_point() +
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "blue") +  # Add a fitted line
  labs(x = "Fitted Values", 
       y = "Square Root of Absolute Standardized Residuals") +
  ggtitle("Scale-Location (Spread-Location) Plot")

scale_location_plot

`geom_smooth()` using formula = 'y ~ x'

Residual versus leverage chart

# Create a residual versus leverage chart
plot(model, 
     which = 4
     )

# Add Cook's distance values to the plot
abline(h = 0.5, 
       col = "red", 
       lty = 1)  # Add a horizontal line at the threshold

A “residual versus leverage” plot, often referred to as a “residual plot” or “residuals vs. leverage plot,” is a graphical tool used to assess the influence of individual data points on the fit of a linear regression model. This plot helps you identify outliers and influential observations.

Interpret the residuals vs. leverage plot:

Residuals: The vertical axis represents the residuals, which are the differences between the observed values and the predicted values from your linear regression model.
Leverage: The horizontal axis represents the leverage values for each data point. Leverage measures how much influence a data point has on the regression coefficients. High leverage values indicate that a data point has the potential to influence the model significantly.
Identification of Outliers: Look for data points that have large absolute residuals (far from zero) and high leverage values (far from the center). These points are potential outliers and influential observations. Outliers have large residuals, while influential observations have high leverage. SO POINTS IN THE TOP RIGHT AND BOTTOM LEFT in particular.
Influence: The influence of a data point is determined not only by its leverage but also by how much it affects the model parameters (coefficients). Points that are both high in leverage and have large residuals exert a strong influence on the regression model.
Patterns: Examine any patterns in the plot. For example, if you see a data point with a large positive residual and high leverage, it may be an influential observation that is driving the regression line away from the majority of the data.

In summary, a residuals vs. leverage plot helps you identify potential outliers and influential observations in your linear regression model. Points that deviate significantly from the center of the plot (high leverage) and have large residuals can have a substantial impact on the model’s coefficients. Careful consideration of these points may be necessary to assess their impact on the validity of your regression analysis.

Creating Manually

Again, this can help ensure you understand what exactly is R plotting above.

# Extract residuals and leverage values from the model
residuals_vs_leverage_data <- data.frame(
  Leverage = hatvalues(model),
  Residuals = residuals(model)
)


# Create a residual versus leverage plot
residuals_vs_leverage_plot <- ggplot(residuals_vs_leverage_data, aes(x = Leverage, y = Residuals)) +
  geom_point() +
  labs(x = "Leverage", y = "Residuals") +
  ggtitle("Residuals vs. Leverage Plot")

# Show the plot
print(residuals_vs_leverage_plot)

Lets try to bring the chart closer to what R originally generated. Try adding the contour lines if you prefer, or I just print out the names of observations that are potentailly troublesome (Cook’s distance > 0.05)

# Calculate Cook's distance values
cooksd <- cooks.distance(model)

# Create a data frame with residuals, leverage, and Cook's distance
residuals_vs_leverage_data <- data.frame(
  Leverage = hatvalues(model),
  Residuals = residuals(model),
  Cooks_Distance = cooksd
)

# Create a residual versus leverage plot
residuals_vs_leverage_plot <- ggplot(residuals_vs_leverage_data, 
                                     aes(x = Leverage, y = Residuals)) +
  geom_point(aes(size = Cooks_Distance), alpha = 0.7) +  # Size of points indicates Cook's distance
  labs(x = "Leverage", 
       y = "Residuals") +
  ggtitle("Residuals vs. Leverage Plot with Cook's Distance") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a horizontal reference line at y = 0
  scale_size_continuous(range = c(2, 10), breaks = c(0.05, 0.1, 0.2), labels = c("0.05", "0.1", "0.2")) +  # Adjust point size and labels
  theme_minimal()

# Label observations with Cook's distance > 0.05
label_threshold <- 0.05
residuals_vs_leverage_plot +
  geom_text(data = subset(residuals_vs_leverage_data, Cooks_Distance > label_threshold), 
            aes(label = rownames(subset(residuals_vs_leverage_data, Cooks_Distance > label_threshold))),
            position = position_nudge(y = 0.15), col = "red", size = 3)

Footnotes

When we say that a sample data is not normally distributed, we mean that the distribution of the data does not follow a normal or Gaussian distribution.

In a normal distribution, the data is symmetric around the mean, with the majority of the data clustered around the mean and fewer observations at the tails of the distribution. The shape of the distribution is bell-shaped, with the tails tapering off gradually on either side of the mean.

When the sample data is not normally distributed, it may exhibit a variety of shapes and patterns. For example, the distribution may be skewed, with a longer tail on one side of the distribution than the other. Alternatively, the distribution may be bimodal, with two distinct peaks in the data. Other types of non-normal distributions include uniform distributions, which have equal frequencies across the range of values, and exponential distributions, which have a right-skewed shape.

When we perform statistical analyses on non-normal data, the results may not be reliable or accurate. Therefore, it is important to assess the normality of our data before conducting any statistical tests or modeling. We can use graphical techniques such as histograms, box plots, and Q-Q plots to visualize the distribution of our data and identify any departures from normality.↩︎