Visual Tests for the Key Assumptions of Multiple Linear Regression

Key Assumptions of Multiple Linear Regression

Multiple Linear Regression

Liner regression is a simple supervised learning approach used to predict the response of a variable y to one or more explanatory variables X in the form below:

\(y_{i}\) = \(\beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) …. + \(\beta_{n}\)\(x_{n}\) + \(\epsilon_{i}\)

Where:

\(y_{i}\) = the predicted response or dependent variable

\(\beta_{0}\) = the constant offset term

\(x_{n}\) = the explanatory or independent variables

\(\beta_{n}\) = the coefficient term to explanatory variable \(x_{n}\)

\(\epsilon_{i}\) = other factors that influence the variable, but to which

Key Assumptions

The four key assumptions we will consider in this vignette are:

Linearity of the relationship between y and its explanatory variables

Independence of variables where explanatory variables are not highly correlated with each other

Normal distribution of residuals

Homoscedasticity or equal variance of residuals

Below we will examine tools within R that will allow to visually test for these assumptions.

R Packages

The packages we will use are:

ggplot2 - to provide advanced plotting capabilities

car - to provide the example data set that will be used

magrittr - to use features for ease of creating functions

library(ggplot2)
library(corrplot)
library(car)
library(magrittr)

We will not be going into the detail of creating the linear regression, however this is done in the code below. The Prestige data set has been used to create two linear regressions:

lm1 - created for predicted variable Prestige where it has already been determined there is a linear relationship with two variables income and education

lm2 - created for predicted variable Prestige with non-predictive variable census

lm1 <- lm(prestige ~ + education + income, Prestige)
lm2 <- lm(prestige ~ census, Prestige)

Visually Testing the Assumptions

Assumption 1 - Linearity of Relationship

The first test to be applied is to determine whether the predicted variable y is correlated with the x explanatory variables. The simple test for this is to assess if the residuals appear to form an equal spread around the horizontal line without distince patterns.

Plot 1: Residuals to Fitted Line lm1

plot(lm1,1)

Plot 2: Residuals to Fitted Line lm2

plot(lm2,1)

Plot 1 shows that there appears to be a linear relationship between the fitted line and the residual value creating a mostly horizontal line in the representation of their relationship. Plot 2 is less consistent with the linear relationship with a definite bend indicating that this is either not an appropriate predictor or more predictore may be required.

Assumption 2 - Independence of Variables

The next test is looking to understand if the explanatory variables used are independent of one another. Here we are looking to understand if there is collinearity between variables by testing these against one another. Here we use a correlation matrix, through the cor() function, as seen below to assess the correlation between each variable.

cor(Prestige[,c(1,2,4)])

##           education    income  prestige
## education 1.0000000 0.5775802 0.8501769
## income    0.5775802 1.0000000 0.7149057
## prestige  0.8501769 0.7149057 1.0000000

Converting this to a visual representation we use the corrplot() function which translates these value in the matrix into a visual representation of the strength of these correlations.

Plot 3: Correlation Matrix of Variable in lm1

corrplot(cor(Prestige[,c(1,2,4)]),method='circle')

In Plot 3 using the corrplot() function we see the high correlation of prestige to the predictor variables education and income we are using in our linear regression as expected. However, as we compare the variable education and income to one another, although there is still some element of positive correlation, this is only a minor relationship that is not strong enough to suggest the same level collinearity. If this again was showing as strong a relationship as the variables are showing with prestige we would see that this would not be holding from the assumption independence between variables.

Assumption 3 - Normal Distribution of Residuals

The next test is to determine whether the residual error terms meet the assumption of normal distribution. This test is to ensure that there are no other significant relationships that could be explaining the variance that have not been taken into account in the linear regression. In order to test this assumption we use the plot() function which provides summary plots of different aspects of the linear regression we gave created. The Normal Q-Q plot is used to visually determine if the standard residuals are normally distributed.

Plot 4: Normal Q-Q for lm1

plot(lm1,2)

Plot 5: Normal Q-Q for lm2

plot(lm2,2)

Plot 4 shows the normal distribution of the residuals for lm1 with the points tending to follow the straight line. Using this plot we are able to determine that for lm1, the residuals appear to be in line with the assumption of a normal distribution for the residual terms. While plot 5 appears to also be along this line some substantial deviation suggests that the residuals are not as close to a normal distribution.

Assumption 4 - Homoscedasticity or Equal Variance of Variables

The final test to be applied is to determine if the error terms is the same across all values of the independent variable. Here we are looking for a constant spread of the residuals

Plot 6: Scale-Location lm1

plot(lm1,3)

Plot 7: Scale-Location lm1

plot(lm2,3)

Plot 6 appears to show the relatively equal variance of residuals across the variable range. Residuals are evenly distributed across the range and do not appear spread or narrow at any point. The difference in Plot 7 is less obvious, with the residuals also appearing to be largely equal in variance with a slight increase towards the right of the plot.

Reference

Assumptions of Multiple Linear Regression 2017, Statistics Solutions, viewed August 2017, http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/.
Kim, B. 2015, Understanding Diagnostic Plots for Linear Regression Analysis, University of Virginia Library, viewed August 2017, http://data.library.virginia.edu/diagnostic-plots/.