Liner regression is a simple supervised learning approach used to predict the response of a variable y to one or more explanatory variables X in the form below:
\(y_{i}\) = \(\beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) …. + \(\beta_{n}\)\(x_{n}\) + \(\epsilon_{i}\)
Where:
- \(y_{i}\) = the predicted response or dependent variable
- \(\beta_{0}\) = the constant offset term
- \(x_{n}\) = the explanatory or independent variables
- \(\beta_{n}\) = the coefficient term to explanatory variable \(x_{n}\)
- \(\epsilon_{i}\) = other factors that influence the variable, but to which
The four key assumptions we will consider in this vignette are:
- Linearity of the relationship between y and its explanatory variables
- Independence of variables where explanatory variables are not highly correlated with each other
- Normal distribution of residuals
- Homoscedasticity or equal variance of residuals
Below we will examine tools within R that will allow to visually test for these assumptions.
The packages we will use are:
- ggplot2 - to provide advanced plotting capabilities
- car - to provide the example data set that will be used
- magrittr - to use features for ease of creating functions
library(ggplot2)
library(corrplot)
library(car)
library(magrittr)
We will not be going into the detail of creating the linear regression, however this is done in the code below. The Prestige data set has been used to create two linear regressions:
- lm1 - created for predicted variable Prestige where it has already been determined there is a linear relationship with two variables income and education
- lm2 - created for predicted variable Prestige with non-predictive variable census
lm1 <- lm(prestige ~ + education + income, Prestige)
lm2 <- lm(prestige ~ census, Prestige)
The first test to be applied is to determine whether the predicted variable y is correlated with the x explanatory variables. The simple test for this is to assess if the residuals appear to form an equal spread around the horizontal line without distince patterns.
Plot 1: Residuals to Fitted Line lm1
plot(lm1,1)
Plot 2: Residuals to Fitted Line lm2
plot(lm2,1)
Plot 1 shows that there appears to be a linear relationship between the fitted line and the residual value creating a mostly horizontal line in the representation of their relationship. Plot 2 is less consistent with the linear relationship with a definite bend indicating that this is either not an appropriate predictor or more predictore may be required.
The next test is looking to understand if the explanatory variables used are independent of one another. Here we are looking to understand if there is collinearity between variables by testing these against one another. Here we use a correlation matrix, through the cor() function, as seen below to assess the correlation between each variable.
cor(Prestige[,c(1,2,4)])
## education income prestige
## education 1.0000000 0.5775802 0.8501769
## income 0.5775802 1.0000000 0.7149057
## prestige 0.8501769 0.7149057 1.0000000
Converting this to a visual representation we use the corrplot() function which translates these value in the matrix into a visual representation of the strength of these correlations.
Plot 3: Correlation Matrix of Variable in lm1
corrplot(cor(Prestige[,c(1,2,4)]),method='circle')
In Plot 3 using the corrplot() function we see the high correlation of prestige to the predictor variables education and income we are using in our linear regression as expected. However, as we compare the variable education and income to one another, although there is still some element of positive correlation, this is only a minor relationship that is not strong enough to suggest the same level collinearity. If this again was showing as strong a relationship as the variables are showing with prestige we would see that this would not be holding from the assumption independence between variables.
The next test is to determine whether the residual error terms meet the assumption of normal distribution. This test is to ensure that there are no other significant relationships that could be explaining the variance that have not been taken into account in the linear regression. In order to test this assumption we use the plot() function which provides summary plots of different aspects of the linear regression we gave created. The Normal Q-Q plot is used to visually determine if the standard residuals are normally distributed.
Plot 4: Normal Q-Q for lm1
plot(lm1,2)
Plot 5: Normal Q-Q for lm2
plot(lm2,2)
Plot 4 shows the normal distribution of the residuals for lm1 with the points tending to follow the straight line. Using this plot we are able to determine that for lm1, the residuals appear to be in line with the assumption of a normal distribution for the residual terms. While plot 5 appears to also be along this line some substantial deviation suggests that the residuals are not as close to a normal distribution.
The final test to be applied is to determine if the error terms is the same across all values of the independent variable. Here we are looking for a constant spread of the residuals
Plot 6: Scale-Location lm1
plot(lm1,3)
Plot 7: Scale-Location lm1
plot(lm2,3)
Plot 6 appears to show the relatively equal variance of residuals across the variable range. Residuals are evenly distributed across the range and do not appear spread or narrow at any point. The difference in Plot 7 is less obvious, with the residuals also appearing to be largely equal in variance with a slight increase towards the right of the plot.