1 Introduction

As almost all the statistical models are based on some specific assumptions, the regression models are also developed on the basis of some pre specified assumptions. These assumptions are basically nothing but the statistical characteristics of the population from which the observations or data points are supposed to be generated and in some cases, the observations itself.

To be able to make statistical inference valid, we always have to be very careful whether the models fitted are abide by these assumptions. Violations of any of these assumptions will always raise the questions of validity or reliability or the robustness of the model. Therefore, the knowledge of these assumptions are very important prior to developing any model.

The linear regression models also place assumptions on the data set that we should verify before proceeding with any statistical inference.

The following are some important assumptions of regression modelling;

Normality. Like half the models in statistics, standard linear regression relies on an assumption of normality. Specifically, it assumes that the residuals ¹ are normally distributed. We assume that the error terms, \(\epsilon_i\) ,are i.i.d. and have a Normal (0, \(\sigma\)) distribution. Some argue that the distribution of the response variable (\(Y\)) and the predictors (\(X_i's\)) are also assumed to be normal, but it’s actually okay if the predictors \(X\) and the outcome \(Y\) are non-normal, so long as the residuals \(\epsilon_i\) are normal.
Linearity. A pretty fundamental assumption of the linear regression model is that relationship between \(X\) and \(Y\) actually be linear. Regardless of whether it’s a simple regression or a multiple regression, we assume that the relationships involved are linear.
Homogeneity of variance (Homoscedasticity). Strictly speaking, the regression model assumes that each residual \(\epsilon_i\) is generated from a normal distribution with mean (\(\mu_i = 0\)) and with a constant standard deviation (\(\sigma\)); that is the same for every single residual. In practice, it’s impossible to test the assumption that every residual is identically distributed. Instead, what we care about is that the standard deviation of the residual is the same for all values of \(\hat{Y}\), and all values of every predictor \(X\) in the model.
Uncorrelated predictors. In case of multiple regression model, we have two or more than two predictors (independent or explanatory variables). The idea here, about the uncorrelated predictors regards to multiple regression model. In multiple regression modelling it is assumed that predictors not to be too strongly correlated with each other. This isn’t technically an assumption of the regression model, but in practice it’s assessment is required. Predictors that are too strongly correlated with each other (referred to as “Multicollinearity”) can cause problems when evaluating the model because it creates noise in the model.
Residuals are independent of each other. The residuals are assumed to be distributed normally and independently. If there exists high degree association among the residuals it might screw things up. This problem is referred as Autocorrelation and can cause problems when evaluating the fitted model.
No bad outliers. Even though there is no common consent among the statisticians regarding the outliers, we should be always worried about it. In regression modelling we should be careful while dealing with it also. Again, not actually a technical assumption of the model, but there is an implicit assumption that the regression model isn’t being too strongly influenced by one or two anomalous data points causing high influence and high leverage; since this raises questions about the adequacy of the model, and the trustworthiness of the data in some cases. It should be always noted that our regression model might be disproportionately sensitive to smaller number of this kind of unusual or anomalous observations.

In the following section we shall discuss briefly the methodology of regression diagnostics, a term that refers to the art of checking that whether the assumptions of the fitted regression model have been met, and if possible figuring out how to fix the model if the assumptions are violated. This is an “art” of model checking with good reason: it’s not easy, and while there are a lot of fairly standardised tools that we can use to diagnose and maybe even cure the problems that ail(trouble) the model (if there are any). For this we really do need to exercise a certain amount of judgement when doing this. It’s easy to get lost in all the details of checking this thing or that thing, and it’s quite exhausting to try to remember what all the different things are. This has the very terrible side effect that a lot of people get frustrated when trying to learn all the tools, but still it is highly recommended to do model checking. Basically, we shall learn to diagnose linear regression models with plot diagnostics (graphically) using plot() function together with respective empirical diagnostics methods (hypothesis testing).

1.1 Assessing the normality of the residuals

The residuals are used to assess whether the error terms in the model are normally distributed. Although a histogram can be used to investigate normality, we see that the quantile quantile-normal plot is better at visualizing differences from normality. Larger deviations from a straight line indicate non normality. Quantile-normal plots in R can be created using the functions qqnorm() and qqline(). We shall also discuss the functions other than built in base R functions.

The empirical test to investigate the normality of a distribution is the Shapiro-Wilk test, which can be performed in R by using a very simple function shapiro.test(). Under the Shapiro-Wilk test, the null and alternative hypotheses are stated as;

The Null Hypothesis (H₀): Data are assumed to be significantly normally distributed.

Against,

The Alternative Hypothesis (H₁): They are not significantly normally distributed.

Below we discuss how to assess the normality of the residual of a fitted linear regression model. For this let us recall the linear regression model of Petal.Length on Sepal.Length which is stored in the lm object model.1. The next step is to obtain the residual of the model. We can obtain the residuals by using the function resid() or residuals(). Let us recall the simple regression model called model.1 and store the residuals of the model in the object Residual.

Residual = resid(model.1) #`residuals(model.1)` also results same.

If we want to view the values of residuals, we can use the function head() or tail() for the first few or last few entries respectively.

Now, let us try investigating the normality of these residuals, stored in the object Residual. We can begin this investigation starting from a histogram followed by normal QQ plot and Shapiro Wilk test.

hist(Residual)

Histogram of the residuals showing fairly normally distributed.

Looking at the histogram (fig ….) of the residuals we can say that the residuals are fairly normally distributed. But still, it is not so reliable method to assess the normality because for different bin size the histogram tells us different stories. However we can use it just to get the tentative idea about the distribution.

Another more reliable graphical approach to assess the normality is to plot a quantile quantile normal plot and carefully check how the points are deviated from the straight line. The following line of codes can be used to plot normal QQ plot in R.

qqnorm(Residual)
qqline(Residual)

Normal QQ plot of the residuals.

The Normal QQ plot also clearly indicates that the residual of this model is fairly normally distributed as almost all the points are fairly hovering over the straight line.

Alternatively, this diagnostic plots can also be produced by the quantile-normal plot of the standardized residuals. The conversion of the ordinary residual to z score is known as the Standardized residuals. Obviously the standardized residuals will have mean 0 and standard deviation 1. Standardized residuals in R can be obtained by using the function rstandard(). The following line of codes should produce the same normal QQ plot as in the figure……

stand.residual = rstandard(model.1)
  qqnorm(stand.residual)
  qqline(stand.residual)

Normal QQ plot using standaridized residuals.

The Normal QQ plot can also be created by built in R graphic function plot(). The main argument x to this function is the linear model object with the argument which = 2. The plot() function can draw six different plots, each of which is quite useful for doing regression model diagnostics. We just need to specify the correct argument by using which argument (a number between 1 to 6). The following table lists the corresponding plots to the respective arguments.

Argument	Plot
`which = 1`	“Residuals vs Fitted”
`which = 2`	“Normal - QQ plot”
`which = 3`	“Scale - Location”
`which = 4`	“Cook’s distance”
`which = 5`	“Residuals vs Leverage”
`which = 6`	“Cook’s distance vs Leverage”

By default the function plot() provides first three and 5. We can draw the Normal QQ plot, using plot() function by providing the argument which = 2. Remember that our linear model object is model.1.

plot(model.1, which = 2)

Normal QQ plot can also be created by using base R funtion plot() with the additional argument which = 2.

Finally, we can perform the Shapiro Wilk test to check whether the results from the histogram and Normal QQ plot do comply .

shapiro.test(Residual)


    Shapiro-Wilk normality test

data:  Residual
W = 0.99437, p-value = 0.831

The Shapiro Wilk test result reports with the high p-value of 0.831 which clearly does not allow us to reject the null hypothesis concluding the residuals of the model is significantly normally distributed and definitely suggesting the normality assumption is not any danger here.

In addition to normality, an assumption of the model is also that the error terms have a common variance. A residual plot can show whether this is the case. When it is, the residuals show scatter about a horizontal line. In many data sets, the variance increases for larger values of the predictor.

1.2 Assessing the linearity of the relationship

A scatter plot of the data with the regression line can show quickly whether the linear model seems appropriate for the data. If the general trend is not linear, either a transformation (like log or polynomial or Box cox transformation) or it is recommended to switch for a different model.

plot(iris$Petal.Length ~ iris$Sepal.Length,
       xlab = "Sepal Length",
       ylab = "Petal Length")
abline(model.1)

To test whether the assumption of linearity of the relationship between the outcome variable or the response variable (\(Y\)) and the predictor variable (\(X\)) is violated we can just plot the relationship between the fitted values (\(\hat Y\)) and the observed values (\(Y\)) for the response variable. To draw this we could use the function fitted.values() to extract the fitted values (\(\hat Y\)).

The following line of commands could be used in this case.

y.hat = fitted.values(model.1)
plot(x = y.hat, y = iris$Petal.Length,
     xlab = "Fitted values",
     ylab = "Observed values")

Plot of fitted values against the observed values of the response variable(\(Y\)). A straight line is what we hope to see in the plot meaning that there is no violation of the linearity assumption.

If this plot looks fairly linear, then we do not have to worry about the linearity assumption. However, if we observe big departure from the linearity, then we have to be serious about it.

When there is more than one predictor variable, a scatter plot will not be as useful. A residual plot can also show whether the linear model is appropriate and can be made with more than one predictor (multiple regression model). As well, it can detect small deviations from the model that may not show up in a scatter plot. In fact, if we wish to get more detailed picture,it is often more informative to look at the relationship between the fitted values and the residuals even in simple regression model. We can plot this by using following line of commands.

plot(fitted(model.1), resid(model.1)) # we can also create object for `resid(model.1)`.

Alternatively, we can use the following command to draw a scatter plot of fitted values against the residuals.

plot(model.1, which = 1)

As you can see, not only does it draw the scatter plot showing the fitted value against the residuals, it also plots a line through the data that shows the relationship between the two. Ideally, this should be a straight, perfectly horizontal line. There’s some hint of curvature here, but it’s not clear whether or not we be concerned. Therefore we may require to see whether this curvature really matters in this case.

For this we are going to use more advanced version of the same plot by using the function residualPlots() from the car package. The residualPlots() function reports the results of curvature test. For a given predictor variable \(X\) in the regression model, this test is equivalent to adding a new predictor to the model corresponding to \(X^2\) , and running the t-test on the \(\beta\) coefficient associated with this new predictor. If it comes up significant, it implies that there is some non linear relationship between the variable and the residuals. First we need to install the package called car in the system.

#install.packages(car) ## uncomment if the package is already installed.
library(car)

Loading required package: carData

residualPlots(model.1)

                  Test stat Pr(>|Test stat|)    
iris$Sepal.Length   -3.3934        0.0008873 ***
Tukey test          -3.3934        0.0006903 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This function not only draws plots comparing the fitted values to the residuals, it does so for each individual predictor (in case we are running multiple regression).

The results of the curvature test turn out to be significant (p value <0.01) hinting that the curvature in the residual plot in the figure….) is an issue in this case. This is a clear indication of violation of the linearity assumption. The linear regression model we developed may not work in such cases. But still, we can resolve this kind of problem by some suitable transformation of the variable. One of the simple and widely used is Box-Cox transformation and is given by;

\[f(x,\lambda) = \frac{x^\lambda - 1}{\lambda}\] for all values of \(\lambda\) except \(\lambda = 0\). When \(\lambda = 0\) we just take the natural logarithm (i.e., \(\ln(x)\). We can calculate it using the boxCox() function in the car package in R.

1.3 Checking the homogeneity of variance

The regression models not only that we’ve been talking about, but all make a homogeneity of variance assumption: the variance of the residuals is assumed to be constant. If homogeneity of variance is violated, the standard error estimates (\(S_{b_i}\)) associated with the regression coefficients (\(\beta_i\)) are no longer entirely reliable, and that makes the t tests for the coefficients not quite right either.

We can assess this assumption by creating a scatter plot of the fitted values (\(\hat{Y_i}\)) against the square root of absolute values of the standardized residuals (\(\sqrt{|\epsilon_i|}\)), which is also known as Scale - location plot. We can create this plot in R by using the following simple base R plot function plot() and setting the argument which = 3.

plot(model.1, which = 3)

This plot is used to diagnose violations of homogeneity of variance. If the variance is really constant, then the line through the middle should be horizontal and flat.

The empirical method to assess the homogeneity of the residual is to test the hypothesis known as ncv (Non-constant Variance) test. The car package provides a function called ncvTest(), which can be used to test this hypothesis.

ncvTest(model.1)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 2.830956, Df = 1, p = 0.092463

Since the p-value is > 0.05 we conclude that there is no violations of homogeneity of the variances among the residuals.

1.4 Checking for collinearity

The last kind of regression diagnostic that we are going to discuss in this subsection is the use of Variance Inflation Factors (VIFs), which are useful for determining whether or not the predictors in the regression model are too highly correlated with each other. Obviously, when we talk about several predictors we are actually dealing with multiple regression model. There is a variance inflation factor associated with each predictor \(X_k\) in the model, and the formula for the \(k^{th}\) VIF is:

\[ VIF_j = \frac{1}{1- R^2_{(j)}}\]

where \(R^2_{(j)}\) refers to \(R^2\) value when we run a regression using \(X_j\) as the outcome variable, and all the other \(X\) variables as the predictors.

The square root of the variance inflation factor (VIF) indicates how much larger the standard error increases compared to if that variable had \(0\) correlation to other predictor variables in the model.

If the variance inflation factor of a predictor variable were \(5.27\) (\(\sqrt{5.27} = 2.3\)), this means that the standard error for the coefficient of that predictor variable is \(2.3\) times larger than if that predictor variable had \(0\) correlation with the other predictor variables.

A VIF value of more than 10 represents a high degree of collinearity among the predictors. In general practice a cut off point of 5 is commonly used.

The vif() function from the car package is used to find the VIF values. Since our regression model is a simple linear regression model this diagnostic is skipped here. We shall talk about this in multiple regression modelling in details.

1.5 Assessing bad outliers

As we have already mentioned about outliers and we need to be cautious while dealing with it. An easy way of detecting the outliers is using the function boxplot(). Another more precise method of detecting outliers is the use of Cook’s Distance. The observation which is an outlier and has a high value of leverage will have high value of Cook’s distance. The Cook’s distance for each of the observation can be obtained by using the function cooks.distace().

As a rough guide, Cook’s distance greater than 1 is often considered large, though a few papers suggests that \(4/N\) has also been suggested as a possible rule of thumb.

cooks.distance(model.1)

##            1            2            3            4            5            6 
## 7.833423e-03 3.852755e-03 1.491927e-03 4.192630e-05 5.814447e-03 8.837308e-03 
##            7            8            9           10           11           12 
## 3.390561e-05 4.436773e-03 1.996234e-03 2.684163e-03 1.193380e-02 5.713400e-04 
##           13           14           15           16           17           18 
## 2.091069e-03 9.344896e-04 2.760975e-02 1.833713e-02 1.549450e-02 7.833423e-03 
##           19           20           21           22           23           24 
## 1.483920e-02 6.311280e-03 8.837308e-03 6.311280e-03 3.026592e-03 3.759830e-03 
##           25           26           27           28           29           30 
## 7.806625e-05 3.245091e-03 3.245091e-03 8.206462e-03 9.824048e-03 1.480528e-05 
##           31           32           33           34           35           36 
## 5.713400e-04 1.193380e-02 8.206462e-03 1.560240e-02 2.684163e-03 9.127765e-03 
##           37           38           39           40           41           42 
## 1.746944e-02 3.852755e-03 9.550403e-04 6.311280e-03 7.378111e-03 2.514455e-05 
##           43           44           45           46           47           48 
## 9.550403e-04 3.245091e-03 1.865492e-03 2.091069e-03 4.953416e-03 3.390561e-05 
##           49           50           51           52           53           54 
## 1.007725e-02 5.814447e-03 1.991176e-02 5.619186e-04 8.173369e-03 4.085003e-03 
##           55           56           57           58           59           60 
## 1.057739e-03 4.700681e-03 5.123093e-05 1.766269e-02 2.658238e-03 1.300830e-02 
##           61           62           63           64           65           66 
## 1.594136e-02 5.109309e-04 1.123430e-05 1.065134e-03 4.225561e-04 8.534336e-03 
##           67           68           69           70           71           72 
## 6.961640e-03 8.030930e-04 3.344176e-05 1.723600e-03 3.954535e-03 2.720332e-04 
##           73           74           75           76           77           78 
## 5.061672e-04 1.065134e-03 1.592945e-03 4.876813e-03 5.797494e-03 1.158641e-03 
##           79           80           81           82           83           84 
## 9.450936e-04 3.242849e-07 2.439287e-03 1.774658e-03 2.227572e-04 5.134487e-03 
##           85           86           87           88           89           90 
## 1.422833e-02 9.450936e-04 3.995522e-03 2.513269e-04 3.079102e-03 4.085003e-03 
##           91           92           93           94           95           96 
## 8.642271e-03 6.562740e-04 4.679423e-04 1.144307e-02 3.903295e-03 2.319761e-03 
##           97           98           99          100          101          102 
## 2.319761e-03 7.793255e-05 3.192508e-03 1.711038e-03 1.142149e-02 9.102707e-03 
##          103          104          105          106          107          108 
## 5.747188e-04 5.804946e-03 4.987634e-03 4.713918e-03 6.555699e-02 5.250143e-04 
##          109          110          111          112          113          114 
## 1.914184e-03 5.536425e-04 1.092993e-04 1.691080e-03 1.379787e-05 1.051805e-02 
##          115          116          117          118          119          120 
## 9.102707e-03 1.691080e-03 2.010326e-03 7.532970e-03 2.772581e-03 4.203773e-03 
##          121          122          123          124          125          126 
## 5.722870e-06 1.240623e-02 7.532970e-03 5.061672e-04 1.157879e-03 1.343556e-03 
##          127          128          129          130          131          132 
## 7.672415e-04 2.178414e-03 4.281488e-03 3.956966e-03 6.513435e-03 4.905146e-02 
##          133          134          135          136          137          138 
## 4.281488e-03 1.431770e-03 9.178273e-03 3.579847e-02 5.804946e-03 3.286687e-03 
##          139          140          141          142          143          144 
## 2.621323e-03 1.252995e-03 5.906766e-04 4.678981e-03 9.102707e-03 1.419191e-03 
##          145          146          147          148          149          150 
## 1.157879e-03 2.129050e-04 9.101354e-04 3.629077e-04 5.116770e-03 6.893276e-03

Cook’s distances for the each of the observations can be visualized by using plot() function also. We just need to set the argument which = 4.

plot(model.1, which = 4)

Cook’s distance for each observations

An obvious question to ask next is, if you do have large values of Cook’s distance, what should you do? As always, there’s no hard and fast rules. Probably the first thing to do is to try running the regression with that point excluded and see what happens to the model performance and to the regression coefficients. If they really are substantially different, it’s time to start digging into your data set and your notes that you no doubt were scribbling as you ran your study; try to figure out why the point is so different. If you start to become convinced that this one data point is badly distorting your results, you might consider excluding it, but that’s less than ideal unless you have a solid explanation for why this particular case is qualitatively different from the others and therefore deserves to be handled separately.

i.Residuals are the difference between the observed value of the response variable \(Y_i\) and the fitted value of the response variable \(\hat{Y_i}\) under the model. ii. The residuals are the realizations of the random error term \(\epsilon_i\).↩︎

Assessing the Assumptions of Regression Model

Pramo Udaya, Lecturer of Statistics, Butwal Multiple Campus.

2021-03-04