Normality of Residuals

Sameer Mathur

Example: Detect and Rectify Non-normality of Residuals using mtcars dataset

Regression Diagnostics

---

Normality of Residuals

The residual error terms are assumed to be normally distributed i.e. \( \epsilon ~ N(\mu, \sigma^2) \).

If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Presence of non - normal distribution suggests that there are a few unusual data points which must be studied closely to make a better model.

Normality Plot

mtcars

Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

Source mtcars data

Data Description

  1. mpg Miles/(US) gallon
  2. cyl Number of cylinders
  3. disp Displacement (cu.in.)
  4. hp Gross horsepower
  5. drat Rear axle ratio
  6. wt Weight (1000 lbs)
  7. qsec ¼ mile time
  8. vs Engine (0 = V-shaped, 1 = straight)
  9. am Transmission (0 = automatic, 1 = manual)
  10. gear Number of forward gears
  11. carb Number of carburetors

Importing data

# importing data
data(mtcars)
# attaching data columns
attach(mtcars)
# data rows and columns
dim(mtcars)
[1] 32 11

First few rows of the cars dataset

# first few rows
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Descriptive statistics

# descriptive statistics
library(psych)
describe(mtcars)[, c(1:5, 8:9)]
     vars  n   mean     sd median   min    max
mpg     1 32  20.09   6.03  19.20 10.40  33.90
cyl     2 32   6.19   1.79   6.00  4.00   8.00
disp    3 32 230.72 123.94 196.30 71.10 472.00
hp      4 32 146.69  68.56 123.00 52.00 335.00
drat    5 32   3.60   0.53   3.70  2.76   4.93
wt      6 32   3.22   0.98   3.33  1.51   5.42
qsec    7 32  17.85   1.79  17.71 14.50  22.90
vs      8 32   0.44   0.50   0.00  0.00   1.00
am      9 32   0.41   0.50   0.00  0.00   1.00
gear   10 32   3.69   0.74   4.00  3.00   5.00
carb   11 32   2.81   1.62   2.00  1.00   8.00

Regression Model

Multiple linear regression

# fitting simple linear model
fitmtcarsModel <- lm(mpg ~ am + wt + hp + disp + cyl, data = mtcars)
# summary of the fitted model
summary(fitmtcarsModel)

Call:
lm(formula = mpg ~ am + wt + hp + disp + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5952 -1.5864 -0.7157  1.2821  5.5725 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.20280    3.66910  10.412 9.08e-11 ***
am           1.55649    1.44054   1.080  0.28984    
wt          -3.30262    1.13364  -2.913  0.00726 ** 
hp          -0.02796    0.01392  -2.008  0.05510 .  
disp         0.01226    0.01171   1.047  0.30472    
cyl         -1.10638    0.67636  -1.636  0.11393    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.505 on 26 degrees of freedom
Multiple R-squared:  0.8551,    Adjusted R-squared:  0.8273 
F-statistic:  30.7 on 5 and 26 DF,  p-value: 4.029e-10

Normality of residuals

The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line (2nd plot).

# normal probability plot of residuals
plot(fitmtcarsModel, 2)

plot of chunk unnamed-chunk-7

In our example, all the points does not fall approximately along this reference line, so we cannot assume normality.

Alternate way to plot Normality plot

  1. Normality plot using Quantile-Quantile (Q-Q plot) using qqnorm() and qqline() functions.

  2. Normality plot using qqPlot() function in car package.

1. Using qqnorm() and qqline() function

qqnorm(mtcars$mpg)
qqline(mtcars$mpg, col = "red")

plot of chunk unnamed-chunk-8

2. Using qqPlot() in car package

# normality plot using qqPlot() in car package
library("car")
qqPlot(mtcars$mpg)

plot of chunk unnamed-chunk-10

[1] 20 18

The value showing in output 18 and 20 are the outliers.

Normality Test

Visual inspection, described in the previous section, is usually unreliable. It's possible to use a significance test comparing the sample distribution to a normal one in order to ascertain whether data show or not a serious deviation from normality.

There are several methods for normality test such as

  1. Anderson-Darling test

  2. Shapiro-Wilk's test

  3. Kolmogorov-Smirnov (K-S) test

1. Anderson-Darling normality test

The Anderson-Darling test (AD test, for short) is one of the most commonly used normality tests, and can be executed using the ad.test() command present within the nortest package.

# Anderson-Darling normality test
library(nortest)
ad.test(mtcars$mpg)

    Anderson-Darling normality test

data:  mtcars$mpg
A = 0.57968, p-value = 0.1207

Source Wikipedia: Anderson-Darling test

2. Shapiro-Wilk's method to test normality

Shapiro-Wilk's method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.

# Shapiro-Wilk's normality test
shapiro.test(mtcars$mpg)

    Shapiro-Wilk normality test

data:  mtcars$mpg
W = 0.94756, p-value = 0.1229

From the output, the p-value > 0.05 implying that the distribution of the data are significantly same from normal distribution. In other words, we can assume the normality.

Note:

Normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it's important to combine visual inspection and significance test in order to take the right decision.

Source Wikipedia: Shapiro-Wilk's test