Linear Regression -- Model Selection and Diagnostics.

Sameer Mathur

Selecting and Diagnosing a Regression Model and using mtcars

---

PART 1: READING AND DESCRIBING DATA

Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

Source mtcars data

Data Description

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec ¼ mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Importing data

# importing data
data(mtcars)
# data rows and columns
dim(mtcars)

[1] 32 11

First few rows of the cars dataset

# first few rows
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Data Types

# data structure
str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Data Conversion

# convert 'am' to a factor variable
mtcars$am <- as.factor(mtcars$am)
# convert 'cyl' to a factor variable
mtcars$cyl <- as.factor(mtcars$cyl)

# verify
str(mtcars$am)

 Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...

str(mtcars$cyl)

 Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...

Descriptive statistics

# attaching data columns
attach(mtcars)
# descriptive statistics
library(psych)
describe(mtcars)[, c(1:5, 8:9)]

     vars  n   mean     sd median   min    max
mpg     1 32  20.09   6.03  19.20 10.40  33.90
cyl*    2 32   2.09   0.89   2.00  1.00   3.00
disp    3 32 230.72 123.94 196.30 71.10 472.00
hp      4 32 146.69  68.56 123.00 52.00 335.00
drat    5 32   3.60   0.53   3.70  2.76   4.93
wt      6 32   3.22   0.98   3.33  1.51   5.42
qsec    7 32  17.85   1.79  17.71 14.50  22.90
vs      8 32   0.44   0.50   0.00  0.00   1.00
am*     9 32   1.41   0.50   1.00  1.00   2.00
gear   10 32   3.69   0.74   4.00  3.00   5.00
carb   11 32   2.81   1.62   2.00  1.00   8.00

PART 2: MODEL VARIABLE SELECTION

Here we use Automatic Model Selection using step().

The R function step() can be used to perform variable selection.

Automatic model selection

Selecting linear regression model (without interaction) using step().

# automatic model selection
stepLinOLSModel <- step(lm(mpg ~ ., data=mtcars),
                  trace=0, steps=10000)
summary(stepLinOLSModel)


Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
qsec          1.2259     0.2887   4.247 0.000216 ***
am1           2.9358     1.4109   2.081 0.046716 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,    Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

PART 4: REGRESSION DIAGNOSTICS

There are major linear regression assumtions.

Linearity
Normality of residuals
Homogeneity in Variance (Homoscedasticity)
Multicollinearity

Diagnostic Plots

# residual plots of OLS model
par(mfrow=c(2,2))
plot(stepLinOLSModel)

plot of chunk unnamed-chunk-9

Assumption 1 - Linearity

Linearity

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (plot on the top-left) from the Diagnostic Plots.

Linearity of the data (Residual vs. Fitted Plot)

# residual vs. fitted plot
plot(stepLinOLSModel, 1)

plot of chunk unnamed-chunk-11

In this plot, we cannot see that the red line is parallel to the dotted line. Hence, we cannot assume the linearity.

Rectify Linearity

We will use the Box-Cox transformation to rectify linearity.

Box-Cox Transformation

library(caret)
mpgTrans <- BoxCoxTrans(mtcars$mpg)
mpgTrans

Box-Cox Transformation

32 data points used to estimate Lambda

Input data summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 

Largest/Smallest: 3.26 
Sample Skewness: 0.611 

Estimated Lambda: 0 
With fudge factor, Lambda = 0 will be used for transformations

# append the transformed variable to mtcars
mtcars <- cbind(mtcars, mpgNew = predict(mpgTrans, mtcars$mpg))
# first few rows of the datset
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
                    mpgNew
Mazda RX4         3.044522
Mazda RX4 Wag     3.044522
Datsun 710        3.126761
Hornet 4 Drive    3.063391
Hornet Sportabout 2.928524
Valiant           2.895912

Automatic model selection after Box-cox Transformation

Selecting linear regression model using step().

# automatic model selection
fitLinOLSTransModel <- lm(mpgNew ~ wt + qsec + am, data = mtcars)
# summary of the model
summary(fitLinOLSTransModel)


Call:
lm(formula = mpgNew ~ wt + qsec + am, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.13879 -0.08114 -0.03466  0.07030  0.26575 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.69410    0.31326   8.600 2.40e-09 ***
wt          -0.22456    0.03201  -7.015 1.25e-07 ***
qsec         0.05329    0.01299   4.101  0.00032 ***
am1          0.08558    0.06351   1.347  0.18863    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1107 on 28 degrees of freedom
Multiple R-squared:  0.8752,    Adjusted R-squared:  0.8619 
F-statistic: 65.47 on 3 and 28 DF,  p-value: 9.036e-13

Comparing Residual versus Fitted plots before and after Box-Cox transformation

Before Box-Cox transformation

# residual vs. fitted plot
plot(stepLinOLSModel, 1)

plot of chunk unnamed-chunk-16

After Box-Cox transformation

# residual vs. fitted plot
plot(fitLinOLSTransModel, 1)

plot of chunk unnamed-chunk-17

Inference

After Box-Cox transformation, we find that the red line is closer to being flat.

Hence, we are more comfortable in assuming linearity.

Assumption 2: Normality

Normality

The Q-Q plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line (2nd plot).

Normality of residuals

# normal probability plot of residuals
plot(fitLinOLSTransModel, 2)

plot of chunk unnamed-chunk-19

In our mtcars model, the points fall approximately along the reference line.

Therefore, we could assume normality.

Further, we will confirm our visual inspection by using some statistical tests.

Assumption 2: Normality

Normality

There are several methods for normality test such as

Anderson-Darling test
Shapiro-Wilk's test
Kolmogorov-Smirnov (K-S) test

1. Anderson-Darling normality test

The Anderson-Darling test (AD test, for short) is one of the most commonly used normality tests, and can be executed using the ad.test() command present within the nortest package.

# Anderson-Darling normality test
library(nortest)
ad.test(mtcars$mpgNew)


    Anderson-Darling normality test

data:  mtcars$mpgNew
A = 0.22479, p-value = 0.8055

2. Shapiro-Wilk's normality test

Shapiro-Wilk's method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.

# Shapiro-Wilk's normality test
shapiro.test(mtcars$mpgNew)


    Shapiro-Wilk normality test

data:  mtcars$mpgNew
W = 0.97668, p-value = 0.699

Statistical Test Inference

From both the test output, the p-value > 0.05 implying that the distribution of the data are significantly same from normal distribution. In other words, we can assume the normality. Hence, it supports our visual interpretation.

Note:

Normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it's important to combine visual inspection and significance test in order to take the right decision.

Assumtion 3: Heteroskedasticity

Heteroskedasticity

The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardised residuals on y axis.

If there is no Heteroskedasticity, you should see a completely random, equal distribution of points throughout the range of x axis and a flat red line.

This assumption can be checked by examining the scale-location plot, also known as the spread-location plot (3rd plot) from the Diagnostic Plots.

Homogeneity of variance

# residual vs. fitted plot
plot(fitLinOLSTransModel, 1)

plot of chunk unnamed-chunk-22

# scale-location plot
plot(fitLinOLSTransModel, 3)

plot of chunk unnamed-chunk-23

We can use statistical tests to check homogeneity of variance.

Statistical Tests

Hetroscedasticity

There are a few tests that comes handy to establish the presence or absence of heteroscedasticity -

The NCV test
Breush-Pagan test

1. Non-constant Error Variance

# non-constant error variance test
library(car)
ncvTest(fitLinOLSTransModel)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 2.820686, Df = 1, p = 0.093057

2. Breusch-Pagan test

library(lmtest)
# Breusch-Pagan test
bptest(fitLinOLSTransModel)


    studentized Breusch-Pagan test

data:  fitLinOLSTransModel
BP = 6.07, df = 3, p-value = 0.1083

Inference

Both the statistical tests give p-value > 0.05,

We fail to reject the null hypothesis that the residuals have non-constant variance