Regression Models Peer Assessment

EXECUTIVE SUMMARY

Let’s imagine we work for ‘Motor Trend’, a magazine about the automobile industry. Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantifying how different is the MPG between automatic and manual transmissions?”

Data exploration

Let’s start with some exploratory works, to learn more about the dataset

## load the data
data(mtcars)
## look at the variables and their formats
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

## Investigate the data descriptions as headers are not relally explicit
?mtcars

As the dataset comes with good descriptions, we can have a better understandings of each variable in the dataset.

    [, 1]   mpg Miles/(US) gallon
    [, 2]   cyl Number of cylinders
    [, 3]   disp    Displacement (cu.in.)
    [, 4]   hp  Gross horsepower
    [, 5]   drat    Rear axle ratio
    [, 6]   wt  Weight (1000 lbs)
    [, 7]   qsec    1/4 mile time
    [, 8]   vs  V/S
    [, 9]   am  Transmission (0 = automatic, 1 = manual)
    [,10]   gear    Number of forward gears
    [,11]   carb    Number of carburetors

Now we can run a quick ‘pairs’ chart to have an idea of which variable might be correlated with ‘mpg’

## Quick linear regression of mpg versus all variables
pairs(mpg ~ ., data = mtcars)

From this chart, we see that most variables are more or less correlated with ‘mpg’ and we will need to consider this in our regression analysis, as variance might be inflated. We also directly see that ‘cyl’, ‘vs’, ‘am’, ‘gear’ and ‘carb’ variables are not continuous and should therefore be factorized. Then, we can run a summary.

mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Auto','Manual')) ## Let'use 'Auto' and 'Manual' instead of 0 and 1
summary(mtcars)

##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec       vs          am     gear   carb  
##  Min.   :1.513   Min.   :14.50   0:18   Auto  :19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   1:14   Manual:13   4:12   2:10  
##  Median :3.325   Median :17.71                      5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                             4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                             6: 1  
##  Max.   :5.424   Max.   :22.90                             8: 1

Hypothesis test

To see if gasoline consumption is different for automatic and manual transmissions, we can make a T-test, the null hypothesis being that there is no difference between the two different models.

testResult <- t.test(mtcars$mpg~mtcars$am)
PValue <- round(testResult$p.value,4)
alpha <- 0.05 ## let set alpha to 5% confidence to reject or not the null hypothesis
if(PValue<alpha){
        print(paste("Our ",PValue," test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"))
}else{
        print(paste("Our ",PValue," test P-value is greater than our 0.05 alpha value, so we failed to reject the null hypothesis"))
}

## [1] "Our  0.0014  test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"

Our T-test finally suggests that manual and automatic transmissions show differences in gasoline consumption. This can be seen on a boxplot graph, as below:

boxplot(mpg ~ am, data = mtcars, main = "Gasoline Consumption, by Transmission Type", 
    xlab = "Transmission ", ylab = "Gasoline Consumption (mpg)")

The boxplot shows that gasoline consumption is higher for manual than for automatic transmissions. Let’s extract the boxplot values:

AutoFilter <- mtcars[mtcars$am=="Auto",c("am","mpg")]
summary(AutoFilter)

##       am          mpg       
##  Auto  :19   Min.   :10.40  
##  Manual: 0   1st Qu.:14.95  
##              Median :17.30  
##              Mean   :17.15  
##              3rd Qu.:19.20  
##              Max.   :24.40

ManualFilter <- mtcars[mtcars$am=="Manual",c("am","mpg")]
summary(ManualFilter)

##       am          mpg       
##  Auto  : 0   Min.   :15.00  
##  Manual:13   1st Qu.:21.00  
##              Median :22.80  
##              Mean   :24.39  
##              3rd Qu.:30.40  
##              Max.   :33.90

Regression Models

Let’s investigate more with regression models. As we saw that most variables are strongly correlated to ‘mpg’, we might use the ‘step’ function to find the optimal regression model with optimal variables, following AIC algorithm rules.

modelFull <- lm(mpg ~ ., data = mtcars)
modelBestFit<- step(modelFull, direction = "both")

## Start:  AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  5   13.5989 134.00 69.828
## - gear  2    3.9729 124.38 73.442
## - am    1    1.1420 121.55 74.705
## - qsec  1    1.2413 121.64 74.732
## - drat  1    1.8208 122.22 74.884
## - cyl   2   10.9314 131.33 75.184
## - vs    1    3.6299 124.03 75.354
## <none>              120.40 76.403
## - disp  1    9.9672 130.37 76.948
## - wt    1   25.5541 145.96 80.562
## - hp    1   25.6715 146.07 80.588
## 
## Step:  AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  2    5.0215 139.02 67.005
## - disp  1    0.9934 135.00 68.064
## - drat  1    1.1854 135.19 68.110
## - vs    1    3.6763 137.68 68.694
## - cyl   2   12.5642 146.57 68.696
## - qsec  1    5.2634 139.26 69.061
## <none>              134.00 69.828
## - am    1   11.9255 145.93 70.556
## - wt    1   19.7963 153.80 72.237
## - hp    1   22.7935 156.79 72.855
## + carb  5   13.5989 120.40 76.403
## 
## Step:  AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1    0.9672 139.99 65.227
## - cyl   2   10.4247 149.45 65.319
## - disp  1    1.5483 140.57 65.359
## - vs    1    2.1829 141.21 65.503
## - qsec  1    3.6324 142.66 65.830
## <none>              139.02 67.005
## - am    1   16.5665 155.59 68.608
## - hp    1   18.1768 157.20 68.937
## + gear  2    5.0215 134.00 69.828
## - wt    1   31.1896 170.21 71.482
## + carb  5   14.6475 124.38 73.442
## 
## Step:  AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1    1.2474 141.24 63.511
## - vs    1    2.3403 142.33 63.757
## - cyl   2   12.3267 152.32 63.927
## - qsec  1    3.1000 143.09 63.928
## <none>              139.99 65.227
## + drat  1    0.9672 139.02 67.005
## - hp    1   17.7382 157.73 67.044
## - am    1   19.4660 159.46 67.393
## + gear  2    4.8033 135.19 68.110
## - wt    1   30.7151 170.71 69.574
## + carb  5   13.0509 126.94 72.095
## 
## Step:  AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - qsec  1     2.442 143.68 62.059
## - vs    1     2.744 143.98 62.126
## - cyl   2    18.580 159.82 63.466
## <none>              141.24 63.511
## + disp  1     1.247 139.99 65.227
## + drat  1     0.666 140.57 65.359
## - hp    1    18.184 159.42 65.386
## - am    1    18.885 160.12 65.527
## + gear  2     4.684 136.55 66.431
## - wt    1    39.645 180.88 69.428
## + carb  5     2.331 138.91 72.978
## 
## Step:  AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1     7.346 151.03 61.655
## <none>              143.68 62.059
## - cyl   2    25.284 168.96 63.246
## + qsec  1     2.442 141.24 63.511
## - am    1    16.443 160.12 63.527
## + disp  1     0.589 143.09 63.928
## + drat  1     0.330 143.35 63.986
## + gear  2     3.437 140.24 65.284
## - hp    1    36.344 180.02 67.275
## - wt    1    41.088 184.77 68.108
## + carb  5     3.480 140.20 71.275
## 
## Step:  AIC=61.65
## mpg ~ cyl + hp + wt + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              151.03 61.655
## - am    1     9.752 160.78 61.657
## + vs    1     7.346 143.68 62.059
## + qsec  1     7.044 143.98 62.126
## - cyl   2    29.265 180.29 63.323
## + disp  1     0.617 150.41 63.524
## + drat  1     0.220 150.81 63.608
## + gear  2     1.361 149.66 65.365
## - hp    1    31.943 182.97 65.794
## - wt    1    46.173 197.20 68.191
## + carb  5     5.633 145.39 70.438

summary(modelBestFit)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

When comparing models fitted by maximum likelihood to the same data, the smaller the AIC, the better the fit. AIC passed from 76.4 (all variables) to 61.5 with ‘cyl’, ‘hp’, ‘wt’, ‘am’ as the most relevant variables. Adjusted r-squared value finally reached 0.8401, explaining 84% of the variance.

Let’s go on with the ‘anova’ function, comparing our best-fit model with a model taking into account only the ‘am’ variable (transmission type). Our null hypothesis will be that model with variables selected for our best-fit model (‘cyl’, ‘hp’, ‘wt’, ‘am’) does not differ from a simple model only related to the ‘am’ vairiable.

modelAmVar <- lm(mpg ~ am, data = mtcars)
anova(modelBestFit, modelAmVar)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + hp + wt + am
## Model 2: mpg ~ am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     26 151.03                                  
## 2     30 720.90 -4   -569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PValue <- anova(modelBestFit, modelAmVar)$"Pr(>F)"[2]
alpha <- 0.05 ## let set alpha to 5% confidence to reject or not the null hypothesis
if(PValue<alpha){
        print(paste("Our ",PValue," test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"))
}else{
        print(paste("Our ",PValue," test P-value is greater than our 0.05 alpha value, so we failed to reject the null hypothesis"))
}

## [1] "Our  1.68843485084987e-08  test P-value is smaller than our 0.05 alpha value, so we successfully rejected the null hypothesis"

Residual Analysis

We will focus now on our best-fit model and have a look to the residual values. We would like to be sure that residuals do not present any hidden pattern that might weaken our regression model. We will also the check the presence of outliers and evaulate its effect on the regression model.

par(mfrow = c(2, 2))
plot(modelBestFit, which = 1)
plot(modelBestFit, which = 2)
plot(modelBestFit, which = 3)
plot(modelBestFit, which = 5)

What can we conclude from residual analysis?

Residual vs Fitted plot does not show any particular hidden pattern, which means residuals are not interdependent.
Normal Q-Q plot shows that standardized residuals are fitting very well the normal distribution slope.
Scale-Location plot shows that standardized residual variance is constant.
Residual vs Leverage shows the effect of potential outliers (values which pull the regression line from the 0 horitzontal line)

Regression Models Peer Assessment

François Rosselet

13/02/2017

EXECUTIVE SUMMARY

Data exploration

Hypothesis test

Regression Models

Residual Analysis

Conclusions

1. Is an automatic or manual transmission better for MPG?

2. Quantifying how different is the MPG between automatic and manual transmissions?”