Excutive Summary

This project will focus on a data collection of cars, exploring the relationship between a set of variables and miles per gallon (MPG) as outcome variable. Interested questions for this project are

  1. Data Description A data frame with 32 observations on 11 (numeric) variables.

Exploratory Analysis

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
  1. Change data types
mtcars[,c(2,9,10)] <- sapply(mtcars[,c(2,9,10)] , factor)

Since there are no complete overlaps in the boxplot below, Transmission type appears to have better relationship with mpg

with(mtcars, boxplot(mpg~am))

Model

  1. We will fit a linear linear model with only cylinder as the predictor varible
    Transmission (am) as factor - (0 = automatic, 1 = manual)
fit.cyl <- lm(mpg~am, data= mtcars)
summary(fit.cyl)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Transimission type is a significant predictor of MPG with based on the p-values. Automatic transmission appears to consume fewer MPG with Manual transmission expected consume and 7.2 more gallons per 1 mile. However, the low R-Squared shows that approximately 36% of the variation in the model is explained by the Transmission variable and our co-efficients may not be reliable.

  1. Let’s compare this to a full model
fit.full <- lm(mpg~., data= mtcars)
summary(fit.full)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2015 -1.2319  0.1033  1.1953  4.3085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 15.09262   17.13627   0.881   0.3895  
## cyl6        -1.19940    2.38736  -0.502   0.6212  
## cyl8         3.05492    4.82987   0.633   0.5346  
## disp         0.01257    0.01774   0.708   0.4873  
## hp          -0.05712    0.03175  -1.799   0.0879 .
## drat         0.73577    1.98461   0.371   0.7149  
## wt          -3.54512    1.90895  -1.857   0.0789 .
## qsec         0.76801    0.75222   1.021   0.3201  
## vs           2.48849    2.54015   0.980   0.3396  
## am1          3.34736    2.28948   1.462   0.1601  
## gear4       -0.99922    2.94658  -0.339   0.7382  
## gear5        1.06455    3.02730   0.352   0.7290  
## carb         0.78703    1.03599   0.760   0.4568  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.616 on 19 degrees of freedom
## Multiple R-squared:  0.8845, Adjusted R-squared:  0.8116 
## F-statistic: 12.13 on 12 and 19 DF,  p-value: 1.764e-06

The full model gives a relatively high Adjusted R-squared 88.5% but has less significant variables, implying that there could be overfitting.

  1. Selecting the best model All numerical variables have at least moderate correlation with mpg
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
df_cor <- na.omit(mtcars[, c(1,3:8,11)])
ggcorr(df_cor, label = T)

We can use the step fubction to select the best model, we will use backward stepwise regression, removing one variable a time untill we reach the best model (when AIC stops decreasing)

# To see output, specify trace = True
fit.best <- step(fit.full, direction = "backward", trace = F)
summary(fit.best)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am1           2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Based on the AIC, the best model has predictors: wt, am, qsec Now we can fit a parsimonious model. This model now gives a Adjusted Rsq of 83.3 with a low p-value indicatiing the model significance.

  1. Hypothesis test for model selection This test also suggest that the best.fit model is significant than the full model
anova(fit.cyl, fit.best, fit.full)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 169.29  2    551.61 40.2941 1.463e-07 ***
## 3     19 130.05  9     39.23  0.6369    0.7524    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Diagnoistics

  • The model diagnostics shows that the best fit model has linear pattern, indicating the linearity was captured by the mode
  • residuals appear normaly distributed as shows on the Normal-Q-Q plot as well
  • The Scale-Location shows that the best fit model has approximately constant variance
  • Residual vs Leverage captures influetial points, in particular, there is no outliers in our model

You can read more here

par(mfrow = c(2,2))
plot(fit.best)

Sum of residual is also zero, all assumptions for a linear regression are adequately satisfied

sum(fit.best$residuals)
## [1] 1.665335e-15

Conclusion