Synoposis

Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  1. Is an automatic or manual transmission better for MPG
  2. Quantify the MPG difference between automatic and manual transmissions

Considering a model that includes weight, acceleration and transmission, we can say that automatic cars have 2.94 miles per galon (MPG) more than manual cars.

Loading the dataset

  data(mtcars)
  head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Check the summary and structure of the dataset

  summary(mtcars)
##       mpg            cyl            disp             hp       
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0  
##       drat            wt            qsec            vs       
##  Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
##  1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
##  Median :3.69   Median :3.33   Median :17.7   Median :0.000  
##  Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
##  3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
##  Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
##        am             gear           carb     
##  Min.   :0.000   Min.   :3.00   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
##  Median :0.000   Median :4.00   Median :2.00  
##  Mean   :0.406   Mean   :3.69   Mean   :2.81  
##  3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :1.000   Max.   :5.00   Max.   :8.00
  str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Check for missing data

  colSums(is.na(mtcars))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

No missing values was found.

Exploratory Data Analysis

Summarize the distributions of MPG for the two transmission types

  mtcars$am <- factor(mtcars$am,
               levels = c(0,1),
               labels = c("automatic", "manual"))
 require(knitr)
## Loading required package: knitr
 opts_chunk$set(fig.align='center')  
 par(bty= "n")
  boxplot(mpg ~ am, data= mtcars, col= (c("lightblue","salmon")), xlab= "transmission type",
          ylab=    "miles per gallon");

plot of chunk unnamed-chunk-4

Form the above plot, we notice that manual transmission has a higher MPG than automatic transimission. To test the significance and quantify the difference, we regress transmission type as a factor varible against mpg:

    fit <- lm(mpg ~ am, mtcars)
    summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.392 -3.092 -0.297  3.244  9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.15       1.12   15.25  1.1e-15 ***
## ammanual        7.24       1.76    4.11  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.338 
## F-statistic: 16.9 on 1 and 30 DF,  p-value: 0.000285

The result indicates the averge mpg of manual transmission is 7.24 (ammanual) higher than that of automatic transmission (17.15 as intercept) with a p-value less than 1% significance. Nevertheless, this analysis may be severely biased since many potential features that are more intuitively related to mpg are omitted in the model.

To verify our suspicion, we scatterplot all varible pairs in the mtcars dataset:

  pairs(mtcars)

plot of chunk unnamed-chunk-6 By looking at the first row of the scatter plots, we could immediately identify that # of cylinder (cyl), displacement (disp), horse power (hp), real axle ratio (drat) and weight (wt) are strongly correlated with mpg. We list the first row of correlation matrix.

  sub_mtcars <- subset(mtcars, select = -c(qsec,vs,am,gear,carb))
  cor(sub_mtcars)[1,]
##     mpg     cyl    disp      hp    drat      wt 
##  1.0000 -0.8522 -0.8476 -0.7762  0.6812 -0.8677

Comparing the correlation of the caracteristic of a car with MPG, we notice that 3 variables are highly correlated with MPG (>.8): wt, cyl and disp.

Model selection

We build a first model based on Simple Linear Regression.

fit.simple <- lm(mpg ~ am, mtcars)
summary(fit.simple)$adj.r.squared
## [1] 0.3385

The adjusted \(R^2\) value indicates that the model explains only 34% of the variations. It’s a very low value.

 summary(fit.simple)$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   17.147      1.125  15.247 1.134e-15
## ammanual       7.245      1.764   4.106 2.850e-04

This model tells us that changing from automatic to manual transmission causes a 7.245 increase in MPG.

Then, we will use the Stepwise Algorithm (step-by-step selection) to select a better model (keeping am variable in the model):

data(mtcars)
mtcars$am  <- factor(mtcars$am, levels=c(0,1), labels=c("automatic", "manual"))

fit.step <- step(lm(mpg~., mtcars), trace=0, scope=list(lower=~am), direction="both")
summary(fit.step)$call
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)

The best model proposed by Stepwise includes the weight (wt) and the “1/4 mile time” (qsec) of the cars, in addition to transmission (am), to explain fuel consumption (MPG).

  summary(fit.step)$adj.r.squared
## [1] 0.8336

The adjusted \(R^2\) is 0.8336 which means that the model explains 83% of the variation.

Model Comparison and Residual Analysis

We then compare the model proposed by Stepwise with our first model using ANOVA.

  anova(fit.simple, fit.step)[2,6] #p-value
## [1] 1.55e-09

The p-value is very low: we can then reject the null hypothesis (i.e. “Model are equals”) and claim that the model proposed by the Stepwise algorithm is better than our first simple model.

par(mfrow=c(2,2))
plot(fit.step)

plot of chunk unnamed-chunk-13

The above figure is a residual plot of the selected model. Residuals seems to be uncorrelated with the fit, independent and (almost) identically distributed with mean zero.

Results

 summary(fit.step)$coefficients
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    9.618     6.9596   1.382 1.779e-01
## wt            -3.917     0.7112  -5.507 6.953e-06
## qsec           1.226     0.2887   4.247 2.162e-04
## ammanual       2.936     1.4109   2.081 4.672e-02

Given the coefficients of our model, we can say that automatic cars have lower fuel consumption than manual cars: they have 2.94 miles per galon (MPG) more than manual cars. This value can be obtained when we consider the weight (wt) and the “1/4 mile time” (qsec) variables of the cars of our dataset.