Executive Summary

Motor Trend, a magazine about the automobile industry is interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). In this study we will analyze the mtcars dataset from the 1974 Motor Trend US magazine to answer the following questions:

Our dataset has 160 variables.

Processing

mtcars dataset was used for the analysis. It comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-1974).

#Load the dataset
data(mtcars)
# we factor variables that are not continuous 
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
#We also change "am" to factor (0 = automatic,1 = manual)
levels(mtcars$am) <- c("automatic", "manual")
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Exploratory analysis

library(ggplot2)
ggplot(mtcars, aes(x=am, y=mpg, fill=am)) +
    geom_boxplot()

It suggests a clear difference on fuel consumption between automatic and manual transmission cars. Below is the model to explain the MPG variability with the transmission type only.

fit1 <- lm(mpg ~ am, data=mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Although coefficients for both intercept and the transmission type are significant, the model fit using only transmission type explains only 35.9798943 % of the MPG variation.

Before making any conclusions on the effect of transmission type on fuel efficiency, we look at the variances between several variables in the dataset.

pairs(mtcars, panel=function(x,y) {
    points(x, y)
    abline(lm(y ~ x), col="red")
})

Based on the pairs plot above, several variables seem to have high correlation with the mpg variable. Hence, we build an initial model using all variables and select the model with the best subset of predictors using stepwise backward elimination and forward selection.

Model selection

Now that we have seen that mpg has many other (stronger) correlations than just ‘am’ we can guess that a model predicting the mpg solely on this parameter will not be the most accurate model. Let’s check this out.

simple_model <-  lm(mpg ~ ., data=mtcars)
bestfit_model <- step(simple_model, direction="both", trace=0)
summary(bestfit_model)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## ammanual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10
par(mfrow = c(2,2))
plot(bestfit_model)

The final model contains four predictors, cyl (number of cylinders), hp (horsepower), weight (weight) and am (transmission type). This model explains the 86.5879872% of the MPG variation. The number of cylinders, weight and horsepower significantly contribute to the accuracy of the model while the transmission has no effect on the fuel consumption (\(\alpha=0.05\)). Also the residual plots show that the distribution of residuals seem to be normally distributed and not depending on fitted values.

Conclusion

We have determined that there is a difference in mpg in relation to transmission type and have quantified that difference. However, transmission type does not appear to be a very good explanatory variable for mpg; weight, horsepower, and number of cylinders are all more significant variables.