Overview

This is the course project of Regression Models course, part of Data Science Specialization, by John Hopkins Bloomberg school of Pulibc Health at Coursera.

In this work we will be looking at a data set of a collection of cars, interested in exploring the relationship between a set of variables and the consumption in miles per gallon (MPG).

In this analysis we’ll try to address the following two questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions

Loading dataset

For this analysis we’ll be using the mtcars R standard dataset.According with R documentation, the mtcars dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models), has 11 variables:

data(mtcars) # loading the dataset
str(mtcars) # looking the data structure
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

To perform the analysis, it’s necessary to transform some ‘num’ values in to factors

mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Data analisys

First, is there a significant MPG difference between Automatic and Manual, in overall?

t.test(mpg~am,data=mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

Seems that really are a statistic significant difference, with p-value = 0.01374, in the comsumption between the two type of transmission, favoring the manual type (average of 24.39231 mpg) against the automatic type (17.14737 mpg, in averate).

Fit Model

So, lets find the best model to fit mtcars variables to describe the comsumption. We’ll first construct a model adding all variables and so use the Stepwise Algorithm to check what are the more significant parameters in the model.

allvars <- lm(mpg ~ .,data=mtcars) # initial model with all variables
best <- step(allvars)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Conclusions

We can see above, the best fitting model (R-squared = 0.87), beyond the transmission type (am), also involves the variables cyl6/cyl8 (# of cylinders), hp (horse power) and wt (weight), so at same levels of cyl, hp and wt, the change from automatic to manual transmission increases the mpg in 1.8.

You can see the residual analisys of this model in the appendices.

Appendices

Appendix A - Correlation Matrix

library(corrplot)
data(mtcars) # reconvert the factors to numerals
cor_mat <- cor(mtcars)
ord <- corrMatOrder(cor_mat, order="AOE")
corrplot.mixed(cor_mat[ord,ord])

Appendix B - Residuals Analisys

par(mfrow=c(2,2))
plot(best)

We can see that the Residuals vs. Fitted chart indicates the independece condition of the residuals (they are randomly scattered) and the Normal Q-Q chart indicates they are normally distributed (the points follow the line).