Executive Summary

This report is a course project within the Regression Models course on the Data Science Specialization by Johns Hopkins University on Coursera.

We will examine the mtcars data set and explore how miles per gallon (MPG) is affected by different variables. In particularly, we will answer the following two questions:

  1. “Is an automatic or manual transmission better for MPG”
  2. “Quantify the MPG difference between automatic and manual transmissions”

Data Description

We analyze the mtcars data set through Regression Modelling and exploratory analysis to show how automatic (am = 0) and manual (am = 1) transmissions features affect the MPG feature.

The data was extracted from the 1974 Motor Trend US magazine, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). The data set consists of a data frame with 32 observations (nrow) and 11 variables (ncol).

Exploratory Analysis

We load the data set, and perform an initial plot of Transmission Types:

library(ggplot2)
data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Transform certain variables into factors

mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am   <- factor(mtcars$am,labels=c("Automatic","Manual"))
boxplot(mpg ~ am, data = mtcars, col = (c("red","blue")), 
        ylab = "Miles Per Gallon", xlab = "Transmission Type")

We can see from the boxplot that Manual Transmission provides better MPG.

Regression Analysis

We can also calculate mean MPG values for cars with Automatic and Manual transmission as follows:

aggregate(mpg~am, data = mtcars, mean)

We can see again that Manual transmission yields on average 7 MPG more than Automatic, Lets now test this hypothesis with a Simple Linear Regression Test:

T_simple <- lm(mpg ~ factor(am), data=mtcars); summary(T_simple)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        17.147      1.125  15.247 1.13e-15 ***
## factor(am)Manual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The p-value is less than 0.0003, so we will not reject the hypothesis. However, the R-squared value for this test is approximately .35, suggesting that only a third or so of variance in MPG can be attributed to transmission type alone.

We explore the other variable to see how all the variables correlate with mpg

pairs(mpg ~ ., data = mtcars)

From this we see that cyl, disp, hp, wt have the strongest correlation with mpg. We build a new model using these variables and compare them to the initial model with the anova function.

NewFit <- lm(mpg~am + cyl + disp + hp + wt, data = mtcars)
anova(T_simple, NewFit)

This results in a p-value of 8.637e-08, and we can claim the NewFit model is significantly better than our T_simple model.

summary(NewFit)
## 
## Call:
## lm(formula = mpg ~ am + cyl + disp + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9374 -1.3347 -0.3903  1.1910  5.0757 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.864276   2.695416  12.564 2.67e-12 ***
## amManual     1.806099   1.421079   1.271   0.2155    
## cyl6        -3.136067   1.469090  -2.135   0.0428 *  
## cyl8        -2.717781   2.898149  -0.938   0.3573    
## disp         0.004088   0.012767   0.320   0.7515    
## hp          -0.032480   0.013983  -2.323   0.0286 *  
## wt          -2.738695   1.175978  -2.329   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.453 on 25 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8344 
## F-statistic: 27.03 on 6 and 25 DF,  p-value: 8.861e-10

The model explains 86.64% of the variance and as a result, cyl, disp, hp, wt did affect the correlation between mpg and am. Thus, we can say the difference between automatic and manual transmissions is 1.81 MPG.

Residual Plot and Analysis

par(mfrow = c(2, 2))
plot(NewFit)

The Residuals vs Fitted plot here shows us that the residuals are homoscedastic. We can also see that they are normally distributed.