Executive Summary

This project is the Regression Models course project by coursera. The goal of this project is to explore and see the relationship between a set of variables with miles per gallon (mpg). We will use the mtcars dataset from R as the data for the project. The analysis are done to address the following questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions

The result from the t-test analysis shows that manual transmission car types have better miles per gallon compared to automatic transmission car types. On average, the difference between the two types are about 7 miles per gallon when only the transmission types are used in the model. After testing with other variables such as weight and horsepower, we found out that manual transmission only contributes to an average increase of 2.08 miles per gallon compared to automatic transmission with those other variables held constant.

Exploratory Data Analysis

We first load and read the top rows of the data

library(datasets)
data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We transform some variables into factors and see the structure of the data

mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$am   <- factor(mtcars$am,labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Here we see the statistical summary of miles per gallon (mpg)

summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90

We compare the transmission type of Automatic vs Manual by grouping them and observe the average mile per gallon for each type

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mtcars %>%
    group_by(am) %>%
    summarize(mean = mean(mpg))
## # A tibble: 2 x 2
##          am     mean
##      <fctr>    <dbl>
## 1 Automatic 17.14737
## 2    Manual 24.39231

The result shows that manual transmission has an average mpg of 24.392 while automatic transmission has an average mpg of 17.147 which is approximately 7 average mpg lower.

Inference Test

We subset the data into a subset containing only automatic and another only manual transmission type and perform a t-test

automatic <- mtcars[mtcars$am == "Automatic", ]
manual <- mtcars[mtcars$am == "Manual", ]
t.test(automatic$mpg, manual$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

The resulting t-test shows that the p value is significant (0.001374) which concludes a statistically significant difference between the two transmission types.

Regression Analysis

Simple linear regression with mpg as the dependent variable and transmission type (am) as the independent variable

modelFit <- lm(mpg ~ am, data = mtcars)
summary(modelFit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The result shows an R^2 value of only 35.98% which can be interpreted as only about 36% of the variability of the mpg performance can be explained by the model. The result is not high enough to be concluded with definite certainity to explain the model so we should analyze the other variables with multiple linear regression analysis.

We try to select which model is the best by looking at the Anova comparison of model with multiple variables

modelFit1 <- lm(mpg ~ am , data = mtcars) 
modelFit2 <- lm(mpg ~ am + wt, data = mtcars) 
modelFit3 <- lm(mpg ~ am + wt + hp , data = mtcars) 
modelFit4 <- lm(mpg ~ am + wt + hp + cyl, data = mtcars) 
modelFit5 <- lm(mpg ~ am + wt + hp + cyl + disp, data = mtcars) 
modelFit6 <- lm(mpg ~ ., data = mtcars) 

anova(modelFit1, modelFit2, modelFit3, modelFit4, modelFit5, modelFit6)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + cyl
## Model 5: mpg ~ am + wt + hp + cyl + disp
## Model 6: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 55.1371 2.129e-06 ***
## 3     28 180.29  1     98.03 12.2126  0.003259 ** 
## 4     26 151.03  2     29.27  1.8230  0.195569    
## 5     25 150.41  1      0.62  0.0768  0.785409    
## 6     15 120.40 10     30.01  0.3738  0.939655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova result shows that modelfit3 is the best model to use indicated by the low p-value of below 0.05. The model includes transmission types (am), weight (wt) and horsepower (hp) as the independent variables.

We see the summary of the selected best model here

summary(modelFit3)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## amManual     2.083710   1.376420   1.514 0.141268    
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

The result shows an R^2 of 83.99% which can be interpreted as about 84% of the variability of the mpg performance is explained by the model. Both weight and horsepower have a small p-value of below 0.05 indicating a strong statistical relationship between those variables with mpg. With the addition of the variables, we found out that manual transmission only contributes to an average increase of 2.08 miles per gallon compared to automatic transmission with the other variables held constant.

Residual plot and Diagnostics

par(mfrow = c(2,2))
plot(modelFit3)

Appendix

Here we can see the boxplot of miles per gallon vs transmission types

boxplot(mpg ~ am, data = mtcars, xlab = "Transmission Type", ylab = "Miles per Gallon")

The plot also shows that manual transmission type yields higher mpg than automatic transmission.

Here we see the pairs relationship plot of all the variables in the dataset with mpg

pairs(mpg ~ ., data = mtcars)