This analysis focuses on the differences of MPG between automatic and manual transmissions. The paper firstly set environment, load data and do a exploratory data analysis to see that there is a difference of MPG. Then try to fit a simple linear model that only use ‘am’ as regressor. The result shows that there’s a obvious difference of MPG between two groups, but the model does not fit very well. Later on the paper selects two more regressors (‘wt’ and ‘sqec’) according to the correlation and fit a new multi-variable linear model. The result shows the same that manul transmission has a better MPG than automatic and the model fits pretty good. In the end, the paper did residual analysis.
Firstly set working environment and load data:
setwd("C:/Study/Coursera/1 Data-Science/2 RStudio/7 Class 7/Coursera_DataScience_Class7Regression_FinalProject")
library(UsingR)
library(ggplot2)
library(dplyr)
library(grid)
library(gridExtra)
data(mtcars)
Then have a brief idea about data:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In order to see the difference mpg between different transmissions, draw a boxplot:
g <- ggplot(mtcars, aes(x=factor(am), y=mpg, fill = factor(am))) +
geom_boxplot() +
xlab('Transmission (0=automatic, 1=manual)') +
ylab('Miles Per Gallon (MPG)')
g
The x-axis is the transmission. 0 stands for automatic and 1 stands for manual. The y-axis is Miles Per Gallon (MPG). As we can see above, it’s quite obvious that there is a difference between different transmissions.
In order to figure out the relationship between MPG and Transmissions, the most straight forward way is to get a simple linear regression between these two and have a look at the coefficients and \(R^2\).
fitAm <- lm(mpg ~ am, data = mtcars)
summary(fitAm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
summary(fitAm)$adj.r.squared
## [1] 0.3384589
The intercept coefficent stands for the MPG of automatic cars (regressor=0). The am coefficient stands for MPG increase for unit increase of manual cars. \(R^2\) is small so this linear model does not fit quite well. Try to get a confidence interval:
alpha <- 0.05
n <- length(mtcars)
pe <- coef(summary(fitAm))["am", "Estimate"]
se <- coef(summary(fitAm))["am", "Std. Error"]
tvalue <- qt(1 - alpha/2, n - 2)
pe + c(-1, 1) * (se * tvalue)
## [1] 3.25354 11.23634
We can see above the confidence interval doesn’t include 0. So we can reject the null hypothesis in favor of the alternative one that there is a significant difference of MPG between two groups of transmissions at alpha=0.5.
We could select a multi-variable linear model. Firstly fit a linear model for all variables:
fitAll <- lm(mpg ~ ., data = mtcars)
summary(fitAll)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
Then calculate the correlation:
corCars <- cor(mtcars)
data.frame(Cor.mpg = corCars[,which(names(mtcars)=='mpg')], Cor.wt = corCars[,which(names(mtcars)=='wt')])
## Cor.mpg Cor.wt
## mpg 1.0000000 -0.8676594
## cyl -0.8521620 0.7824958
## disp -0.8475514 0.8879799
## hp -0.7761684 0.6587479
## drat 0.6811719 -0.7124406
## wt -0.8676594 1.0000000
## qsec 0.4186840 -0.1747159
## vs 0.6640389 -0.5549157
## am 0.5998324 -0.6924953
## gear 0.4802848 -0.5832870
## carb -0.5509251 0.4276059
The selection of regressors follow the rules:
So we get our regressors: ‘wt’, ‘qsec’, ‘am’. Then fit the multi-variable linear regression model with selected variables:
fitNew <- lm(mpg ~ am + qsec + wt, data = mtcars)
summary(fitNew)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## am 2.935837 1.4109045 2.080819 4.671551e-02
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
summary(fitNew)$adj.r.squared
## [1] 0.8335561
As we can see above, the linear model fits quite well. Use nested model to test:
fit1 <- lm(mpg ~ wt, data = mtcars)
fit2 <- update(fit1, mpg ~ wt + qsec)
fit3 <- update(fit1, mpg ~ wt + qsec + am)
anova(fit1,fit2,fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + qsec
## Model 3: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 195.46 1 82.858 13.7048 0.0009286 ***
## 3 28 169.29 1 26.178 4.3298 0.0467155 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see above, all the variables are significant important. Calculate confidence interval again:
pe <- coef(summary(fitNew))["am", "Estimate"]
se <- coef(summary(fitNew))["am", "Std. Error"]
tvalue <- qt(1 - alpha/2, n - 2)
pe + c(-1, 1) * (se * tvalue)
## [1] -0.2558506 6.1275249
So the CI still don’t contain 0, which means our previous conclusion holds. Then we could do residual analysis:
par(mfrow = c(2,2))
plot(fitNew)