This article is a part of "Regression Models" course final project on Coursera.
Aim to analyse “mtcars” dataset and answer 2 questions,
1. Is an automatic or manual transmission better for MPG ?
2. Quantify the MPG difference between automatic and manual
transmissions.
Executive summary
According to “mtcars” dataset analysis. Found that manual transmission
group has significant better mpg than automatic group.With the average
difference in mpg is about 7.24 miles/gallon.
Setting Environment for data analysis
data("mtcars") ## Dataset used in this analysis.
head(mtcars,n=3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
According to ?mtcars, "am" column this is the binary data where 0 mean automatic transmission,
and 1 mean manual transmission.
Step 1 exploratory analysis
coef(lm(mpg ~ ., data = mtcars))
## (Intercept) cyl disp hp drat wt
## 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
## qsec vs am gear carb
## 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
Coefficient interpretation The result above show
that there are 6 predictors, disp, drat, qsec, vs, am and gear,
have positive correlation with mpg,and there are 4
predictors, cyl, hp, wt and carb, have negative correlation
with mpg.
This summary overview also tell us about all predictors coefficient
related to mpg. The most related predictor is “am” with beta1 equal to
2.52, follow by qsec, drat, gear, vs and disp respectively. The
beta1 of “am” = 2.52 this seem to be that manual transmission have
better mpg than automatic transmission in the first gaze. Let
explore more by compare average mpg in this 2 groups manual vs
automatic.
Exploratory plot
Interestingly. At first disp has positive coefficient, but when plotting
MPG with disp there is a negative slope. What is the real coefficient of
disp ?
real <-coef(lm(mpg ~disp ,data = mtcars))[2] ## real coefficient
The real coefficient of disp is -0.0412151. It because there will be
at least one variable that its coefficient reverse the sign of disp
coefficient from negative to positive.
What is MPG mean by am group?
library(dplyr)
mean_mpg_am<-mtcars %>%
select(mpg,am)%>%
group_by(am)%>%
summarise( mpg_mean = mean(mpg))
mean_mpg_am
## # A tibble: 2 x 2
## am mpg_mean
## <dbl> <dbl>
## 1 0 17.1
## 2 1 24.4
Manual transmission has average 24.4 miles/gallon, while automatic
hasaverage 17.1 miles/gallon.
The average difference in mpg along two group is about 7.24
miles/gallon.
Are these means significant difference? Let do student t-test comparison.
automatic_mpg<-mtcars%>%
filter(am == "0")%>%
select(mpg)
manual_mpg <- mtcars%>%
filter(am == "1")%>%
select(mpg)
t.test(automatic_mpg,manual_mpg)$p.value
## [1] 0.001373638
The student t-test show p-value less than 0.05, conclude that there
is a significant difference in average mpg between automatic and manual
transmission.
Conclusion Manual transmission group has significant
better mpg than automatic group. with the average difference in mpg is
about 7.24 miles/gallon.
Step 2 Create model
mpgmodel <- lm(mpg ~. ,data = mtcars)##This is multiple variable model.
summary(mpgmodel)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Residual plot and diagnostic of model
x<-mtcars$mpg
e<-resid(mpgmodel)
plot(e~mpg , data = mtcars)
abline(h=0, col="black", lwd = 3)
for (i in 1 : nrow(mtcars))
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)
The residual plot look balance. Finding max value of residual.
e[which.max(e)]
## Fiat 128
## 4.627094
Discussion This model is a multivariables model aim
to predict mpg by all 10 predictors. The weakness of this model is
P-value of all predictors are more than 0.05,
make us failed to reject that all predictors have no significant impact
on output.