This project explores the relationship between miles-per-gallon (MPG) and other variables in the mtcars data set. The analysis attempt to answer the following:-
a) Is an automatic or manual transmission better for MPG?
b) Quantify the MPG difference between automatic and manual transmissions?
data("mtcars")
library(car)
dim(mtcars) # no of rows and columns
## [1] 32 11
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From the data we can deduce that the data consist of 32 rows(observations) and 11 variables.
All the data is of class numeric and this is definetely not correct thus we move to the next stage of transforming those variables.
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
Lets check the basic summary for the data and for a plot of the same you can check below
table(mtcars$am) # No of Automatic and Manual transmision vehicles
##
## Automatic Manual
## 19 13
aggregate(mpg ~ am, data = mtcars, mean) # MPG mean per transimision
## am mpg
## 1 Automatic 17.14737
## 2 Manual 24.39231
From this we can generally say that the Automatic travel less miles per gallon compared to those of manual transmission ant this is also confirmed visually by boxplot one below.
Below we test if the difference is statistically significant.
Null Hypothesis - The difference is no significant
Auto <- mtcars[mtcars$am == "Automatic",]$mpg
NonAuto <- mtcars[mtcars$am == "Manual",]$mpg
t.test(Auto, NonAuto)
##
## Welch Two Sample t-test
##
## data: Auto and NonAuto
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The t.test give us a P-Value of 0.001375 thus reject the null hypothesis at 5% significance level thus conclusion that the difference is significant.
Here we start with a very simple model that is mpg regressed by am - Transmission.
fitone <- lm(mpg ~ am, data = mtcars)
summary(fitone)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
This shows us that the average MPG for automatic is 17.1 MPG, while manual is 7.2 MPG higher. From the model we get an adjusted R-squared of 33.85% this quite a low variance explained by the model.
Due to little variance explained by the model let examine other variable that are might be relevent to explain more variance to build a multivariate linear regression.
fitall <- lm(mpg ~ ., data = mtcars)
vif(fitall)
## GVIF Df GVIF^(1/(2*Df))
## cyl 128.120962 2 3.364380
## disp 60.365687 1 7.769536
## hp 28.219577 1 5.312210
## drat 6.809663 1 2.609533
## wt 23.830830 1 4.881683
## qsec 10.790189 1 3.284842
## vs 8.088166 1 2.843970
## am 9.930495 1 3.151269
## gear 50.852311 2 2.670408
## carb 503.211851 5 1.862838
summary(fitall)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
Including all the variables as the model has improved that is by explaining 77.9% of the variance which is given by R-squared.
Furter from the varance inflation factor and summaries of the model we can see that some regressors are highly correlated thus we can eliminate them to reduce the standard error. This include: * disp - Displacement
* Vs * gear - Number of forward gears * drat - Rear axle ratio * carb - Number of carburetors
bestfit <- lm(mpg ~ cyl + hp + wt + qsec + am, data = mtcars)
summary(bestfit)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9511 -1.4244 -0.1767 1.3666 4.2187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.57617 11.27271 1.914 0.0671 .
## cyl6 -1.90950 1.72992 -1.104 0.2802
## cyl8 -0.22716 2.87047 -0.079 0.9376
## hp -0.02481 0.01515 -1.637 0.1141
## wt -2.96274 0.97728 -3.032 0.0056 **
## qsec 0.61917 0.55987 1.106 0.2793
## amManual 2.83270 1.67020 1.696 0.1023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.4 on 25 degrees of freedom
## Multiple R-squared: 0.8721, Adjusted R-squared: 0.8414
## F-statistic: 28.42 on 6 and 25 DF, p-value: 5.196e-10
This works as expected the model has improved significantly to attain a R-squared of 84.14% and reduced our residual standard error to 2.4 from 2.8.
Here lets run an anova test between the three models to see if they are significantly different from one another.
anova(fitone, bestfit, fitall)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + qsec + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 25 143.98 5 576.91 14.3746 2.886e-05 ***
## 3 15 120.40 10 23.58 0.2938 0.972
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This results in a p-value of 2.886e-05, and we can claim the bestFit model is significantly better than our fitone simple model.This is confirmed by the residuals for non-normality (Appendix - Diagnostic plots ) and can see they are all normally distributed and homoskedastic.
This section contain supporting plots for our analysis.
boxplot(mpg ~ am, data = mtcars, col = c("red", "orange"),
main = "MPG VS Transmission Type",ylab = "Miles Per Gallon", xlab = "Transmission Type")
The boxplot show a clear difference which correspond to our t test carried above.
Lets plot some diagnostic plots for the best modelaccording to our analysis.
par(mfrow = c(2, 2))
plot(bestfit)
This plots show all the linear regression models assumptions are met.