Executive Summary

This analysis is focused on identifying relationship between mpg and transmission mode. We have used mtcars dataset for this analysis. The following are the findings of this study: - Vehicles with manual transmission have higher mpg compared to the vehicles with automatic transmission. - The best multivariate linear regression model to predict mpg includes transmission, weight and qsec as regressors.

require(dplyr)
require(ggplot2)
require(GGally)
data("mtcars")

Data Cleaning and exploratory data analysis

mtcars$am <- factor(mtcars$am, 
                    levels = c(0,1), 
                    labels = c("Automatic", "Manual"))
g <- ggplot(mtcars, aes(x = am, y = mpg))
g <- g + geom_boxplot(colour = "steelblue", aes(fill=am))
g + labs(x = "Transmission Mode", y = "Miles per Galon (mpg)", title = "MPG Comparision")

  • Mean Comparision
mean(mtcars$mpg)
## [1] 20.09062
aggdata <- aggregate(mtcars$mpg, 
                     by=list(Transmission = mtcars$am), 
                     FUN= mean)
names(aggdata) = c("Transmission", "Mean")
aggdata
##   Transmission     Mean
## 1    Automatic 17.14737
## 2       Manual 24.39231

Finding the best model fit

fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04
  • As the estimated mean of manual transmission is approx 7.24 mpg higher than the automatic transmission we can say that manual transmission is better for mpg.

Explanation Power of this model

summary(fit1)$r.squared
## [1] 0.3597989
summary(fit1)$adj.r.squared
## [1] 0.3384589
  • Over all this model that regresses mpg based on transmission mode is not an efficient model as it is able to explain only around 34% variation in miles per galon. Hence we will try and explore a more effiective model for mpg in the next section.

Multivariate regression model building

fit2 <- update(fit1, mpg ~ am + wt)
fit3 <- update(fit2, mpg ~ am + wt + qsec) 
fit4 <- update(fit2, mpg ~ am + wt + qsec + gear)
fit5 <- update(fit1, mpg ~ am + wt + qsec + gear + carb)
comptab <- matrix(c(summary(fit1)$adj.r.squared, summary(fit2)$adj.r.squared, summary(fit3)$adj.r.squared, summary(fit4)$adj.r.squared, summary(fit5)$adj.r.squared), nrow = 1)
colnames(comptab) <- c("Fit1", "Fit2", "Fit3", "Fit4", "Fit5")
rownames(comptab) <- "Adj R Sqr"
comptab
##                Fit1      Fit2      Fit3      Fit4      Fit5
## Adj R Sqr 0.3384589 0.7357889 0.8335561 0.8275166 0.8323171
  • As we can see from the above comparision of R square that the model 3 that uses transmission mode, cylinders and weight as independent variable is the best fit (most simple providing highest explenation power) for predicting the mpg.
  • Also the incremental variables used are selected based on correlation details highlighted in appendix to minimize problem of multicollinearity.

Interpretation of Best fit model i.e. (mpg ~ am + wt + qsec)

summary(fit3)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## amManual     2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
#Getting Confidence Interval
sumCoef <- summary(fit3)$coefficients
sumCoef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## amManual     2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
  • Interpretation of multivariate regression coefficients
  • Mean mpg mileage is higher compared to automatic transmission keeping other variables constant.
  • MPG value is reduced by 3.91 for every unit increase in weight keeping all things constant.

Appendix

Analysis of Variance Test for various models

anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
## Model 4: mpg ~ am + wt + qsec + gear
## Model 5: mpg ~ am + wt + qsec + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 72.6616 5.273e-09 ***
## 3     28 169.29  1    109.03 17.9010 0.0002554 ***
## 4     27 169.16  1      0.12  0.0201 0.8882207    
## 5     26 158.36  1     10.80  1.7730 0.1945735    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The P values in the Anova result are to test whether the coefficients of newly added variables in the model are zero or not (i.e. whether or not they’re necessary).
  • So a P values that is > 0.05 shows that the newly added coefficient is not zero.
  • In this only model 3 is satisfying that condition. Hence it is best fit.

Identifing Correlation

mtcars$am <- as.numeric(mtcars$am)
cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
  • The following are the highly correlated variables in dataset
  • Cylinder and weight are highly correlated
  • Cylinder and horse power are highly correlated
  • Cylinder and displacement are highly correlated
#g <- ggpairs(mtcars, lower = list(continuous = "smooth"), params = c(method = "loess"))

Plotting residuals

x <- mtcars$wt
y <- mtcars$mpg
g <- ggplot(data.frame(mtcars$mpg, y = resid(fit3)), aes(x = x, y = y))
g <- g + geom_hline(yintercept = 0, size =2)
g <- g + geom_point(size = 7, colour = "black", alpha = 0.4)
g <- g + geom_point(size = 5, colour = "red", alpha = 0.4)
g <- g + xlab("X") + ylab("Residuals")

#Getting all residual plot
par(mfrow = c(2, 2))
plot(fit3)

Confidence Intervals for Best fit model

sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit3$df) * sumCoef[1, 2]
## [1] -4.638299 23.873860
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit3$df) * sumCoef[2, 2]
## [1] 0.04573031 5.82594408
sumCoef[3,1] + c(-1, 1) * qt(.975, df = fit3$df) * sumCoef[3, 2]
## [1] -5.373334 -2.459673
sumCoef[4,1] + c(-1, 1) * qt(.975, df = fit3$df) * sumCoef[4, 2]
## [1] 0.6345732 1.8171987