Executive Summary

This work aims to explore the relationship between a set of variables and miles per gallon(MPG) for the automobile industry. And it tries to answer the following two questions: 1. “Is an automatic or manual transmission better for MPG” 2. “Quantify the MPG difference between automatic and manual transmissions”

The analysis was conducted using simple(mpg ~ am) and multivariant regression(mpg ~ wt + qsec + am) models. And the modeling results show that manual transmission has higher MPG than automatic one. In the simple model, the average mpg is 17.147. And the mpg increate rate would reach 7.245(beta1) when having the manual transimission instead of automatic way. Therefore the automatic transmission is better for MPG for manual way.

Exploratory data analysis is shown in the Appendix. # Analysis of Linear Regression only considering mpg and am

data(mtcars)
mtcars2 <- mtcars
fit1 <- lm(mpg ~ am, data = mtcars2)
summary(fit1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04
# inferential statistics for beta1. H0 hypothesis: beta1 = 0, H1: beta1 != 0
beta1_e <- summary(fit1)$coef[2,1]
beta1_sd <- summary(fit1)$coef[2,2]
n <- length(mtcars2$mpg)
pt(beta1_e/beta1_sd, df = n - 2, lower.tail = FALSE) # probability of t-dist
## [1] 0.0001425104
# The probability is less than 5%, and we reject H0 hypothesis in favor of H1.
# So there is significant difference of MPG exerted by beta1 at alpha of 0.05

The intercept 17.147 is the average MPG for automatic transmissions. The beta 1 of 7.245 is the increase rate for manual transmission. The linear expression of 17.147 + 7.245*beta1 represents the average MPG for manual transmissions.

Mulivariant regression, use stats.step to find the best model

library(stats)
fitall <- lm(mpg ~ ., data = mtcars2)
bestfit <- step(fitall, direction = "both", trace = FALSE)
summary(bestfit)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

We get the best model of mpg ~ wt + qsec + am with adjusted R2=0.833t. Then we need to compare the bestfit with fit1 to check the variable confidence

anova(fit1, bestfit)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results, bestfit has very high significance, so both wt and qsec have significant effect on the model.

Residual Plot and Dianogsis

library(ggfortify) #install.packages("ggfortify")
## Warning: package 'ggfortify' was built under R version 3.2.3
## Loading required package: proto
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
autoplot(bestfit, label.size = 2)

Based on the dignostic plots: 1. “Residuals vs Fitted” has randomly scattered values which prove that there no salient dependence among regressors. 2. “Normal Q-Q” has most points staying on the line proving that the residuals are normally distributed. 3. “Scale-Location” has points scattering around the lines showing stable variance. 4. “Residual vs Leverage” has only several outliers and leverages on top and right.

Appendix exploratary data analysis

Basic exploring

?mtcars # check the variables (0 = automatic, 1 = manual)
str(mtcars2)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# check the mean and sd
mean_sd <- rbind(tapply(mtcars2$mpg, mtcars2$am, mean), tapply(mtcars2$mpg, mtcars2$am, sd))
colnames(mean_sd) <- c("automatic", "manual")
rownames(mean_sd) <- c("mean", "sd")
mean_sd
##      automatic    manual
## mean 17.147368 24.392308
## sd    3.833966  6.166504
# boxplot the mpg by each transmission way
library(ggplot2)
mtcars2 <- mtcars
ggplot(data = mtcars2, aes(x = factor(am), y = mpg)) +
    geom_boxplot() +
    scale_x_discrete(labels = c("automatic", "manual"))

### Explore the correlations

cor(mtcars2)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
# It seems each variable has stroing correlation with mpg
library(GGally)
ggpairs(mtcars2)