This work aims to explore the relationship between a set of variables and miles per gallon(MPG) for the automobile industry. And it tries to answer the following two questions: 1. “Is an automatic or manual transmission better for MPG” 2. “Quantify the MPG difference between automatic and manual transmissions”
The analysis was conducted using simple(mpg ~ am) and multivariant regression(mpg ~ wt + qsec + am) models. And the modeling results show that manual transmission has higher MPG than automatic one. In the simple model, the average mpg is 17.147. And the mpg increate rate would reach 7.245(beta1) when having the manual transimission instead of automatic way. Therefore the automatic transmission is better for MPG for manual way.
Exploratory data analysis is shown in the Appendix. # Analysis of Linear Regression only considering mpg and am
data(mtcars)
mtcars2 <- mtcars
fit1 <- lm(mpg ~ am, data = mtcars2)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
# inferential statistics for beta1. H0 hypothesis: beta1 = 0, H1: beta1 != 0
beta1_e <- summary(fit1)$coef[2,1]
beta1_sd <- summary(fit1)$coef[2,2]
n <- length(mtcars2$mpg)
pt(beta1_e/beta1_sd, df = n - 2, lower.tail = FALSE) # probability of t-dist
## [1] 0.0001425104
# The probability is less than 5%, and we reject H0 hypothesis in favor of H1.
# So there is significant difference of MPG exerted by beta1 at alpha of 0.05
The intercept 17.147 is the average MPG for automatic transmissions. The beta 1 of 7.245 is the increase rate for manual transmission. The linear expression of 17.147 + 7.245*beta1 represents the average MPG for manual transmissions.
library(stats)
fitall <- lm(mpg ~ ., data = mtcars2)
bestfit <- step(fitall, direction = "both", trace = FALSE)
summary(bestfit)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
We get the best model of mpg ~ wt + qsec + am with adjusted R2=0.833t. Then we need to compare the bestfit with fit1 to check the variable confidence
anova(fit1, bestfit)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results, bestfit has very high significance, so both wt and qsec have significant effect on the model.
library(ggfortify) #install.packages("ggfortify")
## Warning: package 'ggfortify' was built under R version 3.2.3
## Loading required package: proto
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
autoplot(bestfit, label.size = 2)
Based on the dignostic plots: 1. “Residuals vs Fitted” has randomly scattered values which prove that there no salient dependence among regressors. 2. “Normal Q-Q” has most points staying on the line proving that the residuals are normally distributed. 3. “Scale-Location” has points scattering around the lines showing stable variance. 4. “Residual vs Leverage” has only several outliers and leverages on top and right.
?mtcars # check the variables (0 = automatic, 1 = manual)
str(mtcars2)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# check the mean and sd
mean_sd <- rbind(tapply(mtcars2$mpg, mtcars2$am, mean), tapply(mtcars2$mpg, mtcars2$am, sd))
colnames(mean_sd) <- c("automatic", "manual")
rownames(mean_sd) <- c("mean", "sd")
mean_sd
## automatic manual
## mean 17.147368 24.392308
## sd 3.833966 6.166504
# boxplot the mpg by each transmission way
library(ggplot2)
mtcars2 <- mtcars
ggplot(data = mtcars2, aes(x = factor(am), y = mpg)) +
geom_boxplot() +
scale_x_discrete(labels = c("automatic", "manual"))
### Explore the correlations
cor(mtcars2)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
# It seems each variable has stroing correlation with mpg
library(GGally)
ggpairs(mtcars2)