Objective

Looking at a data set (mtcars) of a collection of cars and exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). This document is particularly interested in the following two questions: 1.Is an automatic or manual transmission better for MPG? 2.Quantify the MPG difference between automatic and manual transmissions?

Load the data set

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

From the view, we can see that “am” represents whether a car is auto or manual (“0” for auto and “1” for manual), and our target variable is “mpg”.

Exploratory data analyses

x<-mtcars$am
y<-mtcars$mpg
boxplot(y~x) #Take a view of the data

means<- c(mean(mtcars$mpg[mtcars$am==0]), mean(mtcars$mpg[mtcars$am==1]))
names(means)<-c("Auto","Manual")
print(means) #Looking at means of mpg for each type
##     Auto   Manual 
## 17.14737 24.39231
t.test(mtcars$mpg[mtcars$am==0],mtcars$mpg[mtcars$am==1])#Mean test
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg[mtcars$am == 0] and mtcars$mpg[mtcars$am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

From the result we can find out the mean for each type, and the p-value of 0.001374 is less than 5%, thus the null hypothesis that they are equal is rejected.

Modeling

Simply regression

fit0<-lm(y~x)
summary(fit0)$adj.r.squared
## [1] 0.3384589
summary(fit0)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## x            7.244939   1.764422  4.106127 2.850207e-04
#When considering only "am" as independent variables, mpg for auto is 17.15 and changing from auto to manual would increase the value by 7.24.

Not satisfactory.

Consider other variables

First, take a look at correlation within independent variables

cor<-cor(mtcars) #matrix in Appendix

Using step function to determine variables

fit1<-step(lm(mpg~.,data=mtcars), trace=0)
coef<-summary(fit1)$coef[,1]
coef
## (Intercept)          wt        qsec          am 
##    9.617781   -3.916504    1.225886    2.935837

By applying step linear regression, 3 variables were picked up to describe the “mpg”, which are “wt”,“qsec” and “am”, with an adjusted R-squared of 83.36%.

Conclusing and inpretion of coefficients

“Mpg” will be decreased by 3.92 for each one increasing “wt”; increased by 1.23 for 1 increasing “qsec”; and increased by 2.94 fro 1 increasing in “am” (change from auto to munual), holding others constant respectively. To answer the 2 questions at beginning: 1.An automatic it better than manual transmission because of lower MPG. 2.Quantify the MPG difference between automatic and manual transmissions: With considering “wt”,“qsec” and “am” mpg= 9.62-3.92wt+1.23qsec+2.94

Appendix

Correaltion

cor(mtcars) #matrix in Appendix
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

ANOVA

anova(fit0,fit1)
## Warning in anova.lmlist(object, ...): models with response '"mpg"' removed
## because response differs from model 1
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## x          1 405.15  405.15   16.86 0.000285 ***
## Residuals 30 720.90   24.03                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot analysis

par(mfrow=c(2,2))
plot(fit0)

plot(fit1)