Intro

Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions:

1 Is an automatic or manual transmission better for MPG

2 Quantify the MPG difference between automatic and manual transmissions

Load Data and make the appropriate transformations

Load and Explore Data

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Create factor variables (see Appendix)

mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am   <- factor(mtcars$am,labels=c("Automatic","Manual"))
mtcars<-rename(mtcars,cyl.factor=cyl,vs.factor=vs,gear.factor=gear,carb.factor=carb,am.factor=am)

Statistics and best Model Fit

Explore MPG variance across type of Transmission

tapply(mtcars$mpg,mtcars$am.factor,var)
## Automatic    Manual 
##  14.69930  38.02577

There is a significant difference in MPG variance across Manual and Automatic Transmission. We will take this into account in the upcoming T-Test.

Running T-Test

t.test(mtcars$mpg~mtcars$am.factor,paired = F,var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am.factor
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

T-Test Results

There is a significant difference in MPG when comparing Automatic and Manual Transmission. The average MPG for automatic is 17.1 MPG, while manual is 7.2 MPG higher.We can assume that automatic transmission is better for MPG, but this is the conclussion we can get if we do not adjust the relationship of MPG and Transmission accounting for other variables.

First Model Fit (the unadjusted estimate MPG~Transmission)

Results are equal to T-Test

basicmodel <-lm( data=mtcars, mpg~am.factor)
summary(basicmodel)
## 
## Call:
## lm(formula = mpg ~ am.factor, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       17.147      1.125  15.247 1.13e-15 ***
## am.factorManual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The Model only describes 36% of the variance indicading that we need to find a better model with more variables (Multivariable linear regression).

Find the Best Model

bestmodel = step(lm( data=mtcars, mpg ~ am.factor+.),trace=0)
summary(bestmodel)
## 
## Call:
## lm(formula = mpg ~ am.factor + cyl.factor + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     33.70832    2.60489  12.940 7.73e-13 ***
## am.factorManual  1.80921    1.39630   1.296  0.20646    
## cyl.factor6     -3.03134    1.40728  -2.154  0.04068 *  
## cyl.factor8     -2.16368    2.28425  -0.947  0.35225    
## hp              -0.03211    0.01369  -2.345  0.02693 *  
## wt              -2.49683    0.88559  -2.819  0.00908 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

The model explains 87% of the variance and cyl, hp, wt affect the relatioship between mpg and am. The difference between automatic and manual transmissions is 1.81 MPG

Compare Basic Model against Best Model - Anova test

anova(basicmodel,bestmodel)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am.factor
## Model 2: mpg ~ am.factor + cyl.factor + hp + wt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclussion

When we examine MPG only against type of Transmission there is a quite large difference in consumption. This effect is decreasing when we adjust for the affect that other variables causing in the relationship between MPG and type of Transmission.

Appendix

We have split mtcars’ variables into Factor and Continuous. Then we did some exploratory plots

MPG against factor variables

g1=ggplot(mtcars,aes(x=am.factor, y=mpg))+ geom_boxplot(aes(fill=am.factor))+scale_fill_brewer(palette="Set1")+labs(y="",x="")
title1=textGrob("MPG against Transmission", gp=gpar(fontface="bold",fontsize=12))
grid.arrange(g1,top=title1)

g3=ggplot(mtcars,aes(x=am.factor, y=mpg))+ geom_boxplot(aes(fill=am.factor))+scale_fill_brewer(palette="Set1")+labs(y="",x="")+facet_grid(~cyl.factor)
title3=textGrob("MPG per Cyliner and Transmission", gp=gpar(fontface="bold",fontsize=12))
grid.arrange(g3,top=title3)

g4=ggplot(mtcars,aes(x=am.factor, y=mpg))+ geom_boxplot(aes(fill=am.factor))+scale_fill_brewer(palette="Set1")+labs(y="",x="")+facet_grid(~vs.factor)
title4=textGrob("MPG per VS and Transmission", gp=gpar(fontface="bold",fontsize=12))
grid.arrange(g4,top=title4)

MPG against continuous variables

q2=ggpairs(select(mtcars,colnames(mtcars)[!grepl(".factor",colnames(mtcars))]),lower=list(continuous="smooth"))
q2

Best Model Residuals Plots

par(mfrow = c(2,2))
plot(bestmodel)