Abstract

We will use the data set mtcars to analyze the effects of automatic and manual transmission has on miles per gallon (MPG)

Summary

1. “Is an automatic or manual transmission better for MPG?”

We approach the first question through some exploratory data analysis. By summarizing the MPG for each type of transmission, we see that manual transmission (24.4 avg. mpg) is much more efficient than automatic one (17.5 avg. mpg) when we do not take into consideration any other variable.

However, when taking cylinder number into account, the difference decreases as cylinder goes from 4 (28 vs. 22.9) to 6 (20.6 vs. 19.13) and then even out at 8 (15.4 vs. 15.1).

Finally, we perform a hypothesis test on the difference between avg.mpg of two variables (Lets null=0 and left tail). This called for a t-test which results in p-value of 0.0013. At 5% confidence level, data supports the hypothesis that avg. mpg of automatic transmission is less than manual’s.

2. “Quantify the MPG difference between automatic and manual transmissions”

We will use linear model to quantify the MPG difference. By graphing correlation between all numerical variables, we see that mpg most correlated with drat, qsec, vs and gear. Moreover, cylinder strongly correlated with the rest and thus we will fit these 5 variables, using anova, vif to find the best fit.

The results shows that the only two good model: mpg ~ am + cyl and mpg ~ all. However, the second model which includes all are too noisy and takes focus away from our main objective, thus we will use the first.

Model: MPG = 34.522 + 2.567*(am:manual) - 2.5*(cyl)

Thus, ceteris paribus, manual transmission expected to have 2.567 more MPG than automatic on average.

We will conclude our report by looking at the model diagnostic:

APPENDIX

lib_need <- list('tidyverse','car','statsr','GGally')
lapply(lib_need,library,character.only=TRUE)
data(mtcars)

Exploratory Data Analysis

A data frame with 32 observations on 11 variables.

ggcorr(mtcars)

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c('automatic','manual')
ggplot(mtcars,aes(x=am,y=mpg))+geom_boxplot()

mtcars %>%
        group_by(am) %>%
        summarize(average_mpg = mean(mpg)) %>%
        print()
## # A tibble: 2 x 2
##          am average_mpg
##      <fctr>       <dbl>
## 1 automatic    17.14737
## 2    manual    24.39231
mtcars %>%
        group_by(am, cyl) %>%
        summarize(average_mpg = mean(mpg))
## # A tibble: 6 x 3
## # Groups:   am [?]
##          am   cyl average_mpg
##      <fctr> <dbl>       <dbl>
## 1 automatic     4    22.90000
## 2 automatic     6    19.12500
## 3 automatic     8    15.05000
## 4    manual     4    28.07500
## 5    manual     6    20.56667
## 6    manual     8    15.40000
dt <- select(mtcars,cyl,am,mpg)
inference(y=mpg,x = am,data = dt,type = "ht",statistic = "mean",method = "theoretical",null = 0,alternative = "less")
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_automatic = 19, y_bar_automatic = 17.1474, s_automatic = 3.834
## n_manual = 13, y_bar_manual = 24.3923, s_manual = 6.1665
## H0: mu_automatic =  mu_manual
## HA: mu_automatic < mu_manual
## t = -3.7671, df = 12
## p_value = 0.0013

Model

fit1 <- lm(mpg ~ am, mtcars)
fit2 <- lm(mpg ~ am + cyl, mtcars)
fit3 <- lm(mpg ~ am + cyl + drat, mtcars)
fit4 <- lm(mpg ~ am + cyl + drat + qsec, mtcars)
fit5 <- lm(mpg ~ am + cyl + drat + qsec + vs, mtcars)
fit6 <- lm(mpg ~ am + cyl + drat + qsec + vs + gear, mtcars)
fit7 <- lm(mpg ~ ., mtcars)
anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + drat
## Model 4: mpg ~ am + cyl + drat + qsec
## Model 5: mpg ~ am + cyl + drat + qsec + vs
## Model 6: mpg ~ am + cyl + drat + qsec + vs + gear
## Model 7: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 271.36  1    449.53 64.0039 8.231e-08 ***
## 3     28 270.97  1      0.39  0.0561   0.81506    
## 4     27 266.30  1      4.67  0.6652   0.42390    
## 5     26 264.76  1      1.53  0.2185   0.64497    
## 6     25 256.40  1      8.36  1.1906   0.28757    
## 7     21 147.49  4    108.90  3.8764   0.01643 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lapply(list(fit2,fit3,fit4,fit5,fit6,fit7), vif)
## [[1]]
##       am      cyl 
## 1.375739 1.375739 
## 
## [[2]]
##       am      cyl     drat 
## 2.036922 1.964868 2.902648 
## 
## [[3]]
##       am      cyl     drat     qsec 
## 3.811592 6.712992 3.053960 4.192136 
## 
## [[4]]
##       am      cyl     drat     qsec       vs 
## 3.896939 8.657026 3.062424 4.621028 4.360928 
## 
## [[5]]
##       am      cyl     drat     qsec       vs     gear 
## 4.374909 8.954911 3.213257 5.156080 4.397754 3.352122 
## 
## [[6]]
##       cyl      disp        hp      drat        wt      qsec        vs 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873 
##        am      gear      carb 
##  4.648487  5.357452  7.908747
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6856 -1.7172 -0.2657  1.8838  6.8144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.5224     2.6032  13.262 7.69e-14 ***
## ammanual      2.5670     1.2914   1.988   0.0564 .  
## cyl          -2.5010     0.3608  -6.931 1.28e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.059 on 29 degrees of freedom
## Multiple R-squared:  0.759,  Adjusted R-squared:  0.7424 
## F-statistic: 45.67 on 2 and 29 DF,  p-value: 1.094e-09
par(mfrow=c(1,2))
plot(fit2$residuals, ylab = "residuals")
hist(fit2$residuals, main= "Residuals", xlab = "residuals")

par(mfrow=c(2,2))
plot(fit2)