We will use the data set mtcars to analyze the effects of automatic and manual transmission has on miles per gallon (MPG)
We approach the first question through some exploratory data analysis. By summarizing the MPG for each type of transmission, we see that manual transmission (24.4 avg. mpg) is much more efficient than automatic one (17.5 avg. mpg) when we do not take into consideration any other variable.
However, when taking cylinder number into account, the difference decreases as cylinder goes from 4 (28 vs. 22.9) to 6 (20.6 vs. 19.13) and then even out at 8 (15.4 vs. 15.1).
Finally, we perform a hypothesis test on the difference between avg.mpg of two variables (Lets null=0 and left tail). This called for a t-test which results in p-value of 0.0013. At 5% confidence level, data supports the hypothesis that avg. mpg of automatic transmission is less than manual’s.
We will use linear model to quantify the MPG difference. By graphing correlation between all numerical variables, we see that mpg most correlated with drat, qsec, vs and gear. Moreover, cylinder strongly correlated with the rest and thus we will fit these 5 variables, using anova, vif to find the best fit.
The results shows that the only two good model: mpg ~ am + cyl and mpg ~ all. However, the second model which includes all are too noisy and takes focus away from our main objective, thus we will use the first.
Model: MPG = 34.522 + 2.567*(am:manual) - 2.5*(cyl)
Thus, ceteris paribus, manual transmission expected to have 2.567 more MPG than automatic on average.
We will conclude our report by looking at the model diagnostic:
lib_need <- list('tidyverse','car','statsr','GGally')
lapply(lib_need,library,character.only=TRUE)
data(mtcars)
A data frame with 32 observations on 11 variables.
ggcorr(mtcars)
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c('automatic','manual')
ggplot(mtcars,aes(x=am,y=mpg))+geom_boxplot()
mtcars %>%
group_by(am) %>%
summarize(average_mpg = mean(mpg)) %>%
print()
## # A tibble: 2 x 2
## am average_mpg
## <fctr> <dbl>
## 1 automatic 17.14737
## 2 manual 24.39231
mtcars %>%
group_by(am, cyl) %>%
summarize(average_mpg = mean(mpg))
## # A tibble: 6 x 3
## # Groups: am [?]
## am cyl average_mpg
## <fctr> <dbl> <dbl>
## 1 automatic 4 22.90000
## 2 automatic 6 19.12500
## 3 automatic 8 15.05000
## 4 manual 4 28.07500
## 5 manual 6 20.56667
## 6 manual 8 15.40000
dt <- select(mtcars,cyl,am,mpg)
inference(y=mpg,x = am,data = dt,type = "ht",statistic = "mean",method = "theoretical",null = 0,alternative = "less")
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_automatic = 19, y_bar_automatic = 17.1474, s_automatic = 3.834
## n_manual = 13, y_bar_manual = 24.3923, s_manual = 6.1665
## H0: mu_automatic = mu_manual
## HA: mu_automatic < mu_manual
## t = -3.7671, df = 12
## p_value = 0.0013
fit1 <- lm(mpg ~ am, mtcars)
fit2 <- lm(mpg ~ am + cyl, mtcars)
fit3 <- lm(mpg ~ am + cyl + drat, mtcars)
fit4 <- lm(mpg ~ am + cyl + drat + qsec, mtcars)
fit5 <- lm(mpg ~ am + cyl + drat + qsec + vs, mtcars)
fit6 <- lm(mpg ~ am + cyl + drat + qsec + vs + gear, mtcars)
fit7 <- lm(mpg ~ ., mtcars)
anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + drat
## Model 4: mpg ~ am + cyl + drat + qsec
## Model 5: mpg ~ am + cyl + drat + qsec + vs
## Model 6: mpg ~ am + cyl + drat + qsec + vs + gear
## Model 7: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 271.36 1 449.53 64.0039 8.231e-08 ***
## 3 28 270.97 1 0.39 0.0561 0.81506
## 4 27 266.30 1 4.67 0.6652 0.42390
## 5 26 264.76 1 1.53 0.2185 0.64497
## 6 25 256.40 1 8.36 1.1906 0.28757
## 7 21 147.49 4 108.90 3.8764 0.01643 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lapply(list(fit2,fit3,fit4,fit5,fit6,fit7), vif)
## [[1]]
## am cyl
## 1.375739 1.375739
##
## [[2]]
## am cyl drat
## 2.036922 1.964868 2.902648
##
## [[3]]
## am cyl drat qsec
## 3.811592 6.712992 3.053960 4.192136
##
## [[4]]
## am cyl drat qsec vs
## 3.896939 8.657026 3.062424 4.621028 4.360928
##
## [[5]]
## am cyl drat qsec vs gear
## 4.374909 8.954911 3.213257 5.156080 4.397754 3.352122
##
## [[6]]
## cyl disp hp drat wt qsec vs
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
## am gear carb
## 4.648487 5.357452 7.908747
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6856 -1.7172 -0.2657 1.8838 6.8144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.5224 2.6032 13.262 7.69e-14 ***
## ammanual 2.5670 1.2914 1.988 0.0564 .
## cyl -2.5010 0.3608 -6.931 1.28e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.059 on 29 degrees of freedom
## Multiple R-squared: 0.759, Adjusted R-squared: 0.7424
## F-statistic: 45.67 on 2 and 29 DF, p-value: 1.094e-09
par(mfrow=c(1,2))
plot(fit2$residuals, ylab = "residuals")
hist(fit2$residuals, main= "Residuals", xlab = "residuals")
par(mfrow=c(2,2))
plot(fit2)