In this report, we will focus on two question by using the data set “mtcars” in R setup, which is a dataset contain 32 observation and 11 variables:
According to the final result of this report, we find that the manual transition can do better in MPG rather than auto transmission. Moreover, the report also find the best predictive model by using backward model selection through the AIC method. If the car have a Manual transmission, the MPG of the car will increase 1.81 compared to the car have a Auto transmission However, according the diagnostic plot, we can find some of our original assumption may not strongly hold.
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
?mtcars
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We want to first visulize whether there is a difference on MPG between the transmission.
library(ggplot2)
mtcars$amf = factor(mtcars$am,labels = c("Auto","Manual"))
ggplot(data = mtcars,aes(x = amf, y = mpg,fill = amf))+geom_boxplot()+
labs(title = "Boxplot of MPG in Auto or Manual
",
x = "Transmission",
y = "MPG")+
theme_bw()
We can see there is a clear difference between different transmission. The manual transmission is better for the MPG rather than the Auto transmission.
Now, let us set up the hypothesis test to determine whether the Manual transmission is better than the Auto transmission. We will set up the null Hypothesis and alternating hypothesis:
Null Hypothesis H0: There is no difference between the transmission on MPG.
Alternating Hypothesis H1: The manual transmission is better than the Auto transmission on MPG.
and we set the significant level alpha = 0.05.
library(statsr)
inference(y = mpg, x = amf, data = mtcars,type = "ht",statistic = "mean",
method = "theoretical",null = 0, alternative = "less" )
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Auto = 19, y_bar_Auto = 17.1474, s_Auto = 3.834
## n_Manual = 13, y_bar_Manual = 24.3923, s_Manual = 6.1665
## H0: mu_Auto = mu_Manual
## HA: mu_Auto < mu_Manual
## t = -3.7671, df = 12
## p_value = 0.0013
We can observe from the output that the p-value is only 0.0013, which is smaller than the significant level. As a result, we have enough evidence to reject the null hypothesis and state the manual transmission is better than auto transmission on the MPG.
First, let us process our dataset. Some of the variables should be transfer to factor variable.
train = mtcars[,-9]
train$cyl = factor(train$cyl)
train$vs = factor(train$vs,labels = c("automatic","manual"))
train$gear = factor(train$gear)
train$carb = factor(train$carb)
Now, let process the model selection:
step_model <- step(lm(mpg ~ ., data = train),trace = 0)
summary(step_model)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + amf, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amfManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
After the backward selection of the linear model based on AIC selection, we finally get the best model, which includes cyl + hp + wt + amf. The final model have a R-square 0.87, which suggest the model can explain 87% of of the variation in MPG. We can ss that ** cyl, wt, hp turns to be a good predictor because of the smaller p-value. Also, the coefficient of the variable means a lot.
coef(step_model)
## (Intercept) cyl6 cyl8 hp wt amfManual
## 33.70832390 -3.03134449 -2.16367532 -0.03210943 -2.49682942 1.80921138
We can inteprete the coefficient of cyl6 as holding other variable constant, if the car will have a 6 cylinders, it will cause a 3.03 reduced in MPG compared to the car have a 4 cylinders. Also, other variable coefficents can be also explained like this.
Moreover, we can quantify the MPG difference between automatic and manual transmissions: Holding other variable unchanged, if the car have a Manual transmission, the MPG of the car will increase 1.81 compared to the car have a Auto transmission.
par(mfrow = c(2,2))
plot(step_model)
According to the Fitted vs Residual, we can see the residual is almost divided into two sides, however, we can see some of the point has a high residual rather than other.
Moreover, according to the quantile-quantile plot, not all the point is around the QQline, which means the residual are not normal distributed.
Moreover, the plot show us three outlier, which is Toyota Corolla, Chrysler Imperial, Toyota Corona, however, all of these three points are low leverage but high influential. The most leverage point is Maserati Bora , which has low influential in this case.
which.max(hatvalues(step_model))
## Maserati Bora
## 31