Executive Summary

In this report, we will focus on two question by using the data set “mtcars” in R setup, which is a dataset contain 32 observation and 11 variables:

According to the final result of this report, we find that the manual transition can do better in MPG rather than auto transmission. Moreover, the report also find the best predictive model by using backward model selection through the AIC method. If the car have a Manual transmission, the MPG of the car will increase 1.81 compared to the car have a Auto transmission However, according the diagnostic plot, we can find some of our original assumption may not strongly hold.

Data Input

data("mtcars")
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
?mtcars
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Explore Data Analysis

We want to first visulize whether there is a difference on MPG between the transmission.

library(ggplot2)
mtcars$amf = factor(mtcars$am,labels = c("Auto","Manual"))
ggplot(data = mtcars,aes(x = amf, y = mpg,fill = amf))+geom_boxplot()+
        labs(title = "Boxplot of MPG in Auto or Manual
             ",
             x = "Transmission",
             y = "MPG")+
        theme_bw()

We can see there is a clear difference between different transmission. The manual transmission is better for the MPG rather than the Auto transmission.

Is an automatic or manual transmission better for MPG?

Now, let us set up the hypothesis test to determine whether the Manual transmission is better than the Auto transmission. We will set up the null Hypothesis and alternating hypothesis:

  • Null Hypothesis H0: There is no difference between the transmission on MPG.

  • Alternating Hypothesis H1: The manual transmission is better than the Auto transmission on MPG.

and we set the significant level alpha = 0.05.

library(statsr)
inference(y = mpg, x = amf, data = mtcars,type = "ht",statistic = "mean",
          method = "theoretical",null = 0, alternative = "less" )
## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Auto = 19, y_bar_Auto = 17.1474, s_Auto = 3.834
## n_Manual = 13, y_bar_Manual = 24.3923, s_Manual = 6.1665
## H0: mu_Auto =  mu_Manual
## HA: mu_Auto < mu_Manual
## t = -3.7671, df = 12
## p_value = 0.0013

We can observe from the output that the p-value is only 0.0013, which is smaller than the significant level. As a result, we have enough evidence to reject the null hypothesis and state the manual transmission is better than auto transmission on the MPG.

Quantify the MPG difference between automatic and manual transmissions

First, let us process our dataset. Some of the variables should be transfer to factor variable.

train = mtcars[,-9]
train$cyl = factor(train$cyl)
train$vs = factor(train$vs,labels = c("automatic","manual"))
train$gear = factor(train$gear)
train$carb = factor(train$carb)

Now, let process the model selection:

step_model <- step(lm(mpg ~ ., data = train),trace = 0)
summary(step_model)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + amf, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amfManual    1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

After the backward selection of the linear model based on AIC selection, we finally get the best model, which includes cyl + hp + wt + amf. The final model have a R-square 0.87, which suggest the model can explain 87% of of the variation in MPG. We can ss that ** cyl, wt, hp turns to be a good predictor because of the smaller p-value. Also, the coefficient of the variable means a lot.

coef(step_model)
## (Intercept)        cyl6        cyl8          hp          wt   amfManual 
## 33.70832390 -3.03134449 -2.16367532 -0.03210943 -2.49682942  1.80921138

We can inteprete the coefficient of cyl6 as holding other variable constant, if the car will have a 6 cylinders, it will cause a 3.03 reduced in MPG compared to the car have a 4 cylinders. Also, other variable coefficents can be also explained like this.

Moreover, we can quantify the MPG difference between automatic and manual transmissions: Holding other variable unchanged, if the car have a Manual transmission, the MPG of the car will increase 1.81 compared to the car have a Auto transmission.

Diagnostic Plot

par(mfrow = c(2,2))
plot(step_model)

According to the Fitted vs Residual, we can see the residual is almost divided into two sides, however, we can see some of the point has a high residual rather than other.

Moreover, according to the quantile-quantile plot, not all the point is around the QQline, which means the residual are not normal distributed.

Moreover, the plot show us three outlier, which is Toyota Corolla, Chrysler Imperial, Toyota Corona, however, all of these three points are low leverage but high influential. The most leverage point is Maserati Bora , which has low influential in this case.

which.max(hatvalues(step_model))
## Maserati Bora 
##            31