Executive Summary

The 1974 Motor Trend US magazine dataset (mtcars) is used to evaluate the effect of transmission design on mpg (miles per gallon) in automobiles. Simply put we are asking the questions as following:

Dataset Description

The dataset consists of a dataframe with 32 observations (nrow) and 11 variables (nol).

Loading & Processing & Exploring

# load
data("mtcars")

# transform
mtcars$cyl = factor(mtcars$cyl)
mtcars$vs = factor(mtcars$vs)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$am = factor(mtcars$am, labels = c("Automatic", "Manual"))

# print
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

In this section, we deep dive into our data and explore various relationships between variables of interest.

Initially, we plot the relationship bewteen all the variables of the dataset (Figure.1 in the appendix). From the plot, we notice that most of the variables in the dataset seem to have correlation with mpg. So, we will use linear model to identify and qunatify that.

Since we are interested in the effects of car transmission type on mpg. we plot boxplot of the variable mpg whem am is automatic or manual (Figure.2 in the appendix). This plot clearly depicts an increase in the mpg when transmission is manual.

Regression Analysis & Inference

# fit.best
init.mod = lm(mpg ~ ., data = mtcars)
best.mod = step(init.mod, direction = "both", trace = FALSE)

# print
summary(best.mod)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.939 -1.256 -0.401  1.125  5.051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.7083     2.6049   12.94  7.7e-13 ***
## cyl6         -3.0313     1.4073   -2.15   0.0407 *  
## cyl8         -2.1637     2.2843   -0.95   0.3523    
## hp           -0.0321     0.0137   -2.35   0.0269 *  
## wt           -2.4968     0.8856   -2.82   0.0091 ** 
## amManual      1.8092     1.3963    1.30   0.2065    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.84 
## F-statistic: 33.6 on 5 and 26 DF,  p-value: 1.51e-10

The best model obtained from the above computations consists of the variables as cyl, wt, hp and am. From the best model, we observe that the adjusted r squared value is 0.84. Thus, we can conclude that more than 84% of the variability is explained by the best model.

# fit.base
base.mod = lm(mpg ~ am, data = mtcars)

# print
summary(base.mod)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.392 -3.092 -0.297  3.244  9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.15       1.12   15.25  1.1e-15 ***
## amManual        7.24       1.76    4.11  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.338 
## F-statistic: 16.9 on 1 and 30 DF,  p-value: 0.000285

As using only the indicated variable (am) on mpg, the adjusted r squared value is 0.34. Thus, we can conclude that more than 34% of the variability is explained by the base model.

anova(base.mod, best.mod)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1     30 721                              
## 2     26 151  4       570 24.5 1.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on p-value < 0.05, we reject H0 and conclude that the equations are not equivalent, which means that the variables of cyl, hp, and wt do contirbute to the accuracy of the model.

t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -4, df = 18, p-value = 0.001
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                    17.1                    24.4

We also perform a t-test assuming that the transmission data has a normal distribution and we clearly see that the manual and automatic tramsmissions are significantly different based on p-value < 0.05.

Assumption Checking

par(mfrow = c(2, 2))
plot(best.mod)

From the above plots, we can check the following assumptions needing to be established for a regression model.

influence = dfbetas(best.mod)
tail(sort(influence[, 6]), 3)
## Chrysler Imperial          Fiat 128     Toyota Corona 
##             0.351             0.429             0.731
# sum((abs(dfbetas(best.mod)))>1) # default accepted influencial point

The influence point has extreme value of Y, so it has the power to move the line no matter about the leverage. It can be identified by the cook’s distance.

leverage = hatvalues(best.mod)
tail(sort(leverage), 3)
##       Toyota Corona Lincoln Continental       Maserati Bora 
##               0.278               0.294               0.471

The leverage point has extreme value of X, so it has a greater possible ability to move the line based on the distance from the line or the overall pattern that is influence.

data.frame(vif(best.mod)) %>% arrange(GVIF) %>% select(GVIF) %>% t()
##        am   wt  hp  cyl
## GVIF 2.59 4.01 4.7 5.82

The generalized variance inflation factor (GVIF) is a measure of collinearity. The bigger number, the less independency, means higher colinearity. Thus, am is comparatively the best independent variable to mpg.

Conclusion

  1. Based on the analysis result, we can conclude the following:
  1. Above set of analysis yields the inference that manual transmission is better than automatic transmission with a more 1.8 miles per gallon as fixed other variables.

  2. Additionally, type of transmission is the most independent variable to mpg in the model. However, it seems that wt, hp, and cyl are more statistically significant when determining mpg.

Appendix

Figure.1 Overview of Dataset
g = ggpairs(mtcars,
            lower = list(continuous = wrap("smooth", method = "lm"))) +
    labs(caption = "Figure.1")
g

Figure.2 Boxplot of MPG vs Transmission
g[1, 9] + 
    labs(title = "Boxplot of MPG vs Transmission",
         x = "Transmission\n(0 = Automatic, 1 = Manual)", 
         y = "Miles per Gallon",
         caption = "Figure.2")

Figure.3 Boxplot of Mileage by Cylinder
g[1, 2] + 
    labs(title = "Boxplot of Mileage by Cylinder",
         x = "Number of Cylinders", 
         y = "Miles per Gallon",
         caption = "Figure.3")

Figure.4 Regression Plot of Mileage by Gross Horsepower
g[4, 1] + 
    labs(title = "Regression Plot of Mileage by Gross Horsepower",
         y = "Gross Horsepower", 
         x = "Miles per Gallon",
         caption = "Figure.4")

Figure.5 Regression Plot of Mileage by Weight
g[6, 1] + 
    labs(title = "Regression Plot of Mileage by Weight",
         y = "Weight (lb / 1000)", 
         x = "Miles per Gallon",
         caption = "Figure.5")