Summary

I have explored the “mtcars” data set to investigate the influence of a car’s transmission type (am) on the car’s consumption. Two questions were of interest:

Is an automatic or manual transmission better for MPG (Miles Per Gallon)?

If the car’s weight (wt) is not considered [wt = 0] then manual transmission is better for mpg. Remark: There are no cars with automatic transmission < 2500lbs and no cars with manual transmission > 3800lbs.

Quantify the MPG difference between automatic and manual transmissions?

The MPG difference greatly depends on weight (wt), horse power (hp), and the car’s transmission type. The table below gives an overview on the expected value of MPG by a given wt, hp (quantiles) and transmission. The zero and one indicate manual transmission (1), and automatic transmission (0). Manual transmission is better for cars with lower (Q25%) wt and hp. Whereby automatic transmission is better for cars with higher (Q75%) wt and hp.

Model: fit <- lm(mpg ~ (wt + hp) * factor(am))

##        Q25%(1) Q25%(0) Q50%(1) Q50%(0) Q75%(1) Q75%(0) Dif25 Dif50 Dif75
## wt        2.58    2.58    3.22    3.22    3.61    3.61  0.00  0.00  0.00
## hp       96.50   96.50  146.69  146.69  180.00  180.00  0.00  0.00  0.00
## am        1.00    0.00    1.00    0.00    1.00    0.00  1.00  1.00  1.00
## E[MPG]   24.17   21.85   18.94   18.90   15.65   17.02  2.32  0.04 -1.36
## lwr      19.09   16.65   13.48   13.92    9.78   12.07  2.45 -0.44 -2.29
## upr      29.25   27.06   24.41   23.88   21.52   21.96  2.19  0.53 -0.44

Quantification of the diffence = manual - automatic

  • +2.32 MPG for Q25: wt =2.58, hp 96.5 [manual better than automatic]
  • +0.04 MPG for Q50: wt = 3.22, hp = 146.69 [automatic better than manual]
  • -1.36 MPG for Q75: wt = 3.61, hp = 180 [automatic better than manual]

Data exploration Looking at two basic models.

Model 1: lm(mpg ~ am, data = mtcars)

summary(lm(mpg ~ am, data = mtcars))$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

Conclusion: The model (only “am” considered) estimates, statistically significant, an increase of 7.245 mpg by switching from automatic (0) to manual (1) transmission. Cars with automatic transmission have a range in average of 24.4 mpg (intercept = 17.15 + 7.24) and with manual transmission of 17.15 mpg.

Model 2: lm(mpg ~ ., data = mtcars)

summary(lm(mpg ~., data = mtcars))$coefficients
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

Conclusion: The model (all variables considered) estimates an increase of 2.52 mpg by switching from automatic (0) to manual (1) transmission and holding all other variables constant. The estimate comes with a p-value of 0.233 and is not statistically significant. The model further indicates that weight (wt), cylinder (cyl), horse power (hp), carburetors (carb) seem to have a negative effect on MPG, whereby the transmission type (shifting from automatic to manual), and the number of gears have a positive effect.

Model 3

Identifying the needed variables for model 3 via nested model testing
fit1 <- lm(mpg ~ am, data = mtcars)
fit3 <- update(fit1, mpg ~ am + wt)
fit5 <- update(fit3, mpg ~ am + wt + hp)
fit7 <- update(fit5, mpg ~ am + wt + hp + factor(cyl))
fit9 <- update(fit7, mpg ~ am + wt + hp + factor(cyl) + factor(gear))
fit11 <- update(fit9, mpg ~ am + wt + hp + factor(cyl) + factor(gear)+ disp + factor(vs) + qsec + factor(carb))
anova(fit1, fit3, fit5, fit7, fit9, fit11 )
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + factor(cyl)
## Model 5: mpg ~ am + wt + hp + factor(cyl) + factor(gear)
## Model 6: mpg ~ am + wt + hp + factor(cyl) + factor(gear) + disp + factor(vs) + 
##     qsec + factor(carb)
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 57.9367 1.051e-06 ***
## 3     28 180.29  1     98.03 12.8327  0.002491 ** 
## 4     26 151.03  2     29.27  1.9155  0.179552    
## 5     24 149.67  2      1.36  0.0891  0.915249    
## 6     16 122.22  8     27.44  0.4490  0.873734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: Based on the variance table, which is the outcome of the nested model testing, I chose model 3 with lm(mpg ~ am + wt + hp). Remark: The p-values in the variance table are for the hypothesis test of whether the new variables are all zero or not (i.e. whether or not they’re necessary). Unclear is if there are interactions between the variables.

Model 3*: fitint<- lm(mpg ~ am + wt + hp, data = mtcars)

Deriving Model 3: Considering correlations between wt, am, and hp

Exploration

mtcars$am[which(mtcars$am == 0)] <- 'Automatic'
mtcars$am[which(mtcars$am == 1)] <- 'Manual'
mtcars$am <- as.factor(mtcars$am)
p <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~mpg, color = ~am, colors = c('salmon', 'lightblue')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Gross horsepower'),
                     zaxis = list(title = 'Miles Per Gallon')))
p

The correlation table below illustrates the association between weight, horse power, transmission type.

my_fn<-function(data, mapping, method = "loess", ...) {
  p<-ggplot(data = data, mapping = mapping) + 
    geom_point() +
    geom_smooth(method = method, ...)
  p}
g = ggpairs(mtcars[,c("mpg", "wt", "hp", "am")], lower = list(continuous = my_fn))
g
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conclusions: The table shows strong correlations between wt, hp, and wt. A further investigation shows whether the variables are interacting and the model need to be adjusted.

Data Exploration II of Model 3:

Distribution of manual & automatic transmission in dependency of wt

The plot shows a dependency between weight and transmission. Green dots indicate manual and red dots indicate automatic transmission, hp is not considered.

require(ggplot2)
fitint<-lm(mpg ~ wt * factor(am), data = mtcars)
summary(fitint)$coef
##                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)         31.416055  3.0201093 10.402291 4.001043e-11
## wt                  -3.785908  0.7856478 -4.818836 4.551182e-05
## factor(am)Manual    14.878423  4.2640422  3.489277 1.621034e-03
## wt:factor(am)Manual -5.298360  1.4446993 -3.667449 1.017148e-03
s11<-coef(fitint)[2]; i11<-coef(fitint)[1]
s12<-s11 +coef(fitint)[4]; i12<-i11 + coef(fitint)[3]
g = ggplot(mtcars, aes(x = wt, y = mpg, color = factor(am)))
g1 = geom_boxplot()
g = g + geom_point(size = 3, colour = "black") + geom_point(size = 4)
g = g + xlab("Weight") + ylab("Miles per Gallon")
g = g + geom_abline(slope = s11, intercept = i11, colour = "salmon")
g = g + geom_abline(slope = s12, intercept = i12, colour = "lightblue" )
g

Finding: Cars with automatic and manual transmissions are bipolar distributed. Cars with lower weight are equipped with manual transmission and cars with higher weight with automatic transmission. The plot indicates an interaction between weight and transmission. The influence of the weight on MPG changes with the transmission type. The effect is described by two different slopes and intercepts.

Model adjustment

\[ E[Y_{i}|x_{1},x_{2},x_{3}]=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3} \\ E[Y_{i}|x_{1}=hp,x_{2}=wt,x_{3}=am=\beta_{0}+\beta_{1}hp+\beta_{2}wt+\beta_{3}am \\ \] Interaction between wt and am \[ E[Y_{i}|x_{1}=hp,x_{2}=wt,x_{3}=am]=\beta_{0}+\beta_{1}hp+\beta_{2}wt+\beta_{3}am+\beta_{4}wt\times am \\ E[Y_{i}|x_{1}=hp,x_{2}=wt,x_{3}=am=1]=\beta_{0}+\beta_{1}hp+\beta_{3}am+(\beta_{2}+\beta_{4)}\times wt \\ E[Y_{i}|x_{1}=hp,x_{2}=wt,x_{3}=am=0]=\beta_{0}+\beta_{1}hp+\beta_{2}wt\\ \]

Distribution of manual & automatic transmission in dependency of hp. Wt is not considered.

require(ggplot2)
fitint<-lm(mpg ~ hp * factor(am), data = mtcars)
summary(fitint)$coef
##                          Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)         26.6248478696 2.18294320 12.19676624 1.014017e-12
## hp                  -0.0591369818 0.01294486 -4.56837583 9.018508e-05
## factor(am)Manual     5.2176533777 2.66509311  1.95777527 6.028998e-02
## hp:factor(am)Manual  0.0004028907 0.01646022  0.02447662 9.806460e-01
s11<-coef(fitint)[2]; i11<-coef(fitint)[1]
s12<-s11 +coef(fitint)[4]; i12<-i11 + coef(fitint)[3]
g = ggplot(mtcars, aes(x = hp, y = mpg, color = factor(am)))
g1 = geom_boxplot()
g = g + geom_point(size = 3, colour = "black") + geom_point(size = 4)
g = g + xlab("Horse Power") + ylab("Miles per Gallon")
g = g + geom_abline(slope = s11, intercept = i11, colour = "salmon")
g = g + geom_abline(slope = 0, intercept = mean(mtcars[mtcars$am==0,]$mpg), colour = "salmon", lty = 2 )
g = g + geom_abline(slope = s12, intercept = i12, colour = "lightblue")
g = g + geom_abline(slope = 0, intercept = mean(mtcars[mtcars$am==1,]$mpg), colour = "lightblue", lty = 2 )
g
## Warning: Removed 1 rows containing missing values (geom_abline).

## Warning: Removed 1 rows containing missing values (geom_abline).

Finding: The marginal difference between MPG[am=1] and MPG[am=0] is independent from hp. There is not interaction between hp and am.

Model 3: lm(mpg ~ (wt*factor(am)+hp, data = mtcars))

Based on the results derived above the model is not further adjusted. \[ E[Y_{i}| am=1]=\beta_{0}+\beta_{1}hp+\beta_{3}am+(\beta_{2}+\beta_{4)}\times wt \\ E[Y_{i}| am=0]=\beta_{0}+\beta_{1}hp+\beta_{2}wt\\ \]

Check Model 3: lm(mpg ~ (wt*factor(am)+hp, data = mtcars))

fit<-lm(mpg ~ wt * factor(am) + hp, data = mtcars)
summary(fit)  
## 
## Call:
## lm(formula = mpg ~ wt * factor(am) + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0639 -1.3315 -0.9347  1.2180  5.0822 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         30.947333   2.723411  11.363 8.55e-12 ***
## wt                  -2.515586   0.844497  -2.979  0.00605 ** 
## factor(am)Manual    11.554813   4.023277   2.872  0.00784 ** 
## hp                  -0.026949   0.009796  -2.751  0.01048 *  
## wt:factor(am)Manual -3.577910   1.442796  -2.480  0.01968 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.332 on 27 degrees of freedom
## Multiple R-squared:  0.8696, Adjusted R-squared:  0.8503 
## F-statistic: 45.01 on 4 and 27 DF,  p-value: 1.451e-11

Conclusion: For am = 0, wt = 0, hp = 0 the intercept 30.7 MPG. The slope indicates a decrease (not significant) of -1.86 MPG per 1000 lbs (am=0) holding all other variables constant. For am = 1, wt = 0, hp = 0, the intercept is 44.5 MPG. The slope indicates a (significant) of -5.77 MPG/1000 lbs holding all other variables constant.

Assess the Model 3 fit by checking the residuals

data(mtcars); par(mfrow = c(2, 2))
fitint<-lm(mpg ~ wt*factor(am) + hp, data = mtcars); plot(fitint)

  • Patterns fitted vs. residual plot There are no patterns identifiable, which indicates a prober model fit.

  • Cresiduals normally distributed QQ plot The distribution of the residuals independent identical distributed, which indicates a prober model fit.

  • Outliers scale location vs standardized residuals The standard residual distribution shows no pattern, which indicates a prober model fit.

  • Residuals vs. levarage to see if specific points (cars) falsify the entire model results. No outliers are identified that would leverage and influence the MPG.

Quantify the differences between manuel and automatic transmission

For comparing the mpg values of automatic and manual transmission I calculated MPG for a set of different weight and horse power values.

The table gives you the overview on the difference E[MPG] (expected value, with confidence interval) with a given weight, horse power (quantiles) and manual or automatic transmission. See column “diff25, diff50, diff75”. The E[mpg] above a specific weight and horse power is higher for an automatic tranmission.

##        Q25%(1) Q25%(0) Q50%(1) Q50%(0) Q75%(1) Q75%(0) Dif25 Dif50 Dif75
## wt        2.58    2.58    3.22    3.22    3.61    3.61  0.00  0.00  0.00
## hp       96.50   96.50  146.69  146.69  180.00  180.00  0.00  0.00  0.00
## am        1.00    0.00    1.00    0.00    1.00    0.00  1.00  1.00  1.00
## E[MPG]   24.17   21.85   18.94   18.90   15.65   17.02  2.32  0.04 -1.36
## lwr      19.09   16.65   13.48   13.92    9.78   12.07  2.45 -0.44 -2.29
## upr      29.25   27.06   24.41   23.88   21.52   21.96  2.19  0.53 -0.44

Quantification of the diffence = manual - automatic

  • +2.32 MPG for Q25: wt =2.58, hp 96.5 [manual better than automatic]
  • +0.04 MPG for Q50: wt = 3.22, hp = 146.69 [automatic better than manual]
  • -1.36 MPG for Q75: wt = 3.61, hp = 180 [automatic better than manual]