library(datasets) 
data("mtcars") 
dim(mtcars) 
## [1] 32 11
variables <- names(mtcars) ; variables 
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
carnames <- row.names(mtcars) ; carnames 
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
str(mtcars) 
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars, 5) 
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
tail(mtcars, 5) 
##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
anyNA(mtcars) # if FALSE no NAs in the data 
## [1] FALSE
summary.mtcars <- summary(mtcars) 
# Variables List 
## mpg = miles per US gallon 
## cyl = n° of cylinders (4, 6, or 8) 
## disp = engine displacement, in cubic inches 
## hp = gross horsepower 
## drat = rear axle ratio 
## wt = weight (in 1000 lbs) 
## qsec = 1/4 mile time 
## vs = type of engine (0 = V-shaped, 1 = straight) 
## am = transmission type (0 = automatic, 1 = manual) 
## gear = n° of forward gears 

1. Executive Summary1

This project concludes the Regression Models class on Coursera, part of the Data Science specialization offered by Johns Hopkins University. We analyze the mtcars dataset, exploring the relationship between miles per gallon (MPG) and type of transmission.

We will answer the following questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Can we quantify the difference in MPG between transmission types?

Key results:

  1. Manual transmissions cars achieve a higher fuel efficiency, measured in MPG, even after adjusting for other variables.
  2. A manual car reaches 14 MPG more than automatic ones, all else equal.

2. Introduction

The data2 was extracted from the 1974 issue of Motor Trend, a US automobile magazine. It comprises data on fuel consumption efficiency (measured as mpg, miles per gallon) and 103 other major characteristics, including transmission type4 (automatic or manual), of automobile design and performance for 325 different cars.

3. Analysis

3.1 Data Processing

Basic data processing was performed, including proper re-coding of categorical variables as factors, including the main variable of interest, transmission. The are no missing data to deal with.

mtcars$cyl <- factor(mtcars$cyl, levels = c(4, 6, 8)) 
mtcars$vs <- factor(mtcars$vs, labels = c("V", "S")) 
mtcars$gear <- factor(mtcars$gear, levels = c(3, 4, 5)) 
mtcars$carb <- factor(mtcars$carb, levels = c(1, 2, 3, 4, 6, 8)) 
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual")) 
str(mtcars) 
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "V","S": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
vars.num.index <- c(1, 3, 4, 5, 6, 7) 
vars.cat.index <- c(2, 8, 9, 10, 11) 
colnames(mtcars)[9] <- "Transmission" # gives 'am' variable a clear name 

3.2 Exploratory Data Analysis

A full array of descriptive statistics for the numeric variables, as well as counts for the factor variables, are reported in the Appendix. Here I report a basic summary of the outcome variable of interest, mpg, by transmission6. Of the 32 total observations, 19 are “auto” and the other 13 are “manual”. A preliminary inspection suggests there might be a systematic difference in mpg for automatic versus manual cars, worth exploring in greater detail through linear regression7. The overall difference in means is 7.24. Note that manual cars show greater variability, as measured by the standard deviation (Sd).

summary(mtcars$Transmission) 
## Automatic    Manual 
##        19        13
summ.mpg.by.am <- tapply(mtcars$mpg, mtcars$Transmission, summary) 
summ.mpg.by.am 
## $Automatic
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40 
## 
## $Manual
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90
tapply(mtcars$mpg, mtcars$Transmission, sd) 
## Automatic    Manual 
##  3.833966  6.166504
Min 1st Q. Median Mean 3rd Q. IQR Max Sd
Automatic 10.40 14.95 17.30 17.15 19.20 4.25 24.40 3.83
Manual 15.00 21.00 22.80 24.39 30.40 9.40 33.90 6.17

A boxplot (Figure 1 in the Appendix) helps visualize the information above.

Since we anticipate that other variables might be relevant in explaining fuel efficiency and in particular its relationship with transmission type, Figure 2 reports a complex ensemble of information8: scatterplots (with interpolated regression lines), distributions, boxplots, and correlations.

3.3 Regression analysis

3.3.1 Baseline model

The baseline is a simple (univariate) regression model which formally characterizes the effect of transmission type on mpg. Since the former is a binary variable, I take “Automatic” as the reference category, so that the regression model computes the difference in means.

The model output, reported in the Appendix, shows that manual-transmission cars are about 7.25 mpg more efficient, on average, than automatic ones. Given the small p-value reported9, the estimated coefficient is significant at all usual significance levels10.

baseline <- lm(mpg ~ Transmission, data = mtcars) 
baseline.summ <- summary(baseline) 

3.3.2 Model Choice

To account for the effect of possible confounders and avoid endogeneity11, I consider multivariate regression next.

A naïve approach is to regress mpg on all other covariates. The results, shown in the Appendix, are that no coefficient is statistically significant at 5% by itself, even though the F statistic suggest that the model does capture enough variability in the outcome to reject the Null Hypothesis of all zero coefficients (simultaneously). Coupled with the high R-squared, these facts suggest that multicollinearity is eating away much of the precision of the OLS estimator12.

full.model <- lm(mpg ~ . , data = mtcars) 
full.model.summ <- summary(full.model) 

Applying the stepwise algorithm allows to select the “best” model in terms of AIC/BIC13 for different procedures14.

A decision rule often applied in practice is to choose the model with lowest adjusted R^2 among the selected ones15; however, we “prefer” (see footnote!) the model selected by the forward procedures16.

# The k parameter represents the n° of degrees of freedom used for the penalty 
# For k = 2, the original AIC is employed 
# For k = ln(n) = log(32), BIC is computed (see help page ?step) 
# Specify trace = 0 to suppress printing of info during running of 'step' 
# Link in footnote gives brief explanation about the difference between "forward" 
# and "backward" options. In this case, we start from the baseline model (since 
# we need to include the 'Transmission' variable) and
for.aic.mdl  <- step(lm(mpg ~ Transmission, data = mtcars), direction = "forward", 
                     scope = formula(full.model), k = 2, trace = 0) 
                     # forward AIC 
back.aic.mdl <- step(full.model, direction = "backward", 
                     k = 2, trace = 0) 
                     # backward AIC 
for.bic.mdl  <- step(lm(mpg ~ Transmission, data = mtcars), direction = "forward", 
                     scope = formula(full.model), k = log(32), trace = 0) 
                     # forward BIC 
back.bic.mdl <- step(full.model, direction = "backward", 
                     k = log(32), trace = 0) 
                     # backward BIC 
# The forward procedures (based on AIC/BIC) select the same model. 
# Here 'final' refers to the model selected by the stepwise algorithm 
final <- lm(mpg ~ Transmission + hp + wt + qsec, data = mtcars) 
final.summ <- summary(final) 

However, a look at the default plots from the model reveal that there is both some non-linearity and heteroskedasticity that the model fails to capture completely17. I thus consider possible non-linearities and/or interaction effects that might lead us to a better estimation.

The final choice is thus the model: \(mpg = \beta_0 + \beta_1 Transmission + \beta_2 wt + \beta_3 hp + \beta_4 hp + \beta_5 (Transmission*wt)\)

final.int.Twt <- lm(mpg ~ Transmission * wt + hp + qsec, data = mtcars) 
final.int.twt.summ <- summary(final.int.Twt) 

In the Appendix we report the output of the final model selected, which includes all the covariates selected by the forward AIC plus an interaction term between transmission type and weight, with an adjusted \(R^2\) of 0.8759, higher than any other model tried.

3.3.3 Model Diagnostics

A few relevant tests are reported in this section. Figure 3 in the Appendix shows the default output for a linear model. The plots are discussed below in turn18.

i. Residuals vs Fitted Plot

The topleft panel in Figure 3 shows that the residuals have a reasonably “well-behaved” distribution about the zero line, all in the range (-4, 4).

ii. Normality of Residuals: Shapiro-Wilk test

The topright panel shows that there are some points away from the straight line, especially on the tails. However, a formal statistical test tells us that we fail to reject the Null Hypothesis of normality (since we get a p-value greater than the usual \(\alpha\) level of .05 and indeed greater of .1); we thus consider the residuals to be approximately normal.

shapiro.test(final.int.Twt$res) 
## 
##  Shapiro-Wilk normality test
## 
## data:  final.int.Twt$res
## W = 0.94485, p-value = 0.1028

iii. Scale-Location

The bottomleft panel plot seems regular enough, with residuals randomly spread across the graph; this gives some confidence in the homoskedasticity (constant variance) assumption.

iv. High Leverage Points

The bottomright panel suggests that while there are a few leverage points with hatvalues greater than the conservative threshold of \(2(k+1)/n\) (equal to \(2(5+1)/32=0.375\)), only one is above the more generous threshold of \(3(k+1)/n = 0.5625\). Here:

round(hatvalues(final.int.Twt)[hatvalues(final.int.Twt) > 2*(5+1)/32], 2) 
##      Merc 230  Lotus Europa Maserati Bora 
##          0.46          0.38          0.57
# round(hatvalues(final.int.Twt)[hatvalues(final.int.Twt) > 3*(5+1)/32], 2) 

3.3.4 Interpretation of Coefficients & Uncertainty

The results show that manual cars have a higher fuel efficiency of almost 14 mpg compared with automatic ones, keeping all the other included covariates fixed. The coefficient is statistically significant at 1% level. As weight increases, fuel efficiency decreases, and the drop is greater (in absolute value) for manual cars: 1000 more lbs are associated with a reduction of about 7 mpg versus 2.9. Both coefficients (\(\beta_2\) on weight alone and \(\beta_5\) on the interaction term) are also significant at 1%. The \(1/4\) mile time (qsec) is significant at 5%: greater acceleration is associated with less fuel efficiency. Finally, the coefficient on gross horsepower (hp) is basically zero, which is indeed contained in the 95% confidence interval: with a very high p-value of .93, we fail to reject \(H_0: \beta_3=0\).

4. Conclusions

The analysis performed allows to answer the questions motivating this project. Cars with manual transmission are better for MPG than automatic ones. To quantify this claim, we first look at the overall “raw” difference in mean mpg by transmission type, which is about 7.25. We then further investigate this relationship by accounting for potential confounders and other sources of endogeneity with a multivariate linear regression approach. The final model suggests a difference of almost 14 mpg, holding the other covariates fixed.

5. Appendix

5.1 Miscellanea

Descriptive statistics of numeric variables

library(pastecs) 
descriptive <- stat.desc(mtcars[, vars.num.index], norm = TRUE) 
descriptive <- round(descriptive, 2) 
descriptive <- descriptive[-c(1,2,3,10,11,14,16,18), ] 
Q1 <- sapply(mtcars[, vars.num.index], quantile, 0.25) 
Q3 <- sapply(mtcars[, vars.num.index], quantile, 0.75) 
descriptive <- rbind(descriptive, Q1, Q3) 
row.names(descriptive)[c(13, 14)] <- c("1st Q.", "3rd Q.") 
zeroes <- rep(0, 6) 
descriptive <- rbind(descriptive, zeroes) 
row.names(descriptive)[15] <- "IQR" 
descriptive <- descriptive[-c(11, 12), ] 
descriptive["IQR", ] <- descriptive["3rd Q.", ] - descriptive["1st Q.", ] 
descriptive2 <- rbind(
                     descriptive[1:4, ], descriptive[11, ], descriptive[5, ], 
                     descriptive[c(12, 13), ], descriptive[6:10, ] 
                     ) 
descriptive2 <- descriptive2[-4, ] 
row.names(descriptive2)[c(9, 10)] <- c("variance", "std dev") 
round(descriptive2, 2) 
##            mpg     disp      hp  drat    wt  qsec
## min      10.40    71.10   52.00  2.76  1.51 14.50
## max      33.90   472.00  335.00  4.93  5.42 22.90
## range    23.50   400.90  283.00  2.17  3.91  8.40
## 1st Q.   15.43   120.83   96.50  3.08  2.58 16.89
## median   19.20   196.30  123.00  3.70  3.33 17.71
## 3rd Q.   22.80   326.00  180.00  3.92  3.61 18.90
## IQR       7.38   205.18   83.50  0.84  1.03  2.01
## mean     20.09   230.72  146.69  3.60  3.22 17.85
## variance 36.32 15360.80 4700.87  0.29  0.96  3.19
## std dev   6.03   123.94   68.56  0.53  0.98  1.79
## skewness  0.61     0.38    0.73  0.27  0.42  0.37
## kurtosis -0.37    -1.21   -0.14 -0.71 -0.02  0.34

Counts (group size) for categorical variables

summary(mtcars[, vars.cat.index]) 
##  cyl    vs        Transmission gear   carb  
##  4:11   V:18   Automatic:19    3:15   1: 7  
##  6: 7   S:14   Manual   :13    4:12   2:10  
##  8:14                          5: 5   3: 3  
##                                       4:10  
##                                       6: 1  
##                                       8: 1

Welsch t-test for difference in means, unequal variances

auto.mpg <- mtcars[mtcars$Transmission == "Automatic", ]$mpg 
manual.mpg <- mtcars[mtcars$Transmission == "Manual", ]$mpg 
t.test(auto.mpg, manual.mpg, alternative = "two.sided", var.equal = FALSE) 
## 
##  Welch Two Sample t-test
## 
## data:  auto.mpg and manual.mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Baseline regression model output

baseline.summ 
## 
## Call:
## lm(formula = mpg ~ Transmission, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17.147      1.125  15.247 1.13e-15 ***
## TransmissionManual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Naïve multivariate regression model output (all covariates)

full.model.summ 
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        23.87913   20.06582   1.190   0.2525  
## cyl6               -2.64870    3.04089  -0.871   0.3975  
## cyl8               -0.33616    7.15954  -0.047   0.9632  
## disp                0.03555    0.03190   1.114   0.2827  
## hp                 -0.07051    0.03943  -1.788   0.0939 .
## drat                1.18283    2.48348   0.476   0.6407  
## wt                 -4.52978    2.53875  -1.784   0.0946 .
## qsec                0.36784    0.93540   0.393   0.6997  
## vsS                 1.93085    2.87126   0.672   0.5115  
## TransmissionManual  1.21212    3.21355   0.377   0.7113  
## gear4               1.11435    3.79952   0.293   0.7733  
## gear5               2.52840    3.73636   0.677   0.5089  
## carb2              -0.97935    2.31797  -0.423   0.6787  
## carb3               2.99964    4.29355   0.699   0.4955  
## carb4               1.09142    4.44962   0.245   0.8096  
## carb6               4.47757    6.38406   0.701   0.4938  
## carb8               7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124
# round(full.model.summ$coefficients, 2) 

Summaries of models selected by stepwise AIC/BIC algorithm

# Just run this block without the include=FALSE option 
summary(for.aic.mdl) 
## 
## Call:
## lm(formula = mpg ~ Transmission + hp + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4975 -1.5902 -0.1122  1.1795  4.5404 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        17.44019    9.31887   1.871  0.07215 . 
## TransmissionManual  2.92550    1.39715   2.094  0.04579 * 
## hp                 -0.01765    0.01415  -1.247  0.22309   
## wt                 -3.23810    0.88990  -3.639  0.00114 **
## qsec                0.81060    0.43887   1.847  0.07573 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8368 
## F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11
summary(back.aic.mdl) 
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + Transmission, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        33.70832    2.60489  12.940 7.73e-13 ***
## cyl6               -3.03134    1.40728  -2.154  0.04068 *  
## cyl8               -2.16368    2.28425  -0.947  0.35225    
## hp                 -0.03211    0.01369  -2.345  0.02693 *  
## wt                 -2.49683    0.88559  -2.819  0.00908 ** 
## TransmissionManual  1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10
summary(for.bic.mdl) 
## 
## Call:
## lm(formula = mpg ~ Transmission + hp + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4975 -1.5902 -0.1122  1.1795  4.5404 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        17.44019    9.31887   1.871  0.07215 . 
## TransmissionManual  2.92550    1.39715   2.094  0.04579 * 
## hp                 -0.01765    0.01415  -1.247  0.22309   
## wt                 -3.23810    0.88990  -3.639  0.00114 **
## qsec                0.81060    0.43887   1.847  0.07573 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8368 
## F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11
summary(back.bic.mdl) 
## 
## Call:
## lm(formula = mpg ~ wt + qsec + Transmission, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          9.6178     6.9596   1.382 0.177915    
## wt                  -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec                 1.2259     0.2887   4.247 0.000216 ***
## TransmissionManual   2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Final regression model output (forward AIC + interaction term)

final.int.twt.summ 
## 
## Call:
## lm(formula = mpg ~ Transmission * wt + hp + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5084 -1.3683 -0.5383  1.0393  4.3674 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           10.230790   8.457424   1.210  0.23729   
## TransmissionManual    13.957285   3.781546   3.691  0.00104 **
## wt                    -2.903080   0.783699  -3.704  0.00101 **
## hp                    -0.001148   0.013453  -0.085  0.93265   
## qsec                   0.992235   0.387270   2.562  0.01655 * 
## TransmissionManual:wt -4.096233   1.329241  -3.082  0.00482 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.124 on 26 degrees of freedom
## Multiple R-squared:  0.8959, Adjusted R-squared:  0.8759 
## F-statistic: 44.74 on 5 and 26 DF,  p-value: 5.87e-12

5.2 Figures

Figure 1

library(ggplot2) 
CBfriendly <- c("#E69F00", "#56B4E9") # We should all try to make colorblindness friendliness the default 
bp <- ggplot(data = mtcars, 
             aes(x = Transmission, y = mpg, fill = Transmission)) + 
      scale_fill_manual(values = CBfriendly) + 
      geom_boxplot() + 
      xlab("Automatic vs Manual") + 
      ylab("Miles per gallon") + 
      ggtitle("Fuel Efficiency by Transmission Type") + 
      theme_light() + 
      theme(plot.title = element_text(size = 16, hjust = 0.5)) + 
      theme(axis.text = element_text(size = 14), 
            axis.title = element_text(size = 14)) 
bp 

Figure 2

library(ggplot2) 
my_fn <- function(data, mapping, ...){ 
    p <- ggplot(data = data, mapping = mapping) + 
         geom_point() + 
         geom_smooth(method = lm, fill = "#E69F00", color = "blue", ...) 
    p 
} 
library(ggplot2) 
library(GGally) 
library(dplyr) 
ggpairs.all <- ggpairs(mtcars) # Print it for overview of all variables. 
# Included in .pdf: ggpairs.final 
final.vars <- mtcars %>% select(c(mpg, Transmission, hp, wt, qsec)) 
ggpairs.final <- ggpairs(final.vars, 
                         lower = list(continuous = my_fn), 
                         ) 
ggpairs.final 

Figure 3

par(mfrow = c(2, 2)) 
plot(final.int.Twt) 

5.3 Extra

Figure 4

par(mfrow = c(2, 2)) 
plot(full.model) 
## Warning: not plotting observations with leverage one:
##   30, 31

Figure 5

par(mfrow = c(2, 2)) 
plot(final) 


  1. Please note that blank space and division in sections were used to make the document more readable; overall, the page lengths requirements are (more or less) met!↩︎

  2. See help page in R (?mtcars) and references within.↩︎

  3. In statistical jargon, this means we have measurements on 11 variables, as such.↩︎

  4. Henceforth just transmission↩︎

  5. Number of observations, or sample size, of our dataset.↩︎

  6. To see where these figures come from, please refer to https://rpubs.com/francescoaldo60/752421↩︎

  7. In the Appendix I also report an unequal variances Welsch test for the differences in means↩︎

  8. Only variables in the final regression model are shown. For a full picture, albeit one difficult to read, just require(GGally) on top of ggplot2 and do a ggpairs() call with the entire ‘mtcars’ dataset as argument.↩︎

  9. Recall that any lm() call of a model \(Y = \alpha + \sum_{j=1}^k \beta_j X_j\) automatically tests the Null Hypothesis \(H_0\): \(\beta_j = 0\) \(\forall j=1,\dots,k\).↩︎

  10. i.e. for \(\alpha\) equal to 10%, 5%, 1%, or 0.1%↩︎

  11. In the sense of Stock, J. H. & Watson, M. W. (2014 updated 3rd ed.), Introduction to Econometrics, Pearson, Global Edition, endogeneity is a catch-word for any potential threat to the linear regression-least squares assumptions, such as omitted variable(s).↩︎

  12. In the statistical learning language, we are overfitting.↩︎

  13. Akaike / Bayesian Information Criterion, respectively. See https://rpubs.com/francescoaldo60/752421 for details: code and summaries.↩︎

  14. For an accessible explanation of “forward” vs “backward” stepwise regression, see https://quantifyinghealth.com/stepwise-selection/↩︎

  15. This would leave us with the model selected by backward AIC, in which the coefficient on ‘Transmission’ is not statistically significant different from zero at any conventional significance level (meaning there is no statistically appreciable difference between automatic and manual cars in mpg, once other relevant variables have been accounted for), should that be our final choice↩︎

  16. Let me apply a threefold reasoning to show this is not arbitrary. On one hand, the BIC for a (relatively) small sample size n, such as 32, might penalize more complex models a bit too much, especially if the sample is not as representative of the population as one would like (recall: data from one specific magazine). Moreover, the fact that forward AIC and forward BIC produced the same selection gives some hope of a more “stable” criterion for this particular application. Finally, it could be argued that building a regression model “from the ground up”, adding one covariate at a time, is a more sensible approach to model building. Overall, given that the gains in adjusted R^2 are very small, the final selection seemed the most sensible one –to me at least.↩︎

  17. The plots can be found in the online version, available at https://rpubs.com/francescoaldo60/752421↩︎

  18. For the corresponding plots of the other model discussed here, please refer to https://rpubs.com/francescoaldo60/752421↩︎