This report is the concluding course project within the Regression models course from the Data Science Spezialization by John Hopkins University on Coursera.org.
The relationship between cars’s transmission (manual or automatic) and the fuel consumption (miles per gallon) of different car types listed in the mtcars dataset is examined, considering the eventual effects of other variables on fuel consumption, as well.

  1. Project Instructions
  2. Raw Data
  3. Data Processing
  4. Exploratory Data Analysis
  5. Model Selection
  6. Results
  7. Appendix

Project Instructions

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following questions:

Raw Data

The data was extracted from the 1974 Motor Trend US magazine. It comprises the following aspects of automibile design and performance for 32 automobile models.

Variable Name Class Range Description
mpg numeric 10.4 - 33.9 Fuel Consumption in Miles per Gallon
cyl numeric 4,6,8 Number of Cylinders
disp numeric 71.1 - 472 Displacement cubic inches
hp numeric 52 - 335 Gross Horsepower
drat numeric 2.76 - 4.93 Rear Axle Ratio
wt numeric 1.513 - 5.424 Weight in 1000 lbs
qsec numeric 14.5 - 22.9 1/4 Mile Time in Seconds
vs numeric 0,1 Engine Type: 0 = V-shaped, 1 = straight
am numeric 0,1 Transmission: 0 = automatic, 1 = manual
gear numeric 3,4,5 Number of Gears
carb numeric 1 - 8 Number if Carburators

Data Processing

In order to analyze the data effectively, the raw data was processed to a tidy form. Processing steps were (Code in Apendix A: Data Processing):

The tidy dataset has the following form:

##                mpg cyl disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0 Manual    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0 Manual    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1 Manual    4    1

Exploratory Data Analysis

When comparing manual to automatic transmission, without taking any other factor into account, it appears as if there is a better fuel consumption in automatic cars:

When testing for the relationship of transmission type am and fuel consumption mpg alone, there is a significant difference of effects of automatic and manual transmission of fuel consumption (p < 0.001). The fuel consumption for manual cars is approx. 7 miles per gallon higher (95% CI: [3.64151, 10.84837]) than for automatic cars.

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04

Anyhow, this simple model does not take into account any of the other variables, which may explain away the difference we just observed. When including all variable in the dataset, for example, the difference between automatic and manual transmission disappears.

##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs1          0.31776281  2.10450861  0.1509915 0.88142347
## amManual     2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

In order to decide which variables to include in a comprehensive model, the effects of multiple variables on fuel consumption, as well as the effects that these variables eventually have on each other in a collinear way, have been examined in the following section.

Model Selection

The model selection process was organized in these consecutive steps (Recruit Appendix B: Model Selection for more detailed information):

  1. Aikaike’s Information Criterion (AIC) was examined with the step() function applied to the full model in order to determine the variables, that produce the best AIC value, i.e. that form the best model. The resulting model includes the variables wt. qsec and am, i.e. the weight, acceleration and transmission type of a car.

  2. In order to verify, that no other variable adds additional information to the model including these 3 variables, an ANOVA was applied comparing all nesting models to their nested models. No other variables seem to add any additional information to the model identified in step 1.

  3. To rule out collinearity of wt, qsec and am, variance inflation factors (VIF) were checked. With VIF values between 1.36 to 2.54, the variables seem to be independent.

  4. To ensure that all other assumotions are met (Linearity, Homoscedasticity and Normality), residual plots were examined (Appendix C: Residual Plots). All assumptions seem to be met.

That leaves us with the following model:

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
##                   2.5 %    97.5 %
## (Intercept) -4.63829946 23.873860
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199
## amManual     0.04573031  5.825944

Results

The model selected in the Model Selection section explains about 83% of variance in the data (Adjusted R^2). That value is highly significant (p < 0.001) and indicates that an appropriate model has been found. It further shows that wt, qsec and am each have an effect on mpg, which is either highly significent (wt & qsec: p < 0.001) or significant (am: p < 0.05):

  1. wt has a decreasing effect on mpg. For every ton weight increase, the fuel consumption decreases by 2.5 to 5.37 tons with a certainty of 95%.

  2. qsec has an increasing effect on mpg. The slower the car acceleration, the higher the fuel consumption: For every additional second that a car needs for 1/4 mile, the fuel consumption increases by 0.63 to 1.82 miles per gallon with a certainty of 95%.

  3. The effect of the transmission type is also detectable and leads to following answers to the initial questions:

Is an automatic or manual transmission better for MPG?

An automatic transmission has a lower fuel consumption than a car with manual transmission.

Quantify the MPG difference between automatic and manual transmission.

The difference in fuel consumption in Miles per Gallon for automatic and manual transmission lays between 0.05 and 5.83 Miles per Gallon with a certainty of 95%.

Appendix

Appendix A: Data Processing

### Load Packages 
if(!require(ggplot2)){
    install.packages("ggplot2")
}
library(ggplot2)

if(!require(car)){
    install.packages("car")
}
library(car)

### Load Dataset
data(mtcars)

Appendix B: Model Selection

AIC

step(lm(mpg~., data = mtcars), direction = "both", trace = 0)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec     amManual  
##       9.618       -3.917        1.226        2.936

ANOVA

fit1 <- lm(mpg ~ wt + qsec + am, mtcars)
fit2 <- lm(mpg ~ wt + qsec + am + cyl, mtcars)
fit3 <- lm(mpg ~ wt + qsec + am + cyl + disp, mtcars)
fit4 <- lm(mpg ~ wt + qsec + am + cyl + disp + hp, mtcars)
fit5 <- lm(mpg ~ wt + qsec + am + cyl + disp + hp + drat, mtcars)
fit6 <- lm(mpg ~ wt + qsec + am + cyl + disp + hp + drat + vs, mtcars)
fit7 <- lm(mpg ~ wt + qsec + am + cyl + disp + hp + drat + vs + gear, mtcars)
fit8 <- lm(mpg ~ wt + qsec + am + cyl + disp + hp + drat + vs + gear + carb, mtcars)
anova(fit1, fit2, fit3, fit4, fit5, fit6, fit7, fit8)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ wt + qsec + am + cyl
## Model 3: mpg ~ wt + qsec + am + cyl + disp
## Model 4: mpg ~ wt + qsec + am + cyl + disp + hp
## Model 5: mpg ~ wt + qsec + am + cyl + disp + hp + drat
## Model 6: mpg ~ wt + qsec + am + cyl + disp + hp + drat + vs
## Model 7: mpg ~ wt + qsec + am + cyl + disp + hp + drat + vs + gear
## Model 8: mpg ~ wt + qsec + am + cyl + disp + hp + drat + vs + gear + carb
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 169.29                           
## 2     27 167.78  1    1.5011 0.2137 0.6486
## 3     26 161.41  1    6.3709 0.9071 0.3517
## 4     25 150.99  1   10.4228 1.4840 0.2367
## 5     24 149.09  1    1.9013 0.2707 0.6083
## 6     23 148.87  1    0.2170 0.0309 0.8621
## 7     22 147.90  1    0.9717 0.1384 0.7137
## 8     21 147.49  1    0.4067 0.0579 0.8122

VIF

##       wt     qsec       am 
## 2.482952 1.364339 2.541437

Appendix C: Residual Plots