Work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
Attending to the data obtain by Motor Trend US, Cars with manual transmission perform better (higher mpg) than cars with automatic transmission for cars weighting less than 2,800lb. When the cars weight more than this value, the auto transmission offers better mpg figures Manual Cars have lower mpg than automatic cars for weight up to 2,800 lb
The data set has: 32 Cars and eleven variables The data used has been the mtcars dataset available in R : * mpg Miles/(US) gallon * cyl Number of cylinders * disp Displacement (cu.in.) * hp Gross horsepower * drat Rear axle ratio * wt Weight (lb/1000) * qsec 1/4 mile time * vs V/S * am Transmission (0 = automatic, 1 = manual) * gear Number of forward gears * carb Number of carburetors ##Let’s take a look at the first rows of our data set.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To answer this project’s questions, we will use linear models to quantify the relationship between MPG variable, and the rest. And we will look for the best model fit in the subsequent regression analysis section.
Taking a look at how the MPG variable of interest is distributed (Figure 1 in the appendix), we can see that it kind of fits a normal distribution. The lack of precission for this comparison can be due to the small sample size of only 32 observations. Anyways, multiple linear regression assumes that the residuals are normally distributed, so we will check this assumption once we have the best model fit for this project.
To know which transmission type is better for fuel consumption, we first plot boxplots of the variable MPG when am is Automatic or Manual (see Figure 2 in the appendix). And we can clearly notice an increase in the MPG when the transmission is Manual.
We can take a look into our data and explore the relationships between all the variables (Firgure 3 in the appendix). Our outcome variable, miles per gallon seems to have a relationship with some other variables like cyl, disp, hp, drat, wt, vs and am.
These relationships are easier to notice by looking at the correlogram (Firgure 4 in the appendix). All the correlations have a darker tone of blue if it’s closer to 1, and a darker tone of red when it’s closer to -1, which means a stronger relationship.
Besides our interest in the outcome variable, we can also have an idea whether or not to consider the combination of certain variables for a linear model depending on their correlation. This is because one variable could be explaining another one, e.g. Weigh (wt) vs Displacement (disp) sharing a correlation of 0.89. We migh explain this relationship because a heavy vehicle simply needs more power than a smaller one to produce comparable acceleration and load-hauling capacity. And that’s usually achieved by a larger-displacement engine.
In this section, we will try to find the best model fit to explain the relationship between the set of variables and miles per gallon (MPG) through a variable selection procedure called stepwise regression. This is a method that consists on fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R2, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.
Stepwise Regression and similar selection methods might not be the best for regression analysis [1], [2], but for the sake of this academic project, we will run this procedure. #Model Selection We will start our selection using a component of the stepwise regression technique called Backward Elimination. Where each variable is considered for subtraction from the set of explanatory variables, based on the prespecified Akaike information criterion according to the R function step.
Let’s start fitting a model with all the variables.
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
# Summary
summary(Fit_Base)
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
With this first model, we can basically say that none of the variables are significant. This can be due to multicollinearity (as mentioned in the Weigh (wt) vs Displacement (disp) example in the Exploratory Analysis section), and the lack of simplicity of the model (parsimony principle).
We will use the R funcion step() to look for the best model, starting from our Fit_Base model that has all the variables already fitted. As we just mentioned, this function use Akaike information criterion for the selection of variables for substraction. The lower the AIC, the better fit we will have. I.e., one by one, variables are going to get removed.
Start: AIC=70.9
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
Df Sum of Sq RSS AIC
- cyl 1 0.0799 147.57 68.915
- vs 1 0.1601 147.66 68.932
- carb 1 0.4067 147.90 68.986
- gear 1 1.3531 148.85 69.190
- drat 1 1.6270 149.12 69.249
- disp 1 3.9167 151.41 69.736
- hp 1 6.8399 154.33 70.348
- qsec 1 8.8641 156.36 70.765
<none> 147.49 70.898
- am 1 10.5467 158.04 71.108
- wt 1 27.0144 174.51 74.280
Step: AIC=68.92
mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
Df Sum of Sq RSS AIC
- vs 1 0.2685 147.84 66.973
- carb 1 0.5201 148.09 67.028
- gear 1 1.8211 149.40 67.308
- drat 1 1.9826 149.56 67.342
- disp 1 3.9009 151.47 67.750
- hp 1 7.3632 154.94 68.473
<none> 147.57 68.915
- qsec 1 10.0933 157.67 69.032
- am 1 11.8359 159.41 69.384
- wt 1 27.0280 174.60 72.297
Step: AIC=66.97
mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
Df Sum of Sq RSS AIC
- carb 1 0.6855 148.53 65.121
- gear 1 2.1437 149.99 65.434
- drat 1 2.2139 150.06 65.449
- disp 1 3.6467 151.49 65.753
- hp 1 7.1060 154.95 66.475
<none> 147.84 66.973
- am 1 11.5694 159.41 67.384
- qsec 1 15.6830 163.53 68.200
- wt 1 27.3799 175.22 70.410
Step: AIC=65.12
mpg ~ disp + hp + drat + wt + qsec + am + gear
Df Sum of Sq RSS AIC
- gear 1 1.565 150.09 63.457
- drat 1 1.932 150.46 63.535
<none> 148.53 65.121
- disp 1 10.110 158.64 65.229
- am 1 12.323 160.85 65.672
- hp 1 14.826 163.35 66.166
- qsec 1 26.408 174.94 68.358
- wt 1 69.127 217.66 75.350
Step: AIC=63.46
mpg ~ disp + hp + drat + wt + qsec + am
Df Sum of Sq RSS AIC
- drat 1 3.345 153.44 62.162
- disp 1 8.545 158.64 63.229
<none> 150.09 63.457
- hp 1 13.285 163.38 64.171
- am 1 20.036 170.13 65.466
- qsec 1 25.574 175.67 66.491
- wt 1 67.572 217.66 73.351
Step: AIC=62.16
mpg ~ disp + hp + wt + qsec + am
Df Sum of Sq RSS AIC
- disp 1 6.629 160.07 61.515
<none> 153.44 62.162
- hp 1 12.572 166.01 62.682
- qsec 1 26.470 179.91 65.255
- am 1 32.198 185.63 66.258
- wt 1 69.043 222.48 72.051
Step: AIC=61.52
mpg ~ hp + wt + qsec + am
Df Sum of Sq RSS AIC
- hp 1 9.219 169.29 61.307
<none> 160.07 61.515
- qsec 1 20.225 180.29 63.323
- am 1 25.993 186.06 64.331
- wt 1 78.494 238.56 72.284
Step: AIC=61.31
mpg ~ wt + qsec + am
Df Sum of Sq RSS AIC
<none> 169.29 61.307
- am 1 26.178 195.46 63.908
- qsec 1 109.034 278.32 75.217
- wt 1 183.347 352.63 82.790
As we can see, the AIC of the model decreases every time we remove a suggested variable. The Backward Elimination Procedure stops when there are no more variables to remove ( AIC = 61.307). And removing one more variable would result in a higher AIC.
This is how our best model fit looks like
# Model Summary
summary(Fit_Best)
Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4811 -1.5555 -0.7257 1.4110 4.6610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.6178 6.9596 1.382 0.177915
wt -3.9165 0.7112 -5.507 6.95e-06 ***
qsec 1.2259 0.2887 4.247 0.000216 ***
am 2.9358 1.4109 2.081 0.046716 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
# Confidence Intervals
confint(Fit_Best)
2.5 % 97.5 %
(Intercept) -4.63829946 23.873860
wt -5.37333423 -2.459673
qsec 0.63457320 1.817199
am 0.04573031 5.825944
According to the R-squared value, our best model fit explains 84% of the variability in MPG outcome with the regressors weight wt, 1/4 mile time qsec, and transmission am.
The Adjusted R-squared is 83%, which is close to the R-squared value. This means that if we had unnecessary variables in our model, the difference between this two would be substantial.
We can see that all the p-values of the coefficients are smaller than 0.05 of significance. Which means that all the regressors are statistically significant, except the intercept.
The previous statement can be reinforced by the confidence intervals of the coefficients by not containing 0. Hence, these control variables have a statistically significant effect on the outcome.
The model as a whole is significant to explain the MPG outcome, as the P-value for the regression is 1.2104*e-11, which is smaller than the significance level 0.05.
When the transmission am variable is 0 (automatic), the intercept of the model is 9.6178 (coefficient of the reference variable). But when is equal to 1 (manual) the intercept of the model is 9.6178 + 2.9358 (coefficient of the reference variable + the coefficient of the manual transmission variable).
The t-test for the transmission am variable when is equal to 1 (manual), suggests that this variable is significantly different compared to when is equal to cero (automatic). We can then say that manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission.
Holding 1/4 mile time qsec and Transmission am constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon.
Holding Weight wt and Transmission am constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon.
Assuming that the data of Miles per Gallon for Automatic and Manual cars are normally distributed, we can test the hypothesis that the two population of cars have equal means.
H0: (H0 = M1-M2 = 0) There is no difference in miles per gallon given Automatic or Manual transmission. H1: (H1 = M1-M2 > 0) Miles per gallon are higher when the cars have manual transmission.
Where: M1 is the average Miles per gallon for manual cars. M2 is the average Miles per gallon for automatic cars.
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.9993
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-10.57662 Inf
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
At 5% significance level, there is sufficient evidence to reject the Null Hypothesis (p-value < Significance level). For cars with manual transmission, the miles per gallon are higher.
Consult Figure 5 in the appendix for this analysis. * The points in the Residuals vs. Fitted plot seem to be randomly scattered, verifying the assumption of the homoscedasticity of errors. * The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed. * The scale location plot confirms the constant variance assumption, as the points are randomly distributed. * Cook’s distances are less than 1, D<1. This means that our observations are not highly influential.
Based on the observations from our best model fit, we can conclude the following: * Our best model fit explains 84% of the variability in MPG outcome, with the regressors weight wt, 1/4 mile time qsec, and transmission am. * Manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission. * Holding 1/4 mile time qsec and Transmission am constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon. * Holding Weight wt and Transmission am constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon.
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
##MTcars dataset Correlegram
corrplot 0.84 loaded