Work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

Attending to the data obtain by Motor Trend US, Cars with manual transmission perform better (higher mpg) than cars with automatic transmission for cars weighting less than 2,800lb. When the cars weight more than this value, the auto transmission offers better mpg figures Manual Cars have lower mpg than automatic cars for weight up to 2,800 lb

The data set has: 32 Cars and eleven variables The data used has been the mtcars dataset available in R : * mpg Miles/(US) gallon * cyl Number of cylinders * disp Displacement (cu.in.) * hp Gross horsepower * drat Rear axle ratio * wt Weight (lb/1000) * qsec 1/4 mile time * vs V/S * am Transmission (0 = automatic, 1 = manual) * gear Number of forward gears * carb Number of carburetors ##Let’s take a look at the first rows of our data set.

                     mpg cyl disp  hp drat    wt  qsec vs am gear carb
  Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
  Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
  Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
  Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
  Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
  Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Exploratory Data Analysis

Regression Analysis

In this section, we will try to find the best model fit to explain the relationship between the set of variables and miles per gallon (MPG) through a variable selection procedure called stepwise regression. This is a method that consists on fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R2, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.

Stepwise Regression and similar selection methods might not be the best for regression analysis [1], [2], but for the sake of this academic project, we will run this procedure. #Model Selection We will start our selection using a component of the stepwise regression technique called Backward Elimination. Where each variable is considered for subtraction from the set of explanatory variables, based on the prespecified Akaike information criterion according to the R function step.

Base Model

Let’s start fitting a model with all the variables.

  mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
# Summary
summary(Fit_Base)

Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

With this first model, we can basically say that none of the variables are significant. This can be due to multicollinearity (as mentioned in the Weigh (wt) vs Displacement (disp) example in the Exploratory Analysis section), and the lack of simplicity of the model (parsimony principle).

Backward Elimination Procedure

We will use the R funcion step() to look for the best model, starting from our Fit_Base model that has all the variables already fitted. As we just mentioned, this function use Akaike information criterion for the selection of variables for substraction. The lower the AIC, the better fit we will have. I.e., one by one, variables are going to get removed.

  Start:  AIC=70.9
  mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
  
         Df Sum of Sq    RSS    AIC
  - cyl   1    0.0799 147.57 68.915
  - vs    1    0.1601 147.66 68.932
  - carb  1    0.4067 147.90 68.986
  - gear  1    1.3531 148.85 69.190
  - drat  1    1.6270 149.12 69.249
  - disp  1    3.9167 151.41 69.736
  - hp    1    6.8399 154.33 70.348
  - qsec  1    8.8641 156.36 70.765
  <none>              147.49 70.898
  - am    1   10.5467 158.04 71.108
  - wt    1   27.0144 174.51 74.280
  
  Step:  AIC=68.92
  mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
  
         Df Sum of Sq    RSS    AIC
  - vs    1    0.2685 147.84 66.973
  - carb  1    0.5201 148.09 67.028
  - gear  1    1.8211 149.40 67.308
  - drat  1    1.9826 149.56 67.342
  - disp  1    3.9009 151.47 67.750
  - hp    1    7.3632 154.94 68.473
  <none>              147.57 68.915
  - qsec  1   10.0933 157.67 69.032
  - am    1   11.8359 159.41 69.384
  - wt    1   27.0280 174.60 72.297
  
  Step:  AIC=66.97
  mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
  
         Df Sum of Sq    RSS    AIC
  - carb  1    0.6855 148.53 65.121
  - gear  1    2.1437 149.99 65.434
  - drat  1    2.2139 150.06 65.449
  - disp  1    3.6467 151.49 65.753
  - hp    1    7.1060 154.95 66.475
  <none>              147.84 66.973
  - am    1   11.5694 159.41 67.384
  - qsec  1   15.6830 163.53 68.200
  - wt    1   27.3799 175.22 70.410
  
  Step:  AIC=65.12
  mpg ~ disp + hp + drat + wt + qsec + am + gear
  
         Df Sum of Sq    RSS    AIC
  - gear  1     1.565 150.09 63.457
  - drat  1     1.932 150.46 63.535
  <none>              148.53 65.121
  - disp  1    10.110 158.64 65.229
  - am    1    12.323 160.85 65.672
  - hp    1    14.826 163.35 66.166
  - qsec  1    26.408 174.94 68.358
  - wt    1    69.127 217.66 75.350
  
  Step:  AIC=63.46
  mpg ~ disp + hp + drat + wt + qsec + am
  
         Df Sum of Sq    RSS    AIC
  - drat  1     3.345 153.44 62.162
  - disp  1     8.545 158.64 63.229
  <none>              150.09 63.457
  - hp    1    13.285 163.38 64.171
  - am    1    20.036 170.13 65.466
  - qsec  1    25.574 175.67 66.491
  - wt    1    67.572 217.66 73.351
  
  Step:  AIC=62.16
  mpg ~ disp + hp + wt + qsec + am
  
         Df Sum of Sq    RSS    AIC
  - disp  1     6.629 160.07 61.515
  <none>              153.44 62.162
  - hp    1    12.572 166.01 62.682
  - qsec  1    26.470 179.91 65.255
  - am    1    32.198 185.63 66.258
  - wt    1    69.043 222.48 72.051
  
  Step:  AIC=61.52
  mpg ~ hp + wt + qsec + am
  
         Df Sum of Sq    RSS    AIC
  - hp    1     9.219 169.29 61.307
  <none>              160.07 61.515
  - qsec  1    20.225 180.29 63.323
  - am    1    25.993 186.06 64.331
  - wt    1    78.494 238.56 72.284
  
  Step:  AIC=61.31
  mpg ~ wt + qsec + am
  
         Df Sum of Sq    RSS    AIC
  <none>              169.29 61.307
  - am    1    26.178 195.46 63.908
  - qsec  1   109.034 278.32 75.217
  - wt    1   183.347 352.63 82.790

As we can see, the AIC of the model decreases every time we remove a suggested variable. The Backward Elimination Procedure stops when there are no more variables to remove ( AIC = 61.307). And removing one more variable would result in a higher AIC.

Best Model Fit

This is how our best model fit looks like

# Model Summary
summary(Fit_Best)
  
  Call:
  lm(formula = mpg ~ wt + qsec + am, data = mtcars)
  
  Residuals:
      Min      1Q  Median      3Q     Max 
  -3.4811 -1.5555 -0.7257  1.4110  4.6610 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)   9.6178     6.9596   1.382 0.177915    
  wt           -3.9165     0.7112  -5.507 6.95e-06 ***
  qsec          1.2259     0.2887   4.247 0.000216 ***
  am            2.9358     1.4109   2.081 0.046716 *  
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 2.459 on 28 degrees of freedom
  Multiple R-squared:  0.8497,  Adjusted R-squared:  0.8336 
  F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
# Confidence Intervals
confint(Fit_Best)
                    2.5 %    97.5 %
  (Intercept) -4.63829946 23.873860
  wt          -5.37333423 -2.459673
  qsec         0.63457320  1.817199
  am           0.04573031  5.825944

Best Model Analysis:

Testing the hypothesis that two populations have equal means.

Assuming that the data of Miles per Gallon for Automatic and Manual cars are normally distributed, we can test the hypothesis that the two population of cars have equal means.

H0: (H0 = M1-M2 = 0) There is no difference in miles per gallon given Automatic or Manual transmission. H1: (H1 = M1-M2 > 0) Miles per gallon are higher when the cars have manual transmission.

Where: M1 is the average Miles per gallon for manual cars. M2 is the average Miles per gallon for automatic cars.

  
    Welch Two Sample t-test
  
  data:  mpg by am
  t = -3.7671, df = 18.332, p-value = 0.9993
  alternative hypothesis: true difference in means is greater than 0
  95 percent confidence interval:
   -10.57662       Inf
  sample estimates:
  mean in group 0 mean in group 1 
         17.14737        24.39231

At 5% significance level, there is sufficient evidence to reject the Null Hypothesis (p-value < Significance level). For cars with manual transmission, the miles per gallon are higher.

Residual Analysis

Consult Figure 5 in the appendix for this analysis. * The points in the Residuals vs. Fitted plot seem to be randomly scattered, verifying the assumption of the homoscedasticity of errors. * The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed. * The scale location plot confirms the constant variance assumption, as the points are randomly distributed. * Cook’s distances are less than 1, D<1. This means that our observations are not highly influential.

Conclusion

Based on the observations from our best model fit, we can conclude the following: * Our best model fit explains 84% of the variability in MPG outcome, with the regressors weight wt, 1/4 mile time qsec, and transmission am. * Manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission. * Holding 1/4 mile time qsec and Transmission am constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon. * Holding Weight wt and Transmission am constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon.

Appendix

Scatterplot Mattrix and Correlattons for mtcars dataset

  
  Attaching package: 'psych'
  The following objects are masked from 'package:ggplot2':
  
      %+%, alpha

##MTcars dataset Correlegram

  corrplot 0.84 loaded