Executive Summary

On behalf of Motor Trend magazine, looking at mtcars dataset (a collection of cars), we are particularly interested in the following two issues:

In this report, we will analyze the mtcars dataset and explore the relationship between the variables miles per gallon (MPG) and transmissions (am). Here miles per gallon is our outcome.

The key inference from our analysis is:

Data processing and transformation

file.edit(".Rprofile")
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am, labels = c('Automatic', 'Manual'))
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Regression Analysis

In this section, we start building linear regression models based on the different variables and try to find out the best model fit and compare it with the base model which we have using anova. After model selection, we also perform analysis of residuals.

Multiple models fitting and strategy for model selection

Based on the high correlation with mpg, we first fit an initial linear model with all the variables as predictors. Then we’ll perfom stepwise model selection to select significant predictors for the final model. To do it, we use the step method which runs multiple times to build multiple regression models and select the final variables using both forward selection and backward elimination methods by the AIC algorithm.

initialModel <- lm(mpg ~ ., data = mtcars)
finalModel <- step(initialModel, direction = "both")
## Start:  AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq RSS  AIC
## - carb  5     13.60 134 69.8
## - gear  2      3.97 124 73.4
## - am    1      1.14 122 74.7
## - qsec  1      1.24 122 74.7
## - drat  1      1.82 122 74.9
## - cyl   2     10.93 131 75.2
## - vs    1      3.63 124 75.4
## <none>              120 76.4
## - disp  1      9.97 130 76.9
## - wt    1     25.55 146 80.6
## - hp    1     25.67 146 80.6
## 
## Step:  AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
## 
##        Df Sum of Sq RSS  AIC
## - gear  2      5.02 139 67.0
## - disp  1      0.99 135 68.1
## - drat  1      1.19 135 68.1
## - vs    1      3.68 138 68.7
## - cyl   2     12.56 147 68.7
## - qsec  1      5.26 139 69.1
## <none>              134 69.8
## - am    1     11.93 146 70.6
## - wt    1     19.80 154 72.2
## - hp    1     22.79 157 72.9
## + carb  5     13.60 120 76.4
## 
## Step:  AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
## 
##        Df Sum of Sq RSS  AIC
## - drat  1      0.97 140 65.2
## - cyl   2     10.42 149 65.3
## - disp  1      1.55 141 65.4
## - vs    1      2.18 141 65.5
## - qsec  1      3.63 143 65.8
## <none>              139 67.0
## - am    1     16.57 156 68.6
## - hp    1     18.18 157 68.9
## + gear  2      5.02 134 69.8
## - wt    1     31.19 170 71.5
## + carb  5     14.65 124 73.4
## 
## Step:  AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
## 
##        Df Sum of Sq RSS  AIC
## - disp  1      1.25 141 63.5
## - vs    1      2.34 142 63.8
## - cyl   2     12.33 152 63.9
## - qsec  1      3.10 143 63.9
## <none>              140 65.2
## + drat  1      0.97 139 67.0
## - hp    1     17.74 158 67.0
## - am    1     19.47 160 67.4
## + gear  2      4.80 135 68.1
## - wt    1     30.72 171 69.6
## + carb  5     13.05 127 72.1
## 
## Step:  AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
## 
##        Df Sum of Sq RSS  AIC
## - qsec  1       2.4 144 62.1
## - vs    1       2.7 144 62.1
## - cyl   2      18.6 160 63.5
## <none>              141 63.5
## + disp  1       1.2 140 65.2
## + drat  1       0.7 141 65.4
## - hp    1      18.2 159 65.4
## - am    1      18.9 160 65.5
## + gear  2       4.7 137 66.4
## - wt    1      39.6 181 69.4
## + carb  5       2.3 139 73.0
## 
## Step:  AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
## 
##        Df Sum of Sq RSS  AIC
## - vs    1       7.3 151 61.7
## <none>              144 62.1
## - cyl   2      25.3 169 63.2
## + qsec  1       2.4 141 63.5
## - am    1      16.4 160 63.5
## + disp  1       0.6 143 63.9
## + drat  1       0.3 143 64.0
## + gear  2       3.4 140 65.3
## - hp    1      36.3 180 67.3
## - wt    1      41.1 185 68.1
## + carb  5       3.5 140 71.3
## 
## Step:  AIC=61.65
## mpg ~ cyl + hp + wt + am
## 
##        Df Sum of Sq RSS  AIC
## <none>              151 61.7
## - am    1       9.8 161 61.7
## + vs    1       7.3 144 62.1
## + qsec  1       7.0 144 62.1
## - cyl   2      29.3 180 63.3
## + disp  1       0.6 150 63.5
## + drat  1       0.2 151 63.6
## + gear  2       1.4 150 65.4
## - hp    1      31.9 183 65.8
## - wt    1      46.2 197 68.2
## + carb  5       5.6 145 70.4

Number of cylinders (cyl), weight (wt) and gross horsepower (hp) - these variables are confounders and transmissions (am) is independent variable in our final model.

summary(finalModel)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.939 -1.256 -0.401  1.125  5.051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.7083     2.6049   12.94  7.7e-13 ***
## cyl6         -3.0313     1.4073   -2.15   0.0407 *  
## cyl8         -2.1637     2.2843   -0.95   0.3523    
## hp           -0.0321     0.0137   -2.35   0.0269 *  
## wt           -2.4968     0.8856   -2.82   0.0091 ** 
## amManual      1.8092     1.3963    1.30   0.2065    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.84 
## F-statistic: 33.6 on 5 and 26 DF,  p-value: 1.51e-10

So, from the above summary we can formulate the following linear equation: mpg = 33.7083 - 3.0313 * cyl6 - 2.1637 * cyl8 - 0.0321 * hp - 2.4968 * wt + 1.8092 * am:Manual

From this equation we can conclude regarding the slope of transmission as - all else held constant, the model predicts that manual transmission’s mileage is 1.8092 mile more than automatic transmission’s mileage, on average.

Now we compare an other model (baseModel) that has only transmission am variable as the predictor with the finalModel.

baseModel <- lm(mpg ~ am, data = mtcars)
anova(baseModel, finalModel)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1     30 721                              
## 2     26 151  4       570 24.5 1.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of finalModel is highly significant and we reject the null hypothesis that the confounder variables cyl, hp and wt don’t contribute to the accuracy of the model.

Exploratory Data Analysis

Since we are interested in the effects of car transmission type on mpg, we plot boxplots of the variable mpg when am is Automatic or Manual.

plot(mpg ~ am, data = mtcars)

plot of chunk unnamed-chunk-5

This plot clearly depicts an increase in the mileage (mpg) when the transmission is Manual.

Residual Plots and Diagnostics

In this section, we’ll analyse the residual plots of our finalModel and also compute some of the regression diagnostics for our model to find out some interesting leverage points (outliers) in the data set.

par(mfrow = c(2, 2))
plot(finalModel)

plot of chunk unnamed-chunk-6

From the above plots, we can make the following observations,

  • The points in the Residuals vs. Fitted plot seem to be randomly scattered on the plot and verify the independence condition.

  • The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed.

  • The Scale-Location plot consists of points scattered in a constant band pattern, indicating constant variance.

  • There are some distinct points of interest (outliers or leverage points) in the top right of the plots.

We now compute some regression diagnostics of our model to find out these interesting leverage points as shown in the following section. We compute top three points in each case of influence measures.

leverage <- hatvalues(finalModel)
tail(sort(leverage), 3)
##       Toyota Corona Lincoln Continental       Maserati Bora 
##              0.2778              0.2937              0.4714
influential <- dfbetas(finalModel)
tail(sort(influential[, 6]), 3)
## Chrysler Imperial          Fiat 128     Toyota Corona 
##            0.3507            0.4292            0.7305

Conclusion

Based on the observations from our finalModel, we can conclude the following: