Main Report

The purpose of this peer assignment is to quantitatively determine how the dependent or output variable (MPG) varies with respect to transmission type (automatic versus manual). Which explanatory variable(s) goes into the linear model heavily affects the ouput, meaning the impact on MPG varies according to the set of input variables used to fit the model. Various models will be tested to explore this relationship.

As a reminder, the ultimate goal of this assignment is to answer the following 2 questions:
(1) Is an automatic or manual transmission better for MPG?
(2) Quantify the MPG difference between automatic and manual transmissions.

First, let’s explore the “mtcars” dataset.

data (mtcars)
head (mtcars, 3)

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

str (mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

From exploring the data, it is clear the transmission type variable should be converted into a factor variable before we perform any data analysis.

library (dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mtcars_ds <- mutate (mtcars, Transmission_Type = factor (am, levels = c (0, 1),
  labels = c ("Automatic", "Manual")))

Now, let’s test the relationship between our output (MPG) and different input variables (transmission type, vehicle weight, etc.). We will look at the following 6 things for each model: exploratory data analysis/plots, estimated coefficients (intercept, slope, etc.), coefficients p-value, uncertainty, R-squared value, and residual plots & diagnostics.

Model 1 - MPG vs. Transmission Type

Reasoning: Transmission type heavily affects the MPG efficiency of a vehicle.

Exploratory Data Analysis/Plots

lm_1 <- lm (mpg ~ Transmission_Type, mtcars_ds)
summary (lm_1)

## 
## Call:
## lm(formula = mpg ~ Transmission_Type, data = mtcars_ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               17.147      1.125  15.247 1.13e-15 ***
## Transmission_TypeManual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

plot (mtcars_ds$Transmission_Type, mtcars_ds$mpg, col = "lightblue", main =
  "MPG Variation With Transmission Type", xlab = 
  "Transmission Type", ylab = "MPG")

Interpret Coefficients with Plot
Based on the results above, the linear model is representated as: MPG = 17.147 + 7.245*Transmission_Type, where Transmission_Type takes the value of 0 or 1, automatic or manual respectively. A manual transmission vehicle is 7.245 MPG better off than one with an auto transmission, on average, if we were to study how MPG is affected by transmission type alone.
Interpret Coefficients P-value
Let’s look at the p-values for both coefficients. They are both extremely small (<0.05). Remember the null hypothesis states that our coefficients are equal to zero, i.e. they have no effect on the dependent variable . The p-value tests that null hypothesis. Since both p-values are less than 0.05, we can reject the null hypothesis and state that our coefficients are statistically significant.
Uncertainty of Coefficients
To get the 95% confidence intervals for both coefficients (intercept and Transmission_Type):

confint (lm_1)

##                            2.5 %   97.5 %
## (Intercept)             14.85062 19.44411
## Transmission_TypeManual  3.64151 10.84837

We are 95% confident that the intercept and transmission type (Transmission_Type) coefficents fall within the intervals stated above.

Interpret R-squared
The R-squared value is 0.3598, which means 36% of the observed value variation around the regression line can be explained by the model. R-squared gives us an idea of the ‘goodness-of-fit’ of the model. It measures how close the predicted values are to the fitted regression line.
Residual Plots & Diagnostics
Using residual plots, we can test our assumptions for linear regression. Our linear regression assumptions state that:

The Y-values can be expressed as a linear function of the X variable.
Homoscedasticity - The variation of observations around the regression line is constant for all values of X.
For a given value of X, the Y values (or the error) are normally distributed.

par (mfrow = c(2,2))
plot (lm_1)

We get 4 plots, as shown above. The top-left ‘Residuals vs. Fitted’ plot tells us whether our linearity assumption is met and whether the variation of observations around the regression line is constant for all values of X. The straight red line in the plot suggests the first assumption is validated, i.e. the Y-values can be expressed as a linear function of X. With that being said, this does not mean much considering that our explanatory variable is a categorical variable consisting of only 2 factors or groups.

Let’s now test our data for homoscedasticity. A quick glance at our ‘Residuals vs. Fitted’ and ‘Scale-Location’ (bottom-left) plots suggest that the variation of observations around the regression line is not constant for all values of X. Hence, this violates one of our key assumptions for linear regression. We can correct that by performing a multivariate linear regression instead.

Lastly, the ‘Normal Q-Q’ plot gives us an visual idea whether our errors are normally distributed. It plots the quantiles of the standardized residuals versus the theoretical quantiles. A quick visual inspection of the plot shows the data points fitting nicely around the 45-degree diagonal line, which means that our assumption of errrors being normally distributed is validated.

That concludes our analysis for Model 1. Now, let’s incorporate other variables into our regression model and explore how the model changes with respect to the 6 topics we just discussed - exploratory data analysis/plots, estimated coefficients (intercept, slope, etc.), coefficients p-value, uncertainty, R-squared value, and residual plots & diagnostics. We will also discuss a 7th item, the F-statistic, which allows us to compare various nested models.

Model 2 - MPG vs. Transmission Type & Vehicle Weight

Reasoning: Vehicle weight also affects the MPG efficiency of a vehicle, in addition to transmission type. Weight will be plotted as a confounding variable.

Exploratory Data Analysis/Plots

lm_2 <- lm (mpg ~ Transmission_Type + wt, mtcars_ds)
summary (lm_2)

## 
## Call:
## lm(formula = mpg ~ Transmission_Type + wt, data = mtcars_ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5295 -2.3619 -0.1317  1.4025  6.8782 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             37.32155    3.05464  12.218 5.84e-13 ***
## Transmission_TypeManual -0.02362    1.54565  -0.015    0.988    
## wt                      -5.35281    0.78824  -6.791 1.87e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.098 on 29 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7358 
## F-statistic: 44.17 on 2 and 29 DF,  p-value: 1.579e-09

library (ggplot2)
g2 <- ggplot (mtcars_ds, aes (wt, mpg))
g2 <- g2 + geom_point(aes (color = Transmission_Type), alpha = 0.75, size = 4) + 
  labs (title = "MPG Vs. Vehicle Weight & Transmission Type") + 
  labs (x = "Vehicle Weight [1000 lb]", y = "Efficiency [MPG]")
g2 <- g2 + geom_abline (intercept = coef(lm_2)[1] + coef(lm_2)[2] * 0, slope = coef(lm_2)[3],
  size = 2, col = "lightcoral")
g2 <- g2 + geom_abline (intercept = coef(lm_2)[1] + coef(lm_2)[2] * 1, slope = coef(lm_2)[3],
  size = 1, col = "turquoise3")
g2

Interpret Coefficients with Plot
Based on the results above, the linear model is representated as: MPG = 17.147 - 0.024AM -5.352wt, where AM takes the value of 0 or 1, automatic or manual respectively. As expected, with the model adjusted for weight (meaning weight is included in the model), the effect of transmission type on MPG is attenuated. In fact, the effect of transmission type on MPG is so sharply decreased that the AM coefficient actually changes sign and value from 7.245 in the first model to -0.024 in this second model. What this means is that holding weight constant, automatical vehicles have a slightly higher MPG rate than manual cars. Both transmission type vehicles are 5.352 MPG less efficient for every 1000 lb increase in weight. We will look at the p-value for all 3 coefficients next.
Interpret Coefficients P-value
Let’s now look at the p-values for all 3 coefficients (intercept, AM, weight). For the intercept and weight coefficients, the p-values are both extremely small (<0.05). Hence, we reject the null hypothesis that states our coefficients are equal to zero, and we conclude that these coefficients are statistically significant. In other words, changes in the predictor values are related to changes in the response variable. With that being said, the p-value for our AM coefficient is bigger than 0.05 (0.988). Thus, we fail to reject the null hypothesis, which suggests that AM independent variable has little to no effect on MPG, the dependent variable. In practical terms, there is a strong case to leave out the AM variable in the model.
Uncertainty of Coefficients
To get the 95% confidence intervals for the coefficients:

confint (lm_2)

##                             2.5 %    97.5 %
## (Intercept)             31.074114 43.568989
## Transmission_TypeManual -3.184815  3.137584
## wt                      -6.964951 -3.740672

We are 95% confident that these 3 coefficents fall within the intervals stated above.

Interpret R-squared
The adjusted R-squared value is 0.7358, which means 74% of the variation around the regression line can be explained by the model.
Residual Plots & Diagnostics
Using residual plots, let’s test our assumptions for linear regression.

par (mfrow = c(2,2))
plot (lm_2)

We get 4 plots, as shown above. The top-left ‘Residuals vs. Fitted’ plot tells us whether our linearity assumption is met and whether the variation of observations around the regression line is constant for all values of X. The red line in the plot has a strong curvature, suggesting a non-linear relationship between the dependent and independent variables. Thus, our first assumption of linearity is violated.

In regards to homoscedasticity, a quick glance of the ‘Residuals vs. Fitted’ plot shows the following pattern: at small and large fitted values, residuals have only positive values; in the middle (15-25 MPG) range, however, residuals are at both positive and negative values. This suggests the variation of observation is not constant for all values of X, which would violate our linear regression assumption of homoscedasticity.

Lastly, in regards to our assumption of observed values or residuals being normally distributed, the ‘Normal Q-Q’ plot shows the residual points lining up nicely along the diagonal line. Hence, we can conclude that our errors are normally distributed.

ANOVA/F-statistic
Using the ANOVA method, we can compare different linear models to determine which model is ‘better’ and decide which variable to keep or leave out.

anova (lm_1, lm_2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ Transmission_Type
## Model 2: mpg ~ Transmission_Type + wt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 46.115 1.867e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recall that the first linear model plots transmission type and MPG, whereas the second model plots MPG versus transmission type and weight. According to our ANOVA output, the first model has a RSS (residual sum squared) of 720.90 compared to 278.32 for the second model. The null hypothesis states that the RSS between the first and second model is not statistically significantly different, i.e. the models do not significantly differ. The alternate hypothesis states that there is a statistically significant difference between both RSS’s, and as a result the models do differ significantly in that the ‘full’ model (the ‘bigger’ model with more parameters) is significantly better than the ‘reduced’ model. Since the p-value is extremely small (1.867e-07) and less than 0.05, we can say with 95% confidence that model 2 is significantly better than model 1, and that the weight term should definitely be included in the model.

Model 3 - MPG vs. Transmission Type & Vehicle Weight (as an interaction term)

Reasoning: In this model, weight and transmission will be plotted taking interaction into account. Interaction means that the effects of weight and transmission type on MPG are not independent of each other, i.e. the effect of weight on MPG depends on the value of transmission type, and vice-versa.

Exploratory Data Analysis/Plots

lm_3 <- lm (mpg ~ Transmission_Type * wt, mtcars_ds)
summary (lm_3)

## 
## Call:
## lm(formula = mpg ~ Transmission_Type * wt, data = mtcars_ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6004 -1.5446 -0.5325  0.9012  6.0909 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 31.4161     3.0201  10.402 4.00e-11 ***
## Transmission_TypeManual     14.8784     4.2640   3.489  0.00162 ** 
## wt                          -3.7859     0.7856  -4.819 4.55e-05 ***
## Transmission_TypeManual:wt  -5.2984     1.4447  -3.667  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.591 on 28 degrees of freedom
## Multiple R-squared:  0.833,  Adjusted R-squared:  0.8151 
## F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

library (ggplot2)
g3 <- ggplot (mtcars_ds, aes (wt, mpg))
g3 <- g3 + geom_point(aes (color = Transmission_Type), alpha = 0.75, size = 4) + 
  labs (title = "MPG Vs. Vehicle Weight & Transmission Type") + 
  labs (x = "Vehicle Weight [1000 lb]", y = "Efficiency [MPG]")
g3 <- g3 + geom_abline (intercept = coef(lm_3)[1] + coef(lm_3)[2] * 0, slope = coef(lm_3)[3] + 
  coef(lm_3)[4] * 0, size = 2, col = "lightcoral")
g3 <- g3 + geom_abline (intercept = coef(lm_3)[1] + coef(lm_3)[2] * 1, slope = coef(lm_3)[3] + 
  coef(lm_3)[4] * 1, size = 2, col = "turquoise3")
g3

Interpret Coefficients with Plot
Based on the results above, the linear model is representated as: MPG = 31.416 + 14.878AM - 3.786wt - 5.298AMwt, where AM takes the value of 0 or 1, automatic or manual respectively.

Looking at the linear model equation and graph above, a vehicle of manual transmission type starts off 14.878 MPG better than an automatic vehicle, at weight zero pounds. In real-life applications, however, that interpretation is pointless since no car weighs zero pounds. The interesting thing to discuss is the 2 slopes. An automatic vehicle is 3.786 MPG less efficient for every 1000 lb increase in car weight, compared to 9.084 (3.786 + 5.298) for manual. By weight = 3000 lb, an automatic vehicle is already more MPG efficient, holding weight constant. One can conclude that a manual vehicle has a better MPG at lower weights (<2,750 lb), while the reverse is true for heavier (>2,750 lb) cars.

Another interesting observation is the MPG data we have available for both transmission types. A quick glance at the plot shows that the data points for a manual transmission type are concentrated at the lower weights, roughly 1.0-3.5 [1000] pounds. Contrast that with the automatic type, where the weight range ranges from 2.5-5.5 [1000] pounds.
Interpret Coefficients P-value
Let’s now look at the p-values for all 4 coefficients. All the p-values are extremely small (<0.05). Hence, we reject the null hypothesis that states our coefficients are equal to zero, and we conclude that these coefficients are statistically significant and should be included in our linear model because they impact the dependent variable.
Uncertainty of Coefficients
To get the 95% confidence intervals for the coefficients:

confint (lm_3)

##                                2.5 %    97.5 %
## (Intercept)                25.229642 37.602469
## Transmission_TypeManual     6.143928 23.612917
## wt                         -5.395234 -2.176581
## Transmission_TypeManual:wt -8.257693 -2.339028

We are 95% confident that the 4 coefficents fall within the intervals stated above.

Interpret R-squared
The adjusted R-squared value is 0.8151, which means 82% of the variation around the regression line can be explained by the model.
Residual Plots & Diagnostics
Using residual plots, let’s test our assumptions for linear regression.

par (mfrow = c(2,2))
plot (lm_3)

We get 4 plots, as shown above. The top-left ‘Residuals vs. Fitted’ plot tells us whether our linearity assumption is met and whether the variation of observations around the regression line is constant for all values of X. The red line in the plot is mostly flat and does not have as strong a curvature like in model 2. In regards to homoscedasticity, the variation looks constant across the range of X. Lastly, the ‘Normal Q-Q plot’ shows the residuals fitting nicely along the diagonal line for the most part.

ANOVA/F-statistic Using the ANOVA method, we can compare different linear models to determine which model is superior and decide which variable(s) to keep or leave out. We will compare our 2nd and 3rd linear models

anova (lm_2, lm_3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ Transmission_Type + wt
## Model 2: mpg ~ Transmission_Type * wt
##   Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
## 1     29 278.32                               
## 2     28 188.01  1    90.312 13.45 0.001017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on our ANOVA output, the third model has a RSS of 188.01 compared to 278.32 for the second model. The null hypothesis states that the RSS between the second and third model is not statistically significantly different, i.e. the models do not significantly differ. The alternate hypothesis states that there is a statistically significant difference between both RSS’s. The p-value obtained is small (0.001) and less than 0.05, so we reject the null hypothesis and conclude that model 3 (which takes interaction into account) is superior to model 2.

Model 4 - MPG vs. Transmission Type, displacement, weight, & number of cylinders

Reasoning: In addition to transmission type and weight, MPG is also affected by displacement and the number of cylinders a vehicle has. Let’s create a more complicated model and compare it with model 3. For simplicity, I will only analyze the following for model 4: model coefficients and their p-values, R-squared value, and a comparison between models 3 and 4.

First, let’s create our linear model and look at its summary.

lm_4 <- lm (mpg ~ Transmission_Type + disp + wt + factor (cyl), mtcars_ds)
summary (lm_4)

## 
## Call:
## lm(formula = mpg ~ Transmission_Type + disp + wt + factor(cyl), 
##     data = mtcars_ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5029 -1.2829 -0.4825  1.4954  5.7889 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             33.816067   2.914272  11.604 8.79e-12 ***
## Transmission_TypeManual  0.141212   1.326751   0.106  0.91605    
## disp                     0.001632   0.013757   0.119  0.90647    
## wt                      -3.249176   1.249098  -2.601  0.01513 *  
## factor(cyl)6            -4.304782   1.492355  -2.885  0.00777 ** 
## factor(cyl)8            -6.318406   2.647658  -2.386  0.02458 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.652 on 26 degrees of freedom
## Multiple R-squared:  0.8376, Adjusted R-squared:  0.8064 
## F-statistic: 26.82 on 5 and 26 DF,  p-value: 1.73e-09

Based on the results above, the linear model is representated as: MPG = 33.816 + 0.141AM + 0.002disp - 3.249wt - 4.305cyl(6) - 6.318*cyl(8). An automatic vehicle with 4 cylinders will have an intercept of 33.816; an automatic vehicle with 6 cylinders will have an intercept of 29.511 (33.816 - 4.305); a manual vehicle with 8 cylinders will have an intercept of 27.639 (33.816 + 0.141 - 6.318); and so on. In this model, transmission type and displacement have little to no impact on MPG based on their high p-values. Instead, weight and the number of cylinders play a much bigger role.

Another interesting thing to note is the adjusted R-squared value of 0.8064, compared to 0.8151 for the much less complicated model 3. However, a high R-squared can be misleading, and one needs to study a linear model’s diagnostic/residual plots to determine how well the fitted line really predicts the data. Let’s look at the ‘Residuals vs Fitted’ plot.

par (mfrow = c(2,2))
plot (lm_4)

The top-left ‘Residuals vs. Fitted’ plot shows a constant variance for the entire range of X. Lastly, let’s compare models 3 and 4.

anova (lm_3, lm_4)

## Analysis of Variance Table
## 
## Model 1: mpg ~ Transmission_Type * wt
## Model 2: mpg ~ Transmission_Type + disp + wt + factor(cyl)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 188.01                           
## 2     26 182.87  2    5.1383 0.3653 0.6975

The more complicated model 4 has a slightly smaller RSS (182.87 vs. 188.01 for model 3); however, the p-value of 0.6975 is much bigger than our set alpha of 0.05. Hence, we fail to reject the null hypothesis and conclude that model 4 is not superior to model 3.

Linear Regression Model Final Report

Stephen Lee

May 28, 2016

Executive Summary

Main Report

Model 1 - MPG vs. Transmission Type

Model 2 - MPG vs. Transmission Type & Vehicle Weight

Model 3 - MPG vs. Transmission Type & Vehicle Weight (as an interaction term)

Model 4 - MPG vs. Transmission Type, displacement, weight, & number of cylinders

Sources