Linear Regression-Course Project

Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). We are particularly interested in the following two questions:

Is an automatic or manual transmission better for MPG?
Quantifying how different is the MPG between automatic and manual transmissions?

Loading the data:

data(mtcars)
names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

As we can see we have 11 variables. If we want “mpg” as an outcome we would have 10 predictors. We should decide which predictors are statistically significant and which are not. Lets first see the model if we have all 10 variables as regressors :

model <- lm(mpg ~ . , data = mtcars)
summary(model)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.45  -1.60  -0.12   1.22   4.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  12.3034    18.7179    0.66    0.518  
## cyl          -0.1114     1.0450   -0.11    0.916  
## disp          0.0133     0.0179    0.75    0.463  
## hp           -0.0215     0.0218   -0.99    0.335  
## drat          0.7871     1.6354    0.48    0.635  
## wt           -3.7153     1.8944   -1.96    0.063 .
## qsec          0.8210     0.7308    1.12    0.274  
## vs            0.3178     2.1045    0.15    0.881  
## am            2.5202     2.0567    1.23    0.234  
## gear          0.6554     1.4933    0.44    0.665  
## carb         -0.1994     0.8288   -0.24    0.812  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.807 
## F-statistic: 13.9 on 10 and 21 DF,  p-value: 3.79e-07

As seen the R-squared is 0.869 which means that this model explains the 87% of the variance. Not too bad! For the second guess I would include only those regrossors with smalle P-values, lets say less than 0.5:

model1 <- lm(mpg ~ as.factor(am)+wt+disp+hp+qsec , data = mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ as.factor(am) + wt + disp + hp + qsec, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.54  -1.74  -0.32   1.17   4.55 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     14.3619     9.7408    1.47   0.1524   
## as.factor(am)1   3.4705     1.4858    2.34   0.0275 * 
## wt              -4.0843     1.1941   -3.42   0.0021 **
## disp             0.0112     0.0106    1.06   0.2990   
## hp              -0.0212     0.0145   -1.46   0.1564   
## qsec             1.0069     0.4754    2.12   0.0439 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.43 on 26 degrees of freedom
## Multiple R-squared:  0.864,  Adjusted R-squared:  0.838 
## F-statistic:   33 on 5 and 26 DF,  p-value: 1.84e-10

The R-squared has not changed much, but the P-values got smaller. It means that the second set of regressors does a better job predicting the outcome. Now lets see if our reasoning was correct to omit the regressors with larger P-value from the first model by adding to of them to the second model and compare them:

model2<- lm(mpg ~ as.factor(am)+wt+disp+hp+qsec+drat+as.factor(cyl) , data = mtcars)
summary(model2)

## 
## Call:
## lm(formula = mpg ~ as.factor(am) + wt + disp + hp + qsec + drat + 
##     as.factor(cyl), data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.736 -1.397 -0.351  1.431  4.227 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     17.47585   14.02777    1.25    0.225  
## as.factor(am)1   2.74860    1.77715    1.55    0.136  
## wt              -3.42677    1.32375   -2.59    0.016 *
## disp             0.00757    0.01324    0.57    0.573  
## hp              -0.02535    0.01570   -1.61    0.120  
## qsec             0.71810    0.59660    1.20    0.241  
## drat             0.63329    1.47962    0.43    0.673  
## as.factor(cyl)6 -1.69705    1.91330   -0.89    0.384  
## as.factor(cyl)8 -0.52256    3.46430   -0.15    0.881  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.48 on 23 degrees of freedom
## Multiple R-squared:  0.875,  Adjusted R-squared:  0.831 
## F-statistic: 20.1 on 8 and 23 DF,  p-value: 1.21e-08

Although the R-squared has increased in the newer model, the P-values for each regressor and the overall P-value have increased too. Now lets do the ANOVA Analysis for these two models to compare them :

anova(model1,model2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ as.factor(am) + wt + disp + hp + qsec
## Model 2: mpg ~ as.factor(am) + wt + disp + hp + qsec + drat + as.factor(cyl)
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1     26 153                         
## 2     23 141  3      12.2 0.66   0.58

We would like to continue with the “model1” with much smaller P-values which indicate that the the regressors are statistically more significant. We will keep the three regressors with P-Values less than 0.1:

model3 <- lm(mpg ~ as.factor(am)+wt+qsec , data = mtcars)
summary(model3)

## 
## Call:
## lm(formula = mpg ~ as.factor(am) + wt + qsec, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.481 -1.556 -0.726  1.411  4.661 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.618      6.960    1.38  0.17792    
## as.factor(am)1    2.936      1.411    2.08  0.04672 *  
## wt               -3.917      0.711   -5.51    7e-06 ***
## qsec              1.226      0.289    4.25  0.00022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 28 degrees of freedom
## Multiple R-squared:  0.85,   Adjusted R-squared:  0.834 
## F-statistic: 52.7 on 3 and 28 DF,  p-value: 1.21e-11

Again as we can see eventhough the R-squared has slightly decreased, the P-values have significantly decreased too. Now lets do ANOVA Analysis:

anova(model2,model3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ as.factor(am) + wt + disp + hp + qsec + drat + as.factor(cyl)
## Model 2: mpg ~ as.factor(am) + wt + qsec
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1     23 141                         
## 2     28 169 -5     -28.1 0.91   0.49

anova(model,model1,model2,model3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ as.factor(am) + wt + disp + hp + qsec
## Model 3: mpg ~ as.factor(am) + wt + disp + hp + qsec + drat + as.factor(cyl)
## Model 4: mpg ~ as.factor(am) + wt + qsec
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1     21 148                         
## 2     26 153 -5     -5.94 0.17   0.97
## 3     23 141  3     12.23 0.58   0.63
## 4     28 169 -5    -28.08 0.80   0.56

We would stay with “model3” that has the smallest pr-value and the largest F statistic in ANOVA Analysis.

Summary

As we have seen in the above-mentioned analysis based on “backward selection” the p-values and R-squared values from each regression and the ANOVA analysis led us to choose the “model3” as the best model which inludes 3 predictors: “as.factor(am) & wt & qsec”. The question is whether the diagnostic plots also confirm our model. If we take a look at the plots in the Appendix and compare them for four different models, we would see that none of them is better than others:

Residuals vs Fitted Plot; almost in all 4 cases we observe the followings:

The residuals “bounce randomly” around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly form a “horizontal band” around the 0 line. This suggests that the variances of the error terms are equal.
No one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers.

The second plot is Normal Q-Q plot which shows whether the residuals are normally distributed. In all 4 cases, the plots reveal that the highest value in the data is higher than would be expected for the highest value in a sample of this size from a normal distribution. Nonetheless,none of the distributions deviate greatly from normality.
The third plot is the scale-location plot in which the raw residuals are standardized by their estimated standard deviations. The scale-location plot is used to detect if the spread of the residuals is constant over the range of fitted values. Again, we see a trend to the data such that high values exhibit more variance.
The fourth plot shows the standardized residuals leveraged against each datum (i.e. data point). The plot shows which data points are exerting the most influence. If the Cooks distance line encompasses a data point, it suggests that the analysis may be very sensitive to that point and it may be prudent to repeat the analysis with those data excluded. We only observe such a point in the “model” which has all the variables as predictor. Other models look good! We should not forget that interpreting these plots is subjective and should avoid over-interpreting these plots, especially want to be careful about putting too much weight on them based on small data sets. Therefore, I would stick to my “model3” regression model: \[ mpg = 2.936(am) -3.917(wt) +1.226(qseq) +9.618 \]

In order to answer the questions mentioned at the beginning of this report we do not necessarily need a regression model. I would suggest a boxplot which shows mpg data for two different categories “Automatic” and “Manual” cars. We can conclude that “Manual” cars have a better mpg in general. This conclusion is also supported by all of our regression models where we always had the positive coefficient for factor(am) which confirm a positive correlation between mpg and “Manual” with factor 1. Additionally, in order to quantify the difference between these two, we can use the means for two categories from the box-plot.
Also, please see the scatter plots in the appendix: the regression line is based on only one predictor “wt”, however we can see that “wt” is also correlated by “am” meaning that heavier car which have lower mpg are usually “automatic”.

Appendix of Figures

boxplot(list(Automatic=mtcars$mpg[mtcars$am==0],Manual=mtcars$mpg[mtcars$am==1]))

plot of chunk unnamed-chunk-8

par(mfrow=c(2,2))
plot(model)

plot of chunk unnamed-chunk-9

par(mfrow=c(2,2))
plot(model1)

plot of chunk unnamed-chunk-10

par(mfrow=c(2,2))
plot(model2)

plot of chunk unnamed-chunk-11

par(mfrow=c(2,2))
plot(model3)

plot of chunk unnamed-chunk-12

plot(mtcars$wt,mtcars$mpg)
points(mtcars$wt[mtcars$am==0], mtcars$mpg[mtcars$am==0], pch = 21, col = "black", bg = "lightblue", cex = 2)
points(mtcars$wt[mtcars$am==1], mtcars$mpg[mtcars$am==1], pch = 21, col = "black", bg = "salmon", cex = 2)
abline(lm(mtcars$mpg~mtcars$wt),lwd=2)
legend("topright", pch=c(19,19), legend=c("Automatic","Manual"), col=c("lightblue","salmon"))

plot of chunk unnamed-chunk-13

plot(mtcars$wt,mtcars$mpg)
points(mtcars$wt[mtcars$cyl==4], mtcars$mpg[mtcars$cyl==4], pch = 21, col = "black", bg = "darkgreen", cex = 2)
points(mtcars$wt[mtcars$cyl==6], mtcars$mpg[mtcars$cyl==6], pch = 21, col = "black", bg = "darkblue", cex = 2)
points(mtcars$wt[mtcars$cyl==8], mtcars$mpg[mtcars$cyl==8], pch = 21, col = "black", bg = "darkred", cex = 2)
abline(lm(mtcars$mpg~mtcars$wt),lwd=2)
legend("topright", pch=c(19,19,19), legend=c("4 Cyl","6 Cyl","8 Cyl"), col=c("darkgreen","darkblue","darkred"))

plot of chunk unnamed-chunk-14