Homework 8 - Multiple Linear Regression

1. The data sets package in R contains a small data set called mtcars that contains n = 32 observations of the characteristics of different automobiles. Create a new data frame from part of this data set using this command: myCars <- data.frame(mtcars[,1:6]).

myCars <- data.frame(mtcars[,1:6])
summary(myCars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt       
##  Min.   :2.760   Min.   :1.513  
##  1st Qu.:3.080   1st Qu.:2.581  
##  Median :3.695   Median :3.325  
##  Mean   :3.597   Mean   :3.217  
##  3rd Qu.:3.920   3rd Qu.:3.610  
##  Max.   :4.930   Max.   :5.424

2. Create and interpret a bivariate correlation matrix using cor(myCars) keeping in mind the idea that you will be trying to predict the mpg variable. Which other variable might be the single best predictor of mpg?

cor(myCars)
##             mpg        cyl       disp         hp       drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.6999381  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.7102139  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.0000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.7124406  1.0000000

Analysis

Based on the results of the correlation matrix, the weight variable has the strongest inverse correlation,-.87, with MPG. This would seem to indicate that as the weight of a car increases, the mile per gallon decreases.

3. Run a multiple regression analysis on the myCars data with lm(), using mpg as the dependent variable and wt (weight) and hp (horsepower) as the predictors. Make sure to say whether or not the overall R-squared was significant. If it was significant, report the value and say in your own words whether it seems like a strong result or not. Review the significance tests on the coefficients (B-weights). For each one that was significant, report its value and say in your own words whether it seems like astrong result or not.

lmOut <- lm(mpg~wt+hp, data = myCars)
summary(lmOut)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = myCars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
plot(lmOut)

Analysis

R-squared is a measure that is utilized to measure how well a liner model fits a dataset which ranges from 0-1. For the model created above, the adjusted r-squared is .81 which would indicate that model is strong predictor for our data. The coefficients, or B-weights, for weight and hp are -3.88 and -.032 respectively. The p-value would indicate that both are significant variables, but the weight is a stronger predictor.

4. Using the results of the analysis from Exercise 2, construct a prediction equation for mpg using all three of the coefficients from the analysis (the intercept along with the two B-weights). Pretend that an automobile designer has asked you to predict the mpg for a car with 110 horsepower and a weight of 3 tons. Show your calculation and the resulting value of mpg.

coeffs <- coefficients(lmOut)
coeffs
## (Intercept)          wt          hp 
## 37.22727012 -3.87783074 -0.03177295
weight = 3
hp = 110

newmpg <- coeffs[1] + (coeffs[2] * weight) + (coeffs[3]  * hp)
newmpg
## (Intercept) 
##    22.09875
newData <- data.frame(wt = 3, hp = 110)
predict(lmOut,newData)
##        1 
## 22.09875

Analysis

Based on the linear model created, if a car weighed 3 tons with a horse power of 110, the project mpg is 22.1 mpg.

5. Run a multiple regression analysis on the myCars data with lmBF(), using mpg as the dependent variable and wt (weight) and hp (horsepower) as the predictors. Interpret the resulting Bayes factor in terms of the odds in favor of the alternative hypothesis. If you did Exercise 2, do these results strengthen or weaken your conclusions?

library(BayesFactor)
## Loading required package: coda
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
lmMCMCout <- lmBF(mpg~wt+hp, data = myCars)
summary(lmMCMCout)
## Bayes factor analysis
## --------------
## [1] wt + hp : 788547604 ±0%
## 
## Against denominator:
##   Intercept only 
## ---
## Bayes factor type: BFlinearModel, JZS

Analysis

The Bayes Factor for running the liner predictor reinforces the findings of the previous model. 788547604 +/- 0 is an odds ratio that suggest that weight and horsepower model would be an overwhemingly better at making accuracte predictions of the mpg than using a model with the intecept alone.

6. Run lmBF() with the same model as for Exercise 4, but with the options posterior=TRUE and iterations=10000. Interpret the resulting information about the coefficients.

library(BayesFactor)
lmMCMCout <- lmBF(mpg~wt+hp, data = myCars, posterior = TRUE, iterations = 10000)
summary(lmMCMCout)
## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean        SD  Naive SE Time-series SE
## mu   20.09832  0.488325 4.883e-03      4.883e-03
## wt   -3.78619  0.662894 6.629e-03      6.629e-03
## hp   -0.03112  0.009517 9.517e-05      9.517e-05
## sig2  7.49438  2.179193 2.179e-02      2.582e-02
## g     4.51209 44.765138 4.477e-01      4.477e-01
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%      50%     75%    97.5%
## mu   19.14658 19.7718 20.09232 20.4262 21.06711
## wt   -5.08626 -4.2221 -3.79052 -3.3488 -2.48221
## hp   -0.04957 -0.0374 -0.03128 -0.0249 -0.01236
## sig2  4.38050  5.9867  7.14558  8.5850 12.68266
## g     0.36516  0.9391  1.71800  3.4465 20.57092

Analysis

We can evaluate the significance of the values by looking into the bounds of our predictors. As none of the bounds overlap with 0, it can be determined that thesr predictors are significant predictors for mpg. It should also be noted that the means, 50% quantiles, are approximately the same as the coefficient values of the frequentist approach.

7. Run install.packages() and library() for the “car” package. The car package is “companion to applied regression” rather than more data about automobiles. Read the help file for the vif() procedure and then look up more information online about how to interpret the results. Then write down in your own words a “rule of thumb” for interpreting vif.

#install.packages("car")
library(car)
## Loading required package: carData

Analysis

Multicollinearity is the high intercorrelation between indepedent variables. If used in a model this can lead to skewed or undesirable results. The Variance Inflation Factor, VIF, provides a means to measure the amount of multicollinearity between predictors in a model. Doing this will reduce the duplicitous statisicial significance of related preditors.

8. Run vif() on the results of the model from Exercise 2. Interpret the results. Then run a model that predicts mpg from all five of the predictors in myCars. Run vif() on those results and interpret what you find.

vif(lmOut)
##       wt       hp 
## 1.766625 1.766625
vif(lm(mpg~cyl+disp+hp+drat+wt, data = myCars))
##       cyl      disp        hp      drat        wt 
##  7.869010 10.463957  3.990380  2.662298  5.168795

Analysis

In order to create a ideal model with a low level of multicollinerarity, VIF should be no more than 5 or 10. In the first model, the Vif for weight and horsepower are 1.77. This indicates that the predictors utilized in our model don’t have high multicollinearity. Conversely, the model utilizing all the predictors have a number of problematic fields such as disp and cyl. It would be beneficial to remove these predictors from the model to enhance its effectiveness.