8

This question involves the use of simple linear regression on the Auto data set. (a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example: i. Is there a relationship between the predictor and the re sponse? ii. How strong is the relationship between the predictor and the response? iii. Is the relationship between the predictor and the response positive or negative? iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

  1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.

  2. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit

a

lm_fit <- lm(mpg ~ horsepower, data = Auto)

summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. -Yes there is a relationship, p-values are nearly 0.

  2. -About 60.5% of the variation in response variable is due to horsepower (predictor variable).

  3. -Negative. Regression coefficient is negative for horespower.

predict(lm_fit, newdata = data.frame(horsepower = 98), interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(lm_fit, newdata = data.frame(horsepower = 98), interval = "prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

b

plot(Auto$horsepower, Auto$mpg, col = "blue", pch = 16, xlab = "Horsepower", ylab = "MPG", main = "MPG vs Horsepower")
abline(lm_fit, col = "red", lwd = 2)

c

par(mfrow = c(2, 2))
plot(lm_fit)

10

This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US. (b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! (c) Write out the model in equation form, being careful to handle the qualitative variables properly. (d) For which of the predictors can you reject the null hypothesis H0 : βj =0? (e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome. (f) How well do the models in (a) and (e) fit the data? (g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s). (h) Is there evidence of outliers or high leverage observations in the model from (e)?

data("Carseats")
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

a

lm.fit = lm(Sales ~ Price+Urban+US, data= Carseats)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b

    • The price and sales within the US are significant predictors of Sales. Urban is not a significant predictor of Sales.

c

    • Sales = 13.043469 − 0.054459Price − 0.021916UrbanYes + 1.200573XUSYes

d

    • Price and US

e

lm_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_reduced)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f

cat("Full model R-squared:", summary(lm_fit)$r.squared, "\n")
## Full model R-squared: 0.6059483
cat("Reduced model R-squared:", summary(lm_reduced)$r.squared, "\n")
## Reduced model R-squared: 0.2392629

The reduced model fits the model horribly. The full model doesnt fit the data great, however it is better than the reduced model.

g

confint(lm_reduced, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h

par(mfrow = c(2, 2))
plot(lm_reduced)

summary(influence.measures(lm_reduced))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

Here are outliers and evidence of high leverage observations.

14

This problem focuses on the collinearity problem.

  1. Perform the following commands in R: > set.seed(1) > x1 <-runif(100) > x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients? (b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables. (c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ β0, ˆ β1, and ˆ β2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 =0? How about the null hypothesis H0 : β2 =0? (d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 =0? (e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 =0? (f) Do the results obtained in (c)–(e) contradict each other? Explain your answer. (g) Now suppose we obtain one additional observation, which was unfortunately mismeasured. > x1 <-c(x1, 0.1) > x2 <-c(x2, 0.8) > y <-c(y, 6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

a

set.seed(1)
x1 <-runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

Y = 2 + 2X1 + 0.3X2 + ε True regression coefficients are 2, 2 and 0.3.

b

cor(x1, x2)
## [1] 0.8351212
plot(x1, x2, main="Scatterplot of x1 vs x2", xlab="x1", ylab="x2")

c

lm_fit <- lm(y ~ x1 + x2)
summary(lm_fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

ˆβ0 - 2.1305, ˆβ1 - 1.4396, ˆβ2 - 1.0097 We can reject the null hypothesis for β1 because the p-value is less than 0.05, we cannot reject the null hypothesis for β2 because the p-value is greater than 0.05

d

lm_x1 <- lm(y ~ x1)
summary(lm_x1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Reject null hypothesis.

e

lm_x2 <- lm(y ~ x2)
summary(lm_x2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Again, reject null hypothesis.

f

    • The results do not contradict each other. The apparent contradictions occur because collinearity between predictors x1 and x2 inflates the standard errors of their coefficients, making it difficult to assess their individual importance accurately. This causes us to fail to reject the null hypothesis for certain predictors when they are considered together, even though they may appear significant individually. Essentially, the presence of collinearity masks the true effect of each predictor.

g

x1 <-c(x1, 0.1)
x2 <-c(x2, 0.8)
y <-c(y, 6)
lm_fit_new <- lm(y ~ x1 + x2)
summary(lm_fit_new)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
lm_x1_new <- lm(y ~ x1)
summary(lm_x1_new)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
lm_x2_new <- lm(y ~ x2)
summary(lm_x2_new)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

In the new combined model x1 is not significant, but in the new x1 model it is. x2 is significant in both new models.

par(mfrow=c(2,2))
plot(lm_fit_new)

plot(lm_x1_new)

plot(lm_x2_new)

It is a high leverage point in the x1 + x2 model as well as the x2 model, but not in the x1 model.