Question 8

Here we re are using simple linear regression on the Auto data set. We use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. We use the summary function to print the results.

# first we load the dataset
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
df_auto <- Auto
head(df_auto)
#fit the simple linear regression model
model <- lm(mpg ~ horsepower, data=df_auto)

#print the summary of the model
summary(model)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = df_auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictor and the response?

    Yes, there exist a relationship between horsepower predictor variable and mpg response variable. We know this by looking at the p-value from the regression summary of the coefficient associated with horsepower which is 2 x 10-16. Since the p-value is extremely small, this implies that there is a statistically significant relationship between the two.

  2. How strong is the relationship between the predictor and the response?

    To determine the strength of the relationship between the predictor and the response, we use the \(R^2\) which has a value of 0.6049 which means that the 60.49% variation of mpg is explained by the model using horsepower.

  3. Is the relationship between the predictor and the response positive or negative?

    The relationship between the predictor and the response is negative. We know this by looking at the sign of the coefficient for horsepower variable. The coefficient is -0.1578 meaning higher horsepower leads to lower mpg. This makes sense practically because more power engine cars are less fuel-efficient, thus higher miles per gallons.

  4. What is the predicted mpg associated with a horsepower of 98?

    Using the below linear regression equation from our model, we can make a prediction of mpg given a horsepower value of 98.

    \[\begin{align} mpg &= 39.94 - 0.1578 \times \text{horsepower} \\[4pt] &= 39.94 - (0.1578 \times 98) \\[4pt] &= 24.48 \text{(mpg)} \end{align}\]

    Hence answer is 24.48 mpg.

  5. What are the associated 95% confidence and prediction intervals?

#compute the confidence and prediction intervals
confint(model, level = 0.95)
##                 2.5 %     97.5 %
## (Intercept) 38.525212 41.3465103
## horsepower  -0.170517 -0.1451725
  1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.

    plot( 
      df_auto$horsepower,
      df_auto$mpg, 
      xlab="horsepower", 
      ylab="mpg",
      pch = 16,
      col="blue")
    
    #adding a regression line
    abline(model, col="red", lwd=2)

  1. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

    par(mfrow=c(2,2))
    plot(model)

Question 10

This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

#view description of the dataset
?Carseats
## starting httpd help server ... done
head(Carseats)
#next we fit our model using the provided predictor variables.
model <- lm(Sales ~ Price + Urban + US, data=Carseats)
  1. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
#the model coefficients
summary(model)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
  1. Write out the model in equation form, being careful to handle the qualitative variables properly.

\[ \hat{Sales} \quad = \quad 13.04 \quad - \quad 0.0544 \times \text{Price} \quad - \quad 0.0219 \times \text{UrbanYes}\quad + \quad 1.2006 \times \text{USYes}\] (d) For which of the predictors can you reject the null hypothesis H0 : \(\beta_j = 0\)?

  1. On the basis of your response to the previous question, fit asmaller model that only uses the predictors for which there is evidence of association with the outcome.
#the new updated model without the urban variable.
model <- lm(Sales ~ Price + US, data=Carseats)

#get the model summary
summary(model)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
  1. How well do the models in (a) and (e) fit the data?
  1. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(model, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

The confidence intervals are as follows: Intercept CI: (11.79, 14.27); Price CI: (-0.06475, -0.04420), USYes CI: (0.6915, 1.7078). For the two predictors variables, their respective confidence intervals excludes zero, which agrees with our intial analysis of rejecting the null hypothesis.

  1. Is there evidence of outliers or high leverage observations in the model from (e)
par= (mfrow = c(2,2))
plot(model)

Question 14

This problem focuses on the collinearity problem. (a) Perform the following commands in R:

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100) 

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

  1. What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
#scatter plot between x1 and x2
plot(
  x1, 
  x2,
  main = "Scatter plot of x1 versus x2",
  pch = 19,
  col = "blue"
  )

  1. Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \(\hat\beta_0, \;\hat\beta_1,\;\hat\beta_2\) ? How do these relate to the true \(\beta_0, \; \beta_1, \; \text{and} \;\beta_2\) Can you reject the null hypothesis H0 : \(\beta_1 = 0\) ? How about the null hypothesis H0 : \(\beta_2 = 0\) ?
#fitting the linear regression model
lm1 = lm(y ~ x1 + x2)
#getting the summary of the model
summary(lm1)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
  1. Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : \(\beta_1\; = 0\) ?
#fitting the regression model now using only x1
lm2 = lm(y ~ x1)
#getting summary of the model
summary(lm2)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

From the above summary we reject the null hypothesis, Ho: \(\beta_1 = 0\). This is so because the p-value associated with x1 is extremely small (8.27 x 10-15). This implies that x1 is statistically significant in predicting response variable y.

  1. Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : \(\beta_1 = 0\)?
lm3 = lm(y~x2)
summary(lm3)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

No! We fail to reject the null hypothesis Ho: \(\beta_1 = 0\) . This is because the p-value is extremely low suggesting that x2 is statistically significant in predicting y.

  1. Do the results obtained in (c)–(e) contradict each other? Explain your answer.

No! They do not contradict each other. For example in lm1 (where we used x1 and x2) we found a strong evidence to fail to reject the null hypothesis for x2. However, when we fit only x2 in lm3 the p-value associated with x2 seems to show that x2 is statistically significant contrary to our earlier assertion. However, this is not to say that the results contradict each other, instead we are simply having the problem of collinearity.

  1. Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
#with both x1 and x2
lm4 = lm(y ~ x1 + x2)
summary(lm4)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
#with only x1
lm5 = lm(y ~ x1)
summary(lm5)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
#finally with just x2
lm6 = lm(y ~ x2)
summary(lm6)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06