LAB 4

Question 8

Here we re are using simple linear regression on the Auto data set. We use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. We use the summary function to print the results.

# first we load the dataset
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

df_auto <- Auto
head(df_auto)

#fit the simple linear regression model
model <- lm(mpg ~ horsepower, data=df_auto)

#print the summary of the model
summary(model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = df_auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Is there a relationship between the predictor and the response?

Yes, there exist a relationship between horsepower predictor variable and mpg response variable. We know this by looking at the p-value from the regression summary of the coefficient associated with horsepower which is 2 x 10-16. Since the p-value is extremely small, this implies that there is a statistically significant relationship between the two.
How strong is the relationship between the predictor and the response?

To determine the strength of the relationship between the predictor and the response, we use the \(R^2\) which has a value of 0.6049 which means that the 60.49% variation of mpg is explained by the model using horsepower.
Is the relationship between the predictor and the response positive or negative?

The relationship between the predictor and the response is negative. We know this by looking at the sign of the coefficient for horsepower variable. The coefficient is -0.1578 meaning higher horsepower leads to lower mpg. This makes sense practically because more power engine cars are less fuel-efficient, thus higher miles per gallons.
What is the predicted mpg associated with a horsepower of 98?

Using the below linear regression equation from our model, we can make a prediction of mpg given a horsepower value of 98.

\[\begin{align} mpg &= 39.94 - 0.1578 \times \text{horsepower} \\[4pt] &= 39.94 - (0.1578 \times 98) \\[4pt] &= 24.48 \text{(mpg)} \end{align}\]

Hence answer is 24.48 mpg.
What are the associated 95% confidence and prediction intervals?

#compute the confidence and prediction intervals
confint(model, level = 0.95)

##                 2.5 %     97.5 %
## (Intercept) 38.525212 41.3465103
## horsepower  -0.170517 -0.1451725

For the intercept we have a confidence interval of (38.53, 41.35), meaning that the when mean mpg when horsepower is zero is between these values.
For the horsepower CI (-0.1705, -0.1452) meaning that for each additional unit of horsepower, mpg decreases by a value within this range.

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot( 
  df_auto$horsepower,
  df_auto$mpg, 
  xlab="horsepower", 
  ylab="mpg",
  pch = 16,
  col="blue")

#adding a regression line
abline(model, col="red", lwd=2)

The negative slope of the line confirms the inverse relationship between mpg and horsepower.

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
```
par(mfrow=c(2,2))
plot(model)
```

From the Residual vs. Fitter Values, we can see that the residuals follows a certain pattern as seen from the curved red line, and are not random. This suggests a problem in our linear model. The curved pattern suggests that the relationship between the horsepower and mpgis not perfectly linear. This is not a surprise for us because from \(R^2\) we saw that with horsepower variable, the model only explains approximately 60% of the variability in the response variable.
Using the Q-Q diagnostic plot, we are able to assess whether the residuals of our model are normally distributed which is on the key assumptions of the linear regression model. From our Q-Q plot, we observe that the central part of the plot aligns with the reference line, suggesting that the majority of the residuals follows the normal distribution. However, the tails seem to deviate from the reference line, implying possible presence of outliers.
Using the Scale-Location plot, we can be able to check for homoscedasticity (which means constant variance of residuals remains constant across different levels of fitted values). The red line which represents the trend in the standardized residuals, is slightly curved but generally increasing. This suggests that the variance is not constant, but increases for larger fitted values.
Finally, we look at the Cook’s Distance Plot (Residuals vs. Leverage) which helps us identify influential data points that have high impact on the model. From the plot, we see that some of the observations such as observation 1170 have high leverage and high residual meaning such points influence the regression model significantly. If we were optimizing the regression model, we could consider removing these points and refitting the model for a better performance.

Question 10

This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

#view description of the dataset
?Carseats

## starting httpd help server ... done

head(Carseats)

#next we fit our model using the provided predictor variables.
model <- lm(Sales ~ Price + Urban + US, data=Carseats)

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

#the model coefficients
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

From the summary of the model above, we have the intercept, \(\beta_0\) as 13.04, which represents the sales price when we have all the predictor variables at their reference levels i.e \(\text{price} = 0\), urban = No, US=No. Realistically this does not make sense because it implies the baseline average Sales for stores that are not in the US, and not in an urban area is 13.04 when the company charged zero price for car seats in those sites. This would equate the company was giving the car seats for free perhaps for some sort of special promotion campaign. However, this is not part of the model.
We have the coefficient associated with price as - 0.0545implying that for every additional increase in unit price, the Sales of car seats decrease by this much, while holding the other two predictor variables constant. This realistically agrees in a real world scenario where we expect sales to reduce when price of commodity is increased.
For the Urban variable, we have the coefficient as -0.0219. Since this is a categorical variable, it implies that for sites in the urban areas, they sales tends to be lower by -0.0219 units when compared with those NOT in urban areas (of course while holding the other predictor variables constant).
Finally, for US, we have a coefficient of 1.2006 implying that on average, stores located in the US sell 1.2006 more units when compared to stores that are NOT in the US, while holding the other predictor variables constant.

Write out the model in equation form, being careful to handle the qualitative variables properly.

\[ \hat{Sales} \quad = \quad 13.04 \quad - \quad 0.0544 \times \text{Price} \quad - \quad 0.0219 \times \text{UrbanYes}\quad + \quad 1.2006 \times \text{USYes}\] (d) For which of the predictors can you reject the null hypothesis H0 : \(\beta_j = 0\)?

From the summary model, we can see that the p-value associated with the Price and US are extremely small implying that they are statiscally significant in predicting the Sales response variable. Therefore, for these two we fail to reject the null hypothesis that their \(\beta_j = 0\). However, for the Urban variable, the p-value is extremely high implying that this predictor is not stastically significant in predicting the response variable. Thus, for this predictor, we fail to reject the null hypothesis that \(\beta_\text{urban} = 0\).

On the basis of your response to the previous question, fit asmaller model that only uses the predictors for which there is evidence of association with the outcome.

Based on findings table in (d), we drop the Urban variable as one of the predictors.

#the new updated model without the urban variable.
model <- lm(Sales ~ Price + US, data=Carseats)

#get the model summary
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

To evaluate how well the models from (a) and (e) fit we can start by looking at the \(R^2\) metric. Both of the models seem to have the same \(R^2\) of 0.2393, meaning that the two models only explain approximately 24% of the variance in the Sales response variable. This is quite expected since, in (e) we removed a predictor variable that was not statistically significant.
When it comes to Residual standard error, the two models seems to have approximately the same value of 2.5. Therefore, they seem to perform the same here as well.
Lastly, when we compare these models based on their corresponding F-statistic, we see that model in (e) has slightly higher value meaning that the updated model has a higher overall model significance

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(model, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

The confidence intervals are as follows: Intercept CI: (11.79, 14.27); Price CI: (-0.06475, -0.04420), USYes CI: (0.6915, 1.7078). For the two predictors variables, their respective confidence intervals excludes zero, which agrees with our intial analysis of rejecting the null hypothesis.

Is there evidence of outliers or high leverage observations in the model from (e)

To check the presence of outliers we can use the Residuals vs. Fitted Plot. For the high leverage points, we can use the Cook’s Distance plot.

par= (mfrow = c(2,2))
plot(model)

From the Residuals vs. Fitted plot, we can be able to identify some of the possible outliers such as the 51st, 69th and 377th observations. These three seem to have residuals that deviate largely from the rest of the points.
From the Cook’s Distance, we can observe 26th, 50th and 368th data points as some of the high leverage points; these points may unduly influence our model.

Question 14

This problem focuses on the collinearity problem. (a) Perform the following commands in R:

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

The form of the linear model is as shown below:

\[y = 2 + 2x_1 + 0.3x_2 \]

with \(\beta_0 = 2 \text{,}\; \beta_1 = 2,\; \text{and} \; \beta_2 = 0.3\) representing the intercept coefficient, x1 coefficient, and x2 coefficient respectively.

What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

#scatter plot between x1 and x2
plot(
  x1, 
  x2,
  main = "Scatter plot of x1 versus x2",
  pch = 19,
  col = "blue"
  )

The scatter plot above shows a strong positive correlation between the two variables. This means that x2 tends to increase with increase in x1. To confirm the actual correlation between these two variables, we can run below code:
```
cor(x1, x2)
```
```
## [1] 0.8351212
```
The correlation between the two variables is equal to 1, confirming the strong positive correlation that we earlier identified by visualizing the plot. This is however expected because we know x2 was generated from x1, hence the two are linearly dependent.

Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \(\hat\beta_0, \;\hat\beta_1,\;\hat\beta_2\) ? How do these relate to the true \(\beta_0, \; \beta_1, \; \text{and} \;\beta_2\) Can you reject the null hypothesis H0 : \(\beta_1 = 0\) ? How about the null hypothesis H0 : \(\beta_2 = 0\) ?

#fitting the linear regression model
lm1 = lm(y ~ x1 + x2)
#getting the summary of the model
summary(lm1)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

From the summary above we observe the coefficients as follows: \(\hat{\beta_0} = 2.1305, \; \hat{\beta_1} = 1.4396, \; \text{and} \; \hat{\beta2} = 1.0097\). Ideally, the estimated coefficients should be close to the true coefficients for the predictor variables. The coefficients for the two predictors seem to vary from the true coefficients. This is possibly due to multicollinearity that exist between x1 and x2.
For the x1, we reject the null hypothesis, Ho: \(\beta_1 \; = 0\) since the p-value is relatively small (less than the 0.05 threshold. This means that x1 is statistically significant in predicting predicting y.
For x2, we fail to reject the null hypothesis, Ho: \(\beta_1 = 0\) since the p-value is relatively great (greater than the 0.05 threshold). This means that x2 is NOT statistically in predicting y.

Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : \(\beta_1\; = 0\) ?

#fitting the regression model now using only x1
lm2 = lm(y ~ x1)
#getting summary of the model
summary(lm2)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

From the above summary we reject the null hypothesis, Ho: \(\beta_1 = 0\). This is so because the p-value associated with x1 is extremely small (8.27 x 10-15). This implies that x1 is statistically significant in predicting response variable y.

Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : \(\beta_1 = 0\)?

lm3 = lm(y~x2)
summary(lm3)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

No! We fail to reject the null hypothesis Ho: \(\beta_1 = 0\) . This is because the p-value is extremely low suggesting that x2 is statistically significant in predicting y.

Do the results obtained in (c)–(e) contradict each other? Explain your answer.

No! They do not contradict each other. For example in lm1 (where we used x1 and x2) we found a strong evidence to fail to reject the null hypothesis for x2. However, when we fit only x2 in lm3 the p-value associated with x2 seems to show that x2 is statistically significant contrary to our earlier assertion. However, this is not to say that the results contradict each other, instead we are simply having the problem of collinearity.

Now suppose we obtain one additional observation, which was unfortunately mismeasured.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

Refitting the linear model from (c) to (e)

#with both x1 and x2
lm4 = lm(y ~ x1 + x2)
summary(lm4)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

#with only x1
lm5 = lm(y ~ x1)
summary(lm5)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

#finally with just x2
lm6 = lm(y ~ x2)
summary(lm6)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

From the results of the 3 above models we notice some changes in coefficient estimates. The estimates have slightly shifted in all the models, with some significant changes in x1 and x2. For example, in the multiple regression model, the coefficient for x1 dropped from 0.5394 to 0.1671, suggesting that the new observation significantly affects the relationship.
We can make a scatter plot of x1 vs. x2 in an attempt to identify whether the new point is an outlier
```
plot(x1, x2, main = "Scatter Plot: x1 vs. x2", col="blue", pch=19)
```
From the scatter plot above, we can definitely see that the new observation is an outlier in the scatter plot of x1 vs. x1.

We then to finally check whether the observation is a high-leverage point:

#for the multiple regression model
which.max(hatvalues(lm4))

## 101 
## 101

x2[101]

## [1] 0.8

x1[101]

## [1] 0.1

#here we see the new observation is a high leverate point

#for the regression model using only x1
which.max(hatvalues(lm5))

## 27 
## 27

x2[27]

## [1] -0.03763402

x1[27]

## [1] 0.01339033

#here it doesn't seem like the new observation is a high leverage point.

#finally we check for the regression model using x2
which.max(hatvalues(lm6))

## 101 
## 101

x2[101]

## [1] 0.8

x1[101]

## [1] 0.1

#again, here we see that the new observation is a high leverage point.

From the results above, we can see that the new observation is not a high leverage point in all the models, but it is in 2 of our models as shown above.