Lab 4

# Load necessary library
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.3

library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

This question involves the use of simple linear regression on the Auto data set.

a. Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

data(Auto)

# Simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Is there a relationship between the predictor and the re-sponse?

Yes, the p-value for horsepower (P>|t| = 0.000) is exceptionally small (< 0.05), which signifies a statistically significant relationship.
How strong is the relationship?

The R-squared value is 0.606, indicating that 60.6% of the variation in mpg is accounted for by horsepower. This represents a moderately strong relationship.
Is the relationship positive or negative?

The coefficient for horsepower is -0.1578, indicating that mpg decreases as horsepower increases, confirming a negative relationship.

What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

predict(lm_fit, data.frame(horsepower = 98), interval = "confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm_fit, data.frame(horsepower = 98), interval = "prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

Explanation:

The predict() function calculates the predicted value of mpg for a given horsepower (98 here).
Confidence intervals provide a range for the mean response, while prediction intervals provide a range for individual predictions.
1. Plot the response and the predictor. Use the abline() function to display the least squares regression line.
```
# Scatterplot with regression line
plot(Auto$horsepower, Auto$mpg, main = "MPG vs Horsepower", xlab = "Horsepower", ylab = "MPG", pch = 19)
abline(lm_fit, col = "red", lwd = 2)
```
  Explanation:
  - This scatterplot visualizes the relationship between horsepower and mpg.
  - The red line represents the fitted regression line.

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
```
# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_fit)
```

Explanation:

The diagnostic plots help assess assumptions of linear regression:
- Residuals vs. Fitted: Check for non-linearity.
- Normal Q-Q: Check if residuals are normally distributed.
- Scale-Location: Check homoscedasticity.
- Residuals vs. Leverage: Identify outliers or high-leverage points.

This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.

# Multiple regression
lm_fit_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm_fit_carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Explanation:

The coefficients indicate how each predictor affects Sales.

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Intercept (const) = 13.0435
This value represents the expected sales when the price is $0, and the store is located in a non-urban, non-US area.

Price = -0.0545
The negative coefficient indicates that for every $1 increase in price, sales decrease by 0.0545 units. This relationship is highly significant, with a p-value less than 0.0001.

Urban = -0.0219
The coefficient is very small and statistically insignificant (p-value = 0.936), suggesting that being in an urban area does not have a meaningful impact on sales.

US = 1.2006
The positive coefficient indicates that stores in the US sell, on average, 1.2 more units than stores outside the US. This effect is highly significant, with a p-value less than 0.0001.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.0435 -0.0545xPrice - 0.0219xUrban + 1.2006xUS

For which of the predictors can you reject the null hypothesis H0 : βj =0?

summary(lm_fit)$coefficients

##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

Reject the Null Hypothesis:

Price (p < 0.001): The effect of price on sales is statistically significant.
US (p < 0.001): The difference in sales between US and non-US stores is statistically significant.

Fail to Reject the Null Hypothesis:

Urban (p = 0.936): The impact of being in an urban area on sales is not statistically significant.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

# Smaller model with significant predictors
lm_fit_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_fit_reduced)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Explanation:

This model includes only predictors with significant evidence of association with sales.

(f) How well do the models in (a) and (e) fit the data?

# Compare models using adjusted R-squared or AIC
summary(lm_fit_carseats)$adj.r.squared

## [1] 0.2335123

summary(lm_fit_reduced)$adj.r.squared

## [1] 0.2354305

Explanation:

Full Model R² = 0.239
Reduced Model R² (without Urban) ≈ same value
Since removing Urban does not reduce explanatory power, the reduced model is better.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

# Confidence intervals
confint(lm_fit_reduced, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

# Diagnostic plots for reduced model
par(mfrow = c(2, 2))
plot(lm_fit_reduced)

Insights from the Plots:

Outliers (Standardized Residuals beyond ±2 or ±3):

Several data points fall outside the ±2 range on the y-axis, suggesting the presence of potential outliers.
A few extreme values, particularly below -2.5 and above +2.5, warrant further investigation to determine whether they are valid observations or possible data entry errors.

High Leverage Points (Far-right points on the x-axis):

Since leverage reflects the influence of individual data points on the regression model, this indicates that no single observation exerts an undue impact on the model fit.

Conclusion:

There is clear evidence of some outliers based on standardized residuals exceeding ±2.
However, there is no indication of high leverage points significantly affecting the regression results.

14. This problem focuses on the collinearity problem.
(a) Perform the following commands in R:

> set.seed(1)
> x1 <-runif(100)
> x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

# Linear model equation:
# y = β0 + β1 * x1 + β2 * x2 + ε

Explanation:

The true coefficients are β0=2, β1=2, β2=0.3.

(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2, main = "Scatterplot of x1 vs x2", xlab = "x1", ylab = "x2")

Explanation:

Displays correlation between x1 and x2, which contributes to collinearity issues.

(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ β0, ˆ β1, and β2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 =0? How about the null hypothesis H0 : β2 =0?

lm_fit_full <- lm(y ~ x1 + x2)
summary(lm_fit_full)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Relationship to True Coefficients (β0, β1, β2):

The true value of β0 is 2, and the estimated value (β̂0) is approximately 1.9579, which is very close to the true value.
For β1, the true value is 2, but the estimated coefficient (β̂1) is about 1.6154, indicating it is slightly underestimated.
The true value of β2 is 0.3, while the estimated coefficient (β̂2) is approximately 0.9428, suggesting it is overestimated.

Although the estimates are reasonably aligned with the true coefficients, some deviations exist due to random noise in the data and collinearity between x1 and x2.

Hypothesis Testing Results:

For H₀: β1 = 0 (coefficient of x₁):
- t-statistic = 3.065
- p-value = 0.003 (< 0.05)
  The null hypothesis for β₁ can be rejected, providing strong evidence that x₁ has a statistically significant effect on the dependent variable y.
For H₀: β2 = 0 (coefficient of x₂):
- t-statistic = 1.134
- p-value = 0.259 (> 0.05)
  The null hypothesis for β₂ cannot be rejected, indicating insufficient evidence to conclude that x₂ has a significant impact on y in this model.

(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 =0?

lm_fit_x1 <- lm(y ~ x1)
summary(lm_fit_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Analysis of Results:

Significance of Coefficients:
The null hypothesis H₀: β₁ = 0 can be strongly rejected, as the p-value for x₁ is extremely small (p < 0.001). This indicates that x₁ is a highly significant predictor of the dependent variable y.

Accuracy of Coefficient Estimates:
The estimated coefficient for x₁ (2.0771) is very close to the true value of β₁ (2), which was used to generate the data. This demonstrates that the model has effectively captured the relationship between x₁ and y, providing an accurate estimate.

Model Fit:
The R-squared value of 0.281 shows that approximately 28.1% of the variability in y is explained by x₁ alone. This is only slightly lower than the R-squared value (0.291) from the model including both x₁ and x₂, suggesting that adding x₂ does not significantly enhance the model’s explanatory power.

Overall Model Significance:
The F-statistic of 38.39, paired with a very small p-value (1.37e-08), confirms that the overall model is highly significant.

Comparison with Previous Model:
In this model, the coefficient for x₁ (2.0771) is closer to its true value (2) compared to the estimate from the previous model that included both x₁ and x₂ (1.6154). This suggests that collinearity between x₁ and x₂ in the earlier model was influencing the accuracy of coefficient estimation.

Conclusion:
The null hypothesis H₀: β₁ = 0 can be confidently rejected, confirming that x₁ has a significant impact on y. The model with only x₁ performs well, accurately capturing its relationship with y, explaining a substantial portion of its variance, and avoiding the collinearity issues observed when both predictors are included.

(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 =0?

lm_fit_x2 <- lm(y ~ x2)
summary(lm_fit_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Analysis of Results:

Significance of Coefficients:
The null hypothesis H₀: β₂ = 0 for x₂ can be strongly rejected, as the p-value is extremely small (p < 0.001). This indicates that x₂ is a highly significant predictor of y when analyzed on its own.

Coefficient Estimate:
The estimated coefficient for x₂ (2.9103) is significantly larger than the true value of β₂ (0.3) used to generate the data. This overestimation is likely a result of collinearity between x₁ and x₂, with x₂ capturing some of the effect that should be attributed to x₁.

Model Fit:
The R-squared value of 0.222 indicates that approximately 22.2% of the variation in y is explained by x₂ alone. This is lower than the R-squared values for the model with only x₁ (0.281) and the model including both predictors (0.291).

Overall Model Significance:
The F-statistic of 27.99, along with a very small p-value (7.43e-07), confirms that the overall model is statistically significant.

Comparison with Other Models:
Although the true coefficient for x₂ in the data generation process is relatively small (0.3), it appears as a significant predictor when analyzed in isolation. This is likely due to its correlation with x₁, which has a stronger true effect on y, causing x₂ to absorb some of that effect.

Conclusion:
The null hypothesis H₀: β₂ = 0 for x₂ can be confidently rejected, showing that x₂ has a significant relationship with y in this model. However, these results should be interpreted carefully. The large coefficient and apparent significance of x₂ are likely driven by its correlation with x₁, rather than its direct influence on y. This highlights how collinearity can distort results when variables are evaluated independently.

(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.

Explanation:

Results may differ due to multicollinearity between x1 and x2, affecting coefficient estimates in the full model versus single-predictor models.

(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.

> x1 <-c(x1, 0.1)

> x2 <-c(x2, 0.8)

> y <-c(y, 6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

lm_fit_full_new <- lm(y ~ x1 + x2)
lm_fit_x1_new <- lm(y ~ x1)
lm_fit_x2_new <- lm(y ~ x2)

par(mfrow = c(2, 2))
plot(lm_fit_full_new)

par(mfrow = c(2, 2))
plot(lm_fit_x1_new)

par(mfrow = c(2, 2))
plot(lm_fit_x2_new)

Leverage and Residuals for the New Observation:

Model with both x₁ and x₂:
- Leverage = 0.3477, Residual = 2.6926
Model with only x₁:
- Leverage = 0.0303, Residual = 3.8698
Model with only x₂:
- Leverage = 0.1102, Residual = 1.3398

Comparison of Models:

1. Model with both x₁ and x₂:

Impact on Coefficients: The coefficient for x₂ increased significantly (from 0.9428 to 2.2663), while the coefficient for x₁ decreased (from 1.6154 to 0.8575).
Model Fit: The R-squared value increased slightly from 0.291 to 0.292.
Diagnostics: The leverage value (0.3477) is much higher than the threshold of approximately 0.0594 (calculated as 2(p+1)nn2(p+1)), indicating high leverage. The studentized residual (2.6926) exceeds 2, marking it as an outlier.
Conclusion: This observation is both a high-leverage point and an outlier, exerting significant influence on the model by altering the balance between the effects of x₁ and x₂.

2. Model with only x₁:

Impact on Coefficients: The coefficient for x₁ decreased from 2.0771 to 1.8760.
Model Fit: The R-squared value dropped from 0.281 to 0.217, indicating reduced explanatory power when this observation is included.
Diagnostics: The leverage value (0.0303) is below the threshold of approximately 0.0396, meaning it is not a high-leverage point. However, the studentized residual (3.8698) is well above 2, identifying it as a strong outlier.
Conclusion: This observation is an outlier but does not have high leverage, though it negatively impacts the model’s fit.

3. Model with only x₂:

Impact on Coefficients: The coefficient for x₂ increased slightly from 2.9103 to 3.1458.
Model Fit: The R-squared value improved from 0.222 to 0.267, suggesting that this observation enhances the model’s ability to explain variance in the response variable using only x₂ as a predictor.
Diagnostics: The leverage value (0.1102) exceeds the threshold of approximately 0.0396, marking it as a high-leverage point, but the studentized residual (1.3398) is below 2, indicating it is not an outlier in this model.
Conclusion: This observation has high leverage but is not an outlier in this model, and its inclusion improves the fit for predicting based on x₂.

Summary of Effects Across Models:
The new observation has varying impacts depending on the model:

In the full model (x₁ and x₂), it acts as both a high-leverage point and an outlier, significantly shifting the relative contributions of the predictors (x₁ decreases while x₂ increases).
In the model with only x₁, it is a strong outlier but does not have high leverage, reducing overall model performance.
In the model with only x₂, it has high leverage but is not an outlier, improving the model’s fit for this predictor.

This analysis highlights how a single unusual data point can affect regression models differently depending on their structure and variables included, underscoring the importance of thorough diagnostics and careful interpretation in regression analysis.

Lab 4

Akshay Kumar

02/25/2025