# Load necessary library
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. This question involves the use of simple linear regression on the Auto data set.

a. Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

data(Auto)

# Simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictor and the re-sponse?

    Yes, the p-value for horsepower (P>|t| = 0.000) is exceptionally small (< 0.05), which signifies a statistically significant relationship.

  2. How strong is the relationship?

    The R-squared value is 0.606, indicating that 60.6% of the variation in mpg is accounted for by horsepower. This represents a moderately strong relationship.

  3. Is the relationship positive or negative?

    The coefficient for horsepower is -0.1578, indicating that mpg decreases as horsepower increases, confirming a negative relationship.

  4. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?

    predict(lm_fit, data.frame(horsepower = 98), interval = "confidence")
    ##        fit      lwr      upr
    ## 1 24.46708 23.97308 24.96108
    predict(lm_fit, data.frame(horsepower = 98), interval = "prediction")
    ##        fit     lwr      upr
    ## 1 24.46708 14.8094 34.12476

Explanation:

  1. Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

    # Diagnostic plots
    par(mfrow = c(2, 2))
    plot(lm_fit)

Explanation:

  1. This question should be answered using the Carseats data set.
  1. Fit a multiple regression model to predict Sales using Price, Urban, and US.

    # Multiple regression
    lm_fit_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
    summary(lm_fit_carseats)
    ## 
    ## Call:
    ## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
    ## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
    ## UrbanYes    -0.021916   0.271650  -0.081    0.936    
    ## USYes        1.200573   0.259042   4.635 4.86e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 2.472 on 396 degrees of freedom
    ## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
    ## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Explanation:

Intercept (const) = 13.0435
This value represents the expected sales when the price is $0, and the store is located in a non-urban, non-US area.

Price = -0.0545
The negative coefficient indicates that for every $1 increase in price, sales decrease by 0.0545 units. This relationship is highly significant, with a p-value less than 0.0001.

Urban = -0.0219
The coefficient is very small and statistically insignificant (p-value = 0.936), suggesting that being in an urban area does not have a meaningful impact on sales.

US = 1.2006
The positive coefficient indicates that stores in the US sell, on average, 1.2 more units than stores outside the US. This effect is highly significant, with a p-value less than 0.0001.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.0435 -0.0545xPrice - 0.0219xUrban + 1.2006xUS

  1. For which of the predictors can you reject the null hypothesis H0 : βj =0?
summary(lm_fit)$coefficients
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

Reject the Null Hypothesis:

Fail to Reject the Null Hypothesis:

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

# Smaller model with significant predictors
lm_fit_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_fit_reduced)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Explanation:

(f) How well do the models in (a) and (e) fit the data?

# Compare models using adjusted R-squared or AIC
summary(lm_fit_carseats)$adj.r.squared
## [1] 0.2335123
summary(lm_fit_reduced)$adj.r.squared
## [1] 0.2354305

Explanation:

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

# Confidence intervals
confint(lm_fit_reduced, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

# Diagnostic plots for reduced model
par(mfrow = c(2, 2))
plot(lm_fit_reduced)

Insights from the Plots:

Outliers (Standardized Residuals beyond ±2 or ±3):

High Leverage Points (Far-right points on the x-axis):

Conclusion:

14. This problem focuses on the collinearity problem.
(a) Perform the following commands in R:


> set.seed(1)
> x1 <-runif(100)
> x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

# Linear model equation:
# y = β0 + β1 * x1 + β2 * x2 + ε

Explanation:

(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1, x2)
## [1] 0.8351212
plot(x1, x2, main = "Scatterplot of x1 vs x2", xlab = "x1", ylab = "x2")

Explanation:

(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ β0, ˆ β1, and β2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 =0? How about the null hypothesis H0 : β2 =0?

lm_fit_full <- lm(y ~ x1 + x2)
summary(lm_fit_full)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Relationship to True Coefficients (β0, β1, β2):

Although the estimates are reasonably aligned with the true coefficients, some deviations exist due to random noise in the data and collinearity between x1 and x2.

Hypothesis Testing Results:

  1. For H₀: β1 = 0 (coefficient of x₁):

    • t-statistic = 3.065

    • p-value = 0.003 (< 0.05)
      The null hypothesis for β₁ can be rejected, providing strong evidence that x₁ has a statistically significant effect on the dependent variable y.

  2. For H₀: β2 = 0 (coefficient of x₂):

    • t-statistic = 1.134

    • p-value = 0.259 (> 0.05)
      The null hypothesis for β₂ cannot be rejected, indicating insufficient evidence to conclude that x₂ has a significant impact on y in this model.

(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 =0?

lm_fit_x1 <- lm(y ~ x1)
summary(lm_fit_x1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Analysis of Results:

Significance of Coefficients:
The null hypothesis H₀: β₁ = 0 can be strongly rejected, as the p-value for x₁ is extremely small (p < 0.001). This indicates that x₁ is a highly significant predictor of the dependent variable y.

Accuracy of Coefficient Estimates:
The estimated coefficient for x₁ (2.0771) is very close to the true value of β₁ (2), which was used to generate the data. This demonstrates that the model has effectively captured the relationship between x₁ and y, providing an accurate estimate.

Model Fit:
The R-squared value of 0.281 shows that approximately 28.1% of the variability in y is explained by x₁ alone. This is only slightly lower than the R-squared value (0.291) from the model including both x₁ and x₂, suggesting that adding x₂ does not significantly enhance the model’s explanatory power.

Overall Model Significance:
The F-statistic of 38.39, paired with a very small p-value (1.37e-08), confirms that the overall model is highly significant.

Comparison with Previous Model:
In this model, the coefficient for x₁ (2.0771) is closer to its true value (2) compared to the estimate from the previous model that included both x₁ and x₂ (1.6154). This suggests that collinearity between x₁ and x₂ in the earlier model was influencing the accuracy of coefficient estimation.

Conclusion:
The null hypothesis H₀: β₁ = 0 can be confidently rejected, confirming that x₁ has a significant impact on y. The model with only x₁ performs well, accurately capturing its relationship with y, explaining a substantial portion of its variance, and avoiding the collinearity issues observed when both predictors are included.

(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 =0?

lm_fit_x2 <- lm(y ~ x2)
summary(lm_fit_x2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Analysis of Results:

Significance of Coefficients:
The null hypothesis H₀: β₂ = 0 for x₂ can be strongly rejected, as the p-value is extremely small (p < 0.001). This indicates that x₂ is a highly significant predictor of y when analyzed on its own.

Coefficient Estimate:
The estimated coefficient for x₂ (2.9103) is significantly larger than the true value of β₂ (0.3) used to generate the data. This overestimation is likely a result of collinearity between x₁ and x₂, with x₂ capturing some of the effect that should be attributed to x₁.

Model Fit:
The R-squared value of 0.222 indicates that approximately 22.2% of the variation in y is explained by x₂ alone. This is lower than the R-squared values for the model with only x₁ (0.281) and the model including both predictors (0.291).

Overall Model Significance:
The F-statistic of 27.99, along with a very small p-value (7.43e-07), confirms that the overall model is statistically significant.

Comparison with Other Models:
Although the true coefficient for x₂ in the data generation process is relatively small (0.3), it appears as a significant predictor when analyzed in isolation. This is likely due to its correlation with x₁, which has a stronger true effect on y, causing x₂ to absorb some of that effect.

Conclusion:
The null hypothesis H₀: β₂ = 0 for x₂ can be confidently rejected, showing that x₂ has a significant relationship with y in this model. However, these results should be interpreted carefully. The large coefficient and apparent significance of x₂ are likely driven by its correlation with x₁, rather than its direct influence on y. This highlights how collinearity can distort results when variables are evaluated independently.

(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.

Explanation:

(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.

> x1 <-c(x1, 0.1)

> x2 <-c(x2, 0.8)

> y <-c(y, 6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

lm_fit_full_new <- lm(y ~ x1 + x2)
lm_fit_x1_new <- lm(y ~ x1)
lm_fit_x2_new <- lm(y ~ x2)

par(mfrow = c(2, 2))
plot(lm_fit_full_new)

par(mfrow = c(2, 2))
plot(lm_fit_x1_new)

par(mfrow = c(2, 2))
plot(lm_fit_x2_new)

Leverage and Residuals for the New Observation:

Comparison of Models:

1. Model with both x₁ and x₂:

2. Model with only x₁:

3. Model with only x₂:

Summary of Effects Across Models:
The new observation has varying impacts depending on the model:

This analysis highlights how a single unusual data point can affect regression models differently depending on their structure and variables included, underscoring the importance of thorough diagnostics and careful interpretation in regression analysis.