# Load necessary library
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
a. Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:
data(Auto)
# Simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Is there a relationship between the predictor and the re-sponse?
Yes, the p-value for horsepower (P>|t| = 0.000) is exceptionally small (< 0.05), which signifies a statistically significant relationship.
How strong is the relationship?
The R-squared value is 0.606, indicating that 60.6% of the variation in mpg is accounted for by horsepower. This represents a moderately strong relationship.
Is the relationship positive or negative?
The coefficient for horsepower is -0.1578, indicating that mpg decreases as horsepower increases, confirming a negative relationship.
What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
predict(lm_fit, data.frame(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm_fit, data.frame(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
Explanation:
The predict() function calculates the predicted value of mpg for a given horsepower (98 here).
Confidence intervals provide a range for the mean response, while prediction intervals provide a range for individual predictions.
Plot the response and the predictor. Use the abline() function to display the least squares regression line.
# Scatterplot with regression line
plot(Auto$horsepower, Auto$mpg, main = "MPG vs Horsepower", xlab = "Horsepower", ylab = "MPG", pch = 19)
abline(lm_fit, col = "red", lwd = 2)
Explanation:
This scatterplot visualizes the relationship between horsepower and mpg.
The red line represents the fitted regression line.
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_fit)
Explanation:
The diagnostic plots help assess assumptions of linear regression:
Residuals vs. Fitted: Check for non-linearity.
Normal Q-Q: Check if residuals are normally distributed.
Scale-Location: Check homoscedasticity.
Residuals vs. Leverage: Identify outliers or high-leverage points.
Fit a multiple regression model to predict Sales using Price, Urban, and US.
# Multiple regression
lm_fit_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm_fit_carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Explanation:
The coefficients indicate how each predictor affects Sales.
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Intercept (const) = 13.0435
This value represents the expected sales when the price is $0, and the
store is located in a non-urban, non-US area.
Price = -0.0545
The negative coefficient indicates that for every $1 increase in price,
sales decrease by 0.0545 units. This relationship is highly significant,
with a p-value less than 0.0001.
Urban = -0.0219
The coefficient is very small and statistically insignificant (p-value =
0.936), suggesting that being in an urban area does not have a
meaningful impact on sales.
US = 1.2006
The positive coefficient indicates that stores in the US sell, on
average, 1.2 more units than stores outside the US. This effect is
highly significant, with a p-value less than 0.0001.
(c) Write out the model in equation form, being careful to
handle the qualitative variables properly.
Sales = 13.0435 -0.0545xPrice - 0.0219xUrban + 1.2006xUS
summary(lm_fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
## horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
Reject the Null Hypothesis:
Price (p < 0.001): The effect of price on sales is statistically significant.
US (p < 0.001): The difference in sales between US and non-US stores is statistically significant.
Fail to Reject the Null Hypothesis:
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
# Smaller model with significant predictors
lm_fit_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_fit_reduced)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Explanation:
(f) How well do the models in (a) and (e) fit the data?
# Compare models using adjusted R-squared or AIC
summary(lm_fit_carseats)$adj.r.squared
## [1] 0.2335123
summary(lm_fit_reduced)$adj.r.squared
## [1] 0.2354305
Explanation:
Full Model R² = 0.239
Reduced Model R² (without Urban) ≈ same value
Since removing Urban does not reduce explanatory power, the reduced model is better.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
# Confidence intervals
confint(lm_fit_reduced, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
# Diagnostic plots for reduced model
par(mfrow = c(2, 2))
plot(lm_fit_reduced)
Insights from the Plots:
Outliers (Standardized Residuals beyond ±2 or ±3):
Several data points fall outside the ±2 range on the y-axis, suggesting the presence of potential outliers.
A few extreme values, particularly below -2.5 and above +2.5, warrant further investigation to determine whether they are valid observations or possible data entry errors.
High Leverage Points (Far-right points on the x-axis):
Conclusion:
There is clear evidence of some outliers based on standardized residuals exceeding ±2.
However, there is no indication of high leverage points significantly affecting the regression results.
14. This problem focuses on the collinearity problem.
(a) Perform the following commands in R:
> set.seed(1)
> x1 <-runif(100)
> x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 *
x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a
function of x1 and x2. Write out the form of the linear model. What are
the regression coefficients?
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
# Linear model equation:
# y = β0 + β1 * x1 + β2 * x2 + ε
Explanation:
(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2, main = "Scatterplot of x1 vs x2", xlab = "x1", ylab = "x2")
Explanation:
(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ β0, ˆ β1, and β2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 =0? How about the null hypothesis H0 : β2 =0?
lm_fit_full <- lm(y ~ x1 + x2)
summary(lm_fit_full)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Relationship to True Coefficients (β0, β1, β2):
The true value of β0 is 2, and the estimated value (β̂0) is approximately 1.9579, which is very close to the true value.
For β1, the true value is 2, but the estimated coefficient (β̂1) is about 1.6154, indicating it is slightly underestimated.
The true value of β2 is 0.3, while the estimated coefficient (β̂2) is approximately 0.9428, suggesting it is overestimated.
Although the estimates are reasonably aligned with the true coefficients, some deviations exist due to random noise in the data and collinearity between x1 and x2.
Hypothesis Testing Results:
For H₀: β1 = 0 (coefficient of x₁):
t-statistic = 3.065
p-value = 0.003 (< 0.05)
The null hypothesis for β₁ can be rejected, providing strong evidence
that x₁ has a statistically significant effect on the
dependent variable y.
For H₀: β2 = 0 (coefficient of x₂):
t-statistic = 1.134
p-value = 0.259 (> 0.05)
The null hypothesis for β₂ cannot be rejected, indicating insufficient
evidence to conclude that x₂ has a significant impact
on y in this model.
(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 =0?
lm_fit_x1 <- lm(y ~ x1)
summary(lm_fit_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
Analysis of Results:
Significance of Coefficients:
The null hypothesis H₀: β₁ = 0 can be strongly
rejected, as the p-value for x₁ is extremely small (p
< 0.001). This indicates that x₁ is a highly
significant predictor of the dependent variable y.
Accuracy of Coefficient Estimates:
The estimated coefficient for x₁ (2.0771) is very close
to the true value of β₁ (2), which was used to generate
the data. This demonstrates that the model has effectively captured the
relationship between x₁ and y,
providing an accurate estimate.
Model Fit:
The R-squared value of 0.281 shows that approximately 28.1% of the
variability in y is explained by x₁
alone. This is only slightly lower than the R-squared value (0.291) from
the model including both x₁ and x₂,
suggesting that adding x₂ does not significantly
enhance the model’s explanatory power.
Overall Model Significance:
The F-statistic of 38.39, paired with a very small p-value (1.37e-08),
confirms that the overall model is highly significant.
Comparison with Previous Model:
In this model, the coefficient for x₁ (2.0771) is
closer to its true value (2) compared to the estimate from the previous
model that included both x₁ and x₂
(1.6154). This suggests that collinearity between x₁
and x₂ in the earlier model was influencing the
accuracy of coefficient estimation.
Conclusion:
The null hypothesis H₀: β₁ = 0 can be confidently
rejected, confirming that x₁ has a significant impact
on y. The model with only x₁ performs
well, accurately capturing its relationship with y,
explaining a substantial portion of its variance, and avoiding the
collinearity issues observed when both predictors are included.
(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 =0?
lm_fit_x2 <- lm(y ~ x2)
summary(lm_fit_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Analysis of Results:
Significance of Coefficients:
The null hypothesis H₀: β₂ = 0 for x₂
can be strongly rejected, as the p-value is extremely small (p <
0.001). This indicates that x₂ is a highly significant
predictor of y when analyzed on its own.
Coefficient Estimate:
The estimated coefficient for x₂ (2.9103) is
significantly larger than the true value of β₂ (0.3)
used to generate the data. This overestimation is likely a result of
collinearity between x₁ and x₂, with
x₂ capturing some of the effect that should be
attributed to x₁.
Model Fit:
The R-squared value of 0.222 indicates that approximately 22.2% of the
variation in y is explained by x₂
alone. This is lower than the R-squared values for the model with only
x₁ (0.281) and the model including both predictors
(0.291).
Overall Model Significance:
The F-statistic of 27.99, along with a very small p-value (7.43e-07),
confirms that the overall model is statistically significant.
Comparison with Other Models:
Although the true coefficient for x₂ in the data
generation process is relatively small (0.3), it appears as a
significant predictor when analyzed in isolation. This is likely due to
its correlation with x₁, which has a stronger true
effect on y, causing x₂ to absorb some
of that effect.
Conclusion:
The null hypothesis H₀: β₂ = 0 for x₂
can be confidently rejected, showing that x₂ has a
significant relationship with y in this model. However,
these results should be interpreted carefully. The large coefficient and
apparent significance of x₂ are likely driven by its
correlation with x₁, rather than its direct influence
on y. This highlights how collinearity can distort
results when variables are evaluated independently.
(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.
Explanation:
(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.
> x1 <-c(x1, 0.1)
> x2 <-c(x2, 0.8)
> y <-c(y, 6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
lm_fit_full_new <- lm(y ~ x1 + x2)
lm_fit_x1_new <- lm(y ~ x1)
lm_fit_x2_new <- lm(y ~ x2)
par(mfrow = c(2, 2))
plot(lm_fit_full_new)
par(mfrow = c(2, 2))
plot(lm_fit_x1_new)
par(mfrow = c(2, 2))
plot(lm_fit_x2_new)
Leverage and Residuals for the New Observation:
Model with both x₁ and x₂:
Model with only x₁:
Model with only x₂:
Comparison of Models:
1. Model with both x₁ and x₂:
Impact on Coefficients: The coefficient for x₂ increased significantly (from 0.9428 to 2.2663), while the coefficient for x₁ decreased (from 1.6154 to 0.8575).
Model Fit: The R-squared value increased slightly from 0.291 to 0.292.
Diagnostics: The leverage value (0.3477) is much higher than the threshold of approximately 0.0594 (calculated as 2(p+1)nn2(p+1)), indicating high leverage. The studentized residual (2.6926) exceeds 2, marking it as an outlier.
Conclusion: This observation is both a high-leverage point and an outlier, exerting significant influence on the model by altering the balance between the effects of x₁ and x₂.
2. Model with only x₁:
Impact on Coefficients: The coefficient for x₁ decreased from 2.0771 to 1.8760.
Model Fit: The R-squared value dropped from 0.281 to 0.217, indicating reduced explanatory power when this observation is included.
Diagnostics: The leverage value (0.0303) is below the threshold of approximately 0.0396, meaning it is not a high-leverage point. However, the studentized residual (3.8698) is well above 2, identifying it as a strong outlier.
Conclusion: This observation is an outlier but does not have high leverage, though it negatively impacts the model’s fit.
3. Model with only x₂:
Impact on Coefficients: The coefficient for x₂ increased slightly from 2.9103 to 3.1458.
Model Fit: The R-squared value improved from 0.222 to 0.267, suggesting that this observation enhances the model’s ability to explain variance in the response variable using only x₂ as a predictor.
Diagnostics: The leverage value (0.1102) exceeds the threshold of approximately 0.0396, marking it as a high-leverage point, but the studentized residual (1.3398) is below 2, indicating it is not an outlier in this model.
Conclusion: This observation has high leverage but is not an outlier in this model, and its inclusion improves the fit for predicting based on x₂.
Summary of Effects Across Models:
The new observation has varying impacts depending on the model:
In the full model (x₁ and x₂), it acts as both a high-leverage point and an outlier, significantly shifting the relative contributions of the predictors (x₁ decreases while x₂ increases).
In the model with only x₁, it is a strong outlier but does not have high leverage, reducing overall model performance.
In the model with only x₂, it has high leverage but is not an outlier, improving the model’s fit for this predictor.
This analysis highlights how a single unusual data point can affect regression models differently depending on their structure and variables included, underscoring the importance of thorough diagnostics and careful interpretation in regression analysis.