This question involves the use of simple linear regression on the Auto data set.
a.Use the sm.OLS() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summarize() function to print the results. Comment on the output. For example: i. Is there a relationship between the predictor and the response?
How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confdence and prediction intervals?
b.Plot the response and the predictor in a new set of axes ax. Use the ax.axline() method or the abline() function defned in the lab to display the least squares regression line.
c.Produce some of diagnostic plots of the least squares regression ft as described in the lab. Comment on any problems you see with the ft. ### (a) Perform simple linear regression
# Load necessary libraries
library(ISLR) # Auto dataset is part of ISLR package
## Warning: package 'ISLR' was built under R version 4.4.2
# Load the dataset
data(Auto)
lm_fit <- lm(mpg ~ horsepower, data = Auto)
# Print the summary of the regression model
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
i. There is a strong, statistically significant negative relationship between horsepower and mpg,indicating that higher horsepower leads to lower fuel efficiency. ii. The R-squared value = 0.606, meaning that 60.6% of the variability in mpg is explained by horsepower. This indicates a moderately strong relationship. iii. The coefficient for horsepower is -0.1578, meaning that as horsepower increases, mpg decreases. This confirms a negative relationship.
# iv. Predict mpg for horsepower = 98 with confidence and prediction intervals
hp_98 <- data.frame(horsepower = 98)
predict(lm_fit, newdata = hp_98, interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm_fit, newdata = hp_98, interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
confint(lm_fit)
## 2.5 % 97.5 %
## (Intercept) 38.525212 41.3465103
## horsepower -0.170517 -0.1451725
A car with 98 horsepower is predicted to have 24.47 mpg, with a 95% confidence interval (23.97, 24.96) and a 95% prediction interval (14.81, 34.12), confirming a significant negative relationship between horsepower and mpg.
plot(Auto$horsepower, Auto$mpg, main = "MPG vs Horsepower",
xlab = "Horsepower", ylab = "MPG", pch = 16, col = "blue")
abline(lm_fit, col = "red", lwd = 2)
par(mfrow = c(2, 2)) # 2x2 layout for plots
plot(lm_fit) # Produces residual plots, Q-Q plot, scale-location plot, and leverage plot
This question should be answered using the Carseats data set.
a.Fit a multiple regression model to predict Sales using Price, Urban, and US.
b.Provide an interpretation of each coefcient in the model. Be careful—some of the variables in the model are qualitative!
c.Write out the model in equation form, being careful to handle the qualitative variables properly.
d.For which of the predictors can you reject the null hypothesis H0 : βj = 0?
e.On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.
f.How well do the models in (a) and (e) ft the data?
g.Using the model from (e), obtain 95 % confdence intervals for the coefcient(s).
h.Is there evidence of outliers or high leverage observations in the model from (e)?
# Load necessary library
library(ISLR)
# Load the Carseats dataset
data(Carseats)
lm_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price negatively affects Sales, US stores have higher Sales, Urban location is insignificant, and the model explains 23.35% of Sales variability.
Interpretation: - The coefficient of Price
tells how
Sales change with a unit increase in Price. - UrbanYes
(since Urban is categorical) indicates the effect of being in an urban
location. - USYes
(since US is categorical) shows the
effect of being in the US vs. non-US locations. — title: “Regression
Model Interpretation” output: html_document —
\[\text{Sales} = 13.0435 - 0.0545 \times \text{Price} - 0.0219 \times \text{Urban} + 1.2006 \times \text{US}\]
summary(lm_fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.04346894 0.651012245 20.03567373 3.626602e-62
## Price -0.05445885 0.005241855 -10.38923205 1.609917e-22
## UrbanYes -0.02191615 0.271650277 -0.08067781 9.357389e-01
## USYes 1.20057270 0.259041508 4.63467306 4.860245e-06
Significant Predictors:
Likely Price and US, while Urban may not be significant.We can reject the null hypothesis of Price and US.The summary(lm_fit) output will show which predictors have p-values below 0.05.
lm_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_reduced)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
cat("Full model R-squared:", summary(lm_fit)$r.squared, "\n")
## Full model R-squared: 0.2392754
cat("Reduced model R-squared:", summary(lm_reduced)$r.squared, "\n")
## Reduced model R-squared: 0.2392629
The full and reduced models have nearly identical R-squared values (0.2393 vs.0.2393), indicating that removing Urban has a negligible effect on model fit, so the simpler reduced model is preferable.
confint(lm_reduced)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(lm_reduced) # Diagnostic plots: residuals, leverage, etc.
## Insights from the Diagnostic Plots
This problem focuses on the collinearity problem.
a.Perform the following commands in R: > set.seed(1) > x1 <- runif(100) > x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefcients?
b.What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
c.Using this data, ft a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 = 0? How about the null hypothesis H0 : β2 = 0?
d.Now ft a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?
e.Now ft a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?
f.Do the results obtained in (c)–(e) contradict each other? Explain your answer.
g.Suppose we obtain one additional observation, which was unfortunately mismeasured. We use the function np.concatenate() to np.concaadd this additional observation to each of x1, x2 and y. tenate()
x1 = np.concatenate([x1, [0.1]])
x2 = np.concatenate([x2, [0.8]])
y = np.concatenate([y, [6]])
Re-ft the linear models from (c) to (e) using this new data. What efect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
cor(x1, x2) # Compute correlation
## [1] 0.8351212
plot(x1, x2, main="Scatterplot of x1 vs x2", xlab="x1", ylab="x2", pch=16) # Scatterplot
lm_fit <- lm(y ~ x1 + x2)
summary(lm_fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
# Extract estimated coefficients
beta_hat <- coef(lm_fit)
cat("Estimated coefficients: β0 =", beta_hat[1], "β1 =", beta_hat[2], "β2 =", beta_hat[3], "\n")
## Estimated coefficients: β0 = 2.1305 β1 = 1.439555 β2 = 1.009674
# Check significance of β1 and β2
# If p-values < 0.05, we reject the null hypothesis
Multiple regression results: Estimated coefficients should be close to β0 = 2, β1 = 2, β2 = 0.3. High p-values for β2 could indicate collinearity issues.
The estimates are fairly close to the true values, though slight discrepancies arise due to random noise and collinearity between x1 and x2.
lm_x1 <- lm(y ~ x1)
summary(lm_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
Yes.The p-value for x1 (< 0.001) is very small, so we reject the null hypothesis 𝐻0:𝛽=0, meaning x1 is a significant predictor of y, though the model explains only ~20.2% of the variance.
lm_x2 <- lm(y ~ x2)
summary(lm_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Yes, Since the p-value for X2 is very small (1.37×10 −5), we reject
the null hypothesis H0:β 2=0, indicating that X2 is a significant
predictor of y; however, it explains slightly less variance (R2 =0.1763)
than x1 alone. ### For H0: β2 = 0 (x2 coefficient) -
t-statistic = 1.134, p-value = 0.259
(> 0.05)
- Conclusion: Fail to reject H0, suggesting x2 has no
significant effect on y.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
# Re-fit the models
lm_fit_new <- lm(y ~ x1 + x2)
lm_x1_new <- lm(y ~ x1)
lm_x2_new <- lm(y ~ x2)
# Compare new summaries
summary(lm_fit_new)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(lm_x1_new)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(lm_x2_new)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
# Plot diagnostics for outliers and leverage
par(mfrow = c(2, 2))
plot(lm_fit_new)
plot(lm_x1_new)
plot(lm_x2_new)
### Model with x2 Alone - Significance: x2 is
significant by itself, but the large coefficient (2.9103 vs. 0.3) likely
reflects its correlation with x1. - Collinearity:
Multicollinearity distorts estimates when predictors are used
together.
The new observation shows varying effects depending on the model used. This highlights the importance of performing careful diagnostic checks and considering multicollinearity in regression analysis to ensure reliable results.
Comments