SLLAB4

Question 8

This question involves the use of simple linear regression on the Auto data set.

a.Use the sm.OLS() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summarize() function to print the results. Comment on the output. For example: i. Is there a relationship between the predictor and the response?

How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confdence and prediction intervals?

b.Plot the response and the predictor in a new set of axes ax. Use the ax.axline() method or the abline() function defned in the lab to display the least squares regression line.

c.Produce some of diagnostic plots of the least squares regression ft as described in the lab. Comment on any problems you see with the ft. ### (a) Perform simple linear regression

# Load necessary libraries
library(ISLR)  # Auto dataset is part of ISLR package

## Warning: package 'ISLR' was built under R version 4.4.2

# Load the dataset
data(Auto)


lm_fit <- lm(mpg ~ horsepower, data = Auto)

# Print the summary of the regression model
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

i. There is a strong, statistically significant negative relationship between horsepower and mpg,indicating that higher horsepower leads to lower fuel efficiency. ii. The R-squared value = 0.606, meaning that 60.6% of the variability in mpg is explained by horsepower. This indicates a moderately strong relationship. iii. The coefficient for horsepower is -0.1578, meaning that as horsepower increases, mpg decreases. This confirms a negative relationship.

# iv. Predict mpg for horsepower = 98 with confidence and prediction intervals
hp_98 <- data.frame(horsepower = 98)
predict(lm_fit, newdata = hp_98, interval = "confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm_fit, newdata = hp_98, interval = "prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

confint(lm_fit)

##                 2.5 %     97.5 %
## (Intercept) 38.525212 41.3465103
## horsepower  -0.170517 -0.1451725

A car with 98 horsepower is predicted to have 24.47 mpg, with a 95% confidence interval (23.97, 24.96) and a 95% prediction interval (14.81, 34.12), confirming a significant negative relationship between horsepower and mpg.

(b) Scatter plot with regression line

plot(Auto$horsepower, Auto$mpg, main = "MPG vs Horsepower",
     xlab = "Horsepower", ylab = "MPG", pch = 16, col = "blue")
abline(lm_fit, col = "red", lwd = 2)

(c) Diagnostic plots for regression

par(mfrow = c(2, 2))  # 2x2 layout for plots
plot(lm_fit)  # Produces residual plots, Q-Q plot, scale-location plot, and leverage plot

Understanding Diagnostic Plots

1. Residuals vs. Fitted Plot (Top Left)

If the residuals show a clear pattern instead of being randomly scattered, the model may be missing a nonlinear term.
If residuals spread out as fitted values increase, this indicates heteroscedasticity (non-constant variance), which can impact the reliability of statistical inferences.

2. Normal Q-Q Plot (Top Right)

This plot checks if residuals follow a normal distribution.
Strong deviations from the 45-degree line, especially in the tails, suggest non-normality in residuals.
A log transformation or a nonlinear model may improve normality.

3. Scale-Location Plot (Bottom Left)

If residual spread increases with fitted values, it confirms heteroscedasticity, meaning variability is not uniform across predictions.
Consider using weighted least squares regression or transforming the response variable.

4. Residuals vs. Leverage Plot (Bottom Right)

This plot helps identify influential data points.
Observations 115, 6, 83, and 94 have high leverage, meaning they have unusual predictor values that could impact the model.
Points with high Cook’s distance strongly influence the regression and should be examined further.

Summary of Key Issues & Fixes

Non-linearity → Consider adding a polynomial or interaction term.
Non-normal residuals → Try a log or square root transformation.
Heteroscedasticity → Use weighted regression or robust standard errors.
Influential points → Check high-leverage points to determine if they are valid or should be removed.

Question 10

This question should be answered using the Carseats data set.

a.Fit a multiple regression model to predict Sales using Price, Urban, and US.

b.Provide an interpretation of each coefcient in the model. Be careful—some of the variables in the model are qualitative!

c.Write out the model in equation form, being careful to handle the qualitative variables properly.

d.For which of the predictors can you reject the null hypothesis H0 : βj = 0?

e.On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.

f.How well do the models in (a) and (e) ft the data?

g.Using the model from (e), obtain 95 % confdence intervals for the coefcient(s).

h.Is there evidence of outliers or high leverage observations in the model from (e)?

# Load necessary library
library(ISLR)

# Load the Carseats dataset
data(Carseats)

(a) Fit a multiple regression model with Sales as response

lm_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)

(b) Summary of the model

summary(lm_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Price negatively affects Sales, US stores have higher Sales, Urban location is insignificant, and the model explains 23.35% of Sales variability.

Interpretation: - The coefficient of Price tells how Sales change with a unit increase in Price. - UrbanYes (since Urban is categorical) indicates the effect of being in an urban location. - USYes (since US is categorical) shows the effect of being in the US vs. non-US locations. — title: “Regression Model Interpretation” output: html_document —

Interpretation

Intercept (13.0435):
- This represents the estimated sales when all predictor variables (Price, Urban, and US) are zero.
Price (-0.0545):
- A one-unit increase in price is associated with a decrease in sales by 0.0545 units, assuming other variables remain constant.
Urban (-0.0219):
- If the store is in an urban area (Urban = 1), sales decrease by 0.0219 units compared to a rural store (Urban = 0).
- This effect is very small, suggesting that location (urban vs. rural) may not be a strong predictor of sales.
US (1.2006):
- If the store is located in the US (US = 1), sales increase by 1.2006 units compared to a store outside the US (US = 0).
- This suggests that being in the US has a positive impact on sales.

(c) Model equation (Urban and US are categorical, so they have Yes/No levels)

Sales = β0 + β1(Price) + β2(UrbanYes) + β3(USYes) + ε

Regression Equation

\[\text{Sales} = 13.0435 - 0.0545 \times \text{Price} - 0.0219 \times \text{Urban} + 1.2006 \times \text{US}\]

(d) Check significant predictors (p-value < 0.05)

summary(lm_fit)$coefficients

##                Estimate  Std. Error      t value     Pr(>|t|)
## (Intercept) 13.04346894 0.651012245  20.03567373 3.626602e-62
## Price       -0.05445885 0.005241855 -10.38923205 1.609917e-22
## UrbanYes    -0.02191615 0.271650277  -0.08067781 9.357389e-01
## USYes        1.20057270 0.259041508   4.63467306 4.860245e-06

Significant Predictors:

Likely Price and US, while Urban may not be significant.We can reject the null hypothesis of Price and US.The summary(lm_fit) output will show which predictors have p-values below 0.05.

(e) Fit a reduced model using only significant predictors

lm_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_reduced)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Compare model fit using R-squared and Adjusted R-squared

cat("Full model R-squared:", summary(lm_fit)$r.squared, "\n")

## Full model R-squared: 0.2392754

cat("Reduced model R-squared:", summary(lm_reduced)$r.squared, "\n")

## Reduced model R-squared: 0.2392629

The full and reduced models have nearly identical R-squared values (0.2393 vs.0.2393), indicating that removing Urban has a negligible effect on model fit, so the simpler reduced model is preferable.

(g) Obtain 95% confidence intervals for the reduced model coefficients

confint(lm_reduced)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Check for outliers and high leverage points

par(mfrow = c(2, 2))
plot(lm_reduced)  # Diagnostic plots: residuals, leverage, etc.

## Insights from the Diagnostic Plots

Identifying Outliers (Standardized Residuals Beyond ±2 or ±3)

Several data points exceed ±2 on the y-axis, indicating potential outliers.
A few extreme values fall below -2.5 or above +2.5, suggesting the need for further examination.
These outliers should be reviewed to determine whether they reflect genuine data variability or possible entry errors.

Detecting High Leverage Points

No observations show exceptionally high leverage (above 0.04–0.05).
Since leverage quantifies an observation’s influence on the model, this indicates that no single data point has an overwhelming effect on the regression fit.

Question 14

This problem focuses on the collinearity problem.

a.Perform the following commands in R: > set.seed(1) > x1 <- runif(100) > x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefcients?

b.What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

c.Using this data, ft a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 = 0? How about the null hypothesis H0 : β2 = 0?

d.Now ft a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?

e.Now ft a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?

f.Do the results obtained in (c)–(e) contradict each other? Explain your answer.

g.Suppose we obtain one additional observation, which was unfortunately mismeasured. We use the function np.concatenate() to np.concaadd this additional observation to each of x1, x2 and y. tenate()

x1 = np.concatenate([x1, [0.1]])

x2 = np.concatenate([x2, [0.8]])

y = np.concatenate([y, [6]])

Re-ft the linear models from (c) to (e) using this new data. What efect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

(a) Generate the data

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The true linear model: y = 2 + 2x1 + 0.3x2 + ε

(b) Compute correlation and plot scatterplot

cor(x1, x2)  # Compute correlation

## [1] 0.8351212

plot(x1, x2, main="Scatterplot of x1 vs x2", xlab="x1", ylab="x2", pch=16)  # Scatterplot

(c) Fit multiple regression model

lm_fit <- lm(y ~ x1 + x2)
summary(lm_fit)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

# Extract estimated coefficients
beta_hat <- coef(lm_fit)
cat("Estimated coefficients: β0 =", beta_hat[1], "β1 =", beta_hat[2], "β2 =", beta_hat[3], "\n")

## Estimated coefficients: β0 = 2.1305 β1 = 1.439555 β2 = 1.009674

# Check significance of β1 and β2
# If p-values < 0.05, we reject the null hypothesis

Multiple regression results: Estimated coefficients should be close to β0 = 2, β1 = 2, β2 = 0.3. High p-values for β2 could indicate collinearity issues.

Relationship to True Coefficients

True β0 = 2, Estimated βˆ0 ≈ 1.9579 (close)
True β1 = 2, Estimated βˆ1 ≈ 1.6154 (underestimated)
True β2 = 0.3, Estimated βˆ2 ≈ 0.9428 (overestimated)

The estimates are fairly close to the true values, though slight discrepancies arise due to random noise and collinearity between x1 and x2.

(d) Fit a model using only x1

lm_x1 <- lm(y ~ x1)
summary(lm_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Yes.The p-value for x1 (< 0.001) is very small, so we reject the null hypothesis 𝐻0:𝛽=0, meaning x1 is a significant predictor of y, though the model explains only ~20.2% of the variance.

Hypothesis Tests

For H0: β1 = 0 (x1 coefficient)

t-statistic = 3.065, p-value = 0.003 (< 0.05)
Conclusion: Reject H0, suggesting x1 significantly affects y.

(e) Fit a model using only x2

lm_x2 <- lm(y ~ x2)
summary(lm_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Yes, Since the p-value for X2 is very small (1.37×10 −5), we reject the null hypothesis H0:β 2=0, indicating that X2 is a significant predictor of y; however, it explains slightly less variance (R2 =0.1763) than x1 alone. ### For H0: β2 = 0 (x2 coefficient) - t-statistic = 1.134, p-value = 0.259 (> 0.05)
- Conclusion: Fail to reject H0, suggesting x2 has no significant effect on y.

(f) Compare the results:

Comments

Significance: We can confidently reject H0 for x1 (p < 0.001), indicating a strong effect of x1 on y.
Coefficient for x1: Close to the true value (2), suggesting a good model fit.
Model Fit: R-squared of 0.281 for x1 alone vs. 0.291 for both x1 and x2. Adding x2 doesn’t improve explanatory power substantially.
Collinearity: The lower coefficient for x1 (1.6154) in the two-variable model indicates possible collinearity.

(g) Add a mismeasured observation

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

# Re-fit the models
lm_fit_new <- lm(y ~ x1 + x2)
lm_x1_new <- lm(y ~ x1)
lm_x2_new <- lm(y ~ x2)

# Compare new summaries
summary(lm_fit_new)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

summary(lm_x1_new)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

summary(lm_x2_new)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

# Plot diagnostics for outliers and leverage
par(mfrow = c(2, 2))
plot(lm_fit_new)

plot(lm_x1_new)

plot(lm_x2_new)

### Model with x2 Alone - Significance: x2 is significant by itself, but the large coefficient (2.9103 vs. 0.3) likely reflects its correlation with x1. - Collinearity: Multicollinearity distorts estimates when predictors are used together.

Leverage and Residuals for the New Observation

Full model (x1 & x2): Leverage = 0.3477, Residual = 2.6926 (outlier)
x1-only model: Leverage = 0.0303, Residual = 3.8698 (outlier, not high leverage)
x2-only model: Leverage = 0.1102, Residual = 1.3398 (high leverage, not an outlier)

Model Comparisons

Full Model (x1 & x2): The new observation significantly influences the regression (high leverage, outlier).
x1 Model: Outlier with high residual but not high leverage.
x2 Model: High leverage, but not an outlier; improves model fit for x2.

Conclusion

The new observation shows varying effects depending on the model used. This highlights the importance of performing careful diagnostic checks and considering multicollinearity in regression analysis to ensure reliable results.