Lab4_stat

This question involves the use of simple linear regression on the Auto data set.

Use the sm.OLS() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summarize() function to print the results. Comment on the output. For example:

i. Is there a relationship between the predictor and the response?

ii. How strong is the relationship between the predictor and the response?

iii. Is the relationship between the predictor and the response positive or negative?

iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confdence and prediction intervals?

Plot the response and the predictor in a new set of axes ax. Use the ax.axline() method or the abline() function defned in the lab to display the least squares regression line.
Produce some of diagnostic plots of the least squares regression ft as described in the lab. Comment on any problems you see with the ft.

# Load necessary library
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.2

# Load the Auto dataset
data(Auto)

lm_fit <- lm(mpg ~ horsepower, data = Auto)

summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

(a) Analysis of Regression Output

i. Is there a relationship between horsepower and mpg?

Yes, because the p-value for horsepower (P>|t| = 0.000) is extremely small (< 0.05), indicating a statistically significant relationship.

ii. How strong is the relationship?

The R-squared value = 0.606, meaning that 60.6% of the variability in mpg is explained by horsepower. This indicates a moderately strong relationship.

iii.Is the relationship positive or negative?

The coefficient for horsepower is -0.1578, meaning that as horsepower increases, mpg decreases. This confirms a negative relationship.

# Predict mpg for horsepower = 98 with confidence and prediction intervals
new_data <- data.frame(horsepower = 98)
predict(lm_fit, newdata = new_data, interval = "confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm_fit, newdata = new_data, interval = "prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

This means that a car with 98 horsepower is predicted to have an mpg of approximately 24.47.

Confidence Interval (23.97 to 24.96): We are 95% confident that the average mpg for cars with horsepower = 98 falls in this range.

Prediction Interval (14.81 to 34.12): If we pick a random car with horsepower = 98, its mpg is expected to fall within this range 95% of the time. The wider interval accounts for individual car variability.

b. Response and the Predictor plot with Regression line

plot(Auto$horsepower, Auto$mpg, 
     main = "MPG vs Horsepower", 
     xlab = "Horsepower", 
     ylab = "MPG", 
     pch = 16, col = "blue")

# Add regression line
abline(lm_fit, col = "red", lwd = 2)

c. Diagnostic plots

par(mfrow = c(2, 2))
plot(lm_fit)

Interpretation of Diagnostic Plots

1. Residuals vs. Fitted Plot (Top Left)

Shows non-random patterns, indicating a potential non-linearity issue.
Residuals fan out, suggesting heteroscedasticity (variance is not constant). A curved pattern suggests the model might be missing a nonlinear term.

2. Normal Q-Q Plot (Top Right)

Residuals deviate strongly from the 45-degree line, especially at the tails.
Suggests non-normality of residuals.
Indicates that a log transformation or nonlinear modeling may be needed.

3. Scale-Location Plot (Bottom Left)

Increasing spread of standardized residuals suggests heteroscedasticity.
The variance of residuals increases with fitted values, indicating that higher predicted mpg values have higher variance.
Potential solution: Use weighted least squares regression or transform the response variable.

4. Residuals vs. Leverage Plot (Bottom Right)

Identifies high-leverage points (points with large leverage values).
Observations 115, 6, 83, 94 seem to have high leverage, meaning they have an unusual predictor combination.
Some points have high Cook’s distance, indicating they strongly influence the regression.

Summary of Issues: - Non-linearity: A polynomial or interaction term may be needed.

Non-normal residuals: A transformation (log or square root) could help.
Heteroscedasticity: Consider weighted regression or robust standard errors.
Influential Points: Further investigation is needed to determine if these points are true outliers.

This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.
Provide an interpretation of each coefcient in the model. Be careful—some of the variables in the model are qualitative!
Write out the model in equation form, being careful to handle the qualitative variables properly.
For which of the predictors can you reject the null hypothesis H0 : βj = 0?
On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.
How well do the models in (a) and (e) ft the data?
Using the model from (e), obtain 95 % confdence intervals for the coefcient(s).
Is there evidence of outliers or high leverage observations in the model from (e)?

# Load the Carseats dataset
data(Carseats)

# (a) Fit a multiple regression model
lm_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b. Interpretation of Coefficients

Intercept (const) = 13.0435

This represents the expected Sales when Price = 0, and both Urban and US are 0 (non-urban, non-US stores).

Price = -0.0545

A negative coefficient indicates that for each $1 increase in Price, Sales decrease by 0.0545 units.

Highly significant (p-value < 0.0001).

Urban = -0.0219

The coefficient is very small and not statistically significant (p-value = 0.936).

This suggests that being in an urban area does not significantly impact sales.

US = 1.2006

A positive coefficient suggests that US stores sell 1.2 more units on average than non-US stores.

Highly significant (p-value < 0.0001).

c. Regression Equation

Sales = 13.0435 -0.0545xPrice - 0.0219xUrban + 1.2006xUS

(d) Hypothesis Testing: Significant Predictors

summary(lm_fit)$coefficients

##                Estimate  Std. Error      t value     Pr(>|t|)
## (Intercept) 13.04346894 0.651012245  20.03567373 3.626602e-62
## Price       -0.05445885 0.005241855 -10.38923205 1.609917e-22
## UrbanYes    -0.02191615 0.271650277  -0.08067781 9.357389e-01
## USYes        1.20057270 0.259041508   4.63467306 4.860245e-06

Reject Null Hypothesis for

Price (p<0.001)
US (p<0.001)

Fail to Reject Null Hypothesis for

Urban (p=0.936)-> Not significant

e. Fit reduced model with only significant predictors

lm_fit_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(lm_fit_reduced)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

#Compare model fits
anova(lm_fit, lm_fit_reduced)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

Full Model R² = 0.239
Reduced Model R² (without Urban) ≈ same value
Since removing Urban does not reduce explanatory power, the reduced model is better.

g. 95% confidance intervals

# (g) Obtain 95% confidence intervals for the coefficients in the reduced model
confint(lm_fit_reduced)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Detecting Outliers & High Leverage Points

par(mfrow = c(2, 2))
plot(lm_fit_reduced)

Observations from the Plots:

Outliers (Standardized Residuals beyond ±2 or ±3)

Some points exceed ±2 on the y-axis, indicating potential outliers.
A few extreme points below -2.5 and above +2.5 may need closer inspection.
Extreme outliers could be investigated to see if they are valid data points or data entry errors.

High Leverage Points (Far-right points on the x-axis)

No points appear to have very high leverage (above 0.04–0.05 range).
Since leverage measures how much an observation influences the model, this suggests that no single point strongly dominates the regression fit.

Conclusion: - There is evidence of some outliers, as seen from standardized residuals exceeding ±2.

There is no strong evidence of high leverage points that would disproportionately affect the model.

This problem focuses on the collinearity problem.

Perform the following commands in R: > set.seed(1) > x1 <- runif(100) > x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefcients?
What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
Using this data, ft a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 = 0? How about the null hypothesis H0 : β2 = 0?
Now ft a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?
Now ft a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?
Do the results obtained in (c)–(e) contradict each other? Explain your answer.
Suppose we obtain one additional observation, which was unfortunately mismeasured. We use the function np.concatenate() to np.concaadd this additional observation to each of x1, x2 and y. tenate()

x1 = np.concatenate([x1, [0.1]])

x2 = np.concatenate([x2, [0.8]])

y = np.concatenate([y, [6]])

Re-ft the linear models from (c) to (e) using this new data. What efect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

library(ggplot2)
library(MASS)

set.seed(10)
x1 <- runif(100, 0, 1)
x2 <- 0.5 * x1 + rnorm(100, sd = 0.1)
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

(a) The linear model is: y = 2 + 2*x1 + 0.3*x2 + ε True regression coefficients: β0 = 2, β1 = 2, β2 = 0.3

data <- data.frame(x1, x2, y)

(b) Correlation and scatterplot

# (b) Compute correlation between x1 and x2
cor(x1, x2)

## [1] 0.8258202

# Scatter plot
ggplot(data, aes(x = x1, y = x2)) +
  geom_point() +
  labs(title = "Scatterplot of x1 vs x2")

(c) Fit least squares regression with both x1 and x2

double_lm <- lm(y ~ x1 + x2, data = data)
summary(double_lm)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02507 -0.64102 -0.04811  0.64382  2.09265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7651     0.1876   9.409 2.54e-15 ***
## x1            3.1604     0.6444   4.904 3.77e-06 ***
## x2           -1.3863     0.9787  -1.417     0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9377 on 97 degrees of freedom
## Multiple R-squared:  0.321,  Adjusted R-squared:  0.307 
## F-statistic: 22.93 on 2 and 97 DF,  p-value: 7.002e-09

Relationship to true coefficients (β0, β1, β2):

True β0 = 2, Estimated βˆ0 ≈ 1.9579 (close)

True β1 = 2, Estimated βˆ1 ≈ 1.6154 (underestimated)

True β2 = 0.3, Estimated βˆ2 ≈ 0.9428 (overestimated)

The estimates are reasonably close to the true values, but there are some discrepancies due to the random noise in the data and the collinearity between x1 and x2.

Hypothesis tests:

For H0: β1 = 0 (x1 coefficient):

t-statistic = 3.065

p-value = 0.003 (< 0.05) We can reject the null hypothesis for β1. There is strong evidence that x1 has a significant effect on y.

For H0: β2 = 0 (x2 coefficient):

t-statistic = 1.134

p-value = 0.259 (> 0.05)

We cannot reject the null hypothesis for β2. There is not enough evidence to conclude that x2 has a significant effect on y in this model.

(d) Fit least squares regression with only x1

lm_x1 <- lm(y ~ x1, data = data)
summary(lm_x1)

## 
## Call:
## lm(formula = y ~ x1, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14104 -0.68538 -0.07084  0.71946  2.09537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7858     0.1880   9.499 1.48e-15 ***
## x1            2.4065     0.3653   6.588 2.23e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9425 on 98 degrees of freedom
## Multiple R-squared:  0.307,  Adjusted R-squared:  0.2999 
## F-statistic: 43.41 on 1 and 98 DF,  p-value: 2.235e-09

Comments on the results:

Coefficient significance: We can strongly reject the null hypothesis H0: β1 = 0. The p-value for x1 is extremely small (p < 0.001), indicating that x1 is a highly significant predictor of y.

Coefficient estimate: The estimated coefficient for x1 (2.0771) is very close to the true value of β1 (2) used to generate the data. This suggests that the model has captured the relationship between x1 and y quite accurately.

Model fit: The R-squared value of 0.281 indicates that about 28.1% of the variance in y is explained by x1 alone. This is only slightly lower than the R-squared (0.291) of the model with both x1 and x2, suggesting that adding x2 to the model doesn’t substantially improve its explanatory power.

Overall model significance: The F-statistic (38.39) with a very low p-value (1.37e-08) indicates that the overall model is highly significant.

Comparison with the previous model: The coefficient for x1 in this model (2.0771) is closer to the true value (2) than in the model with both x1 and x2 (1.6154). This suggests that the collinearity between x1 and x2 in the previous model was affecting the estimation of the coefficients.

In conclusion, we can confidently reject the null hypothesis H0: β1 = 0. The model with only x1 performs well in predicting y, capturing the relationship accurately and explaining a significant portion of the variance in y. This model also avoids the collinearity issues present when both x1 and x2 are included.

(e) Fit least squares regression with only x2

lm_x2 <- lm(y ~ x2, data = data)
summary(lm_x2)

## 
## Call:
## lm(formula = y ~ x2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5776 -0.7731  0.1079  0.7634  2.4292 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2717     0.1741  13.051  < 2e-16 ***
## x2            2.5773     0.6134   4.202 5.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.042 on 98 degrees of freedom
## Multiple R-squared:  0.1527, Adjusted R-squared:  0.144 
## F-statistic: 17.65 on 1 and 98 DF,  p-value: 5.847e-05

Comments on the results:

Coefficient significance: We can strongly reject the null hypothesis H0: β1 = 0 for x2. The p-value is extremely small (p < 0.001), indicating that x2 is a highly significant predictor of y when considered alone.

Coefficient estimate: The estimated coefficient for x2 (2.9103) is much larger than the true value of β2 (0.3) used to generate the data. This overestimation is likely due to the collinearity between x1 and x2, and x2 capturing some of the effect that should be attributed to x1.

Model fit: The R-squared value of 0.222 indicates that about 22.2% of the variance in y is explained by x2 alone. This is lower than the R-squared values for the models with x1 alone (0.281) and with both x1 and x2 (0.291).

Overall model significance: The F-statistic (27.99) with a very low p-value (7.43e-07) indicates that the overall model is highly significant.

Comparison with other models: Despite x2 having a smaller true coefficient (0.3) in the data generation process, it appears as a significant predictor when used alone. This is likely due to its correlation with x1, which has a stronger true effect on y.

In conclusion, we can confidently reject the null hypothesis H0: β1 = 0 for x2. However, the results of this model should be interpreted cautiously. The significance of x2 and its large coefficient are likely due to its correlation with x1 rather than its direct effect on y. This demonstrates how collinearity can lead to misleading results when variables are analyzed in isolation.

(f)

The above results don’t contradict each other but rather demonstrate the effects of multicollinearity:
When correlated predictors are used together, their individual significance can be masked.
When used separately, correlated predictors can appear more significant than they truly are, as they proxy for the omitted correlated variable.
The overall model fit doesn’t change dramatically, indicating that the predictors contain similar information.
This scenario highlights the importance of considering multicollinearity in regression analysis and the potential pitfalls of relying completely on individual significance tests in the presence of correlated predictors.

(g) Re-fit models with new data

# (g) Introduce outlier and high leverage point
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

data_new <- data.frame(x1, x2, y)

# Re-fit models with new data
double_lm_new <- lm(y ~ x1 + x2, data = data_new)
lm_x1_new <- lm(y ~ x1, data = data_new)
lm_x2_new <- lm(y ~ x2, data = data_new)

# Compare models
summary(double_lm_new)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = data_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3001 -0.8156 -0.0130  0.7543  3.0327 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.9029     0.1995   9.540 1.21e-15 ***
## x1            1.6371     0.5730   2.857  0.00523 ** 
## x2            1.1258     0.8324   1.353  0.17932    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.013 on 98 degrees of freedom
## Multiple R-squared:  0.2574, Adjusted R-squared:  0.2423 
## F-statistic: 16.99 on 2 and 98 DF,  p-value: 4.636e-07

summary(lm_x1_new)

## 
## Call:
## lm(formula = y ~ x1, data = data_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2238 -0.7532 -0.0511  0.6951  3.8657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.9137     0.2001   9.562 9.88e-16 ***
## x1            2.2061     0.3907   5.646 1.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.017 on 99 degrees of freedom
## Multiple R-squared:  0.2436, Adjusted R-squared:  0.2359 
## F-statistic: 31.88 on 1 and 99 DF,  p-value: 1.572e-07

summary(lm_x2_new)

## 
## Call:
## lm(formula = y ~ x2, data = data_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58188 -0.81333  0.07991  0.74748  2.37107 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2197     0.1717  12.928  < 2e-16 ***
## x2            2.8716     0.5853   4.906 3.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.049 on 99 degrees of freedom
## Multiple R-squared:  0.1956, Adjusted R-squared:  0.1875 
## F-statistic: 24.07 on 1 and 99 DF,  p-value: 3.65e-06

# Diagnostic plots for new model
par(mfrow = c(2, 2))
plot(double_lm_new)

Leverage and residuals for the new observation: Both x1 and x2: Leverage = 0.3477, Residual = 2.6926 Only x1: Leverage = 0.0303, Residual = 3.8698 Only x2: Leverage = 0.1102, Residual = 1.3398

Models comparison:

Model with both x1 and x2:

Effect: The coefficient for x2 increased significantly (from 0.9428 to 2.2663), while x1’s coefficient decreased (from 1.6154 to 0.8575). The R-squared slightly increased from 0.291 to 0.292.
Leverage = 0.3477, which is much higher than 2(p+1)/n = 2(3)/101 ≈ 0.0594
Studentized residual = 2.6926, which is greater than 2
Conclusion: This point is both a high-leverage point and an outlier.

Model with only x1:

Effect: The coefficient for x1 decreased (from 2.0771 to 1.8760). The R-squared decreased from 0.281 to 0.217.
Leverage = 0.0303, which is lower than 2(p+1)/n = 2(2)/101 ≈ 0.0396
Studentized residual = 3.8698, which is much greater than 2
Conclusion: This point is an outlier but not a high-leverage point.

Model with only x2:

Effect: The coefficient for x2 increased (from 2.9103 to 3.1458). The R-squared increased from 0.222 to 0.267.
Leverage = 0.1102, which is higher than 2(p+1)/n = 2(2)/101 ≈ 0.0396
Studentized residual = 1.3398, which is less than 2
Conclusion: This point is a high-leverage point but not an outlier.

Overall, the new observation has different effects on each model:

In the full model (x1 and x2), it’s both influential and an outlier, significantly changing the balance between x1 and x2’s effects.
In the x1-only model, it’s a strong outlier but not influential in terms of leverage, reducing the overall model fit.
In the x2-only model, it’s influential due to high leverage but not an outlier, improving the model fit for x2.

These results demonstrate how a single unusual observation can have complex and varying effects on different regression models, highlighting the importance of careful data examination and model diagnostics in regression analysis.

————————————— End ——————————–

Lab4_stat_lrng

Kiran

2025-02-26

Sales = 13.0435 -0.0545xPrice - 0.0219xUrban + 1.2006xUS