Lab

Chapter 3

Question - 8

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.4.2

(a).

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

# Fit simple linear regression model
lm_model <- lm(mpg ~ horsepower, data = Auto)

# Display regression summary
summary(lm_model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Interpretation of Results:

Existence of Relationship
- The p-value of horsepower in summary(lm_model) determines if it significantly impacts mpg.
- A low p-value (< 0.05) suggests a strong relationship.
Strength of Relationship
- The R-squared value indicates the proportion of mpg variation explained by horsepower.
- Closer to 1 → stronger model fit.
Nature of Relationship (Positive/Negative)
- The coefficient of horsepower shows if mpg increases or decreases as horsepower rises.
- A negative coefficient suggests an inverse relationship.
Prediction for Horsepower = 98

predict(lm_model, newdata = data.frame(horsepower = 98), interval = "confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm_model, newdata = data.frame(horsepower = 98), interval = "prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

Predicted mpg for horsepower = 98: 24.47

95% Confidence Interval (CI): (23.97, 24.96)
- This represents the expected average mpg for all cars with 98 HP.
95% Prediction Interval (PI): (14.81, 34.12)
- This represents the expected mpg range for an individual car with 98 HP.
Key Takeaway:
- The CI is narrow (accurate mean prediction).
- The PI is wider (accounts for variation in individual cars).
- Individual cars may have high variability in mpg even with the same horsepower.

(b).

plot(Auto$horsepower, Auto$mpg, main = "MPG vs. Horsepower",
     xlab = "Horsepower", ylab = "MPG", col = "red", pch = 19)

# Add regression line
abline(lm_model, col = "blue", lwd = 2)

Scatter Plot Analysis:
- Shows a negative relationship between horsepower and mpg.
- As horsepower increases, mpg decreases.
Regression Line (Blue Line):
- Represents the best-fit linear model.
- Confirms the downward trend, indicating that cars with higher horsepower tend to be less fuel-efficient.

Key Takeaway: The plot visually supports the regression results, showing that horsepower is a significant predictor of mpg, with a negative correlation.

(c).

par(mfrow = c(2, 2))  
plot(lm_model)

1. Residuals vs. Fitted Plot:

A curved pattern suggests non-linearity in the relationship.
Residuals are not randomly scattered, indicating that a linear model may not be the best fit.

2. Q-Q Plot:

Residuals mostly follow the normal line, but some deviations in the tails indicate potential non-normality.
This suggests possible outliers affecting the model.

3. Scale-Location Plot:

Increasing spread indicates heteroscedasticity (non-constant variance).
This means prediction errors vary across different fitted values.

4. Residuals vs. Leverage Plot:

A few high-leverage points are present, meaning some observations strongly influence the model.
These should be investigated to check for outliers or influential data points.

Key Takeaway: The model shows signs of non-linearity, heteroscedasticity, and potential outliers, suggesting that a transformation or a more flexible model (e.g., polynomial regression) may improve the fit.

Question 10

(a).

data(Carseats)

# Fit regression model
lm_full <- lm(Sales ~ Price + Urban + US, data = Carseats)

# View summary
summary(lm_full)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Regression Model: Sales ~ Price + Urban + US
Key Coefficients & Significance:
- Price: -0.054459 (p < 0.001) → Significant negative impact on Sales. As Price increases, Sales decrease.
- Urban (Yes/No): -0.021916 (p = 0.936) → Not significant, meaning whether a store is in an urban or rural area does not significantly affect Sales.
- US (Yes/No): 1.200573 (p < 0.001) → Significant positive impact. Stores in the US have higher Sales compared to non-US stores.
Model Performance:
- R-squared = 0.2393 → The model explains ~23.9% of the variation in Sales, suggesting moderate predictive power.
- F-statistic = 41.52 (p < 2.2e-16) → The overall model is statistically significant.

Key Takeaway:

Price significantly affects Sales (negative impact).
Being in the US increases Sales, but urban/rural location does not matter.
The model captures a moderate amount of variance in Sales.

(b).

Price: Indicates how much Sales change per unit price increase.
Urban: Whether sales differ for urban vs. rural stores.
US: Effect of being in US vs. non-US.

(c).

\[ \hat{Sales} = \beta_0 + \beta_1 Price + \beta_2 UrbanYes + \beta_3 USYes + \varepsilon \]

UrbanYes = 1 (Urban), 0 (Rural).
USYes = 1 (US), 0 (Non-US).

(d).

summary(lm_full)$coefficients

##                Estimate  Std. Error      t value     Pr(>|t|)
## (Intercept) 13.04346894 0.651012245  20.03567373 3.626602e-62
## Price       -0.05445885 0.005241855 -10.38923205 1.609917e-22
## UrbanYes    -0.02191615 0.271650277  -0.08067781 9.357389e-01
## USYes        1.20057270 0.259041508   4.63467306 4.860245e-06

Significant Predictors (p-value < 0.05):
- Price (p < 0.001) → Significant negative effect on Sales. As Price increases, Sales decrease.
- US (Yes/No) (p < 0.001) → Significant positive effect. Stores in the US have higher Sales than non-US stores.
Non-Significant Predictor:
- Urban (Yes/No) (p = 0.936) → Not significant, meaning whether a store is in an urban or rural area has no meaningful impact on Sales.

Key Takeaway:

Price and US location significantly impact Sales.
Urban location does not affect Sales and can be removed from the model.

(e).

lm_reduced <- lm(Sales ~ Price, data = Carseats)
summary(lm_reduced)

## 
## Call:
## lm(formula = Sales ~ Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5224 -1.8442 -0.1459  1.6503  7.5108 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.641915   0.632812  21.558   <2e-16 ***
## Price       -0.053073   0.005354  -9.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared:  0.198,  Adjusted R-squared:  0.196 
## F-statistic: 98.25 on 1 and 398 DF,  p-value: < 2.2e-16

Regression Model: Sales ~ Price (Reduced Model)
Key Findings:
- Price (p < 0.001) → Highly significant with a negative effect on Sales.
- As Price increases, Sales decrease (β = -0.053073).
Model Performance:
- R-squared = 0.198 → 19.8% of the variation in Sales is explained by Price alone.
- F-statistic = 98.25 (p < 2.2e-16) → The model is highly significant overall.

Key Takeaway:

Price is a strong predictor of Sales.
Removing Urban and US does not drastically impact model performance.

(f).

anova(lm_full, lm_reduced)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    396 2420.8                                  
## 2    398 2552.2 -2   -131.41 10.748 2.848e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing Models: Sales ~ Price + Urban + US vs. Sales ~ Price
Key Findings from ANOVA Test:
- p-value = 2.848e-05 (< 0.05) → The full model significantly improves the fit compared to the reduced model.
- F-statistic = 10.748 → The additional predictors (Urban and US) contribute meaningfully to explaining Sales.
- Residual Sum of Squares (RSS) increases in the reduced model, indicating a worse fit.

Key Takeaway:

The full model (Price + Urban + US) is statistically better than using Price alone.
Although Urban was not significant individually, the combination of additional variables improves the model’s explanatory power.

(g).

confint(lm_reduced)

##                  2.5 %      97.5 %
## (Intercept) 12.3978438 14.88598655
## Price       -0.0635995 -0.04254653

95% Confidence Intervals for Coefficients (Sales ~ Price):
- Intercept (12.40, 14.89) → When Price = 0, expected Sales falls between 12.40 and 14.89 with 95% confidence.
- Price (-0.0636, -0.0425) → The coefficient for Price is negative, confirming that as Price increases, Sales decrease.

Key Takeaway:

The confidence interval for Price does not include zero, indicating that Price significantly affects Sales.

(h).

par(mfrow = c(2, 2))
plot(lm_reduced)

Residuals vs. Fitted Plot:
- Residuals appear randomly scattered, suggesting no major non-linearity in the model.
Q-Q Plot:
- Residuals mostly follow the normal line, but slight deviations in the tails indicate minor non-normality.
Scale-Location Plot:
- No strong pattern, suggesting homoscedasticity (constant variance) is mostly satisfied.
Residuals vs. Leverage Plot:
- A few high-leverage points exist, but no extreme outliers are significantly influencing the model.

Key Takeaway:

The model assumptions are mostly met, with minor deviations in normality and leverage.
No strong evidence of outliers or major issues in model fit.

Question 14

(a).

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

lm_model <- lm(y ~ x1 + x2)
summary(lm_model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Regression Model: y ~ x1 + x2
Key Findings:
- x1 coefficient = 1.4396, p = 0.0487 → Marginally significant, meaning x1 has a weak but noticeable effect on y.
- x2 coefficient = 1.0097, p = 0.3754 → Not significant, suggesting x2 does not have a strong independent effect on y.
- Intercept = 2.1305 → Represents the expected value of y when x1 = 0 and x2 = 0.
Model Performance:
- R-squared = 0.2088 → The model explains ~20.88% of the variation in y, indicating a moderate fit.
- F-statistic = 12.8, p-value = 1.164e-05 → The overall model is statistically significant.

Key Takeaway:

The model suggests x1 has a significant effect on y, but x2 does not.
Potential multicollinearity may be affecting the results, as x2 was expected to be significant.

(b).

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2, main = "Correlation Between x1 and x2", col = "green", pch = 19)

Correlation Between x1 and x2
- The correlation (cor(x1, x2)) is positive, indicating a strong linear relationship between x1 and x2.
- This suggests that x2 is partially dependent on x1, which may lead to multicollinearity in regression.
Scatter Plot Analysis:
- The points follow an upward trend, reinforcing the positive correlation.
- The spread suggests moderate to strong collinearity.

Key Takeaway:

Since x1 and x2 are highly correlated, including both in a regression model may cause multicollinearity, making it difficult to determine their individual effects on y.

(c).

lm_model <- lm(y ~ x1 + x2)
summary(lm_model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Regression Model: y ~ x1 + x2
Key Findings:
- x1 (p = 0.0487) → Marginally significant, suggesting a weak but notable effect on y.
- x2 (p = 0.3754) → Not significant, meaning it does not contribute much to predicting y.
Model Performance:
- R-squared = 0.2088 → About 20.88% of the variation in y is explained by x1 and x2.
- F-statistic (12.8, p < 0.001) → The overall model is statistically significant, despite x2 being insignificant.

Key Takeaway:

x1 has some predictive power, but x2 does not significantly contribute to the model.
Potential multicollinearity might be affecting x2’s significance, reducing its impact when both predictors are included.

(d).

lm_x1 <- lm(y ~ x1)
summary(lm_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

Regression Model: y ~ x1 (Only x1 as a predictor)
Key Findings:
- x1 coefficient = 1.9759, p < 0.001 → Highly significant, meaning x1 has a strong positive effect on y.
- Intercept = 2.1124 → The expected value of y when x1 = 0.
Model Performance:
- R-squared = 0.2024 → x1 alone explains ~20.24% of the variation in y, slightly lower than the multiple regression model.
- F-statistic (24.86, p < 0.001) → The model is statistically significant.

Key Takeaway:

x1 is a strong predictor of y when used alone.
Compared to the multiple regression model, removing x2 does not significantly impact predictive power, suggesting x2 was redundant (likely due to multicollinearity).

(e).

lm_x2 <- lm(y ~ x2)
summary(lm_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Regression Model: y ~ x2 (Only x2 as a predictor)
Key Findings:
- x2 coefficient = 2.8996, p < 0.001 → Highly significant, meaning x2 has a strong positive effect on y.
- Intercept = 2.3899 → The expected value of y when x2 = 0.
Model Performance:
- R-squared = 0.1763 → x2 alone explains ~17.63% of the variation in y, slightly lower than using x1.
- F-statistic (20.98, p < 0.001) → The model is statistically significant.

Key Takeaway:

x2 is a significant predictor of y when used alone, unlike in the multiple regression model where it was not.
This suggests multicollinearity between x1 and x2, which caused x2 to appear insignificant in the combined model.

(f).

x1 and x2 are individually significant but not together → due to collinearity.
Solution: Use VIF (Variance Inflation Factor) to check collinearity.

(g).

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

lm_new <- lm(y ~ x1 + x2)
summary(lm_new)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

# Check influential points
plot(lm_new, which = 4)

Effect of Adding the New Observation (x1 = 0.1, x2 = 0.8, y = 6)
- The Cook’s Distance plot identifies observation 101 as a highly influential point.
- This observation has a disproportionate effect on the model, suggesting it may be an outlier or high-leverage point.
Key Takeaways:
- The new data point significantly influences regression coefficients.
- It may distort the model’s predictions and should be carefully examined.
- Possible solution: Check for data entry errors, consider robust regression or remove the influential point if justified.

Final Note: The new observation impacts model stability, and caution is needed when interpreting results.

Lab

2025-02-26

Chapter 3

Question - 8

(a).

(b).

(c).

1. Residuals vs. Fitted Plot:

2. Q-Q Plot:

3. Scale-Location Plot:

4. Residuals vs. Leverage Plot:

(a).

(b).

(c).

(d).

(e).

(f).

(g).

(h).

Question 14

(a).

(b).

(c).

(d).

(e).

(f).

(g).