Chapter 3

Question - 8

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2

(a).

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
# Fit simple linear regression model
lm_model <- lm(mpg ~ horsepower, data = Auto)

# Display regression summary
summary(lm_model)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Interpretation of Results:

  1. Existence of Relationship

    • The p-value of horsepower in summary(lm_model) determines if it significantly impacts mpg.

    • A low p-value (< 0.05) suggests a strong relationship.

  2. Strength of Relationship

    • The R-squared value indicates the proportion of mpg variation explained by horsepower.

    • Closer to 1 → stronger model fit.

  3. Nature of Relationship (Positive/Negative)

    • The coefficient of horsepower shows if mpg increases or decreases as horsepower rises.

    • A negative coefficient suggests an inverse relationship.

  4. Prediction for Horsepower = 98

predict(lm_model, newdata = data.frame(horsepower = 98), interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(lm_model, newdata = data.frame(horsepower = 98), interval = "prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
  • Predicted mpg for horsepower = 98: 24.47
  • 95% Confidence Interval (CI): (23.97, 24.96)

    • This represents the expected average mpg for all cars with 98 HP.
  • 95% Prediction Interval (PI): (14.81, 34.12)

    • This represents the expected mpg range for an individual car with 98 HP.
  • Key Takeaway:

    • The CI is narrow (accurate mean prediction).

    • The PI is wider (accounts for variation in individual cars).

    • Individual cars may have high variability in mpg even with the same horsepower.

(b).

plot(Auto$horsepower, Auto$mpg, main = "MPG vs. Horsepower",
     xlab = "Horsepower", ylab = "MPG", col = "red", pch = 19)

# Add regression line
abline(lm_model, col = "blue", lwd = 2)

  • Scatter Plot Analysis:

    • Shows a negative relationship between horsepower and mpg.

    • As horsepower increases, mpg decreases.

  • Regression Line (Blue Line):

    • Represents the best-fit linear model.

    • Confirms the downward trend, indicating that cars with higher horsepower tend to be less fuel-efficient.

Key Takeaway: The plot visually supports the regression results, showing that horsepower is a significant predictor of mpg, with a negative correlation.

(c).

par(mfrow = c(2, 2))  
plot(lm_model)

1. Residuals vs. Fitted Plot:

2. Q-Q Plot:

  • Residuals mostly follow the normal line, but some deviations in the tails indicate potential non-normality.

  • This suggests possible outliers affecting the model.

3. Scale-Location Plot:

  • Increasing spread indicates heteroscedasticity (non-constant variance).

  • This means prediction errors vary across different fitted values.

4. Residuals vs. Leverage Plot:

  • A few high-leverage points are present, meaning some observations strongly influence the model.

  • These should be investigated to check for outliers or influential data points.

Key Takeaway: The model shows signs of non-linearity, heteroscedasticity, and potential outliers, suggesting that a transformation or a more flexible model (e.g., polynomial regression) may improve the fit.

Question 10

(a).

data(Carseats)
# Fit regression model
lm_full <- lm(Sales ~ Price + Urban + US, data = Carseats)

# View summary
summary(lm_full)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
  • Regression Model: Sales ~ Price + Urban + US

  • Key Coefficients & Significance:

    • Price: -0.054459 (p < 0.001) → Significant negative impact on Sales. As Price increases, Sales decrease.

    • Urban (Yes/No): -0.021916 (p = 0.936) → Not significant, meaning whether a store is in an urban or rural area does not significantly affect Sales.

    • US (Yes/No): 1.200573 (p < 0.001) → Significant positive impact. Stores in the US have higher Sales compared to non-US stores.

  • Model Performance:

    • R-squared = 0.2393 → The model explains ~23.9% of the variation in Sales, suggesting moderate predictive power.

    • F-statistic = 41.52 (p < 2.2e-16) → The overall model is statistically significant.

Key Takeaway:

  • Price significantly affects Sales (negative impact).

  • Being in the US increases Sales, but urban/rural location does not matter.

  • The model captures a moderate amount of variance in Sales.

(b).

  • Price: Indicates how much Sales change per unit price increase.

  • Urban: Whether sales differ for urban vs. rural stores.

  • US: Effect of being in US vs. non-US.

(c).

\[ \hat{Sales} = \beta_0 + \beta_1 Price + \beta_2 UrbanYes + \beta_3 USYes + \varepsilon \]

  • UrbanYes = 1 (Urban), 0 (Rural).

  • USYes = 1 (US), 0 (Non-US).

(d).

summary(lm_full)$coefficients
##                Estimate  Std. Error      t value     Pr(>|t|)
## (Intercept) 13.04346894 0.651012245  20.03567373 3.626602e-62
## Price       -0.05445885 0.005241855 -10.38923205 1.609917e-22
## UrbanYes    -0.02191615 0.271650277  -0.08067781 9.357389e-01
## USYes        1.20057270 0.259041508   4.63467306 4.860245e-06
  • Significant Predictors (p-value < 0.05):

    • Price (p < 0.001) → Significant negative effect on Sales. As Price increases, Sales decrease.

    • US (Yes/No) (p < 0.001) → Significant positive effect. Stores in the US have higher Sales than non-US stores.

  • Non-Significant Predictor:

    • Urban (Yes/No) (p = 0.936) → Not significant, meaning whether a store is in an urban or rural area has no meaningful impact on Sales.

Key Takeaway:

  • Price and US location significantly impact Sales.

  • Urban location does not affect Sales and can be removed from the model.

(e).

lm_reduced <- lm(Sales ~ Price, data = Carseats)
summary(lm_reduced)
## 
## Call:
## lm(formula = Sales ~ Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5224 -1.8442 -0.1459  1.6503  7.5108 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.641915   0.632812  21.558   <2e-16 ***
## Price       -0.053073   0.005354  -9.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared:  0.198,  Adjusted R-squared:  0.196 
## F-statistic: 98.25 on 1 and 398 DF,  p-value: < 2.2e-16
  • Regression Model: Sales ~ Price (Reduced Model)

  • Key Findings:

    • Price (p < 0.001) → Highly significant with a negative effect on Sales.

    • As Price increases, Sales decrease (β = -0.053073).

  • Model Performance:

    • R-squared = 0.198 → 19.8% of the variation in Sales is explained by Price alone.

    • F-statistic = 98.25 (p < 2.2e-16) → The model is highly significant overall.

Key Takeaway:

  • Price is a strong predictor of Sales.

  • Removing Urban and US does not drastically impact model performance.

(f).

anova(lm_full, lm_reduced)
## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    396 2420.8                                  
## 2    398 2552.2 -2   -131.41 10.748 2.848e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Comparing Models: Sales ~ Price + Urban + US vs. Sales ~ Price

  • Key Findings from ANOVA Test:

    • p-value = 2.848e-05 (< 0.05) → The full model significantly improves the fit compared to the reduced model.

    • F-statistic = 10.748 → The additional predictors (Urban and US) contribute meaningfully to explaining Sales.

    • Residual Sum of Squares (RSS) increases in the reduced model, indicating a worse fit.

Key Takeaway:

  • The full model (Price + Urban + US) is statistically better than using Price alone.

  • Although Urban was not significant individually, the combination of additional variables improves the model’s explanatory power.

(g).

confint(lm_reduced)
##                  2.5 %      97.5 %
## (Intercept) 12.3978438 14.88598655
## Price       -0.0635995 -0.04254653
  • 95% Confidence Intervals for Coefficients (Sales ~ Price):

    • Intercept (12.40, 14.89) → When Price = 0, expected Sales falls between 12.40 and 14.89 with 95% confidence.

    • Price (-0.0636, -0.0425) → The coefficient for Price is negative, confirming that as Price increases, Sales decrease.

Key Takeaway:

  • The confidence interval for Price does not include zero, indicating that Price significantly affects Sales.

(h).

par(mfrow = c(2, 2))
plot(lm_reduced)

  • Residuals vs. Fitted Plot:

    • Residuals appear randomly scattered, suggesting no major non-linearity in the model.
  • Q-Q Plot:

    • Residuals mostly follow the normal line, but slight deviations in the tails indicate minor non-normality.
  • Scale-Location Plot:

    • No strong pattern, suggesting homoscedasticity (constant variance) is mostly satisfied.
  • Residuals vs. Leverage Plot:

    • A few high-leverage points exist, but no extreme outliers are significantly influencing the model.

Key Takeaway:

  • The model assumptions are mostly met, with minor deviations in normality and leverage.

  • No strong evidence of outliers or major issues in model fit.


Question 14

(a).

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

lm_model <- lm(y ~ x1 + x2)
summary(lm_model)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
  • Regression Model: y ~ x1 + x2

  • Key Findings:

    • x1 coefficient = 1.4396, p = 0.0487 → Marginally significant, meaning x1 has a weak but noticeable effect on y.

    • x2 coefficient = 1.0097, p = 0.3754 → Not significant, suggesting x2 does not have a strong independent effect on y.

    • Intercept = 2.1305 → Represents the expected value of y when x1 = 0 and x2 = 0.

  • Model Performance:

    • R-squared = 0.2088 → The model explains ~20.88% of the variation in y, indicating a moderate fit.

    • F-statistic = 12.8, p-value = 1.164e-05 → The overall model is statistically significant.

Key Takeaway:

  • The model suggests x1 has a significant effect on y, but x2 does not.

  • Potential multicollinearity may be affecting the results, as x2 was expected to be significant.

(b).

cor(x1, x2)
## [1] 0.8351212
plot(x1, x2, main = "Correlation Between x1 and x2", col = "green", pch = 19)

  • Correlation Between x1 and x2

    • The correlation (cor(x1, x2)) is positive, indicating a strong linear relationship between x1 and x2.

    • This suggests that x2 is partially dependent on x1, which may lead to multicollinearity in regression.

  • Scatter Plot Analysis:

    • The points follow an upward trend, reinforcing the positive correlation.

    • The spread suggests moderate to strong collinearity.

Key Takeaway:

  • Since x1 and x2 are highly correlated, including both in a regression model may cause multicollinearity, making it difficult to determine their individual effects on y.

(c).

lm_model <- lm(y ~ x1 + x2)
summary(lm_model)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
  • Regression Model: y ~ x1 + x2
  • Key Findings:
    • x1 (p = 0.0487) → Marginally significant, suggesting a weak but notable effect on y.

    • x2 (p = 0.3754) → Not significant, meaning it does not contribute much to predicting y.

  • Model Performance:
    • R-squared = 0.2088 → About 20.88% of the variation in y is explained by x1 and x2.

    • F-statistic (12.8, p < 0.001) → The overall model is statistically significant, despite x2 being insignificant.

Key Takeaway:

  • x1 has some predictive power, but x2 does not significantly contribute to the model.

  • Potential multicollinearity might be affecting x2’s significance, reducing its impact when both predictors are included.

(d).

lm_x1 <- lm(y ~ x1)
summary(lm_x1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06
  • Regression Model: y ~ x1 (Only x1 as a predictor)
  • Key Findings:
    • x1 coefficient = 1.9759, p < 0.001 → Highly significant, meaning x1 has a strong positive effect on y.

    • Intercept = 2.1124 → The expected value of y when x1 = 0.

  • Model Performance:
    • R-squared = 0.2024 → x1 alone explains ~20.24% of the variation in y, slightly lower than the multiple regression model.

    • F-statistic (24.86, p < 0.001) → The model is statistically significant.

Key Takeaway:

  • x1 is a strong predictor of y when used alone.

  • Compared to the multiple regression model, removing x2 does not significantly impact predictive power, suggesting x2 was redundant (likely due to multicollinearity).

(e).

lm_x2 <- lm(y ~ x2)
summary(lm_x2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05
  • Regression Model: y ~ x2 (Only x2 as a predictor)

  • Key Findings:

    • x2 coefficient = 2.8996, p < 0.001 → Highly significant, meaning x2 has a strong positive effect on y.

    • Intercept = 2.3899 → The expected value of y when x2 = 0.

  • Model Performance:

    • R-squared = 0.1763 → x2 alone explains ~17.63% of the variation in y, slightly lower than using x1.

    • F-statistic (20.98, p < 0.001) → The model is statistically significant.

Key Takeaway:

  • x2 is a significant predictor of y when used alone, unlike in the multiple regression model where it was not.

  • This suggests multicollinearity between x1 and x2, which caused x2 to appear insignificant in the combined model.

(f).

  • x1 and x2 are individually significant but not together → due to collinearity.

  • Solution: Use VIF (Variance Inflation Factor) to check collinearity.

(g).

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

lm_new <- lm(y ~ x1 + x2)
summary(lm_new)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
# Check influential points
plot(lm_new, which = 4)  

  • Effect of Adding the New Observation (x1 = 0.1, x2 = 0.8, y = 6)

    • The Cook’s Distance plot identifies observation 101 as a highly influential point.

    • This observation has a disproportionate effect on the model, suggesting it may be an outlier or high-leverage point.

  • Key Takeaways:

    • The new data point significantly influences regression coefficients.

    • It may distort the model’s predictions and should be carefully examined.

    • Possible solution: Check for data entry errors, consider robust regression or remove the influential point if justified.

Final Note: The new observation impacts model stability, and caution is needed when interpreting results.