library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
# Fit simple linear regression model
lm_model <- lm(mpg ~ horsepower, data = Auto)
# Display regression summary
summary(lm_model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Interpretation of Results:
Existence of Relationship
The p-value of horsepower in summary(lm_model)
determines if it significantly impacts mpg
.
A low p-value (< 0.05) suggests a strong relationship.
Strength of Relationship
The R-squared value indicates the proportion of mpg
variation explained by horsepower
.
Closer to 1 → stronger model fit.
Nature of Relationship (Positive/Negative)
The coefficient of horsepower shows if mpg
increases
or decreases as horsepower rises.
A negative coefficient suggests an inverse relationship.
Prediction for Horsepower = 98
predict(lm_model, newdata = data.frame(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm_model, newdata = data.frame(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
mpg
for
horsepower = 98
: 24.4795% Confidence Interval (CI): (23.97, 24.96)
95% Prediction Interval (PI): (14.81, 34.12)
Key Takeaway:
The CI is narrow (accurate mean prediction).
The PI is wider (accounts for variation in individual cars).
Individual cars may have high variability in mpg even with the same horsepower.
plot(Auto$horsepower, Auto$mpg, main = "MPG vs. Horsepower",
xlab = "Horsepower", ylab = "MPG", col = "red", pch = 19)
# Add regression line
abline(lm_model, col = "blue", lwd = 2)
Scatter Plot Analysis:
Shows a negative relationship between horsepower and mpg.
As horsepower increases, mpg decreases.
Regression Line (Blue Line):
Represents the best-fit linear model.
Confirms the downward trend, indicating that cars with higher horsepower tend to be less fuel-efficient.
Key Takeaway: The plot visually supports the regression results, showing that horsepower is a significant predictor of mpg, with a negative correlation.
par(mfrow = c(2, 2))
plot(lm_model)
A curved pattern suggests non-linearity in the relationship.
Residuals are not randomly scattered, indicating that a linear model may not be the best fit.
Residuals mostly follow the normal line, but some deviations in the tails indicate potential non-normality.
This suggests possible outliers affecting the model.
Increasing spread indicates heteroscedasticity (non-constant variance).
This means prediction errors vary across different fitted values.
A few high-leverage points are present, meaning some observations strongly influence the model.
These should be investigated to check for outliers or influential data points.
Key Takeaway: The model shows signs of non-linearity, heteroscedasticity, and potential outliers, suggesting that a transformation or a more flexible model (e.g., polynomial regression) may improve the fit.
Question 10
data(Carseats)
# Fit regression model
lm_full <- lm(Sales ~ Price + Urban + US, data = Carseats)
# View summary
summary(lm_full)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Regression Model:
Sales ~ Price + Urban + US
Key Coefficients & Significance:
Price: -0.054459
(p < 0.001) → Significant
negative impact on Sales
. As Price
increases,
Sales
decrease.
Urban (Yes/No): -0.021916
(p = 0.936) → Not
significant, meaning whether a store is in an urban or rural area does
not significantly affect Sales
.
US (Yes/No): 1.200573
(p < 0.001) → Significant
positive impact. Stores in the US have higher Sales compared to non-US
stores.
Model Performance:
R-squared = 0.2393 → The model explains ~23.9% of the variation
in Sales
, suggesting moderate predictive power.
F-statistic = 41.52 (p < 2.2e-16) → The overall model is statistically significant.
Key Takeaway:
Price significantly affects Sales (negative impact).
Being in the US increases Sales, but urban/rural location does not matter.
The model captures a moderate amount of variance in Sales.
Price: Indicates how much Sales
change per unit price increase.
Urban: Whether sales differ for urban vs. rural stores.
US: Effect of being in US vs. non-US.
\[ \hat{Sales} = \beta_0 + \beta_1 Price + \beta_2 UrbanYes + \beta_3 USYes + \varepsilon \]
UrbanYes = 1
(Urban), 0
(Rural).
USYes = 1
(US), 0
(Non-US).
summary(lm_full)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.04346894 0.651012245 20.03567373 3.626602e-62
## Price -0.05445885 0.005241855 -10.38923205 1.609917e-22
## UrbanYes -0.02191615 0.271650277 -0.08067781 9.357389e-01
## USYes 1.20057270 0.259041508 4.63467306 4.860245e-06
Significant Predictors (p-value < 0.05):
Price (p < 0.001
) → Significant negative effect
on Sales
. As Price
increases,
Sales
decrease.
US (Yes/No) (p < 0.001
) → Significant positive
effect. Stores in the US have higher Sales than non-US stores.
Non-Significant Predictor:
p = 0.936
) → Not significant, meaning
whether a store is in an urban or rural area has no meaningful impact on
Sales
.Key Takeaway:
Price and US location significantly impact Sales.
Urban location does not affect Sales and can be removed from the model.
lm_reduced <- lm(Sales ~ Price, data = Carseats)
summary(lm_reduced)
##
## Call:
## lm(formula = Sales ~ Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5224 -1.8442 -0.1459 1.6503 7.5108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.641915 0.632812 21.558 <2e-16 ***
## Price -0.053073 0.005354 -9.912 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared: 0.198, Adjusted R-squared: 0.196
## F-statistic: 98.25 on 1 and 398 DF, p-value: < 2.2e-16
Regression Model: Sales ~ Price
(Reduced Model)
Key Findings:
Price (p < 0.001
) → Highly significant with a
negative effect on Sales
.
As Price increases, Sales decrease
(β = -0.053073
).
Model Performance:
R-squared = 0.198 → 19.8% of the variation in Sales
is explained by Price
alone.
F-statistic = 98.25 (p < 2.2e-16) → The model is highly significant overall.
Key Takeaway:
Price is a strong predictor of Sales.
Removing Urban and US does not drastically impact model performance.
anova(lm_full, lm_reduced)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 398 2552.2 -2 -131.41 10.748 2.848e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing Models:
Sales ~ Price + Urban + US
vs. Sales ~ Price
Key Findings from ANOVA Test:
p-value = 2.848e-05 (< 0.05) → The full model significantly improves the fit compared to the reduced model.
F-statistic = 10.748 → The additional predictors
(Urban
and US
) contribute meaningfully to
explaining Sales
.
Residual Sum of Squares (RSS) increases in the reduced model, indicating a worse fit.
Key Takeaway:
The full model (Price + Urban + US) is statistically better than
using Price
alone.
Although Urban
was not significant individually, the
combination of additional variables improves the model’s explanatory
power.
confint(lm_reduced)
## 2.5 % 97.5 %
## (Intercept) 12.3978438 14.88598655
## Price -0.0635995 -0.04254653
95% Confidence Intervals for Coefficients
(Sales ~ Price
):
Intercept (12.40, 14.89) → When Price = 0
, expected
Sales
falls between 12.40 and 14.89 with 95%
confidence.
Price (-0.0636, -0.0425) → The coefficient for Price
is negative, confirming that as Price
increases,
Sales
decrease.
Key Takeaway:
Price
significantly affects Sales.par(mfrow = c(2, 2))
plot(lm_reduced)
Residuals vs. Fitted Plot:
Q-Q Plot:
Scale-Location Plot:
Residuals vs. Leverage Plot:
Key Takeaway:
The model assumptions are mostly met, with minor deviations in normality and leverage.
No strong evidence of outliers or major issues in model fit.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
lm_model <- lm(y ~ x1 + x2)
summary(lm_model)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Regression Model:
y ~ x1 + x2
Key Findings:
x1 coefficient = 1.4396, p = 0.0487 → Marginally significant,
meaning x1
has a weak but noticeable effect on
y
.
x2 coefficient = 1.0097, p = 0.3754 → Not significant, suggesting
x2
does not have a strong independent effect on
y
.
Intercept = 2.1305 → Represents the expected value of
y
when x1 = 0
and
x2 = 0
.
Model Performance:
R-squared = 0.2088 → The model explains ~20.88% of the variation
in y
, indicating a moderate fit.
F-statistic = 12.8, p-value = 1.164e-05 → The overall model is statistically significant.
Key Takeaway:
The model suggests x1 has a significant effect on y, but x2 does not.
Potential multicollinearity may be affecting the results, as
x2
was expected to be significant.
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2, main = "Correlation Between x1 and x2", col = "green", pch = 19)
Correlation Between x1
and
x2
The correlation (cor(x1, x2)
) is positive,
indicating a strong linear relationship between x1
and
x2
.
This suggests that x2
is partially dependent on
x1
, which may lead to multicollinearity in
regression.
Scatter Plot Analysis:
The points follow an upward trend, reinforcing the positive correlation.
The spread suggests moderate to strong collinearity.
Key Takeaway:
x1
and x2
are highly correlated,
including both in a regression model may cause multicollinearity, making
it difficult to determine their individual effects on
y
.lm_model <- lm(y ~ x1 + x2)
summary(lm_model)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
y ~ x1 + x2
x1 (p = 0.0487
) → Marginally significant, suggesting
a weak but notable effect on y
.
x2 (p = 0.3754
) → Not significant, meaning it does
not contribute much to predicting y
.
R-squared = 0.2088 → About 20.88% of the variation in
y
is explained by x1
and
x2
.
F-statistic (12.8, p < 0.001) → The overall model is
statistically significant, despite x2
being
insignificant.
Key Takeaway:
x1 has some predictive power, but x2 does not significantly contribute to the model.
Potential multicollinearity might be affecting x2
’s
significance, reducing its impact when both predictors are
included.
lm_x1 <- lm(y ~ x1)
summary(lm_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
y ~ x1
(Only
x1
as a predictor)x1 coefficient = 1.9759, p < 0.001 → Highly significant,
meaning x1
has a strong positive effect on
y
.
Intercept = 2.1124 → The expected value of y
when
x1 = 0
.
R-squared = 0.2024 → x1
alone explains ~20.24% of
the variation in y
, slightly lower than the multiple
regression model.
F-statistic (24.86, p < 0.001) → The model is statistically significant.
Key Takeaway:
x1 is a strong predictor of y when used alone.
Compared to the multiple regression model, removing x2 does not significantly impact predictive power, suggesting x2 was redundant (likely due to multicollinearity).
lm_x2 <- lm(y ~ x2)
summary(lm_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Regression Model: y ~ x2
(Only
x2
as a predictor)
Key Findings:
x2 coefficient = 2.8996, p < 0.001 → Highly significant,
meaning x2
has a strong positive effect on
y
.
Intercept = 2.3899 → The expected value of y
when
x2 = 0
.
Model Performance:
R-squared = 0.1763 → x2
alone explains ~17.63% of
the variation in y
, slightly lower than using
x1
.
F-statistic (20.98, p < 0.001) → The model is statistically significant.
Key Takeaway:
x2 is a significant predictor of y when used alone, unlike in the multiple regression model where it was not.
This suggests multicollinearity between x1
and
x2
, which caused x2
to appear insignificant in
the combined model.
x1
and x2
are individually significant
but not together → due to collinearity.
Solution: Use VIF (Variance Inflation Factor) to check collinearity.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
lm_new <- lm(y ~ x1 + x2)
summary(lm_new)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
# Check influential points
plot(lm_new, which = 4)
Effect of Adding the New Observation
(x1 = 0.1, x2 = 0.8, y = 6
)
The Cook’s Distance plot identifies observation 101 as a highly influential point.
This observation has a disproportionate effect on the model, suggesting it may be an outlier or high-leverage point.
Key Takeaways:
The new data point significantly influences regression coefficients.
It may distort the model’s predictions and should be carefully examined.
Possible solution: Check for data entry errors, consider robust regression or remove the influential point if justified.
Final Note: The new observation impacts model stability, and caution is needed when interpreting results.