We begin by fitting a simple linear regression model to investigate
the relationship between miles per gallon (mpg) and horsepower from the
Auto
dataset. We use lm()
to fit the model and
summary()
to examine the results.
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Explanation:
Relationship between the predictor and response:
< 2.2e-16
), indicating that we reject the null
hypothesis, and horsepower is a statistically significant predictor of
mpg.Strength of the relationship:
Direction of the relationship:
Predicted mpg for horsepower of 98: We calculate the predicted mpg for a car with 98 horsepower, along with confidence and prediction intervals.
predict(lm_fit, newdata = data.frame(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm_fit, newdata = data.frame(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
Predicted mpg for 98 horsepower: 24.47
95% confidence interval: (23.97, 24.96)
95% prediction interval: (14.81, 34.12)
We create a scatter plot of mpg against horsepower, along with the fitted regression line.
plot(Auto$horsepower, Auto$mpg,
main = "MPG vs Horsepower",
xlab = "Horsepower",
ylab = "MPG",
pch = 17,
col = "blue")
abline(lm_fit, col = "red", lwd = 1.5)
We examine diagnostic plots to evaluate the regression model fit.
par(mfrow = c(2, 2))
plot(lm_fit)
Residuals vs Fitted: The residuals exhibit a curved pattern, suggesting a non-linear relationship between horsepower and mpg.
Normal Q-Q Plot: Residuals align with the diagonal line, though there are slight deviations at the extremes, indicating mild non-normality.
Scale-Location Plot: The spread of residuals increases with fitted values, indicating heteroscedasticity.
Residuals vs Leverage: Some high-leverage points exist, but none seem overly influential.
We fit a multiple regression model to predict Sales
using Price
, Urban
, and US
as
predictors.
model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Intercept (13.04): This represents the expected sales when Price = 0, Urban = “No”, and US = “No”. This is a baseline estimate.
Price (-0.0545): For each unit increase in price, sales decrease by approximately 0.0545 units, holding other variables constant. This aligns with the expectation that higher prices reduce sales.
UrbanYes (-0.0219): Urban location does not significantly affect sales (p-value = 0.936).
USYes (1.2006): Being located in the US increases sales by approximately 1.2006 units compared to non-US locations, with high statistical significance (p-value < 0.0001).
The regression model is as follows:
Sales = 13.0435 - 0.0545 × Price - 0.0219 × UrbanYes + 1.2006 × USYes
Where:
UrbanYes = 1
for urban stores, 0
otherwise.
USYes = 1
for US stores, 0
otherwise.
We reject the null hypothesis (βj = 0) for predictors with p-values less than 0.05:
Price (p-value < 2e-16): Significant predictor, reject H0.
USYes (p-value < 0.0001): Significant predictor, reject H0.
UrbanYes (p-value = 0.936): Not significant, fail to reject H0.
We now fit a reduced model excluding Urban
.
smaller_model <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller_model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both the full and reduced models explain approximately the same amount of variance (R² = 0.2393). However, the reduced model is slightly more efficient with a higher adjusted R² and lower residual standard error.
We calculate 95% confidence intervals for the coefficients of the reduced model.
confint(smaller_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
We produce diagnostic plots for the reduced model to check for outliers or high leverage points.
par(mfrow = c(2, 2))
plot(smaller_model)
While there are a few outliers, none are overly influential. The model performs well, but checking Cook’s distance might help identify influential points more clearly.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
The last line corresponds to creating a linear model where
y
is a function of x1
and x2
. The
form of the linear model is as follows:
y=2+2x1+0.3x2+ϵy = 2 + 2x_1 + 0.3x_2 + +2x1+0.3x2+ϵ
where ϵrepresents random noise. The true regression coefficients are:
Intercept (β0_0β0) = 2
Coefficient of x1x_1x1 (β1_1β1) = 2
Coefficient of x2x_2x2 (β2_2β2) = 0.3
cor(x1, x2)
## [1] 0.8351212
Since x2=0.5x1+x_2 = 0.5x_1 + x2=0.5x1+ random noise, we expect the correlation to be positive and around 0.84.
plot(x1, x2, main = "Scatterplot of x1 vs x2", xlab = "x1", ylab = "x2")
model <- lm(y ~ x1 + x2)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The estimates obtained are:
β0^=2.13 = 2.13β0^=2.13
β1^=1.43 = 1.43β1^=1.43
β2^=1.01 = 1.01β2^=1.01
These deviate from the true values due to collinearity between x1x_1x1 and x2x_2x2.
H0:β1=0H_0: _1 = 0H0:β1=0 can be rejected at p=0.0487p = 0.0487p=0.0487 (just below 0.05).
H0:β2=0H_0: _2 = 0H0:β2=0 cannot be rejected (p=0.3754p = 0.3754p=0.3754, which is too high).
lm_x1 <- lm(y ~ x1)
summary(lm_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
We can reject H0:β1=0H_0: _1 = 0H0:β1=0. The p-value for β1_1β1 is much more significant in this model compared to when x2x_2x2 was included.
lm_x2 <- lm(y ~ x2)
summary(lm_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
We can reject H0:β2=0H_0: _2 = 0H0:β2=0. The p-value for β2_2β2 is much more significant in this model compared to when x1x_1x1 was included.
The results do not contradict each other; instead, they highlight the issue of collinearity. Both x1x_1x1 and x2x_2x2 seem significant when examined individually, but their significance diminishes when analyzed together due to the inflated standard errors caused by their correlation.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on each model? Is this observation an outlier or a high-leverage point? Explain.
summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(lm(y ~ x1))
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(lm(y ~ x2))
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
par(mfrow = c(2, 2))
plot(lm(y ~ x1 + x2), cex = 0.2)
par(mfrow = c(2, 2))
plot(lm(y ~ x1), cex = 0.2)
par(mfrow = c(2, 2))
plot(lm(y ~ x2), cex = 0.2)
The mismeasured observation affects the models as follows:
Model with both x1x_1x1 and x2x_2x2: The point has high leverage because it is far from the typical distribution of predictors but does not strongly affect the residuals.
Model with only x1x_1x1: The point is an outlier because its yyy-value deviates significantly from the trend.
Model with only x2x_2x2: The point has high leverage because it is far from the main cluster in the predictor space but does not drastically affect residuals.
This concludes the analysis and highlights the different behaviors of the mismeasured observation across models.