8. Simple Linear Regression Analysis using the Auto Dataset

(a) Fitting a Simple Linear Regression Model

We begin by fitting a simple linear regression model to investigate the relationship between miles per gallon (mpg) and horsepower from the Auto dataset. We use lm() to fit the model and summary() to examine the results.

lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Explanation:

  • Relationship between the predictor and response:

    • Yes, there is a significant relationship between horsepower and mpg. The p-value for horsepower is extremely small (< 2.2e-16), indicating that we reject the null hypothesis, and horsepower is a statistically significant predictor of mpg.
  • Strength of the relationship:

    • The R-squared value is 0.606, meaning approximately 60.6% of the variation in mpg is explained by horsepower alone, which suggests a moderately strong relationship.
  • Direction of the relationship:

    • The coefficient of horsepower is -0.1578, meaning that as horsepower increases by 1 unit, mpg decreases by 0.1578 on average, indicating a negative relationship.
  • Predicted mpg for horsepower of 98: We calculate the predicted mpg for a car with 98 horsepower, along with confidence and prediction intervals.

predict(lm_fit, newdata = data.frame(horsepower = 98), interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(lm_fit, newdata = data.frame(horsepower = 98), interval = "prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476
  • Predicted mpg for 98 horsepower: 24.47

  • 95% confidence interval: (23.97, 24.96)

  • 95% prediction interval: (14.81, 34.12)

(b) Plotting the Response and Predictor

We create a scatter plot of mpg against horsepower, along with the fitted regression line.

plot(Auto$horsepower, Auto$mpg, 
     main = "MPG vs Horsepower", 
     xlab = "Horsepower", 
     ylab = "MPG", 
     pch = 17, 
     col = "blue")
abline(lm_fit, col = "red", lwd = 1.5)

(c) Diagnostic Plots for the Regression Fit

We examine diagnostic plots to evaluate the regression model fit.

par(mfrow = c(2, 2))
plot(lm_fit)

Observations:

  • Residuals vs Fitted: The residuals exhibit a curved pattern, suggesting a non-linear relationship between horsepower and mpg.

  • Normal Q-Q Plot: Residuals align with the diagonal line, though there are slight deviations at the extremes, indicating mild non-normality.

  • Scale-Location Plot: The spread of residuals increases with fitted values, indicating heteroscedasticity.

  • Residuals vs Leverage: Some high-leverage points exist, but none seem overly influential.


10. Multiple Linear Regression Analysis using the Carseats Dataset

(a) Fitting a Multiple Linear Regression Model

We fit a multiple regression model to predict Sales using Price, Urban, and US as predictors.

model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Interpreting the Coefficients:

  • Intercept (13.04): This represents the expected sales when Price = 0, Urban = “No”, and US = “No”. This is a baseline estimate.

  • Price (-0.0545): For each unit increase in price, sales decrease by approximately 0.0545 units, holding other variables constant. This aligns with the expectation that higher prices reduce sales.

  • UrbanYes (-0.0219): Urban location does not significantly affect sales (p-value = 0.936).

  • USYes (1.2006): Being located in the US increases sales by approximately 1.2006 units compared to non-US locations, with high statistical significance (p-value < 0.0001).

(c) Writing the Model Equation:

The regression model is as follows:

Sales = 13.0435 - 0.0545 × Price - 0.0219 × UrbanYes + 1.2006 × USYes

Where:

  • UrbanYes = 1 for urban stores, 0 otherwise.

  • USYes = 1 for US stores, 0 otherwise.

(d) Hypothesis Testing:

We reject the null hypothesis (βj = 0) for predictors with p-values less than 0.05:

  • Price (p-value < 2e-16): Significant predictor, reject H0.

  • USYes (p-value < 0.0001): Significant predictor, reject H0.

  • UrbanYes (p-value = 0.936): Not significant, fail to reject H0.

(e) Fitting a Smaller Model:

We now fit a reduced model excluding Urban.

smaller_model <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller_model)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Comparing Model Fits:

Both the full and reduced models explain approximately the same amount of variance (R² = 0.2393). However, the reduced model is slightly more efficient with a higher adjusted R² and lower residual standard error.

(g) Confidence Intervals for the Reduced Model:

We calculate 95% confidence intervals for the coefficients of the reduced model.

confint(smaller_model, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Diagnostic Plots for the Reduced Model:

We produce diagnostic plots for the reduced model to check for outliers or high leverage points.

par(mfrow = c(2, 2))
plot(smaller_model)

While there are a few outliers, none are overly influential. The model performs well, but checking Cook’s distance might help identify influential points more clearly.


14. This problem focuses on the collinearity problem.

(a) Perform the following commands in R:

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model where y is a function of x1 and x2. The form of the linear model is as follows:

y=2+2x1+0.3x2+ϵy = 2 + 2x_1 + 0.3x_2 + +2x1​+0.3x2​+ϵ

where ϵrepresents random noise. The true regression coefficients are:

  • Intercept (β0_0β0​) = 2

  • Coefficient of x1x_1x1​ (β1_1β1​) = 2

  • Coefficient of x2x_2x2​ (β2_2β2​) = 0.3

(b) What is the correlation between x1x_1x1​ and x2x_2x2​? Create a scatterplot to display their relationship.

cor(x1, x2)
## [1] 0.8351212

Since x2=0.5x1+x_2 = 0.5x_1 + x2​=0.5x1​+ random noise, we expect the correlation to be positive and around 0.84.

plot(x1, x2, main = "Scatterplot of x1 vs x2", xlab = "x1", ylab = "x2")

  1. Using this data, fit a least squares regression to predict yyy using x1x_1x1​ and x2x_2x2​. What are the estimated coefficients β0^β0​^​, β1^β1​^​, and β2^β2​^​? Compare them with the true values. Can you reject the null hypothesis H0:β1=0H_0: _1 = 0H0​:β1​=0? How about H0:β2=0H_0: _2 = 0H0​:β2​=0?
model <- lm(y ~ x1 + x2)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The estimates obtained are:

  • β0^=2.13 = 2.13β0​^​=2.13

  • β1^=1.43 = 1.43β1​^​=1.43

  • β2^=1.01 = 1.01β2​^​=1.01

These deviate from the true values due to collinearity between x1x_1x1​ and x2x_2x2​.

  • H0:β1=0H_0: _1 = 0H0​:β1​=0 can be rejected at p=0.0487p = 0.0487p=0.0487 (just below 0.05).

  • H0:β2=0H_0: _2 = 0H0​:β2​=0 cannot be rejected (p=0.3754p = 0.3754p=0.3754, which is too high).

(d) Now fit a least squares regression to predict yyy using only x1x_1x1​. Comment on the results. Can you reject the null hypothesis H0:β1=0H_0: _1 = 0H0​:β1​=0?

lm_x1 <- lm(y ~ x1)
summary(lm_x1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

We can reject H0:β1=0H_0: _1 = 0H0​:β1​=0. The p-value for β1_1β1​ is much more significant in this model compared to when x2x_2x2​ was included.

(e) Now fit a least squares regression to predict yyy using only x2x_2x2​. Comment on the results. Can you reject the null hypothesis H0:β2=0H_0: _2 = 0H0​:β2​=0?

lm_x2 <- lm(y ~ x2)
summary(lm_x2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

We can reject H0:β2=0H_0: _2 = 0H0​:β2​=0. The p-value for β2_2β2​ is much more significant in this model compared to when x1x_1x1​ was included.

(f) Do the results obtained in (c)–(e) contradict each other? Explain.

The results do not contradict each other; instead, they highlight the issue of collinearity. Both x1x_1x1​ and x2x_2x2​ seem significant when examined individually, but their significance diminishes when analyzed together due to the inflated standard errors caused by their correlation.

(g) Now, suppose we obtain one additional observation, which was unfortunately mismeasured:

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on each model? Is this observation an outlier or a high-leverage point? Explain.

summary(lm(y ~ x1 + x2))
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
summary(lm(y ~ x1))
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
summary(lm(y ~ x2))
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06
par(mfrow = c(2, 2))
plot(lm(y ~ x1 + x2), cex = 0.2)

par(mfrow = c(2, 2))
plot(lm(y ~ x1), cex = 0.2)

par(mfrow = c(2, 2))
plot(lm(y ~ x2), cex = 0.2)

The mismeasured observation affects the models as follows:

  • Model with both x1x_1x1​ and x2x_2x2​: The point has high leverage because it is far from the typical distribution of predictors but does not strongly affect the residuals.

  • Model with only x1x_1x1​: The point is an outlier because its yyy-value deviates significantly from the trend.

  • Model with only x2x_2x2​: The point has high leverage because it is far from the main cluster in the predictor space but does not drastically affect residuals.

This concludes the analysis and highlights the different behaviors of the mismeasured observation across models.