Q14(a) This problem focuses on the collinearity problem. Perform the following commands in R.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100)/10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
The last line corresponds to creating a linear model in which “y” is a function of “x1” and “x2”. Write out the form of the linear model. What are the regression coefficients ?
The form of the linear model is \[Y = 2 + 2X_1 +0.3X_2 + \varepsilon\] with \(\varepsilon\) ~ \(N(0,1)\) random variable. The regression coefficients are respectively 2, 2 and 0.3.
(b) What is the correlation between “x1” and “x2” ? Create a scatterplot displaying the relationship between the variables.
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)
The variables seem highly correlated.
(c) Using this data, fit a least squares regression to predict “y” using “x1” and “x2”. Describe the results obtained. What are \(\hat{\beta}_0\), \(\hat{\beta}_1\) and \(\hat{\beta}_2\) ? How do these relate to the true \(\beta_0\), \(\beta_1\) and \(\beta_2\) ? Can you reject the null hypothesis \(H_0 : \beta_1 = 0\) ? How about the null hypothesis \(H_0 : \beta_2 = 0\) ?
lm.fit <- lm(y ~ x1 + x2)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The coefficients \(\hat{\beta}_0\), \(\hat{\beta}_1\) and \(\hat{\beta}_2\) are respectively 2.1304996, 1.4395554 and 1.0096742. Only \(\hat{\beta}_0\) is close to \(\beta_0\). As the p-value is less than 0.05 we may reject \(H_0\) for \(\beta_1\), however we may not reject \(H_0\) for \(\beta_2\) as the p-value is higher than 0.05.
(d) Now fit a least squares regression to predict “y” using only “x1”. Comment on your results. Can you reject the null hypothesis \(H_0 : \beta_1 = 0\) ?
lm.fit2 <- lm(y ~ x1)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The coefficient for “x1” in this last model is very different from the one with “x1” and “x2” as predictors. In this case “x1” is highly significant as its p-value is very low, so we may reject \(H_0\).
(e) Now fit a least squares regression to predict “y” using only “x2”. Comment on your results. Can you reject the null hypothesis \(H_0 : \beta_1 = 0\) ?
lm.fit3 <- lm(y ~ x2)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The coefficient for “x2” in this last model is very different from the one with “x1” and “x2” as predictors. In this case “x2” is highly significant as its p-value is very low, so we may again reject \(H_0\).
(f) Do the results obtained in (c)-(e) contradict each other ? Explain your answer.
No, the results do not contradict each other. As the predictors “x1” and “x2” are highly correlated we are in the presence of collinearity, in this case it can be difficult to determine how each predictor separately is associated with the response. Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\hat{\beta}_1\) to grow (we have a standard error of 0.7211795 and 1.1337225 for “x1” and “x2” respectively in the model with two predictors and only of 0.3962774 and 0.6330467 for “x1” and “x2” respectively in the models with only one predictor). Consequently, we may fail to reject \(H_0\) in the presence of collinearity. The importance of the “x2” variable has been masked due to the presence of collinearity.
(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on each of the models ? In each model, is this observation an outlier ? A high-leverage point ? Explain your answers.
lm.fit4 <- lm(y ~ x1 + x2)
lm.fit5 <- lm(y ~ x1)
lm.fit6 <- lm(y ~ x2)
summary(lm.fit4)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(lm.fit5)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(lm.fit6)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
In the first model, x1 is not statistically significant, but in the second model it is. x2 is statistically significant in both the first and the third model.
par(mfrow=c(2,2))
plot(lm.fit4)
plot(lm.fit5)
plot(lm.fit6)
In the first model (x1 & x2) and third model (x2 only), the last point is a high-leverage point. In the second model (x1 only) the last point is not a high-leverage point.
plot(predict(lm.fit4), rstudent(lm.fit4))
plot(predict(lm.fit5), rstudent(lm.fit5))
plot(predict(lm.fit6), rstudent(lm.fit6))
In the first model (x1 & x2) and third model (x2 only), the last point is not an outlier. In the second model (x1 only) the last point is an outlier, the point is outside |3| value cutoff.