This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!). I welcome you to comment and critic my work to help me improve
This problem focuses on the collinearity problem.
Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X. The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
set.seed(1)
x1 = runif (100)
x2 = 0.5*x1+rnorm (100)/10
y = 2+2*x1+0.3*x2+rnorm (100)
The regression coefficients are βo = 2+rnorm(100), β1 = 2, and β3 = 0.3
What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
cor(x1, x2); plot(x1, x2)
## [1] 0.8351212
Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true βo, β1, and β2? Can you reject the null hypothesis Ho : β1 = 0? How about the null hypothesis Ho : β2 = 0?
lm.fit =lm(y~x1+x2)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis Ho : β1 = 0?
lm.fit2 = lm(y~x1)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
We can reject the null hypothesis for the regression coefficient β1 because of the very low associated p-value. In addition, per the R-squared value, the predictor x1 can, on its own, explain about 20% of the changes in the responce variable y.
Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis Ho : β2 = 0?
lm.fit3 = lm(y~x2)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Similarly, we can reject the null hypothesis for the regression coefficient β2 because of the very low associated p-value. Per the R-squared value, the predictor x2 can, on its own, explain about 17% of the changes in the responce variable y.
Do the results obtained in (c)–(e) contradict each other? Explain your answer
In part-c, based on evidence from the multi-linear regression model, we discarded x1 and x2 as significant predictor variables. In part-e, separate linear regression models with each of the predictors show that x1 and x2 are actually quite significant and can help explain up to 20% and 17% of the changes in y respectively. By the gods, this is a contradiction!
Now suppose we obtain one additional observation, which was unfortunately mismeasured.Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)
lm.fit =lm(y~x1+x2)
summary(lm.fit);
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
plot(lm.fit);
The x2 variable is now the predictor that is statistically significant instead of x1. This was the opposite for the previous model.
For this new model, using cook’s distance as a reference, the new point is a leverage point but not an outlier.
lm.fit2 =lm(y~x1)
summary(lm.fit2);
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
plot(lm.fit2);
In both cases, x1 is statistically significant to the variations in y. However, we observe a sizeable negative change is the R-squared value. Hence, this model is worse than the former.
For this model, the added data point is a relative outlier. It is not an absolute outlier in the sense that the magnitude of its separation from the “normal” range is quite small.
lm.fit3 =lm(y~x2)
summary(lm.fit3);
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
plot(lm.fit3);
In both cases, x2 is statistically significant to the variations in y. However, we observe a sizeable positive change is the R-squared value. Hence, this model is better than the former.
For this model, using cook’s distance as a reference, the added data point is not an outlier or leverage point.