13(a) Using the rnorm() function, create a vector, “x”, containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, \(X\). Make sure to use set.seed(1) prior to starting part (a) to ensure conistent results
set.seed(1)
x <- rnorm(100)
(b) Using the rnorm() function, create a vector, “eps”, containing 100 observations drawn from a \(N(0, 0.25)\) distribution.
eps <- rnorm(100, sd = sqrt(0.25))
(c) Using “x” and “eps”, generate a vector “y” according to the model \[Y = -1 + 0.5X + \varepsilon.\] What is the length of the vector “y” ? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model ?
y <- -1 + 0.5 * x + eps
length(y)
## [1] 100
The values of \(\beta_0\) and \(\beta_1\) are \(-1\) and \(0.5\) respectively.
(d) Create a scatterplot displaying the relationship between “x” and “y”. Comment on what you observe.
plot(x, y)
The relationship between “x” and “y” looks linear with some noise introduced by the “eps” variable.
(e) Fit a least squares linear model to predict “y” using “x”. Comment on the model obtained. How do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) compare to \(\beta_0\) and \(\beta_1\) ?
lm.fit <- lm(y ~ x)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
The values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are pretty close to \(\beta_0\) and \(\beta_1\). The model has a large F-statistic with a near-zero p-value so the null hypothesis can be rejected.
(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() function to create an appropriate legend.
plot(x, y)
abline(lm.fit, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))
(g) Now fit a polynomial regression model that predicts “y” using “x” and “x^2”. Is there evidence that the quadratic term improves the model fit ? Explain your answer.
lm.fit2 <- lm(y ~ x + I(x^2))
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98252 -0.31270 -0.06441 0.29014 1.13500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97164 0.05883 -16.517 < 2e-16 ***
## x 0.50858 0.05399 9.420 2.4e-15 ***
## I(x^2) -0.05946 0.04238 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
## F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
The coefficient for “x^2” is not significant as its p-value is higher than 0.05. So there is not sufficient evidence that the quadratic term improves the model fit even though the \(R^2\) is slightly higher and \(RSE\) slightly lower than the linear model.
set.seed(1)
eps <- rnorm(100, sd = 0.125)
x <- rnorm(100)
y <- -1 + 0.5 * x + eps
plot(x, y)
lm.fit3 <- lm(y ~ x)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29052 -0.07545 0.00067 0.07288 0.28664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.98639 0.01129 -87.34 <2e-16 ***
## x 0.49988 0.01184 42.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1128 on 98 degrees of freedom
## Multiple R-squared: 0.9479, Adjusted R-squared: 0.9474
## F-statistic: 1782 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm.fit3, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit3 Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))
We reduced the noise by decreasing the variance of the normal distribution used to generate the error term \(\varepsilon\). We may see that the coefficients are very close to the previous ones, but now, as the relationship is nearly linear, we have a much higher \(R^2\) and much lower \(RSE\). Moreover, the two lines overlap each other as we have very little noise.
set.seed(1)
eps <- rnorm(100, sd = 0.5)
x <- rnorm(100)
y <- -1 + 0.5 * x + eps
plot(x, y)
lm.fit4 <- lm(y ~ x)
summary(lm.fit4)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16208 -0.30181 0.00268 0.29152 1.14658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94557 0.04517 -20.93 <2e-16 ***
## x 0.49953 0.04736 10.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4514 on 98 degrees of freedom
## Multiple R-squared: 0.5317, Adjusted R-squared: 0.5269
## F-statistic: 111.2 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm.fit4, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit4 Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))
We increased the noise by increasing the variance of the normal distribution used to generate the error term \(\varepsilon\). We may see that the coefficients are again very close to the previous ones, but now, as the relationship is not quite linear, we have a much lower \(R^2\) and much higher \(RSE\). Moreover, the two lines are wider apart but are still really close to each other as we have a fairly large data set.
(j) What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the original data set, the noisier data set, and the less noisy data set ? Comment on your results.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -1.1150804 -0.9226122
## x 0.3925794 0.6063602
confint(lm.fit3)
## 2.5 % 97.5 %
## (Intercept) -1.008805 -0.9639819
## x 0.476387 0.5233799
confint(lm.fit4)
## 2.5 % 97.5 %
## (Intercept) -1.0352203 -0.8559276
## x 0.4055479 0.5935197
All intervals seem to be centered on approximately 0.5. As the noise increases, the confidence intervals widen. With less noise, there is more predictability in the data set.