ISLR - C3Applied - Q13: SLR simulation

13(a) Using the rnorm() function, create a vector, “x”, containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, \(X\). Make sure to use set.seed(1) prior to starting part (a) to ensure conistent results

set.seed(1)
x <- rnorm(100)

(b) Using the rnorm() function, create a vector, “eps”, containing 100 observations drawn from a \(N(0, 0.25)\) distribution.

eps <- rnorm(100, sd = sqrt(0.25))

(c) Using “x” and “eps”, generate a vector “y” according to the model \[Y = -1 + 0.5X + \varepsilon.\] What is the length of the vector “y” ? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model ?

y <- -1 + 0.5 * x + eps
length(y)

## [1] 100

The values of \(\beta_0\) and \(\beta_1\) are \(-1\) and \(0.5\) respectively.

(d) Create a scatterplot displaying the relationship between “x” and “y”. Comment on what you observe.

plot(x, y)

The relationship between “x” and “y” looks linear with some noise introduced by the “eps” variable.

(e) Fit a least squares linear model to predict “y” using “x”. Comment on the model obtained. How do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) compare to \(\beta_0\) and \(\beta_1\) ?

lm.fit <- lm(y ~ x)
summary(lm.fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

The values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are pretty close to \(\beta_0\) and \(\beta_1\). The model has a large F-statistic with a near-zero p-value so the null hypothesis can be rejected.

(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() function to create an appropriate legend.

plot(x, y)
abline(lm.fit, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))

(g) Now fit a polynomial regression model that predicts “y” using “x” and “x^2”. Is there evidence that the quadratic term improves the model fit ? Explain your answer.

lm.fit2 <- lm(y ~ x + I(x^2))
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

The coefficient for “x^2” is not significant as its p-value is higher than 0.05. So there is not sufficient evidence that the quadratic term improves the model fit even though the \(R^2\) is slightly higher and \(RSE\) slightly lower than the linear model.

Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The initial model should remain the same. Describe your results.

set.seed(1)
eps <- rnorm(100, sd = 0.125)
x <- rnorm(100)
y <- -1 + 0.5 * x + eps
plot(x, y)
lm.fit3 <- lm(y ~ x)
summary(lm.fit3)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29052 -0.07545  0.00067  0.07288  0.28664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.98639    0.01129  -87.34   <2e-16 ***
## x            0.49988    0.01184   42.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1128 on 98 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.9474 
## F-statistic:  1782 on 1 and 98 DF,  p-value: < 2.2e-16

abline(lm.fit3, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit3 Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))

We reduced the noise by decreasing the variance of the normal distribution used to generate the error term \(\varepsilon\). We may see that the coefficients are very close to the previous ones, but now, as the relationship is nearly linear, we have a much higher \(R^2\) and much lower \(RSE\). Moreover, the two lines overlap each other as we have very little noise.

Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The initial model should remain the same. Describe your results.

set.seed(1)
eps <- rnorm(100, sd = 0.5)
x <- rnorm(100)
y <- -1 + 0.5 * x + eps
plot(x, y)
lm.fit4 <- lm(y ~ x)
summary(lm.fit4)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16208 -0.30181  0.00268  0.29152  1.14658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94557    0.04517  -20.93   <2e-16 ***
## x            0.49953    0.04736   10.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4514 on 98 degrees of freedom
## Multiple R-squared:  0.5317, Adjusted R-squared:  0.5269 
## F-statistic: 111.2 on 1 and 98 DF,  p-value: < 2.2e-16

abline(lm.fit4, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit4 Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))

We increased the noise by increasing the variance of the normal distribution used to generate the error term \(\varepsilon\). We may see that the coefficients are again very close to the previous ones, but now, as the relationship is not quite linear, we have a much lower \(R^2\) and much higher \(RSE\). Moreover, the two lines are wider apart but are still really close to each other as we have a fairly large data set.

(j) What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the original data set, the noisier data set, and the less noisy data set ? Comment on your results.

confint(lm.fit)

##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602

confint(lm.fit3)

##                 2.5 %     97.5 %
## (Intercept) -1.008805 -0.9639819
## x            0.476387  0.5233799

confint(lm.fit4)

##                  2.5 %     97.5 %
## (Intercept) -1.0352203 -0.8559276
## x            0.4055479  0.5935197

All intervals seem to be centered on approximately 0.5. As the noise increases, the confidence intervals widen. With less noise, there is more predictability in the data set.

ISLR - C3Applied - Q13: SLR simulation

Chee Loong Lian

12/18/2017