PART II: Practice Problems

Problem 1: Investigating the T-stat

In this problem we will investigate the t-statistic for the null hypothesis \(H_0 : \beta = 0\) in simple linear regression without an intercept. To begin, we generate a predictor \(x\) and a response \(y\) as follows.

set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100)

a) Perform a simple linear regression of \(y\) onto \(x\), without an intercept. Report the coefficient estimate \(\hat{\beta}\), the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis \(H_0 : \beta = 0\). Comment on these results. Hint: You can perform regression without an intercept using the command lm(y~x+0)

betaHat <- lm(y~x+0)

b) Now perform a simple linear regression of \(x\) onto \(y\) without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis \(H_0 : \beta = 0\). Comment on the results.

SLR <-  lm(x~y + 0)
summary(SLR)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

c) What is the relationship between the results obtained in (a) and (b)?

When changing which variable is first, the relationship does not change. Therefore, the regression line and the p and t statistics will not change as well.

d) In R, show that when regression is performed with an intercept, the t-statistic for \(H_0 : \beta_1 = 0\) is the same for the regression of \(y\) on to \(x\) as it is for the regression of \(x\) onto \(y\).

The t-statistic is 18.56 for both \(x\) to \(y\) and \(y\) to \(x\).

reg1 <- lm(y~x)
reg2 <- lm(x~y)

summary(reg1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03769    0.09699  -0.389    0.698    
## x            1.99894    0.10773  18.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
summary(reg2)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

Problem 2: SLR Estimation

In this exercise you will create some simulated data and will fit a simple linear regression model to it. Make sure to use set.seed(1) prior to starting (a) to ensure consistent results.

a) Using the rnorm() function, create a vector, \(x\), containing 100 observations drawn from a Normal(0, 1) distribution. This represents an explanatory variable (aka a feature), X.

b) Using the rnorm() function, create a vector, \(eps\), containing 100 observations drawn from a Normal(0, 0.25) distribution, i.e. a normal distribution with mean zero and variance 0.25. This represents the error (or noise).

set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
eps <- rnorm(100, mean = 0, sd = 0.25)

c) Using \(x\) and \(eps\), generate a vector \(y\) according to the model \(Y= -1+0.5X+\epsilon\). What is the length of the vector \(y\)? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?

Length of y: 100 \(\beta_0\): -1 \(\beta_1\): 0.5

y <- -1 + (0.5*x) + eps

d) Create a scatterplot displaying the relationship between \(x\) and \(y\). Comment on what you observe.

The scatterplot has a positive linear relationship.

plot(x~y)

e) Fit a least squares linear model to predict \(y\) using \(x\). Comment on the model obtained.How do \(\hat{\beta_0}\) and \(\hat{\beta_1}\) compareto \(\beta_0\) and \(\beta_1\)?

\(\hat{\beta_0}\): -1.0094 \(\hat{\beta_1}\): 0.4997

\(\hat{\beta_0}\) and \(\beta_0\) are very close and \(\hat{\beta_1}\) and \(\beta_1\) are very close.

lsq <- lm(y~x)
lsq
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -1.0094       0.4997

f) Display the least square line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Try to create a legend indicating the two different lines.

The blue line represents the least squares model, and the red line represents the model.

plot(y~x)
abline(lsq, col = "blue")
abline(a = -1, b = 0.5, col = "red")

g) Repeat (a)-(f) after modifying the data generating process in such a way that there is less noise in the data. The population model from part (c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(eps\) in (b). Describe your results. How does this differ from the model in (e)?

The standard deviation of the error has been reduced to 0.1. The blue line represents the least squares model, and the red line represents the model.

\(\beta_0\): -1 \(\beta_1\): 0.5

\(\hat{\beta_0}\): -1.0038 \(\hat{\beta_1}\): 0.4999

The Betas are even closer in this least squares model.

set.seed(1)
x2 <- rnorm(100, mean = 0, sd = 1)
eps2 <- rnorm(100, mean = 0, sd = 0.1)

y2 <- -1 + (0.5*x2) + eps2

lsq2 <- lm(y2~x2)
lsq2
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Coefficients:
## (Intercept)           x2  
##     -1.0038       0.4999
plot(y2~x2)
abline(lsq2, col = "blue")
abline(a = -1, b = 0.5, col = "red")

h) Repeat (a)-(f) after modifying the data generating process in such a way that there is more noise in the data. The population model from part (c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(eps\) in (b). Describe your results. How does this differ from the model in (e)?

The standard deviation of the error has been reduced to 1. The blue line represents the least squares model, and the red line represents the model.

\(\beta_0\): -1 \(\beta_1\): 0.5

\(\hat{\beta_0}\): -1.0377 \(\hat{\beta_1}\): 0.4989

The Betas are more different than in the previous models.

set.seed(1)
x3 <- rnorm(100, mean = 0, sd = 1)
eps3 <- rnorm(100, mean = 0, sd = 1)

y3 <- -1 + (0.5*x3) + eps3

lsq3 <- lm(y3~x3)
lsq3
## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Coefficients:
## (Intercept)           x3  
##     -1.0377       0.4989
plot(y3~x3)
abline(lsq3, col = "blue")
abline(a = -1, b = 0.5, col = "red")

i) What are the confidence intervals for #_0$ and $_1# based on the original data set, the noiser, and the less noisy data set? Comment on your results.

The confidence interval for x2, with the smallest error variance, has the smallest condifence interval. Likewise, the confidence interval for x3, with the largest error variance, has the largest condifence interval. This makes sense because the distance from the regression lines to the points are smaller when there is smaller vairiance.

conf1 <- confint(lsq)
conf2 <- confint(lsq2)
conf3 <- confint(lsq3)

conf1
##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801
conf2
##                  2.5 %     97.5 %
## (Intercept) -1.0230161 -0.9845224
## x2           0.4785159  0.5212720
conf3
##                  2.5 %     97.5 %
## (Intercept) -1.2301607 -0.8452245
## x3           0.2851588  0.7127204