library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)

Problem 1: Investigating the T-Stat.

In this problem we will investigate the t-statistic for the null hypothesis H.0: beta = 0 in simple linear regresssion without an intercept. To begin, we generate a predictor x and a response variable y as follows.

set.seed(1)
x<-rnorm(100)
y<-2*x+rnorm(100)
  1. Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate “beta-hat”, the standard error, and the corresponding t-statistic and p-values associated with the null hypothesis. Comment on the results.
m1<-lm(y~x+0)
summary(m1)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate “beta-hat” is 1.9939, the standard error is 0.1065, t-statistic is 18.73, and the p-value is <2e-16. This means that the estimate is pretty accurate, with a small p-value and small standard error.

  1. Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and corresponding t-statistic and p-values associated with the null hypothesis. Comment on the results.
m2<-lm(x~y+0)
summary(m2)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate is 0.39111, the standard error is 0.02089, t-statistic is also 18.73, and the p-value is the same (<2e-16). This means that this estimate is also pretty accurate.

  1. What is the relationship between the results obtained in (a) and (b)?

The relationship is that the low estimates indicate a strong correlation between “x” and “y”.

  1. In R, show that when regression is performed with an intercept, the t-statistic for the null hypothesis: beta1 = 0 is the same for the regression of “y” onto “x” as it is for the regression of “x” onto “y”.
m3<-lm(y~x)
m4<-lm(x~y)
summary(m3)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03769    0.09699  -0.389    0.698    
## x            1.99894    0.10773  18.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
summary(m4)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

For both m3 and m4, the test statistic is the same (18.56).

Problem 2

  1. Using the rnorm() function, create a vector, x, containing 100 observations drawn from a Normal(0, 0.25) distribution, i.e. a normal distribution with a mean zero and variance 0.25. This represents an explanitory variable X.
set.seed(1)
x<-rnorm(100, sd = 1)
  1. Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a Normal(0, 0.25) distribution. This represents the error (or noise).
eps<-rnorm(100, sd=.25)
  1. Using x and eps, generate a vector “y” according to the model: y = (-1)+0.5X+eps

What is the length of the vector “y”? What are the values of beta-not and beta1 in this linear model?

y<-(-1)+.5*x+eps
length(y)
## [1] 100

The length is as long as the rnorm function is, so the length is 100. The y-intercept is (-1), and the slope is 0.5X.

  1. Create a scatterplot displaying the relationship between “x” and “y”. Comment on what you observe.
df_x0y0<-data.frame(x, y)
ggplot(df_x0y0, aes(x = x, y = y))+
  geom_jitter()

I notice that this plot is linear. It’s positive, has moderate strength, and no real outliers.

  1. Fit a least squares linear model to predict “y” using “x”. Comment on the model obtained. How do beta-not-hat and beta-one-hat compare to beta-not and beta-one?
model<-lm(y~x)
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

This is telling me that my beta-not-hat, or my y-intercept value, is about -1. It is also telling me that my beta-one-hat, or my slope is about 0.49. This is good because the beta-one is equal to .5 and the beta-not is equal to -1.

  1. Display the least square line on the scatterplot obtained in (d).Draw the population regression line on the plot in a different color.
df_xy<-data.frame(x, y)
names(model)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
model<-lm(y~x)
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
model$coefficients
## (Intercept)           x 
##  -1.0094232   0.4997349
ggplot(df_xy, aes(x = x, y = y))+
  geom_jitter()+
  geom_abline(intercept = model$coefficients[1], slope = model$coefficients[2], col="forestgreen")+
  geom_abline(intercept = -1, slope = .5, col="cyan")

Cyan line is the population regression, forest green is the least squares estimation.

m5<-lm(y~x)
summary(m5)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
  1. Repeat (a)-(f) after modifying the data generating process in such a way thta there is less noise in the data. The population model from part (c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error terms eps in (b). Describe the results. How does this differ from the model in (e)?
# Part (a)
x2<-rnorm(100, sd=1)

# Part (b)
eps2<-rnorm(100, sd=.1)

# Part (c)
y2<-(-1)+0.5*x2+eps2

# - Length is 100, beta0 is (-1), and beta1(2) is 0.5.

# Part (d)
df_x2y2<-data.frame(x2, y2)
ggplot(df_x2y2, aes(x = x2, y = y2))+
  geom_jitter()

# Part (e)
model2<-lm(y2~x2)
summary(model2)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.251014 -0.060549  0.002065  0.070483  0.208980 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.004745   0.009676 -103.83   <2e-16 ***
## x2           0.492505   0.008310   59.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09671 on 98 degrees of freedom
## Multiple R-squared:  0.9729, Adjusted R-squared:  0.9726 
## F-statistic:  3512 on 1 and 98 DF,  p-value: < 2.2e-16
names(model2)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
model2$coefficients
## (Intercept)          x2 
##  -1.0047452   0.4925051
ggplot(df_x2y2, aes(x = x2, y = y2))+
  geom_jitter()+
  geom_abline(intercept = model2$coefficients[1], slope = model2$coefficients[2], col="red")+
  geom_abline(intercept = -1, slope = 0.5, col="yellow")

# - In this case, the regression color is yellow and the least squares color is red.

For Part (e), the new beta0-hat seems to still be around the same value as the old one, but it is still pretty close to the true y-intercept value. The same goes for the beta1-hat.

Looking at the new graph, the lines aren’t as parallel as they were in the previous scatterplot. They just barely cross each other, but they show that with a lower variance, it is still very close to the true population regression. It’s also stronger.

  1. Do the same thing as above in (f), but this time adjust for a larger variance.
# Part (a)
x3<-rnorm(100, sd=1)

# Part (b)
eps3<-rnorm(100, sd=.5)

# Part (c)
y3<-(-1)+0.5*x3+eps3

# - Length is 100, beta0 is (-1), and beta1(3) is 0.5.

# Part (d)
df_x3y3<-data.frame(x3, y3)
ggplot(df_x3y3, aes(x = x3, y = y3))+
  geom_jitter()

# Part (e)
model3<-lm(y3~x3)
summary(model3)
## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45071 -0.31817  0.01701  0.34848  1.13426 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.99894    0.05195  -19.23   <2e-16 ***
## x3           0.59663    0.04797   12.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5194 on 98 degrees of freedom
## Multiple R-squared:  0.6122, Adjusted R-squared:  0.6082 
## F-statistic: 154.7 on 1 and 98 DF,  p-value: < 2.2e-16
names(model3)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
model3$coefficients
## (Intercept)          x3 
##  -0.9989412   0.5966254
ggplot(df_x3y3, aes(x = x3, y = y3))+
  geom_jitter()+
  geom_abline(intercept = model3$coefficients[1], slope = model3$coefficients[2], col="blue")+
  geom_abline(intercept = -1, slope = 0.5, col="purple")

# - In this case, the regression color is purple and the least squares color is blue.

For Part (e), the least squares estimate for the y-intercept is a lot lower than I expected it to be from the true regression. The least squares estimate for the slope is also a bit higher than the true regression.

In this graph, the lines definitely cross. The data itself is not as strong (it’s more scattered).

  1. What are the confidence intervals for beta0 and beta1 based on the original data set, the noisier, and the less noisy data set? Comment on the results.
# These are for the confidence intervals of the original, noisier, and less noisy data set, respectively.
confint(model)
##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801
confint(model2)
##                  2.5 %     97.5 %
## (Intercept) -1.0239477 -0.9855428
## x2           0.4760139  0.5089963
confint(model3)
##                  2.5 %     97.5 %
## (Intercept) -1.1020276 -0.8958549
## x3           0.5014353  0.6918155

For the original data set, the beta0 confidence intervals are (-1.58, -0.96), and the beta1 confidence intervals are (0.45, 0.55).

Noisier Data Set: Beta0: (-1.01, -0.96) Beta1: (0.49, 0.53)

Less Noisy Data Set: Beta0: (-1.06, -0.86) Beta1: (0.42, 0.63)