library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
In this problem we will investigate the t-statistic for the null hypothesis H.0: beta = 0 in simple linear regresssion without an intercept. To begin, we generate a predictor x and a response variable y as follows.
set.seed(1)
x<-rnorm(100)
y<-2*x+rnorm(100)
m1<-lm(y~x+0)
summary(m1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate “beta-hat” is 1.9939, the standard error is 0.1065, t-statistic is 18.73, and the p-value is <2e-16. This means that the estimate is pretty accurate, with a small p-value and small standard error.
m2<-lm(x~y+0)
summary(m2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate is 0.39111, the standard error is 0.02089, t-statistic is also 18.73, and the p-value is the same (<2e-16). This means that this estimate is also pretty accurate.
The relationship is that the low estimates indicate a strong correlation between “x” and “y”.
m3<-lm(y~x)
m4<-lm(x~y)
summary(m3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
summary(m4)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
For both m3 and m4, the test statistic is the same (18.56).
set.seed(1)
x<-rnorm(100, sd = 1)
eps<-rnorm(100, sd=.25)
What is the length of the vector “y”? What are the values of beta-not and beta1 in this linear model?
y<-(-1)+.5*x+eps
length(y)
## [1] 100
The length is as long as the rnorm function is, so the length is 100. The y-intercept is (-1), and the slope is 0.5X.
df_x0y0<-data.frame(x, y)
ggplot(df_x0y0, aes(x = x, y = y))+
geom_jitter()
I notice that this plot is linear. It’s positive, has moderate strength, and no real outliers.
model<-lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
This is telling me that my beta-not-hat, or my y-intercept value, is about -1. It is also telling me that my beta-one-hat, or my slope is about 0.49. This is good because the beta-one is equal to .5 and the beta-not is equal to -1.
df_xy<-data.frame(x, y)
names(model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
model<-lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
model$coefficients
## (Intercept) x
## -1.0094232 0.4997349
ggplot(df_xy, aes(x = x, y = y))+
geom_jitter()+
geom_abline(intercept = model$coefficients[1], slope = model$coefficients[2], col="forestgreen")+
geom_abline(intercept = -1, slope = .5, col="cyan")
Cyan line is the population regression, forest green is the least squares estimation.
m5<-lm(y~x)
summary(m5)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
# Part (a)
x2<-rnorm(100, sd=1)
# Part (b)
eps2<-rnorm(100, sd=.1)
# Part (c)
y2<-(-1)+0.5*x2+eps2
# - Length is 100, beta0 is (-1), and beta1(2) is 0.5.
# Part (d)
df_x2y2<-data.frame(x2, y2)
ggplot(df_x2y2, aes(x = x2, y = y2))+
geom_jitter()
# Part (e)
model2<-lm(y2~x2)
summary(model2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.251014 -0.060549 0.002065 0.070483 0.208980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004745 0.009676 -103.83 <2e-16 ***
## x2 0.492505 0.008310 59.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09671 on 98 degrees of freedom
## Multiple R-squared: 0.9729, Adjusted R-squared: 0.9726
## F-statistic: 3512 on 1 and 98 DF, p-value: < 2.2e-16
names(model2)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
model2$coefficients
## (Intercept) x2
## -1.0047452 0.4925051
ggplot(df_x2y2, aes(x = x2, y = y2))+
geom_jitter()+
geom_abline(intercept = model2$coefficients[1], slope = model2$coefficients[2], col="red")+
geom_abline(intercept = -1, slope = 0.5, col="yellow")
# - In this case, the regression color is yellow and the least squares color is red.
For Part (e), the new beta0-hat seems to still be around the same value as the old one, but it is still pretty close to the true y-intercept value. The same goes for the beta1-hat.
Looking at the new graph, the lines aren’t as parallel as they were in the previous scatterplot. They just barely cross each other, but they show that with a lower variance, it is still very close to the true population regression. It’s also stronger.
# Part (a)
x3<-rnorm(100, sd=1)
# Part (b)
eps3<-rnorm(100, sd=.5)
# Part (c)
y3<-(-1)+0.5*x3+eps3
# - Length is 100, beta0 is (-1), and beta1(3) is 0.5.
# Part (d)
df_x3y3<-data.frame(x3, y3)
ggplot(df_x3y3, aes(x = x3, y = y3))+
geom_jitter()
# Part (e)
model3<-lm(y3~x3)
summary(model3)
##
## Call:
## lm(formula = y3 ~ x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45071 -0.31817 0.01701 0.34848 1.13426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.99894 0.05195 -19.23 <2e-16 ***
## x3 0.59663 0.04797 12.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5194 on 98 degrees of freedom
## Multiple R-squared: 0.6122, Adjusted R-squared: 0.6082
## F-statistic: 154.7 on 1 and 98 DF, p-value: < 2.2e-16
names(model3)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
model3$coefficients
## (Intercept) x3
## -0.9989412 0.5966254
ggplot(df_x3y3, aes(x = x3, y = y3))+
geom_jitter()+
geom_abline(intercept = model3$coefficients[1], slope = model3$coefficients[2], col="blue")+
geom_abline(intercept = -1, slope = 0.5, col="purple")
# - In this case, the regression color is purple and the least squares color is blue.
For Part (e), the least squares estimate for the y-intercept is a lot lower than I expected it to be from the true regression. The least squares estimate for the slope is also a bit higher than the true regression.
In this graph, the lines definitely cross. The data itself is not as strong (it’s more scattered).
# These are for the confidence intervals of the original, noisier, and less noisy data set, respectively.
confint(model)
## 2.5 % 97.5 %
## (Intercept) -1.0575402 -0.9613061
## x 0.4462897 0.5531801
confint(model2)
## 2.5 % 97.5 %
## (Intercept) -1.0239477 -0.9855428
## x2 0.4760139 0.5089963
confint(model3)
## 2.5 % 97.5 %
## (Intercept) -1.1020276 -0.8958549
## x3 0.5014353 0.6918155
For the original data set, the beta0 confidence intervals are (-1.58, -0.96), and the beta1 confidence intervals are (0.45, 0.55).
Noisier Data Set: Beta0: (-1.01, -0.96) Beta1: (0.49, 0.53)
Less Noisy Data Set: Beta0: (-1.06, -0.86) Beta1: (0.42, 0.63)