Problem 1: Investigating the T-stat In this problem we will investigate the t-statistic for the null hypothesis Ho : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.
set.seed(1)
x <- rnorm(100)
y <- 2*x+rnorm(100)
mod0 <- lm(y~x+0)
mod0
##
## Call:
## lm(formula = y ~ x + 0)
##
## Coefficients:
## x
## 1.994
summary(mod0)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
x 1.9454 0.1083 17.96 <2e-16
mod <- lm(y~x)
mod
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -0.03769 1.99894
summary(mod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
\(H_0\) : \(\beta\) = 0 should be rejected considering the given p value < 2.2e-16.
With or without intercept, test statistic is almost the same, though abviously the value of estimated intercept is given in the t statistic with intercept.
mod <- lm(x~y)
mod
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 0.0388 0.3894
summary(mod)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
In this exercise you will create some simulated data and will fit a simple linear regression model to it. Make sure to use set.seed(1) prior to starting (a) to ensure consistent results.
set.seed(1)
X <- rnorm(100, mean = 0, sd = 1)
eps <- rnorm(100, mean = 0, sd = 0.25)
Y <- 1+0.5*X+eps
library(ggplot2)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ✓ purrr 0.3.3
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
plot(X, Y)
We can see moderately correlated relationship between X and Y
mod1 <- lm(Y~X)
mod1
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 0.9906 0.4997
summary(mod1)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99058 0.02425 40.85 <2e-16 ***
## X 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
ggplot(NULL, aes(X, Y))+
geom_point()+
geom_smooth(se = FALSE)+
geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The scatter plot along with the smoothing line above suggests a linearly increasing relationship between X and Y.
X <- rnorm(100, mean = 0, sd = 1)
eps2 <- rnorm(100, mean = 0, sd = 0.015)
Y2 <- 1+0.5*X+eps2
mod2 <- lm(Y2~X)
mod2
##
## Call:
## lm(formula = Y2 ~ X)
##
## Coefficients:
## (Intercept) X
## 1.0007 0.5016
summary(mod2)
##
## Call:
## lm(formula = Y2 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.041127 -0.008421 -0.000262 0.010196 0.027726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000727 0.001486 673.2 <2e-16 ***
## X 0.501593 0.001444 347.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01486 on 98 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 1.207e+05 on 1 and 98 DF, p-value: < 2.2e-16
ggplot(NULL, aes(X, Y2))+
geom_point()+
geom_smooth(se = FALSE)+
geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Multiple R squared and Adjusted R squared ahve improved in the latter model with less noise in the normaly distributed data.
X <- rnorm(100, mean = 0, sd = 1)
eps3 <- rnorm(100, mean = 0, sd = 2)
Y3 <- 1+0.5*X+eps3
mod3 <- lm(Y3~X)
mod3
##
## Call:
## lm(formula = Y3 ~ X)
##
## Coefficients:
## (Intercept) X
## 0.9051 0.3501
summary(mod3)
##
## Call:
## lm(formula = Y3 ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0203 -1.2110 0.0413 1.4097 4.1796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9051 0.1935 4.677 9.33e-06 ***
## X 0.3501 0.1662 2.106 0.0377 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.934 on 98 degrees of freedom
## Multiple R-squared: 0.04332, Adjusted R-squared: 0.03355
## F-statistic: 4.437 on 1 and 98 DF, p-value: 0.03772
ggplot(NULL, aes(X, Y3))+
geom_point()+
geom_smooth(se = FALSE)+
geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
t.test(X, Y)
##
## Welch Two Sample t-test
##
## data: X and Y
## t = -8.4998, df = 135.17, p-value = 3.095e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3363711 -0.8318807
## sample estimates:
## mean of x mean of y
## -0.03913424 1.04499166
(-0.6707836)-(-1.1104530)
## [1] 0.4396694
t.test(X, Y2)
##
## Welch Two Sample t-test
##
## data: X and Y2
## t = -8.2429, df = 136.53, p-value = 1.237e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3077825 -0.8017076
## sample estimates:
## mean of x mean of y
## -0.03913424 1.01561080
-0.7277076-(-1.2123379)
## [1] 0.4846303
t.test(X, Y3)
##
## Welch Two Sample t-test
##
## data: X and Y3
## t = -4.0654, df = 161.2, p-value = 7.478e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3825330 -0.4785238
## sample estimates:
## mean of x mean of y
## -0.03913424 0.89139417
(-0.6004198)-(-1.6257408)
## [1] 1.025321
When there is less noise within the normaly distributed data, confidence interval gets smaller whereas, there is more noise in the data, confidence interval gets bigger.