Problem 1: Investigating the T-stat In this problem we will investigate the t-statistic for the null hypothesis Ho : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.
set.seed(1)
x <- rnorm(100)
y <- 2*x+rnorm(100)
mod0 <- lm(y~x+0)
mod0
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Coefficients:
##     x  
## 1.994
summary(mod0)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
x 1.9454 0.1083 17.96 <2e-16
mod <- lm(y~x)
mod 
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    -0.03769      1.99894
summary(mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03769    0.09699  -0.389    0.698    
## x            1.99894    0.10773  18.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
\(H_0\) : \(\beta\) = 0 should be rejected considering the given p value < 2.2e-16.
With or without intercept, test statistic is almost the same, though abviously the value of estimated intercept is given in the t statistic with intercept.
mod <- lm(x~y)
mod 
## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##      0.0388       0.3894
summary(mod)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
In this exercise you will create some simulated data and will fit a simple linear regression model to it. Make sure to use set.seed(1) prior to starting (a) to ensure consistent results.
set.seed(1)
X <- rnorm(100, mean = 0, sd = 1)
eps <- rnorm(100, mean = 0, sd = 0.25)
Y <- 1+0.5*X+eps
library(ggplot2)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ✓ purrr   0.3.3
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
plot(X, Y)
 We can see moderately correlated relationship between X and Y
mod1 <- lm(Y~X)
mod1
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      0.9906       0.4997
summary(mod1)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99058    0.02425   40.85   <2e-16 ***
## X            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
ggplot(NULL, aes(X, Y))+
  geom_point()+
  geom_smooth(se = FALSE)+
  geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The scatter plot along with the smoothing line above suggests a linearly increasing relationship between X and Y.
X <- rnorm(100, mean = 0, sd = 1)
eps2 <- rnorm(100, mean = 0, sd = 0.015)
Y2 <- 1+0.5*X+eps2
mod2 <- lm(Y2~X)
mod2
## 
## Call:
## lm(formula = Y2 ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      1.0007       0.5016
summary(mod2)
## 
## Call:
## lm(formula = Y2 ~ X)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041127 -0.008421 -0.000262  0.010196  0.027726 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.000727   0.001486   673.2   <2e-16 ***
## X           0.501593   0.001444   347.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01486 on 98 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 1.207e+05 on 1 and 98 DF,  p-value: < 2.2e-16
ggplot(NULL, aes(X, Y2))+
  geom_point()+
  geom_smooth(se = FALSE)+
  geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Multiple R squared and Adjusted R squared ahve improved in the latter model with less noise in the normaly distributed data.
X <- rnorm(100, mean = 0, sd = 1)
eps3 <- rnorm(100, mean = 0, sd = 2)
Y3 <- 1+0.5*X+eps3
mod3 <- lm(Y3~X)
mod3
## 
## Call:
## lm(formula = Y3 ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      0.9051       0.3501
summary(mod3)
## 
## Call:
## lm(formula = Y3 ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0203 -1.2110  0.0413  1.4097  4.1796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9051     0.1935   4.677 9.33e-06 ***
## X             0.3501     0.1662   2.106   0.0377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.934 on 98 degrees of freedom
## Multiple R-squared:  0.04332,    Adjusted R-squared:  0.03355 
## F-statistic: 4.437 on 1 and 98 DF,  p-value: 0.03772
ggplot(NULL, aes(X, Y3))+
  geom_point()+
  geom_smooth(se = FALSE)+
  geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
t.test(X, Y)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y
## t = -8.4998, df = 135.17, p-value = 3.095e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3363711 -0.8318807
## sample estimates:
##   mean of x   mean of y 
## -0.03913424  1.04499166
 (-0.6707836)-(-1.1104530)
## [1] 0.4396694
t.test(X, Y2)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y2
## t = -8.2429, df = 136.53, p-value = 1.237e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3077825 -0.8017076
## sample estimates:
##   mean of x   mean of y 
## -0.03913424  1.01561080
-0.7277076-(-1.2123379)
## [1] 0.4846303
t.test(X, Y3)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y3
## t = -4.0654, df = 161.2, p-value = 7.478e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3825330 -0.4785238
## sample estimates:
##   mean of x   mean of y 
## -0.03913424  0.89139417
(-0.6004198)-(-1.6257408)
## [1] 1.025321
When there is less noise within the normaly distributed data, confidence interval gets smaller whereas, there is more noise in the data, confidence interval gets bigger.