Part I: Proejct Proposal (submitted)

Part II

Problem 1

Problem 1: Investigating the T-stat In this problem we will investigate the t-statistic for the null hypothesis Ho : β = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.

set.seed(1)
x <- rnorm(100)
y <- 2*x+rnorm(100)
  1. Perform a simple linear regression of y onto x, without an intercept. Report the ˆ coefficient estimate β , the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H o : β = 0 . Comment on these results. ○ Hint: You can perform regression without an intercept using the command
mod0 <- lm(y~x+0)
mod0
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Coefficients:
##     x  
## 1.994
summary(mod0)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

Estimate Std. Error t value Pr(>|t|)
x 1.9454 0.1083 17.96 <2e-16

  1. Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis H o : β = 0 . Comment on the results.
mod <- lm(y~x)
mod 
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    -0.03769      1.99894
summary(mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03769    0.09699  -0.389    0.698    
## x            1.99894    0.10773  18.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

\(H_0\) : \(\beta\) = 0 should be rejected considering the given p value < 2.2e-16.

  1. What is the relationship between the results obtained in (a) and (b)?

With or without intercept, test statistic is almost the same, though abviously the value of estimated intercept is given in the t statistic with intercept.

  1. In R, show that when regression is performed with an intercept, the t-statistic for \(H_0\) : \(\beta_0\) = 0 is the same for the regression of y onto x as it is for th eregression of x onto
mod <- lm(x~y)
mod 
## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##      0.0388       0.3894
summary(mod)
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

Problem 2 SLR Estimation

In this exercise you will create some simulated data and will fit a simple linear regression model to it. Make sure to use set.seed(1) prior to starting (a) to ensure consistent results.

set.seed(1)
  1. Using the rnorm() function, create a vector, x, containing 100 observations drawn from a Normal(0, 1) distribution. This represents an explanatory variable (aka a feature), X.
X <- rnorm(100, mean = 0, sd = 1)
  1. Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a Normal(0, 0.25) distribution, i.e. a normal distribution with mean zero and variance 0.25. This represents the error (or noise).
eps <- rnorm(100, mean = 0, sd = 0.25)
  1. Using x and eps, generate a vector y according to the model Y= 1+0.5X+ What is the length of the vector y? What are the values of β0 and β1 in this linear model?
Y <- 1+0.5*X+eps
  1. Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
library(ggplot2)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ✓ purrr   0.3.3
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
plot(X, Y)

We can see moderately correlated relationship between X and Y

  1. Fit a least squares linear model to predict y using x. Comment on the model obtained. How do β0 and β1 compare to β0 and β1 ?
mod1 <- lm(Y~X)
mod1
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      0.9906       0.4997
summary(mod1)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.99058    0.02425   40.85   <2e-16 ***
## X            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
  1. Display the least square line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. ○ Try to create a legend indicating the two different lines.
ggplot(NULL, aes(X, Y))+
  geom_point()+
  geom_smooth(se = FALSE)+
  geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between X and Y.

  1. Repeat (a)-(f) after modifying the data generating process in such a way that there is less noise in the data. The population model from part (c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term eps in (b). Describe your results. How does this differ from the model in (e)?
X <- rnorm(100, mean = 0, sd = 1)
eps2 <- rnorm(100, mean = 0, sd = 0.015)
Y2 <- 1+0.5*X+eps2

mod2 <- lm(Y2~X)
mod2
## 
## Call:
## lm(formula = Y2 ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      1.0007       0.5016
summary(mod2)
## 
## Call:
## lm(formula = Y2 ~ X)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041127 -0.008421 -0.000262  0.010196  0.027726 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.000727   0.001486   673.2   <2e-16 ***
## X           0.501593   0.001444   347.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01486 on 98 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 1.207e+05 on 1 and 98 DF,  p-value: < 2.2e-16
ggplot(NULL, aes(X, Y2))+
  geom_point()+
  geom_smooth(se = FALSE)+
  geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Multiple R squared and Adjusted R squared ahve improved in the latter model with less noise in the normaly distributed data.

  1. Repeat (a)-(f) after modifying the data generating process in such a way that there is more noise in the data. The population model from part (c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term eps in (b). Describe your results. How does this differ from the model in (e)?
X <- rnorm(100, mean = 0, sd = 1)
eps3 <- rnorm(100, mean = 0, sd = 2)
Y3 <- 1+0.5*X+eps3

mod3 <- lm(Y3~X)
mod3
## 
## Call:
## lm(formula = Y3 ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      0.9051       0.3501
summary(mod3)
## 
## Call:
## lm(formula = Y3 ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0203 -1.2110  0.0413  1.4097  4.1796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9051     0.1935   4.677 9.33e-06 ***
## X             0.3501     0.1662   2.106   0.0377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.934 on 98 degrees of freedom
## Multiple R-squared:  0.04332,    Adjusted R-squared:  0.03355 
## F-statistic: 4.437 on 1 and 98 DF,  p-value: 0.03772
ggplot(NULL, aes(X, Y3))+
  geom_point()+
  geom_smooth(se = FALSE)+
  geom_smooth(method = "lm", color = "orange")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. What are the confidence intervals for β0 and β1 based on the original data set, the noiser, and the less noisy data set? Comment on your results.
t.test(X, Y)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y
## t = -8.4998, df = 135.17, p-value = 3.095e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3363711 -0.8318807
## sample estimates:
##   mean of x   mean of y 
## -0.03913424  1.04499166
 (-0.6707836)-(-1.1104530)
## [1] 0.4396694
t.test(X, Y2)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y2
## t = -8.2429, df = 136.53, p-value = 1.237e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3077825 -0.8017076
## sample estimates:
##   mean of x   mean of y 
## -0.03913424  1.01561080
-0.7277076-(-1.2123379)
## [1] 0.4846303
t.test(X, Y3)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y3
## t = -4.0654, df = 161.2, p-value = 7.478e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3825330 -0.4785238
## sample estimates:
##   mean of x   mean of y 
## -0.03913424  0.89139417
(-0.6004198)-(-1.6257408)
## [1] 1.025321

When there is less noise within the normaly distributed data, confidence interval gets smaller whereas, there is more noise in the data, confidence interval gets bigger.