Week_4

1a.

set.seed(1)
x<-rnorm(100)
y<-2*x + rnorm(100)
summary(lm(y~x + 0))

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate is 1.9939. The standard error is 0.1065, and the t value is 18.73. The p value is approx. 0. These results show that the regression is a good fit. To be a perfect match, the coefficient would have been 2, not 1.994. The p value being so small means that there is pretty much no chance that a random sample would be correlated this strongly.

summary(lm(y~x + 0))

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate is 1.9939. The standard error is 0.1065, and the t value is 18.73. The p value is approx 0. The coefficient estimate is 1.9939. The standard error is 0.1065, and the t value is 18.73. The p value is approx. 0. These results show that the regression is a good fit. To be a perfect match, the coefficient would have been 2, not 1.994. The p value being so small means that there is pretty much no chance that a random sample would be correlated this strongly.

The results in (b) are the same as the results of (a), which makes sense, because the regression wouldn’t be affected by which var is on the x axis and which is on the y axis, unless there is an intercept present.

x<-rnorm(100)
y<-2*x + rnorm(100)
summary(lm(y~x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74179 -0.56139 -0.01749  0.67973  1.84843 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.04845    0.09910   0.489    0.626    
## x            2.10622    0.09626  21.881   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9906 on 98 degrees of freedom
## Multiple R-squared:  0.8301, Adjusted R-squared:  0.8284 
## F-statistic: 478.8 on 1 and 98 DF,  p-value: < 2.2e-16

summary(lm(x~y))

## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12182 -0.33896 -0.01481  0.22338  1.23712 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01405    0.04290  -0.328    0.744    
## y            0.39411    0.01801  21.881   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4285 on 98 degrees of freedom
## Multiple R-squared:  0.8301, Adjusted R-squared:  0.8284 
## F-statistic: 478.8 on 1 and 98 DF,  p-value: < 2.2e-16

Both of these regressions have a t value of 21.881, so with an intercept, the regressions will still have the same t-statistic.

2a-c.

set.seed(1)
x <- rnorm(100)
eps <- rnorm(100,0,0.25)
y <- -1 + 0.5 * x + eps

The length of the vector is 100, the same as x and eps. \(\beta_0 = -1\), and \(\beta_1 = 0.5\).

plot(x,y)

It seems that the data does seem to be decently linearly correlated, although there is a definitely a decent amount of noise. THere are no obvious outliers that I see.

e-f.

normal_noise_regression <- lm(y~x)
plot(x,y)
abline(lm(y~x),col = 'red')
abline(-1,0.5, col = 'green')

summary(lm(y~x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46921 -0.15344 -0.03487  0.13485  0.58654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00942    0.02425  -41.63   <2e-16 ***
## x            0.49973    0.02693   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

x <- rnorm(100)
eps <- rnorm(100,0,0.1)
y <- -1 + 0.5 * x + eps
plot(x,y)
summary(lm(y~x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274179 -0.056139 -0.001749  0.067973  0.184843 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.995155   0.009910 -100.42   <2e-16 ***
## x            0.510622   0.009626   53.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09906 on 98 degrees of freedom
## Multiple R-squared:  0.9663, Adjusted R-squared:  0.966 
## F-statistic:  2814 on 1 and 98 DF,  p-value: < 2.2e-16

abline(lm(y~x),col = 'red')
abline(-1,0.5, col = 'green')

less_noise_regression <- lm(y~x)

The models are very similar. The main difference is that the summary of the linear regression shows us that when there is less noise, the standard error is reduced, which makes sense, since we can see from the plot that the data is more correlated.

x <- rnorm(100)
eps <- rnorm(100,0,0.5)
y <- -1 + 0.5 * x + eps
plot(x,y)
summary(lm(y~x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25507 -0.30275  0.01032  0.35241  1.04490 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.02373    0.04838  -21.16   <2e-16 ***
## x            0.46253    0.04155   11.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4835 on 98 degrees of freedom
## Multiple R-squared:  0.5584, Adjusted R-squared:  0.5539 
## F-statistic: 123.9 on 1 and 98 DF,  p-value: < 2.2e-16

abline(lm(y~x),col = 'red')
abline(-1,0.5, col = 'green')

more_noise_regression <- lm(y~x)

We can see that when the noise is increased, the two regressions begin to differ; they are no longer visually the same line. The standard error is also increased from (e).

Here are the three confidence intervals for the three regressions:

confint(normal_noise_regression)

##                  2.5 %     97.5 %
## (Intercept) -1.0575402 -0.9613061
## x            0.4462897  0.5531801

confint(less_noise_regression)

##                  2.5 %     97.5 %
## (Intercept) -1.0148210 -0.9754890
## x            0.4915195  0.5297242

confint(more_noise_regression)

##                  2.5 %     97.5 %
## (Intercept) -1.1197386 -0.9277138
## x            0.3800695  0.5449816

The more noise there is, the larger the confidence interval is for both \(\Beta_0\) and \(\Beta_1\). This is not always true, however, as the upper bound for the intecept with the most noise is smaller than the upper bound for the intercept of the original regression, but it is generally the trend. This makes sense, since when our data is more spread out, we can’t be as confident in predicting the regression accurately.

Week_4

Daniel Smith

2/26/2020