1a.
set.seed(1)
x<-rnorm(100)
y<-2*x + rnorm(100)
summary(lm(y~x + 0))
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate is 1.9939. The standard error is 0.1065, and the t value is 18.73. The p value is approx. 0. These results show that the regression is a good fit. To be a perfect match, the coefficient would have been 2, not 1.994. The p value being so small means that there is pretty much no chance that a random sample would be correlated this strongly.
summary(lm(y~x + 0))
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate is 1.9939. The standard error is 0.1065, and the t value is 18.73. The p value is approx 0. The coefficient estimate is 1.9939. The standard error is 0.1065, and the t value is 18.73. The p value is approx. 0. These results show that the regression is a good fit. To be a perfect match, the coefficient would have been 2, not 1.994. The p value being so small means that there is pretty much no chance that a random sample would be correlated this strongly.
The results in (b) are the same as the results of (a), which makes sense, because the regression wouldn’t be affected by which var is on the x axis and which is on the y axis, unless there is an intercept present.
x<-rnorm(100)
y<-2*x + rnorm(100)
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.74179 -0.56139 -0.01749 0.67973 1.84843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04845 0.09910 0.489 0.626
## x 2.10622 0.09626 21.881 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9906 on 98 degrees of freedom
## Multiple R-squared: 0.8301, Adjusted R-squared: 0.8284
## F-statistic: 478.8 on 1 and 98 DF, p-value: < 2.2e-16
summary(lm(x~y))
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12182 -0.33896 -0.01481 0.22338 1.23712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01405 0.04290 -0.328 0.744
## y 0.39411 0.01801 21.881 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4285 on 98 degrees of freedom
## Multiple R-squared: 0.8301, Adjusted R-squared: 0.8284
## F-statistic: 478.8 on 1 and 98 DF, p-value: < 2.2e-16
Both of these regressions have a t value of 21.881, so with an intercept, the regressions will still have the same t-statistic.
2a-c.
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100,0,0.25)
y <- -1 + 0.5 * x + eps
The length of the vector is 100, the same as x and eps. \(\beta_0 = -1\), and \(\beta_1 = 0.5\).
plot(x,y)
It seems that the data does seem to be decently linearly correlated, although there is a definitely a decent amount of noise. THere are no obvious outliers that I see.
e-f.
normal_noise_regression <- lm(y~x)
plot(x,y)
abline(lm(y~x),col = 'red')
abline(-1,0.5, col = 'green')
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
x <- rnorm(100)
eps <- rnorm(100,0,0.1)
y <- -1 + 0.5 * x + eps
plot(x,y)
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.274179 -0.056139 -0.001749 0.067973 0.184843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.995155 0.009910 -100.42 <2e-16 ***
## x 0.510622 0.009626 53.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09906 on 98 degrees of freedom
## Multiple R-squared: 0.9663, Adjusted R-squared: 0.966
## F-statistic: 2814 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm(y~x),col = 'red')
abline(-1,0.5, col = 'green')
less_noise_regression <- lm(y~x)
The models are very similar. The main difference is that the summary of the linear regression shows us that when there is less noise, the standard error is reduced, which makes sense, since we can see from the plot that the data is more correlated.
x <- rnorm(100)
eps <- rnorm(100,0,0.5)
y <- -1 + 0.5 * x + eps
plot(x,y)
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25507 -0.30275 0.01032 0.35241 1.04490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.02373 0.04838 -21.16 <2e-16 ***
## x 0.46253 0.04155 11.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4835 on 98 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.5539
## F-statistic: 123.9 on 1 and 98 DF, p-value: < 2.2e-16
abline(lm(y~x),col = 'red')
abline(-1,0.5, col = 'green')
more_noise_regression <- lm(y~x)
We can see that when the noise is increased, the two regressions begin to differ; they are no longer visually the same line. The standard error is also increased from (e).
confint(normal_noise_regression)
## 2.5 % 97.5 %
## (Intercept) -1.0575402 -0.9613061
## x 0.4462897 0.5531801
confint(less_noise_regression)
## 2.5 % 97.5 %
## (Intercept) -1.0148210 -0.9754890
## x 0.4915195 0.5297242
confint(more_noise_regression)
## 2.5 % 97.5 %
## (Intercept) -1.1197386 -0.9277138
## x 0.3800695 0.5449816
The more noise there is, the larger the confidence interval is for both \(\Beta_0\) and \(\Beta_1\). This is not always true, however, as the upper bound for the intecept with the most noise is smaller than the upper bound for the intercept of the original regression, but it is generally the trend. This makes sense, since when our data is more spread out, we can’t be as confident in predicting the regression accurately.