Part II; Practice Problems Problem 1: Investigating the T-stat a)
set.seed(1)
x<-rnorm(100)
y<-2*x+rnorm(100)
slr<-lm(y~x+0)
summary(slr)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
coefficient estimate: 1.9939 standard error of the coefficient estimate: 0.1065 t-statistic: 18.73 p-value: <2e-16
The summary suggests that we have a small p-value (< α = 0.05) that allows us to reject the null hypothesis, so the model is statistically significant. b)
set.seed(1)
x<-rnorm(100)
y<-2*x+rnorm(100)
slr2<-lm(x~y+0)
summary(slr2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
coefficient estimate: 0.39111 standard error of the coefficient estimate: 0.02089 t-statistic: 18.73 p-value: <2e-16
These results suggest that we have a small p-value (< α = 0.05) which allows us to reject the null hypothesis again, so the model is statistically significant.
The results obtained in a) and b) have the same t-statistic and p-value. The results in a) and b) are both models for the same line y = 2x + ε but with a) performing the regression of y onto x and b) performing the regression of x onto y.
slr3<- lm(y~x)
summary(slr3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The test statistic for the coefficient estimate is 18.556 when the regression of y onto x is performed with an intercept.
slr4<-lm(x~y)
summary(slr4)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
summary(slr3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
The test statistic for the coefficient estimate is 18.556 when the regression of x onto y is performed with an intercept.
Problem 2: SLR Estimation a)
set.seed(1)
x<- rnorm(100, mean=0, sd=1)
eps <- rnorm(100, mean = 0, sd = 0.25)
y <- -1+0.5*x+ eps
length(y)
## [1] 100
The length of the vector y is 100 The value of β0 is -1 The value of β1 is 0.5
plot(x,y)
The scatterplot shows a positive direction, linear form, a moderately strong relationship between x and y, and there appear to be no outliers.
model <- lm(y~x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
βhat0 is -1.00942 βhat1 is 0.49973 These predicted values are extremely close to values of the population model (β0 = -1, β1 = 0.5) The model has an R-squared value of 0.7784, suggesting that this model fits the data moderately well The model also had a very small p-value, so the null hypothesis can be rejected.
plot(x,y)
abline(model, col="blue")
abline(-1, 0.5, col="yellow")
legend("topleft", c("Least Square", "Regression"), col= c("blue","yellow"), lty=c(1,1))
set.seed(1)
eps <-rnorm(100, sd = 0.10)
x <- rnorm(100)
y <- -1 + 0.5*x + eps
model_less <- lm(y~x)
summary(model_less)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.232416 -0.060361 0.000536 0.058305 0.229316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.989115 0.009035 -109.48 <2e-16 ***
## x 0.499907 0.009472 52.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09028 on 98 degrees of freedom
## Multiple R-squared: 0.966, Adjusted R-squared: 0.9657
## F-statistic: 2785 on 1 and 98 DF, p-value: < 2.2e-16
plot(x,y)
abline(model_less, col="blue")
abline(-1, 0.5, col = "yellow")
legend("topleft", c("Least Square", "Regression"), col= c("blue","yellow"), lty=c(1,1))
The noise was decreased by decreasing the variance. The relationship is strongly linear with an R^2 value of .966 and a residual standard error value of 0.09028. The least square and regression lines nearly overlap for this model.Compared to the original model, the R^2 value is greater, and the standard error value is lower.
set.seed(1)
eps <-rnorm(100, sd = 0.5)
x <- rnorm(100)
y <- -1 + 0.5*x + eps
model_more <- lm(y~x)
summary(model_more)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16208 -0.30181 0.00268 0.29152 1.14658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94557 0.04517 -20.93 <2e-16 ***
## x 0.49953 0.04736 10.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4514 on 98 degrees of freedom
## Multiple R-squared: 0.5317, Adjusted R-squared: 0.5269
## F-statistic: 111.2 on 1 and 98 DF, p-value: < 2.2e-16
plot(x,y)
abline(model_more, col="blue")
abline(-1, 0.5, col = "yellow")
legend("topleft", c("Least Square", "Regression"), col= c("blue","yellow"), lty=c(1,1))
The noise was increased by increasing the variancee. The relationshop is moderarely weak but still linear with an R^2 value of 0.5317, which is greater than the original model created without changing the noise but less than the model created by decreasing noise.This model also has a residual standard of error of 0.4514 which is greater than the original model but less than the model created by decreasing noise
confint(model)
## 2.5 % 97.5 %
## (Intercept) -1.0575402 -0.9613061
## x 0.4462897 0.5531801
Less Noise Confidence Interval
confint(model_less)
## 2.5 % 97.5 %
## (Intercept) -1.0070441 -0.9711855
## x 0.4811096 0.5187039
More Noise Confidence Interval
confint(model_more)
## 2.5 % 97.5 %
## (Intercept) -1.0352203 -0.8559276
## x 0.4055479 0.5935197
The ranges for each of the confidence intervals are very smilar, but as the noise increases, the confidence interval widens. When there is less noise in the data, there is more predictability in the data set.