Applied Linear Regression Homework 2

a. Generate data with X∼N(μ=2,σ=0.1), sample size n=100 and error term ϵ∼N(μ=0,σ=0.5).

set.seed(7052)

n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)

sigma <- 0.5
epsilon <- rnorm(n, mean = 0, sd = sigma)

Y <- 10 + 5 * X + epsilon

model <- lm(Y ~ X)

summary(model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## X             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16

mse <- mean(model$residuals^2)
mse

## [1] 0.1992276

b. Fit a simple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?

Estimated prediction equation is Y^ = 9.0218 + 5.5652X.

Coefficients: 9.0218, 5.5652

Standard error: 0.8336 (for intercept), 0.4155 (for slope)

Null hypothesis: H0 - Coefficient is 0 implying no relationship between X and Y

Alternative hypothesis: Ha - Coefficient is not 0 suggesting that there is a significant relationship between X and Y.

t-statistics: Intercept: 10.82 Slope: 13.39

p-value: 2.2e-16

Since the p-value is less than 0.05 (alpha value) we reject the null hypothesis. This strongly suggests that the slope is significantly different from 0 indicating a statistically significant relationship between X and Y.

The MSE for fitted model is 0.1992276. The MSE indicates the average squared difference between the observed values and the predicted values. A lower MSE suggests a better fit, and in this case, the MSE of 0.1992276 shows that the model fits the data reasonably well.

Since both the intercept and slope coefficients are highly significant (with p-values < 2e-16 ), we conclude that:

•   The intercept ( beta_0 ) is significantly different from zero, indicating that there is a meaningful baseline value of  Y  when  X = 0 .
•   The slope ( beta_1 ) is significantly different from zero, indicating a strong relationship between  X  and  Y .

This significance implies that X is a meaningful predictor of Y in this model.

c. Repeat part b), but re-simulate the data and change the error term to ϵ∼N(0,σ=1)

set.seed(7052)

n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)

sigma <- 1
epsilon <- rnorm(n, mean = 0, sd = sigma)

Y <- 10 + 5 * X + epsilon

model <- lm(Y ~ X)

summary(model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4146 -0.6058  0.0186  0.6066  2.7090 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0436     1.6673   4.824 5.16e-06 ***
## X             6.1303     0.8309   7.378 5.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared:  0.3571, Adjusted R-squared:  0.3505 
## F-statistic: 54.43 on 1 and 98 DF,  p-value: 5.253e-11

mse <- mean(model$residuals^2)
mse

## [1] 0.7969102

d. Repeat parts a)–c) using n=400. What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?

Changing value of n for part (a)

set.seed(7052)

n <- 400
X <- rnorm(n, mean = 2, sd = 0.1)

sigma <- 0.5
epsilon <- rnorm(n, mean = 0, sd = sigma)

Y <- 10 + 5 * X + epsilon

model <- lm(Y ~ X)

summary(model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76214 -0.33740  0.03615  0.32077  1.63021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.7466     0.5015   19.44   <2e-16 ***
## X             5.1177     0.2490   20.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4887 on 398 degrees of freedom
## Multiple R-squared:  0.5149, Adjusted R-squared:  0.5137 
## F-statistic: 422.5 on 1 and 398 DF,  p-value: < 2.2e-16

mse <- mean(model$residuals^2)
mse

## [1] 0.2376328

Changing the value of n for part (c).

#d) ii)
set.seed(7052)

n <- 400
X <- rnorm(n, mean = 2, sd = 0.1)

sigma <- 1
epsilon <- rnorm(n, mean = 0, sd = sigma)

Y <- 10 + 5 * X + epsilon

model <- lm(Y ~ X)

summary(model)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5243 -0.6748  0.0723  0.6415  3.2604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4933     1.0029   9.466   <2e-16 ***
## X             5.2355     0.4979  10.514   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2154 
## F-statistic: 110.5 on 1 and 398 DF,  p-value: < 2.2e-16

mse <- mean(model$residuals^2)
mse

## [1] 0.9505312

Analysis and Conclusions

Effect of Error Variance on Model Estimates and Fit

• Precision of Estimates: Reducing error variance ( \(\sigma\) = 0.5 vs. \(\sigma\) = 1 ) improves the precision of parameter estimates across both sample sizes. For example, with n = 100 , the standard error for the slope decreases from 0.8309 to 0.4155 when the error variance decreases. This increased precision in estimates is seen with both n = 100 and n = 400 .

• Model Fit: Lower error variance generally results in a better model fit, as reflected in higher R^2 values (e.g., \(R^2\) = 0.6468 with n = 100 and \(\sigma\) = 0.5 vs. \(R^2\) = 0.3571 with \(\sigma\) = 1 ). Lower error variance also leads to a smaller MSE, indicating that predictions are closer to the observed values.

Effect of Sample Size on Stability and Reliability of Estimates

• Stability of Estimates: Increasing the sample size from n = 100 to n = 400 generally improves the stability and reliability of parameter estimates, as seen in the smaller standard errors. For instance, the standard error of the slope at \(\sigma\) = 0.5 decreases from 0.4155 with n = 100 to 0.2490 with n = 400 . This indicates that larger samples yield more precise parameter estimates.

• Consistency in Fit: With a larger sample size, the model fit metrics (e.g., \(R^2\) and MSE) become more consistent across different levels of error variance. However, even with a larger sample, increased error variance still leads to a lower \(R^2\) and higher MSE, showing that while sample size improves stability, it does not completely mitigate the effects of high error variance.

Comparison of MSE Across Models

• Impact of Error Variance on MSE: Higher error variance consistently results in a higher MSE across both sample sizes. For instance, with n = 100 , the MSE increases from 0.1992 with \(\sigma\) = 0.5 to 0.7969 with \(\sigma\) = 1 . Similarly, with n = 400 , the MSE increases from 0.2376 to 0.9505 when the error variance doubles. This pattern demonstrates that error variance significantly impacts the accuracy of predictions.

• Effect of Sample Size on MSE: Larger sample sizes tend to produce more stable MSE values, even when error variance increases. For example, with \(\sigma\) = 1 , increasing the sample size from 100 to 400 mitigates the MSE increase (from 0.7969 to 0.9505), suggesting that sample size helps stabilize predictions by averaging out the effect of high variance.

Mathematical Explanation

Effect of Error Variance on Standard Errors:

• The standard error of the slope \(\hat{\beta}_1\) in simple linear regression is given by:

\(\text{SE}(\hat{\beta}_1) = \frac{\sigma}{\sqrt{\sum (X_i - \bar{X})^2}}\)

• As \(\sigma\) (error variance) increases, the standard error of the slope also increases, reducing the precision of \(\hat{\beta}_1\) . This relationship explains why higher error variance leads to less precise parameter estimates and lower model fit quality.

Effect of Sample Size on Standard Errors:

• In OLS regression, the standard error of the slope is inversely proportional to the square root of n , as seen in the formula:

\(\text{SE}(\hat{\beta}_1) \propto \frac{1}{\sqrt{n}}\)

• Larger sample sizes reduce the standard error, improving the precision and stability of the estimates. This results in higher t-values, which support the statistical significance of the predictor.

MSE as an Indicator of Model Fit:

• MSE is calculated as:

MSE = \(\frac{1}{n-2} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\)

• Lower error variance and larger sample sizes reduce MSE, as the model captures the true relationship more effectively. When error variance increases, MSE rises due to larger residuals (prediction errors), indicating a decrease in model accuracy.

Summary

• Lower error variance improves the precision of parameter estimates, reflected in lower standard errors, higher \(R^2\) , and lower MSE.

• Larger sample sizes enhance the stability and reliability of the estimates, reducing standard errors and resulting in more consistent and statistically significant parameter estimates.

In conclusion, minimizing error variance and increasing sample size together yield a more accurate and robust model, allowing it to better capture the true relationship between X and Y while reducing prediction errors.