set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
sigma <- 0.5
epsilon <- rnorm(n, mean = 0, sd = sigma)
Y <- 10 + 5 * X + epsilon
model <- lm(Y ~ X)
summary(model)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## X 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
mse <- mean(model$residuals^2)
mse
## [1] 0.1992276
Estimated prediction equation is Y^ = 9.0218 + 5.5652X.
Coefficients: 9.0218, 5.5652
Standard error: 0.8336 (for intercept), 0.4155 (for slope)
Null hypothesis: H0 - Coefficient is 0 implying no relationship between X and Y
Alternative hypothesis: Ha - Coefficient is not 0 suggesting that there is a significant relationship between X and Y.
t-statistics: Intercept: 10.82 Slope: 13.39
p-value: 2.2e-16
Since the p-value is less than 0.05 (alpha value) we reject the null hypothesis. This strongly suggests that the slope is significantly different from 0 indicating a statistically significant relationship between X and Y.
The MSE for fitted model is 0.1992276. The MSE indicates the average squared difference between the observed values and the predicted values. A lower MSE suggests a better fit, and in this case, the MSE of 0.1992276 shows that the model fits the data reasonably well.
Since both the intercept and slope coefficients are highly significant (with p-values < 2e-16 ), we conclude that:
• The intercept ( beta_0 ) is significantly different from zero, indicating that there is a meaningful baseline value of Y when X = 0 .
• The slope ( beta_1 ) is significantly different from zero, indicating a strong relationship between X and Y .
This significance implies that X is a meaningful predictor of Y in this model.
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
sigma <- 1
epsilon <- rnorm(n, mean = 0, sd = sigma)
Y <- 10 + 5 * X + epsilon
model <- lm(Y ~ X)
summary(model)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4146 -0.6058 0.0186 0.6066 2.7090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0436 1.6673 4.824 5.16e-06 ***
## X 6.1303 0.8309 7.378 5.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared: 0.3571, Adjusted R-squared: 0.3505
## F-statistic: 54.43 on 1 and 98 DF, p-value: 5.253e-11
mse <- mean(model$residuals^2)
mse
## [1] 0.7969102
set.seed(7052)
n <- 400
X <- rnorm(n, mean = 2, sd = 0.1)
sigma <- 0.5
epsilon <- rnorm(n, mean = 0, sd = sigma)
Y <- 10 + 5 * X + epsilon
model <- lm(Y ~ X)
summary(model)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76214 -0.33740 0.03615 0.32077 1.63021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7466 0.5015 19.44 <2e-16 ***
## X 5.1177 0.2490 20.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4887 on 398 degrees of freedom
## Multiple R-squared: 0.5149, Adjusted R-squared: 0.5137
## F-statistic: 422.5 on 1 and 398 DF, p-value: < 2.2e-16
mse <- mean(model$residuals^2)
mse
## [1] 0.2376328
#d) ii)
set.seed(7052)
n <- 400
X <- rnorm(n, mean = 2, sd = 0.1)
sigma <- 1
epsilon <- rnorm(n, mean = 0, sd = sigma)
Y <- 10 + 5 * X + epsilon
model <- lm(Y ~ X)
summary(model)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5243 -0.6748 0.0723 0.6415 3.2604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4933 1.0029 9.466 <2e-16 ***
## X 5.2355 0.4979 10.514 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2154
## F-statistic: 110.5 on 1 and 398 DF, p-value: < 2.2e-16
mse <- mean(model$residuals^2)
mse
## [1] 0.9505312
Analysis and Conclusions
• Precision of Estimates: Reducing error variance ( \(\sigma\) = 0.5 vs. \(\sigma\) = 1 ) improves the precision of parameter estimates across both sample sizes. For example, with n = 100 , the standard error for the slope decreases from 0.8309 to 0.4155 when the error variance decreases. This increased precision in estimates is seen with both n = 100 and n = 400 .
• Model Fit: Lower error variance generally results in a better model fit, as reflected in higher R^2 values (e.g., \(R^2\) = 0.6468 with n = 100 and \(\sigma\) = 0.5 vs. \(R^2\) = 0.3571 with \(\sigma\) = 1 ). Lower error variance also leads to a smaller MSE, indicating that predictions are closer to the observed values.
• Stability of Estimates: Increasing the sample size from n = 100 to n = 400 generally improves the stability and reliability of parameter estimates, as seen in the smaller standard errors. For instance, the standard error of the slope at \(\sigma\) = 0.5 decreases from 0.4155 with n = 100 to 0.2490 with n = 400 . This indicates that larger samples yield more precise parameter estimates.
• Consistency in Fit: With a larger sample size, the model fit metrics (e.g., \(R^2\) and MSE) become more consistent across different levels of error variance. However, even with a larger sample, increased error variance still leads to a lower \(R^2\) and higher MSE, showing that while sample size improves stability, it does not completely mitigate the effects of high error variance.
• Impact of Error Variance on MSE: Higher error variance consistently results in a higher MSE across both sample sizes. For instance, with n = 100 , the MSE increases from 0.1992 with \(\sigma\) = 0.5 to 0.7969 with \(\sigma\) = 1 . Similarly, with n = 400 , the MSE increases from 0.2376 to 0.9505 when the error variance doubles. This pattern demonstrates that error variance significantly impacts the accuracy of predictions.
• Effect of Sample Size on MSE: Larger sample sizes tend to produce more stable MSE values, even when error variance increases. For example, with \(\sigma\) = 1 , increasing the sample size from 100 to 400 mitigates the MSE increase (from 0.7969 to 0.9505), suggesting that sample size helps stabilize predictions by averaging out the effect of high variance.
Mathematical Explanation
• The standard error of the slope \(\hat{\beta}_1\) in simple linear regression is given by:
\(\text{SE}(\hat{\beta}_1) = \frac{\sigma}{\sqrt{\sum (X_i - \bar{X})^2}}\)
• As \(\sigma\) (error variance) increases, the standard error of the slope also increases, reducing the precision of \(\hat{\beta}_1\) . This relationship explains why higher error variance leads to less precise parameter estimates and lower model fit quality.
• In OLS regression, the standard error of the slope is inversely proportional to the square root of n , as seen in the formula:
\(\text{SE}(\hat{\beta}_1) \propto \frac{1}{\sqrt{n}}\)
• Larger sample sizes reduce the standard error, improving the precision and stability of the estimates. This results in higher t-values, which support the statistical significance of the predictor.
• MSE is calculated as:
MSE = \(\frac{1}{n-2} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\)
• Lower error variance and larger sample sizes reduce MSE, as the model captures the true relationship more effectively. When error variance increases, MSE rises due to larger residuals (prediction errors), indicating a decrease in model accuracy.
Summary
• Lower error variance improves the precision of parameter estimates, reflected in lower standard errors, higher \(R^2\) , and lower MSE.
• Larger sample sizes enhance the stability and reliability of the estimates, reducing standard errors and resulting in more consistent and statistically significant parameter estimates.
In conclusion, minimizing error variance and increasing sample size together yield a more accurate and robust model, allowing it to better capture the true relationship between X and Y while reducing prediction errors.