# Question 1a
set.seed(7052)
# Generate data: X ~ N(2, 0.1), eps ~ N(0, 0.5), n = 100
n <- 100
mu_x <- 2
sd_x <- 0.1
beta0 <- 10
beta1 <- 5
sd_eps <- 0.5
X <- rnorm(n, mean = mu_x, sd = sd_x)
eps <- rnorm(n, mean = 0, sd = sd_eps)
Y <- beta0 + beta1 * X + eps
data_a <- data.frame(X, Y)
head(data_a)
## X Y
## 1 1.907630 20.46643
## 2 1.949162 20.08051
## 3 2.029797 20.32815
## 4 2.101782 20.39166
## 5 2.000072 20.16090
## 6 1.926412 19.51731
# Question 1b
model_b <- lm(Y ~ X, data = data_a)
summary(model_b)
##
## Call:
## lm(formula = Y ~ X, data = data_a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## X 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
# Model MSE (residual variance estimate)
residuals_b <- resid(model_b)
MSE_b <- sum(residuals_b^2) / (n - 2)
MSE_b
## [1] 0.2032934
Estimated Predicted equation:
Y_hat= 9.0218 + 5.5652 X
This means that for each unit increase in X, the predicted Y increases by approximately 5.57 times
| Term | Estimate | Std. Error | t value | p-value |
|---|---|---|---|---|
| (Intercept) | 9.0218 | 0.8336 | 10.82 | < 2e-16 |
| X | 5.5652 | 0.4155 | 13.39 | < 2e-16 |
Null and alternate hypothesis
Null Hypothesis => beta_0 = 0 Alternate Hypothesis => beta_0 != 0
Since the t_value is less than the p_value, we reject the null hypothesis and the intercept is statistically significant.
Model’s MSE = 0.2033
# Question 1c
set.seed(7052)
sd_eps_c <- 1
X_c <- rnorm(n, mean = mu_x, sd = sd_x)
eps_c <- rnorm(n, mean = 0, sd = sd_eps_c)
Y_c <- beta0 + beta1 * X_c + eps_c
model_c <- lm(Y_c ~ X_c)
summary(model_c)
##
## Call:
## lm(formula = Y_c ~ X_c)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4146 -0.6058 0.0186 0.6066 2.7090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0436 1.6673 4.824 5.16e-06 ***
## X_c 6.1303 0.8309 7.378 5.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared: 0.3571, Adjusted R-squared: 0.3505
## F-statistic: 54.43 on 1 and 98 DF, p-value: 5.253e-11
residuals_c <- resid(model_c)
MSE_c <- sum(residuals_c^2) / (n - 2)
MSE_c
## [1] 0.8131737
set.seed(7052)
# Generate data: X ~ N(2, 0.1), eps ~ N(0, 0.5), n = 100
n_1 <- 400
mu_x_1 <- 2
sd_x_1 <- 0.1
beta0_1 <- 10
beta1_1 <- 5
sd_eps_1 <- 0.5
X_1 <- rnorm(n_1, mean = mu_x_1, sd = sd_x_1)
eps_1 <- rnorm(n_1, mean = 0, sd = sd_eps_1)
Y_1 <- beta0_1 + beta1_1 * X_1 + eps_1
data_a_1 <- data.frame(X_1, Y_1)
head(data_a_1)
## X_1 Y_1
## 1 1.907630 18.94756
## 2 1.949162 19.78539
## 3 2.029797 20.37686
## 4 2.101782 21.28714
## 5 2.000072 19.61935
## 6 1.926412 19.60554
# Question 1b
model_b_1 <- lm(Y ~ X, data = data_a_1)
summary(model_b_1)
##
## Call:
## lm(formula = Y ~ X, data = data_a_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## X 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
# Model MSE (residual variance estimate)
residuals_b_1 <- resid(model_b_1)
MSE_d <- sum(residuals_b_1^2) / (n_1 - 2)
MSE_d
## [1] 0.05005717
set.seed(7052)
sd_eps_c_1 <- 1
X_c_1 <- rnorm(n_1, mean = mu_x_1, sd = sd_x_1)
eps_c_1 <- rnorm(n_1, mean = 0, sd = sd_eps_c_1)
Y_c_1 <- beta0_1 + beta1_1 * X_c_1 + eps_c_1
model_c_1 <- lm(Y_c_1 ~ X_c_1)
summary(model_c_1)
##
## Call:
## lm(formula = Y_c_1 ~ X_c_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5243 -0.6748 0.0723 0.6415 3.2604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4933 1.0029 9.466 <2e-16 ***
## X_c_1 5.2355 0.4979 10.514 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2154
## F-statistic: 110.5 on 1 and 398 DF, p-value: < 2.2e-16
residuals_c_1 <- resid(model_c_1)
MSE_e <- sum(residuals_c_1^2) / (n_1 - 2)
MSE_e
## [1] 0.9553077
When the error variance gets smaller, the data points are closer to the true regression line. This makes it easier for the model to find the correct slope.
Comparison (n=100): Looking only at the n=100 models, the “Low” error model (Y ~ X) has a Std. Error for X of 0.4155, while the “High” error model (Y_c ~ X_c) has a Std. Error of 0.8309.
Comparison (n=400): Looking only at the n=400 models, the “Low” error model has a Std. Error for X of 0.2490, while the “High” error model has a Std. Error of 0.4979.
In both sample sizes, smaller error variance leads to a smaller standard error for the parameter estimates. This means the estimates (like the slope for X) are much more precise and reliable. It also leads to a higher R-squared value, as the model can explain more of the variance.
When the sample size increases, the model has more information to learn from. This helps to “average out” the random noise.
Comparison (Low Error Models): When n increases from 100 to 400 (comparing Y ~ X in both sets), the Std. Error for X decreases from 0.4155 to 0.2490.
Comparison (High Error Models): When n increases from 100 to 400 (comparing Y_c ~ X_c in both sets), the Std. Error for X decreases from 0.8309 to 0.4979.
In both error scenarios, a larger sample size leads to a smaller standard error for the parameter estimates. This makes the estimates more precise, even when the underlying data is noisy
# Question 1e — Summary table
summary_df <- data.frame(
Scenario = c("n=100, σ=0.5", "n=100, σ=1", "n=400, σ=0.5", "n=400, σ=1"),
MSE = c(MSE_b, MSE_c, MSE_d, MSE_e)
)
summary_df
## Scenario MSE
## 1 n=100, σ=0.5 0.20329342
## 2 n=100, σ=1 0.81317370
## 3 n=400, σ=0.5 0.05005717
## 4 n=400, σ=1 0.95530770
Increasing sample size reduces standard errors and narrows confidence intervals, while MSE stays close to variance.
The findings clearly show how both error variance and sample size influence a regression model’s accuracy and reliability.
Smaller Error Variance
With low error variance, data points cluster more tightly around the true regression line. It therefore becomes easier for the model to estimate the relationship between X and Y with better precision.
At n = 100: Decreasing the error variance halved the standard error for the slope from 0.8309 to 0.4155 and thus provided a sharper estimate. Similarly, the R² within the model increased from 0.3571 to 0.6468, indicating that much more of the variation of Y was explained by this model.
At n = 400: The pattern persists. Lower error variance reduced the SE of the slope from 0.4979 to 0.2490 and increased R² from 0.2174 to 0.5149.
In other words, smaller error variance yields cleaner and more precise parameter estimates and improves the model fit of the data.
With a larger sample, the model has more to work with, so random noise has a chance of “averaging out.” In this way, estimates become much more stable and closer to the true population values.
With Low Error: The standard error of the slope decreased from 0.4155 to 0.2490 when sample size increased from 100 to 400, which clearly improved the precision.
High Error: Even under noisier conditions, a larger sample made a marked difference, with the standard error declining from 0.8309 to 0.4979. In general, with increasing sample size, the precision of the regression estimates increases regardless of the noise level in the data.
| Scenario | Sample Size (n) | Error Variance | Residual Std. Error (s) | Slope Estimate (β¹) | Std. Error of X (sb₁) | R-squared |
|---|---|---|---|---|---|---|
| 1 | 100 | High | 0.9018 | 6.1303 | 0.8309 | 0.3571 |
| 2 | 100 | Low | 0.4509 | 5.5652 | 0.4155 | 0.6468 |
| 3 | 400 | High | 0.9774 | 5.2355 | 0.4979 | 0.2174 |
| 4 | 400 | Low | 0.4887 | 5.1177 | 0.2490 | 0.5149 |
Reliability and precision of any linear regression model are essentially linked with the quality and quantity of data.
Low error variance is also important for a better model fit. It ensures the relationships of variables are clear and leads to more precise coefficient estimates.
Large sample size is important for statistical confidence. It reduces the uncertainty in the estimates of parameters, making those estimates more reliable.
The best conditions-very low error variance and large sample size-produced the most accurate and reliable model (Scenario 4). The worst conditions-a combination of high error variance and a small sample size-led to the least reliable model (Scenario 1).