Question 1a: Simulate Data, Fit SLR (n=100, σ=0.5)

# Question 1a
set.seed(7052)

# Generate data: X ~ N(2, 0.1), eps ~ N(0, 0.5), n = 100
n <- 100
mu_x <- 2
sd_x <- 0.1
beta0 <- 10
beta1 <- 5
sd_eps <- 0.5

X <- rnorm(n, mean = mu_x, sd = sd_x)
eps <- rnorm(n, mean = 0, sd = sd_eps)
Y <- beta0 + beta1 * X + eps

data_a <- data.frame(X, Y)
head(data_a)
##          X        Y
## 1 1.907630 20.46643
## 2 1.949162 20.08051
## 3 2.029797 20.32815
## 4 2.101782 20.39166
## 5 2.000072 20.16090
## 6 1.926412 19.51731

Question 1 (b): Fit the simple linear regression and compute MSE

# Question 1b
model_b <- lm(Y ~ X, data = data_a)
summary(model_b)
## 
## Call:
## lm(formula = Y ~ X, data = data_a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## X             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16
# Model MSE (residual variance estimate)
residuals_b <- resid(model_b)
MSE_b <- sum(residuals_b^2) / (n - 2)
MSE_b
## [1] 0.2032934

Estimated Predicted equation:

Y_hat= 9.0218 + 5.5652 X

This means that for each unit increase in X, the predicted Y increases by approximately 5.57 times

Term Estimate Std. Error t value p-value
(Intercept) 9.0218 0.8336 10.82 < 2e-16
X 5.5652 0.4155 13.39 < 2e-16

Null and alternate hypothesis

Null Hypothesis => beta_0 = 0 Alternate Hypothesis => beta_0 != 0

Since the t_value is less than the p_value, we reject the null hypothesis and the intercept is statistically significant.

Model’s MSE = 0.2033

Question 1 (c): Repeat (b) with larger error variance σ = 1

# Question 1c
set.seed(7052)

sd_eps_c <- 1
X_c <- rnorm(n, mean = mu_x, sd = sd_x)
eps_c <- rnorm(n, mean = 0, sd = sd_eps_c)
Y_c <- beta0 + beta1 * X_c + eps_c

model_c <- lm(Y_c ~ X_c)
summary(model_c)
## 
## Call:
## lm(formula = Y_c ~ X_c)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4146 -0.6058  0.0186  0.6066  2.7090 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0436     1.6673   4.824 5.16e-06 ***
## X_c           6.1303     0.8309   7.378 5.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared:  0.3571, Adjusted R-squared:  0.3505 
## F-statistic: 54.43 on 1 and 98 DF,  p-value: 5.253e-11
residuals_c <- resid(model_c)
MSE_c <- sum(residuals_c^2) / (n - 2)
MSE_c
## [1] 0.8131737

Question 1 (d): Repeat (a–c) with larger n = 400

set.seed(7052)

# Generate data: X ~ N(2, 0.1), eps ~ N(0, 0.5), n = 100
n_1 <- 400
mu_x_1 <- 2
sd_x_1 <- 0.1
beta0_1 <- 10
beta1_1 <- 5
sd_eps_1 <- 0.5

X_1 <- rnorm(n_1, mean = mu_x_1, sd = sd_x_1)
eps_1 <- rnorm(n_1, mean = 0, sd = sd_eps_1)
Y_1 <- beta0_1 + beta1_1 * X_1 + eps_1

data_a_1 <- data.frame(X_1, Y_1)
head(data_a_1)
##        X_1      Y_1
## 1 1.907630 18.94756
## 2 1.949162 19.78539
## 3 2.029797 20.37686
## 4 2.101782 21.28714
## 5 2.000072 19.61935
## 6 1.926412 19.60554
# Question 1b
model_b_1 <- lm(Y ~ X, data = data_a_1)
summary(model_b_1)
## 
## Call:
## lm(formula = Y ~ X, data = data_a_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## X             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16
# Model MSE (residual variance estimate)
residuals_b_1 <- resid(model_b_1)
MSE_d <- sum(residuals_b_1^2) / (n_1 - 2)
MSE_d
## [1] 0.05005717
set.seed(7052)

sd_eps_c_1 <- 1
X_c_1 <- rnorm(n_1, mean = mu_x_1, sd = sd_x_1)
eps_c_1 <- rnorm(n_1, mean = 0, sd = sd_eps_c_1)
Y_c_1 <- beta0_1 + beta1_1 * X_c_1 + eps_c_1

model_c_1 <- lm(Y_c_1 ~ X_c_1)
summary(model_c_1)
## 
## Call:
## lm(formula = Y_c_1 ~ X_c_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5243 -0.6748  0.0723  0.6415  3.2604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4933     1.0029   9.466   <2e-16 ***
## X_c_1         5.2355     0.4979  10.514   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2154 
## F-statistic: 110.5 on 1 and 398 DF,  p-value: < 2.2e-16
residuals_c_1 <- resid(model_c_1)
MSE_e <- sum(residuals_c_1^2) / (n_1 - 2)
MSE_e
## [1] 0.9553077

What is the effect on the model parameter estimates when error variance gets smaller?

When the error variance gets smaller, the data points are closer to the true regression line. This makes it easier for the model to find the correct slope.

Comparison (n=100): Looking only at the n=100 models, the “Low” error model (Y ~ X) has a Std. Error for X of 0.4155, while the “High” error model (Y_c ~ X_c) has a Std. Error of 0.8309.

Comparison (n=400): Looking only at the n=400 models, the “Low” error model has a Std. Error for X of 0.2490, while the “High” error model has a Std. Error of 0.4979.

In both sample sizes, smaller error variance leads to a smaller standard error for the parameter estimates. This means the estimates (like the slope for X) are much more precise and reliable. It also leads to a higher R-squared value, as the model can explain more of the variance.

What is the effect on the model parameter estimates when sample size gets bigger?

When the sample size increases, the model has more information to learn from. This helps to “average out” the random noise.

Comparison (Low Error Models): When n increases from 100 to 400 (comparing Y ~ X in both sets), the Std. Error for X decreases from 0.4155 to 0.2490.

Comparison (High Error Models): When n increases from 100 to 400 (comparing Y_c ~ X_c in both sets), the Std. Error for X decreases from 0.8309 to 0.4979.

In both error scenarios, a larger sample size leads to a smaller standard error for the parameter estimates. This makes the estimates more precise, even when the underlying data is noisy

Question 1 (e): Compare MSEs and summarize

# Question 1e — Summary table
summary_df <- data.frame(
  Scenario = c("n=100, σ=0.5", "n=100, σ=1", "n=400, σ=0.5", "n=400, σ=1"),
  MSE = c(MSE_b, MSE_c, MSE_d, MSE_e)
)
summary_df
##       Scenario        MSE
## 1 n=100, σ=0.5 0.20329342
## 2   n=100, σ=1 0.81317370
## 3 n=400, σ=0.5 0.05005717
## 4   n=400, σ=1 0.95530770

Increasing sample size reduces standard errors and narrows confidence intervals, while MSE stays close to variance.

Write a brief report (no more than two pages). A table may be helpful to summarize your results.

Results Summary

The findings clearly show how both error variance and sample size influence a regression model’s accuracy and reliability.

Smaller Error Variance

With low error variance, data points cluster more tightly around the true regression line. It therefore becomes easier for the model to estimate the relationship between X and Y with better precision.

  • At n = 100: Decreasing the error variance halved the standard error for the slope from 0.8309 to 0.4155 and thus provided a sharper estimate. Similarly, the R² within the model increased from 0.3571 to 0.6468, indicating that much more of the variation of Y was explained by this model.

  • At n = 400: The pattern persists. Lower error variance reduced the SE of the slope from 0.4979 to 0.2490 and increased R² from 0.2174 to 0.5149.

In other words, smaller error variance yields cleaner and more precise parameter estimates and improves the model fit of the data.

Larger Sample Size

With a larger sample, the model has more to work with, so random noise has a chance of “averaging out.” In this way, estimates become much more stable and closer to the true population values.

  • With Low Error: The standard error of the slope decreased from 0.4155 to 0.2490 when sample size increased from 100 to 400, which clearly improved the precision.

  • High Error: Even under noisier conditions, a larger sample made a marked difference, with the standard error declining from 0.8309 to 0.4979. In general, with increasing sample size, the precision of the regression estimates increases regardless of the noise level in the data.

Table: Simple Linear Regression Simulation Results

Scenario Sample Size (n) Error Variance Residual Std. Error (s) Slope Estimate (β¹) Std. Error of X (sb₁) R-squared
1 100 High 0.9018 6.1303 0.8309 0.3571
2 100 Low 0.4509 5.5652 0.4155 0.6468
3 400 High 0.9774 5.2355 0.4979 0.2174
4 400 Low 0.4887 5.1177 0.2490 0.5149

Reliability and precision of any linear regression model are essentially linked with the quality and quantity of data.

Low error variance is also important for a better model fit. It ensures the relationships of variables are clear and leads to more precise coefficient estimates.

Large sample size is important for statistical confidence. It reduces the uncertainty in the estimates of parameters, making those estimates more reliable.

The best conditions-very low error variance and large sample size-produced the most accurate and reliable model (Scenario 4). The worst conditions-a combination of high error variance and a small sample size-led to the least reliable model (Scenario 1).