Question 1. (5 points) Simulation Study (Simple Linear Regression). Assume mean function E(Y|X)=10+5X1−2X2

a. Generate data with X1∼N(μ=2,σ=0.1), X2∼N(μ=0,σ=0.4), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5).

set.seed(7052)

X1 <- rnorm(100, mean = 2, sd = 0.1)
X2 <- rnorm(100, mean = 0, sd = 0.4)
epsilon <- rnorm(100, mean = 0, sd = 0.5)

Y <- 10 + 5 * X1 - 10*X2 + epsilon

b. Fit a multiple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?

model_1 <- lm(Y ~ X1 + X2)
summary(model_1)
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89171 -0.37736  0.02165  0.31169  1.20370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.8320     0.8603  13.753  < 2e-16 ***
## X1            4.1160     0.4298   9.577  1.1e-15 ***
## X2           -9.8873     0.1294 -76.405  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4621 on 97 degrees of freedom
## Multiple R-squared:  0.9837, Adjusted R-squared:  0.9833 
## F-statistic:  2919 on 2 and 97 DF,  p-value: < 2.2e-16

Null Hypothesis: The coefficient is equal to zero and not significant.

Alternative Hypothesis: The coefficient is not equal to zero and is significant.

The estimated regression equation is:

Y = 11.8320 + 4.1160 * X1 - 9.8873 * X2

Estimated coefficients with their standard errors:

Intercept: 11.8320 with a standard error of 0.8603

X1: 4.1160 with a standard error of 0.4298

X2: −9.8873 with a standard error of 0.1294

Since the p value are less than the significance level of 0.05, we reject the null hypothesis and conclude that each coefficient is significant.

Intercept: t-value = 13.753, p-value = 2e-16

X1: t-value = 9.577, p-value = 1.1e-15

X2: t-value = −76.405 p-value = 2e-16

So, we are getting a great model with Adjusted R-squared: 0.9833

mse_1 <- mean(model_1$residuals^2)
mse_1
## [1] 0.2071171

c. Repeat part b), but re-simulate the data and change the error term to ϵ∼N(0,σ=1)

set.seed(7052)

X1 <- rnorm(100, mean = 2, sd = 0.1)
X2 <- rnorm(100, mean = 0, sd = 0.4)
epsilon <- rnorm(100, mean = 0, sd = 1)

Y <- 10 + 5 * X1 - 10*X2 + epsilon

model_2 <- lm(Y ~ X1 + X2)
summary(model_2)
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78342 -0.75472  0.04329  0.62338  2.40739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.6640     1.7207   7.941 3.58e-12 ***
## X1            3.2320     0.8596   3.760  0.00029 ***
## X2           -9.7746     0.2588 -37.767  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9242 on 97 degrees of freedom
## Multiple R-squared:  0.9364, Adjusted R-squared:  0.9351 
## F-statistic: 714.1 on 2 and 97 DF,  p-value: < 2.2e-16

Null Hypothesis: The coefficient is equal to zero and not significant.

Alternative Hypothesis: The coefficient is not equal to zero and is significant.

The estimated equation is:

Y = 13.6640 + 3.2320 * X1 + (-9.7746) * X2

Estimated coefficients with their standard errors:

Intercept: 13.6640 with a standard error of 1.7207

X1: 3.2320 with a standard error of 0.8596

X2: -9.7746 with a standard error of 0.2588

t-statistics and p-value:

Intercept: t-statistics = 7.941, p-value = 3.58e-12

X1: t-statistics = 3.760, p-value = 0.00029

X2: t-statistics = -37.767, p-value = 2e-16

Since the p-value for all coefficients is less than the significant level of 0.05, we reject our null hypothesis and conclude that the coefficients are significant.

We are getting the Adjusted R-squared: 0.9351

mse_2 <- mean(model_2$residuals^2)
mse_2
## [1] 0.8284685

The MSE for the fitted model is 0.8284685

d. Repeat parts a)–c) using n=400. What do you conclude? What is the effect to the model parameter estimates when the error variance gets smaller? What is the effect when the sample size gets bigger?

set.seed(7052)

X1 <- rnorm(400, mean = 2, sd = 0.1)
X2 <- rnorm(400, mean = 0, sd = 0.4)
epsilon <- rnorm(400, mean = 0, sd = 0.5)

Y <- 10 + 5 * X1 - 10*X2 + epsilon

model_3 <- lm(Y ~ X1 + X2)
summary(model_3)
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85547 -0.36374  0.00169  0.32801  1.31769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.69826    0.52106   20.53   <2e-16 ***
## X1           4.66233    0.25869   18.02   <2e-16 ***
## X2          -9.95779    0.06508 -153.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5076 on 397 degrees of freedom
## Multiple R-squared:  0.9835, Adjusted R-squared:  0.9834 
## F-statistic: 1.181e+04 on 2 and 397 DF,  p-value: < 2.2e-16
set.seed(7052)

X1 <- rnorm(400, mean = 2, sd = 0.1)
X2 <- rnorm(400, mean = 0, sd = 0.4)
epsilon <- rnorm(400, mean = 0, sd = 1)

Y <- 10 + 5 * X1 - 10*X2 + epsilon

model_4 <- lm(Y ~ X1 + X2)
summary(model_4)
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7109 -0.7275  0.0034  0.6560  2.6354 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3965     1.0421  10.936  < 2e-16 ***
## X1            4.3247     0.5174   8.359 1.08e-15 ***
## X2           -9.9156     0.1302 -76.175  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.015 on 397 degrees of freedom
## Multiple R-squared:  0.9364, Adjusted R-squared:  0.9361 
## F-statistic:  2923 on 2 and 397 DF,  p-value: < 2.2e-16

As the variance gets smaller, the estimates gets more significant and precise.

Also, the larger sample sizes reduce standard errors and improve the precision of the estimates, which enhances the ability to detect significant predictors.

e. What about the MSE from each model?

#MSE for the first model
mse_3 <- mean(model_3$residuals^2)
mse_3
## [1] 0.2557567
#MSE for the second model
mse_4 <- mean(model_4$residuals^2)
mse_4
## [1] 1.023027

The MSE for the first model is 0.2557567. This indicates a relatively better fit since the errors are less dispersed around the predicted values.

The MSE for the second model is 1.023027. This reflectes the increased variability in the error term, which makes it harder for the model to predict accurately.

For both models, the coefficients for X1 and X2 were highly statistically significant, with p-values close to 0. The t-statistics were large, suggesting that both predictors contribute meaningfully to the response variable Y.

Smaller error variance (σ = 0.5) resulted in smaller standard errors, leading to more precise estimates of the coefficients with smaller standard errors, a lower MSE and higher t-statistics. This increased the statistical significance of the model. This means that the model fits the data better when the variability in the errors is smaller.

Larger error variance (σ = 1) led to larger standard errors for the coefficients, which reduced the precision of the estimates, though the p-values remained small due to the large sample size. Larger error variance results in less precise estimates of the coefficients, larger standard errors, and higher MSE, indicating a poorer fit. The model becomes less reliable in predicting the outcome variable.

A larger sample size(n=400) improved the precision of the coefficient estimates, reducing standard errors and increasing t-statistics, which made it easier to reject the null hypothesis that the coefficients are equal to zero. The larger sample size helped in detecting statistical significance, even in the presence of a larger error variance (σ = 1). This is a direct consequence of increasing the power of the hypothesis test, as larger sample sizes reduce the variance of the coefficient estimates.