set.seed(7052)
X1 <- rnorm(100, mean = 2, sd = 0.1)
X2 <- rnorm(100, mean = 0, sd = 0.4)
epsilon <- rnorm(100, mean = 0, sd = 0.5)
Y <- 10 + 5 * X1 - 10*X2 + epsilon
model_1 <- lm(Y ~ X1 + X2)
summary(model_1)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89171 -0.37736 0.02165 0.31169 1.20370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.8320 0.8603 13.753 < 2e-16 ***
## X1 4.1160 0.4298 9.577 1.1e-15 ***
## X2 -9.8873 0.1294 -76.405 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4621 on 97 degrees of freedom
## Multiple R-squared: 0.9837, Adjusted R-squared: 0.9833
## F-statistic: 2919 on 2 and 97 DF, p-value: < 2.2e-16
Null Hypothesis: The coefficient is equal to zero and not significant.
Alternative Hypothesis: The coefficient is not equal to zero and is significant.
The estimated regression equation is:
Y = 11.8320 + 4.1160 * X1 - 9.8873 * X2
Estimated coefficients with their standard errors:
Intercept: 11.8320 with a standard error of 0.8603
X1: 4.1160 with a standard error of 0.4298
X2: −9.8873 with a standard error of 0.1294
Since the p value are less than the significance level of 0.05, we reject the null hypothesis and conclude that each coefficient is significant.
Intercept: t-value = 13.753, p-value = 2e-16
X1: t-value = 9.577, p-value = 1.1e-15
X2: t-value = −76.405 p-value = 2e-16
So, we are getting a great model with Adjusted R-squared: 0.9833
mse_1 <- mean(model_1$residuals^2)
mse_1
## [1] 0.2071171
set.seed(7052)
X1 <- rnorm(100, mean = 2, sd = 0.1)
X2 <- rnorm(100, mean = 0, sd = 0.4)
epsilon <- rnorm(100, mean = 0, sd = 1)
Y <- 10 + 5 * X1 - 10*X2 + epsilon
model_2 <- lm(Y ~ X1 + X2)
summary(model_2)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78342 -0.75472 0.04329 0.62338 2.40739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6640 1.7207 7.941 3.58e-12 ***
## X1 3.2320 0.8596 3.760 0.00029 ***
## X2 -9.7746 0.2588 -37.767 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9242 on 97 degrees of freedom
## Multiple R-squared: 0.9364, Adjusted R-squared: 0.9351
## F-statistic: 714.1 on 2 and 97 DF, p-value: < 2.2e-16
Null Hypothesis: The coefficient is equal to zero and not significant.
Alternative Hypothesis: The coefficient is not equal to zero and is significant.
The estimated equation is:
Y = 13.6640 + 3.2320 * X1 + (-9.7746) * X2
Estimated coefficients with their standard errors:
Intercept: 13.6640 with a standard error of 1.7207
X1: 3.2320 with a standard error of 0.8596
X2: -9.7746 with a standard error of 0.2588
t-statistics and p-value:
Intercept: t-statistics = 7.941, p-value = 3.58e-12
X1: t-statistics = 3.760, p-value = 0.00029
X2: t-statistics = -37.767, p-value = 2e-16
Since the p-value for all coefficients is less than the significant level of 0.05, we reject our null hypothesis and conclude that the coefficients are significant.
We are getting the Adjusted R-squared: 0.9351
mse_2 <- mean(model_2$residuals^2)
mse_2
## [1] 0.8284685
The MSE for the fitted model is 0.8284685
set.seed(7052)
X1 <- rnorm(400, mean = 2, sd = 0.1)
X2 <- rnorm(400, mean = 0, sd = 0.4)
epsilon <- rnorm(400, mean = 0, sd = 0.5)
Y <- 10 + 5 * X1 - 10*X2 + epsilon
model_3 <- lm(Y ~ X1 + X2)
summary(model_3)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85547 -0.36374 0.00169 0.32801 1.31769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.69826 0.52106 20.53 <2e-16 ***
## X1 4.66233 0.25869 18.02 <2e-16 ***
## X2 -9.95779 0.06508 -153.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5076 on 397 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.9834
## F-statistic: 1.181e+04 on 2 and 397 DF, p-value: < 2.2e-16
set.seed(7052)
X1 <- rnorm(400, mean = 2, sd = 0.1)
X2 <- rnorm(400, mean = 0, sd = 0.4)
epsilon <- rnorm(400, mean = 0, sd = 1)
Y <- 10 + 5 * X1 - 10*X2 + epsilon
model_4 <- lm(Y ~ X1 + X2)
summary(model_4)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7109 -0.7275 0.0034 0.6560 2.6354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3965 1.0421 10.936 < 2e-16 ***
## X1 4.3247 0.5174 8.359 1.08e-15 ***
## X2 -9.9156 0.1302 -76.175 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.015 on 397 degrees of freedom
## Multiple R-squared: 0.9364, Adjusted R-squared: 0.9361
## F-statistic: 2923 on 2 and 397 DF, p-value: < 2.2e-16
As the variance gets smaller, the estimates gets more significant and precise.
Also, the larger sample sizes reduce standard errors and improve the precision of the estimates, which enhances the ability to detect significant predictors.
#MSE for the first model
mse_3 <- mean(model_3$residuals^2)
mse_3
## [1] 0.2557567
#MSE for the second model
mse_4 <- mean(model_4$residuals^2)
mse_4
## [1] 1.023027
The MSE for the first model is 0.2557567. This indicates a relatively better fit since the errors are less dispersed around the predicted values.
The MSE for the second model is 1.023027. This reflectes the increased variability in the error term, which makes it harder for the model to predict accurately.
For both models, the coefficients for X1 and X2 were highly statistically significant, with p-values close to 0. The t-statistics were large, suggesting that both predictors contribute meaningfully to the response variable Y.
Smaller error variance (σ = 0.5) resulted in smaller standard errors, leading to more precise estimates of the coefficients with smaller standard errors, a lower MSE and higher t-statistics. This increased the statistical significance of the model. This means that the model fits the data better when the variability in the errors is smaller.
Larger error variance (σ = 1) led to larger standard errors for the coefficients, which reduced the precision of the estimates, though the p-values remained small due to the large sample size. Larger error variance results in less precise estimates of the coefficients, larger standard errors, and higher MSE, indicating a poorer fit. The model becomes less reliable in predicting the outcome variable.
A larger sample size(n=400) improved the precision of the coefficient estimates, reducing standard errors and increasing t-statistics, which made it easier to reject the null hypothesis that the coefficients are equal to zero. The larger sample size helped in detecting statistical significance, even in the presence of a larger error variance (σ = 1). This is a direct consequence of increasing the power of the hypothesis test, as larger sample sizes reduce the variance of the coefficient estimates.