BANA7052_Group2

BANA7052 Fall 2025 Homework 2

Eli Bales Kazuhide Watanabe

November 2, 2025

Tools & Packages

# Load the dataset and packages
library(dplyr)
library(ggplot2)
library(kableExtra)
library(patchwork)

Problem 1

Simulation Study (Simple Linear Regression). Assume mean function \(E\left(Y|X\right) = 10 + 5 * X\). For this exercise, use set.seed(7052) to ensure reproducibility.

Generate data with \(X \sim N\left(\mu = 2, \sigma = 0.1\right)\), sample size \(n = 100\), and error term \(\epsilon \sim N\left(\mu = 0, \sigma = 0.5\right)\).

set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error <- rnorm(n, mean = 0, sd = 0.5)
Y <- 10 + 5*X + error

df <- data.frame(X, Y)

We generated the data by simulating the predictor variable \(X\) and the random error term using the rnorm() function. The response variable \(Y\) was then constructed according to the simple linear regression model:

\[ Y = 10 + 5X + \varepsilon, \]

where \(\varepsilon \sim N(0, 0.5)\).

(b) Fit a simple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?

# Fit multiple linear regression with two predictors
model <- lm(Y ~ X, data = df)

summary_model <- summary(model)

Estimated Model

The estimated simple linear regression model is:

\[\hat{Y} = 9.0218 + 5.5652X\]

Estimated Coefficients

coef_table <- as.data.frame(summary_model$coefficients)

names(coef_table) <- c("Estimate", "Std Error", "t Value", "p Value")

coef_table$`p Value` <- format(coef_table$`p Value`, scientific = TRUE, digits = 2)

coef_table %>%
  kbl(
    caption = "Table 1: Estimated Coefficients for Model: Y ~ X (n = 100; Error SD = 0.5)",
    align = "c",
    digits = 3
  ) %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  column_spec(2, width = "3cm") %>%  
  column_spec(3, width = "3cm") %>%  
  column_spec(4, width = "3cm") %>%
  column_spec(5, width = "3cm")

Table 1: Estimated Coefficients for Model: Y ~ X (n = 100; Error SD = 0.5)
	Estimate	Std Error	t Value	p Value
(Intercept)	9.022	0.834	10.822	2.0e-18
X	5.565	0.415	13.395	7.1e-24

Significance

The coefficient of this equation are significant as the p-value for:

\[ X: 7.1 \times 10^{-24} < 0.05 \]

Null & Alternative Hypothesis

For each coefficient \((\beta_i)\), the hypotheses are defined as:

\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]

(a) \(X\)

\[ t = 13.395, \quad p = 7.1 \times 10^{-24} \]

Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).

Model Performance

cat("Mean Squared Error:",(mse <- mean(model$residuals^2)))

## Mean Squared Error: 0.1992276

# Changing the error sig value to 1
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error2 <- rnorm(n, mean = 0, sd = 1)
Y2 <- 10 + 5*X + error2

df2 <- data.frame(Y2, X)

# Fit multiple linear regression with two predictors
model2 <- lm(Y2 ~ X, data = df2)

summary_model2 <- summary(model2)

Estimated Model

The estimated simple linear regression model is:

\[\hat{Y} = 8.0436 + 6.1303X\]

Estimated Coefficients

coef_table2 <- as.data.frame(summary_model2$coefficients)

names(coef_table2) <- c("Estimate", "Std Error", "t Value", "p Value")

coef_table2$`p Value` <- format(coef_table2$`p Value`, scientific = TRUE, digits = 2)

coef_table2 %>%
  kbl(
    caption = "Table 2: Estimated Coefficients for Model: Y ~ X (n = 100; Error SD = 1)",
    align = "c",
    digits = 3
  ) %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  column_spec(2, width = "3cm") %>%  
  column_spec(3, width = "3cm") %>%  
  column_spec(4, width = "3cm") %>%
  column_spec(5, width = "3cm")

Table 2: Estimated Coefficients for Model: Y ~ X (n = 100; Error SD = 1)
	Estimate	Std Error	t Value	p Value
(Intercept)	8.044	1.667	4.824	5.2e-06
X	6.130	0.831	7.378	5.3e-11

Significance

The coefficient of this equation are significant as the p-value for:

\[ X: 5.3 \times 10^{-11} < 0.05 \]

Null & Alternative Hypothesis

For each coefficient \((\beta_i)\), the hypotheses are defined as:

\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]

(a) \(X\)

\[ t = 7.378, \quad p = 5.3 \times 10^{-11} \]

Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).

Model Performance

cat("Mean Squared Error:",(mse <- mean(model2$residuals^2)))

## Mean Squared Error: 0.7969102

(d) Repeat parts a)–c) using \(n = 400\). What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?

set.seed(7052)
n2 <- 400
X_n   <- rnorm(n2, mean = 2, sd = 0.1)
error_n <- rnorm(n2, mean = 0, sd = 0.5)
Y_n <- 10 + 5*X_n + error_n

df_n <- data.frame(Y_n, X_n)

# Fit multiple linear regression with two predictors
model_n <- lm(Y_n ~ X_n, data = df_n)

summary_model_n <- summary(model_n)

Estimated Model

The estimated multi linear regression model is:

\[\hat{Y} = 9.7466 + 5.1177X\]

Estimated Coefficients

coef_table_n <- as.data.frame(summary_model_n$coefficients)

names(coef_table_n) <- c("Estimate", "Std Error", "t Value", "p Value")

coef_table_n$`p Value` <- format(coef_table_n$`p Value`, scientific = TRUE, digits = 2)

coef_table_n %>%
  kbl(
    caption = "Table 3: Estimated Coefficients for Model: Y ~ X (n = 400; Error SD = 0.5)",
    align = "c",
    digits = 3
  ) %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  column_spec(2, width = "3cm") %>%  
  column_spec(3, width = "3cm") %>%  
  column_spec(4, width = "3cm") %>%
  column_spec(5, width = "3cm")

Table 3: Estimated Coefficients for Model: Y ~ X (n = 400; Error SD = 0.5)
	Estimate	Std Error	t Value	p Value
(Intercept)	9.747	0.501	19.436	1.2e-59
X_n	5.118	0.249	20.555	1.6e-64

Significance

The coefficient of this equation are significant as the p-value for:

\[ X: 1.6 \times 10^{-64} < 0.05 \]

Null & Alternative Hypothesis

For each coefficient \((\beta_i)\), the hypotheses are defined as:

\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]

(a) \(X\)

\[ t = 20.555, \quad p = 1.6 \times 10^{-64} \]

Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).

Model Performance

cat("Mean Squared Error:",(mse <- mean(model_n$residuals^2)))

## Mean Squared Error: 0.2376328

set.seed(7052)
n2 <- 400
X_n   <- rnorm(n2, mean = 2, sd = 0.1)
error2_n <- rnorm(n2, mean = 0, sd = 1)
Y2_n <- 10 + 5*X_n + error2_n

df2_n <- data.frame(Y2_n, X_n)

# Fit multiple linear regression with two predictors
model2_n <- lm(Y2_n ~ X_n, data = df2_n)

summary_model2_n <- summary(model2_n)

Estimated Model

The estimated simple linear regression model is:

\[\hat{Y} = 9.4933 + 5.2355X\]

Estimated Coefficients

coef_table2_n <- as.data.frame(summary_model2_n$coefficients)

names(coef_table2_n) <- c("Estimate", "Std Error", "t Value", "p Value")

coef_table2_n$`p Value` <- format(coef_table2_n$`p Value`, scientific = TRUE, digits = 2)

coef_table2_n %>%
  kbl(
    caption = "Table 4: Estimated Coefficients for Model: Y ~ X (n = 400; Error SD = 1)",
    align = "c",
    digits = 3
  ) %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  column_spec(2, width = "3cm") %>%  
  column_spec(3, width = "3cm") %>%  
  column_spec(4, width = "3cm") %>%
  column_spec(5, width = "3cm")

Table 4: Estimated Coefficients for Model: Y ~ X (n = 400; Error SD = 1)
	Estimate	Std Error	t Value	p Value
(Intercept)	9.493	1.003	9.466	2.6e-19
X_n	5.235	0.498	10.514	5.6e-23

Significance

The coefficient of this equation are significant as the p-value for:

\[ X: 5.6 \times 10^{-23} < 0.05 \]

Null & Alternative Hypothesis

For each coefficient \((\beta_i)\), the hypotheses are defined as:

\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]

(a) \(X\)

\[ t = 10.514, \quad p = 5.6 \times 10^{-23} \]

Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).

Model Performance

cat("Mean Squared Error:",(mse <- mean(model2_n$residuals^2)))

## Mean Squared Error: 0.9505312

Effect of Increasing Sample Size (\(n\))

As the sample size \(n\) increases, the Mean Squared Error (MSE) values in our results change only slightly, but this does not mean the model is performing worse.
The MSE in this context represents the model’s estimate of the true variance of the residuals:

\[ \widehat{\sigma}^2_{MSE} = \frac{SSE}{n - 2} \]

This value estimates the underlying error variance \(\sigma^2\), not predictive accuracy.
Because \(E[\widehat{\sigma}^2_{MSE}] = \sigma^2\), the expected MSE should remain approximately constant across different \(n\) values.

As \(n\) increases:

The denominator \(n - 2\) becomes larger, which makes the MSE estimate more stable and less variable across samples.
The sampling fluctuations in MSE become smaller. The estimate of \(\sigma^2\) becomes more reliable, even if the specific numeric value changes slightly in a single trial.
Thus, a slightly higher MSE at larger \(n\) does not indicate a worse model; it simply reflects random variation in one simulated dataset.

Conceptually, increasing the sample size gives the regression model more information to estimate both coefficients and residual variance accurately.
The model becomes more precise, and the MSE becomes a more consistent estimator of the true error variance.

Effect of Decreasing Error Standard Deviation (\(\sigma\))

The standard deviation of the errors, \(\sigma\), measures the amount of random noise around the regression line.
Since the MSE estimates \(\sigma^2\), any change in \(\sigma\) directly affects the expected MSE.

When \(\sigma\) decreases:

The residuals become smaller and less scattered around the regression line.
The sum of squared errors (\(SSE\)) decreases, and therefore, the MSE also decreases.
Mathematically, \(MSE \propto \sigma^2\); halving \(\sigma\) should reduce the MSE by approximately one-fourth: \[ (0.5)^2 = 0.25(1)^2 \]

This pattern aligns perfectly with our observed results, where decreasing the error standard deviation from 1.0 to 0.5 lowered the MSE from approximately 0.80–0.95 to around 0.20–0.24.

Conceptually, a smaller \(\sigma\) means less random noise, so the regression line fits the data more tightly.
The smaller the \(\sigma\), the smaller the average squared residuals (MSE), and the more accurate and precise the model becomes.

Conclusion

Increasing \(n\) → the MSE estimate becomes more stable and less variable.
Decreasing \(\sigma\) → the MSE becomes smaller, indicating less noise and a tighter model fit.
Both behaviors align with theoretical expectations for the Mean Squared Error in linear regression.

(e) What about the MSE from each model?

cat(
  " Mean Squared Error (n = 100; Error SD = 0.5):", mean(model$residuals^2), "\n",
  "Mean Squared Error (n = 100; Error SD =  1 ):", mean(model2$residuals^2), "\n",
  "Mean Squared Error (n = 400; Error SD = 0.5):", mean(model_n$residuals^2), "\n",
  "Mean Squared Error (n = 400; Error SD =  1 ):", mean(model2_n$residuals^2), "\n"
)

##  Mean Squared Error (n = 100; Error SD = 0.5): 0.1992276 
##  Mean Squared Error (n = 100; Error SD =  1 ): 0.7969102 
##  Mean Squared Error (n = 400; Error SD = 0.5): 0.2376328 
##  Mean Squared Error (n = 400; Error SD =  1 ): 0.9505312

Problem 2

Bias and Variance of Parameter Estimates.

(a) What are the bias and variance of the OLS estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\)?

Bias and Variance of the OLS Estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\)

We know that by its nature, the Ordinary Least Squares (OLS) method “provides the best linear unbiased estimates of \(\beta_0\) and \(\beta_1\).”
They are unbiased because

\[ E[\widehat{\beta}_0] = \beta_0, \qquad E[\widehat{\beta}_1] = \beta_1 \]

Therefore, the bias for both estimators is

\[ \text{Bias}(\widehat{\beta}_0) = 0, \qquad \text{Bias}(\widehat{\beta}_1) = 0 \]

Variance of the OLS Estimators

The standard error (SE) of an estimator is the square root of its variance:

\[\text{SE}(\widehat{\beta}_i) = \sqrt{\operatorname{Var}(\widehat{\beta}_i)}\]

Thus, squaring the SE expression yields the variance of that estimator.

For the intercept estimator, we are given the SE formula:

\[ \text{SE}(\widehat{\beta}_0) = \left[\, s^2 \left( \frac{1}{n} + \frac{\bar{X}^{2}}{S_{xx}} \right) \right]^{1/2} \]

Squaring both sides gives:

\[\operatorname{Var}(\widehat{\beta}_0) = s^2\left(\frac{1}{n} + \frac{\bar{X}^2}{S_{xx}}\right)\]

In theoretical derivations we replace the sample estimate \(s^2\) with the true error variance \(\sigma^2\):

\[\operatorname{Var}(\widehat{\beta}_0) = \sigma^2\left(\frac{1}{n} + \frac{\bar{X}^2}{S_{xx}}\right)\]

Let \(S_{xx} = \sum_{i=1}^n (X_i - \bar{X})^2\).
The variance of the slope is:

\[\operatorname{Var}(\widehat{\beta}_1) = \frac{\sigma^2}{S_{xx}}\]

Collecting results, we have:

\[ \boxed{ \begin{aligned} \text{Bias}(\widehat{\beta}_0) &= 0, & \operatorname{Var}(\widehat{\beta}_0) &= \sigma^2\left(\frac{1}{n} + \frac{\bar{X}^2}{S_{xx}}\right), \\ \text{Bias}(\widehat{\beta}_1) &= 0, & \operatorname{Var}(\widehat{\beta}_1) &= \frac{\sigma^2}{S_{xx}} \end{aligned} } \]

(b) What do you expect to happen to the variances of the OLS estimates of \(\beta_0\) and \(\beta_1\) when the sample size \(n\) increases? What do you expect when the error variance \(\sigma^2\) increases?

As \(n\) increases, the variance of both estimators \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) decreases.

Mathematically, this is shown because the larger your sample size, the larger \(S_{xx}\) becomes.
Recall that \[ S_{xx} = \sum_{i=1}^n (X_i - \bar{X})^2 \] and both variance formulas include \(S_{xx}\) in the denominator: \[ \operatorname{Var}(\widehat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \qquad \operatorname{Var}(\widehat{\beta}_0) = \sigma^2\!\left(\frac{1}{n} + \frac{\bar{X}^{\,2}}{S_{xx}}\right) \]

As \(n\) increases:

\(S_{xx}\) becomes larger, since we sum over more values of \((X_i - \bar{X})^2\).
The term \(1/n\) in the variance of \(\widehat{\beta}_0\) becomes smaller.

Because both of these changes reduce the total variance, it follows that the estimates become more precise.

Conceptually, this makes sense. As your sample size increases, your regression line is estimated using more data points, so there is less uncertainty and smaller variance in both \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\).

For \(\sigma^2\), as it increases, so does the variance of both estimators.
Mathematically, we see this because both variance formulas have \(\sigma^2\) multiplied by them: \[ \operatorname{Var}(\widehat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \qquad \operatorname{Var}(\widehat{\beta}_0) = \sigma^2\!\left(\frac{1}{n} + \frac{\bar{X}^{\,2}}{S_{xx}}\right) \]

Hence, when the error variance \(\sigma^2\) increases, the variability of the estimates also increases proportionally.

Conceptually, if the noise of the error residuals increases, it means that there is more randomness or “unexplained variation” in the model. As a result, the estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) become less precise, leading to higher variance.

(c) What is the bias of the model’s MSE? What about the ML estimate of \(\sigma^2\)? What is the difference between these two estimates of \(\sigma^2\)? Why do we use MSE instead of the ML estimate?

Maximum Likelihood (ML) Estimate

The ML estimate of the variance is given by:

\[ \widehat{\sigma}^2_{ML} = \frac{SSE}{n} \]

where \(SSE = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2\).

Because it divides by \(n\) (the total number of observations) instead of adjusting for the parameters estimated in the model, it underestimates the true variance \(\sigma^2\).
This makes \(\widehat{\sigma}^2_{ML}\) a biased estimator.

Mathematically, the bias is downward. The ML estimate tends to give a smaller value than the true variance because it does not account for the “used up” degrees of freedom when estimating \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\).

\[ E[\widehat{\sigma}^2_{ML}] < \sigma^2 \]

Mean Squared Error (MSE) Estimate

To correct this bias, we divide the sum of squared errors by \(n - 2\) instead of \(n\), since two parameters (\(\beta_0\) and \(\beta_1\)) are estimated in simple linear regression:

\[ \widehat{\sigma}^2_{MSE} = \frac{SSE}{n - 2} \]

This adjustment produces an unbiased estimator of the population variance:

\[ E[\widehat{\sigma}^2_{MSE}] = \sigma^2 \]

The MSE estimate incorporates the “used up” degrees of freedom when calculating \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\), which is why it correctly accounts for the loss of flexibility in the model.

Why We Use MSE Instead of ML

We use the MSE estimate rather than the ML estimate because:

The ML estimator underestimates the true variance \(\sigma^2\).
The MSE estimator adjusts for the two estimated regression parameters, making it unbiased.
In practical inference (such as constructing confidence intervals or hypothesis tests), unbiased variance estimation ensures valid standard errors and \(t\)-tests.

Therefore, the MSE estimator is preferred in regression analysis due to its unbiasedness.

Final summary:

\[ \boxed{ \begin{aligned} \widehat{\sigma}^2_{ML} &= \frac{SSE}{n} \quad &\text{(Biased, underestimates \(\sigma^2\))} \\[6pt] \widehat{\sigma}^2_{MSE} &= \frac{SSE}{n - 2} \quad &\text{(Unbiased, preferred for inference)} \end{aligned} } \]