BANA7052 Fall 2025 Homework 2
Eli Bales Kazuhide Watanabe
November 2, 2025
Tools & Packages
# Load the dataset and packages
library(dplyr)
library(ggplot2)
library(kableExtra)
library(patchwork)
Problem 1
Simulation Study (Simple Linear Regression). Assume mean function \(E\left(Y|X\right) = 10 + 5 * X\). For this exercise, use set.seed(7052) to ensure reproducibility.
Generate data with \(X \sim N\left(\mu
= 2, \sigma = 0.1\right)\), sample size \(n = 100\), and error term \(\epsilon \sim N\left(\mu = 0, \sigma =
0.5\right)\).
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error <- rnorm(n, mean = 0, sd = 0.5)
Y <- 10 + 5*X + error
df <- data.frame(X, Y)
We generated the data by simulating the predictor variable \(X\) and the random error term using the
rnorm() function. The response variable \(Y\) was then constructed according to the
simple linear regression model:
\[ Y = 10 + 5X + \varepsilon, \]
where \(\varepsilon \sim N(0, 0.5)\).
(b) Fit a simple linear regression to the simulated data from
part a. What is the estimated prediction equation? Report the estimated
coefficients and their standard errors. Are they significant? Clearly
write out the null and alternative hypotheses, observed t-statistic(s),
p-value(s), and interpret the estimates and test results. What is fitted
model’s MSE?
# Fit multiple linear regression with two predictors
model <- lm(Y ~ X, data = df)
summary_model <- summary(model)
The estimated simple linear regression model is:
\[\hat{Y} = 9.0218 + 5.5652X\]
coef_table <- as.data.frame(summary_model$coefficients)
names(coef_table) <- c("Estimate", "Std Error", "t Value", "p Value")
coef_table$`p Value` <- format(coef_table$`p Value`, scientific = TRUE, digits = 2)
coef_table %>%
kbl(
caption = "Table 1: Estimated Coefficients for Model: Y ~ X (n = 100; Error SD = 0.5)",
align = "c",
digits = 3
) %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
column_spec(2, width = "3cm") %>%
column_spec(3, width = "3cm") %>%
column_spec(4, width = "3cm") %>%
column_spec(5, width = "3cm")
| Estimate | Std Error | t Value | p Value | |
|---|---|---|---|---|
| (Intercept) | 9.022 | 0.834 | 10.822 | 2.0e-18 |
| X | 5.565 | 0.415 | 13.395 | 7.1e-24 |
The coefficient of this equation are significant as the p-value for:
\[ X: 7.1 \times 10^{-24} < 0.05 \]
For each coefficient \((\beta_i)\), the hypotheses are defined as:
\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]
(a) \(X\)
\[ t = 13.395, \quad p = 7.1 \times 10^{-24} \]
Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).
cat("Mean Squared Error:",(mse <- mean(model$residuals^2)))
## Mean Squared Error: 0.1992276
(c) Repeat part b), but re-simulate the data and change the
error term to \(\epsilon \sim N\left(0, \sigma
= 1\right)\)
# Changing the error sig value to 1
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error2 <- rnorm(n, mean = 0, sd = 1)
Y2 <- 10 + 5*X + error2
df2 <- data.frame(Y2, X)
# Fit multiple linear regression with two predictors
model2 <- lm(Y2 ~ X, data = df2)
summary_model2 <- summary(model2)
The estimated simple linear regression model is:
\[\hat{Y} = 8.0436 + 6.1303X\]
coef_table2 <- as.data.frame(summary_model2$coefficients)
names(coef_table2) <- c("Estimate", "Std Error", "t Value", "p Value")
coef_table2$`p Value` <- format(coef_table2$`p Value`, scientific = TRUE, digits = 2)
coef_table2 %>%
kbl(
caption = "Table 2: Estimated Coefficients for Model: Y ~ X (n = 100; Error SD = 1)",
align = "c",
digits = 3
) %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
column_spec(2, width = "3cm") %>%
column_spec(3, width = "3cm") %>%
column_spec(4, width = "3cm") %>%
column_spec(5, width = "3cm")
| Estimate | Std Error | t Value | p Value | |
|---|---|---|---|---|
| (Intercept) | 8.044 | 1.667 | 4.824 | 5.2e-06 |
| X | 6.130 | 0.831 | 7.378 | 5.3e-11 |
The coefficient of this equation are significant as the p-value for:
\[ X: 5.3 \times 10^{-11} < 0.05 \]
For each coefficient \((\beta_i)\), the hypotheses are defined as:
\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]
(a) \(X\)
\[ t = 7.378, \quad p = 5.3 \times 10^{-11} \]
Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).
cat("Mean Squared Error:",(mse <- mean(model2$residuals^2)))
## Mean Squared Error: 0.7969102
(d) Repeat parts a)–c) using \(n =
400\). What do you conclude? What is the effect on the model
parameter estimates when error variance gets smaller? What is the effect
when sample size gets bigger?
set.seed(7052)
n2 <- 400
X_n <- rnorm(n2, mean = 2, sd = 0.1)
error_n <- rnorm(n2, mean = 0, sd = 0.5)
Y_n <- 10 + 5*X_n + error_n
df_n <- data.frame(Y_n, X_n)
# Fit multiple linear regression with two predictors
model_n <- lm(Y_n ~ X_n, data = df_n)
summary_model_n <- summary(model_n)
The estimated multi linear regression model is:
\[\hat{Y} = 9.7466 + 5.1177X\]
coef_table_n <- as.data.frame(summary_model_n$coefficients)
names(coef_table_n) <- c("Estimate", "Std Error", "t Value", "p Value")
coef_table_n$`p Value` <- format(coef_table_n$`p Value`, scientific = TRUE, digits = 2)
coef_table_n %>%
kbl(
caption = "Table 3: Estimated Coefficients for Model: Y ~ X (n = 400; Error SD = 0.5)",
align = "c",
digits = 3
) %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
column_spec(2, width = "3cm") %>%
column_spec(3, width = "3cm") %>%
column_spec(4, width = "3cm") %>%
column_spec(5, width = "3cm")
| Estimate | Std Error | t Value | p Value | |
|---|---|---|---|---|
| (Intercept) | 9.747 | 0.501 | 19.436 | 1.2e-59 |
| X_n | 5.118 | 0.249 | 20.555 | 1.6e-64 |
The coefficient of this equation are significant as the p-value for:
\[ X: 1.6 \times 10^{-64} < 0.05 \]
For each coefficient \((\beta_i)\), the hypotheses are defined as:
\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]
(a) \(X\)
\[ t = 20.555, \quad p = 1.6 \times 10^{-64} \]
Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).
cat("Mean Squared Error:",(mse <- mean(model_n$residuals^2)))
## Mean Squared Error: 0.2376328
set.seed(7052)
n2 <- 400
X_n <- rnorm(n2, mean = 2, sd = 0.1)
error2_n <- rnorm(n2, mean = 0, sd = 1)
Y2_n <- 10 + 5*X_n + error2_n
df2_n <- data.frame(Y2_n, X_n)
# Fit multiple linear regression with two predictors
model2_n <- lm(Y2_n ~ X_n, data = df2_n)
summary_model2_n <- summary(model2_n)
The estimated simple linear regression model is:
\[\hat{Y} = 9.4933 + 5.2355X\]
coef_table2_n <- as.data.frame(summary_model2_n$coefficients)
names(coef_table2_n) <- c("Estimate", "Std Error", "t Value", "p Value")
coef_table2_n$`p Value` <- format(coef_table2_n$`p Value`, scientific = TRUE, digits = 2)
coef_table2_n %>%
kbl(
caption = "Table 4: Estimated Coefficients for Model: Y ~ X (n = 400; Error SD = 1)",
align = "c",
digits = 3
) %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
column_spec(2, width = "3cm") %>%
column_spec(3, width = "3cm") %>%
column_spec(4, width = "3cm") %>%
column_spec(5, width = "3cm")
| Estimate | Std Error | t Value | p Value | |
|---|---|---|---|---|
| (Intercept) | 9.493 | 1.003 | 9.466 | 2.6e-19 |
| X_n | 5.235 | 0.498 | 10.514 | 5.6e-23 |
The coefficient of this equation are significant as the p-value for:
\[ X: 5.6 \times 10^{-23} < 0.05 \]
For each coefficient \((\beta_i)\), the hypotheses are defined as:
\[ H_0: \beta_i = 0 \quad \text{vs.} \quad H_1: \beta_i \neq 0 \]
(a) \(X\)
\[ t = 10.514, \quad p = 5.6 \times 10^{-23} \]
Reject \(H_0\) → \(X_1\) is a significant positive predictor of \(Y\).
cat("Mean Squared Error:",(mse <- mean(model2_n$residuals^2)))
## Mean Squared Error: 0.9505312
As the sample size \(n\) increases,
the Mean Squared Error (MSE) values in our results change only slightly,
but this does not mean the model is performing
worse.
The MSE in this context represents the model’s estimate of the true
variance of the residuals:
\[ \widehat{\sigma}^2_{MSE} = \frac{SSE}{n - 2} \]
This value estimates the underlying error variance \(\sigma^2\), not predictive accuracy.
Because \(E[\widehat{\sigma}^2_{MSE}] =
\sigma^2\), the expected MSE should remain approximately constant
across different \(n\) values.
As \(n\) increases:
Conceptually, increasing the sample size gives the regression model
more information to estimate both coefficients and residual variance
accurately.
The model becomes more precise, and the MSE becomes a
more consistent estimator of the true error
variance.
The standard deviation of the errors, \(\sigma\), measures the amount of random
noise around the regression line.
Since the MSE estimates \(\sigma^2\),
any change in \(\sigma\) directly
affects the expected MSE.
When \(\sigma\) decreases:
This pattern aligns perfectly with our observed results, where decreasing the error standard deviation from 1.0 to 0.5 lowered the MSE from approximately 0.80–0.95 to around 0.20–0.24.
Conceptually, a smaller \(\sigma\)
means less random noise, so the regression line fits the data more
tightly.
The smaller the \(\sigma\), the smaller
the average squared residuals (MSE), and the more accurate and
precise the model becomes.
(e) What about the MSE from each model?
cat(
" Mean Squared Error (n = 100; Error SD = 0.5):", mean(model$residuals^2), "\n",
"Mean Squared Error (n = 100; Error SD = 1 ):", mean(model2$residuals^2), "\n",
"Mean Squared Error (n = 400; Error SD = 0.5):", mean(model_n$residuals^2), "\n",
"Mean Squared Error (n = 400; Error SD = 1 ):", mean(model2_n$residuals^2), "\n"
)
## Mean Squared Error (n = 100; Error SD = 0.5): 0.1992276
## Mean Squared Error (n = 100; Error SD = 1 ): 0.7969102
## Mean Squared Error (n = 400; Error SD = 0.5): 0.2376328
## Mean Squared Error (n = 400; Error SD = 1 ): 0.9505312
Problem 2
Bias and Variance of Parameter Estimates.
(a) What are the bias and variance of the OLS estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\)?
We know that by its nature, the Ordinary Least Squares
(OLS) method “provides the best linear unbiased estimates of
\(\beta_0\) and \(\beta_1\).”
They are unbiased because
\[ E[\widehat{\beta}_0] = \beta_0, \qquad E[\widehat{\beta}_1] = \beta_1 \]
Therefore, the bias for both estimators is
\[ \text{Bias}(\widehat{\beta}_0) = 0, \qquad \text{Bias}(\widehat{\beta}_1) = 0 \]
The standard error (SE) of an estimator is the square root of its variance:
\[\text{SE}(\widehat{\beta}_i) = \sqrt{\operatorname{Var}(\widehat{\beta}_i)}\]
Thus, squaring the SE expression yields the variance of that estimator.
For the intercept estimator, we are given the SE formula:
\[ \text{SE}(\widehat{\beta}_0) = \left[\, s^2 \left( \frac{1}{n} + \frac{\bar{X}^{2}}{S_{xx}} \right) \right]^{1/2} \]
Squaring both sides gives:
\[\operatorname{Var}(\widehat{\beta}_0) = s^2\left(\frac{1}{n} + \frac{\bar{X}^2}{S_{xx}}\right)\]
In theoretical derivations we replace the sample estimate \(s^2\) with the true error variance \(\sigma^2\):
\[\operatorname{Var}(\widehat{\beta}_0) = \sigma^2\left(\frac{1}{n} + \frac{\bar{X}^2}{S_{xx}}\right)\]
Let \(S_{xx} = \sum_{i=1}^n (X_i -
\bar{X})^2\).
The variance of the slope is:
\[\operatorname{Var}(\widehat{\beta}_1) = \frac{\sigma^2}{S_{xx}}\]
Collecting results, we have:
\[ \boxed{ \begin{aligned} \text{Bias}(\widehat{\beta}_0) &= 0, & \operatorname{Var}(\widehat{\beta}_0) &= \sigma^2\left(\frac{1}{n} + \frac{\bar{X}^2}{S_{xx}}\right), \\ \text{Bias}(\widehat{\beta}_1) &= 0, & \operatorname{Var}(\widehat{\beta}_1) &= \frac{\sigma^2}{S_{xx}} \end{aligned} } \]
(b) What do you expect to happen to the variances of the OLS
estimates of \(\beta_0\) and \(\beta_1\) when the sample size \(n\) increases? What do you expect when the
error variance \(\sigma^2\)
increases?
As \(n\) increases, the variance of both estimators \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) decreases.
Mathematically, this is shown because the larger your sample size,
the larger \(S_{xx}\) becomes.
Recall that \[
S_{xx} = \sum_{i=1}^n (X_i - \bar{X})^2
\] and both variance formulas include \(S_{xx}\) in the
denominator: \[
\operatorname{Var}(\widehat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \qquad
\operatorname{Var}(\widehat{\beta}_0) = \sigma^2\!\left(\frac{1}{n} +
\frac{\bar{X}^{\,2}}{S_{xx}}\right)
\]
As \(n\) increases:
Because both of these changes reduce the total variance, it follows that the estimates become more precise.
Conceptually, this makes sense. As your sample size increases, your regression line is estimated using more data points, so there is less uncertainty and smaller variance in both \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\).
For \(\sigma^2\), as it increases,
so does the variance of both estimators.
Mathematically, we see this because both variance formulas have \(\sigma^2\) multiplied by them: \[
\operatorname{Var}(\widehat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \qquad
\operatorname{Var}(\widehat{\beta}_0) = \sigma^2\!\left(\frac{1}{n} +
\frac{\bar{X}^{\,2}}{S_{xx}}\right)
\]
Hence, when the error variance \(\sigma^2\) increases, the variability of the estimates also increases proportionally.
Conceptually, if the noise of the error residuals increases, it means that there is more randomness or “unexplained variation” in the model. As a result, the estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) become less precise, leading to higher variance.
(c) What is the bias of the model’s MSE? What about the ML
estimate of \(\sigma^2\)? What is the
difference between these two estimates of \(\sigma^2\)? Why do we use MSE instead of
the ML estimate?
The ML estimate of the variance is given by:
\[ \widehat{\sigma}^2_{ML} = \frac{SSE}{n} \]
where \(SSE = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2\).
Because it divides by \(n\) (the
total number of observations) instead of adjusting for the parameters
estimated in the model, it underestimates the true
variance \(\sigma^2\).
This makes \(\widehat{\sigma}^2_{ML}\)
a biased estimator.
Mathematically, the bias is downward. The ML estimate tends to give a smaller value than the true variance because it does not account for the “used up” degrees of freedom when estimating \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\).
\[ E[\widehat{\sigma}^2_{ML}] < \sigma^2 \]
To correct this bias, we divide the sum of squared errors by \(n - 2\) instead of \(n\), since two parameters (\(\beta_0\) and \(\beta_1\)) are estimated in simple linear regression:
\[ \widehat{\sigma}^2_{MSE} = \frac{SSE}{n - 2} \]
This adjustment produces an unbiased estimator of the population variance:
\[ E[\widehat{\sigma}^2_{MSE}] = \sigma^2 \]
The MSE estimate incorporates the “used up” degrees of freedom when calculating \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\), which is why it correctly accounts for the loss of flexibility in the model.
We use the MSE estimate rather than the ML estimate because:
Therefore, the MSE estimator is preferred in regression analysis due to its unbiasedness.
Final summary:
\[ \boxed{ \begin{aligned} \widehat{\sigma}^2_{ML} &= \frac{SSE}{n} \quad &\text{(Biased, underestimates \(\sigma^2\))} \\[6pt] \widehat{\sigma}^2_{MSE} &= \frac{SSE}{n - 2} \quad &\text{(Unbiased, preferred for inference)} \end{aligned} } \]