BANA7052 Fall 2025 Homework 1
Eli Bales Vivian Comer Andrew McCurrach Devin Walker Kazuhide Watanabe
October 28, 2025
Tools & Packages
# Load the dataset and packages
library(dplyr)
library(ggplot2)
library(kableExtra)
library(patchwork)
Problem 1
A Simulation Study (Simple Linear Regression). Assuming the mean response is \(E\left(Y|X\right)= 10 + 5X\):
(a) Generate data with \(X \sim
N\left(\mu = 2, \sigma = 0.1\right)\), sample size \(n = 100\), and error term \(\epsilon \sim N\left(\mu = 0, \sigma =
0.5\right)\).
Hint: You can use rnorm(n =
50, mean = 5, sd = 3) to simulate \(n =
50\) observations from a \(N\left(\mu =
5, \sigma = 3\right)\) distribution, but note that rnorm()
specifies the standard deviation (\(\sigma\)), rather than the variance (\(\sigma^2\)), of the normal distribution. It
is also good practice to specify the random seed via set.seed() whenever
generating random data. For this exercise, use set.seed(7052) to ensure
reproducibility.
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error <- rnorm(n, mean = 0, sd = 0.5)
Y <- 10 + 5*X + error
We generated the data by simulating the predictor variable \(X\) and the random error term using the
rnorm() function. The response variable \(Y\) was then constructed according to the
simple linear regression model:
\[ Y = 10 + 5X + \varepsilon, \]
where \(\varepsilon \sim N(0, 0.5^2)\).
(b) Show summary statistics of the response variable and
predictor variable. Are there outliers? What is the correlation
coefficient? Draw a scatter plot.
df <- data.frame(X = X, Y = Y)
# Creating a table of summary statistics
summary_stats <- data.frame(
Min = sapply(df, min, na.rm = TRUE),
Q1 = sapply(df, function(x) quantile(x, 0.25, na.rm = TRUE)),
Median = sapply(df, median, na.rm = TRUE),
Mean = sapply(df, mean, na.rm = TRUE),
Q3 = sapply(df, function(x) quantile(x, 0.75, na.rm = TRUE)),
Max = sapply(df, max, na.rm = TRUE),
SD = sapply(df, sd, na.rm = TRUE)
)
# Transposing data
rotated_summary <- round(as.data.frame(t(summary_stats[,-1])), 3)
rotated_summary %>%
kbl(caption = "Table 1: Summary Statistics: X and Y", align = "c") %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
column_spec(2, width = "6em") %>%
column_spec(3, width = "6em") %>%
row_spec(0, align = "c")
| X | Y | |
|---|---|---|
| Q1 | 1.923 | 19.665 |
| Median | 2.001 | 20.110 |
| Mean | 2.004 | 20.173 |
| Q3 | 2.070 | 20.701 |
| Max | 2.243 | 21.795 |
| SD | 0.109 | 0.755 |
# Boxplot X
p1 <- ggplot(df, aes(y = X)) +
geom_boxplot(fill = "#B0B0B0") +
labs(
title = "Plot X",
x = "",
y = ""
) +
theme_bw(base_family = "serif") +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
)
# Boxplot Y
p2 <- ggplot(df, aes(y = Y)) +
geom_boxplot(fill = "#B0B0B0") +
labs(
title = "Plot Y",
x = "",
y = ""
) +
theme_bw(base_family = "serif") +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
)
# Put boxplots next to each other
p1 + p2 +
plot_annotation(
title = "Figure 1: X vs. Y Boxplots",
theme = theme(
plot.title = element_text(
family = "Times New Roman",
size = 15,
)
)
)
ggplot(df, aes(x = X, y = Y)) +
geom_point(color = "#36454F") +
labs(
x = "X",
y = "Y"
) +
theme_bw(base_family = "serif") +
theme(
axis.title.y = element_text(angle = 0, hjust = 0.5, vjust = 0.5,
margin = margin(r = 8))) +
plot_annotation(
title = "Figure 2: X vs. Y Scatterplot",
theme = theme(
plot.title = element_text(
family = "Times New Roman",
size = 15,
)
)
)
cat("Correlation Coefficient:", cor(df$X, df$Y))
## Correlation Coefficient: 0.8042198
This shows a strong positive correlation.
Examining both variables in Figure 1, the mean and median values for X (mean = 2.004, median = 2.001) and Y (mean = 2.17, median = 2.11) are closely aligned, indicating a symmetrical distribution with no apparent skewness or extreme values. Figure 2 further supports this observation, showing no potential outliers and only minimal dispersion among the data points.
(c) Fit a simple linear regression. What is the estimated model?
Report the estimated coefficients. What is the model mean squared error
(MSE)?
# Fit the simple linear regression model
model <- lm(df$Y ~ df$X)
# Produce summary statistics
summary_model <- summary(model)
The estimated simple linear regression model is:
\[\hat{Y} = 9.022 + 5.565X\]
coef_table <- as.data.frame(summary_model$coefficients)
names(coef_table) <- c("Estimate", "Std Error", "t Value", "p Value")
coef_table$`p Value` <- format(coef_table$`p Value`, scientific = TRUE, digits = 2)
coef_table %>%
kbl(
caption = "Table 2: Estimated Coefficients for Model: Y ~ X",
align = "c",
digits = 3
) %>%
kable_classic(full_width = F, html_font = "Cambria")
| Estimate | Std Error | t Value | p Value | |
|---|---|---|---|---|
| (Intercept) | 9.022 | 0.834 | 10.822 | 2.0e-18 |
| df$X | 5.565 | 0.415 | 13.395 | 7.1e-24 |
cat("Mean Squared Error:",(mse <- mean(model$residuals^2)))
## Mean Squared Error: 0.1992276
(d) What is the sample mean of both \(X\) and \(Y\)? Plot the fitted regression line and
the point \(\left(\bar{X},
\bar{Y}\right)\). What do you find?
# Calculate means
x_mean <- mean(df$X)
y_mean <- mean(df$Y)
ggplot(df, aes(x = X, y = Y)) +
geom_point(color = "gray40", size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "darkorange", linewidth = 1) +
geom_point(aes(x = x_mean, y = y_mean), color = "blue", size = 3) +
labs(
x = "X",
y = "Y"
) +
theme_bw(base_family = "serif") +
theme(
axis.title.y = element_text(angle = 0, hjust = 0.5, vjust = 0.5,
margin = margin(r = 8))) +
plot_annotation(
title = "Figure 3: Mean of X and Y with Regression Line",
theme = theme(
plot.title = element_text(
family = "Times New Roman",
size = 15,
)
)
)
Looking at Figure 3, we see that the regression line falls in a similar nature to the cluster of values.
Problem 2
Ordinary least squares (OLS) is typically used to estimate the regression coefficients \(\beta_0\) and \(\beta_1\) in the simple linear regression model by minimizing the residual sum of squares (RSS)
\[ RSS\left(\beta_0, \beta_1\right) =
\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2 = \sum_{i=1}^n
\epsilon_i^2 \]
(a) How about minimizing \(\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1
X_i\right) = \sum_{i=1}^n \epsilon_i\), compared to minimizing
RSS?
If we look to minimize \(\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right) = \sum_{i=1}^n \epsilon_i\), we must first identify that this is a linear equation. If we expand the equation, we see this:
\[\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right) = \sum_{i=1}^n Y_i + n\beta_0 + \beta_1\sum_{i=1}^n X_i\]
This is a linear function of variables B0 and B1, no higher order variables to introduce curves. In the 3D space, this would represent a (tilted) plane. If we are trying to minimize a function, we need some sort of “valley” or minimum to find and set B0 and B1 to. We can’t minimize the sum of residuals because it just extends downwards infinitely, we can set B0 and B1 to be as small as we want and that will just make the function smaller forever.
(b) How about minimizing \(\sum_{i=1}^n\left|Y_i - \beta_0 - \beta_1
X_i\right| = \sum_{i=1}^n \left|\epsilon_i\right|\), compared to
minimizing RSS?
Now we want to minimize the sum of the absolute value of the residuals. This looks like:
\[\sum_i^n\left|Y_i + \beta_0 + \beta_1 X_i\right| = \sum_i^n \left|\epsilon_i\right|\]
We now have a piecewise linear function. This solves the problem of being able to make \(\beta_0\) and \(\beta_1\) smaller and smaller infinitely, making the residuals smaller and smaller, but visualizing this as a plane again, we don’t get a valley or a curve, we get a v-shaped plane. We again can’t minimize this as the point (technically a line) where the two sides of the v meet is not differentiable since it is not a curve. The function does not curve smoothly, and therefore we can’t minimize it.
(c) Why is OLS a popular choice for estimating \(\beta_0\) and \(\beta_1\)?
Ordinary Least Squares is about minimizing the residual sum of squares. The squares portion is important here because squaring the residual creates a quadratic function of variables \(\beta_0\) and \(\beta_1\). Let’s look at the expanded function:
\[\sum_i^n\left(Y_i + \beta_0 + \beta_1 X_i\right)^2 = \sum_i^n\left(Y_i^2 + 2Y_i\beta_0 + 2Y_i \beta_1 X_i + \beta_0^2 + 2\beta_0 \beta_1 X_i + \beta_1^2 X_i^2\right)\]
We see the terms \(\beta_1^2\), \(\beta_0^2\), \(\beta_0\), and \(\beta_1\), showing we have a quadratic function, which is represented in the 3D space as a paraboloid, with a global minimum for us to find. That is why the OLS is so popular, as minimizing it to find \(\beta_0\) and \(\beta_1\) is computationally easy and possible.
Problem 3
Establish the following relationships for the simple linear regression model. (Some are trivial to show.)
(a) The fitted line passes through the point \(\left(\bar{X}, \bar{Y}\right)\).
We want to show that our regression line passes through the point \(\left(\bar{X}, \bar{Y}\right)\), representing the means of X and Y.
From the ordinary least square (OLS) equations, we know:
\[\ \widehat{\beta_0} = \widehat{Y} -
\widehat{\beta_1} \bar{X}\]
The fitted regression line can be written as:
\[\ \widehat{Y}(x) -
\widehat{\beta_0} + \widehat{\beta_1} X \]
Substituting \(\ X = \bar{X}\) and
\(\ \widehat{\beta_0} = \bar{Y} -
\widehat{\beta_1} \bar{X}\)
\[\ \widehat{Y}(\bar{x}) = (\bar{Y} -
\widehat{\beta_1} \bar{X}) + \widehat{\beta_1} \bar{X} =
\bar{Y}\]
Therefore, the fitted regression line passes exactly through the mean point \(\bar{X} = \bar{Y}\).
(b) \(\sum_{i=1}^n e_i =
0\)
We want to show that the sum of the residuals equals zero: \[\sum_{i=1}^n e_i = 0\]
The residuals,\(\ e_i\), are defined as the vertical difference between the observed and fitted values:
\[\ e_i = Y_i -
\widehat{Y_i}\]
We know from deriving \(\widehat{Y}\), \(\widehat{\beta_0}\), \(\widehat{\beta_1}\), that:
\[ RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2 = \sum_{i=1}^n \epsilon_i^2 = 0\]
Taking the partial derivative of RSS with respect to \(\beta_0\):
\[\frac{\partial RSS}{\partial \beta_0} = -2\sum e_i\]
At the optimal level, where we found \(\widehat{Y}\), \(\widehat{\beta_0}\), \(\widehat{\beta_1}\), this partial derivative is equal to zero:
\[-2\sum e_i = 0\]
Therefore, we conclude that:
\[\sum_{i=1}^n e_i = 0\]
(c) \(\sum_{i=1}^n Y_i = \sum_{i=1}^n
\widehat{Y}_i\)
We want to show that the sum of the observed values equals the sum of the fitted values:
\[\sum_{i=1}^n Y_i = \sum_{i=1}^n \widehat{Y}_i\]
Recall that the residuals are defined as:
\[\ e_i = Y_i -
\widehat{Y_i}\]
From part (b), we know that the sum of the residuals is zero:
\[\sum_{i=1}^n e_i = 0\]
Substituting \(\ e_i = Y_i - \widehat{Y_i}\) into this expression gives:
\[\sum_{i=1}^n (Y_i - \widehat{Y_i} = 0 \]
Simplifying, we get:
\[\sum_{i=1}^n Y_i = \sum_{i=1}^n \widehat{Y_i} \]
Therefore, the total of the observed values equals the total of the fitted values.
(d) \(\sum_{i=1}^n X_ie_i =
0\); that is, the sum of the weighted residuals is zero when the
residual of the i-th observation is weighted by the predictor value of
the i-th observation.
We want to show that the sum of the weighted residuals equals zero:
\[\sum_{i=1}^n X_ie_i = 0\]
Starting from the definition of the residuals, \(e_i = Y_i - \widehat{Y_i}\), we can write:
\[\sum_{i=1}^n X_i e_i = \sum_{i=1}^n X_i (Y_i - \widehat{\beta_0} - \widehat{\beta_1} X_i)\]
Expanding this expression gives:
\[\sum_{i=1}^n X_i Y_i - \widehat{\beta_0} \sum_{i=1}^n X_i - \widehat{\beta_1} \sum_{i=1}^n X_i^2 = 0\]
From our second normal equation, we know that:
\[ \frac{\partial RSS}{\partial \beta_1} = -2\left(\sum_{i=1}^n X_i Y_i - \widehat{\beta_0}\sum_{i=1}^n X_i - \widehat{\beta_1}\sum_{i=1}^n X_i^2\right) = 0 \]
Therefore, we conclude that:
\[\sum_{i=1}^n X_i e_i = 0\]
(e) \(\sum_{i=1}^n \widehat{Y}_ie_i =
0\); that is, the sum of the weighted residuals is zero when the
residual of the i-th observation is weighted by the fitted value of the
i-th observation.
We want to show that the sum of the weighted residuals is zero when each residual is weighted by its corresponding fitted value:
\[ \sum_{i=1}^n \widehat{Y_i} e_i = 0 \]
Substituting \(\widehat{Y_i} = \widehat{\beta_0} + \widehat{\beta_1} X_i\), we can write:
\[ \sum_{i=1}^n \widehat{Y_i} e_i = \sum_{i=1}^n (\widehat{\beta_0} + \widehat{\beta_1} X_i) e_i = \widehat{\beta_0} \sum_{i=1}^n e_i + \widehat{\beta_1} \sum_{i=1}^n X_i e_i \]
From part (b), we know that:
\[ \sum_{i=1}^n e_i = 0 \]
And from part (d), we know that:
\[ \sum_{i=1}^n X_i e_i = 0 \]
Substituting these results, we obtain:
\[ \sum_{i=1}^n \widehat{Y_i} e_i = \widehat{\beta_0}(0) + \widehat{\beta_1}(0) = 0 \]
Therefore, the sum of the fitted values weighted by their corresponding residuals is zero.
(f) \(\sum_{i=1}^n e_i^2\) is
minimized
We want to show that the residual sum of squares, \(\sum_{i=1}^n e_i^2\), is minimized.
The least squares method minimizes the sum of squared residuals with respect to \(\beta_0\) and \(\beta_1\):
\[ RSS(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2 \]
where the residuals are defined as:
\[ e_i = Y_i - \widehat{Y_i} = Y_i - (\beta_0 + \beta_1 X_i) \]
The OLS estimates of \(\beta_0\) and
\(\beta_1\) minimize \(RSS(\beta_0, \beta_1)\).
Since any squared term is greater than or equal to 0, minimizing this
sum ensures the best possible linear fit.
To find the minimum, we take partial derivatives of the RSS function with respect to both parameters and set them equal to zero:
\[ \frac{\partial RSS}{\partial \beta_0} = -2\sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i) = 0 \]
\[ \frac{\partial RSS}{\partial \beta_1} = -2\sum_{i=1}^n X_i (Y_i - \beta_0 - \beta_1 X_i) = 0 \]
Solving these two equations simultaneously gives the OLS estimates:
\[ \widehat{\beta_1} = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
\[ \widehat{\beta_0} = \bar{Y} - \widehat{\beta_1}\bar{X} \]
Because the residual sum of squares \(RSS\) is a sum of squared terms (and thus
always greater than or equal to 0), and because both first derivatives
equal zero at \(\widehat{\beta_0}\) and
\(\widehat{\beta_1}\), these parameter
values are critical points.
Since the second derivatives of \(RSS\) are positive, the critical points
represent a minimum.
Therefore, the OLS estimates \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) minimize the residual sum of squares, confirming that:
\[ \sum_{i=1}^n e_i^2 \] is minimized.