BANA7052 Fall 2025 Homework 1

Eli Bales     Vivian Comer     Andrew McCurrach     Devin Walker     Kazuhide Watanabe

October 28, 2025


Tools & Packages

# Load the dataset and packages
library(dplyr)
library(ggplot2)
library(kableExtra)
library(patchwork)


Problem 1

A Simulation Study (Simple Linear Regression). Assuming the mean response is \(E\left(Y|X\right)= 10 + 5X\):


(a) Generate data with \(X \sim N\left(\mu = 2, \sigma = 0.1\right)\), sample size \(n = 100\), and error term \(\epsilon \sim N\left(\mu = 0, \sigma = 0.5\right)\).
Hint: You can use rnorm(n = 50, mean = 5, sd = 3) to simulate \(n = 50\) observations from a \(N\left(\mu = 5, \sigma = 3\right)\) distribution, but note that rnorm() specifies the standard deviation (\(\sigma\)), rather than the variance (\(\sigma^2\)), of the normal distribution. It is also good practice to specify the random seed via set.seed() whenever generating random data. For this exercise, use set.seed(7052) to ensure reproducibility.


set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error <- rnorm(n, mean = 0, sd = 0.5)
Y <- 10 + 5*X + error

We generated the data by simulating the predictor variable \(X\) and the random error term using the rnorm() function. The response variable \(Y\) was then constructed according to the simple linear regression model:

\[ Y = 10 + 5X + \varepsilon, \]

where \(\varepsilon \sim N(0, 0.5^2)\).



(b) Show summary statistics of the response variable and predictor variable. Are there outliers? What is the correlation coefficient? Draw a scatter plot.

df <- data.frame(X = X, Y = Y)

# Creating a table of summary statistics
summary_stats <- data.frame(
  Min = sapply(df, min, na.rm = TRUE),
  Q1 = sapply(df, function(x) quantile(x, 0.25, na.rm = TRUE)),
  Median = sapply(df, median, na.rm = TRUE),
  Mean = sapply(df, mean, na.rm = TRUE),
  Q3 = sapply(df, function(x) quantile(x, 0.75, na.rm = TRUE)),
  Max = sapply(df, max, na.rm = TRUE),
  SD = sapply(df, sd, na.rm = TRUE)
)

# Transposing data
rotated_summary <- round(as.data.frame(t(summary_stats[,-1])), 3)

rotated_summary %>%
  kbl(caption = "Table 1: Summary Statistics: X and Y", align = "c") %>%
  kable_classic(full_width = F, html_font = "Cambria") %>%
  column_spec(2, width = "6em") %>%
  column_spec(3, width = "6em") %>%
  row_spec(0, align = "c")
Table 1: Summary Statistics: X and Y
X Y
Q1 1.923 19.665
Median 2.001 20.110
Mean 2.004 20.173
Q3 2.070 20.701
Max 2.243 21.795
SD 0.109 0.755
# Boxplot X
p1 <- ggplot(df, aes(y = X)) +
  geom_boxplot(fill = "#B0B0B0") +
  labs(
    title = "Plot X",
    x = "",
    y = ""
  ) + 
  theme_bw(base_family = "serif") +
  theme(
    axis.text.x = element_blank(),       
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5) 
  )

# Boxplot Y
p2 <- ggplot(df, aes(y = Y)) +
  geom_boxplot(fill = "#B0B0B0") +
  labs(
    title = "Plot Y",
    x = "",
    y = ""
  ) + 
  theme_bw(base_family = "serif") +
  theme(
    axis.text.x = element_blank(),      
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5) 
  )

# Put boxplots next to each other
p1 + p2 +
    plot_annotation(
    title = "Figure 1: X vs. Y Boxplots",
    theme = theme(
      plot.title = element_text(
        family = "Times New Roman", 
        size = 15,        
      )
    )
  )

ggplot(df, aes(x = X, y = Y)) +
  geom_point(color = "#36454F") +
  labs(
    x = "X",
    y = "Y"
  ) +
  theme_bw(base_family = "serif") +
  theme(
    axis.title.y = element_text(angle = 0, hjust = 0.5, vjust = 0.5,
                                margin = margin(r = 8))) +
  plot_annotation(
  title = "Figure 2: X vs. Y Scatterplot",
  theme = theme(
    plot.title = element_text(
      family = "Times New Roman", 
      size = 15,        
      )
    )
  )

Correlation Coefficient

cat("Correlation Coefficient:", cor(df$X, df$Y))
## Correlation Coefficient: 0.8042198

This shows a strong positive correlation.


Outliers

Examining both variables in Figure 1, the mean and median values for X (mean = 2.004, median = 2.001) and Y (mean = 2.17, median = 2.11) are closely aligned, indicating a symmetrical distribution with no apparent skewness or extreme values. Figure 2 further supports this observation, showing no potential outliers and only minimal dispersion among the data points.



(c) Fit a simple linear regression. What is the estimated model? Report the estimated coefficients. What is the model mean squared error (MSE)?

# Fit the simple linear regression model
model <- lm(df$Y ~ df$X)

# Produce summary statistics
summary_model <- summary(model)

Estimated Model

The estimated simple linear regression model is:

\[\hat{Y} = 9.022 + 5.565X\]


Estimated Coefficients

coef_table <- as.data.frame(summary_model$coefficients)

names(coef_table) <- c("Estimate", "Std Error", "t Value", "p Value")

coef_table$`p Value` <- format(coef_table$`p Value`, scientific = TRUE, digits = 2)

coef_table %>%
  kbl(
    caption = "Table 2: Estimated Coefficients for Model: Y ~ X",
    align = "c",
    digits = 3
  ) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 2: Estimated Coefficients for Model: Y ~ X
Estimate Std Error t Value p Value
(Intercept) 9.022 0.834 10.822 2.0e-18
df$X 5.565 0.415 13.395 7.1e-24

Model Performance

cat("Mean Squared Error:",(mse <- mean(model$residuals^2)))
## Mean Squared Error: 0.1992276


(d) What is the sample mean of both \(X\) and \(Y\)? Plot the fitted regression line and the point \(\left(\bar{X}, \bar{Y}\right)\). What do you find?

# Calculate means
x_mean <- mean(df$X)
y_mean <- mean(df$Y)

ggplot(df, aes(x = X, y = Y)) +
  geom_point(color = "gray40", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "darkorange", linewidth = 1) +
  geom_point(aes(x = x_mean, y = y_mean), color = "blue", size = 3) +
  labs(
    x = "X",
    y = "Y"
  ) +
  theme_bw(base_family = "serif") +
  theme(
    axis.title.y = element_text(angle = 0, hjust = 0.5, vjust = 0.5,
                                margin = margin(r = 8))) +
  plot_annotation(
  title = "Figure 3: Mean of X and Y with Regression Line",
  theme = theme(
    plot.title = element_text(
      family = "Times New Roman", 
      size = 15,        
      )
    )
  )

Sample Means


Findings

Looking at Figure 3, we see that the regression line falls in a similar nature to the cluster of values.



Problem 2

Ordinary least squares (OLS) is typically used to estimate the regression coefficients \(\beta_0\) and \(\beta_1\) in the simple linear regression model by minimizing the residual sum of squares (RSS)

\[ RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2 = \sum_{i=1}^n \epsilon_i^2 \]
(a) How about minimizing \(\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right) = \sum_{i=1}^n \epsilon_i\), compared to minimizing RSS?

If we look to minimize \(\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right) = \sum_{i=1}^n \epsilon_i\), we must first identify that this is a linear equation. If we expand the equation, we see this:

\[\sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right) = \sum_{i=1}^n Y_i + n\beta_0 + \beta_1\sum_{i=1}^n X_i\]

This is a linear function of variables B0 and B1, no higher order variables to introduce curves. In the 3D space, this would represent a (tilted) plane. If we are trying to minimize a function, we need some sort of “valley” or minimum to find and set B0 and B1 to. We can’t minimize the sum of residuals because it just extends downwards infinitely, we can set B0 and B1 to be as small as we want and that will just make the function smaller forever.



(b) How about minimizing \(\sum_{i=1}^n\left|Y_i - \beta_0 - \beta_1 X_i\right| = \sum_{i=1}^n \left|\epsilon_i\right|\), compared to minimizing RSS?

Now we want to minimize the sum of the absolute value of the residuals. This looks like:

\[\sum_i^n\left|Y_i + \beta_0 + \beta_1 X_i\right| = \sum_i^n \left|\epsilon_i\right|\]

We now have a piecewise linear function. This solves the problem of being able to make \(\beta_0\) and \(\beta_1\) smaller and smaller infinitely, making the residuals smaller and smaller, but visualizing this as a plane again, we don’t get a valley or a curve, we get a v-shaped plane. We again can’t minimize this as the point (technically a line) where the two sides of the v meet is not differentiable since it is not a curve. The function does not curve smoothly, and therefore we can’t minimize it.



(c) Why is OLS a popular choice for estimating \(\beta_0\) and \(\beta_1\)?

Ordinary Least Squares is about minimizing the residual sum of squares. The squares portion is important here because squaring the residual creates a quadratic function of variables \(\beta_0\) and \(\beta_1\). Let’s look at the expanded function:

\[\sum_i^n\left(Y_i + \beta_0 + \beta_1 X_i\right)^2 = \sum_i^n\left(Y_i^2 + 2Y_i\beta_0 + 2Y_i \beta_1 X_i + \beta_0^2 + 2\beta_0 \beta_1 X_i + \beta_1^2 X_i^2\right)\]

We see the terms \(\beta_1^2\), \(\beta_0^2\), \(\beta_0\), and \(\beta_1\), showing we have a quadratic function, which is represented in the 3D space as a paraboloid, with a global minimum for us to find. That is why the OLS is so popular, as minimizing it to find \(\beta_0\) and \(\beta_1\) is computationally easy and possible.



Problem 3

Establish the following relationships for the simple linear regression model. (Some are trivial to show.)


(a) The fitted line passes through the point \(\left(\bar{X}, \bar{Y}\right)\).

We want to show that our regression line passes through the point \(\left(\bar{X}, \bar{Y}\right)\), representing the means of X and Y.

From the ordinary least square (OLS) equations, we know:


\[\ \widehat{\beta_0} = \widehat{Y} - \widehat{\beta_1} \bar{X}\]

The fitted regression line can be written as:


\[\ \widehat{Y}(x) - \widehat{\beta_0} + \widehat{\beta_1} X \]

Substituting \(\ X = \bar{X}\) and \(\ \widehat{\beta_0} = \bar{Y} - \widehat{\beta_1} \bar{X}\)


\[\ \widehat{Y}(\bar{x}) = (\bar{Y} - \widehat{\beta_1} \bar{X}) + \widehat{\beta_1} \bar{X} = \bar{Y}\]

Therefore, the fitted regression line passes exactly through the mean point \(\bar{X} = \bar{Y}\).



(b) \(\sum_{i=1}^n e_i = 0\)

We want to show that the sum of the residuals equals zero: \[\sum_{i=1}^n e_i = 0\]

The residuals,\(\ e_i\), are defined as the vertical difference between the observed and fitted values:


\[\ e_i = Y_i - \widehat{Y_i}\]

We know from deriving \(\widehat{Y}\), \(\widehat{\beta_0}\), \(\widehat{\beta_1}\), that:

\[ RSS\left(\beta_0, \beta_1\right) = \sum_{i=1}^n\left(Y_i - \beta_0 - \beta_1 X_i\right)^2 = \sum_{i=1}^n \epsilon_i^2 = 0\]

Taking the partial derivative of RSS with respect to \(\beta_0\):

\[\frac{\partial RSS}{\partial \beta_0} = -2\sum e_i\]

At the optimal level, where we found \(\widehat{Y}\), \(\widehat{\beta_0}\), \(\widehat{\beta_1}\), this partial derivative is equal to zero:

\[-2\sum e_i = 0\]

Therefore, we conclude that:

\[\sum_{i=1}^n e_i = 0\]



(c) \(\sum_{i=1}^n Y_i = \sum_{i=1}^n \widehat{Y}_i\)

We want to show that the sum of the observed values equals the sum of the fitted values:

\[\sum_{i=1}^n Y_i = \sum_{i=1}^n \widehat{Y}_i\]

Recall that the residuals are defined as:


\[\ e_i = Y_i - \widehat{Y_i}\]

From part (b), we know that the sum of the residuals is zero:

\[\sum_{i=1}^n e_i = 0\]

Substituting \(\ e_i = Y_i - \widehat{Y_i}\) into this expression gives:

\[\sum_{i=1}^n (Y_i - \widehat{Y_i} = 0 \]

Simplifying, we get:

\[\sum_{i=1}^n Y_i = \sum_{i=1}^n \widehat{Y_i} \]

Therefore, the total of the observed values equals the total of the fitted values.



(d) \(\sum_{i=1}^n X_ie_i = 0\); that is, the sum of the weighted residuals is zero when the residual of the i-th observation is weighted by the predictor value of the i-th observation.

We want to show that the sum of the weighted residuals equals zero:

\[\sum_{i=1}^n X_ie_i = 0\]

Starting from the definition of the residuals, \(e_i = Y_i - \widehat{Y_i}\), we can write:

\[\sum_{i=1}^n X_i e_i = \sum_{i=1}^n X_i (Y_i - \widehat{\beta_0} - \widehat{\beta_1} X_i)\]

Expanding this expression gives:

\[\sum_{i=1}^n X_i Y_i - \widehat{\beta_0} \sum_{i=1}^n X_i - \widehat{\beta_1} \sum_{i=1}^n X_i^2 = 0\]

From our second normal equation, we know that:

\[ \frac{\partial RSS}{\partial \beta_1} = -2\left(\sum_{i=1}^n X_i Y_i - \widehat{\beta_0}\sum_{i=1}^n X_i - \widehat{\beta_1}\sum_{i=1}^n X_i^2\right) = 0 \]

Therefore, we conclude that:

\[\sum_{i=1}^n X_i e_i = 0\]



(e) \(\sum_{i=1}^n \widehat{Y}_ie_i = 0\); that is, the sum of the weighted residuals is zero when the residual of the i-th observation is weighted by the fitted value of the i-th observation.

We want to show that the sum of the weighted residuals is zero when each residual is weighted by its corresponding fitted value:

\[ \sum_{i=1}^n \widehat{Y_i} e_i = 0 \]

Substituting \(\widehat{Y_i} = \widehat{\beta_0} + \widehat{\beta_1} X_i\), we can write:

\[ \sum_{i=1}^n \widehat{Y_i} e_i = \sum_{i=1}^n (\widehat{\beta_0} + \widehat{\beta_1} X_i) e_i = \widehat{\beta_0} \sum_{i=1}^n e_i + \widehat{\beta_1} \sum_{i=1}^n X_i e_i \]

From part (b), we know that:

\[ \sum_{i=1}^n e_i = 0 \]

And from part (d), we know that:

\[ \sum_{i=1}^n X_i e_i = 0 \]

Substituting these results, we obtain:

\[ \sum_{i=1}^n \widehat{Y_i} e_i = \widehat{\beta_0}(0) + \widehat{\beta_1}(0) = 0 \]

Therefore, the sum of the fitted values weighted by their corresponding residuals is zero.



(f) \(\sum_{i=1}^n e_i^2\) is minimized

We want to show that the residual sum of squares, \(\sum_{i=1}^n e_i^2\), is minimized.

The least squares method minimizes the sum of squared residuals with respect to \(\beta_0\) and \(\beta_1\):

\[ RSS(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2 \]

where the residuals are defined as:

\[ e_i = Y_i - \widehat{Y_i} = Y_i - (\beta_0 + \beta_1 X_i) \]

The OLS estimates of \(\beta_0\) and \(\beta_1\) minimize \(RSS(\beta_0, \beta_1)\).
Since any squared term is greater than or equal to 0, minimizing this sum ensures the best possible linear fit.

To find the minimum, we take partial derivatives of the RSS function with respect to both parameters and set them equal to zero:

\[ \frac{\partial RSS}{\partial \beta_0} = -2\sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i) = 0 \]

\[ \frac{\partial RSS}{\partial \beta_1} = -2\sum_{i=1}^n X_i (Y_i - \beta_0 - \beta_1 X_i) = 0 \]

Solving these two equations simultaneously gives the OLS estimates:

\[ \widehat{\beta_1} = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2} \]

\[ \widehat{\beta_0} = \bar{Y} - \widehat{\beta_1}\bar{X} \]

Because the residual sum of squares \(RSS\) is a sum of squared terms (and thus always greater than or equal to 0), and because both first derivatives equal zero at \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\), these parameter values are critical points.
Since the second derivatives of \(RSS\) are positive, the critical points represent a minimum.

Therefore, the OLS estimates \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) minimize the residual sum of squares, confirming that:

\[ \sum_{i=1}^n e_i^2 \] is minimized.