Homework Module 2

library(ggplot2)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni) # print structure of the alumni data frame

## 'data.frame':    48 obs. of  5 variables:
##  $ school                     : chr  "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
##  $ percent_of_classes_under_20: int  39 68 60 65 67 52 45 69 72 61 ...
##  $ student_faculty_ratio      : int  13 8 8 3 10 8 12 7 13 10 ...
##  $ alumni_giving_rate         : int  25 33 40 46 28 31 27 31 35 53 ...
##  $ private                    : int  1 1 1 1 1 1 1 1 1 1 ...

Question 1: Alumni Donation Data (Simple Linear Regression). Continuing with the alumni donation data, fit a simple linear regression model to the data, where the alumni giving rate (alumni_giving_rate) is the response variable (Y) of interest and the percentage of classes with fewer than 20 students (percent_of_classes_under_20) as the single predictor variable

What is the estimated slope? Is it signficant at the α=0.05 level? Clearly write out the null and alternative hypotheses, observed t-statistic, p-value and interpret the estimate and test results.

-The null hypotheses is rejecting the estimated slope at α=0.05 level. -The alternative hypothese is alumni rate does have a relationship with the percentage of classes with fewer than 20 students. -My esimated slope is 0.6578

ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
       x = "% of Classes Under 20",
       y = "Alumni Giving Rate")

## `geom_smooth()` using formula = 'y ~ x'

linear <- lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
   data = alumni)
summary(linear)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

-My esimated slope is 0.6578 -The t-statistic is 5.734 -The p-value is 7.228e-07

Repeat part a. above using the equivalent F-test

linear <- lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
   data = alumni)
f.test <- anova(linear)
f.statistic <- f.test
p.value <- f.test
summary(linear)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

-Estimated slop F-Test is 0.6578 -F Statistic is 32.88389 -P-Value is 7.228121

-The slope from the F-test shows significance at α=0.05 level. -We can confirm the null hypothesis is rejected and alumni giving rate and class size show a correlation

What is the value of R2? Please interpret

-We can see the value of R-Squared is 41.686 gives us the insight that number is the percetnage of classes with under 20 students attend while the rest of the percent which is 58.313 is remaining we hae unkowns and can’t determine the results

What is the correlation coefficient r between percent_of_classes_under_20 and alumni_giving_rate? What is the relationship between r and R2?

r.correlation_cofficient <- cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)
cat("Correlation Coefficient (r):", r.correlation_cofficient, "\n")

## Correlation Coefficient (r): 0.6456504

The relationship between r and R2 are equal in the linear regression. R2 is the coefficient of determination in how much variance occurs from 0 to 1 and r gives us direction where the slope is positive or negative in the linear relationship.

Plot the training data(the data used to fit the model) with the fitted regression line and include a 95% (pointwise) confidence band for the mean reponses; the investr package makes this trivial. What do you observe about the confidence band at the point (X, Y)? Is it narrower or wider compared to the rest?

library(investr)
library(ggplot2)
 
linear <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
summary(linear)

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
  geom_point(color = "gray40") +
  geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
  labs(
    title = "Regression Line with 95% Confidence Band",
    x = "% of Classes Under 20",
    y = "Alumni Giving Rate"
  )

## `geom_smooth()` using formula = 'y ~ x'

Question 2. Simulation Study (Simple Linear Regression). Assume mean function E(Y|X)=10+5*X. For this exercise, use set.seed(7052) to ensure reproducibility

Generate data with X∼N(μ=2,σ=0.1), sample size n = 100, and error term ϵ∼N(μ=0,σ=0.5)..

set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error <- rnorm(n, mean = 0, sd = 0.5)  # Change to sd = 1 or n = 400 for other parts
Y <- 10 + 5 * X + error

# Fit linear model
model <- lm(Y ~ X)

# Calculate MSE
mse <- mean(residuals(model)^2)
print(mse)

## [1] 0.1992276

Fit a simple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significiant? Clearly write out the null and alternative hypotheses, observed t-statistics, p-value and interpret the estimates and test results. What is fitted model’s MSE?

set.seed(7052)

# Parameters (same as part a)
n <- 100
mu_x <- 2
sd_x <- 0.1
sigma_error <- 0.5
beta0 <- 10
beta1 <- 5

# Simulate X and error
X <- rnorm(n, mean = mu_x, sd = sd_x)
epsilon <- rnorm(n, mean = 0, sd = sigma_error)

# Generate Y
Y <- beta0 + beta1 * X + epsilon

# Fit linear regression model
model <- lm(Y ~ X)

# Summarize the model
summary_output <- summary(model)

# Extract estimated coefficients and standard errors
intercept_est <- summary_output$coefficients["(Intercept)", "Estimate"]
slope_est <- summary_output$coefficients["X", "Estimate"]
intercept_se <- summary_output$coefficients["(Intercept)", "Std. Error"]
slope_se <- summary_output$coefficients["X", "Std. Error"]
t_stat <- summary_output$coefficients["X", "t value"]
p_val <- summary_output$coefficients["X", "Pr(>|t|)"]

# Calculate MSE
mse <- mean(residuals(model)^2)

# Output
cat("\n--- Question 2(b) Answer ---\n")

## 
## --- Question 2(b) Answer ---

cat("Estimated Prediction Equation:\n")

## Estimated Prediction Equation:

cat("Ŷ =", round(intercept_est, 3), "+", round(slope_est, 3), "* X\n\n")

## Ŷ = 9.022 + 5.565 * X

cat("Estimated Coefficients:\n")

## Estimated Coefficients:

cat("  Intercept:", round(intercept_est, 4), "  |  SE:", round(intercept_se, 4), "\n")

##   Intercept: 9.0218   |  SE: 0.8336

cat("  Slope    :", round(slope_est, 4), "  |  SE:", round(slope_se, 4), "\n\n")

##   Slope    : 5.5652   |  SE: 0.4155

cat("Hypothesis Test for Slope:\n")

## Hypothesis Test for Slope:

cat("  H0: β1 = 0\n")

##   H0: β1 = 0

cat("  Ha: β1 ≠ 0\n")

##   Ha: β1 ≠ 0

cat("  t-statistic =", round(t_stat, 4), "\n")

##   t-statistic = 13.3955

cat("  p-value     =", signif(p_val, 4), "\n")

##   p-value     = 7.118e-24

# Significance interpretation
if (p_val < 0.05) {
  cat("  Conclusion: Reject H0 — the slope is statistically significant.\n")
} else {
  cat("  Conclusion: Fail to reject H0 — the slope is not statistically significant.\n")
}

##   Conclusion: Reject H0 — the slope is statistically significant.

cat("\nFitted Model's MSE:", round(mse, 4), "\n")

## 
## Fitted Model's MSE: 0.1992

Repeat part b but re-simulate the data and change the error term to ϵ∼N(0,σ=1).

# Set seed again for reproducibility
set.seed(7052)

# Parameters (same as before but larger error)
n <- 100
mu_x <- 2
sd_x <- 0.1
sigma_error <- 1   # <-- changed from 0.5 to 1
beta0 <- 10
beta1 <- 5

# Simulate new data
X <- rnorm(n, mean = mu_x, sd = sd_x)
epsilon <- rnorm(n, mean = 0, sd = sigma_error)
Y <- beta0 + beta1 * X + epsilon

# Fit linear regression model
model <- lm(Y ~ X)

# Get summary
summary_output <- summary(model)

# Extract slope and intercept info
intercept_est <- summary_output$coefficients["(Intercept)", "Estimate"]
slope_est <- summary_output$coefficients["X", "Estimate"]
intercept_se <- summary_output$coefficients["(Intercept)", "Std. Error"]
slope_se <- summary_output$coefficients["X", "Std. Error"]
t_stat <- summary_output$coefficients["X", "t value"]
p_val <- summary_output$coefficients["X", "Pr(>|t|)"]

# Calculate MSE
mse <- mean(residuals(model)^2)

# Output results
cat("\n--- Question 2(c) Answer ---\n")

## 
## --- Question 2(c) Answer ---

cat("Estimated Prediction Equation:\n")

## Estimated Prediction Equation:

cat("Ŷ =", round(intercept_est, 3), "+", round(slope_est, 3), "* X\n\n")

## Ŷ = 8.044 + 6.13 * X

cat("Estimated Coefficients:\n")

## Estimated Coefficients:

cat("  Intercept:", round(intercept_est, 4), "  |  SE:", round(intercept_se, 4), "\n")

##   Intercept: 8.0436   |  SE: 1.6673

cat("  Slope    :", round(slope_est, 4), "  |  SE:", round(slope_se, 4), "\n\n")

##   Slope    : 6.1303   |  SE: 0.8309

cat("Hypothesis Test for Slope:\n")

## Hypothesis Test for Slope:

cat("  H0: β1 = 0\n")

##   H0: β1 = 0

cat("  Ha: β1 ≠ 0\n")

##   Ha: β1 ≠ 0

cat("  t-statistic =", round(t_stat, 4), "\n")

##   t-statistic = 7.3779

cat("  p-value     =", signif(p_val, 4), "\n")

##   p-value     = 5.253e-11

if (p_val < 0.05) {
  cat("  Conclusion: Reject H0 — the slope is statistically significant.\n")
} else {
  cat("  Conclusion: Fail to reject H0 — the slope is not statistically significant.\n")
}

##   Conclusion: Reject H0 — the slope is statistically significant.

cat("\nFitted Model's MSE:", round(mse, 4), "\n")

## 
## Fitted Model's MSE: 0.7969

Repeat parts a-c using n = 400. What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?

set.seed(7052)

# Helper function to simulate and fit model
simulate_regression <- function(n, sigma_error) {
  X <- rnorm(n, mean = 2, sd = 0.1)
  epsilon <- rnorm(n, mean = 0, sd = sigma_error)
  Y <- 10 + 5 * X + epsilon
  model <- lm(Y ~ X)
  summary_output <- summary(model)
  
  # Extract info
  intercept_est <- summary_output$coefficients["(Intercept)", "Estimate"]
  slope_est <- summary_output$coefficients["X", "Estimate"]
  slope_se <- summary_output$coefficients["X", "Std. Error"]
  t_stat <- summary_output$coefficients["X", "t value"]
  p_val <- summary_output$coefficients["X", "Pr(>|t|)"]
  mse <- mean(residuals(model)^2)
  
  return(list(
    n = n,
    sigma_error = sigma_error,
    intercept = intercept_est,
    slope = slope_est,
    slope_se = slope_se,
    t_stat = t_stat,
    p_val = p_val,
    mse = mse
  ))
}

# Run simulations for all 4 cases
result_400_low_var <- simulate_regression(n = 400, sigma_error = 0.5)
result_400_high_var <- simulate_regression(n = 400, sigma_error = 1.0)

# Print results
cat("\n--- Question 2(d): Results with n = 400 ---\n")

## 
## --- Question 2(d): Results with n = 400 ---

for (res in list(result_400_low_var, result_400_high_var)) {
  cat("\nSample Size:", res$n, " | Error SD:", res$sigma_error, "\n")
  cat("Estimated Prediction Equation: Ŷ =", round(res$intercept, 3), "+", round(res$slope, 3), "* X\n")
  cat("Slope SE:", round(res$slope_se, 4), "\n")
  cat("t-Statistic:", round(res$t_stat, 4), "\n")
  cat("p-Value:", signif(res$p_val, 4), "\n")
  cat("MSE:", round(res$mse, 4), "\n")
  cat(ifelse(res$p_val < 0.05, "Conclusion: Slope is significant.\n", "Conclusion: Slope is not significant.\n"))
}

## 
## Sample Size: 400  | Error SD: 0.5 
## Estimated Prediction Equation: Ŷ = 9.747 + 5.118 * X
## Slope SE: 0.249 
## t-Statistic: 20.5554 
## p-Value: 1.649e-64 
## MSE: 0.2376 
## Conclusion: Slope is significant.
## 
## Sample Size: 400  | Error SD: 1 
## Estimated Prediction Equation: Ŷ = 10.059 + 4.951 * X
## Slope SE: 0.4938 
## t-Statistic: 10.0245 
## p-Value: 3.078e-21 
## MSE: 0.9982 
## Conclusion: Slope is significant.

What about the MSE from each model? -As the error variance increase the MSE is less precise and decreases with higher standard errors. If we increase the sample size improves the accuracy and stability even when the variance is high.

Question 3 Bias and Variance of Parameter Estimates.

What are the bias and variance of the OLS estimates β ̂0 and β ̂?

-β ̂_0 and β 1 are un biased and variance depends on the error for σ^2

What do you expect to happen to the variances of the OLS estimates of β ̂_0 and β when the sample size n increases? What do you expect when the error variance σ^2 increases?

β ̂_0 and B1 decrease when the sample size increases. It happens because the estimation is becoming more precise. However when error variance σ^2 increases it will cause greater noise and can reduce accuracy.

What is the bias of the model’s MSE? What about the ML estimate of σ^2? What is the difference between these two estimates of σ^2? Why do we use MSE instead of the ML estimate?

MSE is unbiased estimator to find true error variance for it adjusts to degrees of freedom when estimating parameters. ML estimate of o2 underestimates variance by ignoring this adjustment. It holds bias since it won’t correct itself. Instead of using ML, it is better to use MSE to find an acurrate reflection on the data’s variablity and accuracy

Homework Module 2

Matthew Ensor

2025-10-20