library(ggplot2)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni) # print structure of the alumni data frame
## 'data.frame': 48 obs. of 5 variables:
## $ school : chr "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
## $ percent_of_classes_under_20: int 39 68 60 65 67 52 45 69 72 61 ...
## $ student_faculty_ratio : int 13 8 8 3 10 8 12 7 13 10 ...
## $ alumni_giving_rate : int 25 33 40 46 28 31 27 31 35 53 ...
## $ private : int 1 1 1 1 1 1 1 1 1 1 ...
Question 1: Alumni Donation Data (Simple Linear Regression). Continuing with the alumni donation data, fit a simple linear regression model to the data, where the alumni giving rate (alumni_giving_rate) is the response variable (Y) of interest and the percentage of classes with fewer than 20 students (percent_of_classes_under_20) as the single predictor variable
-The null hypotheses is rejecting the estimated slope at α=0.05 level. -The alternative hypothese is alumni rate does have a relationship with the percentage of classes with fewer than 20 students. -My esimated slope is 0.6578
ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
x = "% of Classes Under 20",
y = "Alumni Giving Rate")
## `geom_smooth()` using formula = 'y ~ x'
linear <- lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
data = alumni)
summary(linear)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
-My esimated slope is 0.6578 -The t-statistic is 5.734 -The p-value is 7.228e-07
linear <- lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
data = alumni)
f.test <- anova(linear)
f.statistic <- f.test
p.value <- f.test
summary(linear)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
-Estimated slop F-Test is 0.6578 -F Statistic is 32.88389 -P-Value is 7.228121
-The slope from the F-test shows significance at α=0.05 level. -We can confirm the null hypothesis is rejected and alumni giving rate and class size show a correlation
-We can see the value of R-Squared is 41.686 gives us the insight that number is the percetnage of classes with under 20 students attend while the rest of the percent which is 58.313 is remaining we hae unkowns and can’t determine the results
r.correlation_cofficient <- cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)
cat("Correlation Coefficient (r):", r.correlation_cofficient, "\n")
## Correlation Coefficient (r): 0.6456504
The relationship between r and R2 are equal in the linear regression. R2 is the coefficient of determination in how much variance occurs from 0 to 1 and r gives us direction where the slope is positive or negative in the linear relationship.
library(investr)
library(ggplot2)
linear <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
summary(linear)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
geom_point(color = "gray40") +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
labs(
title = "Regression Line with 95% Confidence Band",
x = "% of Classes Under 20",
y = "Alumni Giving Rate"
)
## `geom_smooth()` using formula = 'y ~ x'
Question 2. Simulation Study (Simple Linear Regression). Assume mean function E(Y|X)=10+5*X. For this exercise, use set.seed(7052) to ensure reproducibility
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
error <- rnorm(n, mean = 0, sd = 0.5) # Change to sd = 1 or n = 400 for other parts
Y <- 10 + 5 * X + error
# Fit linear model
model <- lm(Y ~ X)
# Calculate MSE
mse <- mean(residuals(model)^2)
print(mse)
## [1] 0.1992276
set.seed(7052)
# Parameters (same as part a)
n <- 100
mu_x <- 2
sd_x <- 0.1
sigma_error <- 0.5
beta0 <- 10
beta1 <- 5
# Simulate X and error
X <- rnorm(n, mean = mu_x, sd = sd_x)
epsilon <- rnorm(n, mean = 0, sd = sigma_error)
# Generate Y
Y <- beta0 + beta1 * X + epsilon
# Fit linear regression model
model <- lm(Y ~ X)
# Summarize the model
summary_output <- summary(model)
# Extract estimated coefficients and standard errors
intercept_est <- summary_output$coefficients["(Intercept)", "Estimate"]
slope_est <- summary_output$coefficients["X", "Estimate"]
intercept_se <- summary_output$coefficients["(Intercept)", "Std. Error"]
slope_se <- summary_output$coefficients["X", "Std. Error"]
t_stat <- summary_output$coefficients["X", "t value"]
p_val <- summary_output$coefficients["X", "Pr(>|t|)"]
# Calculate MSE
mse <- mean(residuals(model)^2)
# Output
cat("\n--- Question 2(b) Answer ---\n")
##
## --- Question 2(b) Answer ---
cat("Estimated Prediction Equation:\n")
## Estimated Prediction Equation:
cat("Ŷ =", round(intercept_est, 3), "+", round(slope_est, 3), "* X\n\n")
## Ŷ = 9.022 + 5.565 * X
cat("Estimated Coefficients:\n")
## Estimated Coefficients:
cat(" Intercept:", round(intercept_est, 4), " | SE:", round(intercept_se, 4), "\n")
## Intercept: 9.0218 | SE: 0.8336
cat(" Slope :", round(slope_est, 4), " | SE:", round(slope_se, 4), "\n\n")
## Slope : 5.5652 | SE: 0.4155
cat("Hypothesis Test for Slope:\n")
## Hypothesis Test for Slope:
cat(" H0: β1 = 0\n")
## H0: β1 = 0
cat(" Ha: β1 ≠ 0\n")
## Ha: β1 ≠ 0
cat(" t-statistic =", round(t_stat, 4), "\n")
## t-statistic = 13.3955
cat(" p-value =", signif(p_val, 4), "\n")
## p-value = 7.118e-24
# Significance interpretation
if (p_val < 0.05) {
cat(" Conclusion: Reject H0 — the slope is statistically significant.\n")
} else {
cat(" Conclusion: Fail to reject H0 — the slope is not statistically significant.\n")
}
## Conclusion: Reject H0 — the slope is statistically significant.
cat("\nFitted Model's MSE:", round(mse, 4), "\n")
##
## Fitted Model's MSE: 0.1992
# Set seed again for reproducibility
set.seed(7052)
# Parameters (same as before but larger error)
n <- 100
mu_x <- 2
sd_x <- 0.1
sigma_error <- 1 # <-- changed from 0.5 to 1
beta0 <- 10
beta1 <- 5
# Simulate new data
X <- rnorm(n, mean = mu_x, sd = sd_x)
epsilon <- rnorm(n, mean = 0, sd = sigma_error)
Y <- beta0 + beta1 * X + epsilon
# Fit linear regression model
model <- lm(Y ~ X)
# Get summary
summary_output <- summary(model)
# Extract slope and intercept info
intercept_est <- summary_output$coefficients["(Intercept)", "Estimate"]
slope_est <- summary_output$coefficients["X", "Estimate"]
intercept_se <- summary_output$coefficients["(Intercept)", "Std. Error"]
slope_se <- summary_output$coefficients["X", "Std. Error"]
t_stat <- summary_output$coefficients["X", "t value"]
p_val <- summary_output$coefficients["X", "Pr(>|t|)"]
# Calculate MSE
mse <- mean(residuals(model)^2)
# Output results
cat("\n--- Question 2(c) Answer ---\n")
##
## --- Question 2(c) Answer ---
cat("Estimated Prediction Equation:\n")
## Estimated Prediction Equation:
cat("Ŷ =", round(intercept_est, 3), "+", round(slope_est, 3), "* X\n\n")
## Ŷ = 8.044 + 6.13 * X
cat("Estimated Coefficients:\n")
## Estimated Coefficients:
cat(" Intercept:", round(intercept_est, 4), " | SE:", round(intercept_se, 4), "\n")
## Intercept: 8.0436 | SE: 1.6673
cat(" Slope :", round(slope_est, 4), " | SE:", round(slope_se, 4), "\n\n")
## Slope : 6.1303 | SE: 0.8309
cat("Hypothesis Test for Slope:\n")
## Hypothesis Test for Slope:
cat(" H0: β1 = 0\n")
## H0: β1 = 0
cat(" Ha: β1 ≠ 0\n")
## Ha: β1 ≠ 0
cat(" t-statistic =", round(t_stat, 4), "\n")
## t-statistic = 7.3779
cat(" p-value =", signif(p_val, 4), "\n")
## p-value = 5.253e-11
if (p_val < 0.05) {
cat(" Conclusion: Reject H0 — the slope is statistically significant.\n")
} else {
cat(" Conclusion: Fail to reject H0 — the slope is not statistically significant.\n")
}
## Conclusion: Reject H0 — the slope is statistically significant.
cat("\nFitted Model's MSE:", round(mse, 4), "\n")
##
## Fitted Model's MSE: 0.7969
set.seed(7052)
# Helper function to simulate and fit model
simulate_regression <- function(n, sigma_error) {
X <- rnorm(n, mean = 2, sd = 0.1)
epsilon <- rnorm(n, mean = 0, sd = sigma_error)
Y <- 10 + 5 * X + epsilon
model <- lm(Y ~ X)
summary_output <- summary(model)
# Extract info
intercept_est <- summary_output$coefficients["(Intercept)", "Estimate"]
slope_est <- summary_output$coefficients["X", "Estimate"]
slope_se <- summary_output$coefficients["X", "Std. Error"]
t_stat <- summary_output$coefficients["X", "t value"]
p_val <- summary_output$coefficients["X", "Pr(>|t|)"]
mse <- mean(residuals(model)^2)
return(list(
n = n,
sigma_error = sigma_error,
intercept = intercept_est,
slope = slope_est,
slope_se = slope_se,
t_stat = t_stat,
p_val = p_val,
mse = mse
))
}
# Run simulations for all 4 cases
result_400_low_var <- simulate_regression(n = 400, sigma_error = 0.5)
result_400_high_var <- simulate_regression(n = 400, sigma_error = 1.0)
# Print results
cat("\n--- Question 2(d): Results with n = 400 ---\n")
##
## --- Question 2(d): Results with n = 400 ---
for (res in list(result_400_low_var, result_400_high_var)) {
cat("\nSample Size:", res$n, " | Error SD:", res$sigma_error, "\n")
cat("Estimated Prediction Equation: Ŷ =", round(res$intercept, 3), "+", round(res$slope, 3), "* X\n")
cat("Slope SE:", round(res$slope_se, 4), "\n")
cat("t-Statistic:", round(res$t_stat, 4), "\n")
cat("p-Value:", signif(res$p_val, 4), "\n")
cat("MSE:", round(res$mse, 4), "\n")
cat(ifelse(res$p_val < 0.05, "Conclusion: Slope is significant.\n", "Conclusion: Slope is not significant.\n"))
}
##
## Sample Size: 400 | Error SD: 0.5
## Estimated Prediction Equation: Ŷ = 9.747 + 5.118 * X
## Slope SE: 0.249
## t-Statistic: 20.5554
## p-Value: 1.649e-64
## MSE: 0.2376
## Conclusion: Slope is significant.
##
## Sample Size: 400 | Error SD: 1
## Estimated Prediction Equation: Ŷ = 10.059 + 4.951 * X
## Slope SE: 0.4938
## t-Statistic: 10.0245
## p-Value: 3.078e-21
## MSE: 0.9982
## Conclusion: Slope is significant.
Question 3 Bias and Variance of Parameter Estimates.
-β ̂_0 and β 1 are un biased and variance depends on the error for σ^2
MSE is unbiased estimator to find true error variance for it adjusts to degrees of freedom when estimating parameters. ML estimate of o2 underestimates variance by ignoring this adjustment. It holds bias since it won’t correct itself. Instead of using ML, it is better to use MSE to find an acurrate reflection on the data’s variablity and accuracy