This model predicts that for an average institution, for any given student–faculty ratio, a one–percentage-point increase in the percentage of small classes (less than 20 students) is associated with an estimated 0.17 point increase in the alumni giving rate. Alternatively, for any given percentage of small classes, each additional unit of the student–faculty ratio is associated with an estimated 1.70 point decrease in the alumni giving rate. The intercept, 39.66, represents the predicted average giving rate when both the student–faculty ratio and the percentage of small classes are zero, but has limited interpretive value in this context.
With the use of the fitted model, for a university with 50% of classes under 20 students and with a student–faculty ratio of 10, the predicted Alumni Giving Rate is: y_hat = 30.94 The 95% prediction interval is [12.18, 49.71], meaning we are 95% confident that for an institution with those characteristics, the true alumni giving rate will be between approximately 12% and 50%.
Interpretation: Alumni giving rate tends to increase with the percentage of small classes (positive correlation) and decrease as the student–faculty ratio rises (negative correlation). The squared multiple correlation is consistent with the idea that multiple regression captures combined explanatory power from both predictors.
Each regression coefficient was tested at a significance level of alpha = 0.05 using the hypotheses:
H0: beta_j = 0 H1: beta_j !=0
Conclusion: The student–faculty ratio is a statistically significant predictor of alumni giving rate, while the percent of small classes is not significant at the 5% level when both variables are considered simultaneously.
The F-statistic is 28.79 with 2 and 45 degrees of freedom and the associated p-value < 0.001. Since the p-value is much smaller than 0.05, we reject the null hypothesis. This means that, taken together, the predictors contribute to a statistically significant proportion of the variability of the alumni giving rate.
The R² value is 0.5613, and the Adjusted R² is 0.5418. This means that approximately 56.1% of the variability in alumni giving rate across institutions is explained by differences in the percent of small classes and student–faculty ratio. The adjusted R², which penalizes for model complexity, remains strong at 54.2%, indicating a moderately good model fit.
Interpretation: Alumni giving rate tends to increase with the % of small classes (positive correlation), and decrease as the student–faculty ratio increases (negative correlation). The squared multiple correlation is consistent with the idea that multiple regression captures combined explanatory power from both predictors.
The model yields strong evidence that the alumni giving behavior is influenced by institutional characteristics related to instructional quality: universities with smaller student–faculty ratios tend to have higher alumni giving rates, while the proportion of small classes, while positively related, is not statistically significant once the student–faculty ratio is controlled. The model explains over half the variation in giving rates and is overall statistically significant.
alumni <- read.csv("C:/Users/Gokul/Downloads/alumni (1).csv")
head(alumni,5)
## school percent_of_classes_under_20
## 1 Boston College 39
## 2 Brandeis University 68
## 3 Brown University 60
## 4 California Institute of Technology 65
## 5 Carnegie Mellon University 67
## student_faculty_ratio alumni_giving_rate private
## 1 13 25 1
## 2 8 33 1
## 3 8 40 1
## 4 3 46 1
## 5 10 28 1
names(alumni)
## [1] "school" "percent_of_classes_under_20"
## [3] "student_faculty_ratio" "alumni_giving_rate"
## [5] "private"
summary(alumni)
## school percent_of_classes_under_20 student_faculty_ratio
## Length:48 Min. :29.00 Min. : 3.00
## Class :character 1st Qu.:44.75 1st Qu.: 8.00
## Mode :character Median :59.50 Median :10.50
## Mean :55.73 Mean :11.54
## 3rd Qu.:66.25 3rd Qu.:13.50
## Max. :77.00 Max. :23.00
## alumni_giving_rate private
## Min. : 7.00 Min. :0.0000
## 1st Qu.:18.75 1st Qu.:0.0000
## Median :29.00 Median :1.0000
## Mean :29.27 Mean :0.6875
## 3rd Qu.:38.50 3rd Qu.:1.0000
## Max. :67.00 Max. :1.0000
# 3. Fit multiple linear regression
model_q1 <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio,
data = alumni)
summary_q1 <- summary(model_q1)
summary_q1
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
## student_faculty_ratio, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.00 -6.57 -1.95 4.42 24.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6556 13.5076 2.936 0.005225 **
## percent_of_classes_under_20 0.1662 0.1626 1.022 0.312128
## student_faculty_ratio -1.7021 0.4421 -3.850 0.000371 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
## F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
coef(model_q1)
## (Intercept) percent_of_classes_under_20
## 39.6555835 0.1661686
## student_faculty_ratio
## -1.7021103
new_obs <- data.frame(
percent_of_classes_under_20 = 50,
student_faculty_ratio = 10
)
predict(model_q1, newdata = new_obs, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 30.94291 12.17513 49.71069
cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate, use = "complete.obs")
## [1] 0.6456504
cor(alumni$student_faculty_ratio, alumni$alumni_giving_rate, use = "complete.obs")
## [1] -0.7423975
r2 <- summary_q1$r.squared
r2a <- summary_q1$adj.r.squared
fstat <- summary_q1$fstatistic
pval_F <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
cat(sprintf("\nF-statistic: %.3f on %d and %d df, p-value = %.6f\n",
fstat[1], fstat[2], fstat[3], pval_F))
##
## F-statistic: 28.793 on 2 and 45 df, p-value = 0.000000
cat(sprintf("R² = %.4f | Adjusted R² = %.4f\n", r2, r2a))
## R² = 0.5613 | Adjusted R² = 0.5418
confint(model_q1, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 12.449926 66.8612409
## percent_of_classes_under_20 -0.161228 0.4935653
## student_faculty_ratio -2.592600 -0.8116205
set.seed(341) # for reproducibility
N <- 1000 # number of observations
beta0 <- 10
beta1 <- 5
beta2 <- -2
# Generate X1, X2, and epsilon
x1 <- rnorm(N, mean = 2, sd = 0.1)
x2 <- rnorm(N, mean = 0, sd = 0.4)
epsilon <- rnorm(N, mean = 0, sd = 0.5)
# Generate Y
y <- beta0 + beta1 * x1 + beta2 * x2 + epsilon
# Combine into a data frame
sim_data <- data.frame(y, x1, x2)
# Preview first few rows
head(sim_data)
## y x1 x2
## 1 18.56398 1.894038 0.73312410
## 2 19.25623 1.969219 0.12198032
## 3 19.60292 2.086630 0.17212008
## 4 18.66492 1.900722 0.44454290
## 5 20.53890 2.083360 0.24527267
## 6 20.21512 2.047346 0.06045223
model <- lm(y ~ x1 + x2, data = sim_data)
# Display summary (coefficients, SE, t, p)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63697 -0.32608 0.00367 0.34163 1.57338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.07405 0.30720 32.79 <2e-16 ***
## x1 4.95980 0.15342 32.33 <2e-16 ***
## x2 -1.97800 0.03954 -50.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4994 on 997 degrees of freedom
## Multiple R-squared: 0.7868, Adjusted R-squared: 0.7864
## F-statistic: 1840 on 2 and 997 DF, p-value: < 2.2e-16
# Calculate fitted model's Mean Squared Error (MSE)
mse <- mean(residuals(model)^2)
mse
## [1] 0.2486094
Estimated Prediction Equation:
Hypotheses for each coefficient:
\[ H_0: \beta_j = 0 \]
\[ H_A: \beta_j \neq 0 \]
Decision rule:
\[ \text{If } p < 0.05, \text{ reject } H_0; \text{ the coefficient is statistically significant.} \]
Interpretation:
set.seed(456) # new seed for re-simulation
# Truth
beta0 <- 10; beta1 <- 5; beta2 <- -2
N <- 1000
# Re-simulate with larger error SD = 1
x1 <- rnorm(N, mean = 2, sd = 0.1)
x2 <- rnorm(N, mean = 0, sd = 0.4)
epsilon <- rnorm(N, mean = 0, sd = 1)
y <- beta0 + beta1*x1 + beta2*x2 + epsilon
sim_data2 <- data.frame(y, x1, x2)
# Fit regression
model2 <- lm(y ~ x1 + x2, data = sim_data2)
summary(model2)
##
## Call:
## lm(formula = y ~ x1 + x2, data = sim_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3391 -0.6638 0.0360 0.6365 3.8591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.63515 0.62449 15.43 <2e-16 ***
## x1 5.16154 0.31100 16.60 <2e-16 ***
## x2 -1.92455 0.07598 -25.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9641 on 997 degrees of freedom
## Multiple R-squared: 0.478, Adjusted R-squared: 0.477
## F-statistic: 456.5 on 2 and 997 DF, p-value: < 2.2e-16
# Fitted (training) MSE
mse2 <- mean(residuals(model2)^2)
mse2
## [1] 0.9267077
# part a where epsilon is 0 and 0.5
set.seed(341)
N_new <- 400
# Simulate with larger error SD = 1
x1 <- rnorm(N_new, mean = 2, sd = 0.1)
x2 <- rnorm(N_new, mean = 0, sd = 0.4)
eps <- rnorm(N_new, mean = 0, sd = 0.5)
y <- beta0 + beta1*x1 + beta2*x2 + eps
dat_400_10 <- data.frame(y, x1, x2)
fit_400_10 <- lm(y ~ x1 + x2, data = dat_400_10)
summary(fit_400_10)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat_400_10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20807 -0.36793 -0.00781 0.38143 1.49232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.74504 0.47744 20.41 <2e-16 ***
## x1 5.14127 0.23801 21.60 <2e-16 ***
## x2 -2.05940 0.06222 -33.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4998 on 397 degrees of freedom
## Multiple R-squared: 0.7837, Adjusted R-squared: 0.7826
## F-statistic: 719.1 on 2 and 397 DF, p-value: < 2.2e-16
mse_400_10 <- mean(residuals(fit_400_10)^2)
mse_400_10
## [1] 0.2479042
#part b
model <- lm(y ~ x1 + x2, data = dat_400_10)
# Display summary (coefficients, SE, t, p)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat_400_10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20807 -0.36793 -0.00781 0.38143 1.49232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.74504 0.47744 20.41 <2e-16 ***
## x1 5.14127 0.23801 21.60 <2e-16 ***
## x2 -2.05940 0.06222 -33.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4998 on 397 degrees of freedom
## Multiple R-squared: 0.7837, Adjusted R-squared: 0.7826
## F-statistic: 719.1 on 2 and 397 DF, p-value: < 2.2e-16
# Calculate fitted model's Mean Squared Error (MSE)
mse2 <- mean(residuals(model)^2)
mse2
## [1] 0.2479042
# part c where epsilon is 0 and 1
set.seed(341)
N_new <- 400
# Simulate with larger error SD = 1
x1 <- rnorm(N_new, mean = 2, sd = 0.1)
x2 <- rnorm(N_new, mean = 0, sd = 0.4)
eps <- rnorm(N_new, mean = 0, sd = 1)
y <- beta0 + beta1*x1 + beta2*x2 + eps
dat_400_10 <- data.frame(y, x1, x2)
fit_400_10 <- lm(y ~ x1 + x2, data = dat_400_10)
summary(fit_400_10)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat_400_10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.41613 -0.73587 -0.01561 0.76286 2.98464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4901 0.9549 9.939 <2e-16 ***
## x1 5.2825 0.4760 11.097 <2e-16 ***
## x2 -2.1188 0.1244 -17.028 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9996 on 397 degrees of freedom
## Multiple R-squared: 0.4893, Adjusted R-squared: 0.4867
## F-statistic: 190.1 on 2 and 397 DF, p-value: < 2.2e-16
mse_400_10 <- mean(residuals(fit_400_10)^2)
mse_400_10
## [1] 0.9916167
Effect of error variance (holding \(N\) fixed):
When \(\sigma_\varepsilon\) increases
(0.5 → 1), the standard errors of \(\hat\beta_j\) increase, so
t-statistics decrease and p-values
rise. The MSE of the fitted model
increases because residuals are noisier. Point
estimates remain close to the true values (unbiased), but are
less precise.
Effect of sample size (holding \(\sigma_\varepsilon\) fixed):
When \(N\) increases (400 → 1000),
standard errors shrink (roughly at a \(1/\sqrt{N}\) rate), so estimates are
more precise, t-statistics grow, and
p-values decrease. The training MSE
typically decreases because the model learns
coefficients more accurately.
Overall takeaway:
Smaller error variance and larger sample size both improve precision.
Changing \(\sigma_\varepsilon\) mainly
affects the noise level (SEs and MSE), while changing
\(N\) mainly affects
precision (SEs/t-stats). Across all runs, \(\hat{\beta}_1\) should be near \(5\) and \(\hat{\beta}_2\) near \(-2\), with significance maintained given
the strong true effects.
Interpretation