Question 1

Response variable: Alumni Giving Rate

Predictors: Percent of Classes Under 20, Student–Faculty Ratio

a. Final Estimated Model

This model predicts that for an average institution, for any given student–faculty ratio, a one–percentage-point increase in the percentage of small classes (less than 20 students) is associated with an estimated 0.17 point increase in the alumni giving rate. Alternatively, for any given percentage of small classes, each additional unit of the student–faculty ratio is associated with an estimated 1.70 point decrease in the alumni giving rate. The intercept, 39.66, represents the predicted average giving rate when both the student–faculty ratio and the percentage of small classes are zero, but has limited interpretive value in this context.

b. Predicted Alumni Giving Rate for Given Values

With the use of the fitted model, for a university with 50% of classes under 20 students and with a student–faculty ratio of 10, the predicted Alumni Giving Rate is: y_hat = 30.94 The 95% prediction interval is [12.18, 49.71], meaning we are 95% confident that for an institution with those characteristics, the true alumni giving rate will be between approximately 12% and 50%.

Interpretation: Alumni giving rate tends to increase with the percentage of small classes (positive correlation) and decrease as the student–faculty ratio rises (negative correlation). The squared multiple correlation is consistent with the idea that multiple regression captures combined explanatory power from both predictors.

c. Significance Tests for Regression Coefficients (t-tests)

Each regression coefficient was tested at a significance level of alpha = 0.05 using the hypotheses:

H0: beta_j = 0 H1: beta_j !=0

Conclusion: The student–faculty ratio is a statistically significant predictor of alumni giving rate, while the percent of small classes is not significant at the 5% level when both variables are considered simultaneously.

d. Overall Model Significance (F-test)

The F-statistic is 28.79 with 2 and 45 degrees of freedom and the associated p-value < 0.001. Since the p-value is much smaller than 0.05, we reject the null hypothesis. This means that, taken together, the predictors contribute to a statistically significant proportion of the variability of the alumni giving rate.

e. Coefficient of Determination (R^2)

The R² value is 0.5613, and the Adjusted R² is 0.5418. This means that approximately 56.1% of the variability in alumni giving rate across institutions is explained by differences in the percent of small classes and student–faculty ratio. The adjusted R², which penalizes for model complexity, remains strong at 54.2%, indicating a moderately good model fit.

f. Correlation Coefficients (r₁, r₂) and Relationship with R^2

Interpretation: Alumni giving rate tends to increase with the % of small classes (positive correlation), and decrease as the student–faculty ratio increases (negative correlation). The squared multiple correlation is consistent with the idea that multiple regression captures combined explanatory power from both predictors.

Conclusion

The model yields strong evidence that the alumni giving behavior is influenced by institutional characteristics related to instructional quality: universities with smaller student–faculty ratios tend to have higher alumni giving rates, while the proportion of small classes, while positively related, is not statistically significant once the student–faculty ratio is controlled. The model explains over half the variation in giving rates and is overall statistically significant.

Load the data

alumni <- read.csv("C:/Users/Gokul/Downloads/alumni (1).csv")
head(alumni,5)
##                               school percent_of_classes_under_20
## 1                     Boston College                          39
## 2               Brandeis University                           68
## 3                   Brown University                          60
## 4 California Institute of Technology                          65
## 5         Carnegie Mellon University                          67
##   student_faculty_ratio alumni_giving_rate private
## 1                    13                 25       1
## 2                     8                 33       1
## 3                     8                 40       1
## 4                     3                 46       1
## 5                    10                 28       1

Columns in the dataset

names(alumni)
## [1] "school"                      "percent_of_classes_under_20"
## [3] "student_faculty_ratio"       "alumni_giving_rate"         
## [5] "private"

Summary of the dataset

summary(alumni)
##     school          percent_of_classes_under_20 student_faculty_ratio
##  Length:48          Min.   :29.00               Min.   : 3.00        
##  Class :character   1st Qu.:44.75               1st Qu.: 8.00        
##  Mode  :character   Median :59.50               Median :10.50        
##                     Mean   :55.73               Mean   :11.54        
##                     3rd Qu.:66.25               3rd Qu.:13.50        
##                     Max.   :77.00               Max.   :23.00        
##  alumni_giving_rate    private      
##  Min.   : 7.00      Min.   :0.0000  
##  1st Qu.:18.75      1st Qu.:0.0000  
##  Median :29.00      Median :1.0000  
##  Mean   :29.27      Mean   :0.6875  
##  3rd Qu.:38.50      3rd Qu.:1.0000  
##  Max.   :67.00      Max.   :1.0000

Fitting multiple linear regression model

# 3. Fit multiple linear regression
model_q1 <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio, 
               data = alumni)
summary_q1 <- summary(model_q1)
summary_q1
## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 + 
##     student_faculty_ratio, data = alumni)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.00  -6.57  -1.95   4.42  24.56 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  39.6556    13.5076   2.936 0.005225 ** 
## percent_of_classes_under_20   0.1662     0.1626   1.022 0.312128    
## student_faculty_ratio        -1.7021     0.4421  -3.850 0.000371 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.5418 
## F-statistic: 28.79 on 2 and 45 DF,  p-value: 8.869e-09

Coefficients of the model

coef(model_q1)
##                 (Intercept) percent_of_classes_under_20 
##                  39.6555835                   0.1661686 
##       student_faculty_ratio 
##                  -1.7021103

Predicted alumni giving rate for given values

new_obs <- data.frame(
  percent_of_classes_under_20 = 50,
  student_faculty_ratio = 10
)
predict(model_q1, newdata = new_obs, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 30.94291 12.17513 49.71069

Correlations between predictors and response

cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate, use = "complete.obs")
## [1] 0.6456504
cor(alumni$student_faculty_ratio, alumni$alumni_giving_rate, use = "complete.obs")
## [1] -0.7423975

Extract R², Adjusted R², and F-test

r2  <- summary_q1$r.squared
r2a <- summary_q1$adj.r.squared
fstat <- summary_q1$fstatistic
pval_F <- pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE)
cat(sprintf("\nF-statistic: %.3f on %d and %d df, p-value = %.6f\n",
            fstat[1], fstat[2], fstat[3], pval_F))
## 
## F-statistic: 28.793 on 2 and 45 df, p-value = 0.000000
cat(sprintf("R² = %.4f | Adjusted R² = %.4f\n", r2, r2a))
## R² = 0.5613 | Adjusted R² = 0.5418

Confidence intervals for coefficients

confint(model_q1, level = 0.95)
##                                 2.5 %     97.5 %
## (Intercept)                 12.449926 66.8612409
## percent_of_classes_under_20 -0.161228  0.4935653
## student_faculty_ratio       -2.592600 -0.8116205

Question 2:

Part a: Generate the simulated data

set.seed(341)   # for reproducibility

N <- 1000        # number of observations
beta0 <- 10
beta1 <- 5
beta2 <- -2

# Generate X1, X2, and epsilon
x1 <- rnorm(N, mean = 2, sd = 0.1)
x2 <- rnorm(N, mean = 0, sd = 0.4)
epsilon <- rnorm(N, mean = 0, sd = 0.5)

# Generate Y
y <- beta0 + beta1 * x1 + beta2 * x2 + epsilon

# Combine into a data frame
sim_data <- data.frame(y, x1, x2)

# Preview first few rows
head(sim_data)
##          y       x1         x2
## 1 18.56398 1.894038 0.73312410
## 2 19.25623 1.969219 0.12198032
## 3 19.60292 2.086630 0.17212008
## 4 18.66492 1.900722 0.44454290
## 5 20.53890 2.083360 0.24527267
## 6 20.21512 2.047346 0.06045223

Part b: Fit multiple linear regression to simulated data

model <- lm(y ~ x1 + x2, data = sim_data)

# Display summary (coefficients, SE, t, p)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63697 -0.32608  0.00367  0.34163  1.57338 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.07405    0.30720   32.79   <2e-16 ***
## x1           4.95980    0.15342   32.33   <2e-16 ***
## x2          -1.97800    0.03954  -50.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4994 on 997 degrees of freedom
## Multiple R-squared:  0.7868, Adjusted R-squared:  0.7864 
## F-statistic:  1840 on 2 and 997 DF,  p-value: < 2.2e-16
# Calculate fitted model's Mean Squared Error (MSE)
mse <- mean(residuals(model)^2)
mse
## [1] 0.2486094

Estimated Prediction Equation:

\(\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)

Hypotheses for each coefficient:

\[ H_0: \beta_j = 0 \]

\[ H_A: \beta_j \neq 0 \]

Decision rule:

\[ \text{If } p < 0.05, \text{ reject } H_0; \text{ the coefficient is statistically significant.} \]

Interpretation:

Part c: re-simulate the data and change the error term

set.seed(456)     # new seed for re-simulation

# Truth
beta0 <- 10; beta1 <- 5; beta2 <- -2
N <- 1000

# Re-simulate with larger error SD = 1
x1 <- rnorm(N, mean = 2, sd = 0.1)
x2 <- rnorm(N, mean = 0, sd = 0.4)
epsilon <- rnorm(N, mean = 0, sd = 1)

y <- beta0 + beta1*x1 + beta2*x2 + epsilon
sim_data2 <- data.frame(y, x1, x2)

# Fit regression
model2 <- lm(y ~ x1 + x2, data = sim_data2)
summary(model2)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim_data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3391 -0.6638  0.0360  0.6365  3.8591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.63515    0.62449   15.43   <2e-16 ***
## x1           5.16154    0.31100   16.60   <2e-16 ***
## x2          -1.92455    0.07598  -25.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9641 on 997 degrees of freedom
## Multiple R-squared:  0.478,  Adjusted R-squared:  0.477 
## F-statistic: 456.5 on 2 and 997 DF,  p-value: < 2.2e-16
# Fitted (training) MSE
mse2 <- mean(residuals(model2)^2)
mse2
## [1] 0.9267077

Part d: Effect when the sample size gets bigger

# part a where epsilon is 0 and 0.5
set.seed(341)
N_new <- 400

# Simulate with larger error SD = 1
x1 <- rnorm(N_new, mean = 2, sd = 0.1)
x2 <- rnorm(N_new, mean = 0, sd = 0.4)
eps <- rnorm(N_new, mean = 0, sd = 0.5)
y  <- beta0 + beta1*x1 + beta2*x2 + eps

dat_400_10 <- data.frame(y, x1, x2)
fit_400_10 <- lm(y ~ x1 + x2, data = dat_400_10)
summary(fit_400_10)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat_400_10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20807 -0.36793 -0.00781  0.38143  1.49232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.74504    0.47744   20.41   <2e-16 ***
## x1           5.14127    0.23801   21.60   <2e-16 ***
## x2          -2.05940    0.06222  -33.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4998 on 397 degrees of freedom
## Multiple R-squared:  0.7837, Adjusted R-squared:  0.7826 
## F-statistic: 719.1 on 2 and 397 DF,  p-value: < 2.2e-16
mse_400_10 <- mean(residuals(fit_400_10)^2)
mse_400_10
## [1] 0.2479042
#part b 
model <- lm(y ~ x1 + x2, data = dat_400_10)

# Display summary (coefficients, SE, t, p)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat_400_10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20807 -0.36793 -0.00781  0.38143  1.49232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.74504    0.47744   20.41   <2e-16 ***
## x1           5.14127    0.23801   21.60   <2e-16 ***
## x2          -2.05940    0.06222  -33.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4998 on 397 degrees of freedom
## Multiple R-squared:  0.7837, Adjusted R-squared:  0.7826 
## F-statistic: 719.1 on 2 and 397 DF,  p-value: < 2.2e-16
# Calculate fitted model's Mean Squared Error (MSE)
mse2 <- mean(residuals(model)^2)
mse2
## [1] 0.2479042
# part c where epsilon is 0 and 1
set.seed(341)
N_new <- 400

# Simulate with larger error SD = 1
x1 <- rnorm(N_new, mean = 2, sd = 0.1)
x2 <- rnorm(N_new, mean = 0, sd = 0.4)
eps <- rnorm(N_new, mean = 0, sd = 1)
y  <- beta0 + beta1*x1 + beta2*x2 + eps

dat_400_10 <- data.frame(y, x1, x2)
fit_400_10 <- lm(y ~ x1 + x2, data = dat_400_10)
summary(fit_400_10)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat_400_10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.41613 -0.73587 -0.01561  0.76286  2.98464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4901     0.9549   9.939   <2e-16 ***
## x1            5.2825     0.4760  11.097   <2e-16 ***
## x2           -2.1188     0.1244 -17.028   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9996 on 397 degrees of freedom
## Multiple R-squared:  0.4893, Adjusted R-squared:  0.4867 
## F-statistic: 190.1 on 2 and 397 DF,  p-value: < 2.2e-16
mse_400_10 <- mean(residuals(fit_400_10)^2)
mse_400_10
## [1] 0.9916167

Conclusions

  • Effect of error variance (holding \(N\) fixed):
    When \(\sigma_\varepsilon\) increases (0.5 → 1), the standard errors of \(\hat\beta_j\) increase, so t-statistics decrease and p-values rise. The MSE of the fitted model increases because residuals are noisier. Point estimates remain close to the true values (unbiased), but are less precise.

  • Effect of sample size (holding \(\sigma_\varepsilon\) fixed):
    When \(N\) increases (400 → 1000), standard errors shrink (roughly at a \(1/\sqrt{N}\) rate), so estimates are more precise, t-statistics grow, and p-values decrease. The training MSE typically decreases because the model learns coefficients more accurately.

  • Overall takeaway:
    Smaller error variance and larger sample size both improve precision. Changing \(\sigma_\varepsilon\) mainly affects the noise level (SEs and MSE), while changing \(N\) mainly affects precision (SEs/t-stats). Across all runs, \(\hat{\beta}_1\) should be near \(5\) and \(\hat{\beta}_2\) near \(-2\), with significance maintained given the strong true effects.

Part e:Comparison of MSE across all models

Interpretation