library(ggplot2)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni) # print structure of the alumni data frame
## 'data.frame':    48 obs. of  5 variables:
##  $ school                     : chr  "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
##  $ percent_of_classes_under_20: int  39 68 60 65 67 52 45 69 72 61 ...
##  $ student_faculty_ratio      : int  13 8 8 3 10 8 12 7 13 10 ...
##  $ alumni_giving_rate         : int  25 33 40 46 28 31 27 31 35 53 ...
##  $ private                    : int  1 1 1 1 1 1 1 1 1 1 ...

Question 1 Alumni Donation Data (Multiple Linear Regression). Continuing with the alumni donation data, fit a multiple linear regression model to the data, where the alumni giving rate (alumni_giving_rate) is the response variable (Y), and the percentage of classes with fewer than 20 students (percent_of_classes_under_20) and student/faculty ratio (student_faculty_ratio) as the predictors.

  1. What is your final estimated model?
mutiple_model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio, data = alumni)
mutiple_summary <- summary(mutiple_model)
print(mutiple_summary)
## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 + 
##     student_faculty_ratio, data = alumni)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.00  -6.57  -1.95   4.42  24.56 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  39.6556    13.5076   2.936 0.005225 ** 
## percent_of_classes_under_20   0.1662     0.1626   1.022 0.312128    
## student_faculty_ratio        -1.7021     0.4421  -3.850 0.000371 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.5418 
## F-statistic: 28.79 on 2 and 45 DF,  p-value: 8.869e-09
get_eq <- function(mutiple_model) {
  eq <- as.character(paste("Y =", round(coef(mutiple_model)[1], 2), "+", round(coef(mutiple_model)[2], 2), "* X"))
  return(eq)
}
students_faculty_alumni_ratio_rate <- ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, col = "blue") +
  labs(title = "Alumni giving rate vs. % of classes under 20",
       x = "% of Classes Under 20",
       y = "Alumni Giving Rate") +
  geom_text(x = 40, y = 50, label = get_eq(mutiple_model), color = "blue", parse = TRUE)
print(students_faculty_alumni_ratio_rate)
## `geom_smooth()` using formula = 'y ~ x'

students_faculty_alumni_ratio_rate <- ggplot(alumni, aes(x = student_faculty_ratio, y = alumni_giving_rate)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, col = "blue") +
  labs(title = "Alumni giving rate vs. Student Faculty Ratio",
       x = "Student Faculty Ratio",
       y = "Alumni Giving Rate") +
  geom_text(x = 10, y = 50, label = get_eq(mutiple_model), color = "blue", parse = TRUE)
print(students_faculty_alumni_ratio_rate)
## `geom_smooth()` using formula = 'y ~ x'

student_faculty_data <- rbind(
  transform(alumni, variable = "% of Classes Under 20"),
  transform(alumni, variable = "Student Faculty Ratio", percent_of_classes_under_20 = student_faculty_ratio)
)

students_faculty_alumni_ratio_rate <- ggplot(student_faculty_data, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
  geom_point(aes(color = variable)) +
  geom_smooth(method = "lm", se = FALSE, aes(group = variable, color = variable), show.legend = FALSE) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Alumni Giving Rate",
       x = "% of Classes Under 20",
       y = "Alumni Giving Rate")

print(students_faculty_alumni_ratio_rate)
## `geom_smooth()` using formula = 'y ~ x'

library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(
  data = alumni,
  x = ~percent_of_classes_under_20,
  y= ~student_faculty_ratio,
  z= ~alumni_giving_rate,
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    scene = list(
      xaxis = list(title = "% of Classes < 20"),
      yaxis = list(title = "Student-Faculty Ratio"),
      zaxis = list(title = "Alumni Giving Rate")
    ))

The final regression is Y= 39.6556 + 0.1662X1 - 1.7021X2. Lets keep in mind that Y is the alumni giving rate while X1 and x2 are % of classes under 20 students and X2 is student faculty ratio respectively. The model shows R^2 equals 0.5613 showing X2 is significant but sadly X1 does not hold significance.

  1. What is the predicted alumni giving rate for an observation with percent_of_classes_under_20 = 50 and student_faculty_ratio = 10?
combined_model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio,
                     data = alumni)

cat(sprintf( "Predicted Alumni Giving Rate: %.2f\n", 
predict(combined_model, 
         newdata = data.frame(percent_of_classes_under_20 = 50,
                    student_faculty_ratio = 10))
))
## Predicted Alumni Giving Rate: 30.94

We estimated the predicted alumni giving rate is 30.94 based on percent of classes under 20 is at 50% and the student faculty ratio was 10.

  1. Test the statistical significance of the regression coefficient using t-tests; use α=0.05. Obtain the t-statistics and p-values, interpret the results, make a conclusion (i.e. reject or not reject), and explain why. Note: please explain what the null hypothesis is.
##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 39.6555835 13.5075774  2.935803 0.0052247868
## percent_of_classes_under_20  0.1661686  0.1625520  1.022249 0.3121275033
## student_faculty_ratio       -1.7021103  0.4421271 -3.849821 0.0003709425

Where we are at α=0.05, we test the null hypothesis to equal close to zero for each coefficient. According to the statistical t test we have the p results as 0.31 when it comes to the percent of students under 20 and student faculty ratio with the p results as 0.00037. The percent of classes under 20 is not significant so it fails to reject the null hypothesis but student faculty ratio is significant therefore rejects the null hypothesis. In conclusion the student faculty ration shows a significant relationship with alumni rate than the percent of classes under 20.

  1. What is the F statistic? Is it significant? Clearly write out the null hypothesis, F-statistic, and p-value and interpret the test results.
f_statistic_mutiple_model <- summary(mutiple_model)$fstatistic
p_value_f_Stat_multiple_model <- pf(f_statistic_mutiple_model[1], f_statistic_mutiple_model[2], f_statistic_mutiple_model[3], lower.tail = FALSE)
f_statistic_mutiple_model
##    value    numdf    dendf 
## 28.79264  2.00000 45.00000
p_value_f_Stat_multiple_model
##        value 
## 8.868867e-09

The F-test shows null is close to zero. We calculated the F test equals 28.79 with two degrees of freedom over 45 and the p-value is 8.868*10^-9.We can conclude with these results we reject the null hypothesis since p < 0.05 and the results are statistically significant.

  1. What is the value of the coefficient of determination? Please interpret
r_squared_mutiple_model <- summary(mutiple_model)$r.squared

r_squared_mutiple_model
## [1] 0.5613406

We tested the R^2 result is 0.5613 so we can interpret this as 56.13% of the variance in alumni giving rates using the % of classes under 20 and the student faculty ratio. We can also say that the remaining 43.87% is unpredictable. In conclusion because of the higher R^2 we can say this is a better fit.

  1. What is the correlation coefficient r_1 between percent_of_classes_under_20 and alumni_giving_rate and the correlation coefficient r_2 between student_faculty_ratio and alumni_giving_rate? Do you see any relationship between r_1, r_2, and R^2?
cor_percent_agr_poc_20 <- cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)
cor_sfr_agr <- cor(alumni$student_faculty_ratio, alumni$alumni_giving_rate)

cor_percent_agr_poc_20
## [1] 0.6456504
cor_sfr_agr
## [1] -0.7423975

The correlation coefficient r1 is 0.6457 for the percent of classes under 20 and the alumni giving rate, r2= -0.7442 for student faculty ration and alumni giving. R^2 remember is 0.5613 so the R^2 reflects the predictors joint contribution and does not equal any simple function of the individual correlation because the predictors share explanatory power.

Question 2 Simulation Study (Simple Linear Regression).

  1. Generate N=1000 observations from the model Y∼10+5X_1-2X_2+ϵ, where X_1∼N(μ=2,σ=0.1), X_2∼N(μ=0,σ=0.4), and ϵ∼N(μ=0,σ=0.5).
set.seed(123)
N<-1000
X1 <- rnorm(N, mean=2, sd=0.1)
X2 <- rnorm(N, mean=0, sd=0.4)
epsilon <- rnorm(N, mean=0, sd=0.5)
Y <- 10+5*X1 - 2*X2 + epsilon
sim_study_a <- lm(Y ~ X1 + X2)
sim_study_b <- summary(sim_study_a )
print(sim_study_a )
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Coefficients:
## (Intercept)           X1           X2  
##      10.204        4.893       -1.966
  1. Fit a multiple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?
set.seed(123)
N<-1000
X1 <- rnorm(N, mean=2, sd=0.1)
X2 <- rnorm(N, mean=0, sd=0.4)
epsilon <- rnorm(N, mean=0, sd=0.5)
Y <- 10+5*X1 - 2*X2 + epsilon
sim_study_a <- lm(Y ~ X1 + X2)
sim_study_b <- summary(sim_study_a )
print(sim_study_a )
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Coefficients:
## (Intercept)           X1           X2  
##      10.204        4.893       -1.966

To answer both a and b, we can say the estimated equation is Y = 10.204 + 4.893X1 - 1.966X2. The coefficient intercept’s standard error 0.314, t value is 32.50 and the p value is 2 x 10^-16. X1’s results for standard error, t value and p value are 0.156, 31.22 and 2x10^-16 respectively. X2’s results for standard error, t value, p-value are 0.03848, -51.08 and 2x10^-16 respectively. This all means we can reject the null hypothesis for all terms. X1 is positive. X2 is negative with respect to Y. The MSE is 0.4894

  1. Repeat part b), but re-simulate the data and change the error term to ϵ∼N(μ=0,σ=1) (i.e., standard normal).
epsilon_new <- rnorm(N, mean = 0, sd = 1)
Y_new <- 10 + 5 * X1 - 2 * X2 + epsilon_new
sim_model_c <- lm(Y_new ~ X1+X2)
summary(sim_model_c)
## 
## Call:
## lm(formula = Y_new ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11699 -0.62664  0.00437  0.66561  2.90740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.03929    0.63737   15.75   <2e-16 ***
## X1           4.97594    0.31810   15.64   <2e-16 ***
## X2          -2.01676    0.07811  -25.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9933 on 997 degrees of freedom
## Multiple R-squared:  0.4596, Adjusted R-squared:  0.4585 
## F-statistic: 423.9 on 2 and 997 DF,  p-value: < 2.2e-16
  1. Repeat parts a)–c) using N=400. What do you conclude? What is the effect to the model parameter estimates when the error variance gets smaller? What is the effect when the sample size gets bigger?
N_new <- 400
X1_new <- rnorm(N_new, mean = 2, sd = 0.1)
X2_new <- rnorm(N_new, mean = 0, sd = 0.4)
epsilon_new <- rnorm(N_new, mean = 0, sd = 0.5)
Y_new <- 10 + 5 * X1_new - 2 * X2_new + epsilon_new
sim_model_d <- lm(Y_new ~ X1_new + X2_new)
summary(sim_model_d)
## 
## Call:
## lm(formula = Y_new ~ X1_new + X2_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52153 -0.31078  0.02841  0.32596  1.08097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.5572     0.4768   20.04   <2e-16 ***
## X1_new        5.2195     0.2386   21.87   <2e-16 ***
## X2_new       -1.9189     0.0605  -31.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4772 on 397 degrees of freedom
## Multiple R-squared:  0.7931, Adjusted R-squared:  0.7921 
## F-statistic:   761 on 2 and 397 DF,  p-value: < 2.2e-16

the estimate model is Y = 9.5572 + 5.2195X1-1.9189X2 with all coefficients high significance with p-value 2x10^-16 and R^2 equals 0.7931. When error variance decreases, the standard errors shrink and predictors become more accurate with the lower MSE at 0.228. The sample size increases, parameters estimates approach their true values, standard errors decrease and the model reliability improves.

  1. What about the MSE from each model? The larger the error variance, the higher the MSE and less significant the coefficients become. Smaller samples also increase uncertainity in estimates. When the error variance is low and sample size increases, MSE decreases and coefficients remain highly significant.

Question 3

  1. Write out the multiple linear regression model with normal errors in matrix form
beta <- coefficients(combined_model)
x <- model.matrix(~percent_of_classes_under_20 + student_faculty_ratio, data = alumni)

message("Multiple Linear Regression Model with Normal Errors:")
## Multiple Linear Regression Model with Normal Errors:
print(beta)
##                 (Intercept) percent_of_classes_under_20 
##                  39.6555835                   0.1661686 
##       student_faculty_ratio 
##                  -1.7021103
  1. State the model assumptions from part a).

The multiple linear regression model’s assumptions are a linear relationship between predictors and the response and can also assume that residuals are normal and no signficiant outliers are affecting the overall fit.

  1. Write out the model matrix X for the model in part a).
message("Model Matrix (X):")
## Model Matrix (X):
print(x)
##    (Intercept) percent_of_classes_under_20 student_faculty_ratio
## 1            1                          39                    13
## 2            1                          68                     8
## 3            1                          60                     8
## 4            1                          65                     3
## 5            1                          67                    10
## 6            1                          52                     8
## 7            1                          45                    12
## 8            1                          69                     7
## 9            1                          72                    13
## 10           1                          61                    10
## 11           1                          68                     8
## 12           1                          65                     7
## 13           1                          54                    10
## 14           1                          73                     8
## 15           1                          64                     9
## 16           1                          55                    11
## 17           1                          65                     6
## 18           1                          63                    13
## 19           1                          66                     8
## 20           1                          32                    19
## 21           1                          68                     5
## 22           1                          62                     8
## 23           1                          69                     7
## 24           1                          67                     9
## 25           1                          56                    12
## 26           1                          58                    17
## 27           1                          32                    19
## 28           1                          42                    20
## 29           1                          41                    18
## 30           1                          48                    19
## 31           1                          45                    20
## 32           1                          65                     4
## 33           1                          31                    23
## 34           1                          29                    15
## 35           1                          51                    15
## 36           1                          40                    16
## 37           1                          53                    13
## 38           1                          65                     7
## 39           1                          63                    10
## 40           1                          53                    13
## 41           1                          39                    21
## 42           1                          44                    13
## 43           1                          37                    12
## 44           1                          37                    13
## 45           1                          68                     9
## 46           1                          59                    11
## 47           1                          73                     7
## 48           1                          77                     7
## attr(,"assign")
## [1] 0 1 2
  1. What is the least squares estimate of β in part a)?
beta<-coef(combined_model)
print(beta)
##                 (Intercept) percent_of_classes_under_20 
##                  39.6555835                   0.1661686 
##       student_faculty_ratio 
##                  -1.7021103

The least squared estimate of beta is 39.655 with percent of classes under 20 0.166 and student faculty ratio -1.70211 e. What does it mean for the estimate in part d) to be unbiased?

An estimator is unbiased when its expectation equals the true population parameter. With repeated samples, the estimated coefficients will match true values and providing all the model assumptions are aligned with all variables. Thus the estimates are unbiased in part d showing the samples are accurate and representing the population and model yield in parameter estimates