Introduction

Problem Statement

Alumni donations are an important source of revenue for colleges and universities. The effects of different variables on alumni’s donation could help predict a university’s revenue. Furthermore, understanding the factors that influence increases in the percentage of alumni who make a donation could help administrators implement policies that could lead to increased revenues.

As shown in research, students who are more satisfied with their contact with teachers are more likely to graduate, as well as contribute financially in the future. There are also other important factors that may influence alumni’s decisions such as: national ranking, location, and if the institution is public or private. Ivy leagues are a primary example of such factors having an impact on alumni’s decisions.

This report will analyze the effects of the aforementioned variables on alumni donation rates. Through exploratory data analysis and the development of a linear regression model, the report will quantify these relationships. Lastly we will use model diagnostics to identify and solve potential flaws.

Packages Used

The following packages are used:

library(readr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(investr)
library(car)
library(knitr)
library(ggpubr)
library(GGally)
library(broom)

Data Description

The alumni donation data set can be found here

This report analyzes donation data from 48 national universities, sourced from America’s Best Colleges, Year 2000 Edition. The dataset has been enhanced by including additional information, such as the state where each school is located and its ranking from U.S. News & World Report.

Variables Description
school Name of the school
percent_of_classes_under_20 Percentage of classes with fewer than 20 students
student_faculty_ratio Ratio of students to faculty
alumni_giving_rate Percentage of alumni donation
private Whether or not the school is private
state State location of the school
ranking US World News ranking for 2018

Modeling and Results

Initial Analysis

Univariate Analysis

The model’s response variable is the alumni_giving_rate. A summary of the predictor variables is provided below.

percent_of_classes_under_20

plot1 <- ggplot(alumni, aes(percent_of_classes_under_20)) + 
  geom_histogram(fill = "lightgreen", color = "darkgreen") + 
  ggtitle("Histogram distribution of the variable") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(y = percent_of_classes_under_20)) + 
  geom_boxplot() + 
  ggtitle("Boxplot to look for outliers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot3 <- ggplot(alumni, aes(percent_of_classes_under_20, alumni_giving_rate)) + 
  geom_point(shape = 1) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Scatter plot relating the two variables") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(plot1, plot2, plot3, nrow = 1, ncol = 3)

lm(alumni_giving_rate ~ percent_of_classes_under_20, alumni) %>% 
  summary() %>% .$coefficients
##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 -7.3860676  6.5654723 -1.124986 2.664307e-01
## percent_of_classes_under_20  0.6577687  0.1147048  5.734448 7.228121e-07

Observations:

  • The range is from 29% to 77% with an average of around 56%.
  • The median is 59%.
  • There are no missing values and the values are distributed with a Standard deviation of 13.19.
  • The donation rate increases by 0.66 points for every unit increase in the class percentage.

student_faculty_ratio

plot1 <- ggplot(alumni, aes(student_faculty_ratio)) + 
  geom_histogram(fill = "lightgreen", color = "darkgreen") + 
  ggtitle("Histogram of student vs faculty ratio") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot2 <- ggplot(alumni, aes(y = student_faculty_ratio)) + 
  geom_boxplot() + 
  ggtitle("Boxplot to look for outliers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
plot3 <- ggplot(alumni, aes(student_faculty_ratio, alumni_giving_rate)) + 
  geom_point(shape = 1) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Scatter plot relating the two variables") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))
ggarrange(plot1, plot2, plot3, nrow = 1, ncol = 3)

lm(alumni_giving_rate ~ student_faculty_ratio, alumni) %>% 
  summary() %>% .$coefficients
##                        Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)           53.013827   3.421450 15.494548 7.058813e-20
## student_faculty_ratio -2.057155   0.273716 -7.515653 1.544232e-09

Observations:

  • The range of is from 3% to 23% with an average of 11.54%.
  • The median is 10.5%.
  • There are no missing values and the values are distributed with a Standard deviation of 4.85.
  • The donation rate decreases by 2.06 points for every unit increase in the faculty ratio.

Correlation

alumni %>%
select(alumni_giving_rate,percent_of_classes_under_20,student_faculty_ratio,private) %>%
ggpairs()

Observations:

  • Student-faculty ratio has a high negative correlation of -0.742 with the response variable. It can be seen from the plot that the Alumni giving rate decreases as it increases.

  • Percentage of classes under 20 has a high positive correlation of 0.646 with the response variable.The Alumni giving rate increases as it increases.

  • The type of school , whether private or not plays a significant role in determining the alumni giving rate.

  • All the predictor variables also have a strong correlation among them, suggesting that multicollinearity can be a potential issue.

Model Building

Model 1

model1 <- lm(alumni_giving_rate ~ student_faculty_ratio + 
               percent_of_classes_under_20 + private + ranking, 
              data = alumni)  
model1 %>% summary()
## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + percent_of_classes_under_20 + 
##     private + ranking, data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7557  -5.5338  -0.5836   4.7865  22.3672 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  51.8087    14.5213   3.568   0.0009 ***
## student_faculty_ratio        -1.2424     0.4911  -2.530   0.0152 *  
## percent_of_classes_under_20  -0.1100     0.1881  -0.585   0.5618    
## private1                      6.7788     5.1073   1.327   0.1914    
## ranking                      -0.2619     0.1119  -2.341   0.0240 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.632 on 43 degrees of freedom
## Multiple R-squared:  0.6227, Adjusted R-squared:  0.5876 
## F-statistic: 17.74 on 4 and 43 DF,  p-value: 1.138e-08

Based on the p-values, we can see that percent_of_classes_under_20 and private variables do not seem that significantly influential on the model. Which was unexpected since based on the univariate analyses it seemed it was going to be influential. Multicollinearity between the two variables is probably the reason. The difference between the \(R^{2}\) and adjusted \(R^{2}\) also indicates the same reason. The \(MSE\) is observed to be 69.65.

vif(model1)
##       student_faculty_ratio percent_of_classes_under_20 
##                    3.580330                    3.886828 
##                     private                     ranking 
##                    3.610477                    1.976081

As we can see from the Variance Inflation Factor, there are no values above 10. Therefore, rejecting the multicollinearity theory. These two variables will be excluded for now.

Now that we have removed those two variables, this is the current model.

model2 <- lm(alumni_giving_rate ~ student_faculty_ratio + ranking,
             data = alumni)  
summary(model2)
## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking, 
##     data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9086  -6.1546   0.2597   4.4597  21.2957 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           53.74558    3.24880  16.543  < 2e-16 ***
## student_faculty_ratio -1.55767    0.32534  -4.788 1.86e-05 ***
## ranking               -0.25291    0.09978  -2.535   0.0148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.609 on 45 degrees of freedom
## Multiple R-squared:  0.6072, Adjusted R-squared:  0.5898 
## F-statistic: 34.79 on 2 and 45 DF,  p-value: 7.38e-10

We only kept significant variables in the model, however, there does not seem to be any improvement. The adjusted \(R^{2}\) is still 0.59 and the \(MSE\) is still 72.5.

Residual Diagnostics

we can check the fit of the model based on the following residual analysis points.

model2Augment <- augment(model2) %>% mutate(row_num = 1:n())

p1 <- ggplot(model2Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p2 <- ggplot(model2Augment, aes(x = ranking, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "lightblue", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a slight exponential curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model2Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "lightblue", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p4 <- ggplot(model2Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p5 <- ggplot(model2Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p6 <- ggplot(model2Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

By applying a logarithmic transformation to the ranking variable we can improve the curve.

Log Transformation

This is the new model after the logarithmic transformation.

alumni$ranking2 <- log(alumni$ranking)
model4 <- lm(alumni_giving_rate ~ student_faculty_ratio + ranking2,
             data = alumni)  
summary(model4)
## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + ranking2, 
##     data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3976  -5.7894   0.6323   4.2420  21.5566 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            61.1314     3.9543  15.459  < 2e-16 ***
## student_faculty_ratio  -1.4228     0.3136  -4.538 4.21e-05 ***
## ranking2               -5.2561     1.5878  -3.310  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.253 on 45 degrees of freedom
## Multiple R-squared:  0.639,  Adjusted R-squared:  0.623 
## F-statistic: 39.83 on 2 and 45 DF,  p-value: 1.103e-10

We can see that the new adjusted \(R^{2}\) is 0.62, and the \(MSE\) is 66.63.

Residual Diagnostics 2

These are based on the new model.

model4Augment <- augment(model4) %>% mutate(row_num = 1:n())

p1 <- ggplot(model4Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p2 <- ggplot(model4Augment, aes(x = ranking2, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "lightblue", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX not exponential anymore") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model4Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "lightblue", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a curve") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p4 <- ggplot(model4Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p5 <- ggplot(model4Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals almost follow a normal distribution") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p6 <- ggplot(model4Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Box-Cox Transformation

By applying a Box-Cox transformation to the response variable, we can further improve the model.

bc <- MASS::boxcox(model2, plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]

alumni$alumni_giving_rate2 <- (alumni$alumni_giving_rate ^ lambda - 1) / lambda
model3 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2,
             data = alumni)  
summary(model3)
## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5503 -1.1945  0.1262  0.8061  3.9072 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           14.44137    0.75593  19.104  < 2e-16 ***
## student_faculty_ratio -0.31443    0.05994  -5.246 4.04e-06 ***
## ranking2              -0.78227    0.30354  -2.577   0.0133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.578 on 45 degrees of freedom
## Multiple R-squared:  0.6426, Adjusted R-squared:  0.6267 
## F-statistic: 40.45 on 2 and 45 DF,  p-value: 8.83e-11

We can see that the new adjusted \(R^{2}\) is 0.63, and the \(MSE\) is 2.43.

Residual Diagnostics 3

These are based on the new model.

model3Augment <- augment(model3) %>% mutate(row_num = 1:n())

p1 <- ggplot(model3Augment, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  xlab("Fitted value") +
  ylab("Standardized residuals") + 
  ggtitle("Non-constant variance & non-linearity test\nFitted values - equally spread around 0") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p2 <- ggplot(model3Augment, aes(x = ranking2, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "lightblue", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX follows a slight exponential curve, but improved") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p3 <- ggplot(model3Augment, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "lightblue", alpha = 0.1, se = FALSE) +
  ylab("Standardized residuals") + 
  ggtitle("Non-linearity test\nX still follows a curve, but improved") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p4 <- ggplot(model3Augment, aes(y = .std.resid)) +
  geom_boxplot() +
  ggtitle("Outlying observations test\nOne outlier") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p5 <- ggplot(model3Augment, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  ggtitle("Non-normality test\nThe residuals follow normal distribution much better") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

p6 <- ggplot(model3Augment, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Index") +
  ylab("Standardized residual") +
  ggtitle("Non-independance test\nNo residual pattern across row numbers") + 
    theme(plot.title = element_text(size = 8), axis.title = element_text(size = 8))

ggarrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Variable Transformation

Private Variable

We try reinserting the **private* variable into the new model, which will improve it.

model5 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + private,
             data = alumni)  
summary(model5)
## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + 
##     private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2235 -1.2151  0.0255  0.8514  3.1533 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.68523    1.47088   7.944 4.85e-10 ***
## student_faculty_ratio -0.17379    0.08707  -1.996  0.05214 .  
## ranking2              -0.80677    0.29216  -2.761  0.00836 ** 
## private1               1.75265    0.81309   2.156  0.03663 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.517 on 44 degrees of freedom
## Multiple R-squared:  0.6767, Adjusted R-squared:  0.6547 
## F-statistic:  30.7 on 3 and 44 DF,  p-value: 7.259e-11

We can see that the new adjusted \(R^{2}\) is 0.65, and the \(MSE\) is 2.2.

However, we can see that the significance of the student_faculty_ratio variable dips significantly. The increase in both the adjusted \(R^{2}\) and \(MSE\) is not high enough to justify accepting the new model.

Percent_of_classes_under_20 Variable

This variable however does the opposite and decreases adjusted \(R^{2}\) and \(MSE\).

model6 <- lm(alumni_giving_rate2 ~ student_faculty_ratio + ranking2 +
               percent_of_classes_under_20,
             data = alumni)  
summary(model6)
## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio + ranking2 + 
##     percent_of_classes_under_20, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5751 -1.1999  0.1094  0.8352  3.9101 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 13.739852   2.666998   5.152 5.83e-06 ***
## student_faculty_ratio       -0.300137   0.079863  -3.758 0.000501 ***
## ranking2                    -0.755347   0.321999  -2.346 0.023557 *  
## percent_of_classes_under_20  0.008209   0.029902   0.275 0.784957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.594 on 44 degrees of freedom
## Multiple R-squared:  0.6432, Adjusted R-squared:  0.6189 
## F-statistic: 26.44 on 3 and 44 DF,  p-value: 6.213e-10

Summary

From the last model we can see that two variables have a great influence on it. These variables are ranking and the student_faculty_ratio variables. The regression equation is as follows:

\(alumniGivingRate^{0.505} - 1.98 = 14.636 - (0.319 \times studentFacultyRatio) - (0.798 \times \log(ranking))\)

The model’s adjusted \(R^{2}\) is 62.67% which means that 62.67% if the variability is explained by its predictors. Furthermore, the model’s mean squared error is 2.43 which is very low, which also means that the model is very accurate.

Based on the negative coefficients of our predictors, it suggests an inverse relationship. Which is explained by the exploratory analysis we have done before, highlighting the negative linear patterns observed. This also mean that the alumni_giving_rate will decrease as the student_faculty_ratio increases or the ranking worsens.