Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increases in the percentage of alumni who donate, they might be able to implement policies that could lead to increased revenues.
A study shows that students who have more access to the faculty are more likely to be satisfied. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to an increase in the percentage of alumni who donate.
In this project, we will develop a linear regression model to study the factors affecting alumni donation in schools.
The alumni donation data set can be found here
The data set comes from the 2006 ASA Data Expo and contains data for 48 national universities (America’s Best Colleges, Year 2000 Edition).
Variable | Description |
---|---|
ï..school | The percentage of classes offered with fewer than 20 students |
percent_of_classes_under_20 | The number of students enrolled divided by the total number of faculty |
student_faculty_ratio | The percentage of alumni that made a donation to the university |
alumni_giving_rate | Whether the university is private or not |
private | The name of the universities |
The following packages are used:
tidyverse : For data manipulation and plotting graphs
ggplot2 : For data visualizations
GGally : extension of ‘ggplot2’ ; reduces the complexity of combining objects
EnvStats : Produces a quantile-quantile (Q-Q) plot, also called a probability plot
DT : interface to the JavaScript library DataTables
library(MASS)
library(tidyverse)
library(ggplot2)
library(GGally)
library(ggpubr)
library(EnvStats)
library(DT)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 18.75 29.00 29.27 38.50 67.00
Observations:
The range is of alumni_giving_rate starts from minimum 7% to maximum 67% and has a median giving rate of 29%.
The average giving rate is 29.27%, which is very close to the median.
There are no missing values present and the giving rates are distributed with standard deviation of 13.44.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 44.75 59.50 55.73 66.25 77.00
Observations:
The range is from minimum 29% to maximum 77% with 59% as the Median.
The average value is 55.72%, which is quite close to the median.
There are no missing values and the values are distributed with a Standard deviation of 13.19.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 8.00 10.50 11.54 13.50 23.00
Observations:
The range of student_faculty_ratio is from minimum 3 to maximum 23 and has a median of 10.5.
Average is 11.54, which is quite close to the median.
There are no missing values and the values are distributed with a Standard deviation of 4.85
The above plot suggets the following things:
Student-faculty ratio has a high negative correlation of -0.742 with the response variable. It can be seen from the plot that the Alumni giving rate decreases as it increases.
Percentage of classes under 20 has a high positive correlation of 0.646 with the response variable.The Alumni giving rate increases as it increases.
The type of school , whether private or not plays a significant role in determining the alumni giving rate.
All the predictor variables also have a strong correlation among them, suggesting that multi-collinearity can be a potential issue.
Based on the initial data analysis, the following predictor variables show some association with the alumni giving rate:
In order to find the best fit model, we will run a forward-selection algorithm to determine the best possible predictor variables based on \(R^2_{adj}\).
For model 1, we use variable selection to fit the model.
The steps to fit model 1 are as follows:
We start by regressing the response variable on the intercept. This is the simplest that our model can get.
With every iteration, the algorithm will keep adding the explanatory parameters and will pick the model with maximum \(R^2_{adj}\).
Once a model is selected, we will run a two-way stepwise algorithm on the model obtained in step 2. This will again, add or subtract parameters and pick the model with maximum \(R^2_{adj}\).
##
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.328 -5.692 -1.471 4.058 24.272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.0138 3.4215 15.495 < 2e-16 ***
## student_faculty_ratio -2.0572 0.2737 -7.516 1.54e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared: 0.5512, Adjusted R-squared: 0.5414
## F-statistic: 56.49 on 1 and 46 DF, p-value: 1.544e-09
Without transforming any variables, the forward- selection algorithm gives the below model as the model with best fit:
\[ Y_{hat} = 53.01 - 2.0572*student\_faculty\_ratio \] The best fit model obtained has only student_faculty_ratio as the predictor variable with \(R^2_{adj}\) of 0.5414.
But, the below residuals vs fitted-values plot suggests that the constant variance assumption is being violated by this model. The variance is increasing with increase in fitted_values.
In order to fix this increasing variance problem, we apply Box-Cox transformation to the response variable (Y) where:
Y(λ) = (Y−1) λ / λ, for λ ≠ 0 ,
Y(λ) = log(Y), for λ = 0
Using Box-Cox transformation, the value of λ is obtained as 0.42
On running the forward selecetion algorithm using the transformed response variable,the final model selects student_faculty_ratio and private as the predictor variables with an \(R^2_{adj} =\) 0.60.
##
## Call:
## lm(formula = alumni_giving_rate^0.42 ~ student_faculty_ratio +
## private, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1198 -0.3590 -0.1362 0.3982 1.0731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6694 0.4889 9.550 2.15e-12 ***
## student_faculty_ratio -0.0897 0.0271 -3.309 0.00185 **
## private 0.5532 0.2807 1.971 0.05489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5242 on 45 degrees of freedom
## Multiple R-squared: 0.6258, Adjusted R-squared: 0.6091
## F-statistic: 37.62 on 2 and 45 DF, p-value: 2.488e-10
Using the box-cox transformed response variable, the model is obtained as :
\[ Y_{hat} = 4.6694 - 0.0897*student\_faculty\_ratio+ 0.5532*private \] This model selects student_faculty_ratio and private as the predictor variables with \(R^2_{adj}\) as 0.6091
For this model , we will extend the previous model and include 2-way interactions as well.An interaction effect exists when the effect of an independent variable on a dependent variable changes, depending on the value(s) of one or more other independent variables.
##
## Call:
## lm(formula = alumni_giving_rate^0.42 ~ student_faculty_ratio +
## private, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1198 -0.3590 -0.1362 0.3982 1.0731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6694 0.4889 9.550 2.15e-12 ***
## student_faculty_ratio -0.0897 0.0271 -3.309 0.00185 **
## private 0.5532 0.2807 1.971 0.05489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5242 on 45 degrees of freedom
## Multiple R-squared: 0.6258, Adjusted R-squared: 0.6091
## F-statistic: 37.62 on 2 and 45 DF, p-value: 2.488e-10
On including two-way interactions in our second model and running forward selection algorithm gives us a model that has \(R^2_{adj}\) of 0.6091. This means that there is no change even when we include the 2-way interactions.
The final variation involves using a log transformation on the dependent variable. The final model selects student_faculty_ratio and private as the predictor variables and has an \(R^2_{adj}\) of 0.6222.
##
## Call:
## lm(formula = log(alumni_giving_rate) ~ student_faculty_ratio +
## private, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73017 -0.19276 -0.06152 0.24370 0.59670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.61808 0.30584 11.830 2.09e-15 ***
## student_faculty_ratio -0.05467 0.01696 -3.224 0.00235 **
## private 0.38773 0.17558 2.208 0.03236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3279 on 45 degrees of freedom
## Multiple R-squared: 0.6383, Adjusted R-squared: 0.6222
## F-statistic: 39.7 on 2 and 45 DF, p-value: 1.158e-10
\[ Y_{hat} = 3.618 - 0.0547*student\_faculty\_ratio+ 0.387*private \]
It is observed that the p-value for each slope coefficient is <.05. Hence, the slope coefficients are individually statistically significant at 5% level of significance.
β0: On average, the estimated value of log(Y) is equal to 3.62 when each Xi =0 β1: On average, estimated log(Y) decreases by 0.055 units for a unit increase in student_faculty_ratio, keeping other predictors constant β2: On average, estimated log(Y) increases by 0.39 units when it’s a private university (private=1), keeping other predictors constant.
There are a number of criteria that help to select an “optimal” model from a number of models models produced by automatic search procedures, for example:
\(R^2_{adj}\) : Adjusted R-squared (larger is better)
RMSE : Root mean square error (smaller is better)
AIC : Akaike Information Criterion (smaller is better)
BIC : Bayesian information criterion (smaller is better)
PRESS : Prediction sum of squares (smaller is better)
Based on the above mentioned criteria, the 4 models fitted in the previous section are further compared, as follows:
## model_1 model_2 model_3 model_4
## AIC 352.196 79.119 79.119 34.083
## BIC 357.810 86.603 86.603 41.567
## adjR2 0.541 0.609 0.609 0.622
## RMSE 9.103 0.524 0.524 0.328
## PRESS 4138.880 14.426 14.426 5.707
## nterms 2.000 3.000 3.000 3.000
Observations:
model 4 shows lowest AIC, BIC, RMSE and PRESS as compared to the other models
model 4 has the highest \(R^2_{adj}\)
Hence, it can be concluded that Model 4 is optimal with respect to the other models.
We can see that most of the residuals seem to fall about the straight red line. Thus,the graph validates our assumption about the residuals (and hence error terms) having almost a normal distribution.
Linearity , Constant Variance, and Potential Outliers
• All the fitted values are within +/- 3 standard deviation of the studentized residual, indicating no potential outlier or an influential point.
• The residuals are more or less randomly scattered around the mean zero, with constant variance.
• The randomly scattered plot suggests that Linear model is a valid assumption, as we cannot see any clear patterns.
Multicollinearity
The case of multicolinearity does not arise, as there is only one quantitative variable .
University of Florida and University of Washington are potential outliers (having strong influence on the model) in the predictor space (student_faculty_ratio).
Small sample size of 48 observations
Limited number of attributes
We can use external data to figure out if the universities that are older and more established, will have a larger number of alumni who are more inclined towards donating to their alma mater.
Are Universities located at key geographic centers (big cities, industrial hubs), have an alumni base nearby that is more interested in University affairs after graduation?
If we get the data of the demographics of each batch, we can further find what factors affect alumni donation rate.
Considering what the alumni are doing currently may also play a significant role in the alumni donation rate.