Introduction

Problem Statement

Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increases in the percentage of alumni who donate, they might be able to implement policies that could lead to increased revenues.

A study shows that students who have more access to the faculty are more likely to be satisfied. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to an increase in the percentage of alumni who donate.

In this project, we will develop a linear regression model to study the factors affecting alumni donation in schools.

Data

The alumni donation data set can be found here

The data set comes from the 2006 ASA Data Expo and contains data for 48 national universities (America’s Best Colleges, Year 2000 Edition).

Data Dictionary

Variable Description
ï..school The percentage of classes offered with fewer than 20 students
percent_of_classes_under_20 The number of students enrolled divided by the total number of faculty
student_faculty_ratio The percentage of alumni that made a donation to the university
alumni_giving_rate Whether the university is private or not
private The name of the universities

Setup

Loading the required packages

The following packages are used:

  • tidyverse : For data manipulation and plotting graphs

  • ggplot2 : For data visualizations

  • GGally : extension of ‘ggplot2’ ; reduces the complexity of combining objects

  • EnvStats : Produces a quantile-quantile (Q-Q) plot, also called a probability plot

  • DT : interface to the JavaScript library DataTables

library(MASS)
library(tidyverse)
library(ggplot2)
library(GGally)
library(ggpubr)
library(EnvStats)
library(DT)

Reading the data

url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)

Final Data

Initial Analysis

Univariate Analyses

alumni_giving_rate (response variable) :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00

Observations:

  • The range is of alumni_giving_rate starts from minimum 7% to maximum 67% and has a median giving rate of 29%.

  • The average giving rate is 29.27%, which is very close to the median.

  • There are no missing values present and the giving rates are distributed with standard deviation of 13.44.

percent_of_classes_under_20 (predictor variable) :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00

Observations:

  • The range is from minimum 29% to maximum 77% with 59% as the Median.

  • The average value is 55.72%, which is quite close to the median.

  • There are no missing values and the values are distributed with a Standard deviation of 13.19.

student_faculty_ratio (predictor variable) :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00   10.50   11.54   13.50   23.00

Observations:

  • The range of student_faculty_ratio is from minimum 3 to maximum 23 and has a median of 10.5.

  • Average is 11.54, which is quite close to the median.

  • There are no missing values and the values are distributed with a Standard deviation of 4.85

Correlation

The above plot suggets the following things:

  • Student-faculty ratio has a high negative correlation of -0.742 with the response variable. It can be seen from the plot that the Alumni giving rate decreases as it increases.

  • Percentage of classes under 20 has a high positive correlation of 0.646 with the response variable.The Alumni giving rate increases as it increases.

  • The type of school , whether private or not plays a significant role in determining the alumni giving rate.

  • All the predictor variables also have a strong correlation among them, suggesting that multi-collinearity can be a potential issue.

Modelling and Results

Model Building

Based on the initial data analysis, the following predictor variables show some association with the alumni giving rate:

  • student_faculty_ratio
  • percent_of_classes_under_20
  • private

In order to find the best fit model, we will run a forward-selection algorithm to determine the best possible predictor variables based on \(R^2_{adj}\).

Model 1

For model 1, we use variable selection to fit the model.

The steps to fit model 1 are as follows:

  1. We start by regressing the response variable on the intercept. This is the simplest that our model can get.

  2. With every iteration, the algorithm will keep adding the explanatory parameters and will pick the model with maximum \(R^2_{adj}\).

  3. Once a model is selected, we will run a two-way stepwise algorithm on the model obtained in step 2. This will again, add or subtract parameters and pick the model with maximum \(R^2_{adj}\).

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.328  -5.692  -1.471   4.058  24.272 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            53.0138     3.4215  15.495  < 2e-16 ***
## student_faculty_ratio  -2.0572     0.2737  -7.516 1.54e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared:  0.5512, Adjusted R-squared:  0.5414 
## F-statistic: 56.49 on 1 and 46 DF,  p-value: 1.544e-09

Without transforming any variables, the forward- selection algorithm gives the below model as the model with best fit:

\[ Y_{hat} = 53.01 - 2.0572*student\_faculty\_ratio \] The best fit model obtained has only student_faculty_ratio as the predictor variable with \(R^2_{adj}\) of 0.5414.

But, the below residuals vs fitted-values plot suggests that the constant variance assumption is being violated by this model. The variance is increasing with increase in fitted_values.

Model 2

In order to fix this increasing variance problem, we apply Box-Cox transformation to the response variable (Y) where:

Y(λ) = (Y−1) λ / λ, for λ ≠ 0 ,

Y(λ) = log(Y), for λ = 0

Using Box-Cox transformation, the value of λ is obtained as 0.42

On running the forward selecetion algorithm using the transformed response variable,the final model selects student_faculty_ratio and private as the predictor variables with an \(R^2_{adj} =\) 0.60.

## 
## Call:
## lm(formula = alumni_giving_rate^0.42 ~ student_faculty_ratio + 
##     private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1198 -0.3590 -0.1362  0.3982  1.0731 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.6694     0.4889   9.550 2.15e-12 ***
## student_faculty_ratio  -0.0897     0.0271  -3.309  0.00185 ** 
## private                 0.5532     0.2807   1.971  0.05489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5242 on 45 degrees of freedom
## Multiple R-squared:  0.6258, Adjusted R-squared:  0.6091 
## F-statistic: 37.62 on 2 and 45 DF,  p-value: 2.488e-10

Using the box-cox transformed response variable, the model is obtained as :

\[ Y_{hat} = 4.6694 - 0.0897*student\_faculty\_ratio+ 0.5532*private \] This model selects student_faculty_ratio and private as the predictor variables with \(R^2_{adj}\) as 0.6091

Model 3

For this model , we will extend the previous model and include 2-way interactions as well.An interaction effect exists when the effect of an independent variable on a dependent variable changes, depending on the value(s) of one or more other independent variables.

## 
## Call:
## lm(formula = alumni_giving_rate^0.42 ~ student_faculty_ratio + 
##     private, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1198 -0.3590 -0.1362  0.3982  1.0731 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.6694     0.4889   9.550 2.15e-12 ***
## student_faculty_ratio  -0.0897     0.0271  -3.309  0.00185 ** 
## private                 0.5532     0.2807   1.971  0.05489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5242 on 45 degrees of freedom
## Multiple R-squared:  0.6258, Adjusted R-squared:  0.6091 
## F-statistic: 37.62 on 2 and 45 DF,  p-value: 2.488e-10

On including two-way interactions in our second model and running forward selection algorithm gives us a model that has \(R^2_{adj}\) of 0.6091. This means that there is no change even when we include the 2-way interactions.

Model 4

The final variation involves using a log transformation on the dependent variable. The final model selects student_faculty_ratio and private as the predictor variables and has an \(R^2_{adj}\) of 0.6222.

## 
## Call:
## lm(formula = log(alumni_giving_rate) ~ student_faculty_ratio + 
##     private, data = alumni)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73017 -0.19276 -0.06152  0.24370  0.59670 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.61808    0.30584  11.830 2.09e-15 ***
## student_faculty_ratio -0.05467    0.01696  -3.224  0.00235 ** 
## private                0.38773    0.17558   2.208  0.03236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3279 on 45 degrees of freedom
## Multiple R-squared:  0.6383, Adjusted R-squared:  0.6222 
## F-statistic:  39.7 on 2 and 45 DF,  p-value: 1.158e-10

\[ Y_{hat} = 3.618 - 0.0547*student\_faculty\_ratio+ 0.387*private \]

It is observed that the p-value for each slope coefficient is <.05. Hence, the slope coefficients are individually statistically significant at 5% level of significance.

β0: On average, the estimated value of log(Y) is equal to 3.62 when each Xi =0 β1: On average, estimated log(Y) decreases by 0.055 units for a unit increase in student_faculty_ratio, keeping other predictors constant β2: On average, estimated log(Y) increases by 0.39 units when it’s a private university (private=1), keeping other predictors constant.

Model Selection

There are a number of criteria that help to select an “optimal” model from a number of models models produced by automatic search procedures, for example:

  • \(R^2_{adj}\) : Adjusted R-squared (larger is better)

  • RMSE : Root mean square error (smaller is better)

  • AIC : Akaike Information Criterion (smaller is better)

  • BIC : Bayesian information criterion (smaller is better)

  • PRESS : Prediction sum of squares (smaller is better)

Based on the above mentioned criteria, the 4 models fitted in the previous section are further compared, as follows:

##         model_1 model_2 model_3 model_4
## AIC     352.196  79.119  79.119  34.083
## BIC     357.810  86.603  86.603  41.567
## adjR2     0.541   0.609   0.609   0.622
## RMSE      9.103   0.524   0.524   0.328
## PRESS  4138.880  14.426  14.426   5.707
## nterms    2.000   3.000   3.000   3.000

Observations:

  • model 4 shows lowest AIC, BIC, RMSE and PRESS as compared to the other models

  • model 4 has the highest \(R^2_{adj}\)

Hence, it can be concluded that Model 4 is optimal with respect to the other models.

Model Diagnostics

  1. Normality Assumption

We can see that most of the residuals seem to fall about the straight red line. Thus,the graph validates our assumption about the residuals (and hence error terms) having almost a normal distribution.

  1. Linearity , Constant Variance, and Potential Outliers

    • All the fitted values are within +/- 3 standard deviation of the studentized residual, indicating no potential outlier or an influential point.

    • The residuals are more or less randomly scattered around the mean zero, with constant variance.

    • The randomly scattered plot suggests that Linear model is a valid assumption, as we cannot see any clear patterns.

  2. Multicollinearity

The case of multicolinearity does not arise, as there is only one quantitative variable .

  1. Outliers in Predictor space

University of Florida and University of Washington are potential outliers (having strong influence on the model) in the predictor space (student_faculty_ratio).

Limitations and Next Steps

Limitations:

  • Small sample size of 48 observations

  • Limited number of attributes

Next Steps:

  • We can use external data to figure out if the universities that are older and more established, will have a larger number of alumni who are more inclined towards donating to their alma mater.

  • Are Universities located at key geographic centers (big cities, industrial hubs), have an alumni base nearby that is more interested in University affairs after graduation?

  • If we get the data of the demographics of each batch, we can further find what factors affect alumni donation rate.

  • Considering what the alumni are doing currently may also play a significant role in the alumni donation rate.