Alumni Donation Case Study: Applied Linear Regression

Team - Sai Mounica Gudimella , Kushal Gupta , Nikhil Agarwal

Introduction

Alumni donations are an important source of revenue for colleges and universities. A regression model aimed at determining the factors responsible for increased donations from the alumni, the administrators could take relevant decisions in order to increase the donations and hence increase the overall revenue from this source.

A study shows that students who have more access to the faculty are more likely to be satisfied. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to an increase in the percentage of alumni who donate.

This report aims at the achieving the same. We will look at the various factors associated with the donations from the alumni and build a regression model with response variable as donations received and predictor variables include the student to class ratio

Data Description

Variable Definition
School Name of the school/university
% of Classes Under 20 % of classes with fewer than 20 students
Student/Faculty Ratio The number of students enrolled divided by the total number of faculty
Alumni Giving Rate The percentage of alumni that made a donation to the university
Private A binary variable with ‘1’ when the university is a private else it is ‘0’

Packages used

library(readxl)
library(car)
library(tidyverse)
library(psych)
library(packHV)

EDA

We have changed the categorical variable ‘Private’ to factor form for further analysis

PART A) EDA for Percentage Of Class under 20:

##    vars  n  mean    sd median trimmed  mad min max range  skew kurtosis  se
## X1    1 48 55.73 13.19   59.5    56.4 12.6  29  77    48 -0.47    -1.07 1.9
  • Supplementing a Histogram plus boxpolot for visual representation:

  • Observations
    • We see that there are no outliers in this data.
    • Additionally the curve is not a normal distribution
    • From the describe metrics in the variable we see that the curve shows negligible skewness however there is some negative kurtosis

PART B) EDA for Student Faculty Ratio:

##    vars  n  mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 48 11.54 4.85   10.5    11.3 3.71   3  23    20 0.55    -0.62 0.7
  • Supplementing a Histogram plus boxpolot for visual representation:

  • Observations
  • We see that there are no outliers in this data.
  • Additionally the curve is not a normal distribution
  • From the describe metrics in the variable we see that the curve shows negligible skewness however there is some negative kurtosis

PART C) EDA for alumni giving rate:

##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 48 29.27 13.44     29   28.75 15.57   7  67    60 0.35     -0.3 1.94
  • Supplementing a Histogram plus boxplot for visual representation:

  • Observations
    • We see that there are no outliers in this data.
    • Additionally the curve is not a normal distribution
    • From the describer metrics in the variable we see that the curve shows negligible skewness however there is some negative kurtosis

Correlation Analysis

Modelling

Model 1

Description of Algorithm

Stepwise regression is the step-by-step iterative construction of a regression model that involves the selection of independent variables to be used in a final model. It involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration based on AIC values produced by the model.

#Stepwise Algorithm to find the best fit model

alumni_fitAll <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private)
alumni_fitboth <- lm(alumni_giving_rate ~ 1)
step(alumni_fitboth, direction="both", scope=formula(alumni_fitAll))
## Start:  AIC=250.43
## alumni_giving_rate ~ 1
## 
##                               Df Sum of Sq    RSS    AIC
## + student_faculty_ratio        1    4680.1 3811.4 213.98
## + private                      1    4038.0 4453.5 221.45
## + percent_of_classes_under_20  1    3539.8 4951.7 226.54
## <none>                                     8491.5 250.43
## 
## Step:  AIC=213.98
## alumni_giving_rate ~ student_faculty_ratio
## 
##                               Df Sum of Sq    RSS    AIC
## + private                      1     184.2 3627.2 213.60
## <none>                                     3811.4 213.98
## + percent_of_classes_under_20  1      86.5 3724.9 214.88
## - student_faculty_ratio        1    4680.1 8491.5 250.43
## 
## Step:  AIC=213.6
## alumni_giving_rate ~ student_faculty_ratio + private
## 
##                               Df Sum of Sq    RSS    AIC
## <none>                                     3627.2 213.60
## - private                      1    184.19 3811.4 213.98
## + percent_of_classes_under_20  1     15.34 3611.8 215.40
## - student_faculty_ratio        1    826.34 4453.5 221.45
## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + private)
## 
## Coefficients:
##           (Intercept)  student_faculty_ratio               private1  
##                41.429                 -1.486                  7.267
  • On running this algortihm, we find the AIC to be 213.60 with the regression equation being \[Y = 41.429 - 1.486*studentfacultyratio + 7.267*private\]

  • The summary shows that

    • R^2 = 57.28%
    • Adjusted R squared = 55.39%
    • p value is also less than 0.05, which means that null hypothesis can be rejected and alternate hypothesis can be accepted which implies there is a significant linear relation between the response variable with student_faculty_ratio and private

  • Observations
    • We see that there is no specific pattern in Residuals vs. Student Faculty Ratio graph
    • The Residual plot for Private is also expected as we see only two categorical values for this variable i.e. 0 and 1
    • Alumni giving Rate Residual plot shows a linear trend which is somewhat concerning and violates the residuals assumption
    • The scatter plot for the fitted values is also on the same lines as student faculty ratio and hence it should follow a normal distribution along with constant variance just by visual inspection

Plotting the histograms, QQ plots, Cook’s distance plot and performing the Shapiro test for Normality:

##    vars  n mean   sd median trimmed  mad    min   max range skew kurtosis   se
## X1    1 48    0 8.78  -2.19   -0.59 7.63 -16.37 25.74 42.11 0.74     0.36 1.27
## 
##  Shapiro-Wilk normality test
## 
## data:  alumni_model1$residuals
## W = 0.95092, p-value = 0.04352
  • From the above graphs we can observe the following:
    • Fitted v/s Residual - There is non constant variance with the increase in x value
    • QQ plot/ Histogram - We can deduce that the model violates the assumption of normality. Also the p-value from Shapiro test is < 0.05 which signifies that null hypothesis can be rejected which means that the residuals don’t follow Normal Distribution. Here is the definition of the Shapiro test:
    • H0 - the errors follow a normal distribution (Null Hypothesis)
    • H1 - The errors do not follow a normal distribution (Alternate Hypothesis)

Model 1 Diagnostics

lmtest::bptest(alumni_model1) #Breush-Pagan test
## 
##  studentized Breusch-Pagan test
## 
## data:  alumni_model1
## BP = 1.5742, df = 2, p-value = 0.4552
  • To test the constant variance assumption, we observe the residuals versus fitted values plot
  • From the plot, we can observe that the residuals are increasing with the increase in fitted values. Hence, the assumption of constant error variance is violated and we can say that the model is having heteroskedasticity.
  • However, to confirm further we can use Breusch-Pagan test, where the p-value is greater than 0.05. Hence we cannot reject null hypothesis.
    • H0 - the variance of residuals is constant (Null Hypothesis)
    • H1 - the variance of residuals is not constant (Alternate Hypothesis)

VIF

## student_faculty_ratio               private 
##              2.956517              2.956517
  • The vif values are less than the threshold value of 10, and hence we can conclude that there is no multicollinearity in the suggested model

Outliers

  • Outliers - Points 33 and 43 are the outliers.

Sum of Influential points

#Count number of outliers
infl <- influence.measures(alumni_model1)
sum(infl$is.inf[,7])
## [1] 0
  • The sum of influential points is zero.

Remedial Measures and Transformation

Since the assumption of constant variance is violated (from the fitted vs Residual plot), we use the Box cox transformation to determine lambda value.

#Box-Cox transformation
bc <- MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio + private)

alumni_data$alumni_giving_rate1 <- (alumni_giving_rate ^ lambda - 1) / lambda
  • lambda=0.343 –> using which the response variable is transformed and stored in the dataframe

Transformed Model Diagnostics

With the transformed response variable, we re-run the step-wise algorithm to determine the new predictors that forms the best fit model.

alumni_fitAll2 <- lm(alumni_giving_rate1 ~ percent_of_classes_under_20 + student_faculty_ratio + private)
alumni_fitboth2 <- lm(alumni_data$alumni_giving_rate1 ~ 1)
step(alumni_fitboth2, direction="both", scope=formula(alumni_fitAll2))
## Start:  AIC=43.88
## alumni_data$alumni_giving_rate1 ~ 1
## 
##                               Df Sum of Sq     RSS    AIC
## + student_faculty_ratio        1    68.481  46.389  2.361
## + private                      1    62.024  52.846  8.617
## + percent_of_classes_under_20  1    50.399  64.472 18.161
## <none>                                     114.870 43.885
## 
## Step:  AIC=2.36
## alumni_data$alumni_giving_rate1 ~ student_faculty_ratio
## 
##                               Df Sum of Sq     RSS    AIC
## + private                      1     3.867  42.522  0.184
## <none>                                      46.389  2.361
## + percent_of_classes_under_20  1     0.935  45.454  3.384
## - student_faculty_ratio        1    68.481 114.870 43.885
## 
## Step:  AIC=0.18
## alumni_data$alumni_giving_rate1 ~ student_faculty_ratio + private
## 
##                               Df Sum of Sq    RSS    AIC
## <none>                                     42.522 0.1836
## + percent_of_classes_under_20  1    0.0217 42.500 2.1591
## - private                      1    3.8668 46.389 2.3613
## - student_faculty_ratio        1   10.3241 52.846 8.6170
## 
## Call:
## lm(formula = alumni_data$alumni_giving_rate1 ~ student_faculty_ratio + 
##     private)
## 
## Coefficients:
##           (Intercept)  student_faculty_ratio               private1  
##                7.3245                -0.1661                 1.0529
  • The AIC is significantly reduced to 0.18, with the regression equation being \[Y = 7.3245 -0.1661*studentfacultyratio + 1.0529*private\]

  • The summary conclusions are:

    • R-Squared = 62.98%
    • Adjusted R squared = 61.34%
    • p value is also less than 0.05, which means that null hypothesis can be rejected and alternate hypothesis can be accepted which implies there is a significant linear relation between the response variable with student_faculty_ratio and private.

  • Observations
    • We see that there is no specific pattern in Residuals vs. Student Faculty Ratio graph
    • The Residual plot for Private is also expected as we see only two categorical values for this variable i.e. 0 and 1
    • Alumni giving Rate (transformed) vs Residual plot shows a linear trend
    • The scatter plot for the fitted values vs Residuals follows constant variance assumption and improved graph as compared to model 1

Plotting the histograms, QQ plots, Cook’s distance plot and performing the Shapiro test for Normality:

##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 48    0 0.95  -0.24   -0.02 0.92 -2.1 1.95  4.06 0.19    -0.63 0.14
## 
##  Shapiro-Wilk normality test
## 
## data:  alumni_model2$residuals
## W = 0.9791, p-value = 0.5418
  • From the above graphs we can observe the following:
    • Fitted v/s Residual - The graph is better than model 1 and we have acheived almost constant variance
    • QQ plot/ Histogram - The QQ-plot is also better representation and hence the curve of this model should be a normal. Also the p-value from Shapiro test is > 0.05 which signifies that null hypothesis cannot be rejected which means that the residuals follow Normal Distribution.

Breusch-Pagan test for constant variance

lmtest::bptest(alumni_model2) 
## 
##  studentized Breusch-Pagan test
## 
## data:  alumni_model2
## BP = 2.4586, df = 2, p-value = 0.2925
  • To test the constant variance assumption, we observe the residuals versus fitted values plot
  • From the plot for constant variance, we see an improvement from model 1.
  • However, to confirm further we can used Breusch-Pagan test, where the p-value is greater than 0.05. Hence we cannot reject null hypothesis and conclude that the residuals show constant variance as compared to model 1.

VIF

## student_faculty_ratio               private 
##              2.956517              2.956517
  • The vif values are less than the threshold value of 10, and hence we can conclude that there is no multicollinearity in the suggested model

Outliers

  • Outliers - Points 33 and 43 are the outliers. Also, the sum of influential points is zero.

Sum of Influential points

## [1] 0

Interaction Model

We will check another model that will consider an interaction between the 2 predictor variables. This additional model is added to explore the categorical nature of ‘Private’ variable in the dataset. \[Y = β0 + β1X1 + β2X2 + β3X1X2 + ϵ\]

## 
## Call:
## lm(formula = alumni_giving_rate1 ~ student_faculty_ratio + private + 
##     student_faculty_ratio:private)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9076 -0.5697 -0.2487  0.8057  2.1484 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.04950    1.40435   4.308 9.11e-05 ***
## student_faculty_ratio          -0.09258    0.07973  -1.161    0.252    
## private1                        2.75368    1.52592   1.805    0.078 .  
## student_faculty_ratio:private1 -0.12135    0.10241  -1.185    0.242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9677 on 44 degrees of freedom
## Multiple R-squared:  0.6413, Adjusted R-squared:  0.6168 
## F-statistic: 26.22 on 3 and 44 DF,  p-value: 6.99e-10
  • From the coefficients above, the suggested regression equation is: \[Y = 6.04950 -0.09258*studentfacultyratio + 2.75368*private +0.12135 studentfacultyratio * private \]
  • From the summary stats of the interaction model, we observe that none of the predictor variables are statistically significant due to the p-values larger than the threshold of 0.05.
  • We hence conclude that the model performance in this case will not be as good as model 2 and we avoid further analysis

Conclusions

Parameters Model 1 Model 2 Model 3
Regression Equation 𝑌=41.429−1.486 * student_faculty_ratio+7.267 ∗ private 𝑌=7.3245−0.1661 ∗ student_faculty_ratio +1.0529 ∗ private 𝑌=6.04950−0.09258 ∗student_faculty_ratio + 2.75368 ∗ private + 0.12135 * student_faculty_ratio ∗ private
R- squared value 57.28% 62.98% 64.13%
Adjusted R-squared 55.39% 61.34% 61.68%
Assumption of constant variance Non - Constant Constant NA
Assumption of normality Not Normal Normal NA
Multicollinearity Not Observed Not Observed NA
AIC 213.6 0.18 NA
  • We used the step-wise algorithm to determine the best model for the given predictor variables.

  • As per the conclusion table, we can see the best performance is shown by Model 2. The model was achieved through Box-Cox variance stabilizing transformation. Post the transformation, the model’s residuals showed a constant variance (Breusch-Pagan test) and also normality (Shapiro Wilk Test). The model also exhibited extremely low value of AIC reflecting the best fit in the given predictor variables.