Alumni-Donation-Case-Study.utf8

Introduction

Alumni donations are an important source of revenue for colleges and universities. A regression model aimed at determining the factors responsible for increased donations from the alumni, the administrators could take relevant decisions in order to increase the donations and hence increase the overall revenue from this source.

A study shows that students who have more access to the faculty are more likely to be satisfied. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to an increase in the percentage of alumni who donate.

This report aims at the achieving the same. We will look at the various factors associated with the donations from the alumni and build a regression model with response variable as donations received and predictor variables include the student to class ratio

Data Description

Variable	Definition
School	Name of the school/university
% of Classes Under 20	% of classes with fewer than 20 students
Student/Faculty Ratio	The number of students enrolled divided by the total number of faculty
Alumni Giving Rate	The percentage of alumni that made a donation to the university
Private	A binary variable with ‘1’ when the university is a private else it is ‘0’

Packages used

library(readxl)
library(car)
library(tidyverse)
library(psych)
library(packHV)

EDA

We have changed the categorical variable ‘Private’ to factor form for further analysis

PART A) EDA for Percentage Of Class under 20:

##    vars  n  mean    sd median trimmed  mad min max range  skew kurtosis  se
## X1    1 48 55.73 13.19   59.5    56.4 12.6  29  77    48 -0.47    -1.07 1.9

Supplementing a Histogram plus boxpolot for visual representation:

Observations
- We see that there are no outliers in this data.
- Additionally the curve is not a normal distribution
- From the describe metrics in the variable we see that the curve shows negligible skewness however there is some negative kurtosis

PART B) EDA for Student Faculty Ratio:

##    vars  n  mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 48 11.54 4.85   10.5    11.3 3.71   3  23    20 0.55    -0.62 0.7

Supplementing a Histogram plus boxpolot for visual representation:

Observations
We see that there are no outliers in this data.
Additionally the curve is not a normal distribution
From the describe metrics in the variable we see that the curve shows negligible skewness however there is some negative kurtosis

PART C) EDA for alumni giving rate:

##    vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 48 29.27 13.44     29   28.75 15.57   7  67    60 0.35     -0.3 1.94

Supplementing a Histogram plus boxplot for visual representation:

Observations
- We see that there are no outliers in this data.
- Additionally the curve is not a normal distribution
- From the describer metrics in the variable we see that the curve shows negligible skewness however there is some negative kurtosis

Correlation Analysis

Modelling

Model 1

Description of Algorithm

Stepwise regression is the step-by-step iterative construction of a regression model that involves the selection of independent variables to be used in a final model. It involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration based on AIC values produced by the model.

#Stepwise Algorithm to find the best fit model

alumni_fitAll <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private)
alumni_fitboth <- lm(alumni_giving_rate ~ 1)
step(alumni_fitboth, direction="both", scope=formula(alumni_fitAll))

## Start:  AIC=250.43
## alumni_giving_rate ~ 1
## 
##                               Df Sum of Sq    RSS    AIC
## + student_faculty_ratio        1    4680.1 3811.4 213.98
## + private                      1    4038.0 4453.5 221.45
## + percent_of_classes_under_20  1    3539.8 4951.7 226.54
## <none>                                     8491.5 250.43
## 
## Step:  AIC=213.98
## alumni_giving_rate ~ student_faculty_ratio
## 
##                               Df Sum of Sq    RSS    AIC
## + private                      1     184.2 3627.2 213.60
## <none>                                     3811.4 213.98
## + percent_of_classes_under_20  1      86.5 3724.9 214.88
## - student_faculty_ratio        1    4680.1 8491.5 250.43
## 
## Step:  AIC=213.6
## alumni_giving_rate ~ student_faculty_ratio + private
## 
##                               Df Sum of Sq    RSS    AIC
## <none>                                     3627.2 213.60
## - private                      1    184.19 3811.4 213.98
## + percent_of_classes_under_20  1     15.34 3611.8 215.40
## - student_faculty_ratio        1    826.34 4453.5 221.45

## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + private)
## 
## Coefficients:
##           (Intercept)  student_faculty_ratio               private1  
##                41.429                 -1.486                  7.267

On running this algortihm, we find the AIC to be 213.60 with the regression equation being \[Y = 41.429 - 1.486*studentfacultyratio + 7.267*private\]
The summary shows that
- R^2 = 57.28%
- Adjusted R squared = 55.39%
- p value is also less than 0.05, which means that null hypothesis can be rejected and alternate hypothesis can be accepted which implies there is a significant linear relation between the response variable with student_faculty_ratio and private

Observations
- We see that there is no specific pattern in Residuals vs. Student Faculty Ratio graph
- The Residual plot for Private is also expected as we see only two categorical values for this variable i.e. 0 and 1
- Alumni giving Rate Residual plot shows a linear trend which is somewhat concerning and violates the residuals assumption
- The scatter plot for the fitted values is also on the same lines as student faculty ratio and hence it should follow a normal distribution along with constant variance just by visual inspection

Plotting the histograms, QQ plots, Cook’s distance plot and performing the Shapiro test for Normality:

##    vars  n mean   sd median trimmed  mad    min   max range skew kurtosis   se
## X1    1 48    0 8.78  -2.19   -0.59 7.63 -16.37 25.74 42.11 0.74     0.36 1.27

## 
##  Shapiro-Wilk normality test
## 
## data:  alumni_model1$residuals
## W = 0.95092, p-value = 0.04352

From the above graphs we can observe the following:
- Fitted v/s Residual - There is non constant variance with the increase in x value
- QQ plot/ Histogram - We can deduce that the model violates the assumption of normality. Also the p-value from Shapiro test is < 0.05 which signifies that null hypothesis can be rejected which means that the residuals don’t follow Normal Distribution. Here is the definition of the Shapiro test:
- H0 - the errors follow a normal distribution (Null Hypothesis)
- H1 - The errors do not follow a normal distribution (Alternate Hypothesis)

Model 1 Diagnostics

lmtest::bptest(alumni_model1) #Breush-Pagan test

## 
##  studentized Breusch-Pagan test
## 
## data:  alumni_model1
## BP = 1.5742, df = 2, p-value = 0.4552

To test the constant variance assumption, we observe the residuals versus fitted values plot
From the plot, we can observe that the residuals are increasing with the increase in fitted values. Hence, the assumption of constant error variance is violated and we can say that the model is having heteroskedasticity.
However, to confirm further we can use Breusch-Pagan test, where the p-value is greater than 0.05. Hence we cannot reject null hypothesis.
- H0 - the variance of residuals is constant (Null Hypothesis)
- H1 - the variance of residuals is not constant (Alternate Hypothesis)

VIF

## student_faculty_ratio               private 
##              2.956517              2.956517

The vif values are less than the threshold value of 10, and hence we can conclude that there is no multicollinearity in the suggested model

Outliers

Outliers - Points 33 and 43 are the outliers.

Sum of Influential points

#Count number of outliers
infl <- influence.measures(alumni_model1)
sum(infl$is.inf[,7])

## [1] 0

The sum of influential points is zero.

Remedial Measures and Transformation

Since the assumption of constant variance is violated (from the fitted vs Residual plot), we use the Box cox transformation to determine lambda value.

#Box-Cox transformation
bc <- MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio + private)

alumni_data$alumni_giving_rate1 <- (alumni_giving_rate ^ lambda - 1) / lambda

lambda=0.343 –> using which the response variable is transformed and stored in the dataframe

Transformed Model Diagnostics

With the transformed response variable, we re-run the step-wise algorithm to determine the new predictors that forms the best fit model.

alumni_fitAll2 <- lm(alumni_giving_rate1 ~ percent_of_classes_under_20 + student_faculty_ratio + private)
alumni_fitboth2 <- lm(alumni_data$alumni_giving_rate1 ~ 1)
step(alumni_fitboth2, direction="both", scope=formula(alumni_fitAll2))

## Start:  AIC=43.88
## alumni_data$alumni_giving_rate1 ~ 1
## 
##                               Df Sum of Sq     RSS    AIC
## + student_faculty_ratio        1    68.481  46.389  2.361
## + private                      1    62.024  52.846  8.617
## + percent_of_classes_under_20  1    50.399  64.472 18.161
## <none>                                     114.870 43.885
## 
## Step:  AIC=2.36
## alumni_data$alumni_giving_rate1 ~ student_faculty_ratio
## 
##                               Df Sum of Sq     RSS    AIC
## + private                      1     3.867  42.522  0.184
## <none>                                      46.389  2.361
## + percent_of_classes_under_20  1     0.935  45.454  3.384
## - student_faculty_ratio        1    68.481 114.870 43.885
## 
## Step:  AIC=0.18
## alumni_data$alumni_giving_rate1 ~ student_faculty_ratio + private
## 
##                               Df Sum of Sq    RSS    AIC
## <none>                                     42.522 0.1836
## + percent_of_classes_under_20  1    0.0217 42.500 2.1591
## - private                      1    3.8668 46.389 2.3613
## - student_faculty_ratio        1   10.3241 52.846 8.6170

## 
## Call:
## lm(formula = alumni_data$alumni_giving_rate1 ~ student_faculty_ratio + 
##     private)
## 
## Coefficients:
##           (Intercept)  student_faculty_ratio               private1  
##                7.3245                -0.1661                 1.0529

The AIC is significantly reduced to 0.18, with the regression equation being \[Y = 7.3245 -0.1661*studentfacultyratio + 1.0529*private\]
The summary conclusions are:
- R-Squared = 62.98%
- Adjusted R squared = 61.34%
- p value is also less than 0.05, which means that null hypothesis can be rejected and alternate hypothesis can be accepted which implies there is a significant linear relation between the response variable with student_faculty_ratio and private.

Observations
- We see that there is no specific pattern in Residuals vs. Student Faculty Ratio graph
- The Residual plot for Private is also expected as we see only two categorical values for this variable i.e. 0 and 1
- Alumni giving Rate (transformed) vs Residual plot shows a linear trend
- The scatter plot for the fitted values vs Residuals follows constant variance assumption and improved graph as compared to model 1