Alumni donations are an important source of revenue for colleges and universities. A regression model aimed at determining the factors responsible for increased donations from the alumni, the administrators could take relevant decisions in order to increase the donations and hence increase the overall revenue from this source.
A study shows that students who have more access to the faculty are more likely to be satisfied. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to an increase in the percentage of alumni who donate.
This report aims at the achieving the same. We will look at the various factors associated with the donations from the alumni and build a regression model with response variable as donations received and predictor variables include the student to class ratio
| Variable | Definition |
|---|---|
| School | Name of the school/university |
| % of Classes Under 20 | % of classes with fewer than 20 students |
| Student/Faculty Ratio | The number of students enrolled divided by the total number of faculty |
| Alumni Giving Rate | The percentage of alumni that made a donation to the university |
| Private | A binary variable with ‘1’ when the university is a private else it is ‘0’ |
library(readxl)
library(car)
library(tidyverse)
library(psych)
library(packHV)
We have changed the categorical variable ‘Private’ to factor form for further analysis
PART A) EDA for Percentage Of Class under 20:
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 48 55.73 13.19 59.5 56.4 12.6 29 77 48 -0.47 -1.07 1.9
PART B) EDA for Student Faculty Ratio:
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 48 11.54 4.85 10.5 11.3 3.71 3 23 20 0.55 -0.62 0.7
PART C) EDA for alumni giving rate:
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 48 29.27 13.44 29 28.75 15.57 7 67 60 0.35 -0.3 1.94
Correlation Analysis
Description of Algorithm
Stepwise regression is the step-by-step iterative construction of a regression model that involves the selection of independent variables to be used in a final model. It involves adding or removing potential explanatory variables in succession and testing for statistical significance after each iteration based on AIC values produced by the model.
#Stepwise Algorithm to find the best fit model
alumni_fitAll <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private)
alumni_fitboth <- lm(alumni_giving_rate ~ 1)
step(alumni_fitboth, direction="both", scope=formula(alumni_fitAll))
## Start: AIC=250.43
## alumni_giving_rate ~ 1
##
## Df Sum of Sq RSS AIC
## + student_faculty_ratio 1 4680.1 3811.4 213.98
## + private 1 4038.0 4453.5 221.45
## + percent_of_classes_under_20 1 3539.8 4951.7 226.54
## <none> 8491.5 250.43
##
## Step: AIC=213.98
## alumni_giving_rate ~ student_faculty_ratio
##
## Df Sum of Sq RSS AIC
## + private 1 184.2 3627.2 213.60
## <none> 3811.4 213.98
## + percent_of_classes_under_20 1 86.5 3724.9 214.88
## - student_faculty_ratio 1 4680.1 8491.5 250.43
##
## Step: AIC=213.6
## alumni_giving_rate ~ student_faculty_ratio + private
##
## Df Sum of Sq RSS AIC
## <none> 3627.2 213.60
## - private 1 184.19 3811.4 213.98
## + percent_of_classes_under_20 1 15.34 3611.8 215.40
## - student_faculty_ratio 1 826.34 4453.5 221.45
##
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio + private)
##
## Coefficients:
## (Intercept) student_faculty_ratio private1
## 41.429 -1.486 7.267
On running this algortihm, we find the AIC to be 213.60 with the regression equation being \[Y = 41.429 - 1.486*studentfacultyratio + 7.267*private\]
The summary shows that
Plotting the histograms, QQ plots, Cook’s distance plot and performing the Shapiro test for Normality:
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 48 0 8.78 -2.19 -0.59 7.63 -16.37 25.74 42.11 0.74 0.36 1.27
##
## Shapiro-Wilk normality test
##
## data: alumni_model1$residuals
## W = 0.95092, p-value = 0.04352
lmtest::bptest(alumni_model1) #Breush-Pagan test
##
## studentized Breusch-Pagan test
##
## data: alumni_model1
## BP = 1.5742, df = 2, p-value = 0.4552
VIF
## student_faculty_ratio private
## 2.956517 2.956517
Outliers
Sum of Influential points
#Count number of outliers
infl <- influence.measures(alumni_model1)
sum(infl$is.inf[,7])
## [1] 0
Since the assumption of constant variance is violated (from the fitted vs Residual plot), we use the Box cox transformation to determine lambda value.
#Box-Cox transformation
bc <- MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio + private)
alumni_data$alumni_giving_rate1 <- (alumni_giving_rate ^ lambda - 1) / lambda
With the transformed response variable, we re-run the step-wise algorithm to determine the new predictors that forms the best fit model.
alumni_fitAll2 <- lm(alumni_giving_rate1 ~ percent_of_classes_under_20 + student_faculty_ratio + private)
alumni_fitboth2 <- lm(alumni_data$alumni_giving_rate1 ~ 1)
step(alumni_fitboth2, direction="both", scope=formula(alumni_fitAll2))
## Start: AIC=43.88
## alumni_data$alumni_giving_rate1 ~ 1
##
## Df Sum of Sq RSS AIC
## + student_faculty_ratio 1 68.481 46.389 2.361
## + private 1 62.024 52.846 8.617
## + percent_of_classes_under_20 1 50.399 64.472 18.161
## <none> 114.870 43.885
##
## Step: AIC=2.36
## alumni_data$alumni_giving_rate1 ~ student_faculty_ratio
##
## Df Sum of Sq RSS AIC
## + private 1 3.867 42.522 0.184
## <none> 46.389 2.361
## + percent_of_classes_under_20 1 0.935 45.454 3.384
## - student_faculty_ratio 1 68.481 114.870 43.885
##
## Step: AIC=0.18
## alumni_data$alumni_giving_rate1 ~ student_faculty_ratio + private
##
## Df Sum of Sq RSS AIC
## <none> 42.522 0.1836
## + percent_of_classes_under_20 1 0.0217 42.500 2.1591
## - private 1 3.8668 46.389 2.3613
## - student_faculty_ratio 1 10.3241 52.846 8.6170
##
## Call:
## lm(formula = alumni_data$alumni_giving_rate1 ~ student_faculty_ratio +
## private)
##
## Coefficients:
## (Intercept) student_faculty_ratio private1
## 7.3245 -0.1661 1.0529
The AIC is significantly reduced to 0.18, with the regression equation being \[Y = 7.3245 -0.1661*studentfacultyratio + 1.0529*private\]
The summary conclusions are:
Plotting the histograms, QQ plots, Cook’s distance plot and performing the Shapiro test for Normality:
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 48 0 0.95 -0.24 -0.02 0.92 -2.1 1.95 4.06 0.19 -0.63 0.14
##
## Shapiro-Wilk normality test
##
## data: alumni_model2$residuals
## W = 0.9791, p-value = 0.5418
Breusch-Pagan test for constant variance
lmtest::bptest(alumni_model2)
##
## studentized Breusch-Pagan test
##
## data: alumni_model2
## BP = 2.4586, df = 2, p-value = 0.2925
VIF
## student_faculty_ratio private
## 2.956517 2.956517
Outliers
Sum of Influential points
## [1] 0
We will check another model that will consider an interaction between the 2 predictor variables. This additional model is added to explore the categorical nature of ‘Private’ variable in the dataset. \[Y = β0 + β1X1 + β2X2 + β3X1X2 + ϵ\]
##
## Call:
## lm(formula = alumni_giving_rate1 ~ student_faculty_ratio + private +
## student_faculty_ratio:private)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9076 -0.5697 -0.2487 0.8057 2.1484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.04950 1.40435 4.308 9.11e-05 ***
## student_faculty_ratio -0.09258 0.07973 -1.161 0.252
## private1 2.75368 1.52592 1.805 0.078 .
## student_faculty_ratio:private1 -0.12135 0.10241 -1.185 0.242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9677 on 44 degrees of freedom
## Multiple R-squared: 0.6413, Adjusted R-squared: 0.6168
## F-statistic: 26.22 on 3 and 44 DF, p-value: 6.99e-10
| Parameters | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| Regression Equation | 𝑌=41.429−1.486 * student_faculty_ratio+7.267 ∗ private | 𝑌=7.3245−0.1661 ∗ student_faculty_ratio +1.0529 ∗ private | 𝑌=6.04950−0.09258 ∗student_faculty_ratio + 2.75368 ∗ private + 0.12135 * student_faculty_ratio ∗ private |
| R- squared value | 57.28% | 62.98% | 64.13% |
| Adjusted R-squared | 55.39% | 61.34% | 61.68% |
| Assumption of constant variance | Non - Constant | Constant | NA |
| Assumption of normality | Not Normal | Normal | NA |
| Multicollinearity | Not Observed | Not Observed | NA |
| AIC | 213.6 | 0.18 | NA |
We used the step-wise algorithm to determine the best model for the given predictor variables.
As per the conclusion table, we can see the best performance is shown by Model 2. The model was achieved through Box-Cox variance stabilizing transformation. Post the transformation, the model’s residuals showed a constant variance (Breusch-Pagan test) and also normality (Shapiro Wilk Test). The model also exhibited extremely low value of AIC reflecting the best fit in the given predictor variables.