Setup

Load packages

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(knitr)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)

1. Introduction

Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increased donation among alumni’s, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their contact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might leads to a higher percentage of satisfied graduates, which in turn might lead to increased alumni donations. We have taken a dataset containing information of 48 national universities (America’s Best Colleges, Year 2000 Edition) and studied how the different factors affect the alumni giving rate. We have implemented a multiple linear regression model to answer this question.


2. Data Description

The alumni data set has 5 variables and 48 observations:

  • School: Name of the school.
  • Percent_of_classes_under_20: The percentage of classes offered with fewer than 20 students.
  • Student_faculty_ratio: The ratio of the students enrolled to the number of faculty in school.
  • Alumni_giving_rate: The percentage of alumni that donated to the university.
  • Private: This is an indicator variable indicating if the school is a private (1) or public institute (0).

Summary Statistics:

  • alumni_giving_rate ranges from 7% to 67% with an average of 30% and no missing values.
  • percent_of_classes_under_20 ranges from 29% to 77% with an average of 48% and no missing values.
  • student_faculty_ratio ranges from 3 to 23 with an average of 12 and no missing values.

All values are reasonable and there seems to be no outlier.

# Reading Data
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)

summary_fn <- function(x)
{
  alumni %>% 
  summarise(min=min(x),
            max=max(x),
            range=diff(range(x)),
            mean=round(mean(x),2),
            median=median(x),
            missing=sum(is.na(x)),
            Q1=quantile(x,probs=0.25),
            Q2=quantile(x,probs=0.75))
}

summary_data <- rbind(summary_fn(alumni$alumni_giving_rate),
                      summary_fn(alumni$percent_of_classes_under_20),
                      summary_fn(alumni$student_faculty_ratio))
summary_data <-
summary_data %>% 
  mutate(Variable = c("alumni_giving_rate",
                      "percent_of_classes_under_20",
                      "student_faculty_ratio")) %>% 
  select(Variable,everything())

kable(summary_data)
Variable min max range mean median missing Q1 Q2
alumni_giving_rate 7 67 60 29.27 29.0 0 18.75 38.50
percent_of_classes_under_20 29 77 48 55.73 59.5 0 44.75 66.25
student_faculty_ratio 3 23 20 11.54 10.5 0 8.00 13.50

Plots:

  • The distribution of alumni_giving_rate seems bi-modal and right skewed.
  • The distribution of of percent_of_classes_under_20 seems bi-modal as well. It has a high positive correlation of 0.646 with the response variable.
  • The distribution of student_faculty_ratio seems right skewed. It has a high negative correlation of -0.742 with the response variable.
  • Roughly 70% of the sampled schools are Private. Also, Private schools seem to have a positive association with the alumni_giving_rate.
  • All predictor variables seem to have strong correlation among them. We will need to check for multi-collinearity post model building.
alumni %>%
select(alumni_giving_rate,percent_of_classes_under_20,student_faculty_ratio,private) %>%
ggpairs()

3. Modeling and Results

Building the Model

Using the Exploratory Data Analysis, these are the predictor variables which have some association with the alumni giving rate:

  • student_faculty_ratio
  • percent_of_classes_under_20
  • private

Now we will run a forward-selection algorithm to determine the best possible predictor variables based on \(R^2_{adj}\). The methodology is as follows:

  1. Start with single predictor regression of response vs each explanatory variable
  2. Pick the model with the hightest \(R^2_{adj}\)
  3. Add remaining variables one at a time to the existing model, and pick the model with the highest \(R^2_{adj}\)
  4. Repeat until the addition of any of the remaining variables does not result in a higher \(R^2_{adj}\)

Note: The alpha used is 0.1 and Bonferroni correction helps reduce the Type-I error rate.

The model selects only student_faculty_ratio as the predictor variable with an \(R^2_{adj} =\) 0.54

# The range of variables that the forward-selection algorithm will examine
scope <- alumni_giving_rate ~
  percent_of_classes_under_20 +
  student_faculty_ratio +
  private

# Building the final model using forward-selection algorithm
model <- SignifReg(scope=scope,
                   data=alumni,
                   alpha=0.1,
                   direction="forward",
                   criterion="r-adj",
                   correction="FDR")

summary(model)
## 
## Call:
## lm(formula = reg, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.328  -5.692  -1.471   4.058  24.272 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            53.0138     3.4215  15.495  < 2e-16 ***
## student_faculty_ratio  -2.0572     0.2737  -7.516 1.54e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared:  0.5512, Adjusted R-squared:  0.5414 
## F-statistic: 56.49 on 1 and 46 DF,  p-value: 1.544e-09

Using the residuals vs fitted-values plot, we can conclude that the constant variance assumption is being violated for this model. The variance is fanning out i.e. increasing with increase in fitted_values.

# Constructing a dataframe containing model attributes
model_attributes <-
data.frame(index=1:nrow(alumni),
           residuals = model$residuals, 
           fitted_values = model$fitted.values)

# Residuals vs Fitted-Values Plot
ggplot(model_attributes, aes(x=fitted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_abline(intercept = 0, slope = 0.7, color = "blue") +
  geom_abline(intercept = 0, slope = -0.7, color = "blue") +
  ylim(-25, 25)


To fix the non-constant variance problem we can apply Box-Cox Transformation to the response variable (Y) where:

\[Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\; for \; \lambda \neq 0\]

\[Y^\lambda_{i} = ln(Y{}i)\; for \; \lambda = 0\]

The \(\lambda\) obtained by the Box-Cox function is approximately 0.5

MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio, data = alumni)


Applying forward-selection algorithm using the transformed response variable i.e. \(Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\) for \(\lambda =\) 0.5

The final model selects student_faculty_ratio and private as the predictor variables with an \(R^2_{adj} =\) 0.60.

lambda=0.5
alumni$alumni_giving_rate_boxcox <- (alumni$alumni_giving_rate^lambda-1)/lambda

scope <- alumni_giving_rate_boxcox ~
  percent_of_classes_under_20 +
  student_faculty_ratio +
  private

model1 <- SignifReg(scope=scope,
                    data=alumni,
                    alpha=0.1,
                    direction="forward",
                    criterion="r-adj",
                    correction="FDR")

summary(model1)
## 
## Call:
## lm(formula = reg, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.418 -1.082 -0.439  1.188  3.517 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           10.57852    1.51584   6.979  1.1e-08 ***
## student_faculty_ratio -0.27803    0.08403  -3.309  0.00185 ** 
## private                1.66528    0.87023   1.914  0.06204 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 45 degrees of freedom
## Multiple R-squared:  0.6207, Adjusted R-squared:  0.6039 
## F-statistic: 36.82 on 2 and 45 DF,  p-value: 3.363e-10

Model Diagnostics

We will check if our final model satistfies all the assumptions:


  1. Errors are normally distributed with mean=0

Using a Histogram, errors seem to be normally distributed and centred at 0.

# Constructing a dataframe containing model attributes
model_attributes1 <-
data.frame(index=1:nrow(alumni),
           residuals = model1$residuals, 
           fitted_values = model1$fitted.values)

# Plotting Histogram of Residuals
model_attributes1 %>% 
  ggplot(aes(x=residuals)) +
  geom_histogram(binwidth=2) 


Using a Q-Q Plot, errors seem to be normally distributed as well.

# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')

  1. Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

# Plotting Residuals over Time
model_attributes1 %>%
  ggplot(aes(x=index,y=residuals)) +
  geom_point()

  1. Constance Variance

We can clearly see that the residuals are constantly varied across the fitted values.

# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  geom_hline(yintercept = 3, color = "blue") +
  geom_hline(yintercept = -3, color = "blue")