library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(knitr)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increased donation among alumni’s, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their contact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might leads to a higher percentage of satisfied graduates, which in turn might lead to increased alumni donations. We have taken a dataset containing information of 48 national universities (America’s Best Colleges, Year 2000 Edition) and studied how the different factors affect the alumni giving rate. We have implemented a multiple linear regression model to answer this question.
The alumni data set has 5 variables and 48 observations:
All values are reasonable and there seems to be no outlier.
# Reading Data
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
summary_fn <- function(x)
{
alumni %>%
summarise(min=min(x),
max=max(x),
range=diff(range(x)),
mean=round(mean(x),2),
median=median(x),
missing=sum(is.na(x)),
Q1=quantile(x,probs=0.25),
Q2=quantile(x,probs=0.75))
}
summary_data <- rbind(summary_fn(alumni$alumni_giving_rate),
summary_fn(alumni$percent_of_classes_under_20),
summary_fn(alumni$student_faculty_ratio))
summary_data <-
summary_data %>%
mutate(Variable = c("alumni_giving_rate",
"percent_of_classes_under_20",
"student_faculty_ratio")) %>%
select(Variable,everything())
kable(summary_data)| Variable | min | max | range | mean | median | missing | Q1 | Q2 |
|---|---|---|---|---|---|---|---|---|
| alumni_giving_rate | 7 | 67 | 60 | 29.27 | 29.0 | 0 | 18.75 | 38.50 |
| percent_of_classes_under_20 | 29 | 77 | 48 | 55.73 | 59.5 | 0 | 44.75 | 66.25 |
| student_faculty_ratio | 3 | 23 | 20 | 11.54 | 10.5 | 0 | 8.00 | 13.50 |
alumni %>%
select(alumni_giving_rate,percent_of_classes_under_20,student_faculty_ratio,private) %>%
ggpairs()Using the Exploratory Data Analysis, these are the predictor variables which have some association with the alumni giving rate:
Now we will run a forward-selection algorithm to determine the best possible predictor variables based on \(R^2_{adj}\). The methodology is as follows:
Note: The alpha used is 0.1 and Bonferroni correction helps reduce the Type-I error rate.
The model selects only student_faculty_ratio as the predictor variable with an \(R^2_{adj} =\) 0.54
# The range of variables that the forward-selection algorithm will examine
scope <- alumni_giving_rate ~
percent_of_classes_under_20 +
student_faculty_ratio +
private
# Building the final model using forward-selection algorithm
model <- SignifReg(scope=scope,
data=alumni,
alpha=0.1,
direction="forward",
criterion="r-adj",
correction="FDR")
summary(model)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.328 -5.692 -1.471 4.058 24.272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.0138 3.4215 15.495 < 2e-16 ***
## student_faculty_ratio -2.0572 0.2737 -7.516 1.54e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared: 0.5512, Adjusted R-squared: 0.5414
## F-statistic: 56.49 on 1 and 46 DF, p-value: 1.544e-09
Using the residuals vs fitted-values plot, we can conclude that the constant variance assumption is being violated for this model. The variance is fanning out i.e. increasing with increase in fitted_values.
# Constructing a dataframe containing model attributes
model_attributes <-
data.frame(index=1:nrow(alumni),
residuals = model$residuals,
fitted_values = model$fitted.values)
# Residuals vs Fitted-Values Plot
ggplot(model_attributes, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_abline(intercept = 0, slope = 0.7, color = "blue") +
geom_abline(intercept = 0, slope = -0.7, color = "blue") +
ylim(-25, 25)To fix the non-constant variance problem we can apply Box-Cox Transformation to the response variable (Y) where:
\[Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\; for \; \lambda \neq 0\]
\[Y^\lambda_{i} = ln(Y{}i)\; for \; \lambda = 0\]
The \(\lambda\) obtained by the Box-Cox function is approximately 0.5
MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio, data = alumni)Applying forward-selection algorithm using the transformed response variable i.e. \(Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\) for \(\lambda =\) 0.5
The final model selects student_faculty_ratio and private as the predictor variables with an \(R^2_{adj} =\) 0.60.
lambda=0.5
alumni$alumni_giving_rate_boxcox <- (alumni$alumni_giving_rate^lambda-1)/lambda
scope <- alumni_giving_rate_boxcox ~
percent_of_classes_under_20 +
student_faculty_ratio +
private
model1 <- SignifReg(scope=scope,
data=alumni,
alpha=0.1,
direction="forward",
criterion="r-adj",
correction="FDR")
summary(model1)##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.418 -1.082 -0.439 1.188 3.517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.57852 1.51584 6.979 1.1e-08 ***
## student_faculty_ratio -0.27803 0.08403 -3.309 0.00185 **
## private 1.66528 0.87023 1.914 0.06204 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.625 on 45 degrees of freedom
## Multiple R-squared: 0.6207, Adjusted R-squared: 0.6039
## F-statistic: 36.82 on 2 and 45 DF, p-value: 3.363e-10
We will check if our final model satistfies all the assumptions:
Using a Histogram, errors seem to be normally distributed and centred at 0.
# Constructing a dataframe containing model attributes
model_attributes1 <-
data.frame(index=1:nrow(alumni),
residuals = model1$residuals,
fitted_values = model1$fitted.values)
# Plotting Histogram of Residuals
model_attributes1 %>%
ggplot(aes(x=residuals)) +
geom_histogram(binwidth=2) Using a Q-Q Plot, errors seem to be normally distributed as well.
# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.
# Plotting Residuals over Time
model_attributes1 %>%
ggplot(aes(x=index,y=residuals)) +
geom_point()We can clearly see that the residuals are constantly varied across the fitted values.
# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_hline(yintercept = 3, color = "blue") +
geom_hline(yintercept = -3, color = "blue")As the Variation Inflation Factor < 5 for each predictor variable, we can assume that there is no multi-collinearity.
vif(model1)## student_faculty_ratio private
## 2.956517 2.956517
Almost all standardized errors are below the absolute value of 2. Thus we do not have any extreme outliers that may be influencing the regression line.
# Plotting Studentized/Standardized Errors
rstan <- rstandard(model1)
plot(rstan)Some of this data is easy to obtain: A quick search on each University gives us the Year Established, Number of Undergrads and Current Endowment Size. However, some of the data is subjective; like measuring alumni relations team’s effectiveness.