Linear Regression

Predicting Alumni Donation

Introduction

What am I trying to do?

Alumni donations are a crucial part of a universities’ revenue. Understanding what factors are contributing to a higher donation rate and implementing a few changes could lead to an increase in donations by future alumni.

What I did?

Several linear regression models were made to determine best fit. Our final model included student to faculty ratio and private/public variables in relation to alumni giving.
We concluded that a lower student to faculty ratio results in higher percentage of alumni giving, and private schools also had higher alumni giving. Since a university cannot change whether it is a private or public university, we suggest decreasing the student to faculty ratio. This may lead to an increase in alumni donations and thus an increase in revenue.

Packages Required

library(tidyverse)  #to visualize, transform, input, tidy and join data
library(dplyr)      #data wrangling
library(kableExtra) #to create HTML Table
library(DT)         #to preview the data sets
library(tibble)     #to better preview tibbles
library(scales)     # to clear cluttering of labels in plots

Data

Data source

We used data from America’s Best Colleges which included 48 national universities and four variables to analyze these factors.

url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)

                       
# Looking at first few rows of the data set
head(as_tibble(alumni))

## # A tibble: 6 x 5
##   ï..school    percent_of_class~ student_faculty~ alumni_giving_r~ private
##   <fct>                    <int>            <int>            <int>   <int>
## 1 Boston Coll~                39               13               25       1
## 2 "Brandeis U~                68                8               33       1
## 3 Brown Unive~                60                8               40       1
## 4 California ~                65                3               46       1
## 5 Carnegie Me~                67               10               28       1
## 6 Case Wester~                52                8               31       1

# Keeping relevant variables as per the problem statement
alum.data <- data.frame(y=alumni$alumni_giving_rate, x1=alumni$percent_of_classes_under_20, x2=alumni$student_faculty_ratio, x3=alumni$private) 

# Looking at first few rows of the data frame
head(alum.data)

##    y x1 x2 x3
## 1 25 39 13  1
## 2 33 68  8  1
## 3 40 60  8  1
## 4 46 65  3  1
## 5 28 67 10  1
## 6 31 52  8  1

Exploratory Data Analysis

Data Wrangling

Objective: Predicting alumni giving rate (y) using x1=percent of classes under 20 years of age, x2=student faculty ratio, x3=private

Scatter plots for different variables:

GGally::ggpairs(data = alum.data)

x1 and x2 seem to be correlated. Going to keep one of them. There is a little non linearity in y vs x2. But we can ignore as we have less data points. Anyway, 2 possible remedies could be : 1. Adding x2^2 term along with x2 and x3 in the lm model 2. Transforming x

In this case, non linearity does not seem a big trouble. Going to leave that. There do not seem to be aby outliers as well.

In the following code, we will perform an F test to check if we need to include the interaction terms or not.

Hypothesis to be checked

Ho: The coefficient of x2*x3=0

H1: The coefficient of x2*x3!=0

fit1 <- lm(y ~ x2 + x3 + x2*x3 , data = alum.data)
fit2 <- lm(formula = y ~ x2 +x3, data = alum.data)
anova(fit2,fit1)

## Analysis of Variance Table
## 
## Model 1: y ~ x2 + x3
## Model 2: y ~ x2 + x3 + x2 * x3
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     45 3627.2                           
## 2     44 3430.1  1    197.04 2.5275  0.119

Since, the p value>0.05, we do not have sufficient evidence to reject the hypothesis. Hence, we will not include the interaction term in our model. Our revised model is:

fit_full_model <- lm(formula = y ~  x2  + x3, data = alum.data)

aug.l_model<- fit_full_model %>% broom::augment() %>% mutate(row_num=1:n())
head(aug.l_model)

## # A tibble: 6 x 11
##       y    x2    x3 .fitted .se.fit .resid   .hat .sigma .cooksd .std.resid
##   <int> <int> <int>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
## 1    25    13     1    29.4    2.46  -4.37 0.0750   9.05 0.00694     -0.507
## 2    33     8     1    36.8    1.62  -3.81 0.0325   9.06 0.00208     -0.431
## 3    40     8     1    36.8    1.62   3.19 0.0325   9.07 0.00147      0.362
## 4    46     3     1    44.2    3.16   1.76 0.124    9.07 0.00207      0.210
## 5    28    10     1    33.8    1.64  -5.83 0.0335   9.04 0.00504     -0.661
## 6    31     8     1    36.8    1.62  -5.81 0.0325   9.04 0.00484     -0.657
## # ... with 1 more variable: row_num <int>

# Standardized Residuals vs Fitted Values ---------------------------------------------------

ggplot(data=aug.l_model, aes(x=.fitted,y=.std.resid))+geom_point() + geom_smooth(se=FALSE) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2", size=1) +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted", size=1) +
  xlab("Fitted value") + ylab("Standardized residual") + ggtitle("Standardized Residuals vs Fitted Values")+
  theme(plot.title = element_text(hjust = 0.5))

# QQ Plot
ggplot(data=aug.l_model, aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") + ggtitle("Q-Q Plot")+
  theme(plot.title = element_text(hjust = 0.5))

Analysis

We can see that the variance is not constant through the first plot. We can apply the box cox procedure to transform y to remedy this problem. Additionally, there is some non linearity, which we can ignore as we have few data points.
The model almost follows the 45 degree line in QQ plot. We do have long tails which suggests some skenwness. However, the plot is good enough for considering the normality assumption to be valid.

Applying Box Cox Transformation

bc <- MASS::boxcox(y ~ x2 + x3 + x3, data = alum.data)

lambda <- bc$x[which.max(bc$y)]

# creating a new variable y2 for lambda=0.3434

alum.data <- data.frame(y=alumni$alumni_giving_rate, x1=alumni$percent_of_classes_under_20,
                        x2=alumni$student_faculty_ratio, x3=alumni$private) 

alum.data$y2<-((alum.data$y^lambda) - 1)/ lambda

#Revising lr model

fit_full_model <- lm(formula = y2 ~ x2 + x3 , data = alum.data)
summary(fit_full_model)

## 
## Call:
## lm(formula = y2 ~ x2 + x3, data = alum.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1031 -0.6559 -0.2420  0.7583  1.9529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.32445    0.90662   8.079 2.65e-10 ***
## x2          -0.16613    0.05026  -3.305  0.00187 ** 
## x3           1.05289    0.52049   2.023  0.04905 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9721 on 45 degrees of freedom
## Multiple R-squared:  0.6298, Adjusted R-squared:  0.6134 
## F-statistic: 38.28 on 2 and 45 DF,  p-value: 1.946e-10

The hypothesis tested is that the coefficients of the predictor variables are 0 (individually). We can see that p-value of both these predictors are less than 0.05. Hence, we reject the hypothesis and both these variables are significant.

The adjusted Rsq (0.613) is better than most of the other models that we tried.

# Revised Standardized Residuals vs Fitted Values ---------------------------------------------------

ggplot(data=aug.l_model, aes(x=.fitted,y=.std.resid))+geom_point() + 
geom_smooth(se=FALSE) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red2", size=1) +
geom_hline(yintercept = c(-2, 2), linetype = "dotted", size=1) +
xlab("Fitted value") + 
ylab("Standardized residual") + 
ggtitle("Revised Standardized Residuals vs Fitted Values")+
theme(plot.title = element_text(hjust = 0.5))

# Revised QQ Plot
ggplot(data=aug.l_model, aes(sample = .std.resid)) +
geom_qq() +
geom_qq_line(linetype = "dashed", color = "red2") +
xlab("Theoretical quantile") +
ylab("Sample quantile") + ggtitle("Revised Q-Q Plot")+
theme(plot.title = element_text(hjust = 0.5))

The revised standard residual vs fitted values plot indicate that the assumption of linearity and constant variance is being met as the data points are scattered randomly across the plot. The plot seems better than the one before transformation.
The Q-Q plot does not indicate any substantial deviation from the normality. Hence, the assumptions for linear regression are being met by our model

Final Plot

Based on our work in the previous section, we get the final plot as follows:

ggplot(broom::augment(fit_full_model), aes(x=x2,y=.fitted))+
geom_point(aes(y=y2,color=as.factor(x3)),size=2) + 
geom_line(aes(group=x3),color="black") +
xlab("Student Faculty Ratio") + 
ylab("Transformed Alumni Giving Rate") + 
ggtitle("Linear Regression Model") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_discrete(name="Private University",labels = c("No","Yes"))

Analysis

Our final regression line created after regressing the transformed response variable against two predictors, student faculty ratio and public/private is shown above.
We can see that the alumni giving rate (after transformation) decreases by 0.16613 for every unit increase in the student faculty ratio. However, the intercepts for the private and non-private universities are different. Also, the alumni giving rate is higher for private universities than the non-private universities. It decreases as the student faculty ratio increases for both types of universities.

Summary

Our results show a very interesting connection between alumni donations, student to faculty ratio, and private vs public schools. Universities with high student to faculty ratios receive fewer alumni donations, and public universities receive even fewer alumni donations. On average, public universities also have higher student to faculty ratios, so the effect is magnified. The data set used in this analysis consists of just 48 universities. It would be interesting to see if our final model could accurately predict alumni giving for all universities.

Linear Regression

Aabhaas Sethi

December 20, 2018

Predicting Alumni Donation

Introduction

What am I trying to do?

What I did?

Packages Required

Data

Exploratory Data Analysis

Data Wrangling

Ho: The coefficient of x2*x3=0

H1: The coefficient of x2*x3!=0

Final Plot

Summary