Linear_Regression_Analysis_to_predict_alumni_giving

Chapter 1: Introduction

The goal of this project is to build a linear regression model that can predict alumni giving rate if certain attributes related to university are given.
The regression model is built from alumni dataset that contains 48 observations and has four variables of interest - percentage of classes under 20, student faculty ratio, private and alumni giving rate.
This model would be helpful in finding factors that influence alumni donation rate which would help management of an university to take necessary steps in that direction.
This would help university to get more funds helping them to further improve these factors. This becomes a continuous cycle.

Chapter 2: Exploratory Data Analysis

url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
dim(alumni)

## [1] 48  5

alumni <- alumni[c(2,3,4,5)]
sapply(alumni, class)

## percent_of_classes_under_20       student_faculty_ratio 
##                   "integer"                   "integer" 
##          alumni_giving_rate                     private 
##                   "integer"                   "integer"

alumni$private=as.factor(alumni$private)
summary(alumni)

##  percent_of_classes_under_20 student_faculty_ratio alumni_giving_rate
##  Min.   :29.00               Min.   : 3.00         Min.   : 7.00     
##  1st Qu.:44.75               1st Qu.: 8.00         1st Qu.:18.75     
##  Median :59.50               Median :10.50         Median :29.00     
##  Mean   :55.73               Mean   :11.54         Mean   :29.27     
##  3rd Qu.:66.25               3rd Qu.:13.50         3rd Qu.:38.50     
##  Max.   :77.00               Max.   :23.00         Max.   :67.00     
##  private
##  0:15   
##  1:33   
##         
##         
##         
##

boxplot(alumni, use.cols = TRUE)

pairs(alumni, cex = 1.2, pch = 19, col = adjustcolor("darkred", alpha.f = 0.5))
GGally::ggpairs(alumni)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

alumni %>%
  lm(alumni_giving_rate ~ student_faculty_ratio + percent_of_classes_under_20 , data = .) %>%
  vif()

##       student_faculty_ratio percent_of_classes_under_20 
##                    2.611671                    2.611671

Exploring Summary Statistics of all variables

The dataset alumni contains 48 observations and has four variables of interest - percentage of classes under 20, student faculty ratio, private and alumni giving rate. There are no missing values in these four variables.
We would be using alumni giving rate as response variable and rest of the 3 variables would be considered as predictor variables
Exploring Summary Statistics of all variables
Summary statistics of the variables of interest
- Private is a categorical variable
- The response variable has the highest range of 60
- The Mean and median of percent_of_class_under_20 is not similar suggesting a non-symmetric distribution. The distribution looks left skewed.

Exploring variables individually and interaction between variables

The distribution of variables percent of classes under 200 and student faculty ratio looks bimodal. One peak for could be for private and another peak for non-private universities. The distribution of variable alumni giving rate looks approximately normal.
There is positive linear relationship between alumni giving rate and percent ofclassesunder20
There is negative linear relationship between alumni giving rate and student faculty ratio
There seems to be a negative relationship between percent of classes under 20 and student faculty ratio. There can be a problem of multicollinearity. When we check the the VIF value we the value of it as 2.61. This is far less than 10. Hence there is no multicollinearity between these two variables

Chapter 3: Model Selection

For model selection, we try all models from no effect to 3- way interactions based on AIC and BIC using 3 ways of model selection – forward selection, backward elimination and step wise selection.

Code:

fit_min <- lm(alumni_giving_rate ~ 1, data = alumni)
fit_max_2 <- lm(alumni_giving_rate ~ .^3, data = alumni)
fs_1 <- step(fit_min, direction = "forward", 
             scope = list(lower = fit_min,
                          upper = fit_max_2),
             trace = 0, k = log(nrow(alumni)))
be_1 <- step(fit_max_2, direction = "backward", 
             trace = 0, k = log(nrow(alumni)))
be_2 <- step(fit_max_2, direction = "backward", 
             trace = 0, k = 2)
ss_1 <- step(be_2, direction = "both", 
             scope = list(lower = fit_min,
                          upper = fit_max_2),
             trace = 0, k = log(nrow(alumni)))
fs_2 <- step(fit_min, direction = "forward", 
             scope = list(lower = fit_min,
                          upper = fit_max_2),
             trace = 0, k = 2)

ss_2 <- step(be_2, direction = "both", 
             scope = list(lower = fit_min,
                          upper = fit_max_2),
             trace = 0, k = 2)

Function to compute the PRESS statistic (a form of cross-validation). Note: smaller is better!

PRESS <- function(object, ...) {
  if(!missing(...)) {
    res <- sapply(list(object, ...), FUN = function(x) {
      sum(rstandard(x, type = "predictive") ^ 2)
    })
    names(res) <- as.character(match.call()[-1L])
    res
  } else {
    sum(rstandard(object, type = "predictive") ^ 2)
  }
}

Function to compute various model metrics

modelMetrics <- function(object, ...) {
  if(!missing(...)) {
    res <- sapply(list(object, ...), FUN = function(x) {
      c("AIC" = AIC(x), "BIC" = BIC(x), 
        "adjR2" = summary(x)$adj.r.squared,
        "RMSE"  = sigma(x), "PRESS" = PRESS(x), 
        "nterms" = length(coef(x)))
    })
    colnames(res) <- as.character(match.call()[-1L])
    res
  } else {
    c("AIC" = AIC(object), "BIC" = BIC(object), 
      "adjR2" = summary(object)$adj.r.squared, 
      "RMSE"  = sigma(object), "PRESS" = PRESS(object),
      "nterms" = length(coef(object)))
  }
}

Comparing models

res <- modelMetrics(be_1,fs_1,ss_1,be_2,fs_2,ss_2)
round(res, digits = 3)

##            be_1     fs_1     ss_1     be_2     fs_2     ss_2
## AIC     352.196  352.196  352.196  351.365  351.137  351.365
## BIC     357.810  357.810  357.810  362.592  360.493  362.592
## adjR2     0.541    0.541    0.541    0.574    0.569    0.574
## RMSE      9.103    9.103    9.103    8.768    8.829    8.768
## PRESS  4138.880 4138.880 4138.880 4020.749 4124.727 4020.749
## nterms    2.000    2.000    2.000    5.000    4.000    5.000

Based on the above parameters mainly PRESS and adjusted R-squated value, we have 2 major choices

Model 1:

alumni_giving_rate ~ student_faculty_ratio

Observations:

This model has only one predictor variable(student_faculty_ratio) making it simple model
The sign of the estimated parameter is in sync with the negative relationship between response variable and student_faculty_ratio. The p-value<0.05 of the t-test helps us to reject the null hypothesis(_1=0)
The adjusted R-squared value of this model is 0.5414
The p-value of F-statistic is also less than 0.05 which helps to reject the null hypothesis (_1=0) which is similar to t-test in this case. The PRESS value of this model is 4138.88

Model 2:

alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private + percent_of_classes_under_20:student_faculty_ratio

Observations:

This model has only four predictor variable(student_faculty_ratio) including an interaction variable between percent_classes_under_20 and student_faculty_ratio
The sign of the estimated parameter of student_faculty_ratio is not in sync with the negative relationship between response variable and student_faculty_ratio. The p-value>0.05 of the t-test doesnot allow us to reject the null hypothesis(_1=0)
The adjusted R-squared value of this model is 0.5745
The p-value of F-statistic is less than 0.05 which helps to reject the null hypothesis (_1=_2=_3=_4=0)
The PRESS value of this model is 4020.749

Considering below factors we choose our model to be model 1. a. Model 1 is a very simple model with only 1 predictor variable b. There is not much improvement in the value of adjusted R-squared as we include more variables and interaction between variables as seen in model 2 c. The model 2 doesnot capture the correct trend between response variable and predictor variable. (sign of estimated parameter of student_faculty_Ratio is opposite)

Therefore our model is

alumni_giving_rate = 53.0138 - 2.0572(student_faculty_ratio)

The adjusted R-squated value for this model is 0.5414.
This model states that for every one unit increase in student_faculty_ratio, the alumni_giving_rate decreases by 2.0572 units

Chapter 4: Model Diagnostic Analysis

We look at different diagnostic plots to know the adequacy of the selected regression model

yhat <- fitted(ss_2)
rstan <- rstandard(ss_2)
library(broom)
alumni2 <- alumni %>%
  lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private + percent_of_classes_under_20:student_faculty_ratio, data = .) %>%
  augment() %>%
  mutate(row_num = 1:n())

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

#yhat vs residuals
ggplot(alumni2, aes(x = .fitted, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  # geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  xlab("Fitted value") +
  ylab("Standardized residual") +
  theme_light()

#QQ plot
ggplot(alumni2, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3) +
  geom_qq_line(linetype = "dashed", color = "red2") +
  xlab("Theoretical quantile") +
  ylab("Sample quantile") +
  theme_light()

ggplot(alumni2, aes(x = row_num, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  xlab("Index") +
  ylab("Standardized residual") +
  theme_light()

ggplot(alumni2, aes(x = percent_of_classes_under_20, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residual") +
  theme_light()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(alumni2, aes(x = student_faculty_ratio, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residual") +
  theme_light()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(alumni2, aes(x = private, y = .std.resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red2") +
  geom_hline(yintercept = c(-2, 2), linetype = "dotted") +
  geom_smooth(color = "forestgreen", alpha = 0.1, se = FALSE) +
  ylab("Standardized residual") +
  theme_light()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#outliers
h <- hatvalues(ss_2)
plot(h, type = "h", ylim = extendrange(h, f = 0.15))
abline(h = 2 * 5 / nrow(alumni), lty = "dotted")
text(h, labels = seq_len(nrow(alumni)), pos = 3, col = "red2")

Observations:

The varainace looks constant – The residuals are randomly scattered against both fitted values and predictor variable student_faculty_ratio
There seems to be no misspecification of mean
The error look approximately normally distributed from QQ plot. It looks slightly right skewed
Therefore we can conclude that our model looks ideal.
Our Final model is

alumni_giving_rate = 53.0138 - 2.0572(student_faculty_ratio)

Chapter 5: Discussion and Future Scope

Scope for improvement with variable tranformation: We haven’t considered variable transformation for both predictor and response variables which could help in building model with better model metrics.
Interaction between variables: Though the value of VIF between percent of classes under 20 and student faculty ratio is very low, there is clear negative linear relation between these two variables. This has to be further explored.

Chapter 6: Summary

Using Linear Regression, we developed a model to predict the alumni giving rate of universities. According to the final model, the alumni giving rate is dependent on student faculty ration of the university. For every unit increase in the student faculty ratio, the alumni giving rate decreases by 2.0572. This suggests universities to either have fewer intake of students or hire more faculty to maintain a better student faculty ratio. This would increase the university’s income from alumni.

Linear_Regression_Analysis_to_predict_alumni_giving_rate

Pravallika Mulukuri

5/21/2020

Chapter 1: Introduction

Chapter 2: Exploratory Data Analysis

Exploring Summary Statistics of all variables

Exploring variables individually and interaction between variables

Chapter 3: Model Selection

Function to compute the PRESS statistic (a form of cross-validation). Note: smaller is better!

Function to compute various model metrics

Comparing models

Model 1:

alumni_giving_rate ~ student_faculty_ratio

Observations:

Model 2:

alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio + private + percent_of_classes_under_20:student_faculty_ratio

Observations:

alumni_giving_rate = 53.0138 - 2.0572(student_faculty_ratio)

Chapter 4: Model Diagnostic Analysis

Observations:

alumni_giving_rate = 53.0138 - 2.0572(student_faculty_ratio)

Chapter 5: Discussion and Future Scope

Chapter 6: Summary