Linear Regression

Alumni Donation Case Study

Introduction

The original data comprises of three predictors for response variable alumni giving rate. The report explores all combinations of simple and multiple linear regression. Given the original data set, the summary statistics and model selection parameters, the best fit is obtained by combining student-faculty ratio and Private predictor to build the multiple linear regression model.

As an extension to the original data, we scraped educational institution metrics from federal websites to incorporate predictors such as admission rate, graduation completion rate, etc. into the regression analysis. With inclusion of graduation completion rate and student faculty ratio, we obtain another alternate fit with high R-square and low AIC/BIC values.

Based on the validity of the underlying assumptions of new predictors, one of the two models can be finalized.

Data Description

The data description section can be broken down into

The predictor variables in the given dataset have the following description- 1. School: Name of the school to which the alumni belong to 2. Student/Faculty Ratio: The number of students per teacher present in a class 3. % of Classes Under 20: The percentage of the classes having the average age under 20 4. Private: A binary column indicating if the school/university is public or private 5. Alumni Giving rate: The percentage of alumni who have been contributing to the School from which they have graduated from.

The new predictor variables introduced into the data set have the following description [as per source listed in reference] 1. ADM_Rate: Overall admission rate of the college institute for a 4-year program. 2. SATMT75: SAT Math test score of the 75th percentile student. 3. PFTFAC: Percentage of full-time faculty in the institute. 4. C150_4: The proportion of full-time, first-time, degree/certificate-seeking undergraduates who completed a degree or certificate at the institution within 150 percent of normal time, calculated from standardized metrics. 5. UNEMP_RATE: Unemployment rate via census data.

The rationale behind introducing these predictors was to adopt a holistic perspective towards alumni donations. With the inclusion of these predictors, the merit and earning potential of a candidate is also taken into consideration when estimating alumni donations. Assumption with the new predictors

The additional predictor metrics obtained from public sources represent the latest update made on 1/17/2021. The alumni data set has observations from the year 2000. While the observations may seem anachronous, we assume that the chosen metrics are time-invariant or change uniformly. For example, the latest ADM_rate values for California Institute of Technology and Carnegie Mellon University are 6% and 17% respectively. Given the competitiveness of all major universities in USA, it is highly likely that these values remain constant over time or have increased/reduced at the same average rate for all colleges. The same assumption is extended for other predictors such as C150_4, UNEMP_Rate, etc.

In case of missing values on some observations of the new predictors, imputations were performed on predictors by substituting missing data with mean value of the available observations.

library(readxl)
alumni_data <- read_xls("C:/Users/Paaras Saxena/Documents/UCINN/Academics/Flex2/BANA7052- LINEAR REGRESSION/Lecture5/alumni.xls")
str(alumni_data)

## tibble [48 x 10] (S3: tbl_df/tbl/data.frame)
##  $ School               : chr [1:48] "Boston College" "Brandeis University" "Brown University" "California Institute of Technology" ...
##  $ % of Classes Under 20: num [1:48] 39 68 60 65 67 52 45 69 72 61 ...
##  $ Student/Faculty Ratio: num [1:48] 13 8 8 3 10 8 12 7 13 10 ...
##  $ Alumni Giving Rate   : num [1:48] 25 33 40 46 28 31 27 31 35 53 ...
##  $ Private              : num [1:48] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ADM_RATE             : num [1:48] 0.2789 0.3115 0.0767 0.0662 0.1712 ...
##  $ SATMT75              : num [1:48] 770 780 790 800 800 765 765 765 790 790 ...
##  $ PFTFAC               : num [1:48] 0.648 0.68 0.896 0.919 0.952 ...
##  $ C150_4               : num [1:48] 0.919 0.88 0.95 0.92 0.888 ...
##  $ UNEMP_RATE           : num [1:48] 2.88 3 3.28 3.14 2.98 ...

alumni_reg <- alumni_data[, -c(1,5)]
alumni_reg1 <- alumni_data[, -c(1,5)]
names(alumni_reg)

## [1] "% of Classes Under 20" "Student/Faculty Ratio" "Alumni Giving Rate"   
## [4] "ADM_RATE"              "SATMT75"               "PFTFAC"               
## [7] "C150_4"                "UNEMP_RATE"

str(alumni_reg)

## tibble [48 x 8] (S3: tbl_df/tbl/data.frame)
##  $ % of Classes Under 20: num [1:48] 39 68 60 65 67 52 45 69 72 61 ...
##  $ Student/Faculty Ratio: num [1:48] 13 8 8 3 10 8 12 7 13 10 ...
##  $ Alumni Giving Rate   : num [1:48] 25 33 40 46 28 31 27 31 35 53 ...
##  $ ADM_RATE             : num [1:48] 0.2789 0.3115 0.0767 0.0662 0.1712 ...
##  $ SATMT75              : num [1:48] 770 780 790 800 800 765 765 765 790 790 ...
##  $ PFTFAC               : num [1:48] 0.648 0.68 0.896 0.919 0.952 ...
##  $ C150_4               : num [1:48] 0.919 0.88 0.95 0.92 0.888 ...
##  $ UNEMP_RATE           : num [1:48] 2.88 3 3.28 3.14 2.98 ...

head(tibble::as_tibble(alumni_data), n = 5)

## # A tibble: 5 x 10
##   School `% of Classes U~ `Student/Facult~ `Alumni Giving ~ Private ADM_RATE
##   <chr>             <dbl>            <dbl>            <dbl>   <dbl>    <dbl>
## 1 Bosto~               39               13               25       1   0.279 
## 2 Brand~               68                8               33       1   0.312 
## 3 Brown~               60                8               40       1   0.0767
## 4 Calif~               65                3               46       1   0.0662
## 5 Carne~               67               10               28       1   0.171 
## # ... with 4 more variables: SATMT75 <dbl>, PFTFAC <dbl>, C150_4 <dbl>,
## #   UNEMP_RATE <dbl>

Correlation

GGally::ggpairs(alumni_reg, mapping = aes(alpha = 0.5)) + theme_light()

Here is the correlation plot among all the numerical variables of the dataset including the new predictors. Additionally, Alumni Giving Rate has a strong correlation with a majority of the predictors. Same with C150_4 and Student/Faculty Ratio.

Subset Variable Selection (SVS)

Upon adding new columns, Subset varaiable selection method was used to find out better predictors for the new data. SVS is a wrapper method where the predictive power of the variables are evaluated jointly.

Initially, no interaction parameters were considered.

regsubsets function was used to out the best fit MLR model using the least BIC. MLR with Student/Faculty Ratio + C150_4 came out to be the best.

library(ISLR)

fit_u = regsubsets(`Alumni Giving Rate` ~ ., alumni_reg1, nvmax = 7)
summary(fit_u)

## Subset selection object
## Call: regsubsets.formula(`Alumni Giving Rate` ~ ., alumni_reg1, nvmax = 7)
## 7 Variables  (and intercept)
##                         Forced in Forced out
## `% of Classes Under 20`     FALSE      FALSE
## `Student/Faculty Ratio`     FALSE      FALSE
## ADM_RATE                    FALSE      FALSE
## SATMT75                     FALSE      FALSE
## PFTFAC                      FALSE      FALSE
## C150_4                      FALSE      FALSE
## UNEMP_RATE                  FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          `% of Classes Under 20` `Student/Faculty Ratio` ADM_RATE SATMT75
## 1  ( 1 ) " "                     "*"                     " "      " "    
## 2  ( 1 ) " "                     "*"                     " "      " "    
## 3  ( 1 ) "*"                     "*"                     " "      " "    
## 4  ( 1 ) "*"                     "*"                     " "      " "    
## 5  ( 1 ) "*"                     "*"                     " "      " "    
## 6  ( 1 ) "*"                     "*"                     "*"      " "    
## 7  ( 1 ) "*"                     "*"                     "*"      "*"    
##          PFTFAC C150_4 UNEMP_RATE
## 1  ( 1 ) " "    " "    " "       
## 2  ( 1 ) " "    "*"    " "       
## 3  ( 1 ) " "    "*"    " "       
## 4  ( 1 ) " "    "*"    "*"       
## 5  ( 1 ) "*"    "*"    "*"       
## 6  ( 1 ) "*"    "*"    "*"       
## 7  ( 1 ) "*"    "*"    "*"

plot(fit_u, scale = "bic")

Then, Interaction parameters were considered.

The maximum possible number of predictors on allowing 2 way interaction parameters were “268,435,456”. To avoid such heavy computation, Forward selection, Backward elimination and Step-wise selection were used.

numSubsets <- function(x, max.int = 1) {
  if (max.int > x) {
    stop("`max.int` cannot be larger than ",
         x,".", call = FALSE)
  }
  
  x <- as.integer(x)
  max.int <- as.integer(max.int)
  res <- 0
  for (i in seq_len(max.int)) {
    res <- res + choose( n = x, k = i)
  }
  2 ^ res     
}
#interactions?

x <- c(numSubsets(7, max.int= 1),
       numSubsets(7, max.int= 2),
       numSubsets(7, max.int= 3)
       )
scales::comma(x)

## [1] "128"                       "268,435,456"              
## [3] "9,223,372,036,854,775,808"

Backward Elimination -

Here, we start with fitting all the possible 2-way interaction predictors, followed by removing the least significant parameters (ie p-value > 0.05) at each iteration to provide a model with better fit. However, few predictors can have p-values more than 0.05 and still may seem significant as its insignificance may be nullified by others strongly significant parameters.

fit_max_1 <- lm(`Alumni Giving Rate` ~ ., data = alumni_reg)
be_1 <- step(fit_max_1, direction = "backward", trace = 0, k = log(nrow(alumni_reg)))


fit_max_2 <- lm(`Alumni Giving Rate` ~ .^2, data = alumni_reg)
be_2 <- step(fit_max_2, direction = "backward", trace = 0, k = log(nrow(alumni_reg)))

Forward Selection -

Here, we start with a null model with zero predictors, followed by adding significant parameters one by one at each iteration until an addition of a new variable does not improve the fit.

Here, the scope was limited to the initial parameters of backward elimination to avoid any computational load.

fit_min <- lm(`Alumni Giving Rate` ~ 1, data = alumni_reg)
fs_1 <- step(fit_min, direction = "forward",
             scope = list(lower = fit_min,
                          upper = fit_max_1),
             trace = 0, k = log(nrow(alumni_reg)))

fs_2 <- step(fit_min, direction = "forward",
             scope = list(lower = fit_min,
                          upper = fit_max_2),
             trace = 0, k = log(nrow(alumni_reg)))

Step-wise regression -

This method combines both Forward selection and Backward elimination. Its scope was limited to initial parameters of Forward selection (lower value) to Backward elimination (higher value)

## Step_wise selection

ss_1 <- step(be_1, direction = "both", 
            scope = list(lower = fit_min,
                         upper = fit_max_1),
            trace = 0, k = log(nrow(alumni_reg)))

ss_2 <- step(be_2, direction = "both", 
            scope = list(lower = fit_min,
                         upper = fit_max_2),
            trace = 0, k = log(nrow(alumni_reg)))

PRESS

From all these models, BIC values were calculated by providing k = log(nrow). PRESS statistics is also taken into account as a cross validation for the better fit.

PRESS <- function(object, ...){
  if(!missing(...)) {
    res <- sapply(list(object, ...), FUN = function(x) {
      sum(rstandard(x, type = "predictive") ^ 2)
    })
    names(res) <- as.character(match.call()[-1L])
    res
  } else {
    sum(rstandard(object, type = "predictive") ^ 2)
  }
}

Model Metrics

Additionally, a model metric was created too to find out values of various other parameters like AIC, RMSE, adjR2 and nterms (number of predictors) for all the above models.

modelMetrics <- function(object, ...) {
  if(!missing(...)) {
    res <- sapply(list(object, ...), FUN = function(x) {
      c("AIC" = AIC(x), "BIC" = BIC(x),
        "adjR2" = summary(x)$adj.r.squared,
        "RMSE" = sigma(x), "PRESS" = PRESS(x),
        "nterms" = length(coef(x)))
    })
    colnames(res) <- as.character(match.call()[-1L])
    res
  } else {
    c("AIC" = AIC(object), "BIC" = BIC(object),
      "adjR2" = summary(object)$adj.r.squared,
      "RMSE" = sigma(object), "PRESS" = PRESS(object),
      "nterms" = length(coef(object)))
  }
}

Compare models

res <- modelMetrics(be_1, be_2, fs_1, fs_2, ss_1, ss_2)
round(res, digits = 3)

##            be_1     be_2     fs_1     fs_2     ss_1     ss_2
## AIC     338.003  333.482  338.003  338.003  338.003  333.482
## BIC     345.488  361.550  345.488  345.488  345.488  361.550
## adjR2     0.665    0.745    0.665    0.665    0.665    0.745
## RMSE      7.775    6.785    7.775    7.775    7.775    6.785
## PRESS  3109.729 3662.297 3109.729 3109.729 3109.729 3662.297
## nterms    3.000   14.000    3.000    3.000    3.000   14.000

Summary of SS1

summary(ss_1)

## 
## Call:
## lm(formula = `Alumni Giving Rate` ~ `Student/Faculty Ratio` + 
##     C150_4, data = alumni_reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.614  -4.254  -1.258   3.472  20.509 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -100.9704    36.3560  -2.777 0.007964 ** 
## `Student/Faculty Ratio`   -1.4322     0.2762  -5.186 4.95e-06 ***
## C150_4                   160.4298    37.7552   4.249 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.775 on 45 degrees of freedom
## Multiple R-squared:  0.6797, Adjusted R-squared:  0.6654 
## F-statistic: 47.74 on 2 and 45 DF,  p-value: 7.51e-12

Summary of SS2

summary(ss_2)

## 
## Call:
## lm(formula = `Alumni Giving Rate` ~ `% of Classes Under 20` + 
##     `Student/Faculty Ratio` + ADM_RATE + SATMT75 + C150_4 + UNEMP_RATE + 
##     `% of Classes Under 20`:SATMT75 + `% of Classes Under 20`:C150_4 + 
##     `Student/Faculty Ratio`:ADM_RATE + `Student/Faculty Ratio`:UNEMP_RATE + 
##     ADM_RATE:SATMT75 + SATMT75:C150_4 + SATMT75:UNEMP_RATE, data = alumni_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7257  -2.7884   0.2828   3.1750  16.3737 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         1.212e+04  3.137e+03   3.865 0.000476 ***
## `% of Classes Under 20`            -1.397e+00  9.052e+00  -0.154 0.878295    
## `Student/Faculty Ratio`            -1.121e+01  4.839e+00  -2.316 0.026726 *  
## ADM_RATE                           -2.512e+03  1.083e+03  -2.319 0.026514 *  
## SATMT75                            -1.681e+01  4.444e+00  -3.781 0.000603 ***
## C150_4                             -9.308e+03  2.743e+03  -3.393 0.001769 ** 
## UNEMP_RATE                         -6.596e+02  2.236e+02  -2.950 0.005719 ** 
## `% of Classes Under 20`:SATMT75     2.246e-02  1.068e-02   2.103 0.042923 *  
## `% of Classes Under 20`:C150_4     -1.718e+01  7.885e+00  -2.179 0.036360 *  
## `Student/Faculty Ratio`:ADM_RATE    7.058e+00  4.044e+00   1.745 0.089982 .  
## `Student/Faculty Ratio`:UNEMP_RATE  2.711e+00  1.421e+00   1.908 0.064840 .  
## ADM_RATE:SATMT75                    3.155e+00  1.382e+00   2.282 0.028848 *  
## SATMT75:C150_4                      1.351e+01  3.981e+00   3.393 0.001771 ** 
## SATMT75:UNEMP_RATE                  8.006e-01  2.747e-01   2.915 0.006260 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.785 on 34 degrees of freedom
## Multiple R-squared:  0.8157, Adjusted R-squared:  0.7452 
## F-statistic: 11.57 on 13 and 34 DF,  p-value: 6.006e-09

Best Model Selection:

SS1 is preferred over SS2 because of the following reasons:

SS1 (MLR)	SS2 (Interaction Predictors)
Lower BIC value	Higher BIC value
Easy to interpret	Difficult to interpret
Less Likely to overfit	More likely to overfit