The original data comprises of three predictors for response variable alumni giving rate. The report explores all combinations of simple and multiple linear regression. Given the original data set, the summary statistics and model selection parameters, the best fit is obtained by combining student-faculty ratio and Private predictor to build the multiple linear regression model.
As an extension to the original data, we scraped educational institution metrics from federal websites to incorporate predictors such as admission rate, graduation completion rate, etc. into the regression analysis. With inclusion of graduation completion rate and student faculty ratio, we obtain another alternate fit with high R-square and low AIC/BIC values.
Based on the validity of the underlying assumptions of new predictors, one of the two models can be finalized.
The data description section can be broken down into
The predictor variables in the given dataset have the following description- 1. School: Name of the school to which the alumni belong to 2. Student/Faculty Ratio: The number of students per teacher present in a class 3. % of Classes Under 20: The percentage of the classes having the average age under 20 4. Private: A binary column indicating if the school/university is public or private 5. Alumni Giving rate: The percentage of alumni who have been contributing to the School from which they have graduated from.
The new predictor variables introduced into the data set have the following description [as per source listed in reference] 1. ADM_Rate: Overall admission rate of the college institute for a 4-year program. 2. SATMT75: SAT Math test score of the 75th percentile student. 3. PFTFAC: Percentage of full-time faculty in the institute. 4. C150_4: The proportion of full-time, first-time, degree/certificate-seeking undergraduates who completed a degree or certificate at the institution within 150 percent of normal time, calculated from standardized metrics. 5. UNEMP_RATE: Unemployment rate via census data.
The rationale behind introducing these predictors was to adopt a holistic perspective towards alumni donations. With the inclusion of these predictors, the merit and earning potential of a candidate is also taken into consideration when estimating alumni donations. Assumption with the new predictors
The additional predictor metrics obtained from public sources represent the latest update made on 1/17/2021. The alumni data set has observations from the year 2000. While the observations may seem anachronous, we assume that the chosen metrics are time-invariant or change uniformly. For example, the latest ADM_rate values for California Institute of Technology and Carnegie Mellon University are 6% and 17% respectively. Given the competitiveness of all major universities in USA, it is highly likely that these values remain constant over time or have increased/reduced at the same average rate for all colleges. The same assumption is extended for other predictors such as C150_4, UNEMP_Rate, etc.
In case of missing values on some observations of the new predictors, imputations were performed on predictors by substituting missing data with mean value of the available observations.
library(readxl)
alumni_data <- read_xls("C:/Users/Paaras Saxena/Documents/UCINN/Academics/Flex2/BANA7052- LINEAR REGRESSION/Lecture5/alumni.xls")
str(alumni_data)
## tibble [48 x 10] (S3: tbl_df/tbl/data.frame)
## $ School : chr [1:48] "Boston College" "Brandeis University" "Brown University" "California Institute of Technology" ...
## $ % of Classes Under 20: num [1:48] 39 68 60 65 67 52 45 69 72 61 ...
## $ Student/Faculty Ratio: num [1:48] 13 8 8 3 10 8 12 7 13 10 ...
## $ Alumni Giving Rate : num [1:48] 25 33 40 46 28 31 27 31 35 53 ...
## $ Private : num [1:48] 1 1 1 1 1 1 1 1 1 1 ...
## $ ADM_RATE : num [1:48] 0.2789 0.3115 0.0767 0.0662 0.1712 ...
## $ SATMT75 : num [1:48] 770 780 790 800 800 765 765 765 790 790 ...
## $ PFTFAC : num [1:48] 0.648 0.68 0.896 0.919 0.952 ...
## $ C150_4 : num [1:48] 0.919 0.88 0.95 0.92 0.888 ...
## $ UNEMP_RATE : num [1:48] 2.88 3 3.28 3.14 2.98 ...
alumni_reg <- alumni_data[, -c(1,5)]
alumni_reg1 <- alumni_data[, -c(1,5)]
names(alumni_reg)
## [1] "% of Classes Under 20" "Student/Faculty Ratio" "Alumni Giving Rate"
## [4] "ADM_RATE" "SATMT75" "PFTFAC"
## [7] "C150_4" "UNEMP_RATE"
str(alumni_reg)
## tibble [48 x 8] (S3: tbl_df/tbl/data.frame)
## $ % of Classes Under 20: num [1:48] 39 68 60 65 67 52 45 69 72 61 ...
## $ Student/Faculty Ratio: num [1:48] 13 8 8 3 10 8 12 7 13 10 ...
## $ Alumni Giving Rate : num [1:48] 25 33 40 46 28 31 27 31 35 53 ...
## $ ADM_RATE : num [1:48] 0.2789 0.3115 0.0767 0.0662 0.1712 ...
## $ SATMT75 : num [1:48] 770 780 790 800 800 765 765 765 790 790 ...
## $ PFTFAC : num [1:48] 0.648 0.68 0.896 0.919 0.952 ...
## $ C150_4 : num [1:48] 0.919 0.88 0.95 0.92 0.888 ...
## $ UNEMP_RATE : num [1:48] 2.88 3 3.28 3.14 2.98 ...
head(tibble::as_tibble(alumni_data), n = 5)
## # A tibble: 5 x 10
## School `% of Classes U~ `Student/Facult~ `Alumni Giving ~ Private ADM_RATE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bosto~ 39 13 25 1 0.279
## 2 Brand~ 68 8 33 1 0.312
## 3 Brown~ 60 8 40 1 0.0767
## 4 Calif~ 65 3 46 1 0.0662
## 5 Carne~ 67 10 28 1 0.171
## # ... with 4 more variables: SATMT75 <dbl>, PFTFAC <dbl>, C150_4 <dbl>,
## # UNEMP_RATE <dbl>
GGally::ggpairs(alumni_reg, mapping = aes(alpha = 0.5)) + theme_light()
Here is the correlation plot among all the numerical variables of the dataset including the new predictors. Additionally, Alumni Giving Rate has a strong correlation with a majority of the predictors. Same with C150_4 and Student/Faculty Ratio.
Upon adding new columns, Subset varaiable selection method was used to find out better predictors for the new data. SVS is a wrapper method where the predictive power of the variables are evaluated jointly.
Initially, no interaction parameters were considered.
regsubsets function was used to out the best fit MLR model using the least BIC. MLR with Student/Faculty Ratio + C150_4 came out to be the best.
library(ISLR)
fit_u = regsubsets(`Alumni Giving Rate` ~ ., alumni_reg1, nvmax = 7)
summary(fit_u)
## Subset selection object
## Call: regsubsets.formula(`Alumni Giving Rate` ~ ., alumni_reg1, nvmax = 7)
## 7 Variables (and intercept)
## Forced in Forced out
## `% of Classes Under 20` FALSE FALSE
## `Student/Faculty Ratio` FALSE FALSE
## ADM_RATE FALSE FALSE
## SATMT75 FALSE FALSE
## PFTFAC FALSE FALSE
## C150_4 FALSE FALSE
## UNEMP_RATE FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## `% of Classes Under 20` `Student/Faculty Ratio` ADM_RATE SATMT75
## 1 ( 1 ) " " "*" " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) "*" "*" " " " "
## 4 ( 1 ) "*" "*" " " " "
## 5 ( 1 ) "*" "*" " " " "
## 6 ( 1 ) "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*"
## PFTFAC C150_4 UNEMP_RATE
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
plot(fit_u, scale = "bic")
Then, Interaction parameters were considered.
The maximum possible number of predictors on allowing 2 way interaction parameters were “268,435,456”. To avoid such heavy computation, Forward selection, Backward elimination and Step-wise selection were used.
numSubsets <- function(x, max.int = 1) {
if (max.int > x) {
stop("`max.int` cannot be larger than ",
x,".", call = FALSE)
}
x <- as.integer(x)
max.int <- as.integer(max.int)
res <- 0
for (i in seq_len(max.int)) {
res <- res + choose( n = x, k = i)
}
2 ^ res
}
#interactions?
x <- c(numSubsets(7, max.int= 1),
numSubsets(7, max.int= 2),
numSubsets(7, max.int= 3)
)
scales::comma(x)
## [1] "128" "268,435,456"
## [3] "9,223,372,036,854,775,808"
Here, we start with fitting all the possible 2-way interaction predictors, followed by removing the least significant parameters (ie p-value > 0.05) at each iteration to provide a model with better fit. However, few predictors can have p-values more than 0.05 and still may seem significant as its insignificance may be nullified by others strongly significant parameters.
fit_max_1 <- lm(`Alumni Giving Rate` ~ ., data = alumni_reg)
be_1 <- step(fit_max_1, direction = "backward", trace = 0, k = log(nrow(alumni_reg)))
fit_max_2 <- lm(`Alumni Giving Rate` ~ .^2, data = alumni_reg)
be_2 <- step(fit_max_2, direction = "backward", trace = 0, k = log(nrow(alumni_reg)))
Here, we start with a null model with zero predictors, followed by adding significant parameters one by one at each iteration until an addition of a new variable does not improve the fit.
Here, the scope was limited to the initial parameters of backward elimination to avoid any computational load.
fit_min <- lm(`Alumni Giving Rate` ~ 1, data = alumni_reg)
fs_1 <- step(fit_min, direction = "forward",
scope = list(lower = fit_min,
upper = fit_max_1),
trace = 0, k = log(nrow(alumni_reg)))
fs_2 <- step(fit_min, direction = "forward",
scope = list(lower = fit_min,
upper = fit_max_2),
trace = 0, k = log(nrow(alumni_reg)))
This method combines both Forward selection and Backward elimination. Its scope was limited to initial parameters of Forward selection (lower value) to Backward elimination (higher value)
## Step_wise selection
ss_1 <- step(be_1, direction = "both",
scope = list(lower = fit_min,
upper = fit_max_1),
trace = 0, k = log(nrow(alumni_reg)))
ss_2 <- step(be_2, direction = "both",
scope = list(lower = fit_min,
upper = fit_max_2),
trace = 0, k = log(nrow(alumni_reg)))
From all these models, BIC values were calculated by providing k = log(nrow). PRESS statistics is also taken into account as a cross validation for the better fit.
PRESS <- function(object, ...){
if(!missing(...)) {
res <- sapply(list(object, ...), FUN = function(x) {
sum(rstandard(x, type = "predictive") ^ 2)
})
names(res) <- as.character(match.call()[-1L])
res
} else {
sum(rstandard(object, type = "predictive") ^ 2)
}
}
Additionally, a model metric was created too to find out values of various other parameters like AIC, RMSE, adjR2 and nterms (number of predictors) for all the above models.
modelMetrics <- function(object, ...) {
if(!missing(...)) {
res <- sapply(list(object, ...), FUN = function(x) {
c("AIC" = AIC(x), "BIC" = BIC(x),
"adjR2" = summary(x)$adj.r.squared,
"RMSE" = sigma(x), "PRESS" = PRESS(x),
"nterms" = length(coef(x)))
})
colnames(res) <- as.character(match.call()[-1L])
res
} else {
c("AIC" = AIC(object), "BIC" = BIC(object),
"adjR2" = summary(object)$adj.r.squared,
"RMSE" = sigma(object), "PRESS" = PRESS(object),
"nterms" = length(coef(object)))
}
}
res <- modelMetrics(be_1, be_2, fs_1, fs_2, ss_1, ss_2)
round(res, digits = 3)
## be_1 be_2 fs_1 fs_2 ss_1 ss_2
## AIC 338.003 333.482 338.003 338.003 338.003 333.482
## BIC 345.488 361.550 345.488 345.488 345.488 361.550
## adjR2 0.665 0.745 0.665 0.665 0.665 0.745
## RMSE 7.775 6.785 7.775 7.775 7.775 6.785
## PRESS 3109.729 3662.297 3109.729 3109.729 3109.729 3662.297
## nterms 3.000 14.000 3.000 3.000 3.000 14.000
summary(ss_1)
##
## Call:
## lm(formula = `Alumni Giving Rate` ~ `Student/Faculty Ratio` +
## C150_4, data = alumni_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.614 -4.254 -1.258 3.472 20.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -100.9704 36.3560 -2.777 0.007964 **
## `Student/Faculty Ratio` -1.4322 0.2762 -5.186 4.95e-06 ***
## C150_4 160.4298 37.7552 4.249 0.000106 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.775 on 45 degrees of freedom
## Multiple R-squared: 0.6797, Adjusted R-squared: 0.6654
## F-statistic: 47.74 on 2 and 45 DF, p-value: 7.51e-12
summary(ss_2)
##
## Call:
## lm(formula = `Alumni Giving Rate` ~ `% of Classes Under 20` +
## `Student/Faculty Ratio` + ADM_RATE + SATMT75 + C150_4 + UNEMP_RATE +
## `% of Classes Under 20`:SATMT75 + `% of Classes Under 20`:C150_4 +
## `Student/Faculty Ratio`:ADM_RATE + `Student/Faculty Ratio`:UNEMP_RATE +
## ADM_RATE:SATMT75 + SATMT75:C150_4 + SATMT75:UNEMP_RATE, data = alumni_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7257 -2.7884 0.2828 3.1750 16.3737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.212e+04 3.137e+03 3.865 0.000476 ***
## `% of Classes Under 20` -1.397e+00 9.052e+00 -0.154 0.878295
## `Student/Faculty Ratio` -1.121e+01 4.839e+00 -2.316 0.026726 *
## ADM_RATE -2.512e+03 1.083e+03 -2.319 0.026514 *
## SATMT75 -1.681e+01 4.444e+00 -3.781 0.000603 ***
## C150_4 -9.308e+03 2.743e+03 -3.393 0.001769 **
## UNEMP_RATE -6.596e+02 2.236e+02 -2.950 0.005719 **
## `% of Classes Under 20`:SATMT75 2.246e-02 1.068e-02 2.103 0.042923 *
## `% of Classes Under 20`:C150_4 -1.718e+01 7.885e+00 -2.179 0.036360 *
## `Student/Faculty Ratio`:ADM_RATE 7.058e+00 4.044e+00 1.745 0.089982 .
## `Student/Faculty Ratio`:UNEMP_RATE 2.711e+00 1.421e+00 1.908 0.064840 .
## ADM_RATE:SATMT75 3.155e+00 1.382e+00 2.282 0.028848 *
## SATMT75:C150_4 1.351e+01 3.981e+00 3.393 0.001771 **
## SATMT75:UNEMP_RATE 8.006e-01 2.747e-01 2.915 0.006260 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.785 on 34 degrees of freedom
## Multiple R-squared: 0.8157, Adjusted R-squared: 0.7452
## F-statistic: 11.57 on 13 and 34 DF, p-value: 6.006e-09
SS1 is preferred over SS2 because of the following reasons:
| SS1 (MLR) | SS2 (Interaction Predictors) |
|---|---|
| Lower BIC value | Higher BIC value |
| Easy to interpret | Difficult to interpret |
| Less Likely to overfit | More likely to overfit |