Guided Exercise 4 - The College Admissions Yield Problem

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidymodels)

## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip

## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.3 ──

## ✓ broom        0.7.9      ✓ rsample      0.1.0 
## ✓ dials        0.0.10     ✓ tune         0.1.6 
## ✓ infer        1.0.0      ✓ workflows    0.2.3 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.1.0 
## ✓ parsnip      0.1.7      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.16

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(knitr)
library(modelr)

## 
## Attaching package: 'modelr'

## The following objects are masked from 'package:yardstick':
## 
##     mae, mape, rmse

## The following object is masked from 'package:broom':
## 
##     bootstrap

Load in our admissions data from the college:

ad<-read_rds("admit_data.Rds")%>%
  ungroup()%>%
  mutate(yield_f=as_factor(ifelse(yield==1,"Yes","No")))

Our team has been hired as the data science team for a liberal arts college. The president of the college and the board of trustees have laid out two main strategic goals for us to accomplish. They want the average SAT score to be increased to 1300 as well as admit at least 200 more students with incomes less than $50,000. However, there are certain things that must not be changed. Tuition revenues are not allowed to drop below $30 million and the size of the entering class should not change from about 1,500 students, give or take 50 students.

To better understand the profile of the admitted students, we first decided to use conditional means to describe the probability of enrolling once admitted, our yield, for different groups of students.

We first looked at quintiles of SAT.

ad%>%
  mutate(sat_quintile=ntile(sat,n=5))%>%
  group_by(sat_quintile)%>%
  summarize(min_sat=min(sat),
  pr_attend=mean(yield))

## # A tibble: 5 × 3
##   sat_quintile min_sat pr_attend
##          <int>   <dbl>     <dbl>
## 1            1   1000      0.440
## 2            2   1114.     0.533
## 3            3   1173.     0.691
## 4            4   1227.     0.828
## 5            5   1285.     0.919

It appears that as SAT score increases, the probability of attendance also increases.

Next we examined deciles of income.

ad%>%
  mutate(income_decile=ntile(income,n=10))%>%
  group_by(income_decile)%>%
  summarize(min_income=min(income),
  pr_attend=mean(yield))

## # A tibble: 10 × 3
##    income_decile min_income pr_attend
##            <int>      <dbl>     <dbl>
##  1             1      8844.     0.205
##  2             2     41483.     0.265
##  3             3     55177.     0.447
##  4             4     68818.     0.591
##  5             5     82099.     0.651
##  6             6     99895.     0.809
##  7             7    119776.     0.884
##  8             8    143791.     0.967
##  9             9    178036.     1    
## 10            10    241071.     1

It appears that as student family income increases with each decile, the probability of attendance increases, reaching over 80% when a family’s income is a minimum of essentially 100k.

Finally, we looked at deciles of net_price.

ad%>%
  mutate(net_price_decile=ntile(net_price,n=10))%>%
  group_by(net_price_decile)%>%
  summarize(min_net_price=min(net_price),
  pr_attend=mean(yield))

## # A tibble: 10 × 3
##    net_price_decile min_net_price pr_attend
##               <int>         <dbl>     <dbl>
##  1                1            0      0.512
##  2                2            0      0.786
##  3                3         9464.     0.884
##  4                4        11689.     0.921
##  5                5        13053.     0.926
##  6                6        14109.     0.414
##  7                7        22505.     0.279
##  8                8        31248.     0.419
##  9                9        43766.     0.819
## 10               10        45000      0.860

The relationship between net price and the probability of attendance is not as straightforward as the prior two relationships. Probability of attendance increases with net price up until around the 5th decile when net price hits $14000. Here, the probability drops sharply before rising back up to 86% for those paying full price. The sharp change in the trend is likely due to the nature of how merit aid is awarded. The minimum merit aid is $30500 (shown below), so anyone paying over $14500 is receiving strictly need-based aid. We know that those students come from poorer families, which correlates with a lower probability of attendance, as shown above in the conditional means on income.

ad %>%
  filter(merit_aid != 0) %>%
  summarize(min_merit_aid = min(merit_aid))

##   min_merit_aid
## 1      30507.21

In order to try to predict the yield for admitted students, we decided to create a model. When setting up the dependent variable, we need to specify the reference category. This is very important and affects how we interpret the results of our model once we have them. We also made adjustments in scale such as measuring SAT in 100s and income and distance in 1000s. This will make it easier to understand our estimates.

ad_sub <- ad %>%
  mutate(yield_f = as_factor(ifelse(yield == 1, "Yes", "No"))) %>%
  mutate(yield_f = relevel(yield_f, ref = "No")) %>% ## This is what I forgot last time
  mutate(sat=sat/100,
         income=income/1000,
         distance=distance/1000)%>%
  select(ID,
         yield_f,
         legacy,
         visit,
         registered,
         sent_scores,
         sat,
         income,
         gpa,
         distance,
         net_price)

We chose to run a LASSO mode, this time using the logit link function. To do this, we needed to use the logistic_reg specification for a glmnet model. We specified a logistic regression for classification as opposed to a linear model.

A benefit of the lasso model is that it down weights variables mostly by dropping variables that are highly correlated with one another, leaving only one of the correlated variables as contributors to the model.

Instead of choosing a penalty ourselves, we decided to use hyper parametric tuning to allow the penalty to vary across a sensible range of possibilities.

To set up the model for tuning, we set penalty=tune().

mixture_spec<-1

lasso_logit_mod<- 
  logistic_reg(penalty=tune(),
             mixture=mixture_spec) %>% 
  set_engine("glmnet")%>%
  set_mode("classification")

Using the grid_regular command gives us a reasonable set of possible penalties to work with.

lasso_grid<-grid_regular(parameters(lasso_logit_mod) ,levels=10)

We set up a formula that includes all of the variables in the data set. As we are using the lasso model, we go ahead and normalize all of the predictors, turning them into Z scores.

admit_formula <- as.formula(
                        "yield_f~.") #typical for feature selection - '.' gives all independent vars

admit_recipe<-recipe(admit_formula,ad_sub)%>%
  update_role(ID, new_role = "id variable")%>%
  step_log(distance)%>%
  step_normalize(all_predictors())%>%  #shortcut for feature selection: 'all_predictors()'
  step_naomit(all_predictors())

Having set up the model and the recipe, we can create a workflow.

ad_wf<-workflow()%>%
  add_model(lasso_logit_mod)%>%
  add_recipe(admit_recipe)

To properly evaluate model fit, we need to resample. We will use a Monte Carlo resample. We are using 25 resamples for this example, but in practice we would want to do more. 1000 resamples would be typical in many applications.

ad_rs<-mc_cv(ad_sub,times=25) # DO MORE TIMES IN PRACTICE

This code fits the model to the re sampled data set, generating parameter estimates from the training split and estimates of predictive accuracy from the testing split.

ad_tune_fit <- ad_wf %>%
    tune_grid(ad_rs,grid=lasso_grid, metrics = metric_set(roc_auc, sens, spec,accuracy))

ad_tune_fit%>%
  collect_metrics()%>%
  filter(.metric=="roc_auc")%>%
  arrange(-mean)

## # A tibble: 10 × 7
##          penalty .metric .estimator  mean     n std_err .config              
##            <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.000464      roc_auc binary     0.913    25 0.00260 Preprocessor1_Model07
##  2 0.0000000001  roc_auc binary     0.913    25 0.00260 Preprocessor1_Model01
##  3 0.00000000129 roc_auc binary     0.913    25 0.00260 Preprocessor1_Model02
##  4 0.0000000167  roc_auc binary     0.913    25 0.00260 Preprocessor1_Model03
##  5 0.000000215   roc_auc binary     0.913    25 0.00260 Preprocessor1_Model04
##  6 0.00000278    roc_auc binary     0.913    25 0.00260 Preprocessor1_Model05
##  7 0.0000359     roc_auc binary     0.913    25 0.00260 Preprocessor1_Model06
##  8 0.00599       roc_auc binary     0.912    25 0.00270 Preprocessor1_Model08
##  9 0.0774        roc_auc binary     0.895    25 0.00310 Preprocessor1_Model09
## 10 1             roc_auc binary     0.5      25 0       Preprocessor1_Model10

Having fit the model to re samples and identified the best model fit through cross validation, we can select that model and apply it to the full dataset.

ad_final <-
  finalize_workflow(ad_wf,
                    select_best(ad_tune_fit,
                                metric = "roc_auc")) %>%
  fit(ad_sub)

With the model fit in hand, we can check on our parameter estimates.

ad_final%>%
  extract_fit_parsnip()%>%
  tidy()%>%
  mutate(penalty=prettyNum(penalty,digits=4))%>%
  kable()

term	estimate	penalty
(Intercept)	2.8709837	0.0004642
legacy	0.2279305	0.0004642
visit	0.1392716	0.0004642
registered	0.2243953	0.0004642
sent_scores	0.2953763	0.0004642
sat	0.0000000	0.0004642
income	5.0013621	0.0004642
gpa	0.4059959	0.0004642
distance	-0.3990855	0.0004642
net_price	-0.8248575	0.0004642

In terms of the predictors, income is clearly the strongest relative to other predictors. All other predictors besides distance and net price have a positive correlation with yield.

The SAT was removed as a predictor because it is highly correlated with other independent variables (we know SAT is directly related to merit aid, so its removal is understandable).

We can look the model’s metrics.

ad_tune_fit%>%
  collect_metrics()%>%
  filter(penalty == 1.000000e-10)

## # A tibble: 4 × 7
##        penalty .metric  .estimator  mean     n std_err .config              
##          <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
## 1 0.0000000001 accuracy binary     0.845    25 0.00290 Preprocessor1_Model01
## 2 0.0000000001 roc_auc  binary     0.913    25 0.00260 Preprocessor1_Model01
## 3 0.0000000001 sens     binary     0.768    25 0.00649 Preprocessor1_Model01
## 4 0.0000000001 spec     binary     0.882    25 0.00326 Preprocessor1_Model01

Overall, we are confident in the accuracy of our model. The accuracy is 0.85, which is strong, and its ROC-AUC of 0.92 is in the A range, suggesting a very good model fit. Our model correctly predicted 76 percent of all the enrolled students. Our model also correctly classified 88 percent of those who did not enroll.

To try to adjust the current pricing strategy, we decided on the strategy of changing price in the data set based on factors such as income and SAT. We then used our model to predict yield and calculate variables of interest (avg SAT > 13, 300 with income < 50, sum(net_price) > 3,000,000, students = 1500 +/- 50). Finally, we compared our strategy and compared it to these goals. We also discussed the limitations of the college’s admissions process when it comes to the feasibility of these goals.

First, we must understand the baseline situation at the college before enacting policy change.

Baseline from real data

ad_sub%>%
  filter(yield_f=="Yes")%>%
  summarize(mean(sat))

##   mean(sat)
## 1  12.25941

The average SAT is 1225 currently.

ad_sub%>%
  filter(yield_f=="Yes",income<50)%>%
  count()

##    n
## 1 77

77 students with incomes below $50,000 currently yield.

ad_sub%>%
  filter(yield_f=="Yes")%>%
  summarize(dollar(sum(net_price)))

##   dollar(sum(net_price))
## 1            $30,674,149

We currently have revenues of $30.6 million.

ad_sub%>%
  filter(yield_f=="Yes")%>%
  count()

##      n
## 1 1466

We have a class size of 1466 currently.

ad_sub%>%
  filter(income<50,sat>12)%>%
  group_by(yield_f)%>%
  count()

## # A tibble: 2 × 2
## # Groups:   yield_f [2]
##   yield_f     n
##   <fct>   <int>
## 1 No         67
## 2 Yes        40

There are also 67 students not enrolled whose SAT is greater than a 12000 and whose income is less than 50000. We can try to change policy to get this type of student to enroll.

Before using our model to predict the effect of policy change, we need to understand its baseline as well. Comparing this to the real baseline is also a way of confirming our model is accurate.

pred_baseline<-ad_final%>%
  predict(ad_sub)%>%
  bind_cols(ad_sub)

pred_baseline%>%
  filter(.pred_class=="Yes")%>%
  summarize(mean(sat))

## # A tibble: 1 × 1
##   `mean(sat)`
##         <dbl>
## 1        12.3

pred_baseline%>%
  filter(.pred_class=="Yes",income<50)%>%
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    27

pred_baseline%>%
  filter(.pred_class=="Yes")%>%
  summarize(dollar(sum(net_price)))

## # A tibble: 1 × 1
##   `dollar(sum(net_price))`
##   <chr>                   
## 1 $30,161,380

pred_baseline%>%
  filter(.pred_class=="Yes")%>%
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1455

pred_baseline%>%
  filter(income<50,sat>12)%>%
  group_by(.pred_class)%>%
  count()

## # A tibble: 2 × 2
## # Groups:   .pred_class [2]
##   .pred_class     n
##   <fct>       <int>
## 1 No             81
## 2 Yes            26

From our current policy, our model predicts that we would have a class of 1455 with an average SAT of 1233, yield 27 low-income students, and garner $30 million in tuition revenue. This is relatively close to the real baseline. However, our model predicts we would yield much fewer low income students than we currently do (around 1/3 as many). We will keep this in mind when we use the results from our model to draw conclusions.

Now, we will use our model to see the effect of different changes in tuition policy on yield.

We want to attract students with incomes below $50,000 and students with high SAT scores - what about those in both categories? Unfortunately, such students would already be paying nothing in tuition between need aid and merit aid. We can’t do anything further from a tuition standpoint to incentivize them to enroll.

In order to target the low-income student population, we will offer heavy discounts on tuition. What is the effect of making tuition free for anyone with a family income below $50,000?

ad_lowincome<-ad_sub%>%
  mutate(net_price=ifelse(income<50,0,net_price))

pred_lowincome<-ad_final%>%
  predict(ad_lowincome)%>%
  bind_cols(ad_lowincome)

pred_lowincome%>%
  filter(.pred_class=="Yes",income<50)%>%
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    60

Even with free tuition, we can only get the number of low-income students up to 60.

To increase our SAT average, we will offer more generous merit aid to those who have SATs above 1200, and charge full tuition to anyone with a score below 1200 who is not low-income.

ad_highsat<-ad_lowincome%>%
  mutate(net_price=ifelse(sat>12,net_price-100*sat,net_price))%>% # sliding scale
  mutate(net_price=ifelse(sat<12 & income>50,45000,net_price))%>%
  mutate(net_price=ifelse(net_price<0,0,net_price)) # correcting tuitions that go negative from the sliding scale

pred_highsat<-ad_final%>%
  predict(ad_highsat)%>%
  bind_cols(ad_highsat)

pred_highsat%>%
  filter(.pred_class=="Yes")%>%
  summarize(mean(sat))

## # A tibble: 1 × 1
##   `mean(sat)`
##         <dbl>
## 1        12.3

Our average SAT maxes out at around 1233.

pred_highsat%>%
  filter(.pred_class=="Yes")%>%
  summarize(dollar(sum(net_price)))

## # A tibble: 1 × 1
##   `dollar(sum(net_price))`
##   <chr>                   
## 1 $27,954,910

But after offering so many scholarships, our revenues are down to 2 million below our target. We can look to cut aid for other populations in the admitted student pool. The university’s merit aid policy is currently quite generous to those that can afford the full cost of tuition. Even those with high incomes can receive up to a full ride on the basis of SAT scores.

pred_baseline%>%
  filter(income>175, net_price<45000)%>%
  group_by(yield_f)%>%
  count()

## # A tibble: 1 × 2
## # Groups:   yield_f [1]
##   yield_f     n
##   <fct>   <int>
## 1 Yes       295

Indeed, 295 students with incomes of over $175,000 are not paying full price. How will super-rich applicants react to losing their merit aid?

ad_changemerit<-ad_highsat%>%
  mutate(net_price=ifelse(income>175,45000,net_price))

pred_changemerit<-ad_final%>%
  predict(ad_changemerit)%>%
  bind_cols(ad_changemerit)

pred_changemerit%>%
  filter(income>175)%>%
  group_by(.pred_class)%>%
  count()

## # A tibble: 1 × 2
## # Groups:   .pred_class [1]
##   .pred_class     n
##   <fct>       <int>
## 1 Yes           445

pred_changemerit%>%
  filter(.pred_class=="Yes")%>%
  summarize(dollar(sum(net_price)))

## # A tibble: 1 × 1
##   `dollar(sum(net_price))`
##   <chr>                   
## 1 $38,027,342

From our model, all 445 students with incomes over $175,000 will still enroll despite losing their merit aid. This is a huge untapped source of revenue for the university - it boosts overall revenue by $10 million.

pred_changemerit%>%
  filter(.pred_class=="Yes",income<50)%>%
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    60

pred_changemerit%>%
  filter(.pred_class=="Yes")%>%
  summarize(dollar(sum(net_price)))

## # A tibble: 1 × 1
##   `dollar(sum(net_price))`
##   <chr>                   
## 1 $38,027,342

pred_changemerit%>%
  filter(.pred_class=="Yes")%>%
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1451

pred_changemerit%>%
  filter(.pred_class=="Yes")%>%
  summarize(mean(sat))

## # A tibble: 1 × 1
##   `mean(sat)`
##         <dbl>
## 1        12.3

In conclusion, we were able to meet the goal of revenues above $30 million by over $8 million and were able to increase the number of low income students from 27 to 60 (in our new model). We were able to only raise the average SAT to a 1230 while still encouraging low income students to attend. The biggest way to increase SAT we found was to deter those with low SAT from attending.

Our new model baseline most likely under counts low income students. When we change the policy to try to attract more low income students, the model might not pick up on all the low income students.

There are ethical implications to our work. The biggest one is the cutoffs of SAT above 1200 and family income below $50,000 being chosen arbitrarily which is most likely unethical. Also, catering a plan toward specifically high income and low income students, therefore ignoring the middle class can be seen as price discrimination.

Guided Exercise 4 - The College Admissions Yield Problem

David Atwood, Bobby Kent