In the absence of data, it seems to make the most sense to think, just as was presented to the SGA by Wendy Agrusa, that students who are closer to graduation would be less likely to benefit from switching to the new Gen Ed and to new major programs. Wendy and Valentina also mentioned that students’ majors might matter as well.

Based on their perfectly reasonable suppositions, one might think that, for example, students from CHSS might have left the most Gen Ed classes for their senior year and might tend to benefit more, and students from CHNS might tend to benefit less. As they said, we cannot know for everyone: There may be a student who has 110 credits but who still has more than 2 semesters of classes. But, it seems likely that most students with 110 credits will not benefit from switching.

This simulation was modified considerably on 1 March 2015 when I learned that 55% of our students hav e more than 90 accumulated credits. I also added a section trying to explain why, even if we cannot perfectly say how many credits a student needs to graduate as function of accumulated credits, accumulated credits still has utility to predict credits needed to graduate. If the assumtions I got from Valentina and Wendy are correct, this only strengthens the case against setting it up as opt-out decision becasue their assumption (that I think makes sense) is that students who are farther along will be less likely to benefit from switching to the 201580 catalog term.

Set Up Simulation

In this simulation, I set up a structure like the one that Wendy and Valentina presented to the CHSS College Council, to estimate the number of errors generated by a 100% opt-out strategy like the one we will be adopting vs. an opt-out/opt-in policy based on credits and college.

Data

  1. The number of undergraduate students in each college is taken from the most recent Comparative Enrollment Report.
  2. 55% of HPU undergraduates have more than 90 accumulated credits. I wonder if that varies by college. If it does, that would be an important piece of information.

Key Assumptions

  1. Without looking at each student individually, available predictors of which catalog term would lead students to completing their degree most rapidly are probabilistic.
  2. The closer a student is to graduation, the less likely it is that switching to the 201580 catalog term will be beneficial.
  3. Different colleges may have encouraged students to complete their Gen Ed requirements more quickly, and this may affect the rate at which the benefit of switching to the 201580 catalog term declines.

Auxiliary Assumptions

  1. The probability that there could be a benefit from switching to the 201580 catalog term never drops to 0.
  2. The probability that there could be a benefit from switching to the 201580 catalog term declines non-linearly from 1 to some asymptotic floor.
switch.good <- function(credits, g, b, m) {return(((1-m) / (1 + exp(g*(credits-b)))) + m)}

In the simulation, I use the logistic function to model the probability that switching to the to Gen Ed and major program will be better for a student. Note that the probability here never drops to 0. This indicates that there is always a chance that switching to the new Gen Ed and major program will help, but that it does, as Wendy and Valentina suggested, decrease as students get closer to graduation.

Based on the Spring 2015 Comparative Enrollment Report, there are the following numbers of undergraduate students in each college. This, I believe, includes students who are currently in OCP, so I do not count them separately (although based on something Wendy said about military credits, they might tend to be in a different situation than non-OCP students).

N  <- c(cba = 913, chss = 840, cncs = 809, cnhs = 797)

Distribution of Credits

I was given to understand that 55% of our students have more than 90 accumulated credits. I do not know what the distribution of credits looks like, but I use a truncated negatively skewed ex-Gaussian distribution with \(\mu = 120\), \(\sigma = 35\), and \(\lambda = .04\). I truncate the low end strictly at 0, and at the high end I do a “noisy” truncation at 140.

I am not sure if the distribution is correct, but I do get close to the one data point I was given. In this simulated distribution, 55% of students have more than 90 credits.

Credits vs Credits to Graduate

One concern that I have heard voiced from Valentina and now from the provost is that we cannot deterministically tell how many credits someone needs to graduate based on the number of credits they have. I can see how that is a problem, but I cannot see how that prevents us from using the number of credits in a student’s transcript to predict credits to graduate. This is in the logistic model above indirectly. For example, a simulated CHSS student with 120 credits still has about a 25% chance that switching to the new Gen Ed and program will help.

Even if the relationship between credits and credits to graduate looks something like the figure below, it would still have predictive validity. In other words, if you tell me how many credits a student has, I will be able to make a more precise prediction about how many credits he or she needs to graduate than if I do not have that information.

The key ideas in this mini simulation (that does not directly influence any of the other simulations) is that yes, we cannot perfectly predict credits to graduate based on accumulated credits, but you can make an imperfect, noisy prediction that is better than ignoring earned credits altogether. This can be shown by noting that in this simulation, using class level (Freshman, Sophomore, etc.) to predict credits remaining accounts for 83% of the variance. It can also be shown by noting that the correlation beween accumulated credits and credits needed to graduate is \(r = -0.92\).

cred.rem.lm  <- lm(cred.rem ~ class.level, students)
summary(cred.rem.lm)
## 
## Call:
## lm(formula = cred.rem ~ class.level, data = students)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.553  -8.135  -1.553   6.447  62.447 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          108.7603     0.7540  144.24   <2e-16 ***
## class.levelSophomore -32.2626     0.9395  -34.34   <2e-16 ***
## class.levelJunior    -59.6253     0.8553  -69.71   <2e-16 ***
## class.levelSenior    -85.2072     0.8022 -106.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.73 on 3355 degrees of freedom
## Multiple R-squared:  0.8306, Adjusted R-squared:  0.8305 
## F-statistic:  5484 on 3 and 3355 DF,  p-value: < 2.2e-16
describeBy(cred.rem, students$class.level)
## group: Freshman
##   vars   n   mean sd median trimmed   mad min max range  skew kurtosis
## 1    1 242 108.76 10    110  109.35 14.83  90 120    30 -0.21    -1.45
##     se
## 1 0.64
## -------------------------------------------------------- 
## group: Sophomore
##   vars   n mean   sd median trimmed   mad min max range skew kurtosis   se
## 1    1 438 76.5 9.97     76   76.17 11.86  60 103    43 0.27    -0.69 0.48
## -------------------------------------------------------- 
## group: Junior
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 844 49.14 12.46     48   48.35 13.34  30  93    63 0.57     0.02
##     se
## 1 0.43
## -------------------------------------------------------- 
## group: Senior
##   vars    n  mean    sd median trimmed mad min max range skew kurtosis
## 1    1 1835 23.55 11.98     22   22.34 8.9   0  86    86  1.2     2.31
##     se
## 1 0.28

The summary statistics by class level above and this boxplot below show that the regression above can still make good use of class level to predict number of credits remaining even if, for example, among “seniors,” many still need 50 credits to graduate.

The point I have tried to make in several places in this section is that the inability to predict perfectly is not the same as the complete inability to predict. It seems as if decision makers are adopting a strategy that says, “if we do not have perfect information about every student, we should act as if we have no information.” This may well be justified, but I have not seen a good case for it.

Set Up Simulation

students$prob.good[students$college == "CBA"] <-
  switch.good(students$credits[students$college == "CBA"], .11, 80, .10)
students$prob.good[students$college == "CHSS"] <-
  switch.good(students$credits[students$college == "CHSS"], .07, 90, .15)
students$prob.good[students$college == "CNCS"] <-
  switch.good(students$credits[students$college == "CNCS"], .095, 65, .125)
students$prob.good[students$college == "CNHS"] <-
  switch.good(students$credits[students$college == "CNHS"], .12, 70, .075)

## describeBy(students$prob.good, students$college)

Run the Simulation

This is where the actual simulation happens. It uses each students probability that it is a good idea to switch and “flips” a weighted coin. If a student has a 68% chance that it is good for them to switch the coin will come up heads 68% of the time and tails 32% of the time. This means that a senior who probably should not switch could have it come up that switching really would have been better after all.

students$is.good <- runif(nrow(students)) < students$prob.good
with(students, table(is.good))
## is.good
## FALSE  TRUE 
##  2009  1350
with(students, table(college, is.good))
##        is.good
## college FALSE TRUE
##    CBA    524  389
##    CHSS   384  456
##    CNCS   556  253
##    CNHS   545  252
with(students, table(class.level, is.good))
##            is.good
## class.level FALSE TRUE
##   Freshman      1  241
##   Sophomore    39  399
##   Junior      400  444
##   Senior     1569  266
with(students, ftable(college, class.level, is.good))
##                     is.good FALSE TRUE
## college class.level                   
## CBA     Freshman                0   66
##         Sophomore               7  125
##         Junior                 84  138
##         Senior                433   60
## CHSS    Freshman                1   65
##         Sophomore               8   91
##         Junior                 48  170
##         Senior                327  130
## CNCS    Freshman                0   56
##         Sophomore              13   81
##         Junior                139   74
##         Senior                404   42
## CNHS    Freshman                0   54
##         Sophomore              11  102
##         Junior                129   62
##         Senior                405   34

Count the number of initial mis-classifications. The students are all still responsible for where they end up. This is just an initial guess based on the overall base-rate, class level, or class level plus college.

students$opt.out.err  <- !students$is.good     # not good to switch -> mis-classification
students$opt.in.err  <- students$is.good       # good to switch -> mis-classification
students$level.err[students$class.level == "Freshman"]  <- # Freshmen opt out
  !students$is.good[students$class.level == "Freshman"]
students$level.err[students$class.level == "Sophomore"]  <- # Sophomores opt out
  !students$is.good[students$class.level == "Sophomore"]
students$level.err[students$class.level == "Junior"]  <- # Juniors opt out
  !students$is.good[students$class.level == "Junior"]
students$level.err[students$class.level == "Senior"]  <- # Seniors opt in
  students$is.good[students$class.level == "Senior"]

## Paying attention to college does not help much here, except for CHSS Juniors
students$level.college.err <- students$level.err
## Simulated CNCS & CNHS Juniors should actually opt in as well
students$level.college.err[students$class.level == "Junior" & 
                             (students$college == "CNCS" | students$college == "CNHS")] <- 
  students$is.good[students$class.level == "Junior" & 
                     (students$college == "CNCS" | students$college == "CNHS")]

How many initial simulated mis-classifications would there be if you had them all have to opt out like we are doing?

with(students, table(opt.out.err))
## opt.out.err
## FALSE  TRUE 
##  1350  2009

How many initial simulated mis-classifications would there be if you had them all have to opt in, which is the opposite of what we are doing?

with(students, table(opt.in.err))
## opt.in.err
## FALSE  TRUE 
##  2009  1350

Based on the simulated results above, if you are just paying attention to class rank, you should have Freshman, Sophomores, and Juniors opt out and Seniors opt in. How many initial simulated mis-classifications would there be if you followed that strategy?

with(students, table(level.err))
## level.err
## FALSE  TRUE 
##  2653   706

Based on the simulated results above, if you are paying attention to class rank and college, you should have Freshman and Sophomores, opt out and Seniors opt in. For Juniors, you should have the CBA and CHSS Juniors opt out and the CNCS and CNHS Juniors opt in. How many initial simulated mis-classifications would there be if you followed that strategy?

with(students, table(level.college.err))
## level.college.err
## FALSE  TRUE 
##  2785   574

Conclusion

I set up this simulation by translating some straightforward assumptions into math. This simulation assumes that as students accumulate credits that the likelihood that switching to the new Gen Ed and their new major programs will be beneficial decreases. I also assumed that it would be probabilistic, that given two students in the same college with the same number of credits, that one might benefit from switching and one might not.

Given the way I instantiated these mild assumptions, the simulation suggests that we could go from making 60% initial mis-classification to 21% initial mis-classifications just by attending to class level.

Each initial mis-classification has some potential to take up a lot of staff and faculty time. If we can cut initial mis-classifications by about 3/5, that should cut the amount of time staff and faculty need to devote to fixing initial mis-classifications that turn into problematic mis-classifications. We can also just have a big increase in happy students.