By: Jeffrey Ellington
9:00am Mon-Wed Section
4/19/17

1. Introduction: Assessing College Admissions Data

High school guidance counselors use college admissions data to properly advise high school seniors on where to apply to college. Studying the generally available summary-statistic level admissions data is one source of valuable insight into college selection criteria and the admissions process. However, summary-statistic level data is a crude tool for providing any specific student with insight into his or her individual chance of being admitted to a particular university. Using individual level admissions data is a potentially useful method to bolster a guidance counselor’s ability to provide tailored guidance to each student. Unfortunately, gathering individual level data from the over 2,000 colleges in the USA is expensive, if not impossible. As a result, we will examine individual level admissions outcome data from one ‘elite’ US university. This information can be analyzed to help advise students on their prospects of being admitted to this particular elite university (at least if they had theoretically applied in the given year from which our data was collected). Additionally, we will explore if it’s responsible to extrapolate these findings to other years at this specific university and potentially to other similarly elite universities.

2. Data Description

At first glance, it is unclear how the data were collected. The 8,700 observations were obtained from last year’s applicants at a particular ‘elite’ university. It’s not specified whether this is a sample of the data or the complete set of observations. Quickly reviewing publicly available Class of 2020 admissions data from the eight Ivy League schools reveals the range in the number of applicants stretches from 20,675 (Dartmouth) to 44,966 (Cornell). The average number of applicants across the eight schools 33,923. Given this, it is almost certain our 8,700 observations are a sample from a broader set of observations at this particular school. We can’t assume that this sample is representative of the presumed broader applicant pool. Finally, the data violate the IID assumption: admitting one applicant has an effect on subsequent applicants given there is a relatively fixed class size and composition. For our treatment, we will assume the data were not generated via a random process. Instead, for our treatment, we will examine the 8,700 observations of nine variables assuming the data were not collected randomly. The nine variables are:

1. admit — Coded “1” for admit and “0” for reject.
2. anglo — Coded “1” for anglo and “0” otherwise.
3. asian — Coded “1” for asian and “0” otherwise.
4. black — Coded “1” for black and “0” otherwise.
5. gpa.wtd — high school GPA weighted for AP courses. 
6. sati.verb — SAT verbal score.
7. sati.math — SAT math score.
8. income — household income capped at $100,000. 
9. sex — Coded “1” for male and “0” for female.

Admit is the response variable. It is a binary variable (“1” for admit / “0” for reject). Admit results were collected for the 8,700 applicants, where each individual result is assumed to be the outcome of a considered admissions review process (i.e. not random). The tally for admissions results can be seen below in Fig. 2.1. 2,686 of the 8,700 students were admitted, reflecting a 30.87% acceptance rate for our specific sample. Note that this does not imply that 30.87% is the overall acceptance rate for this particular university.

The other eight variables will be examined in much greater detail once they’ve been properly cleaned. However, it’s worth stating up front that while these eight variables seem important in an analysis of admissions, they are far from complete. An expert on university admissions, especially at elite schools, could likely point to several other important missing variables. A few examples might include athletic or artistic talent, legacy relationships, service or entrepreneurial recognition, and/or the quality of recommendation letters. There are countless others. In addition, we also have no framework for existing individual admissions decisions/policies at this particular “elite” university, which may impact the analysis. Whatever our estimate of the mean, we can be assured it will be misspecified without greater information. With that caveat stated, there are still many avenues of analysis available. (Administrative Note: while all of the variables are lower-case in the data, I will refer to them via capitalization to make it clear we are referencing a variable (e.g. “admit” will be written “Admit”)).

ggp <- ggplot(data.frame(admissions),aes(x=admit), aes(y=count))
# counts
ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", title = "Fig 2.1 Admissions Results", subtitle = "Acceptance Rate = 30.87%")

3. Data Cleaning

Before examining the relationships between Admit and the other eight variables, we will first need to clean and prepare the data. Reviewing the summary statistics for the nine variables below, a number of issues immediately pop out.

summary(admissions)
 admit     anglo       asian       black         gpa.wtd         sati.verb        sati.math         income        sex      
 0:6014   0   :5061   0   :4454   0   :7515   Min.   :-3.200   Min.   :   0.0   Min.   :  0.0   Min.   :  120   0   :4633  
 1:2686   1   :2811   1   :3417   1   : 356   1st Qu.: 3.500   1st Qu.: 510.0   1st Qu.:550.0   1st Qu.:32881   1   :4043  
          NA's: 828   NA's: 829   NA's: 829   Median : 3.850   Median : 580.0   Median :630.0   Median :65000   NA's:  24  
                                              Mean   : 3.789   Mean   : 555.5   Mean   :592.3   Mean   :63245              
                                              3rd Qu.: 4.150   3rd Qu.: 650.0   3rd Qu.:690.0   3rd Qu.:99999              
                                              Max.   : 4.950   Max.   :5240.0   Max.   :800.0   Max.   :99999              
                                                                                                NA's   :1724               

After systematically reviewing the data, eight data cleaning/transformation issues are apparent.

  1. The response variable, Admit, is encoded as an integer/numeric variable (as opposed to factor). We will need to co-erce it to a factor variable to ensure our future analysis uses the proper statistical techniques. We will also do this for all other categorical variables (Anglo, Asian, Black, and Sex).
  2. It’s immediately apparent Anglo has extraneous values. We will transform “white” to “1”" for observation #200 (and recede “NA” to 0 for Asian/Black in this case) and transform from “10” to “1” for observation #10, given Black and Asian are both 0. It’s worth noting this evidence for observation #10 is not conclusive, but most likely correct. Additionally, Anglo was initially input as character data, so we transformed to factor type.
  3. Turning our attention to the treatment of the categorical variables associated with race (Angle, Asian, Black), it’s quickly clear that a subset of observations (1288) are coded “0” for all three. This could reflect missing data. However, given there are 828 actual NA values, it’s more likely that this implies another category of race. We will add “Other” as another factor variable for these 1288 observations (note: used a cute mathematical transformation to create the factor!).
  4. The 828 remaining missing values (NAs) for race proved to be perplexing. Thankfully, the NAs were systematically placed in the same rows among Anglo, Asian, and Black (as opposed to be randomly scattered). The missing values represented ~10% of the total data set, yet there was no obvious way to properly use the data. Given our response is Admit, it seemed logical to first check the overall sample population admittance rate with the sub-population rate of NAs across Anglo, Asian, and Black. The results were similar: 30.87% vs. 33.69&, respectively. The NA’s were being admitted at a slightly higher rate. While the difference is likely material, not to mention cutting 10% of the data set is really painful, I didn’t know how to code a good transformation of the data. Simply imputing the mode would have treated the NA’s as Asian. Intuitively this felt wrong. Ideally, I would have been able to regress all the other variables to predict race; however, I did not have the R background to complete this. Ultimately, I cut the 828 NA’s.
  5. In addition, I removed the two remaining NA’s from Sex. This was less than 0.01% of the data, so felt immaterial.
  6. Income had a max value encoded as $99,999. Given 25% was coded as $99,999, we will treat this as a capped value and not an artifact/mis-reported statistic implying some type of missing value. Furthermore, Income had 1724 missing values coded as NAs. 1724 observations was certainly too many to strike from the analysis. Again, it made sense to compare the overall sample population admittance rate with the sub-population rate of NAs in Income. The results were similar: 30.87% vs. 33.58&, respectively. Missing values for Income are associated with a slightly higher acceptance rate. Without more information, it seemed the best option was to impute the NAs as the median value of $65k. While not perfect, I saw no better way to include these values without adding noise. This is admittedly less than perfect, as there could be many reasons beyond randomly missing data that NAs exist. Perhaps the applicant wants to hide that they are unable to pay for college or that they are wealthy and under performing for some reason. There are a multitude of reasonable explanations.
  7. The minimum GPA was initially reported as -3.20. This was simply inverted to 3.20. Additionally, 19 GPAs were entered as 0. Given the small proportion, the values were transformed to the mean of 3.80.
  8. Possible SAT scores range from 200-800, which quickly highlighted two errors. One verbal score was entered as 5,240. Given this student was admitted, the score was transformed to the median score of admitted students (650). More perplexing were the 457 0’s for both SAT Math and Verbal. Not coincidentally, the 0 values were always paired together, implying some type of systematic effect. The most likely scenarios are the students either a) took the ACT and did not report SAT scores or b) were permitted to be considered for admission without any standardized test scores. No transformations were made to these values, given there is likely signal in the decision not to report. We will examine further later.

As a result of these transformations, the final data set to be used for analysis contains 7,870 observations of 10 variables.

4. Analysis of Univariate Statistics

Starting with merit based predictors, let’s examine GPA and SAT scores. GPA is reported on a weighted scale from 0.00 - 5.00, with an actual range of 1.25 to 4.95. The median is 3.85 and the mean is 3.80. The histogram in Fig 4.1 follows a normal distribution, with a slight left-skew. This is good, as it implies there’s no artifacts/systematic errors in the data when converting from non-weighted to weighted GPAs. Individual Math/Verbal SAT scores highlight two interesting features of the data. First, as noted during EDA, there are a number of SAT Scores at 0. We’re leaving these as is, even though it’s impossible to score 0 (range is 200-800). They are indicative of something other than missing data – likely an active decision to take the ACT or no standardized test at all. We suspect this decision will be informative in the admissions decisions. Second, the individual scores follow roughly normal distributions, though there is a fair bit of right-skew bunching on Math at 800. When combined into an SAT Total score, the distribution becomes almost perfectly normal. This flattening of the right tails in both Math (especially) and Verbal (somewhat) implies that there a significant number of students who do exceptionally well on a particular section, but far less students do exceptionally well on both sections. Most importantly, both GPA and SAT Total scores feel representative of what we would expect to see on a national level from a distribution shape. However, the mean/median seem to be right shifted. Looking at all SAT data, the mean score is 1020 for all test takers and 1080 for all college bound students. From a Level I perspective, the sample of students applying to this particular “elite” university are above average nationally. Moving to a Level II analysis, it seems fair to extrapolate that there is a self-selection bias for better students to apply to better universities. We would expect this trend to hold across other elite universities and across other years.

ggp.v3 <- ggplot(admissions.final, aes(gpa.wtd)) + geom_histogram(fill="lightgreen") + labs(x = "GPA (Weighted)", y = "Count", title = "Fig 4.1: GPA Histogram", 
                                                    subtitle = "Median GPA = 3.85")
  ggp.v4 <- ggplot(admissions.final, aes(sati.math)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Math", y = "Count", title = "Fig 4.2: SAT Math", 
                                                    subtitle = "Median Math SAT = 630")
  ggp.v5 <- ggplot(admissions.final, aes(sati.verb)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Verb", y = "Count", title = "Fig 4.3: SAT Verb", 
                                                    subtitle = "Median Verb SAT = 580")
  ggp.v6 <- ggplot(admissions.final, aes(sati.total)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Total", y = "Count", title = "Fig 4.4: SAT Total", 
                                                    subtitle = "Median Total SAT = 1210")
  grid.arrange(ggp.v3, ggp.v4, ggp.v5, ggp.v6, nrow=2, ncol=2)

Looking at demographic data, we see a few interesting facets of the data. There are roughly 17% more female applications than male. This could certainly be a result of our specific sample. From a Level II perspective, however, this wouldn’t be surprising. On almost every level (applications, enrollment, graduation, etc.) there are more female undergraduates than male represented. Given this, it feels reasonable to treat this male/female applicant count gap as representative of the broader population. Race, however, paints a different picture. Black students made up 356 (4.5%) of total applicants. Whether this is a Level I sampling anomaly or Level II broader representation of the college applicant pool is difficult to say. Looking at 2015 applications to “elite”" schools, most schools (Penn, Columbia, Yale, Duke, among many others reporting) had black student applicants comprising 10-13% of the applicant pool, between 2-3x our 4.5%. In contrast, UC Berkeley received just 4.9% of its applicants from black students (lowest among “elite” schools). While there is no conclusive evidence, using the best data available seems to indicate the proportion of black students in our sample is systematically low. Looking at Income, we included two histograms: the original (with 1724 NAs omitted) and the final with the 1724 NAs imputed to the mean. This obviously skewed the distribution, but seemed the “fairest” way to do so without adding noise. The distribution is right skewed, with about 25% of population making more than $100k. Among families with incomes under $100k, the distribution is fairly uniform.

par(mfrow=c(2,2))
      barplot(admit.apps.comp.gen, col = "lightgreen",
            xlab = "Gender: Male, Female",
            ylab = "Count",
            main = "Fig 4.5: Applicant Count by Gender",
            ylim = c(0, 5000))
      barplot(admit.apps.comp, col = "lightgreen",
              xlab = "Applicants: Anglo, Asian, Black, Other",
              ylab = "Count",
              main = "Fig 4.6: Applicant Count by Race",
              ylim = c(0, 4000))
      hist(admissions.final$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.7: App Count by Income (Imputed)",
           ylim = c(0, 3000))
      hist(admissions$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.8: App Count by Income (Original",
           ylim = c(0, 3000))

5. Analysis of Bivariate Statistics

Recreating the histogram (Fig 5.1) of the response variable (Admit) from the final data set, we see the acceptance rate has changed minorly from 30.87% to 30.58%. From here we will turn our attention to uncovering what features of the data set are predictive for admissions.

ggp <- ggplot(data.frame(admissions.final),aes(x=admit), aes(y=count))
# counts
ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", title = "Fig 5.1 Admissions Results (Final Dataset)", subtitle = "Acceptance Rate = 30.58%")

Based on an intuitive first pass, at least for anyone who’s spent time in the college admissions process, an immediate question pops up: for an “elite” school, is 30.58% a reasonable acceptance rate? Turning back to the publicly available Class of 2020 Ivy League information, acceptance rates ranged from 3.4% (Harvard) to 12.5% (Cornell). Boston College, my Alma mater, has an acceptance rate of 29% and is ranked around #30 nationally. So 30% is either systematically low for a truly “elite” college (and implies we have a skewed sample) or is representative of a school ranked around #30. How we decided to interpret will have implications on whether we can perform a Level I or II analysis.

Let’s turn to demographics. First, looking at race, we can see in Fig 5.2 that there is disparity between Anglo/Asian admit rates (32%) and Black/Other admit rates (24%). This information challenges, at least for this particular sample, an often repeated quip that it’s easier for underrepresented minority students to get into elite universities than Anglo or Asian students. This will be interesting to explore further in the multi-variate analysis once interaction effects like income, SAT, and/or GPA are considered. Examining income, let’s start by splitting the data in to two buckets: lower/middle-class (income less than $99,999) and affluent (income greater than $99,999). The admission rates are pretty similar: 30.1% and 32.1%, respectively. While affluent applicants are slightly favored in this data set, it’s not nearly as startling as the racial disparities. Finally, looking at gender, we see similar results to income: a small disparity between the male admission rate (29.8%) and the female admission rate (31.3%). Whether this is strictly a Level I sample artifact or a Level II generalization about college admissions is difficult to say based on this data alone and somewhat irrelevant given the disparity is minimal.

barplot(admit.rates.race.hold*100, col = "lightgreen",
            xlab = "Admit Rate: Total(30.6%), Anglo(32.5%), Asian(32.0%), Black(24.4%), Other(24.4%)",
            ylab = "Percent",
            main = "Fig 5.2: Admit Rate by Race",
            ylim = c(0, 40))

Merit based predictors prove to be more illustrative. We can look at how GPA, SAT Math, SAT Verbal, and SAT Total impact admissions decisions. Given we are working in the wrong model perspective and not focused on causality, let’s focus on visualizing the data. We can look at admissions rates for each of the 4 predictors in quartiles in Figs 5.3-6. It’s abundantly clear that GPA, SAT M, SAT V, and SAT T all greatly effect admissions decision in the sample population. Applicants from the Top quartile in any given category are 7-24x more likely to be admitted from those from the bottom. There is clearly co-linearity between these predictors, bot explicitly (SAT M + V = SAT Total) and implicitly (good SAT takers probably on average have higher GPAs). We can see this empirically. The correlations between SAT Math & SAT Verbal (0.85) and SAT Total and GPA (0.33) are printed below.

    cor(admissions.final$sati.math, admissions.final$sati.verb)
[1] 0.8509055
    cor(admissions.final$sati.total, admissions.final$gpa.wtd)
[1] 0.3286649
 par(mfrow=c(2,2))
    plot(c(top.gpa.admit.rate,
           scnd.gpa.admit.rate,
           thrd.gpa.admit.rate,
           bot.gpa.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.3: Admit by GPA",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
    plot(c(top.satm.admit.rate,
           scnd.satm.admit.rate,
           thrd.satm.admit.rate,
           bot.satm.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.4: Admit by SAT Math",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
    plot(c( top.satv.admit.rate,
            scnd.satv.admit.rate,
            thrd.satv.admit.rate,
            bot.satv.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Verbal",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
    plot(c( top.satt.admit.rate,
            scnd.satt.admit.rate,
            thrd.satt.admit.rate,
            bot.satt.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Total",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")

6. Analysis of Multivariate Statistics

Finally, we get to the Random Forest (RF) Analysis: our principle topic of inquiry. Before we start, it’s important to lay out a few statistical concepts. First, as noted in Section 2: Data Description, our classification using RF will lack many potentially explanatory variables. In short, we have the 11 predictors we’ve observed, but there is a lot more information required to specify the true mean function. Given that, we will still be OK moving ahead at Level I and potentially Level II analysis, but we will not be able to draw conclusions at Level III. Second, in Section 5. Analysis of Bivariate Statistics, we pointed out that it’s unwise to imply causality with the limited data we’re analyzing. Now, we will completely depart from the idea of regression coefficients when using RF. The transformations and interaction effects involved in creating the model are sufficiently complex that there is not even a coefficient type analogy to consider.

To evaluate our model, we will allow the algorithm to split the data into training data and test data for us. The RF algorithm does this automatically by training on a subset of observations and then testing the quality of the decision tree it creates on the Out-of-Bag (OOB) test data. This framework is a good way to prevent over fitting, especially given it is averaged over a large number of trees. Given we have almost two orders of magnitude more observations than base predictors, the model should be stable.

6.1: No Misclassification Costs

For our initial model, we grow a random forest of 750 trees. Based on the plot below in Fig. 6.1, we can see the model improvement tapers off substantially by 50, and is almost flat by 200. 750 trees should capture plenty signal without wasting compute resources. The misclassification error rate is 15.91%, which can seen below in the model print out, showing OOB error rate and the confusion table. 15.91% seems reasonable at first pass given the amount of missing data in our model. We’d expect the error rate to drop if we could inform the model of everyone’s other talents, service, recommendation letters, and myriad other criteria. What’s potentially more interesting is the comparison of false negatives to false positives. The model had sensitivity of 66.9% and specificity of 91.6%. Our model preferred to predict someone would NOT be accepted than would be. The implied cost ratio was a bit less that 1:2. It forecasted 26.2% of applicants would be accepted (compared to the actual 30.6%). As a result, 796 students who would have been admitted would have been discouraged from applying. It’s worth tuning the false positive to false negative cost ratio in our model to promote a more useful prediction; after-all, we don’t want to discourage every high school senior during what is already an incredibly exhausting and trying time.

plot(rf1, main = "Fig 6.1: RF1 Plot")

print(rf1)

Call:
 randomForest(formula = admit ~ anglo + asian + black + other +      gpa.wtd + sati.verb + sati.math + sati.total + income + sex,      data = admissions.final, importance = T, ntree = 750) 
               Type of random forest: classification
                     Number of trees: 750
No. of variables tried at each split: 3

        OOB estimate of  error rate: 15.91%
Confusion matrix:
     0    1 class.error
0 5007  456  0.08347062
1  796 1611  0.33070212

6.2: Setting Misclassification Costs

For our next model, let’s build in a target 4:1 cost ratio. This essentially says that we want the model to favor false positives at 4x the rate it values false negatives. This is going to invariably degrade precision: we will be wrong more often when predicting “Yes”. However, the motivation seems just. As counselors, we can either inflate our vanity metrics of saying our students got into their “Top” school more often OR we can sacrifice our accuracy by pushing students to apply to better schools that they really do have a fighting chance to get into. I’d certainly rather strive for excellence and come up short sometimes than settle for mediocrity and spending a life wondering “what if”. Looking at the results, we achieved a 3.8:1 cost ratio. As a cost, the OOB error rate rose to 20.27%. What’s valuable here from a guidance counselor perspective is that the model is still accurately predicting acceptances at 62%. I’d be shocked if most students did not apply to an “elite” school (assuming they liked the other factors of the school) with a 62% chance of being admitted. On the flip side, when we tell a student that the model is not optimistic about their chances, it’s right 93% of the time. Not too many people will hang their hats on a 1:14 chance. For fun, let’s presume a student with average risk-appetite will apply to a school if there’s 1/3 chance of being accepted. What does that cost ratio look like? The cost ratio is about 155:1. Interestingly, though, the false negative error rate only gets cut by about 50%. This is a poor trade off to make.

print(rf2)

Call:
 randomForest(formula = admit ~ anglo + asian + black + other +      gpa.wtd + sati.verb + sati.math + sati.total + income + sex,      data = admissions.final, importance = T, ntree = 750, sampsize = c(1000,          1700)) 
               Type of random forest: classification
                     Number of trees: 750
No. of variables tried at each split: 3

        OOB estimate of  error rate: 20.27%
Confusion matrix:
     0    1 class.error
0 4199 1264   0.2313747
1  331 2076   0.1375156
print(rf3)

Call:
 randomForest(formula = admit ~ anglo + asian + black + other +      gpa.wtd + sati.verb + sati.math + sati.total + income + sex,      data = admissions.final, importance = T, ntree = 750, sampsize = c(100,          2000)) 
               Type of random forest: classification
                     Number of trees: 750
No. of variables tried at each split: 3

        OOB estimate of  error rate: 56.15%
Confusion matrix:
     0    1 class.error
0 1072 4391  0.80377082
1   28 2379  0.01163274

6.3: Importance Plots and Prediction Accuracy

Looking at the importance plots below in Fig 6.2 and Fig 6.3, reminds us of the previous observations in the bivariate analysis. In descending order, the most important predictors are: GPA, SAT Total, SAT Verbal, and SAT Math. On an unstandardized basis, what this chart is saying is that when a given variable is shuffled, we lose the corresponding amount of predictive power (OOB classification error, formally). For example, shuffling GPA.wtd, the most important predictor causes a subsequent drop in over 25% of predictive power. From a Level I perspective, it’s interesting that GPA and SAT Total were most predictive and also followed the most normal distributions. The broader range of possible results (0.00 - 5.00 or 400 - 1600) and variance among results allowed the random forest algorithm to find incremental signal in a way that is very hard to do with a limited set of indicator variables.

 par(mfrow=c(1,2))
    varImpPlot(rf2,type=1,scale=F,class=1,
               main="Importance Plot for Admission
               (Unstandarized)",col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=1,
               main="Importance Plot for Admission
               (Standardized)", col="lightgreen",cex=1,pch=19)
  par(mfrow=c(1,2))

    varImpPlot(rf2,type=1,scale=F,class=0,
               main="Importance Plot for Rejection
               (Unstandardized)", col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=0,
               main="Importance Plot for Rejection
               (Standardized)", col="lightgreen",cex=1,pch=19)

6.4: Partial Dependence Plots

The partial dependence plots below in Fig 6.4 and Fig 6.5 show the relationships between the GPA/SAT Total and the centered log odds of admission. Both GPA and SAT Score are numeric with normal distributions. The chances of admission increase linearly with GPA from 3.3 until 4.5, at which point the data becomes thin and unstable. For SAT Scores the relationship is slightly different, growing near exponentially from around 1100 until 1300, before settling into a liner relationship and then plateauing at around 1400. High school students with GPAs around 4.5 and SAT Scores above 1400 have the highest chance of admission. The slight declines at the extreme right tails are probably best explained by sparse data. An astute college admissions specialist might have domain knowledge that could explain differently: perhaps students with 5.0 GPAs and 1600 SATs are lacking in other characteristics. There’s not enough evidence here to say.

par(mfrow=c(2,1))
    part1<-partialPlot(rf2,pred.data=admissions.final,x.var=gpa.wtd,
                       rug=T,which.class=1, main="Partial Dependence Plot for GPA on Admission", xlab="GPA",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="lightgreen",pch=19, main="Partial Dependence Plot for GPA on Admission", xlab="GPA",
                   ylab="Centered Log Odds")
    par(mfrow=c(2,1))

    part2<-partialPlot(rf2,pred.data=admissions.final,x.var=sati.total,
                       rug=T,which.class=1, main="Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="lightgreen",pch=19, main="Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                   ylab="Centered Log Odds")

7. Summary and Conclusions

Our final model used 750 trees and a cost ratio of 3.8:1. This lead to a 20.27% OOB Classification error rate, which we accepted due to it’s willingness to predict more students would be admitted, even if at lower precision. This was a policy decision driven by a desire to encourage high quality students to apply to an “elite” university. We can now summarize our findings and conclusions. There are three core issues I’d like to address:

  1. From a data collection perspective, this data set is less than ideal for a few reasons. First, we are unsure whether it is a complete sample of one year’s admission results from an elite university or just a sample of results. The limited number of observations compared to a normal elite university application count implies the data is a sample. From there, we don’t know if was generated randomly or as a biased selection. Furthermore, college admissions data is tricky in that is not fully independent. There are a relatively fixed number of seats and similar students compete over the same seat. This destroys the independence assumption – all previous decisions impact the subsequent ones in a given year (as clearly seen comparing early-decision admittance rates and regular-decision admittance rates). For the remainder of this discussion, we will assume the data were collected as a biased sample from a larger set of admissions observations. Luckily, gender, sex, and income were very minimally, if at all, predictive in our sample set.
  2. From a level I perspective, we can see the data are complex with many missing features. Addressing only these 7870 observations, we cannot approach the true mean specification. That said, we can directionally measure direction and importance of select predictors. For example, GPA and STAT Total appeared to be both correlated with each other and correlated with higher rates of admission, which we observed in the bivariate analysis as well as partial dependence and importance plots. At the far right of both GPA and SAT Scores, the distributions become thin and unstable. Most of the apparent reduction in admission rates there can be attributed to instability. The Random Forest algorithm proved to be a powerful classification tool, performing with an OOB classification error rate of 15.91% when unrestricted in cost ratio. Even after adding a 3.8:1 cost ratio, the algorithm slid 4.5% percentage points on OOB classification error (20.27%); a small trade-off for a much more useful predictive tool.
  3. From a Level II perspective, which despite the data collection issues, I believe we can justify, we can extrapolate some findings that could prove useful moving forward in real world applications for our high school students. I believe we can justify Level II analysis for two principle reasons. First, our model found the most predictive variables to be Weighted GPA and Total STAT Score, which are continually reported to be critical factors in college admissions. Rather than these findings being an artifact of our limited sample, they seem to be indicative of a much broader trend. Second, our data covered a broad set of values. Both GPA and STAT Total formed nearly perfectly normal distributions. Income showed a relative uniformity, aside from the spike at $99,999 (and subsequent spike at $65,000 from median imputation). Anglo, Asian, Other, and Sex were well represented. The one sparsity was around Black students. It would be wise to exercise caution in extrapolating the model for Black students without further data. If these distributions were sparser or highly skewed in illogical ways, I’d be much more cautious to extend to Level II analysis. This raises the question: how should we actually use the model? The answer is: it depends. As my recommendation, the model should be used sparingly overall. The model is best fit for the following situation: an 1) Anglo or Asian student who has 2) taken the SAT and is 3) applying to a University with a ~30% acceptance rate. If the student is Black, the school is substantially easier or harder to get into, the student is missing SAT data, or the time period has shifted significantly from when these data were collected, the model could be wildly less predictive. My recommendation would be to identify students at my high school that fit this profile, use the algorithm to predict their results as a pilot, and then evaluate the true predictive results vs. the OOB data. All the while, I’d be pushing to gather more robust data. As it stands today, the predictive power is for a limited scope of student applying to a fairly specific type of school.

Appendix: R Code

setwd("/Users/jeffreyellington/Dropbox/Wharton/Spring2017/STAT974/data/")
###### first glance at data
summary(admissions)
str(admissions)

## Plot Admissions histogram
library(ggplot2)
ggp <- ggplot(data.frame(admissions),aes(x=admit), aes(y=count))
# counts
ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", title = "Fig 2.1 Admissions Results", 
              subtitle = "Acceptance Rate = 30.87%")
## Calc acceptance rate
sum(admissions$admit)/length(admissions$admit)*100


##### EDA
## Find Weird Values for Categorical Variables, if any
str(admissions)
unique(admissions$anglo)
    ## Anglo, Misc "10"/"White" --> recode as "1" (and remove NA's for obs #200)
    admissions[[200,2]]=1
    admissions[[200,3]]=0
    admissions[[200,4]]=0
    admissions[[10,2]]=1
unique(admissions$asian)
unique(admissions$black)
unique(admissions$sex)

## Factor/Numeric variable transformations
admissions$admit<-as.factor(admissions$admit)
admissions$anglo<-as.factor(admissions$anglo)
admissions$asian<-as.factor(admissions$asian)
admissions$black<-as.factor(admissions$black)
admissions$sex<-as.factor(admissions$sex)
admissions$income<-as.numeric(admissions$income)
admissions$sati.math<-as.numeric(admissions$sati.math)
str(admissions)
summary(admissions)

## GPA Values
  ## Check if GPA is normally distributed (verifying no systematic deviances due to mis-haps in weighting)
  hist(admissions$gpa.wtd) ## looks normal without any systematic deviance // bi-modalities
  ## Invert negative value from -3.20 to 3.20
  admissions[[2000,5]]=3.20
  ## Make 0.00 GPA = Mean/Median GPA
  summary(admissions$gpa.wtd)
  ## Median = 3.85, Mean = 3.79, settled on 3.80, assuming mean was slightly pulled down by the 19 GPA=0.00 values
  admissions[[382,5]]=3.80
  admissions[[560,5]]=3.80
  admissions[[903,5]]=3.80
  admissions[[1900,5]]=3.80
  admissions[[3081,5]]=3.80
  admissions[[3146,5]]=3.80
  admissions[[3541,5]]=3.80
  admissions[[3816,5]]=3.80
  admissions[[3925,5]]=3.80
  admissions[[4034,5]]=3.80
  admissions[[4486,5]]=3.80
  admissions[[4576,5]]=3.80
  admissions[[5140,5]]=3.80
  admissions[[5462,5]]=3.80
  admissions[[6604,5]]=3.80
  admissions[[7131,5]]=3.80
  admissions[[7759,5]]=3.80
  admissions[[7938,5]]=3.80
  admissions[[8061,5]]=3.80
      #Don't laugh... couldn't get function to work :-(
  
## Income Max Encoded at 99,999
  sum(admissions$income >= 99999, na.rm = TRUE)
  2164/8700 *100
  ## 2164 at max income (24.87%)
  ## Seems OK to leave as ... 25% making more than $100k is reasonable for this pool
  
## SAT Max Score Error
  ## Max SAT Verbal = 5240; regressing to the median of admitted students (given they were admitted)
  admitted <- subset(admissions, admissions$admit == 1)
  summary(admitted$sati.verb) ## median = 650
  admissions[[20,6]]=650 


## Find NA's and remove if sensible
  summary(admissions)
  ## Sex NAs (24 total, 22 have tons of other missing values, striking from data set)
  which(is.na(admissions$sex))
  admissions.v1 <- admissions[-c(353, 379, 443, 762, 799, 1321, 1780, 4451, 4486, 4681, 5197, 5229, 5558, 5792, 6027, 6284, 6323, 6426, 6465, 6762, 6921, 7505, 7934, 8274),]
  
  ## NA for Income (1724, too big to strike)
  sum(is.na(admissions$income))
    ## check if systematic deviance from normal acceptance rate (579 Admitted out of 1724 Total)
    579/1724 *100 ## 33.58% vs. 30.87% ... seems omitted income info is slightly predictive, but not much. Regessing to median
    admissions.test <- admissions.v1 
    samp<-na.roughfix(admissions.test$income, median) ## impute NAs to median value
    admissions.test$income <- samp ## replace NAs with median value
    sum(admissions.test$income==65000) - sum(admissions.v1$income==65000, na.rm = TRUE) #1711, correct value
    admissions.v2 <- admissions.test ## create new dataframe with imputed values
    summary(admissions.v2)
    
  ## NA for Race (828)
  sum(is.na(admissions$asian))
  sum(is.na(admissions$black))
  sum(is.na(admissions$anglo))
    ## check if systematic deviance from normal acceptance rate
    summary(admissions) #828
    summary(admitted) #279
    279/828 * 100 ## 33.69% vs. 30.87% ... seems omitted race is slightly predictive, but not much
    ## can't regress to mean, the others are likely too predictive...
    ## striking from dataset... 10% hurts, but it's the best solution I see
    admissions.v3 <- subset(admissions.v2, (!is.na(admissions.v2[,2])) & (!is.na(admissions.v2[,3])) & (!is.na(admissions.v2[,4])))
    summary(admissions.v3)
    

## Is there another race category?
  ## Race Counts
  sum(is.na(admissions.v3)) ## No NA's in dataset, 7870 observations
  ## White / Asian / Black
  sum(admissions.v3$anglo==1) + sum(admissions.v3$asian==1) + sum(admissions.v3$black==1) ## 6582
  7870 - 6582 #1288
  ## Another level? or Multiple?
  sum(admissions.v3$anglo==0 & admissions.v3$asian==0 & admissions.v3$black==0) #1288, checks out
    ## Create 4th level for race [no idea how to do this with functions, so why not use some cute math :-)...]
    admissions.v3$other <- as.numeric(admissions.v3$anglo) + as.numeric(admissions.v3$asian) + as.numeric(admissions.v3$black)
    admissions.v3$other <- admissions.v3$other - 4
    admissions.v3$other <- admissions.v3$other * -1
    admissions.v3$other <- as.factor(admissions.v3$other)
  

## More SAT Issues, 0 values
  ## SAT Math = 0
  sum(admissions.v3$sati.math == 0)
  ## 408
  
  ## SAT Verb = 0
  sum(admissions.v3$sati.verb == 0)
  ## 408
  
  ## Not a coincidence --> Likely took the ACT
    ## check if systematic deviance from normal acceptance rate
    admitted.v3 <- subset(admissions.v3, admissions.v3$admit == 1)
    summary(admitted.v3)
    2407/7870 * 100 ## subset base admissions rate = 30.58% (negliblby different from original 30.87%)
    sum(admitted.v3$sati.math == 0)/sum(admissions.v3$sati.math == 0) * 100 ## 10.05% --> VERY DIFFERENT, need to consider
    sum(admitted.v3$sati.verb == 0)/sum(admissions.v3$sati.verb == 0) * 100 ## sanity check, verbal == math
    sub.test <- subset(admissions.v3, admissions.v3$admit == 1 & admissions.v3$sati.verb == 0)
    summary(sub.test)
    sub.test2 <- subset(admissions.v3, admissions.v3$sati.verb == 0) 
    sort(sub.test2$gpa.wtd)

## Final Dataset
    admissions.final <- admissions.v3

##### Univariate Analysis
  ## GPA
  ## Check if GPA is normally distributed (verifying no systematic deviances due to mis-haps in weighting)
  ggp.v3 <- ggplot(admissions.final, aes(gpa.wtd)) 
  ggp.v3 + geom_histogram(fill="lightgreen") + labs(x = "GPA (Weighted)", y = "Count", title = "Fig 4.1: GPA Histogram", 
                          subtitle = "Median GPA = 3.85")## looks normal without any systematic deviance // bi-modalities
  ## SAT Scores
  ## Math
  summary(admissions.final$sati.math)
  ggp.v4 <- ggplot(admissions.final, aes(sati.math)) 
  ggp.v4 + geom_histogram(fill="lightgreen") + labs(x = "SAT Math", y = "Count", title = "Fig 4.2: SAT Math", 
                          subtitle = "Median Math SAT = 630")
  ## Verb
  summary(admissions.final$sati.verb)
  ggp.v5 <- ggplot(admissions.final, aes(sati.verb)) 
  ggp.v5 + geom_histogram(fill="lightgreen") + labs(x = "SAT Verb", y = "Count", title = "Fig 4.3: SAT Verb", 
                                                    subtitle = "Median Verb SAT = 580")
  ## Combined
  ## Create new combined/total SAT variable
  admissions.final$sati.total <- admissions.final$sati.math + admissions.final$sati.verb
  summary(admissions.final$sati.total)
  ggp.v6 <- ggplot(admissions.final, aes(sati.total)) 
  ggp.v6 + geom_histogram(fill="lightgreen") + labs(x = "SAT Total", y = "Count", title = "Fig 4.4: SAT Total", 
                                                    subtitle = "Median Total SAT = 1210")
  ## combine 
  library(gridExtra)
  ggp.v3 <- ggplot(admissions.final, aes(gpa.wtd)) + geom_histogram(fill="lightgreen") + labs(x = "GPA (Weighted)", y = "Count", title = "Fig 4.1: GPA Histogram", 
                                                    subtitle = "Median GPA = 3.85")
  ggp.v4 <- ggplot(admissions.final, aes(sati.math)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Math", y = "Count", title = "Fig 4.2: SAT Math", 
                                                    subtitle = "Median Math SAT = 630")
  ggp.v5 <- ggplot(admissions.final, aes(sati.verb)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Verb", y = "Count", title = "Fig 4.3: SAT Verb", 
                                                    subtitle = "Median Verb SAT = 580")
  ggp.v6 <- ggplot(admissions.final, aes(sati.total)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Total", y = "Count", title = "Fig 4.4: SAT Total", 
                                                    subtitle = "Median Total SAT = 1210")
  grid.arrange(ggp.v3, ggp.v4, ggp.v5, ggp.v6, nrow=2, ncol=2)
  
  ## Demographics: 
  ## Applicants by Race
  anglo.apps <- sum(admissions.final$anglo==1)
  asian.apps <- sum(admissions.final$asian==1)
  black.apps <- sum(admissions.final$black==1)
  other.apps <- sum(admissions.final$other==1)
  admit.apps.comp <- c(anglo.apps, asian.apps, black.apps, other.apps)
  admit.apps.comp
    ## Plot
    barplot(admit.apps.comp, col = "lightgreen",
          xlab = "Applicants: Anglo, Asian, Black, Other",
          ylab = "Count",
          main = "Fig 4.4: Applicant Count by Race",
          ylim = c(0, 4000))
    
  ## Income Histogram
    hist(admissions.final$income, col = "lightgreen",
         xlab = "Family Income ($)",
         ylab = "Count",
         main = "Fig 4.5: Applicant Count by Income",
         ylim = c(0, 3000))
  ## Original Income Histogram
    hist(admissions$income, col = "lightgreen",
         xlab = "Family Income ($)",
         ylab = "Count",
         main = "Fig 4.5: Applicant Count by Income",
         ylim = c(0, 3000))
    
  ## Gender Histogram
    ## Applicants by Sex
    male.apps <- sum(admissions.final$sex==1)
    female.apps <- sum(admissions.final$sex==0)
    admit.apps.comp.gen <- c(male.apps, female.apps)
    admit.apps.comp.gen
    barplot(admit.apps.comp.gen, col = "lightgreen",
         xlab = "Gender: Male, Female",
         ylab = "Count",
         main = "Fig 4.6: Applicant Count by Gender",
         ylim = c(0, 5000))
  
  ## Combine Plots
    par(mfrow=c(2,2))
      barplot(admit.apps.comp.gen, col = "lightgreen",
            xlab = "Gender: Male, Female",
            ylab = "Count",
            main = "Fig 4.5: Applicant Count by Gender",
            ylim = c(0, 5000))
      barplot(admit.apps.comp, col = "lightgreen",
              xlab = "Applicants: Anglo, Asian, Black, Other",
              ylab = "Count",
              main = "Fig 4.6: Applicant Count by Race",
              ylim = c(0, 4000))
      hist(admissions.final$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.7: Transformed Applicant Count by Income",
           ylim = c(0, 3000))
      hist(admissions$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.8: Original Applicant Count by Income",
           ylim = c(0, 3000))
    
  
###### Bivariate Analysis
  ## Setup / Basic Hist
  summary(admissions.final)
  total.admit.rate <- sum(admissions.final$admit==1)/length(admissions.final$admit)
  ggp <- ggplot(data.frame(admissions.final),aes(x=admit), aes(y=count))
  # histogram
  ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", 
                                           title = "Fig 5.1 Admissions Results (Final Dataset)", subtitle = "Acceptance Rate = 30.58%")
  
  ## Demographics (Race/Sex/$)
    ## Admissions Rates by Race
    anglo.admit.rate <- sum(admissions.final$admit==1 & admissions.final$anglo==1)/sum(admissions.final$anglo == 1)
    asian.admit.rate <- sum(admissions.final$admit==1 & admissions.final$asian==1)/sum(admissions.final$asian == 1)
    black.admit.rate <- sum(admissions.final$admit==1 & admissions.final$black==1)/sum(admissions.final$black == 1)
    other.admit.rate <- sum(admissions.final$admit==1 & admissions.final$other==1)/sum(admissions.final$other == 1)
    admit.rates.race.comp <- c(total.admit.rate, anglo.admit.rate, asian.admit.rate, black.admit.rate, other.admit.rate)
    admit.rates.race.comp
      ## Plot
      barplot(admit.rates.race.hold*100, col = "lightgreen",
              xlab = "Admit Rate: Total, Anglo, Asian, Black, Other",
              ylab = "Percent",
              main = "Fig 5.2: Admit Rate by Race",
              ylim = c(0, 40))
    ## Admissions by Income
    income.under.100 <- subset(admissions.final, admissions.final$income <=99998)
    income.over.100 <- subset(admissions.final, admissions.final$income == 99999)
    summary(income.under.100)
    summary(income.over.100)
    income.under.100.admit.rate <- 1769/(1769+4113)
    income.over.100.admit.rate <- 638/(638+1350)
    income.under.100.admit.rate
    income.over.100.admit.rate
    ## Admissions by Gender
    male.apps <- sum(admissions.final$sex==1)
    female.apps <- sum(admissions.final$sex==0)
    male.admits <- sum(admissions.final$sex==1 & admissions.final$admit==1)
    female.admits <- sum(admissions.final$sex==0 & admissions.final$admit==1)
    male.admit.rate <- male.admits/male.apps
    female.admit.rate <- female.admits/female.apps
    male.admit.rate
    female.admit.rate
  
    
  ## Merit (SAT/GPA)
    ## GPA Quartiles
    summary(admissions.final$gpa.wtd)
    top.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd > 4.151)
    scnd.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd <= 4.151 & admissions.final$gpa.wtd > 3.850)
    thrd.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd <= 3.850 & admissions.final$gpa.wtd > 3.500)
    bot.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd <= 3.500)
    top.gpa.admit.rate <- sum(top.gpa.q$admit == 1)/length(top.gpa.q$gpa.wtd)*100
    scnd.gpa.admit.rate <- sum(scnd.gpa.q$admit == 1)/length(scnd.gpa.q$gpa.wtd)*100
    thrd.gpa.admit.rate <- sum(thrd.gpa.q$admit == 1)/length(thrd.gpa.q$gpa.wtd)*100
    bot.gpa.admit.rate <- sum(bot.gpa.q$admit == 1)/length(bot.gpa.q$gpa.wtd)*100
    top.gpa.admit.rate
    scnd.gpa.admit.rate
    thrd.gpa.admit.rate
    bot.gpa.admit.rate
    plot(c(top.gpa.admit.rate,
           scnd.gpa.admit.rate,
           thrd.gpa.admit.rate,
           bot.gpa.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.3: Admit by GPA",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## SAT Math
    summary(admissions.final$sati.math)
    top.satm.q <- subset(admissions.final, admissions.final$sati.math >= 690)
    scnd.satm.q <- subset(admissions.final, admissions.final$sati.math < 690 & admissions.final$sati.math >= 630)
    thrd.satm.q <- subset(admissions.final, admissions.final$sati.math < 630 & admissions.final$sati.math >= 550)
    bot.satm.q <- subset(admissions.final, admissions.final$sati.math < 550)
    top.satm.admit.rate <- sum(top.satm.q$admit == 1)/length(top.satm.q$sati.math)*100
    scnd.satm.admit.rate <- sum(scnd.satm.q$admit == 1)/length(scnd.satm.q$sati.math)*100
    thrd.satm.admit.rate <- sum(thrd.satm.q$admit == 1)/length(thrd.satm.q$sati.math)*100
    bot.satm.admit.rate <- sum(bot.satm.q$admit == 1)/length(bot.satm.q$sati.math)*100
    top.satm.admit.rate
    scnd.satm.admit.rate
    thrd.satm.admit.rate
    bot.satm.admit.rate
    plot(c(top.satm.admit.rate,
           scnd.satm.admit.rate,
           thrd.satm.admit.rate,
           bot.satm.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.4: Admit by SAT Math",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## SAT Verb
    summary(admissions.final$sati.verb)
    top.satv.q <- subset(admissions.final, admissions.final$sati.verb >= 650)
    scnd.satv.q <- subset(admissions.final, admissions.final$sati.verb < 650 & admissions.final$sati.verb > 580)
    thrd.satv.q <- subset(admissions.final, admissions.final$sati.verb <= 580 & admissions.final$sati.verb > 500)
    bot.satv.q <- subset(admissions.final, admissions.final$sati.verb <= 500)
    top.satv.admit.rate <- sum(top.satv.q$admit == 1)/length(top.satv.q$sati.verb)*100
    scnd.satv.admit.rate <- sum(scnd.satv.q$admit == 1)/length(scnd.satv.q$sati.verb)*100
    thrd.satv.admit.rate <- sum(thrd.satv.q$admit == 1)/length(thrd.satv.q$sati.verb)*100
    bot.satv.admit.rate <- sum(bot.satv.q$admit == 1)/length(bot.satv.q$sati.verb)*100
    top.satv.admit.rate
    scnd.satv.admit.rate
    thrd.satv.admit.rate
    bot.satv.admit.rate
    plot(c( top.satv.admit.rate,
            scnd.satv.admit.rate,
            thrd.satv.admit.rate,
            bot.satv.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Verbal",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## SAT Total
    summary(admissions.final$sati.total)
    top.satt.q <- subset(admissions.final, admissions.final$sati.total > 1320)
    scnd.satt.q <- subset(admissions.final, admissions.final$sati.total <= 1320 & admissions.final$sati.total >= 1210)
    thrd.satt.q <- subset(admissions.final, admissions.final$sati.total < 1210 & admissions.final$sati.total >= 1070)
    bot.satt.q <- subset(admissions.final, admissions.final$sati.total < 1070)
    top.satt.admit.rate <- sum(top.satt.q$admit == 1)/length(top.satt.q$sati.total)*100
    scnd.satt.admit.rate <- sum(scnd.satt.q$admit == 1)/length(scnd.satt.q$sati.total)*100
    thrd.satt.admit.rate <- sum(thrd.satt.q$admit == 1)/length(thrd.satt.q$sati.total)*100
    bot.satt.admit.rate <- sum(bot.satt.q$admit == 1)/length(bot.satt.q$sati.total)*100
    top.satt.admit.rate
    scnd.satt.admit.rate
    thrd.satt.admit.rate
    bot.satt.admit.rate
    plot(c( top.satt.admit.rate,
            scnd.satt.admit.rate,
            thrd.satt.admit.rate,
            bot.satt.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Total",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## Par -- Graph all 4
    par(mfrow=c(2,2))
    plot(c(top.gpa.admit.rate,
           scnd.gpa.admit.rate,
           thrd.gpa.admit.rate,
           bot.gpa.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.3: Admit by GPA",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    plot(c(top.satm.admit.rate,
           scnd.satm.admit.rate,
           thrd.satm.admit.rate,
           bot.satm.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.4: Admit by SAT Math",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    plot(c( top.satv.admit.rate,
            scnd.satv.admit.rate,
            thrd.satv.admit.rate,
            bot.satv.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Verbal",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    plot(c( top.satt.admit.rate,
            scnd.satt.admit.rate,
            thrd.satt.admit.rate,
            bot.satt.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Total",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## Correlation 
    ##SAT M and SAT V
    cor(admissions.final$sati.math, admissions.final$sati.verb)
    ## SAT T and GPA
    cor(admissions.final$sati.total, admissions.final$gpa.wtd)
    
##### Multivariate Analysis (RF)
  ## No Misclassification costs
    # random forests
    rf1<-randomForest(admit~anglo+asian+black+other+gpa.wtd+sati.verb+sati.math+sati.total+income+sex,
      data=admissions.final,importance=T,ntree=750)
    rf1.conf <- rf1$confusion
    rf1.conf
    plot(rf1)
    print(rf1)
  ## Misclassification costs
    ## 4:1
    rf2<-randomForest(admit~anglo+asian+black+other+gpa.wtd+sati.verb+sati.math+sati.total+income+sex,
                      data=admissions.final,importance=T,ntree=750, sampsize=c(1000,1700))
    print(rf2)
    
    rf3<-randomForest(admit~anglo+asian+black+other+gpa.wtd+sati.verb+sati.math+sati.total+income+sex,
                      data=admissions.final,importance=T,ntree=750, sampsize=c(100,2000))
    print(rf3)
    
  ## Prediction Accuracy // Importance Plots
    # Variable Importance Plots
    par(mfrow=c(2,2))
    varImpPlot(rf2,type=1,scale=F,class=1,
               main="Forecasting Importance Plot for Admission
               (Unstandardized)",col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=1,
               main="Forecasting Importance Plot for Admission
               (Standardized)", col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=F,class=0,
               main="Forecasting Importance Plot for Rejection
               (Unstandardized)", col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=0,
               main="Forecasting Importance Plot for Rejection
               (Standardized)", col="lightgreen",cex=1,pch=19)
    
    # Partial Dependence Plots
    par(mfrow=c(2,1))
    part1<-partialPlot(rf2,pred.data=admissions.final,x.var=gpa.wtd,
                       rug=T,which.class=1, main="Partial Dependence Plot for GPA on Admission", xlab="GPA",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="blue",pch=19, main="Partial Dependence Plot for GPA on Admission", xlab="GPA",
                   ylab="Centered Log Odds")
    par(mfrow=c(2,1))
    part2<-partialPlot(rf2,pred.data=admissions.final,x.var=sati.total,
                       rug=T,which.class=1, main="Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="blue",pch=19, main="Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                   ylab="Centered Log Odds")
---
title: "STAT974 Project #2: Applying RF to College Admissions Data"
output: html_notebook
---
##### By: Jeffrey Ellington
##### 9:00am Mon-Wed Section
##### 4/19/17

### 1.  Introduction: Assessing College Admissions Data
High school guidance counselors use college admissions data to properly advise high school seniors on where to apply to college. Studying the generally available summary-statistic level admissions data is one source of valuable insight into college selection criteria and the admissions process. However, summary-statistic level data is a crude tool for providing any specific student with insight into his or her individual chance of being admitted to a particular university. Using individual level admissions data is a potentially useful method to bolster a guidance counselor's ability to provide tailored guidance to each student. Unfortunately, gathering individual level data from the over 2,000 colleges in the USA is expensive, if not impossible. As a result, we will examine individual level admissions outcome data from one 'elite' US university. This information can be analyzed to help advise students on their prospects of being admitted to this particular elite university (at least if they had theoretically applied in the given year from which our data was collected). Additionally, we will explore if it's responsible to extrapolate these findings to other years at this specific university and potentially to other similarly elite universities. 

### 2.  Data Description
At first glance, it is unclear how the data were collected. The 8,700 observations were obtained from last year's applicants at a particular 'elite' university. It's not specified whether this is a sample of the data or the complete set of observations. Quickly reviewing publicly available Class of 2020 admissions data from the eight Ivy League schools reveals the range in the number of applicants stretches from 20,675 (Dartmouth) to 44,966 (Cornell). The average number of applicants across the eight schools 33,923. Given this, it is almost certain our 8,700 observations are a sample from a broader set of observations at this particular school. We can't assume that this sample is representative of the presumed broader applicant pool. Finally, the data violate the IID assumption: admitting one applicant has an effect on subsequent applicants given there is a relatively fixed class size and composition. For our treatment, we will assume the data were **not** generated via a random process. Instead, for our treatment, we will examine the 8,700 observations of nine variables assuming the data were not collected randomly. The nine variables are:

    1. admit — Coded “1” for admit and “0” for reject.
    2. anglo — Coded “1” for anglo and “0” otherwise.
    3. asian — Coded “1” for asian and “0” otherwise.
    4. black — Coded “1” for black and “0” otherwise.
    5. gpa.wtd — high school GPA weighted for AP courses. 
    6. sati.verb — SAT verbal score.
    7. sati.math — SAT math score.
    8. income — household income capped at $100,000. 
    9. sex — Coded “1” for male and “0” for female.
Admit is the response variable. It is a binary variable ("1" for admit / "0" for reject). Admit results were collected for the 8,700 applicants, where each individual result is assumed to be the outcome of a considered admissions review process (i.e. not random). The tally for admissions results can be seen below in Fig. 2.1. 2,686 of the 8,700 students were admitted, reflecting a 30.87% acceptance rate for our specific sample. Note that this does **not** imply that 30.87% is the overall acceptance rate for this particular university.

The other eight variables will be examined in much greater detail once they've been properly cleaned. However, it's worth stating up front that while these eight variables seem important in an analysis of admissions, they are far from complete. An expert on university admissions, especially at elite schools, could likely point to several other important missing variables. A few examples might include athletic or artistic talent, legacy relationships, service or entrepreneurial recognition, and/or the quality of recommendation letters. There are countless others. In addition, we also have no framework for existing individual admissions decisions/policies at this particular "elite" university, which may impact the analysis. Whatever our estimate of the mean, we can be assured it will be misspecified without greater information. With that caveat stated, there are still many avenues of analysis available. (Administrative Note: while all of the variables are lower-case in the data, I will refer to them via capitalization to make it clear we are referencing a variable (e.g. "admit" will be written "Admit")).

```{r}
ggp <- ggplot(data.frame(admissions),aes(x=admit), aes(y=count))
# counts
ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", title = "Fig 2.1 Admissions Results", subtitle = "Acceptance Rate = 30.87%")
```

### 3.  Data Cleaning
Before examining the relationships between Admit and the other eight variables, we will first need to clean and prepare the data. Reviewing the summary statistics for the nine variables below, a number of issues immediately pop out.
```{r}
summary(admissions)
```
After systematically reviewing the data, eight data cleaning/transformation issues are apparent. 

1. The response variable, Admit, is encoded as an integer/numeric variable (as opposed to factor). We will need to co-erce it to a factor variable to ensure our future analysis uses the proper statistical techniques. We will also do this for all other categorical variables (Anglo, Asian, Black, and Sex). 
2. It's immediately apparent Anglo has extraneous values. We will transform "white" to "1"" for observation #200 (and recede "NA" to 0 for Asian/Black in this case) and transform from "10" to "1" for observation #10, given Black and Asian are both 0. It's worth noting this evidence for observation #10 is not conclusive, but most likely correct. Additionally, Anglo was initially input as character data, so we transformed to factor type. 
3. Turning our attention to the treatment of the categorical variables associated with race (Angle, Asian, Black), it's quickly clear that a subset of observations (1288) are coded "0" for all three. This could reflect missing data. However, given there are 828 actual NA values, it's more likely that this implies another category of race. We will add "Other" as another factor variable for these 1288 observations (note: used a cute mathematical transformation to create the factor!). 
4. The 828 remaining missing values (NAs) for race proved to be perplexing. Thankfully, the NAs were systematically placed in the same rows among Anglo, Asian, and Black (as opposed to be randomly scattered). The missing values represented ~10% of the total data set, yet there was no obvious way to properly use the data. Given our response is Admit, it seemed logical to first check the overall sample population admittance rate with the sub-population rate of NAs across Anglo, Asian, and Black. The results were similar: 30.87% vs. 33.69&, respectively. The NA's were being admitted at a slightly higher rate. While the difference is likely material, not to mention cutting 10% of the data set is really painful, I didn't know how to code a good transformation of the data. Simply imputing the mode would have treated the NA's as Asian. Intuitively this felt wrong. Ideally, I would have been able to regress all the other variables to predict race; however, I did not have the R background to complete this. Ultimately, I cut the 828 NA's.
5. In addition, I removed the two remaining NA's from Sex. This was less than 0.01% of the data, so felt immaterial.
6. Income had a max value encoded as $99,999. Given 25% was coded as $99,999, we will treat this as a capped value and not an artifact/mis-reported statistic implying some type of missing value. Furthermore, Income had 1724 missing values coded as NAs. 1724 observations was certainly too many to strike from the analysis. Again, it made sense to compare the overall sample population admittance rate with the sub-population rate of NAs in Income. The results were similar: 30.87% vs. 33.58&, respectively. Missing values for Income are associated with a slightly higher acceptance rate. Without more information, it seemed the best option was to impute the NAs as the median value of $65k. While not perfect, I saw no better way to include these values without adding noise. This is admittedly less than perfect, as there could be many reasons beyond randomly missing data that NAs exist. Perhaps the applicant wants to hide that they are unable to pay for college **or** that they are wealthy and under performing for some reason. There are a multitude of reasonable explanations. 
7. The minimum GPA was initially reported as -3.20. This was simply inverted to 3.20. Additionally, 19 GPAs were entered as 0. Given the small proportion, the values were transformed to the mean of 3.80.
8. Possible SAT scores range from 200-800, which quickly highlighted two errors. One verbal score was entered as 5,240. Given this student was admitted, the score was transformed to the median score of admitted students (650). More perplexing were the 457 0's for both SAT Math and Verbal. Not coincidentally, the 0 values were always paired together, implying some type of systematic effect. The most likely scenarios are the students either a) took the ACT and did not report SAT scores or b) were permitted to be considered for admission without any standardized test scores. No transformations were made to these values, given there is likely signal in the decision not to report. We will examine further later.

As a result of these transformations, the final data set to be used for analysis contains 7,870 observations of 10 variables.

### 4.  Analysis of Univariate Statistics
Starting with merit based predictors, let's examine GPA and SAT scores. GPA is reported on a weighted scale from 0.00 - 5.00, with an actual range of 1.25 to 4.95. The median is 3.85 and the mean is 3.80. The histogram in Fig 4.1 follows a normal distribution, with a slight left-skew. This is good, as it implies there's no artifacts/systematic errors in the data when converting from non-weighted to weighted GPAs. Individual Math/Verbal SAT scores highlight two interesting features of the data. First, as noted during EDA, there are a number of SAT Scores at 0. We're leaving these as is, even though it's impossible to score 0 (range is 200-800). They are indicative of something other than missing data -- likely an active decision to take the ACT or no standardized test at all. We suspect this decision will be informative in the admissions decisions. Second, the individual scores follow roughly normal distributions, though there is a fair bit of right-skew bunching on Math at 800. When combined into an SAT Total score, the distribution becomes almost perfectly normal. This flattening of the right tails in both Math (especially) and Verbal (somewhat) implies that there a significant number of students who do exceptionally well on a particular section, but far less students do exceptionally well on **both** sections. Most importantly, both GPA and SAT Total scores feel representative of what we would expect to see on a national level from a distribution shape. However, the mean/median seem to be right shifted. Looking at all SAT data, the mean score is 1020 for all test takers and 1080 for all college bound students. From a Level I perspective, the sample of students applying to this particular "elite" university are above average nationally. Moving to a Level II analysis, it seems fair to extrapolate that there is a self-selection bias for better students to apply to better universities. We would expect this trend to hold across other elite universities and across other years. 
```{r}
ggp.v3 <- ggplot(admissions.final, aes(gpa.wtd)) + geom_histogram(fill="lightgreen") + labs(x = "GPA (Weighted)", y = "Count", title = "Fig 4.1: GPA Histogram", 
                                                    subtitle = "Median GPA = 3.85")
  ggp.v4 <- ggplot(admissions.final, aes(sati.math)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Math", y = "Count", title = "Fig 4.2: SAT Math", 
                                                    subtitle = "Median Math SAT = 630")
  ggp.v5 <- ggplot(admissions.final, aes(sati.verb)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Verb", y = "Count", title = "Fig 4.3: SAT Verb", 
                                                    subtitle = "Median Verb SAT = 580")
  ggp.v6 <- ggplot(admissions.final, aes(sati.total)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Total", y = "Count", title = "Fig 4.4: SAT Total", 
                                                    subtitle = "Median Total SAT = 1210")
  grid.arrange(ggp.v3, ggp.v4, ggp.v5, ggp.v6, nrow=2, ncol=2)
```

Looking at demographic data, we see a few interesting facets of the data. There are roughly 17% more female applications than male. This could certainly be a result of our specific sample. From a Level II perspective, however, this wouldn't be surprising. On almost every level (applications, enrollment, graduation, etc.) there are more female undergraduates than male represented. Given this, it feels reasonable to treat this male/female applicant count gap as representative of the broader population. Race, however, paints a different picture. Black students made up 356 (4.5%) of total applicants. Whether this is a Level I sampling anomaly or Level II broader representation of the college applicant pool is difficult to say. Looking at 2015 applications to "elite"" schools, most schools (Penn, Columbia, Yale, Duke, among many others reporting) had black student applicants comprising 10-13% of the applicant pool, between 2-3x our 4.5%. In contrast, UC Berkeley received just 4.9% of its applicants from black students (lowest among "elite" schools). While there is no conclusive evidence, using the best data available seems to indicate the proportion of black students in our sample is systematically low. Looking at Income, we included two histograms: the original (with 1724 NAs omitted) and the final with the 1724 NAs imputed to the mean. This obviously skewed the distribution, but seemed the "fairest" way to do so without adding noise. The distribution is right skewed, with about 25% of population making more than $100k. Among families with incomes under $100k, the distribution is fairly uniform.
```{r}
par(mfrow=c(2,2))
      barplot(admit.apps.comp.gen, col = "lightgreen",
            xlab = "Gender: Male, Female",
            ylab = "Count",
            main = "Fig 4.5: Applicant Count by Gender",
            ylim = c(0, 5000))
      barplot(admit.apps.comp, col = "lightgreen",
              xlab = "Applicants: Anglo, Asian, Black, Other",
              ylab = "Count",
              main = "Fig 4.6: Applicant Count by Race",
              ylim = c(0, 4000))
      hist(admissions.final$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.7: App Count by Income (Imputed)",
           ylim = c(0, 3000))
      hist(admissions$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.8: App Count by Income (Original",
           ylim = c(0, 3000))
```

### 5. Analysis of Bivariate Statistics
Recreating the histogram (Fig 5.1) of the response variable (Admit) from the final data set, we see the acceptance rate has changed minorly from 30.87% to 30.58%. From here we will turn our attention to uncovering what features of the data set are predictive for admissions.
```{r}
ggp <- ggplot(data.frame(admissions.final),aes(x=admit), aes(y=count))
# counts
ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", title = "Fig 5.1 Admissions Results (Final Dataset)", subtitle = "Acceptance Rate = 30.58%")
```

Based on an intuitive first pass, at least for anyone who's spent time in the college admissions process, an immediate question pops up: for an "elite" school, is 30.58% a reasonable acceptance rate? Turning back to the publicly available Class of 2020 Ivy League information, acceptance rates ranged from 3.4% (Harvard) to 12.5% (Cornell). Boston College, my Alma mater, has an acceptance rate of 29% and is ranked around #30 nationally. So 30% is either systematically low for a truly "elite" college (and implies we have a skewed sample) or is representative of a school ranked around #30. How we decided to interpret will have implications on whether we can perform a Level I or II analysis.

Let's turn to demographics. First, looking at race, we can see in Fig 5.2 that there is disparity between Anglo/Asian admit rates (32%) and Black/Other admit rates (24%). This information challenges, at least for this particular sample, an often repeated quip that it's easier for underrepresented minority students to get into elite universities than Anglo or Asian students. This will be interesting to explore further in the multi-variate analysis once interaction effects like income, SAT, and/or GPA are considered. Examining income, let's start by splitting the data in to two buckets: lower/middle-class (income less than $99,999) and affluent (income greater than $99,999). The admission rates are pretty similar: 30.1% and 32.1%, respectively. While affluent applicants are slightly favored in this data set, it's not nearly as startling as the racial disparities. Finally, looking at gender, we see similar results to income: a small disparity between the male admission rate (29.8%) and the female admission rate (31.3%). Whether this is strictly a Level I sample artifact or a Level II generalization about college admissions is difficult to say based on this data alone and somewhat irrelevant given the disparity is minimal.

```{r}
barplot(admit.rates.race.hold*100, col = "lightgreen",
            xlab = "Admit Rate: Total(30.6%), Anglo(32.5%), Asian(32.0%), Black(24.4%), Other(24.4%)",
            ylab = "Percent",
            main = "Fig 5.2: Admit Rate by Race",
            ylim = c(0, 40))
```

Merit based predictors prove to be more illustrative. We can look at how GPA, SAT Math, SAT Verbal, and SAT Total impact admissions decisions. Given we are working in the wrong model perspective and not focused on causality, let's focus on visualizing the data. We can look at admissions rates for each of the 4 predictors in quartiles in Figs 5.3-6. It's abundantly clear that GPA, SAT M, SAT V, and SAT T all greatly effect admissions decision in the sample population. Applicants from the Top quartile in any given category are 7-24x more likely to be admitted from those from the bottom. There is clearly co-linearity between these predictors, bot explicitly (SAT M + V = SAT Total) and implicitly (good SAT takers probably on average have higher GPAs). We can see this empirically. The correlations between SAT Math & SAT Verbal (0.85) **and** SAT Total and GPA (0.33) are printed below. 
```{r}
    cor(admissions.final$sati.math, admissions.final$sati.verb)
    cor(admissions.final$sati.total, admissions.final$gpa.wtd)
```

```{r}
 par(mfrow=c(2,2))
    plot(c(top.gpa.admit.rate,
           scnd.gpa.admit.rate,
           thrd.gpa.admit.rate,
           bot.gpa.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.3: Admit by GPA",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
    plot(c(top.satm.admit.rate,
           scnd.satm.admit.rate,
           thrd.satm.admit.rate,
           bot.satm.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.4: Admit by SAT Math",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
    plot(c( top.satv.admit.rate,
            scnd.satv.admit.rate,
            thrd.satv.admit.rate,
            bot.satv.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Verbal",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
    plot(c( top.satt.admit.rate,
            scnd.satt.admit.rate,
            thrd.satt.admit.rate,
            bot.satt.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Total",
         pch = 16,
         cex = 1.25,
         col = "lightgreen")
```

### 6.  Analysis of Multivariate Statistics
Finally, we get to the Random Forest (RF) Analysis: our principle topic of inquiry. Before we start, it's important to lay out a few statistical concepts. First, as noted in *Section 2: Data Description*, our classification using RF will lack many potentially explanatory variables. In short, we have the 11 predictors we've observed, but there is a lot more information required to specify the true mean function. Given that, we will still be OK moving ahead at Level I and potentially Level II analysis, but we will not be able to draw conclusions at Level III. Second, in *Section 5.  Analysis of Bivariate Statistics*, we pointed out that it's unwise to imply causality with the limited data we're analyzing. Now, we will completely depart from the idea of regression coefficients when using RF. The transformations and interaction effects involved in creating the model are sufficiently complex that there is not even a coefficient type analogy to consider.

To evaluate our model, we will allow the algorithm to split the data into training data and test data for us. The RF algorithm does this automatically by training on a subset of observations and then testing the quality of the decision tree it creates on the Out-of-Bag (OOB) test data. This framework is a good way to prevent over fitting, especially given it is averaged over a large number of trees. Given we have almost two orders of magnitude more observations than base predictors, the model should be stable. 

#### 6.1: No Misclassification Costs
For our initial model, we grow a random forest of 750 trees. Based on the plot below in Fig. 6.1, we can see the model improvement tapers off substantially by 50, and is almost flat by 200. 750 trees should capture plenty signal without wasting compute resources. The misclassification error rate is 15.91%, which can seen below in the model print out, showing OOB error rate and the confusion table. 15.91% seems reasonable at first pass given the amount of missing data in our model. We'd expect the error rate to drop if we could inform the model of everyone's other talents, service, recommendation letters, and myriad other criteria. What's potentially more interesting is the comparison of false negatives to false positives. The model had sensitivity of 66.9% and specificity of 91.6%. Our model preferred to predict someone would **NOT** be accepted than would be. The implied cost ratio was a bit less that 1:2. It forecasted 26.2% of applicants would be accepted (compared to the actual 30.6%). As a result, 796 students who would have been admitted would have been discouraged from applying. It's worth tuning the false positive to false negative cost ratio in our model to promote a more useful prediction; after-all, we don't want to discourage every high school senior during what is already an incredibly exhausting and trying time.
```{r}
plot(rf1, main = "Fig 6.1: RF1 Plot")
```
```{r}
print(rf1)
```

#### 6.2: Setting Misclassification Costs
For our next model, let's build in a target 4:1 cost ratio. This essentially says that we want the model to favor false positives at 4x the rate it values false negatives. This is going to invariably degrade precision: we will be wrong more often when predicting "Yes". However, the motivation seems just. As counselors, we can either inflate our vanity metrics of saying our students got into their "Top" school more often **OR** we can sacrifice our accuracy by pushing students to apply to better schools that they really do have a fighting chance to get into. I'd certainly rather strive for excellence and come up short sometimes than settle for mediocrity and spending a life wondering "what if". Looking at the results, we achieved a 3.8:1 cost ratio. As a cost, the OOB error rate rose to 20.27%. What's valuable here from a guidance counselor perspective is that the model is still accurately predicting acceptances at 62%. I'd be shocked if most students did **not** apply to an "elite" school (assuming they liked the other factors of the school) with a 62% chance of being admitted. On the flip side, when we tell a student that the model is not optimistic about their chances, it's right 93% of the time. Not too many people will hang their hats on a 1:14 chance. For fun, let's presume a student with average risk-appetite will apply to a school if there's 1/3 chance of being accepted. What does that cost ratio look like? The cost ratio is about 155:1. Interestingly, though, the false negative error rate only gets cut by about 50%. This is a poor trade off to make. 
```{r}
print(rf2)

print(rf3)
```

#### 6.3: Importance Plots and Prediction Accuracy
Looking at the importance plots below in Fig 6.2 and Fig 6.3, reminds us of the previous observations in the bivariate analysis. In descending order, the most important predictors are: GPA, SAT Total, SAT Verbal, and SAT Math. On an unstandardized basis, what this chart is saying is that when a given variable is shuffled, we lose the corresponding amount of predictive power (OOB classification error, formally). For example, shuffling GPA.wtd, the most important predictor causes a subsequent drop in over 25% of predictive power. From a Level I perspective, it's interesting that GPA and SAT Total were most predictive and also followed the most normal distributions. The broader range of possible results (0.00 - 5.00 or 400 - 1600) and variance among results allowed the random forest algorithm to find incremental signal in a way that is very hard to do with a limited set of indicator variables. 
```{r}
 par(mfrow=c(1,2))
    varImpPlot(rf2,type=1,scale=F,class=1,
               main="Importance Plot for Admission
               (Unstandarized)",col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=1,
               main="Importance Plot for Admission
               (Standardized)", col="lightgreen",cex=1,pch=19)
  par(mfrow=c(1,2))
    varImpPlot(rf2,type=1,scale=F,class=0,
               main="Importance Plot for Rejection
               (Unstandardized)", col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=0,
               main="Importance Plot for Rejection
               (Standardized)", col="lightgreen",cex=1,pch=19)
```

#### 6.4: Partial Dependence Plots
The partial dependence plots below in Fig 6.4 and Fig 6.5 show the relationships between the GPA/SAT Total and the centered log odds of admission. Both GPA and SAT Score are numeric with normal distributions. The chances of admission increase linearly with GPA from 3.3 until 4.5, at which point the data becomes thin and unstable. For SAT Scores the relationship is slightly different, growing near exponentially from around 1100 until 1300, before settling into a liner relationship and then plateauing at around 1400. High school students with GPAs around 4.5 and SAT Scores above 1400 have the highest chance of admission.  The slight declines at the extreme right tails are probably best explained by sparse data. An astute college admissions specialist might have domain knowledge that could explain differently: perhaps students with 5.0 GPAs and 1600 SATs are lacking in other characteristics. There's not enough evidence here to say.
```{r}
par(mfrow=c(2,1))
    part1<-partialPlot(rf2,pred.data=admissions.final,x.var=gpa.wtd,
                       rug=T,which.class=1, main="Fig 6.4: Partial Dependence Plot for GPA on Admission", xlab="GPA",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="lightgreen",pch=19, main="Fig 6.4: Partial Dependence Plot for GPA on Admission", xlab="GPA",
                   ylab="Centered Log Odds")
    par(mfrow=c(2,1))
    part2<-partialPlot(rf2,pred.data=admissions.final,x.var=sati.total,
                       rug=T,which.class=1, main="Fig 6.5: Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="lightgreen",pch=19, main="Fig 6.5: Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                   ylab="Centered Log Odds")
```

### 7.  Summary and Conclusions
Our final model used 750 trees and a cost ratio of 3.8:1. This lead to a 20.27% OOB Classification error rate, which we accepted due to it's willingness to predict more students would be admitted, even if at lower precision. This was a policy decision driven by a desire to encourage high quality students to apply to an "elite" university. We can now summarize our findings and conclusions. There are three core issues I'd like to address:

1. From a data collection perspective, this data set is less than ideal for a few reasons. First, we are unsure whether it is a complete sample of one year's admission results from an elite university or just a sample of results. The limited number of observations compared to a normal elite  university application count implies the data is a sample. From there, we don't know if was generated randomly or as a biased selection. Furthermore, college admissions data is tricky in that is not fully independent. There are a relatively fixed number of seats and similar students compete over the same seat. This destroys the independence assumption -- all previous decisions impact the subsequent ones in a given year (as clearly seen comparing early-decision admittance rates and regular-decision admittance rates). For the remainder of this discussion, we will assume the data were collected as a biased sample from a larger set of admissions observations. Luckily, gender, sex, and income were very minimally, if at all, predictive in our sample set.
2. From a level I perspective, we can see the data are complex with many missing features. Addressing only these 7870 observations, we cannot approach the true mean specification. That said, we can directionally measure direction and importance of select predictors. For example, GPA and STAT Total appeared to be both correlated with each other and correlated with higher rates of admission, which we observed in the bivariate analysis as well as partial dependence and importance plots. At the far right of both GPA and SAT Scores, the distributions become thin and unstable. Most of the apparent reduction in admission rates there can be attributed to instability. The Random Forest algorithm proved to be a powerful classification tool, performing with an OOB classification error rate of 15.91% when unrestricted in cost ratio. Even after adding a 3.8:1 cost ratio, the algorithm slid 4.5% percentage points on OOB classification error (20.27%); a small trade-off for a much more useful predictive tool.
3. From a Level II perspective, which despite the data collection issues, I believe we can justify, we can extrapolate some findings that could prove useful moving forward in real world applications for our high school students. I believe we can justify Level II analysis for two principle reasons. First, our model found the most predictive variables to be Weighted GPA and Total STAT Score, which are continually reported to be critical factors in college admissions. Rather than these findings being an artifact of our limited sample, they seem to be indicative of a much broader trend. Second, our data covered a broad set of values. Both GPA and STAT Total formed nearly perfectly normal distributions. Income showed a relative uniformity, aside from the spike at $99,999 (and subsequent spike at $65,000 from median imputation). Anglo, Asian, Other, and Sex were well represented. The one sparsity was around Black students. It would be wise to exercise caution in extrapolating the model for Black students without further data. If these distributions were sparser or highly skewed in illogical ways, I'd be much more cautious to extend to Level II analysis. This raises the question: how should we actually use the model? The answer is: it depends. As my recommendation, the model should be used sparingly overall. The model is best fit for the following situation: an 1) Anglo or Asian student who has 2) taken the SAT and is 3) applying to a University with a ~30% acceptance rate. If the student is Black, the school is substantially easier or harder to get into, the student is missing SAT data, or the time period has shifted significantly from when these data were collected, the model could be wildly less predictive. My recommendation would be to identify students at my high school that fit this profile, use the algorithm to predict their results as a pilot, and then evaluate the true predictive results vs. the OOB data. All the while, I'd be pushing to gather more robust data. As it stands today, the predictive power is for a limited scope of student applying to a fairly specific type of school. 


### Appendix: R Code
```{r}
setwd("/Users/jeffreyellington/Dropbox/Wharton/Spring2017/STAT974/data/")
###### first glance at data
summary(admissions)
str(admissions)

## Plot Admissions histogram
library(ggplot2)
ggp <- ggplot(data.frame(admissions),aes(x=admit), aes(y=count))
# counts
ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", title = "Fig 2.1 Admissions Results", 
              subtitle = "Acceptance Rate = 30.87%")
## Calc acceptance rate
sum(admissions$admit)/length(admissions$admit)*100


##### EDA
## Find Weird Values for Categorical Variables, if any
str(admissions)
unique(admissions$anglo)
    ## Anglo, Misc "10"/"White" --> recode as "1" (and remove NA's for obs #200)
    admissions[[200,2]]=1
    admissions[[200,3]]=0
    admissions[[200,4]]=0
    admissions[[10,2]]=1
unique(admissions$asian)
unique(admissions$black)
unique(admissions$sex)

## Factor/Numeric variable transformations
admissions$admit<-as.factor(admissions$admit)
admissions$anglo<-as.factor(admissions$anglo)
admissions$asian<-as.factor(admissions$asian)
admissions$black<-as.factor(admissions$black)
admissions$sex<-as.factor(admissions$sex)
admissions$income<-as.numeric(admissions$income)
admissions$sati.math<-as.numeric(admissions$sati.math)
str(admissions)
summary(admissions)

## GPA Values
  ## Check if GPA is normally distributed (verifying no systematic deviances due to mis-haps in weighting)
  hist(admissions$gpa.wtd) ## looks normal without any systematic deviance // bi-modalities
  ## Invert negative value from -3.20 to 3.20
  admissions[[2000,5]]=3.20
  ## Make 0.00 GPA = Mean/Median GPA
  summary(admissions$gpa.wtd)
  ## Median = 3.85, Mean = 3.79, settled on 3.80, assuming mean was slightly pulled down by the 19 GPA=0.00 values
  admissions[[382,5]]=3.80
  admissions[[560,5]]=3.80
  admissions[[903,5]]=3.80
  admissions[[1900,5]]=3.80
  admissions[[3081,5]]=3.80
  admissions[[3146,5]]=3.80
  admissions[[3541,5]]=3.80
  admissions[[3816,5]]=3.80
  admissions[[3925,5]]=3.80
  admissions[[4034,5]]=3.80
  admissions[[4486,5]]=3.80
  admissions[[4576,5]]=3.80
  admissions[[5140,5]]=3.80
  admissions[[5462,5]]=3.80
  admissions[[6604,5]]=3.80
  admissions[[7131,5]]=3.80
  admissions[[7759,5]]=3.80
  admissions[[7938,5]]=3.80
  admissions[[8061,5]]=3.80
      #Don't laugh... couldn't get function to work :-(
  
## Income Max Encoded at 99,999
  sum(admissions$income >= 99999, na.rm = TRUE)
  2164/8700 *100
  ## 2164 at max income (24.87%)
  ## Seems OK to leave as ... 25% making more than $100k is reasonable for this pool
  
## SAT Max Score Error
  ## Max SAT Verbal = 5240; regressing to the median of admitted students (given they were admitted)
  admitted <- subset(admissions, admissions$admit == 1)
  summary(admitted$sati.verb) ## median = 650
  admissions[[20,6]]=650 


## Find NA's and remove if sensible
  summary(admissions)
  ## Sex NAs (24 total, 22 have tons of other missing values, striking from data set)
  which(is.na(admissions$sex))
  admissions.v1 <- admissions[-c(353, 379, 443, 762, 799, 1321, 1780, 4451, 4486, 4681, 5197, 5229, 5558, 5792, 6027, 6284, 6323, 6426, 6465, 6762, 6921, 7505, 7934, 8274),]
  
  ## NA for Income (1724, too big to strike)
  sum(is.na(admissions$income))
    ## check if systematic deviance from normal acceptance rate (579 Admitted out of 1724 Total)
    579/1724 *100 ## 33.58% vs. 30.87% ... seems omitted income info is slightly predictive, but not much. Regessing to median
    admissions.test <- admissions.v1 
    samp<-na.roughfix(admissions.test$income, median) ## impute NAs to median value
    admissions.test$income <- samp ## replace NAs with median value
    sum(admissions.test$income==65000) - sum(admissions.v1$income==65000, na.rm = TRUE) #1711, correct value
    admissions.v2 <- admissions.test ## create new dataframe with imputed values
    summary(admissions.v2)
    
  ## NA for Race (828)
  sum(is.na(admissions$asian))
  sum(is.na(admissions$black))
  sum(is.na(admissions$anglo))
    ## check if systematic deviance from normal acceptance rate
    summary(admissions) #828
    summary(admitted) #279
    279/828 * 100 ## 33.69% vs. 30.87% ... seems omitted race is slightly predictive, but not much
    ## can't regress to mean, the others are likely too predictive...
    ## striking from dataset... 10% hurts, but it's the best solution I see
    admissions.v3 <- subset(admissions.v2, (!is.na(admissions.v2[,2])) & (!is.na(admissions.v2[,3])) & (!is.na(admissions.v2[,4])))
    summary(admissions.v3)
    

## Is there another race category?
  ## Race Counts
  sum(is.na(admissions.v3)) ## No NA's in dataset, 7870 observations
  ## White / Asian / Black
  sum(admissions.v3$anglo==1) + sum(admissions.v3$asian==1) + sum(admissions.v3$black==1) ## 6582
  7870 - 6582 #1288
  ## Another level? or Multiple?
  sum(admissions.v3$anglo==0 & admissions.v3$asian==0 & admissions.v3$black==0) #1288, checks out
    ## Create 4th level for race [no idea how to do this with functions, so why not use some cute math :-)...]
    admissions.v3$other <- as.numeric(admissions.v3$anglo) + as.numeric(admissions.v3$asian) + as.numeric(admissions.v3$black)
    admissions.v3$other <- admissions.v3$other - 4
    admissions.v3$other <- admissions.v3$other * -1
    admissions.v3$other <- as.factor(admissions.v3$other)
  

## More SAT Issues, 0 values
  ## SAT Math = 0
  sum(admissions.v3$sati.math == 0)
  ## 408
  
  ## SAT Verb = 0
  sum(admissions.v3$sati.verb == 0)
  ## 408
  
  ## Not a coincidence --> Likely took the ACT
    ## check if systematic deviance from normal acceptance rate
    admitted.v3 <- subset(admissions.v3, admissions.v3$admit == 1)
    summary(admitted.v3)
    2407/7870 * 100 ## subset base admissions rate = 30.58% (negliblby different from original 30.87%)
    sum(admitted.v3$sati.math == 0)/sum(admissions.v3$sati.math == 0) * 100 ## 10.05% --> VERY DIFFERENT, need to consider
    sum(admitted.v3$sati.verb == 0)/sum(admissions.v3$sati.verb == 0) * 100 ## sanity check, verbal == math
    sub.test <- subset(admissions.v3, admissions.v3$admit == 1 & admissions.v3$sati.verb == 0)
    summary(sub.test)
    sub.test2 <- subset(admissions.v3, admissions.v3$sati.verb == 0) 
    sort(sub.test2$gpa.wtd)

## Final Dataset
    admissions.final <- admissions.v3

##### Univariate Analysis
  ## GPA
  ## Check if GPA is normally distributed (verifying no systematic deviances due to mis-haps in weighting)
  ggp.v3 <- ggplot(admissions.final, aes(gpa.wtd)) 
  ggp.v3 + geom_histogram(fill="lightgreen") + labs(x = "GPA (Weighted)", y = "Count", title = "Fig 4.1: GPA Histogram", 
                          subtitle = "Median GPA = 3.85")## looks normal without any systematic deviance // bi-modalities
  ## SAT Scores
  ## Math
  summary(admissions.final$sati.math)
  ggp.v4 <- ggplot(admissions.final, aes(sati.math)) 
  ggp.v4 + geom_histogram(fill="lightgreen") + labs(x = "SAT Math", y = "Count", title = "Fig 4.2: SAT Math", 
                          subtitle = "Median Math SAT = 630")
  ## Verb
  summary(admissions.final$sati.verb)
  ggp.v5 <- ggplot(admissions.final, aes(sati.verb)) 
  ggp.v5 + geom_histogram(fill="lightgreen") + labs(x = "SAT Verb", y = "Count", title = "Fig 4.3: SAT Verb", 
                                                    subtitle = "Median Verb SAT = 580")
  ## Combined
  ## Create new combined/total SAT variable
  admissions.final$sati.total <- admissions.final$sati.math + admissions.final$sati.verb
  summary(admissions.final$sati.total)
  ggp.v6 <- ggplot(admissions.final, aes(sati.total)) 
  ggp.v6 + geom_histogram(fill="lightgreen") + labs(x = "SAT Total", y = "Count", title = "Fig 4.4: SAT Total", 
                                                    subtitle = "Median Total SAT = 1210")
  ## combine 
  library(gridExtra)
  ggp.v3 <- ggplot(admissions.final, aes(gpa.wtd)) + geom_histogram(fill="lightgreen") + labs(x = "GPA (Weighted)", y = "Count", title = "Fig 4.1: GPA Histogram", 
                                                    subtitle = "Median GPA = 3.85")
  ggp.v4 <- ggplot(admissions.final, aes(sati.math)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Math", y = "Count", title = "Fig 4.2: SAT Math", 
                                                    subtitle = "Median Math SAT = 630")
  ggp.v5 <- ggplot(admissions.final, aes(sati.verb)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Verb", y = "Count", title = "Fig 4.3: SAT Verb", 
                                                    subtitle = "Median Verb SAT = 580")
  ggp.v6 <- ggplot(admissions.final, aes(sati.total)) + geom_histogram(fill="lightgreen") + labs(x = "SAT Total", y = "Count", title = "Fig 4.4: SAT Total", 
                                                    subtitle = "Median Total SAT = 1210")
  grid.arrange(ggp.v3, ggp.v4, ggp.v5, ggp.v6, nrow=2, ncol=2)
  
  ## Demographics: 
  ## Applicants by Race
  anglo.apps <- sum(admissions.final$anglo==1)
  asian.apps <- sum(admissions.final$asian==1)
  black.apps <- sum(admissions.final$black==1)
  other.apps <- sum(admissions.final$other==1)
  admit.apps.comp <- c(anglo.apps, asian.apps, black.apps, other.apps)
  admit.apps.comp
    ## Plot
    barplot(admit.apps.comp, col = "lightgreen",
          xlab = "Applicants: Anglo, Asian, Black, Other",
          ylab = "Count",
          main = "Fig 4.4: Applicant Count by Race",
          ylim = c(0, 4000))
    
  ## Income Histogram
    hist(admissions.final$income, col = "lightgreen",
         xlab = "Family Income ($)",
         ylab = "Count",
         main = "Fig 4.5: Applicant Count by Income",
         ylim = c(0, 3000))
  ## Original Income Histogram
    hist(admissions$income, col = "lightgreen",
         xlab = "Family Income ($)",
         ylab = "Count",
         main = "Fig 4.5: Applicant Count by Income",
         ylim = c(0, 3000))
    
  ## Gender Histogram
    ## Applicants by Sex
    male.apps <- sum(admissions.final$sex==1)
    female.apps <- sum(admissions.final$sex==0)
    admit.apps.comp.gen <- c(male.apps, female.apps)
    admit.apps.comp.gen
    barplot(admit.apps.comp.gen, col = "lightgreen",
         xlab = "Gender: Male, Female",
         ylab = "Count",
         main = "Fig 4.6: Applicant Count by Gender",
         ylim = c(0, 5000))
  
  ## Combine Plots
    par(mfrow=c(2,2))
      barplot(admit.apps.comp.gen, col = "lightgreen",
            xlab = "Gender: Male, Female",
            ylab = "Count",
            main = "Fig 4.5: Applicant Count by Gender",
            ylim = c(0, 5000))
      barplot(admit.apps.comp, col = "lightgreen",
              xlab = "Applicants: Anglo, Asian, Black, Other",
              ylab = "Count",
              main = "Fig 4.6: Applicant Count by Race",
              ylim = c(0, 4000))
      hist(admissions.final$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.7: Transformed Applicant Count by Income",
           ylim = c(0, 3000))
      hist(admissions$income, col = "lightgreen",
           xlab = "Family Income ($)",
           ylab = "Count",
           main = "Fig 4.8: Original Applicant Count by Income",
           ylim = c(0, 3000))
    
  
###### Bivariate Analysis
  ## Setup / Basic Hist
  summary(admissions.final)
  total.admit.rate <- sum(admissions.final$admit==1)/length(admissions.final$admit)
  ggp <- ggplot(data.frame(admissions.final),aes(x=admit), aes(y=count))
  # histogram
  ggp + geom_bar(fill="lightgreen") + labs(x = "0=Rejected, 1=Accepted", y = "Count", 
                                           title = "Fig 5.1 Admissions Results (Final Dataset)", subtitle = "Acceptance Rate = 30.58%")
  
  ## Demographics (Race/Sex/$)
    ## Admissions Rates by Race
    anglo.admit.rate <- sum(admissions.final$admit==1 & admissions.final$anglo==1)/sum(admissions.final$anglo == 1)
    asian.admit.rate <- sum(admissions.final$admit==1 & admissions.final$asian==1)/sum(admissions.final$asian == 1)
    black.admit.rate <- sum(admissions.final$admit==1 & admissions.final$black==1)/sum(admissions.final$black == 1)
    other.admit.rate <- sum(admissions.final$admit==1 & admissions.final$other==1)/sum(admissions.final$other == 1)
    admit.rates.race.comp <- c(total.admit.rate, anglo.admit.rate, asian.admit.rate, black.admit.rate, other.admit.rate)
    admit.rates.race.comp
      ## Plot
      barplot(admit.rates.race.hold*100, col = "lightgreen",
              xlab = "Admit Rate: Total, Anglo, Asian, Black, Other",
              ylab = "Percent",
              main = "Fig 5.2: Admit Rate by Race",
              ylim = c(0, 40))
    ## Admissions by Income
    income.under.100 <- subset(admissions.final, admissions.final$income <=99998)
    income.over.100 <- subset(admissions.final, admissions.final$income == 99999)
    summary(income.under.100)
    summary(income.over.100)
    income.under.100.admit.rate <- 1769/(1769+4113)
    income.over.100.admit.rate <- 638/(638+1350)
    income.under.100.admit.rate
    income.over.100.admit.rate
    ## Admissions by Gender
    male.apps <- sum(admissions.final$sex==1)
    female.apps <- sum(admissions.final$sex==0)
    male.admits <- sum(admissions.final$sex==1 & admissions.final$admit==1)
    female.admits <- sum(admissions.final$sex==0 & admissions.final$admit==1)
    male.admit.rate <- male.admits/male.apps
    female.admit.rate <- female.admits/female.apps
    male.admit.rate
    female.admit.rate
  
    
  ## Merit (SAT/GPA)
    ## GPA Quartiles
    summary(admissions.final$gpa.wtd)
    top.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd > 4.151)
    scnd.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd <= 4.151 & admissions.final$gpa.wtd > 3.850)
    thrd.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd <= 3.850 & admissions.final$gpa.wtd > 3.500)
    bot.gpa.q <- subset(admissions.final, admissions.final$gpa.wtd <= 3.500)
    top.gpa.admit.rate <- sum(top.gpa.q$admit == 1)/length(top.gpa.q$gpa.wtd)*100
    scnd.gpa.admit.rate <- sum(scnd.gpa.q$admit == 1)/length(scnd.gpa.q$gpa.wtd)*100
    thrd.gpa.admit.rate <- sum(thrd.gpa.q$admit == 1)/length(thrd.gpa.q$gpa.wtd)*100
    bot.gpa.admit.rate <- sum(bot.gpa.q$admit == 1)/length(bot.gpa.q$gpa.wtd)*100
    top.gpa.admit.rate
    scnd.gpa.admit.rate
    thrd.gpa.admit.rate
    bot.gpa.admit.rate
    plot(c(top.gpa.admit.rate,
           scnd.gpa.admit.rate,
           thrd.gpa.admit.rate,
           bot.gpa.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.3: Admit by GPA",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## SAT Math
    summary(admissions.final$sati.math)
    top.satm.q <- subset(admissions.final, admissions.final$sati.math >= 690)
    scnd.satm.q <- subset(admissions.final, admissions.final$sati.math < 690 & admissions.final$sati.math >= 630)
    thrd.satm.q <- subset(admissions.final, admissions.final$sati.math < 630 & admissions.final$sati.math >= 550)
    bot.satm.q <- subset(admissions.final, admissions.final$sati.math < 550)
    top.satm.admit.rate <- sum(top.satm.q$admit == 1)/length(top.satm.q$sati.math)*100
    scnd.satm.admit.rate <- sum(scnd.satm.q$admit == 1)/length(scnd.satm.q$sati.math)*100
    thrd.satm.admit.rate <- sum(thrd.satm.q$admit == 1)/length(thrd.satm.q$sati.math)*100
    bot.satm.admit.rate <- sum(bot.satm.q$admit == 1)/length(bot.satm.q$sati.math)*100
    top.satm.admit.rate
    scnd.satm.admit.rate
    thrd.satm.admit.rate
    bot.satm.admit.rate
    plot(c(top.satm.admit.rate,
           scnd.satm.admit.rate,
           thrd.satm.admit.rate,
           bot.satm.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.4: Admit by SAT Math",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## SAT Verb
    summary(admissions.final$sati.verb)
    top.satv.q <- subset(admissions.final, admissions.final$sati.verb >= 650)
    scnd.satv.q <- subset(admissions.final, admissions.final$sati.verb < 650 & admissions.final$sati.verb > 580)
    thrd.satv.q <- subset(admissions.final, admissions.final$sati.verb <= 580 & admissions.final$sati.verb > 500)
    bot.satv.q <- subset(admissions.final, admissions.final$sati.verb <= 500)
    top.satv.admit.rate <- sum(top.satv.q$admit == 1)/length(top.satv.q$sati.verb)*100
    scnd.satv.admit.rate <- sum(scnd.satv.q$admit == 1)/length(scnd.satv.q$sati.verb)*100
    thrd.satv.admit.rate <- sum(thrd.satv.q$admit == 1)/length(thrd.satv.q$sati.verb)*100
    bot.satv.admit.rate <- sum(bot.satv.q$admit == 1)/length(bot.satv.q$sati.verb)*100
    top.satv.admit.rate
    scnd.satv.admit.rate
    thrd.satv.admit.rate
    bot.satv.admit.rate
    plot(c( top.satv.admit.rate,
            scnd.satv.admit.rate,
            thrd.satv.admit.rate,
            bot.satv.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Verbal",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## SAT Total
    summary(admissions.final$sati.total)
    top.satt.q <- subset(admissions.final, admissions.final$sati.total > 1320)
    scnd.satt.q <- subset(admissions.final, admissions.final$sati.total <= 1320 & admissions.final$sati.total >= 1210)
    thrd.satt.q <- subset(admissions.final, admissions.final$sati.total < 1210 & admissions.final$sati.total >= 1070)
    bot.satt.q <- subset(admissions.final, admissions.final$sati.total < 1070)
    top.satt.admit.rate <- sum(top.satt.q$admit == 1)/length(top.satt.q$sati.total)*100
    scnd.satt.admit.rate <- sum(scnd.satt.q$admit == 1)/length(scnd.satt.q$sati.total)*100
    thrd.satt.admit.rate <- sum(thrd.satt.q$admit == 1)/length(thrd.satt.q$sati.total)*100
    bot.satt.admit.rate <- sum(bot.satt.q$admit == 1)/length(bot.satt.q$sati.total)*100
    top.satt.admit.rate
    scnd.satt.admit.rate
    thrd.satt.admit.rate
    bot.satt.admit.rate
    plot(c( top.satt.admit.rate,
            scnd.satt.admit.rate,
            thrd.satt.admit.rate,
            bot.satt.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Total",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## Par -- Graph all 4
    par(mfrow=c(2,2))
    plot(c(top.gpa.admit.rate,
           scnd.gpa.admit.rate,
           thrd.gpa.admit.rate,
           bot.gpa.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.3: Admit by GPA",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    plot(c(top.satm.admit.rate,
           scnd.satm.admit.rate,
           thrd.satm.admit.rate,
           bot.satm.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.4: Admit by SAT Math",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    plot(c( top.satv.admit.rate,
            scnd.satv.admit.rate,
            thrd.satv.admit.rate,
            bot.satv.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Verbal",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    plot(c( top.satt.admit.rate,
            scnd.satt.admit.rate,
            thrd.satt.admit.rate,
            bot.satt.admit.rate),
         xlab = "Quartiles (1-4)",
         ylab = "Admit Rate %",
         main = "Fig 5.5: Admit by SAT Total",
         pch = 16,
         cex = 1.75,
         col = "lightgreen")
    
    ## Correlation 
    ##SAT M and SAT V
    cor(admissions.final$sati.math, admissions.final$sati.verb)
    ## SAT T and GPA
    cor(admissions.final$sati.total, admissions.final$gpa.wtd)
    
##### Multivariate Analysis (RF)
  ## No Misclassification costs
    # random forests
    rf1<-randomForest(admit~anglo+asian+black+other+gpa.wtd+sati.verb+sati.math+sati.total+income+sex,
      data=admissions.final,importance=T,ntree=750)
    rf1.conf <- rf1$confusion
    rf1.conf
    plot(rf1)
    print(rf1)
  ## Misclassification costs
    ## 4:1
    rf2<-randomForest(admit~anglo+asian+black+other+gpa.wtd+sati.verb+sati.math+sati.total+income+sex,
                      data=admissions.final,importance=T,ntree=750, sampsize=c(1000,1700))
    print(rf2)
    
    rf3<-randomForest(admit~anglo+asian+black+other+gpa.wtd+sati.verb+sati.math+sati.total+income+sex,
                      data=admissions.final,importance=T,ntree=750, sampsize=c(100,2000))
    print(rf3)
    
  ## Prediction Accuracy // Importance Plots
    # Variable Importance Plots
    par(mfrow=c(2,2))
    varImpPlot(rf2,type=1,scale=F,class=1,
               main="Forecasting Importance Plot for Admission
               (Unstandardized)",col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=1,
               main="Forecasting Importance Plot for Admission
               (Standardized)", col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=F,class=0,
               main="Forecasting Importance Plot for Rejection
               (Unstandardized)", col="lightgreen",cex=1,pch=19)
    varImpPlot(rf2,type=1,scale=T,class=0,
               main="Forecasting Importance Plot for Rejection
               (Standardized)", col="lightgreen",cex=1,pch=19)
    
    # Partial Dependence Plots
    par(mfrow=c(2,1))
    part1<-partialPlot(rf2,pred.data=admissions.final,x.var=gpa.wtd,
                       rug=T,which.class=1, main="Partial Dependence Plot for GPA on Admission", xlab="GPA",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="blue",pch=19, main="Partial Dependence Plot for GPA on Admission", xlab="GPA",
                   ylab="Centered Log Odds")
    par(mfrow=c(2,1))
    part2<-partialPlot(rf2,pred.data=admissions.final,x.var=sati.total,
                       rug=T,which.class=1, main="Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                       ylab="Centered Log Odds")
    scatter.smooth(part1$x,part1$y,span=1/3,col="blue",pch=19, main="Partial Dependence Plot for SAT Total on Admission", xlab="SAT Score",
                   ylab="Centered Log Odds")
```
