Background

Twelve months ago four of eight CU medical students applying to Obstetrics and Gyncology residency did not match. A model that predicts a medical student’s chances of matching into an obstetrics and gynecology residency may facilitate improved counseling and fewer unmatched medical students.

Objective

Create and validate a predictive model to understand a medical student’s chances of matching into obstetrics and gynecology residency.

Computational Reproducibility

I used the package renv to make the project more: isolated, portable and reproducible. In addition, version control for the project was stored on www.github.com/mufflyt in a private repository called nomogram. Lastly, the environment was controlled by using Rstudio Cloud (https://rstudio.cloud/spaces/191846/content/3527244).

LOADING DATA INTO R ENVIRONMENT.

After we’ve loaded the data in R, we’ll quickly look at the variables. ‘Match_Status’ is a categorical binary variable, meaning only two categories are possible. all_merged_Feature_engineering1 is a dataframe of the independent and the dependent variables for review. Each variable is contained in a column, and each row represents a single unique medical student. If students applied in more than one year the most contemporary data was used.

We can see that these data have 7226 observations of 52 features, and thatabout 54.2 percent of medical students applying to OB/GYN residency matched. Let’s create a few plots to get a sense of the data. Remember, the goal here will be to predict whether a given medical student will match into OB/GYN residency. We make ‘Not_Matched’ the reference level in the target variable, and this will make logit models predict the probability of ‘Matched’.

Data Reduction for the Variable Selection Process

Correlation of Features

In this correlation plot we want to look for the bright, large circles which immediately show the strong correlations (size and shading depends on the absolute values of the coefficients; color depends on direction). This shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Anything that you would have to squint to see is usually not worth seeing!

Option A:

Option B: Find correlations for mixtures of continuous, polytomous, and dichotomous variables:

Data reduction was conducted in two phases. First, highly correlated variables were removed with R threshold = 0.7. 5 variables were removed from data. Second, group LASSO (Least Absolute Shrinkage and Selection Operator) was used for automatic variables selection. The LASSO penalizes the absolute size of the regression coefficients to drive the coefficients of irrelevant variables to zero [tibshirani1996regression].

Descriptive Analysis of `reduced_Data2` dataframe

Description of Applicants by Match Status
	Not_Matched	Matched	p.overall
	N=3307	N=3919
Gender:			<0.001
Female	2585 (78.2%)	3327 (84.9%)
Male	722 (21.8%)	592 (15.1%)
Age	28.0 [27.0;31.0]	27.0 [26.0;29.0]	<0.001
Medical_Degree:			<0.001
DO	704 (21.4%)	521 (13.3%)
MD	2581 (78.6%)	3392 (86.7%)
Military_Service_Obligation	41 (1.24%)	44 (1.12%)	0.726
Visa_Sponsorship_Needed	365 (11.0%)	88 (2.25%)	<0.001
Medical_Education_or_Training_Interrupted	601 (18.2%)	398 (10.2%)	<0.001
Misdemeanor_Conviction	58 (1.75%)	54 (1.38%)	0.233
Alpha_Omega_Alpha	187 (5.65%)	521 (13.3%)	<0.001
Gold_Humanism_Honor_Society	335 (10.1%)	704 (18.0%)	<0.001
Couples_Match	205 (6.20%)	406 (10.4%)	<0.001
Count_of_Oral_Presentation	0.00 [0.00;1.00]	0.00 [0.00;1.00]	<0.001
Count_of_Peer_Reviewed_Book_Chapter	0.00 [0.00;0.00]	0.00 [0.00;0.00]	0.028
Count_of_Peer_Reviewed_Journal_Articles_Abstracts	0.00 [0.00;1.00]	0.00 [0.00;1.00]	<0.001
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published	0.00 [0.00;0.00]	0.00 [0.00;1.00]	<0.001
Count_of_Poster_Presentation	1.00 [0.00;2.00]	1.00 [0.00;3.00]	<0.001
Year:			<0.001
2016	148 (4.48%)	294 (7.50%)
2017	777 (23.5%)	1060 (27.0%)
2018	573 (17.3%)	711 (18.1%)
2019	1076 (32.5%)	1292 (33.0%)
2020	733 (22.2%)	562 (14.3%)
USMLE_Pass_Fail_replaced:			<0.001
Failed attempt	88 (3.01%)	9 (0.24%)
Passed	2835 (97.0%)	3741 (99.8%)
Location:			<0.001
BSW	174 (5.26%)	129 (3.29%)
CCAG	333 (10.1%)	270 (6.89%)
CU	1431 (43.3%)	1449 (37.0%)
Duke	409 (12.4%)	802 (20.5%)
Truman	170 (5.14%)	84 (2.14%)
U_Washington	126 (3.81%)	180 (4.59%)
UAB	130 (3.93%)	110 (2.81%)
Utah	534 (16.1%)	895 (22.8%)
Meeting_Name_Presented:			<0.001
Did not present at a meeting	2534 (76.6%)	2657 (67.8%)
Presented at a meeting	773 (23.4%)	1262 (32.2%)
TopNIHfunded:			<0.001
Did not attend NIH top-funded medical school	2568 (77.7%)	2247 (57.3%)
Attended a NIH top-funded Medical School	739 (22.3%)	1672 (42.7%)
Higher_Education_Institution:			<0.001
No Ivy League Education	3228 (97.6%)	3698 (94.4%)
Ivy League	79 (2.39%)	221 (5.64%)
Higher_Education_Degree:			<0.001
Not a B.S. degree	2132 (64.5%)	2136 (54.5%)
B.S.	1175 (35.5%)	1783 (45.5%)
Interest_Group:			1.000
No Interest Group	3303 (99.9%)	3914 (99.9%)
Mentions Interest Group	4 (0.12%)	5 (0.13%)
Language_Fluency:			0.054
Speaks English and Another Language	1778 (75.6%)	2256 (73.2%)
Speaks English	575 (24.4%)	825 (26.8%)
ACLS	1125 (47.8%)	1262 (41.0%)	<0.001
Other_Service_Obligation	19 (0.81%)	36 (1.17%)	0.238
Photo_Received	2305 (98.0%)	3068 (99.6%)	<0.001
Tracks_Applied_by_Applicant_1:			<0.001
Applying for a Preliminary Position	2203 (66.6%)	2035 (51.9%)
Categorical Applicant	1104 (33.4%)	1884 (48.1%)
AMA:			<0.001
No AMA Membership	2386 (72.1%)	2302 (58.7%)
American Medical Association Member	921 (27.9%)	1617 (41.3%)
ACOG:			<0.001
No ACOG Membership	2291 (69.3%)	1886 (48.1%)
ACOG Member	1016 (30.7%)	2033 (51.9%)
Latin_Honors:			0.001
Latin_honors	31 (0.94%)	12 (0.31%)
No_cum_laude	3276 (99.1%)	3907 (99.7%)
Scholarship:			<0.001
No_scholarship	2769 (83.7%)	3033 (77.4%)
Scholarship	538 (16.3%)	886 (22.6%)
Grant:			<0.001
Grant_funding	123 (3.72%)	231 (5.89%)
No_Grant_funding	3184 (96.3%)	3688 (94.1%)
Phi_beta_kappa:			<0.001
No_Phi_Beta_Kappa	3275 (99.0%)	3835 (97.9%)
Phi_Beta_Kappa	32 (0.97%)	84 (2.14%)
NCAA:			0.008
NCAA_athlente	19 (0.57%)	47 (1.20%)
Not_a_NCAA_athlete	3288 (99.4%)	3872 (98.8%)
Boy_Scouts:			0.269
Boy/Girl_Scouts	9 (0.27%)	18 (0.46%)
Not_a_Boy/Girl_Scout	3298 (99.7%)	3901 (99.5%)
Valedictorian:			0.044
Not_a_Valedictorian	3287 (99.4%)	3877 (98.9%)
Valedictorian	20 (0.60%)	42 (1.07%)
NIH:			0.656
NIH_present	24 (0.73%)	24 (0.61%)
No_NIH_involvement	3283 (99.3%)	3895 (99.4%)
NCI:			0.317
NCI_present	81 (2.45%)	112 (2.86%)
No_NCI_involvement	3226 (97.6%)	3807 (97.1%)
total_OBGYN_letter_writers	2.00 [1.00;2.00]	2.00 [2.00;3.00]	<0.001
number_of_applicant_first_author_publications	0.00 [0.00;1.00]	1.00 [0.00;3.00]	<0.001
Advance_Degree:			0.015
M.B.A.	20 (0.60%)	14 (0.36%)
No Advanced Degree	3027 (91.5%)	3651 (93.2%)
Ph.D.	19 (0.57%)	10 (0.26%)
Other	241 (7.29%)	244 (6.23%)
Type_of_medical_school:			.
U.S. Public School	1024 (31.0%)	1937 (49.4%)
International School	1075 (32.5%)	260 (6.63%)
Osteopathic School	704 (21.3%)	523 (13.3%)
Osteopathic School,International School	1 (0.03%)	0 (0.00%)
U.S. Private School	503 (15.2%)	1199 (30.6%)
work_exp_count	2.00 [0.00;4.00]	2.00 [0.00;4.00]	0.052
Volunteer_exp_count	4.00 [0.00;8.00]	6.00 [2.00;9.00]	<0.001
Research_exp_count	1.00 [0.00;2.00]	2.00 [0.00;3.00]	<0.001

#>  [1] "Match_Status"                                                          
#>  [2] "Gender"                                                                
#>  [3] "Age"                                                                   
#>  [4] "Medical_Degree"                                                        
#>  [5] "Military_Service_Obligation"                                           
#>  [6] "Visa_Sponsorship_Needed"                                               
#>  [7] "Medical_Education_or_Training_Interrupted"                             
#>  [8] "Misdemeanor_Conviction"                                                
#>  [9] "Alpha_Omega_Alpha"                                                     
#> [10] "Gold_Humanism_Honor_Society"                                           
#> [11] "Couples_Match"                                                         
#> [12] "Count_of_Oral_Presentation"                                            
#> [13] "Count_of_Peer_Reviewed_Book_Chapter"                                   
#> [14] "Count_of_Peer_Reviewed_Journal_Articles_Abstracts"                     
#> [15] "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published"
#> [16] "Count_of_Poster_Presentation"                                          
#> [17] "Year"                                                                  
#> [18] "USMLE_Pass_Fail_replaced"                                              
#> [19] "Location"                                                              
#> [20] "Meeting_Name_Presented"                                                
#> [21] "TopNIHfunded"                                                          
#> [22] "Higher_Education_Institution"                                          
#> [23] "Higher_Education_Degree"                                               
#> [24] "Interest_Group"                                                        
#> [25] "Language_Fluency"                                                      
#> [26] "ACLS"                                                                  
#> [27] "Other_Service_Obligation"                                              
#> [28] "Photo_Received"                                                        
#> [29] "Tracks_Applied_by_Applicant_1"                                         
#> [30] "AMA"                                                                   
#> [31] "ACOG"                                                                  
#> [32] "Latin_Honors"                                                          
#> [33] "Scholarship"                                                           
#> [34] "Grant"                                                                 
#> [35] "Phi_beta_kappa"                                                        
#> [36] "NCAA"                                                                  
#> [37] "Boy_Scouts"                                                            
#> [38] "Valedictorian"                                                         
#> [39] "NIH"                                                                   
#> [40] "NCI"                                                                   
#> [41] "total_OBGYN_letter_writers"                                            
#> [42] "number_of_applicant_first_author_publications"                         
#> [43] "Advance_Degree"                                                        
#> [44] "Type_of_medical_school"                                                
#> [45] "work_exp_count"                                                        
#> [46] "Volunteer_exp_count"                                                   
#> [47] "Research_exp_count"

#Recursive Feature Elimination Required more RAM than possible therefore this was able to be run locally in a file called ‘Predictive modeling across sets.RMD’.

Variable Selection using Group LASSO

LASSO, as a feature selection method, focuses on deletion of irrelevant or redundant features.

1- Create dummy variables from categorical data. Here we create a feature matrix where the categorical features are converted to numeric with one-hot encoding, and it’ll be used in glmnet when we train logit models with shrinkage.

2-Drop reference dummies

3-Combine dummies with numeric variables then convert them to a matrix X. Set up the matrix including dummy variables. With a binary categorical outcome the only difference is we must specify family = "binomial" when using glmnet.

4- Create your X matrix (predictors) and Y vector (outcome variable)

5- Create group vector that distinguish groups

Use glmnet to conduct LASSO - Performs k-fold cross validation for penalized regression models with grouped covariates over a grid of values for the regularization parameter lambda. First we need to find the amount of penalty λ by cross-validation. We will search for the λ that give the minimum MSE.

#> grLasso-penalized logistic regression with n=4913, p=60
#> At minimum cross-validation error (lambda=0.0033):
#> -------------------------------------------------
#>   Nonzero coefficients: 42
#>   Nonzero groups: 29
#>   Cross-validation error of 1.10
#>   Maximum R-squared: 0.22
#>   Maximum signal-to-noise ratio: 0.23
#>   Prediction error at lambda.min: 0.266

Cross-validation result. Use cross-validation to identify the best value for the LASSO model. It is hoped (because it is not always the case in practice) that it is U-shaped, like the one shown here, so that we can spot the optimal value of , i.e., the one that corresponds to the lowest dip point.

To view the best model and the corresponding coefficients. cv.fit$lambda.min is the best lambda value that results in the best model with smallest cross-validation error.

This extracts the fitted regression parameters of the logistic regression model using the given lambda value of 0.003. Picks out which predictors have a coefficient > 0.

#> [1] 43

#>                                                            (Intercept) 
#>                                                               2.615970 
#>                                                                    Age 
#>                                                              -0.064361 
#>                                             Count_of_Oral_Presentation 
#>                                                               0.005322 
#>                                    Count_of_Peer_Reviewed_Book_Chapter 
#>                                                              -0.061396 
#>                      Count_of_Peer_Reviewed_Journal_Articles_Abstracts 
#>                                                               0.019971 
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published 
#>                                                               0.065676 
#>                                             total_OBGYN_letter_writers 
#>                                                               0.294902 
#>                          number_of_applicant_first_author_publications 
#>                                                               0.009723 
#>                                                         work_exp_count 
#>                                                              -0.014067 
#>                                                    Volunteer_exp_count 
#>                                                               0.000398 
#>                                                     Research_exp_count 
#>                                                               0.064153 
#>                                            Visa_Sponsorship_Needed_Yes 
#>                                                              -0.502786 
#>                          Medical_Education_or_Training_Interrupted_Yes 
#>                                                              -0.465621 
#>                                             Misdemeanor_Conviction_Yes 
#>                                                              -0.020187 
#>                                        Gold_Humanism_Honor_Society_Yes 
#>                                                               0.200768 
#>                                                              Year_2017 
#>                                                              -0.384054 
#>                                                              Year_2018 
#>                                                              -0.692776 
#>                                                              Year_2019 
#>                                                              -0.894986 
#>                                USMLE_Pass_Fail_replaced_Failed attempt 
#>                                                              -1.565250 
#>                                                           Location_BSW 
#>                                                              -0.081417 
#>                                                          Location_CCAG 
#>                                                               0.254194 
#>                                                            Location_CU 
#>                                                               0.188666 
#>                                                          Location_Duke 
#>                                                               0.355834 
#>                                                        Location_Truman 
#>                                                              -0.126899 
#>                                                  Location_U_Washington 
#>                                                               0.319443 
#>                                                           Location_UAB 
#>                                                              -0.051413 
#>                    Meeting_Name_Presented_Did not present at a meeting 
#>                                                              -0.062828 
#>                  TopNIHfunded_Attended a NIH top-funded Medical School 
#>                                                               0.207639 
#>                                Higher_Education_Institution_Ivy League 
#>                                                               0.253969 
#>                                            Other_Service_Obligation_No 
#>                                                              -0.060423 
#>                                           Other_Service_Obligation_Yes 
#>                                                               0.060423 
#>                                                      Photo_Received_No 
#>                                                              -0.158495 
#>      Tracks_Applied_by_Applicant_1_Applying for a Preliminary Position 
#>                                                              -0.521196 
#>                                                  AMA_No AMA Membership 
#>                                                               0.061868 
#>                                                ACOG_No ACOG Membership 
#>                                                              -0.435343 
#>                                                     NCAA_NCAA_athlente 
#>                                                               0.123874 
#>                                                        NIH_NIH_present 
#>                                                              -0.196396 
#>                                                  Advance_Degree_M.B.A. 
#>                                                              -0.013734 
#>                                      Advance_Degree_No Advanced Degree 
#>                                                               0.158927 
#>                                                   Advance_Degree_Ph.D. 
#>                                                               0.111481 
#>                              Type_of_medical_school_U.S. Public School 
#>                                                              -0.189360 
#>                            Type_of_medical_school_International School 
#>                                                              -1.287720 
#>                              Type_of_medical_school_Osteopathic School 
#>                                                              -0.431929

Here we use the option type = “response” to directly obtain the probabilities.

Brier score is the one true metric. The remainder of metrics are included only for academic purposes.

Temporal validation - Validation for Year vs. Year

The final reduced data include 27 variables for analysis.

Imputation

Next, we will impute missing values. This will make our data ready for machine learning.

Splitting the Data into training set and test set

In some cases, our data is naturally separated into two sets, one of which can be used to fit a model and the other to evaluate it. A common example of this is when data has been collected during two distinct time periods, and the older data is used to fit a model that is evaluated on the newer data, to see if historical data can be used to predict the future.

Here we check sample sizes and proportions of target=“Match_Status” in training and test sets the proportions of target=“Match_Status” should be approximately the same across the training, test, and the whole dataset. Do NOT up- or down- sample the training set because the eval metric is Brier Score, and there is no evidence showing making the data balanced will improve out-of-sample performance. In fact, based on my experiments, it will make the models performance much worse. For more explanations, see this link: https://stats.stackexchange.com/a/474431

#> The training set has 18315 observations.

#> The training set has 15605 observations.

#> Distribution of the target in the training set:

#> n_TRUE 
#>  0.567

#> Distribution of the target in the test set:

#> n_TRUE 
#>  0.506

Feature Exploration

The next step is to explore potential predictive relationships between individual predictors and the response and between pairs of predictors and the response.

Two logistic regression models were considered. The “null model” contains only an intercept term while the model complex model has a single term for an individual predictor from the risk set. These results are presented in the table and orders the risk predictors from most significant to least significant in terms of improvement in Brier Score. For the training data, several predictors provide marginal, but significant improvement in predicting Match_Status outcome as measured by the improvement in Brier Score. Based on these results, our intuition would lead us to believe that the significant categorical set predictors will likely be integral to the final predictive model.

A comparison of the improvement in Brier Score to the p-value of the importance of the interaction term for the numerical predictors. The improvement using interactions was less than 0.1 change in the Brier Score.

#> ~Age:Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published + 
#>     Age:work_exp_count + Age:Volunteer_exp_count + Count_of_Oral_Presentation:Count_of_Peer_Reviewed_Book_Chapter + 
#>     Count_of_Oral_Presentation:Count_of_Peer_Reviewed_Journal_Articles_Abstracts + 
#>     Count_of_Oral_Presentation:Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published + 
#>     Count_of_Peer_Reviewed_Book_Chapter:total_OBGYN_letter_writers + 
#>     Count_of_Peer_Reviewed_Book_Chapter:Research_exp_count + 
#>     Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published:Research_exp_count + 
#>     total_OBGYN_letter_writers:number_of_applicant_first_author_publications + 
#>     total_OBGYN_letter_writers:Research_exp_count + work_exp_count:Volunteer_exp_count
#> <environment: 0x7f9f09fbddf0>

#>            used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  6302894  337   10834477  579         NA 10834477  579
#> Vcells 20218900  154   37704582  288      16384 37704582  288

#NULL MODEL For all our model evaluation metric we will use Brier score. The Brier score for a binary classifer is defined as the mean squared errorof the predicted probabilities and the true target values (0’s or 1’s). The smaller, the better.

Baseline Performance

Model performance and utility will be judged on how much more accurate the model can determine a residency match than a human. Since the target of the prediction is binary (Matched or Not_Matched), we can assume a baseline of 50% for random chance. Next we want to determine a rather modest human baseline. If a human were to predict “Matched” for every resident, they would be accurate 54.23% of the time, a 4.23% increase over chance. Though we can imagine more elaborate methods to improve human performance, we will use this method to establish a baseline. It could easily be just as true, that we could see human performance fall below random chance. Possibly, the human could predict “Not_Matched” for every candidate resident, dropping their accuracy to 45.77%. Let’s give the benefit of the doubt, however, and give the higher accuracy. Therefore, in order to create a successful model, we must aim for an accuracy greater than 54.23%. Even slight improvement in performance is not inconsequential, especially due to the nature of the model.

#> A naive prediction of all matches gives a test Brier score: 0.494

Model Training

Setting Control parameters

Often we are interested in out-of-sample measures of model performance. Recall that cross-validation is one way of attaining out-of-sample performance estimates, we’ll use the caret package to perform cross-validation. The idea behind cross-validation is similar to that behind test-training splitting of the data. We partition the data into several sets, and use one of them for evaluation. The key difference is that we in a cross-validation partition the data into more than two sets, and use all of them (one-by-one) for evaluation.

We will work with the caret package (Classification and Regression Training) package using R. A strength of the caret package is that the same syntax can be used whether the model we are considering has a categorical or numeric outcome.

TRAINING THE LOGISTIC REGRESSION MODEL USING THE `caret` PACKAGE

2- train models on train data. The caret::train function is a high level API that manages everything for us, regarding model training, with a common interface. Our algorithm was optimizing for the Brier Score. This is set with the argument metric.

Logistic regression

The glmnet package is extremely efficient and fast, even on very large data sets (mostly due to its use of Fortran to solve the lasso problem via coordinate descent); note, however, that it only accepts the non-formula XY interface so prior to modeling we need to separate our feature and target sets.

Least absolute shrinkage and selection operator model

#> [1] "Best tuning parameter values found after 3 repeated 10-fold cross-validations:\n"

#>   lambda min_brier_score brier_score_sd
#> 1 0.0302           0.197          0.025

Best grid search terms in logit_shrink model.

Random forest

xgboost

xgboost is a gradient boosting model. These types of tree based models are famous for their stability and performance. XGBoost is one implementation of these boosting models that rely on model’s errors to improve their performance.

CatBOOST

Neural net

Training a neural network for Classification

Model ensemble

As ensemble caret does not support custome model, we will add cat model manually

#>       logit lasso   xgb  nnet    rf rpart
#> logit 1.000 0.999 0.891 0.992 0.990 0.840
#> lasso 0.999 1.000 0.909 0.987 0.994 0.852
#> xgb   0.891 0.909 1.000 0.843 0.928 0.890
#> nnet  0.992 0.987 0.843 1.000 0.965 0.765
#> rf    0.990 0.994 0.928 0.965 1.000 0.905
#> rpart 0.840 0.852 0.890 0.765 0.905 1.000

All models are highly correlated, so they are not good candidate for ensemble

Find Confidence interval

3- Model Comparison -Summary Table

#> 
#> Call:
#> summary.resamples(object = results)
#> 
#> Models: Logit, Lasso, RF, XGB, Nnet 
#> Number of resamples: 4 
#> 
#> Acc 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.598   0.690  0.722 0.705   0.738 0.777    0
#> Lasso 0.590   0.685  0.721 0.703   0.739 0.781    0
#> RF    0.614   0.675  0.696 0.695   0.716 0.775    0
#> XGB   0.619   0.671  0.700 0.690   0.718 0.740    0
#> Nnet  0.603   0.671  0.696 0.685   0.710 0.743    0
#> 
#> AUCPR 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.668   0.715  0.750 0.750   0.784 0.832    0
#> Lasso 0.669   0.711  0.746 0.748   0.782 0.830    0
#> RF    0.644   0.698  0.743 0.734   0.779 0.807    0
#> XGB   0.645   0.699  0.732 0.726   0.759 0.795    0
#> Nnet  0.662   0.663  0.700 0.720   0.757 0.820    0
#> 
#> AUCROC 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.674   0.745  0.771 0.749   0.775 0.782    0
#> Lasso 0.672   0.745  0.771 0.751   0.777 0.788    0
#> RF    0.664   0.729  0.760 0.742   0.772 0.784    0
#> XGB   0.673   0.710  0.737 0.730   0.756 0.773    0
#> Nnet  0.661   0.715  0.740 0.723   0.748 0.753    0
#> 
#> Brier 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.174   0.186  0.192 0.198   0.204 0.231    0
#> Lasso 0.170   0.185  0.190 0.196   0.202 0.235    0
#> RF    0.170   0.195  0.204 0.204   0.213 0.237    0
#> XGB   0.190   0.199  0.210 0.214   0.224 0.247    0
#> Nnet  0.178   0.202  0.214 0.213   0.225 0.244    0
#> 
#> F 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.539   0.687  0.754 0.723   0.790 0.843    0
#> Lasso 0.520   0.681  0.752 0.718   0.788 0.848    0
#> RF    0.600   0.695  0.747 0.735   0.786 0.846    0
#> XGB   0.615   0.701  0.734 0.724   0.757 0.815    0
#> Nnet  0.610   0.691  0.734 0.725   0.768 0.823    0
#> 
#> Kappa 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.202   0.365  0.436 0.384   0.455 0.464    0
#> Lasso 0.189   0.357  0.432 0.380   0.455 0.468    0
#> RF    0.230   0.326  0.378 0.356   0.408 0.438    0
#> XGB   0.238   0.329  0.371 0.352   0.393 0.426    0
#> Nnet  0.206   0.324  0.365 0.335   0.375 0.403    0
#> 
#> Precision 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.657   0.675  0.692 0.708   0.725 0.792    0
#> Lasso 0.655   0.676  0.692 0.707   0.723 0.789    0
#> RF    0.640   0.642  0.654 0.681   0.693 0.774    0
#> XGB   0.640   0.657  0.677 0.692   0.711 0.773    0
#> Nnet  0.617   0.645  0.668 0.678   0.701 0.759    0
#> 
#> Recall 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.457   0.715  0.829 0.754   0.867 0.900    0
#> Lasso 0.431   0.703  0.822 0.748   0.867 0.915    0
#> RF    0.562   0.771  0.872 0.810   0.910 0.934    0
#> XGB   0.593   0.742  0.801 0.764   0.823 0.861    0
#> Nnet  0.602   0.746  0.814 0.782   0.850 0.899    0

-Plots

#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"

#> null device 
#>           1

#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"

#> null device 
#>           1

Variable importance - `varImp` showcases variable importance of the variables used in the training model.

Plot Variable importance

#> svg 
#>   2

-Statistical Significance

There is no significant difference between models.

TESTING THE LOGISTIC REGRESSION MODEL

Predicting Test Set Results. Finally, scoring needs to be performed on the test sample using the parameter estimates obtained from the model building process. This step can be easily implemented with the help of predict function. Using a predict is as simple as it can gets with caret models. We’ll make predictions using the test data in order to evaluate the performance of our logistic regression model.The procedure is as follow: Predict the class membership probabilities of observations based on predictor variables Assign the observations to the class with highest probability score (i.e above 0.5)

Which classes do these probabilities refer to? In our example, the output is the probability that the Matching Status test will be positive. We know that these values correspond to the probability of the test to be positive, rather than negative, because the contrasts() function indicates that R has created a dummy variable with a 1 for “pos” and “0” for neg. Check the dummy coding:

Plotting the Predicted Probalities

1- Lift plot

#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"

#> null device 
#>           1

2-Calibration plot

#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"

#> null device 
#>           1

3- Logistic Calibration plot

#> [1] 0

#> [1] 0

#> [1] 0

Plot Calibration

#> svg 
#>   2

##Table and Plot of CI

Concordonce Index

##Model Interaction RF and nnet will learn interactions on their own. We are going to use the IML package. Use rf first then run IML feature interaction.

Log Loss

2- train models on train data for LOG LOSS

-Summary Table of LOG LOSS

#> 
#> Call:
#> summary.resamples(object = results3)
#> 
#> Models: Logit, Lasso, RF, XGB, Nnet 
#> Number of resamples: 4 
#> 
#> logLoss 
#>        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
#> Logit 0.514   0.549  0.566 0.577   0.594 0.664    0
#> Lasso 0.517   0.549  0.562 0.576   0.590 0.663    0
#> RF    0.528   0.583  0.602 0.611   0.630 0.712    0
#> XGB   0.593   0.599  0.639 0.656   0.696 0.751    0
#> Nnet  0.508   0.567  0.587 0.590   0.610 0.679    0

-Plots of LOG LOSS

Find LOG LOSS Confidence interval

-Staistical Significance

#> 
#> Call:
#> summary.diff.resamples(object = diffs)
#> 
#> p-value adjustment: bonferroni 
#> Upper diagonal: estimates of the difference
#> Lower diagonal: p-value for H0: difference = 0
#> 
#> logLoss 
#>       Logit Lasso    RF       XGB      Nnet    
#> Logit        0.00118 -0.03384 -0.07835 -0.01286
#> Lasso 1.000          -0.03502 -0.07953 -0.01404
#> RF    0.217 0.256             -0.04451  0.02098
#> XGB   0.228 0.196    0.794              0.06549
#> Nnet  1.000 1.000    0.152    0.350

Gives the polynomial formula

Create the y = mx+b Logistic Model Formula

#> [1] "y = -2.43 + 0.45*Military_Service_ObligationYes - 0.33*NCAANot_a_NCAA_athlete + 0.69*NIHNo_NIH_involvement - 0.09*Age - 0.55*Medical_Education_or_Training_InterruptedYes + 0.09*Misdemeanor_ConvictionYes + 0.3*Gold_Humanism_Honor_SocietyYes + 0.04*Count_of_Oral_Presentation - 0.13*Count_of_Peer_Reviewed_Book_Chapter + 0.03*Count_of_Peer_Reviewed_Journal_Articles_Abstracts + 0.08*Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published - 0.77*Year2018 + 1.61*USMLE_Pass_Fail_replacedPassed + 0.12*LocationCCAG - 0.08*LocationCU + 0.19*LocationDuke - 0.42*LocationTruman - 0.62*LocationU_Washington - 0.53*LocationUAB - 0.95*LocationUtah + 0.09*Meeting_Name_PresentedPresented at a meeting + 0.24*TopNIHfundedAttended a NIH top-funded Medical School - 0.07*Higher_Education_InstitutionIvy League + 0.22*Other_Service_ObligationYes + 0.81*Photo_ReceivedYes + 0.97*Tracks_Applied_by_Applicant_1Categorical Applicant - 0.05*AMAAmerican Medical Association Member + 0.41*ACOGACOG Member + 0.36*total_OBGYN_letter_writers + 0.02*number_of_applicant_first_author_publications + 0.34*Advance_DegreeNo Advanced Degree + 0.67*Advance_DegreePh.D. + 0.02*Advance_DegreeOther + 0.74*Type_of_medical_schoolOsteopathic School + 1.54*Type_of_medical_schoolU.S. Private School + 1.3*Type_of_medical_schoolU.S. Public School - 0.02*work_exp_count + 0.01*Volunteer_exp_count + 0.04*Research_exp_count - 0*logit"

Odds Ratio table

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Background

Objective

Computational Reproducibility

LOADING DATA INTO R ENVIRONMENT.

Data Reduction for the Variable Selection Process

Correlation of Features

Descriptive Analysis of `reduced_Data2` dataframe

Variable Selection using Group LASSO

Imputation

Splitting the Data into training set and test set

Feature Exploration

Baseline Performance

Model Training

Setting Control parameters

TRAINING THE LOGISTIC REGRESSION MODEL USING THE `caret` PACKAGE

Logistic regression

Least absolute shrinkage and selection operator model

Random forest

xgboost

CatBOOST

Neural net

Model ensemble

Variable importance - `varImp` showcases variable importance of the variables used in the training model.

TESTING THE LOGISTIC REGRESSION MODEL

Plotting the Predicted Probalities

Concordonce Index

Log Loss

Closing Time Procedures

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Background

Objective

Computational Reproducibility

LOADING DATA INTO R ENVIRONMENT.

Data Reduction for the Variable Selection Process

Correlation of Features

Descriptive Analysis of reduced_Data2 dataframe

Variable Selection using Group LASSO

Imputation

Splitting the Data into training set and test set

Feature Exploration

Baseline Performance

Model Training

Setting Control parameters

TRAINING THE LOGISTIC REGRESSION MODEL USING THE caret PACKAGE

Logistic regression

Least absolute shrinkage and selection operator model

Random forest

xgboost

CatBOOST

Neural net

Model ensemble

Variable importance - varImp showcases variable importance of the variables used in the training model.

TESTING THE LOGISTIC REGRESSION MODEL

Plotting the Predicted Probalities

Concordonce Index

Log Loss

Closing Time Procedures

Descriptive Analysis of `reduced_Data2` dataframe

TRAINING THE LOGISTIC REGRESSION MODEL USING THE `caret` PACKAGE

Variable importance - `varImp` showcases variable importance of the variables used in the training model.