##Data Reduction
#>
#> --------Summary descriptives table by 'Match_Status'---------
#>
#> _______________________________________________________________________________________________________________________
#> Matched Not_Matched p.overall N
#> N=3919 N=3307
#> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
#> Gender: <0.001 7226
#> Female 3327 (84.9%) 2585 (78.2%)
#> Male 592 (15.1%) 722 (21.8%)
#> Age 27.0 [26.0;29.0] 28.0 [27.0;31.0] <0.001 7226
#> Medical_Degree: <0.001 7198
#> DO 521 (13.3%) 704 (21.4%)
#> MD 3392 (86.7%) 2581 (78.6%)
#> Military_Service_Obligation: 0.726 7226
#> No 3875 (98.9%) 3266 (98.8%)
#> Yes 44 (1.12%) 41 (1.24%)
#> Visa_Sponsorship_Needed: <0.001 7226
#> No 3831 (97.8%) 2942 (89.0%)
#> Yes 88 (2.25%) 365 (11.0%)
#> Medical_Education_or_Training_Interrupted: <0.001 7226
#> No 3521 (89.8%) 2706 (81.8%)
#> Yes 398 (10.2%) 601 (18.2%)
#> Misdemeanor_Conviction: 0.233 7226
#> No 3865 (98.6%) 3249 (98.2%)
#> Yes 54 (1.38%) 58 (1.75%)
#> Alpha_Omega_Alpha: <0.001 7226
#> No 3398 (86.7%) 3120 (94.3%)
#> Yes 521 (13.3%) 187 (5.65%)
#> Gold_Humanism_Honor_Society: <0.001 7226
#> No 3215 (82.0%) 2972 (89.9%)
#> Yes 704 (18.0%) 335 (10.1%)
#> Couples_Match: <0.001 7226
#> No 3513 (89.6%) 3102 (93.8%)
#> Yes 406 (10.4%) 205 (6.20%)
#> Count_of_Oral_Presentation 0.00 [0.00;1.00] 0.00 [0.00;1.00] <0.001 7226
#> Count_of_Peer_Reviewed_Book_Chapter 0.00 [0.00;0.00] 0.00 [0.00;0.00] 0.028 7226
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts 0.00 [0.00;1.00] 0.00 [0.00;1.00] <0.001 7226
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published 0.00 [0.00;1.00] 0.00 [0.00;0.00] <0.001 7226
#> Count_of_Poster_Presentation 1.00 [0.00;3.00] 1.00 [0.00;2.00] <0.001 7226
#> Year: <0.001 7226
#> 2016 294 (7.50%) 148 (4.48%)
#> 2017 1060 (27.0%) 777 (23.5%)
#> 2018 711 (18.1%) 573 (17.3%)
#> 2019 1292 (33.0%) 1076 (32.5%)
#> 2020 562 (14.3%) 733 (22.2%)
#> USMLE_Pass_Fail_replaced: <0.001 6673
#> Failed attempt 9 (0.24%) 88 (3.01%)
#> Passed 3741 (99.8%) 2835 (97.0%)
#> Location: <0.001 7226
#> BSW 129 (3.29%) 174 (5.26%)
#> CCAG 270 (6.89%) 333 (10.1%)
#> CU 1449 (37.0%) 1431 (43.3%)
#> Duke 802 (20.5%) 409 (12.4%)
#> Truman 84 (2.14%) 170 (5.14%)
#> U_Washington 180 (4.59%) 126 (3.81%)
#> UAB 110 (2.81%) 130 (3.93%)
#> Utah 895 (22.8%) 534 (16.1%)
#> Meeting_Name_Presented: <0.001 7226
#> Did not present at a meeting 2657 (67.8%) 2534 (76.6%)
#> Presented at a meeting 1262 (32.2%) 773 (23.4%)
#> TopNIHfunded: <0.001 7226
#> Did not attend NIH top-funded medical school 2247 (57.3%) 2568 (77.7%)
#> Attended a NIH top-funded Medical School 1672 (42.7%) 739 (22.3%)
#> Higher_Education_Institution: <0.001 7226
#> No Ivy League Education 3698 (94.4%) 3228 (97.6%)
#> Ivy League 221 (5.64%) 79 (2.39%)
#> Higher_Education_Degree: <0.001 7226
#> Not a B.S. degree 2136 (54.5%) 2132 (64.5%)
#> B.S. 1783 (45.5%) 1175 (35.5%)
#> Interest_Group: 1.000 7226
#> No Interest Group 3914 (99.9%) 3303 (99.9%)
#> Mentions Interest Group 5 (0.13%) 4 (0.12%)
#> Language_Fluency: 0.054 5434
#> Speaks English and Another Language 2256 (73.2%) 1778 (75.6%)
#> Speaks English 825 (26.8%) 575 (24.4%)
#> ACLS: <0.001 5434
#> No 1819 (59.0%) 1228 (52.2%)
#> Yes 1262 (41.0%) 1125 (47.8%)
#> Other_Service_Obligation: 0.238 5434
#> No 3045 (98.8%) 2334 (99.2%)
#> Yes 36 (1.17%) 19 (0.81%)
#> Photo_Received: <0.001 5434
#> No 13 (0.42%) 48 (2.04%)
#> Yes 3068 (99.6%) 2305 (98.0%)
#> Tracks_Applied_by_Applicant_1: <0.001 7226
#> Applying for a Preliminary Position 2035 (51.9%) 2203 (66.6%)
#> Categorical Applicant 1884 (48.1%) 1104 (33.4%)
#> AMA: <0.001 7226
#> No AMA Membership 2302 (58.7%) 2386 (72.1%)
#> American Medical Association Member 1617 (41.3%) 921 (27.9%)
#> ACOG: <0.001 7226
#> No ACOG Membership 1886 (48.1%) 2291 (69.3%)
#> ACOG Member 2033 (51.9%) 1016 (30.7%)
#> Latin_Honors: 0.001 7226
#> Latin_honors 12 (0.31%) 31 (0.94%)
#> No_cum_laude 3907 (99.7%) 3276 (99.1%)
#> Scholarship: <0.001 7226
#> No_scholarship 3033 (77.4%) 2769 (83.7%)
#> Scholarship 886 (22.6%) 538 (16.3%)
#> Grant: <0.001 7226
#> Grant_funding 231 (5.89%) 123 (3.72%)
#> No_Grant_funding 3688 (94.1%) 3184 (96.3%)
#> Phi_beta_kappa: <0.001 7226
#> No_Phi_Beta_Kappa 3835 (97.9%) 3275 (99.0%)
#> Phi_Beta_Kappa 84 (2.14%) 32 (0.97%)
#> NCAA: 0.008 7226
#> NCAA_athlente 47 (1.20%) 19 (0.57%)
#> Not_a_NCAA_athlete 3872 (98.8%) 3288 (99.4%)
#> Boy_Scouts: 0.269 7226
#> Boy/Girl_Scouts 18 (0.46%) 9 (0.27%)
#> Not_a_Boy/Girl_Scout 3901 (99.5%) 3298 (99.7%)
#> Valedictorian: 0.044 7226
#> Not_a_Valedictorian 3877 (98.9%) 3287 (99.4%)
#> Valedictorian 42 (1.07%) 20 (0.60%)
#> NIH: 0.656 7226
#> NIH_present 24 (0.61%) 24 (0.73%)
#> No_NIH_involvement 3895 (99.4%) 3283 (99.3%)
#> NCI: 0.317 7226
#> NCI_present 112 (2.86%) 81 (2.45%)
#> No_NCI_involvement 3807 (97.1%) 3226 (97.6%)
#> total_OBGYN_letter_writers 2.00 [2.00;3.00] 2.00 [1.00;2.00] <0.001 5348
#> number_of_applicant_first_author_publications 1.00 [0.00;3.00] 0.00 [0.00;1.00] <0.001 7226
#> Advance_Degree: 0.015 7226
#> M.B.A. 14 (0.36%) 20 (0.60%)
#> No Advanced Degree 3651 (93.2%) 3027 (91.5%)
#> Ph.D. 10 (0.26%) 19 (0.57%)
#> Other 244 (6.23%) 241 (7.29%)
#> Type_of_medical_school: . 7226
#> U.S. Public School 1937 (49.4%) 1024 (31.0%)
#> International School 260 (6.63%) 1075 (32.5%)
#> Osteopathic School 523 (13.3%) 704 (21.3%)
#> Osteopathic School,International School 0 (0.00%) 1 (0.03%)
#> U.S. Private School 1199 (30.6%) 503 (15.2%)
#> work_exp_count 2.00 [0.00;4.00] 2.00 [0.00;4.00] 0.052 7226
#> Volunteer_exp_count 6.00 [2.00;9.00] 4.00 [0.00;8.00] <0.001 7226
#> Research_exp_count 2.00 [0.00;3.00] 1.00 [0.00;2.00] <0.001 7226
#> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
#>
#> --------Summary descriptives table ---------
#>
#> ____________________________________________________________________________________________
#> [ALL] N
#> N=7226
#> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
#> Match_Status: 7226
#> Matched 3919 (54.2%)
#> Not_Matched 3307 (45.8%)
#> Gender: 7226
#> Female 5912 (81.8%)
#> Male 1314 (18.2%)
#> Age 28.0 [26.0;30.0] 7226
#> Medical_Degree: 7198
#> DO 1225 (17.0%)
#> MD 5973 (83.0%)
#> Military_Service_Obligation: 7226
#> No 7141 (98.8%)
#> Yes 85 (1.18%)
#> Visa_Sponsorship_Needed: 7226
#> No 6773 (93.7%)
#> Yes 453 (6.27%)
#> Medical_Education_or_Training_Interrupted: 7226
#> No 6227 (86.2%)
#> Yes 999 (13.8%)
#> Misdemeanor_Conviction: 7226
#> No 7114 (98.5%)
#> Yes 112 (1.55%)
#> Alpha_Omega_Alpha: 7226
#> No 6518 (90.2%)
#> Yes 708 (9.80%)
#> Gold_Humanism_Honor_Society: 7226
#> No 6187 (85.6%)
#> Yes 1039 (14.4%)
#> Couples_Match: 7226
#> No 6615 (91.5%)
#> Yes 611 (8.46%)
#> Count_of_Oral_Presentation 0.00 [0.00;1.00] 7226
#> Count_of_Peer_Reviewed_Book_Chapter 0.00 [0.00;0.00] 7226
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts 0.00 [0.00;1.00] 7226
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published 0.00 [0.00;0.00] 7226
#> Count_of_Poster_Presentation 1.00 [0.00;3.00] 7226
#> Year: 7226
#> 2016 442 (6.12%)
#> 2017 1837 (25.4%)
#> 2018 1284 (17.8%)
#> 2019 2368 (32.8%)
#> 2020 1295 (17.9%)
#> USMLE_Pass_Fail_replaced: 6673
#> Failed attempt 97 (1.45%)
#> Passed 6576 (98.5%)
#> Location: 7226
#> BSW 303 (4.19%)
#> CCAG 603 (8.34%)
#> CU 2880 (39.9%)
#> Duke 1211 (16.8%)
#> Truman 254 (3.52%)
#> U_Washington 306 (4.23%)
#> UAB 240 (3.32%)
#> Utah 1429 (19.8%)
#> Meeting_Name_Presented: 7226
#> Did not present at a meeting 5191 (71.8%)
#> Presented at a meeting 2035 (28.2%)
#> TopNIHfunded: 7226
#> Did not attend NIH top-funded medical school 4815 (66.6%)
#> Attended a NIH top-funded Medical School 2411 (33.4%)
#> Higher_Education_Institution: 7226
#> No Ivy League Education 6926 (95.8%)
#> Ivy League 300 (4.15%)
#> Higher_Education_Degree: 7226
#> Not a B.S. degree 4268 (59.1%)
#> B.S. 2958 (40.9%)
#> Interest_Group: 7226
#> No Interest Group 7217 (99.9%)
#> Mentions Interest Group 9 (0.12%)
#> Language_Fluency: 5434
#> Speaks English and Another Language 4034 (74.2%)
#> Speaks English 1400 (25.8%)
#> ACLS: 5434
#> No 3047 (56.1%)
#> Yes 2387 (43.9%)
#> Other_Service_Obligation: 5434
#> No 5379 (99.0%)
#> Yes 55 (1.01%)
#> Photo_Received: 5434
#> No 61 (1.12%)
#> Yes 5373 (98.9%)
#> Tracks_Applied_by_Applicant_1: 7226
#> Applying for a Preliminary Position 4238 (58.6%)
#> Categorical Applicant 2988 (41.4%)
#> AMA: 7226
#> No AMA Membership 4688 (64.9%)
#> American Medical Association Member 2538 (35.1%)
#> ACOG: 7226
#> No ACOG Membership 4177 (57.8%)
#> ACOG Member 3049 (42.2%)
#> Latin_Honors: 7226
#> Latin_honors 43 (0.60%)
#> No_cum_laude 7183 (99.4%)
#> Scholarship: 7226
#> No_scholarship 5802 (80.3%)
#> Scholarship 1424 (19.7%)
#> Grant: 7226
#> Grant_funding 354 (4.90%)
#> No_Grant_funding 6872 (95.1%)
#> Phi_beta_kappa: 7226
#> No_Phi_Beta_Kappa 7110 (98.4%)
#> Phi_Beta_Kappa 116 (1.61%)
#> NCAA: 7226
#> NCAA_athlente 66 (0.91%)
#> Not_a_NCAA_athlete 7160 (99.1%)
#> Boy_Scouts: 7226
#> Boy/Girl_Scouts 27 (0.37%)
#> Not_a_Boy/Girl_Scout 7199 (99.6%)
#> Valedictorian: 7226
#> Not_a_Valedictorian 7164 (99.1%)
#> Valedictorian 62 (0.86%)
#> NIH: 7226
#> NIH_present 48 (0.66%)
#> No_NIH_involvement 7178 (99.3%)
#> NCI: 7226
#> NCI_present 193 (2.67%)
#> No_NCI_involvement 7033 (97.3%)
#> total_OBGYN_letter_writers 2.00 [2.00;3.00] 5348
#> number_of_applicant_first_author_publications 0.00 [0.00;2.00] 7226
#> Advance_Degree: 7226
#> M.B.A. 34 (0.47%)
#> No Advanced Degree 6678 (92.4%)
#> Ph.D. 29 (0.40%)
#> Other 485 (6.71%)
#> Type_of_medical_school: 7226
#> U.S. Public School 2961 (41.0%)
#> International School 1335 (18.5%)
#> Osteopathic School 1227 (17.0%)
#> Osteopathic School,International School 1 (0.01%)
#> U.S. Private School 1702 (23.6%)
#> work_exp_count 2.00 [0.00;4.00] 7226
#> Volunteer_exp_count 5.00 [0.00;9.00] 7226
#> Research_exp_count 1.00 [0.00;3.00] 7226
#> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
#> [1] "Match_Status"
#> [2] "Gender"
#> [3] "Age"
#> [4] "Medical_Degree"
#> [5] "Military_Service_Obligation"
#> [6] "Visa_Sponsorship_Needed"
#> [7] "Medical_Education_or_Training_Interrupted"
#> [8] "Misdemeanor_Conviction"
#> [9] "Alpha_Omega_Alpha"
#> [10] "Gold_Humanism_Honor_Society"
#> [11] "Couples_Match"
#> [12] "Count_of_Oral_Presentation"
#> [13] "Count_of_Peer_Reviewed_Book_Chapter"
#> [14] "Count_of_Peer_Reviewed_Journal_Articles_Abstracts"
#> [15] "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published"
#> [16] "Count_of_Poster_Presentation"
#> [17] "Year"
#> [18] "USMLE_Pass_Fail_replaced"
#> [19] "Location"
#> [20] "Meeting_Name_Presented"
#> [21] "TopNIHfunded"
#> [22] "Higher_Education_Institution"
#> [23] "Higher_Education_Degree"
#> [24] "Interest_Group"
#> [25] "Language_Fluency"
#> [26] "ACLS"
#> [27] "Other_Service_Obligation"
#> [28] "Photo_Received"
#> [29] "Tracks_Applied_by_Applicant_1"
#> [30] "AMA"
#> [31] "ACOG"
#> [32] "Latin_Honors"
#> [33] "Scholarship"
#> [34] "Grant"
#> [35] "Phi_beta_kappa"
#> [36] "NCAA"
#> [37] "Boy_Scouts"
#> [38] "Valedictorian"
#> [39] "NIH"
#> [40] "NCI"
#> [41] "total_OBGYN_letter_writers"
#> [42] "number_of_applicant_first_author_publications"
#> [43] "Advance_Degree"
#> [44] "Type_of_medical_school"
#> [45] "work_exp_count"
#> [46] "Volunteer_exp_count"
#> [47] "Research_exp_count"
1- Split Data
{r} # # # set.seed(123) # d_part <- createDataPartition(y=reduced_Data$Match_Status, p=0.70, list=FALSE) # tstSamples <- reduced_Data[-d_part,] # trgSamples <- reduced_Data[d_part,] #
{r} # #set lambda sequence # lambda <- 10^seq(-3, 3, length = 100) # #train model # lasso <- train(Match_Status ~., data = trgSamples, method = "glmnet", family= "binomial", trControl = trainControl("cv", number = 10), tuneGrid = expand.grid(alpha = 1, lambda = lambda)) #
{r} # # Model coefficients # # coefs = coef(lasso$finalModel, lasso$bestTune$lambda) # ix = which(abs(coefs[,1]) > 0) # length(ix) # # coefs[ix,1, drop=FALSE] #
{r} # # Make predictions # predictions <- lasso %>% predict(tstSamples) # cm <- confusionMatrix(predictions, tstSamples$Match_Status, positive = "Matched") # Accuracy = cm$overall[1] # Sensitivity = cm$byClass[1] # Specificity = cm$byClass[2] # Precision = cm$byClass[5] # F1 = cm$byClass[7] # # modelPerformance <- data.frame( Accuracy = cm$overall[1],Sensitivity = cm$byClass[1], Specificity = cm$byClass[2], Precision = cm$byClass[5], F1 = cm$byClass[7]) # modelPerformance #
1- Create dummy variables from categorical data
2-Drop reference dummies
3-Combine dummies with numeric variables then convert them to a matrix X
4- Outcome Variable
5- Create group vector that distinguish groups
#> [1] 45
#> (Intercept)
#> 2.7028166060
#> Age
#> -0.0644008290
#> Count_of_Oral_Presentation
#> 0.0068303907
#> Count_of_Peer_Reviewed_Book_Chapter
#> -0.0804134625
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts
#> 0.0214819797
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published
#> 0.0683113347
#> total_OBGYN_letter_writers
#> 0.2972333604
#> number_of_applicant_first_author_publications
#> 0.0099665971
#> work_exp_count
#> -0.0150879310
#> Volunteer_exp_count
#> 0.0007676075
#> Research_exp_count
#> 0.0644441525
#> Military_Service_Obligation_Yes
#> -0.0318256987
#> Visa_Sponsorship_Needed_Yes
#> -0.5105278656
#> Medical_Education_or_Training_Interrupted_Yes
#> -0.4695121962
#> Misdemeanor_Conviction_Yes
#> -0.0438318999
#> Gold_Humanism_Honor_Society_Yes
#> 0.2048801515
#> Year_2017
#> -0.4448468466
#> Year_2018
#> -0.7920745253
#> Year_2019
#> -1.0012131967
#> USMLE_Pass_Fail_replaced_Failed attempt
#> -1.6106680799
#> Location_BSW
#> -0.0423034916
#> Location_CCAG
#> 0.3172336056
#> Location_CU
#> 0.2536836760
#> Location_Duke
#> 0.4242711983
#> Location_Truman
#> -0.0905061764
#> Location_U_Washington
#> 0.3895741126
#> Location_UAB
#> -0.0098410791
#> Meeting_Name_Presented_Did not present at a meeting
#> -0.0681419911
#> TopNIHfunded_Attended a NIH top-funded Medical School
#> 0.2068550243
#> Higher_Education_Institution_Ivy League
#> 0.2631380197
#> Interest_Group_No Interest Group
#> -0.0431535955
#> Other_Service_Obligation_No
#> -0.0769676431
#> Other_Service_Obligation_Yes
#> 0.0769676431
#> Photo_Received_No
#> -0.2148202210
#> Tracks_Applied_by_Applicant_1_Applying for a Preliminary Position
#> -0.5586389384
#> AMA_No AMA Membership
#> 0.0743980900
#> ACOG_No ACOG Membership
#> -0.4428659113
#> NCAA_NCAA_athlente
#> 0.1496755231
#> NIH_NIH_present
#> -0.2271965431
#> Advance_Degree_M.B.A.
#> -0.0151805770
#> Advance_Degree_No Advanced Degree
#> 0.1737718800
#> Advance_Degree_Ph.D.
#> 0.1239380303
#> Type_of_medical_school_U.S. Public School
#> -0.1896174159
#> Type_of_medical_school_International School
#> -1.2966275262
#> Type_of_medical_school_Osteopathic School
#> -0.4307976573
1-prepare training theme
Add brier score to matrix
2- train models on train data
##Model ensemble
As ensemble caret does not support custome model, we will add cat model manually
#> logit lasso xgb nnet rf rpart CATboost
#> logit 1.0000000 0.9999351 0.8880047 0.9968440 0.9971100 0.8427780 0.9814025
#> lasso 0.9999351 1.0000000 0.8930157 0.9968824 0.9976580 0.8462075 0.9835208
#> xgb 0.8880047 0.8930157 1.0000000 0.8800297 0.9156156 0.9360580 0.9550694
#> nnet 0.9968440 0.9968824 0.8800297 1.0000000 0.9902172 0.8079485 0.9811130
#> rf 0.9971100 0.9976580 0.9156156 0.9902172 1.0000000 0.8805577 0.9881312
#> rpart 0.8427780 0.8462075 0.9360580 0.8079485 0.8805577 1.0000000 0.8796145
#> CATboost 0.9814025 0.9835208 0.9550694 0.9811130 0.9881312 0.8796145 1.0000000
All models are highly correlated, so they are not good candidate for ensemble
3- Model Comparison
-Summary Table
#>
#> Call:
#> summary.resamples(object = results)
#>
#> Models: Logit, Lasso, RF, XGB, CAT, Nnet
#> Number of resamples: 4
#>
#> Acc
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.6074271 0.6845423 0.7167154 0.7034744 0.7356475 0.7730399 0
#> Lasso 0.5941645 0.6822781 0.7127151 0.6978148 0.7282518 0.7716644 0
#> RF 0.6127321 0.6805682 0.7062614 0.6964788 0.7221720 0.7606602 0
#> XGB 0.6312997 0.6699357 0.6894637 0.6861881 0.7057162 0.7345254 0
#> CAT 0.5915119 0.6755596 0.7228125 0.7015126 0.7487654 0.7689133 0
#> Nnet 0.6233422 0.6891400 0.7141936 0.7055045 0.7305581 0.7702889 0
#>
#> AUCPR
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.6963751 0.7007829 0.7321083 0.7420537 0.7733791 0.8076228 0
#> Lasso 0.6974151 0.7018986 0.7325478 0.7432135 0.7738627 0.8103436 0
#> RF 0.6774010 0.6791621 0.7061786 0.7208920 0.7479084 0.7938099 0
#> XGB 0.6682657 0.6841082 0.7099361 0.7231657 0.7489936 0.8045248 0
#> CAT 0.6773208 0.6948163 0.7222133 0.7395354 0.7669325 0.8363942 0
#> Nnet 0.6948113 0.7007222 0.7292614 0.7397723 0.7683115 0.8057551 0
#>
#> AUCROC
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.6816236 0.7447619 0.7705780 0.7522735 0.7780897 0.7863145 0
#> Lasso 0.6822151 0.7446997 0.7707367 0.7529009 0.7789378 0.7879152 0
#> RF 0.6735818 0.7181008 0.7451371 0.7349916 0.7620280 0.7761104 0
#> XGB 0.6749761 0.7188217 0.7396482 0.7323707 0.7531972 0.7752101 0
#> CAT 0.6636387 0.7321357 0.7577719 0.7468202 0.7724563 0.8080982 0
#> Nnet 0.6769196 0.7442541 0.7675981 0.7490198 0.7723638 0.7839636 0
#>
#> Brier
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.1722116 0.1844084 0.1896108 0.1950277 0.2002300 0.2286775 0
#> Lasso 0.1707416 0.1833249 0.1880585 0.1947664 0.1995000 0.2322068 0
#> RF 0.1799489 0.1905923 0.1949141 0.2004738 0.2047956 0.2321181 0
#> XGB 0.1828559 0.1941797 0.2032792 0.2078545 0.2169541 0.2420038 0
#> CAT 0.1835637 0.1873832 0.1928299 0.1993416 0.2047883 0.2281426 0
#> Nnet 0.1659708 0.1873544 0.1946628 0.1957851 0.2030935 0.2278441 0
#>
#> F
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.6062053 0.6348846 0.6460317 0.6474995 0.6586466 0.6917293 0
#> Lasso 0.6084906 0.6259813 0.6367488 0.6463212 0.6570888 0.7032967 0
#> RF 0.5837321 0.6188069 0.6327493 0.6345299 0.6484722 0.6888889 0
#> XGB 0.5563218 0.5931694 0.6258812 0.6217972 0.6545089 0.6791045 0
#> CAT 0.5820896 0.6007976 0.6110168 0.6351463 0.6453655 0.7364621 0
#> Nnet 0.5546667 0.5790736 0.6033996 0.6097213 0.6340473 0.6774194 0
#>
#> Kappa
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.2207045 0.3732169 0.4250306 0.3808895 0.4327032 0.4527925 0
#> Lasso 0.1952424 0.3508282 0.4163395 0.3700956 0.4356068 0.4524609 0
#> RF 0.2288660 0.3559897 0.4038631 0.3647268 0.4126002 0.4223150 0
#> XGB 0.2650454 0.3256951 0.3584094 0.3443734 0.3770877 0.3956295 0
#> CAT 0.1866121 0.3332217 0.4072066 0.3716872 0.4456721 0.4857235 0
#> Nnet 0.2434640 0.3596681 0.4083276 0.3752422 0.4239016 0.4408496 0
#>
#> Precision
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.5738397 0.6877456 0.7417770 0.7076257 0.7616571 0.7731092 0
#> Lasso 0.5614754 0.6778689 0.7268900 0.6942901 0.7433112 0.7619048 0
#> RF 0.5852535 0.6721754 0.7162210 0.6934482 0.7374938 0.7560976 0
#> XGB 0.6047619 0.6263214 0.6581779 0.6682959 0.7001525 0.7520661 0
#> CAT 0.5668203 0.6970848 0.7433735 0.7095456 0.7558343 0.7846154 0
#> Nnet 0.6273292 0.7224515 0.7740260 0.7517661 0.8033406 0.8316832 0
#>
#> Recall
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.5204918 0.5505482 0.5932087 0.6125196 0.6551801 0.7431694 0
#> Lasso 0.5286885 0.5467984 0.6029481 0.6208047 0.6769544 0.7486339 0
#> RF 0.5000000 0.5405928 0.5933884 0.5951915 0.6479871 0.6939891 0
#> XGB 0.4959016 0.5318362 0.5814310 0.5881882 0.6377830 0.6939891 0
#> CAT 0.4795082 0.5035755 0.5918645 0.5892787 0.6775677 0.6938776 0
#> Nnet 0.4262295 0.5008873 0.5388429 0.5188360 0.5567916 0.5714286 0
-Plots
4- Variable importance
Plot Variable importance
#> svg
#> 2
-Statistical Significance
There is no significant difference between models.
-Prediction
1- Lift plot
2-Calibration plot
3- Logistic Calibration plot
Plot Calibration
#> svg
#> 2
##Table and Plot of CI
##Model Ineraction
##Decison Curve
##Log Loss
2- train models on train data
-Summary Table
#>
#> Call:
#> summary.resamples(object = results3)
#>
#> Models: Logit, Lasso, RF, XGB, CAT, Nnet
#> Number of resamples: 4
#>
#> logLoss
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.5181535 0.5482928 0.5597231 0.5722876 0.5837178 0.6515505 0
#> Lasso 0.5196284 0.5465371 0.5572951 0.5714494 0.5822074 0.6515790 0
#> RF 0.5401670 0.5630581 0.5736689 0.5864182 0.5970290 0.6581681 0
#> XGB 0.5516856 0.5747121 0.6008363 0.6113223 0.6374465 0.6919308 0
#> CAT 0.5569512 0.5686929 0.5739522 0.5888041 0.5940634 0.6503607 0
#> Nnet 0.5082789 0.5558810 0.5719629 0.5740597 0.5901416 0.6440341 0
-Plots
-Staistical Significance
#>
#> Call:
#> summary.diff.resamples(object = diffs)
#>
#> p-value adjustment: bonferroni
#> Upper diagonal: estimates of the difference
#> Lower diagonal: p-value for H0: difference = 0
#>
#> logLoss
#> Logit Lasso RF XGB CAT Nnet
#> Logit 0.0008382 -0.0141307 -0.0390347 -0.0165165 -0.0017721
#> Lasso 1.0000 -0.0149688 -0.0398729 -0.0173547 -0.0026103
#> RF 0.3264 0.2323 -0.0249041 -0.0023859 0.0123585
#> XGB 0.1851 0.1824 0.7644 0.0225182 0.0372626
#> CAT 1.0000 1.0000 1.0000 1.0000 0.0147444
#> Nnet 1.0000 1.0000 1.0000 0.3922 1.0000
#> [1] "y = -3.08 - 0.38*Age - 0.92*Visa_Sponsorship_NeededYes - 0.59*Medical_Education_or_Training_InterruptedYes + 0.09*Misdemeanor_ConvictionYes + 0.32*Gold_Humanism_Honor_SocietyYes + 0.05*Count_of_Oral_Presentation - 0.03*Count_of_Peer_Reviewed_Book_Chapter + 0.11*Count_of_Peer_Reviewed_Journal_Articles_Abstracts + 0.08*Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published + 1.4*USMLE_Pass_Fail_replacedPassed + 0.11*LocationCCAG - 0.24*LocationCU + 0.17*LocationDuke - 0.47*LocationTruman - 0.35*LocationU_Washington - 0.58*LocationUAB - 0.3*LocationUtah + 0.07*Meeting_Name_PresentedPresented at a meeting - 0.24*TopNIHfundedDid not attend NIH top-funded medical school + 0.03*Higher_Education_InstitutionNo Ivy League Education + 0.06*Interest_GroupNo Interest Group + 0.16*Other_Service_ObligationYes + 0.77*Photo_ReceivedYes + 0.33*Tracks_Applied_by_Applicant_1Categorical Applicant + 0.06*AMANo AMA Membership - 0.38*ACOGNo ACOG Membership + 0.33*total_OBGYN_letter_writers + 0.07*number_of_applicant_first_author_publications + 0.38*Advance_DegreeNo Advanced Degree - 0.04*Advance_DegreeOther + 0.66*Advance_DegreePh.D. + 0.43*Type_of_medical_schoolOsteopathic School + 1.3*Type_of_medical_schoolU.S. Private School + 1.03*Type_of_medical_schoolU.S. Public School - 0.04*work_exp_count + 0.02*Volunteer_exp_count + 0.08*Research_exp_count + 0.39*Military_Service_ObligationYes - 0.28*NCAANot_a_NCAA_athlete + 0.67*NIHNo_NIH_involvement"
###Odds Ratio table
The study is a cross sectional study that aims to find the correlation between the rate of acceptance in US obstetrics and gynecology residency program and the applicants’ characteristics from 2017 to 2020. 7226 applicants participated in the survey. (more information to be added i.e is it online?, groups selection?….etc)
The survey consists of 52 items. The items describe 7 main categories. The first category is demographics which include gender, age, nationality (US/Canadian), military service, and language fluency. The second category is educational excellence such as higher education institute, degree, and if the applicants have received scholarship, grants, Phi beta Kappa, or advanced degree. The third category is personal excellence and honors such as Misdemeanor Conviction, Alpha Omega Alpha, Boy Scouts, meeting presentation, oral presentation counts,interest group,and NCAA. The fourth category is medical school and training such as medical degree, medical school type, if the medical school is NIH funded, Previous medical training, Interrupted medical training or education, Gold Humanism Honor Society, AMA and ACOG membership, ACLS, and PALS. The fifth category is USMLE/Match status such as passing USMLE, Visa Sponsorship Needed, Couples Match, Tracks Applied by Applicant, and LORs count. The sixth category is work experience which includes count of work, volunteer and research experience. The seventh category is academic research experience which includes count of Peer Reviewed Book Chapter, Peer Reviewed Journal Articles or Abstracts, Poster Presentation, Peer Reviewed Journal Articles or Abstracts Other than Published, number of applicant first author publications and records count. The outcome variable is match status.
Data reduction was conducted in two phases. First, highly correlated variables were removed with R threshold = 0.7. 5 variables were removed from data. Second, group LASSO (Least Absolute Shrinkage and Selection Operator) was used for automatic variables selection. The LASSO penalizes the absolute size of the regression coefficients to drive the coefficients of irrelevant variables to zero [tibshirani1996regression]. The final reduced data include 29 variables.
As the data were missing at random and missing observations can provide valid reference. The missing values were imputed using predictive mean matching method that replaces missing values with an observed value from with a similar predictive mean.
The data were partitioned into train and test data based on ‘Year’ variable for temporal validation. Data from 2017 and 2018 were saved as train data while data from 2019 and 2020 were saved as test data.
Descriptive analysis of the variables by match staus (outcome variable) was performed in the form of frequency tables for categorical variables, and in the form of median[IQR] for continuous variables. The association between variables was assessed by Chi-square test of independence for categorical variables and t test for continuous varaiables. A P value < 0.05 was deemed significant.
###Classification Modeling
Multiple Classification modeling were conducted to find the best classifier for the data. 6 models were used, namely, logistic, LASSO logistic, Random Forest, XGboost (Extreme Gradient Boosting Algorithm), CATboost, and neural network models. To attain out-of-sample performance estimates, 10 folds with 3 repeats cross validation was performed using caret package in R. location groups were perserved in cross validation sampling. The models are expalined below:
logistic clasifier Logistic regression is a simple classifier which is used as a base model as it is easy to interpret.
lasso logistic classifier it penalizes the absolute size of the regression coefficients, based on the value of a tuning parameter λ. A sequence of λ from -3 to 3 was used to tune λ parameter.
Random Forest Random Forest is an ensemble technique that is a tree-based algorithm. Random Forest model was selected as it provides more accuracy for categorical data as well as it interprets categorical data better than logistic regression. Our data includes 36 out of 32 categorical variables.
XGBoost XGBoost is an ensemble method that works by boosting trees using a gradient descent algorithm. XGBoost is faster and provides more accuracy than random forest. The learning rate was tuned to 0.1, the maximum depth of each tree was tuned to 10, 15, 20, and 25, and the fraction of observations to be used in individual tree was estimated to be 0.8.
CATBoost CATBoost outperforms other gradient boost methods regarding to optimizing decision trees for categorical variables. The learning rate was tuned to 0.1, and the number of splits for numerical features was estimated to be 30.
Neural net Neural net outperforms decision tree methods if the training sample is sufficient.
Brier score was used along with other metrics to evaluate the performance of models as well as to evaluate the importance of covariates. Brier score is a metrics that verifies the accuracy of probability prediction. The score ranges from 0 to 1 with 0 equals complete accuracy perrfection. In addition, calibration, lift and decision curves were used to compare models.
*Lift curve describes how well a model ranks samples for one class.
Calibration curve can be used to characterisze how consistent the predicted class probabilities are with the observed event rates.
Decision curve quantifies the net benefit of using each model
Concordance Index is a measure of goodness of fit for binary outcomes in a logistic regression model. It ranges from 0 to 1 with 1 indicates the strongest model.
All statistical tests were performed using R version 4.0.3.
#> NULL
years.
.
[0.4543579- 0.8038737]
The results show that USMLE_Pass_Fail_replacedPassed, Visa_Sponsorship_NeededYes, Type_of_medical_schoolU.S. Private School, Type_of_medical_schoolU.S. Public School, and Photo_ReceivedYes are the most 5 important variables of logistic regression model, Type_of_medical_schoolU.S. Private School, USMLE_Pass_Fail_replacedPassed, Visa_Sponsorship_NeededYes, Type_of_medical_schoolU.S. Public School, and Medical_Education_or_Training_InterruptedYes are the most 5 important variables of lasso regression model, Age, Volunteer_exp_count, work_exp_count, total_OBGYN_letter_writers, and Research_exp_count are the most 5 important variables of random forest model, Volunteer_exp_count, Age, number_of_applicant_first_author_publications, work_exp_count, and Research_exp_count are the most 5 important variables of XGBoost model, Type_of_medical_school, Age, total_OBGYN_letter_writers, ACOG, and Medical_Education_or_Training_Interrupted are the most 5 important variables of CATBoost model, USMLE_Pass_Fail_replacedPassed, Type_of_medical_schoolU.S. Private School, Visa_Sponsorship_NeededYes, Type_of_medical_schoolU.S. Public School, and Photo_ReceivedYes are the most 5 important variables of neural network model.
The lift curves show that all models are very close with around 50.6142506 percent can be sampled (when ordered by the probability predictions).
The calibration show that CATBoost and neural network fails to predict matched class at higher than 0.75. XGBoost is over confident which means its predictions for matched class < 0.5 are too low while its prediction for matched class > 0.5 are too high in contrast to CATBoost model. LASSO model appears as the most ideal model folllowed by logistic model.
Decision curves show that random forest and XGBoost models have the lowest benefits. The other models are closely related to each other.
The concordance index (CI) results show that random forest model is the strongest model with CI =The odds ratios were calculated from the logistic regression model coefficients. They show that match status is significantly associated with younger ages (OR = 0.6839049, CI = [0.6047204-0.7734581], P = 1.4344036^{-9}). In addition, it shows that match status is less to occur in applicants with interrupted medical training or education (OR = 0.4005153, CI = [0.2655538-0.6040679], P = 1.2762757^{-5}), misdemeanor conviction (OR = 0.5541385, CI = [0.4340504-0.7074511], P = 2.168364^{-6}), not attending top NIH funding medical school (OR = 0.7840773, CI = [0.6372296-0.9647657], P = 0.0215062),and do not have ACOG membership (OR = 0.6813617, CI = [0.5691244-0.8157333], P = 2.9445962^{-5}). In contrast, it shows that match status is more to occur in applicants with Gold Humanism Honor Society (OR = 1.3721307, CI = [1.0525537-1.7887378], P = 0.0193572), higher count of peer reviewed journal articles and abstracts (OR = 0.9661793, CI = [0.8787904-1.0622582], P = 0.4768921), passing USMLE exam (OR = 4.0418448, CI = [1.652371-9.8867082], P = 0.0022105), applied to categorical tracks(OR = 1.3922623, CI = [1.1377863-1.7036542], P = 0.0013117), higher OBGYN letters count (OR = 1.386362, CI = [1.2665556-1.5175012], P = 1.3981516^{-12}), attending osteopathis schools (OR = 1.5303769, CI = [1.1094189-2.111063], P = 0.0095244), attending private schools (OR = 3.6718588, CI = [2.6274342-5.13145], P = 2.5972574^{-14}), and attending public US schools (OR = 2.8065957, CI = [2.0513984-3.8398096], P = 1.0981427^{-10})