Twelve months ago four of eight CU medical students applying to Obstetrics and Gyncology residency did not match. A model that predicts a medical student’s chances of matching into an obstetrics and gynecology residency may facilitate improved counseling and fewer unmatched medical students.
Create and validate a predictive model to understand a medical student’s chances of matching into obstetrics and gynecology residency.
I used the package renv
to make the project more:
isolated, portable and reproducible. In addition, version control for
the project was stored on www.github.com/mufflyt in a private repository
called nomogram
. Lastly, the environment was controlled by
using Rstudio Cloud (https://rstudio.cloud/spaces/191846/content/3527244).
After we’ve loaded the data in R, we’ll quickly look at the
variables. ‘Match_Status’ is a categorical binary variable, meaning only
two categories are possible.
all_merged_Feature_engineering1
is a dataframe of the
independent and the dependent variables for review. Each variable is
contained in a column, and each row represents a single unique medical
student. If students applied in more than one year the most contemporary
data was used.
We can see that these data have 7226 observations of 52 features, and
thatabout 54.2 percent of medical students applying to OB/GYN residency
matched. Let’s create a few plots to get a sense of the data. Remember,
the goal here will be to predict whether a given medical student will
match into OB/GYN residency. We make ‘Not_Matched’ the reference level
in the target variable, and this will make logit models predict the
probability of ‘Matched’.
In this correlation plot we want to look for the bright, large
circles which immediately show the strong correlations (size and shading
depends on the absolute values of the coefficients; color depends on
direction). This shows whether two features are connected so that one
changes with a predictable trend if you change the other. The closer
this coefficient is to zero the weaker is the correlation. Anything that
you would have to squint to see is usually not worth seeing!
Option A:
Option B: Find correlations for mixtures of continuous, polytomous, and dichotomous variables:
Data reduction was conducted in two phases. First, highly correlated variables were removed with R threshold = 0.7. 5 variables were removed from data. Second, group LASSO (Least Absolute Shrinkage and Selection Operator) was used for automatic variables selection. The LASSO penalizes the absolute size of the regression coefficients to drive the coefficients of irrelevant variables to zero [tibshirani1996regression].
reduced_Data2
dataframeNot_Matched | Matched | p.overall | |
---|---|---|---|
N=3307 | N=3919 | ||
Gender: | <0.001 | ||
Female | 2585 (78.2%) | 3327 (84.9%) | |
Male | 722 (21.8%) | 592 (15.1%) | |
Age | 28.0 [27.0;31.0] | 27.0 [26.0;29.0] | <0.001 |
Medical_Degree: | <0.001 | ||
DO | 704 (21.4%) | 521 (13.3%) | |
MD | 2581 (78.6%) | 3392 (86.7%) | |
Military_Service_Obligation | 41 (1.24%) | 44 (1.12%) | 0.726 |
Visa_Sponsorship_Needed | 365 (11.0%) | 88 (2.25%) | <0.001 |
Medical_Education_or_Training_Interrupted | 601 (18.2%) | 398 (10.2%) | <0.001 |
Misdemeanor_Conviction | 58 (1.75%) | 54 (1.38%) | 0.233 |
Alpha_Omega_Alpha | 187 (5.65%) | 521 (13.3%) | <0.001 |
Gold_Humanism_Honor_Society | 335 (10.1%) | 704 (18.0%) | <0.001 |
Couples_Match | 205 (6.20%) | 406 (10.4%) | <0.001 |
Count_of_Oral_Presentation | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | <0.001 |
Count_of_Peer_Reviewed_Book_Chapter | 0.00 [0.00;0.00] | 0.00 [0.00;0.00] | 0.028 |
Count_of_Peer_Reviewed_Journal_Articles_Abstracts | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | <0.001 |
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published | 0.00 [0.00;0.00] | 0.00 [0.00;1.00] | <0.001 |
Count_of_Poster_Presentation | 1.00 [0.00;2.00] | 1.00 [0.00;3.00] | <0.001 |
Year: | <0.001 | ||
2016 | 148 (4.48%) | 294 (7.50%) | |
2017 | 777 (23.5%) | 1060 (27.0%) | |
2018 | 573 (17.3%) | 711 (18.1%) | |
2019 | 1076 (32.5%) | 1292 (33.0%) | |
2020 | 733 (22.2%) | 562 (14.3%) | |
USMLE_Pass_Fail_replaced: | <0.001 | ||
Failed attempt | 88 (3.01%) | 9 (0.24%) | |
Passed | 2835 (97.0%) | 3741 (99.8%) | |
Location: | <0.001 | ||
BSW | 174 (5.26%) | 129 (3.29%) | |
CCAG | 333 (10.1%) | 270 (6.89%) | |
CU | 1431 (43.3%) | 1449 (37.0%) | |
Duke | 409 (12.4%) | 802 (20.5%) | |
Truman | 170 (5.14%) | 84 (2.14%) | |
U_Washington | 126 (3.81%) | 180 (4.59%) | |
UAB | 130 (3.93%) | 110 (2.81%) | |
Utah | 534 (16.1%) | 895 (22.8%) | |
Meeting_Name_Presented: | <0.001 | ||
Did not present at a meeting | 2534 (76.6%) | 2657 (67.8%) | |
Presented at a meeting | 773 (23.4%) | 1262 (32.2%) | |
TopNIHfunded: | <0.001 | ||
Did not attend NIH top-funded medical school | 2568 (77.7%) | 2247 (57.3%) | |
Attended a NIH top-funded Medical School | 739 (22.3%) | 1672 (42.7%) | |
Higher_Education_Institution: | <0.001 | ||
No Ivy League Education | 3228 (97.6%) | 3698 (94.4%) | |
Ivy League | 79 (2.39%) | 221 (5.64%) | |
Higher_Education_Degree: | <0.001 | ||
Not a B.S. degree | 2132 (64.5%) | 2136 (54.5%) | |
B.S. | 1175 (35.5%) | 1783 (45.5%) | |
Interest_Group: | 1.000 | ||
No Interest Group | 3303 (99.9%) | 3914 (99.9%) | |
Mentions Interest Group | 4 (0.12%) | 5 (0.13%) | |
Language_Fluency: | 0.054 | ||
Speaks English and Another Language | 1778 (75.6%) | 2256 (73.2%) | |
Speaks English | 575 (24.4%) | 825 (26.8%) | |
ACLS | 1125 (47.8%) | 1262 (41.0%) | <0.001 |
Other_Service_Obligation | 19 (0.81%) | 36 (1.17%) | 0.238 |
Photo_Received | 2305 (98.0%) | 3068 (99.6%) | <0.001 |
Tracks_Applied_by_Applicant_1: | <0.001 | ||
Applying for a Preliminary Position | 2203 (66.6%) | 2035 (51.9%) | |
Categorical Applicant | 1104 (33.4%) | 1884 (48.1%) | |
AMA: | <0.001 | ||
No AMA Membership | 2386 (72.1%) | 2302 (58.7%) | |
American Medical Association Member | 921 (27.9%) | 1617 (41.3%) | |
ACOG: | <0.001 | ||
No ACOG Membership | 2291 (69.3%) | 1886 (48.1%) | |
ACOG Member | 1016 (30.7%) | 2033 (51.9%) | |
Latin_Honors: | 0.001 | ||
Latin_honors | 31 (0.94%) | 12 (0.31%) | |
No_cum_laude | 3276 (99.1%) | 3907 (99.7%) | |
Scholarship: | <0.001 | ||
No_scholarship | 2769 (83.7%) | 3033 (77.4%) | |
Scholarship | 538 (16.3%) | 886 (22.6%) | |
Grant: | <0.001 | ||
Grant_funding | 123 (3.72%) | 231 (5.89%) | |
No_Grant_funding | 3184 (96.3%) | 3688 (94.1%) | |
Phi_beta_kappa: | <0.001 | ||
No_Phi_Beta_Kappa | 3275 (99.0%) | 3835 (97.9%) | |
Phi_Beta_Kappa | 32 (0.97%) | 84 (2.14%) | |
NCAA: | 0.008 | ||
NCAA_athlente | 19 (0.57%) | 47 (1.20%) | |
Not_a_NCAA_athlete | 3288 (99.4%) | 3872 (98.8%) | |
Boy_Scouts: | 0.269 | ||
Boy/Girl_Scouts | 9 (0.27%) | 18 (0.46%) | |
Not_a_Boy/Girl_Scout | 3298 (99.7%) | 3901 (99.5%) | |
Valedictorian: | 0.044 | ||
Not_a_Valedictorian | 3287 (99.4%) | 3877 (98.9%) | |
Valedictorian | 20 (0.60%) | 42 (1.07%) | |
NIH: | 0.656 | ||
NIH_present | 24 (0.73%) | 24 (0.61%) | |
No_NIH_involvement | 3283 (99.3%) | 3895 (99.4%) | |
NCI: | 0.317 | ||
NCI_present | 81 (2.45%) | 112 (2.86%) | |
No_NCI_involvement | 3226 (97.6%) | 3807 (97.1%) | |
total_OBGYN_letter_writers | 2.00 [1.00;2.00] | 2.00 [2.00;3.00] | <0.001 |
number_of_applicant_first_author_publications | 0.00 [0.00;1.00] | 1.00 [0.00;3.00] | <0.001 |
Advance_Degree: | 0.015 | ||
M.B.A. | 20 (0.60%) | 14 (0.36%) | |
No Advanced Degree | 3027 (91.5%) | 3651 (93.2%) | |
Ph.D. | 19 (0.57%) | 10 (0.26%) | |
Other | 241 (7.29%) | 244 (6.23%) | |
Type_of_medical_school: | . | ||
U.S. Public School | 1024 (31.0%) | 1937 (49.4%) | |
International School | 1075 (32.5%) | 260 (6.63%) | |
Osteopathic School | 704 (21.3%) | 523 (13.3%) | |
Osteopathic School,International School | 1 (0.03%) | 0 (0.00%) | |
U.S. Private School | 503 (15.2%) | 1199 (30.6%) | |
work_exp_count | 2.00 [0.00;4.00] | 2.00 [0.00;4.00] | 0.052 |
Volunteer_exp_count | 4.00 [0.00;8.00] | 6.00 [2.00;9.00] | <0.001 |
Research_exp_count | 1.00 [0.00;2.00] | 2.00 [0.00;3.00] | <0.001 |
#> [1] "Match_Status"
#> [2] "Gender"
#> [3] "Age"
#> [4] "Medical_Degree"
#> [5] "Military_Service_Obligation"
#> [6] "Visa_Sponsorship_Needed"
#> [7] "Medical_Education_or_Training_Interrupted"
#> [8] "Misdemeanor_Conviction"
#> [9] "Alpha_Omega_Alpha"
#> [10] "Gold_Humanism_Honor_Society"
#> [11] "Couples_Match"
#> [12] "Count_of_Oral_Presentation"
#> [13] "Count_of_Peer_Reviewed_Book_Chapter"
#> [14] "Count_of_Peer_Reviewed_Journal_Articles_Abstracts"
#> [15] "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published"
#> [16] "Count_of_Poster_Presentation"
#> [17] "Year"
#> [18] "USMLE_Pass_Fail_replaced"
#> [19] "Location"
#> [20] "Meeting_Name_Presented"
#> [21] "TopNIHfunded"
#> [22] "Higher_Education_Institution"
#> [23] "Higher_Education_Degree"
#> [24] "Interest_Group"
#> [25] "Language_Fluency"
#> [26] "ACLS"
#> [27] "Other_Service_Obligation"
#> [28] "Photo_Received"
#> [29] "Tracks_Applied_by_Applicant_1"
#> [30] "AMA"
#> [31] "ACOG"
#> [32] "Latin_Honors"
#> [33] "Scholarship"
#> [34] "Grant"
#> [35] "Phi_beta_kappa"
#> [36] "NCAA"
#> [37] "Boy_Scouts"
#> [38] "Valedictorian"
#> [39] "NIH"
#> [40] "NCI"
#> [41] "total_OBGYN_letter_writers"
#> [42] "number_of_applicant_first_author_publications"
#> [43] "Advance_Degree"
#> [44] "Type_of_medical_school"
#> [45] "work_exp_count"
#> [46] "Volunteer_exp_count"
#> [47] "Research_exp_count"
#Recursive Feature Elimination Required more RAM than possible therefore this was able to be run locally in a file called ‘Predictive modeling across sets.RMD’.
LASSO, as a feature selection method, focuses on deletion of irrelevant or redundant features.
1- Create dummy variables from categorical data. Here we create a
feature matrix where the categorical features are converted to numeric
with one-hot encoding, and it’ll be used in glmnet
when we
train logit models with shrinkage.
3-Combine dummies with numeric variables then convert them to a
matrix X. Set up the matrix including dummy variables. With a binary
categorical outcome the only difference is we must specify
family = "binomial"
when using glmnet.
4- Create your X matrix (predictors) and Y vector (outcome variable)
5- Create group vector that distinguish groups
Use glmnet to conduct LASSO - Performs k-fold cross validation for penalized regression models with grouped covariates over a grid of values for the regularization parameter lambda. First we need to find the amount of penalty λ by cross-validation. We will search for the λ that give the minimum MSE.
#> grLasso-penalized logistic regression with n=4913, p=60
#> At minimum cross-validation error (lambda=0.0033):
#> -------------------------------------------------
#> Nonzero coefficients: 42
#> Nonzero groups: 29
#> Cross-validation error of 1.10
#> Maximum R-squared: 0.22
#> Maximum signal-to-noise ratio: 0.23
#> Prediction error at lambda.min: 0.266
Cross-validation result. Use cross-validation to identify the best
value for the LASSO model. It is hoped (because it is not always the
case in practice) that it is U-shaped, like the one shown here, so that
we can spot the optimal value of , i.e., the one that corresponds to the
lowest dip point.
To view the best model and the corresponding coefficients. cv.fit$lambda.min is the best lambda value that results in the best model with smallest cross-validation error.
This extracts the fitted regression parameters of the logistic regression model using the given lambda value of 0.003. Picks out which predictors have a coefficient > 0.
#> [1] 43
#> (Intercept)
#> 2.615970
#> Age
#> -0.064361
#> Count_of_Oral_Presentation
#> 0.005322
#> Count_of_Peer_Reviewed_Book_Chapter
#> -0.061396
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts
#> 0.019971
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published
#> 0.065676
#> total_OBGYN_letter_writers
#> 0.294902
#> number_of_applicant_first_author_publications
#> 0.009723
#> work_exp_count
#> -0.014067
#> Volunteer_exp_count
#> 0.000398
#> Research_exp_count
#> 0.064153
#> Visa_Sponsorship_Needed_Yes
#> -0.502786
#> Medical_Education_or_Training_Interrupted_Yes
#> -0.465621
#> Misdemeanor_Conviction_Yes
#> -0.020187
#> Gold_Humanism_Honor_Society_Yes
#> 0.200768
#> Year_2017
#> -0.384054
#> Year_2018
#> -0.692776
#> Year_2019
#> -0.894986
#> USMLE_Pass_Fail_replaced_Failed attempt
#> -1.565250
#> Location_BSW
#> -0.081417
#> Location_CCAG
#> 0.254194
#> Location_CU
#> 0.188666
#> Location_Duke
#> 0.355834
#> Location_Truman
#> -0.126899
#> Location_U_Washington
#> 0.319443
#> Location_UAB
#> -0.051413
#> Meeting_Name_Presented_Did not present at a meeting
#> -0.062828
#> TopNIHfunded_Attended a NIH top-funded Medical School
#> 0.207639
#> Higher_Education_Institution_Ivy League
#> 0.253969
#> Other_Service_Obligation_No
#> -0.060423
#> Other_Service_Obligation_Yes
#> 0.060423
#> Photo_Received_No
#> -0.158495
#> Tracks_Applied_by_Applicant_1_Applying for a Preliminary Position
#> -0.521196
#> AMA_No AMA Membership
#> 0.061868
#> ACOG_No ACOG Membership
#> -0.435343
#> NCAA_NCAA_athlente
#> 0.123874
#> NIH_NIH_present
#> -0.196396
#> Advance_Degree_M.B.A.
#> -0.013734
#> Advance_Degree_No Advanced Degree
#> 0.158927
#> Advance_Degree_Ph.D.
#> 0.111481
#> Type_of_medical_school_U.S. Public School
#> -0.189360
#> Type_of_medical_school_International School
#> -1.287720
#> Type_of_medical_school_Osteopathic School
#> -0.431929
Here we use the option type = “response”
to directly
obtain the probabilities.
Temporal validation - Validation for Year vs. Year
The final reduced data include 27 variables for analysis.
Next, we will impute missing values. This will make our data ready for machine learning.
In some cases, our data is naturally separated into two sets, one of which can be used to fit a model and the other to evaluate it. A common example of this is when data has been collected during two distinct time periods, and the older data is used to fit a model that is evaluated on the newer data, to see if historical data can be used to predict the future.
Here we check sample sizes and proportions of target=“Match_Status” in training and test sets the proportions of target=“Match_Status” should be approximately the same across the training, test, and the whole dataset. Do NOT up- or down- sample the training set because the eval metric is Brier Score, and there is no evidence showing making the data balanced will improve out-of-sample performance. In fact, based on my experiments, it will make the models performance much worse. For more explanations, see this link: https://stats.stackexchange.com/a/474431
#> The training set has 18315 observations.
#> The training set has 15605 observations.
#> Distribution of the target in the training set:
#> n_TRUE
#> 0.567
#> Distribution of the target in the test set:
#> n_TRUE
#> 0.506
The next step is to explore potential predictive relationships between individual predictors and the response and between pairs of predictors and the response.
Two logistic regression models were considered. The “null model” contains only an intercept term while the model complex model has a single term for an individual predictor from the risk set. These results are presented in the table and orders the risk predictors from most significant to least significant in terms of improvement in Brier Score. For the training data, several predictors provide marginal, but significant improvement in predicting Match_Status outcome as measured by the improvement in Brier Score. Based on these results, our intuition would lead us to believe that the significant categorical set predictors will likely be integral to the final predictive model.A comparison of the improvement in Brier Score to the p-value of the importance of the interaction term for the numerical predictors. The improvement using interactions was less than 0.1 change in the Brier Score.
#> ~Age:Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published +
#> Age:work_exp_count + Age:Volunteer_exp_count + Count_of_Oral_Presentation:Count_of_Peer_Reviewed_Book_Chapter +
#> Count_of_Oral_Presentation:Count_of_Peer_Reviewed_Journal_Articles_Abstracts +
#> Count_of_Oral_Presentation:Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published +
#> Count_of_Peer_Reviewed_Book_Chapter:total_OBGYN_letter_writers +
#> Count_of_Peer_Reviewed_Book_Chapter:Research_exp_count +
#> Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published:Research_exp_count +
#> total_OBGYN_letter_writers:number_of_applicant_first_author_publications +
#> total_OBGYN_letter_writers:Research_exp_count + work_exp_count:Volunteer_exp_count
#> <environment: 0x7f9f09fbddf0>
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 6302894 337 10834477 579 NA 10834477 579
#> Vcells 20218900 154 37704582 288 16384 37704582 288
#NULL MODEL For all our model evaluation metric we will use Brier score. The Brier score for a binary classifer is defined as the mean squared errorof the predicted probabilities and the true target values (0’s or 1’s). The smaller, the better.
Model performance and utility will be judged on how much more accurate the model can determine a residency match than a human. Since the target of the prediction is binary (Matched or Not_Matched), we can assume a baseline of 50% for random chance. Next we want to determine a rather modest human baseline. If a human were to predict “Matched” for every resident, they would be accurate 54.23% of the time, a 4.23% increase over chance. Though we can imagine more elaborate methods to improve human performance, we will use this method to establish a baseline. It could easily be just as true, that we could see human performance fall below random chance. Possibly, the human could predict “Not_Matched” for every candidate resident, dropping their accuracy to 45.77%. Let’s give the benefit of the doubt, however, and give the higher accuracy. Therefore, in order to create a successful model, we must aim for an accuracy greater than 54.23%. Even slight improvement in performance is not inconsequential, especially due to the nature of the model.
#> A naive prediction of all matches gives a test Brier score: 0.494
Often we are interested in out-of-sample measures of model
performance. Recall that cross-validation is one way of attaining
out-of-sample performance estimates, we’ll use the caret
package to perform cross-validation. The idea behind cross-validation is
similar to that behind test-training splitting of the data. We partition
the data into several sets, and use one of them for evaluation. The key
difference is that we in a cross-validation partition the data into more
than two sets, and use all of them (one-by-one) for evaluation.
We will work with the caret
package (Classification and
Regression Training) package using R
. A strength of the
caret
package is that the same syntax can be used whether
the model we are considering has a categorical or numeric outcome.
caret
PACKAGE2- train models on train data. The caret::train
function
is a high level API that manages everything for us, regarding model
training, with a common interface. Our algorithm was optimizing for the
Brier Score. This is set with the argument metric
.
The glmnet package is extremely efficient and fast, even on very large data sets (mostly due to its use of Fortran to solve the lasso problem via coordinate descent); note, however, that it only accepts the non-formula XY interface so prior to modeling we need to separate our feature and target sets.
#> [1] "Best tuning parameter values found after 3 repeated 10-fold cross-validations:\n"
#> lambda min_brier_score brier_score_sd
#> 1 0.0302 0.197 0.025
Best grid search terms in logit_shrink
model.
xgboost
is a gradient boosting model. These types of
tree based models are famous for their stability and performance.
XGBoost is one implementation of these boosting models that rely on
model’s errors to improve their performance.
Training a neural network for Classification
As ensemble caret does not support custome model, we will add cat model manually
#> logit lasso xgb nnet rf rpart
#> logit 1.000 0.999 0.891 0.992 0.990 0.840
#> lasso 0.999 1.000 0.909 0.987 0.994 0.852
#> xgb 0.891 0.909 1.000 0.843 0.928 0.890
#> nnet 0.992 0.987 0.843 1.000 0.965 0.765
#> rf 0.990 0.994 0.928 0.965 1.000 0.905
#> rpart 0.840 0.852 0.890 0.765 0.905 1.000
All models are highly correlated, so they are not good candidate for ensemble
Find Confidence interval3- Model Comparison -Summary Table
#>
#> Call:
#> summary.resamples(object = results)
#>
#> Models: Logit, Lasso, RF, XGB, Nnet
#> Number of resamples: 4
#>
#> Acc
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.598 0.690 0.722 0.705 0.738 0.777 0
#> Lasso 0.590 0.685 0.721 0.703 0.739 0.781 0
#> RF 0.614 0.675 0.696 0.695 0.716 0.775 0
#> XGB 0.619 0.671 0.700 0.690 0.718 0.740 0
#> Nnet 0.603 0.671 0.696 0.685 0.710 0.743 0
#>
#> AUCPR
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.668 0.715 0.750 0.750 0.784 0.832 0
#> Lasso 0.669 0.711 0.746 0.748 0.782 0.830 0
#> RF 0.644 0.698 0.743 0.734 0.779 0.807 0
#> XGB 0.645 0.699 0.732 0.726 0.759 0.795 0
#> Nnet 0.662 0.663 0.700 0.720 0.757 0.820 0
#>
#> AUCROC
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.674 0.745 0.771 0.749 0.775 0.782 0
#> Lasso 0.672 0.745 0.771 0.751 0.777 0.788 0
#> RF 0.664 0.729 0.760 0.742 0.772 0.784 0
#> XGB 0.673 0.710 0.737 0.730 0.756 0.773 0
#> Nnet 0.661 0.715 0.740 0.723 0.748 0.753 0
#>
#> Brier
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.174 0.186 0.192 0.198 0.204 0.231 0
#> Lasso 0.170 0.185 0.190 0.196 0.202 0.235 0
#> RF 0.170 0.195 0.204 0.204 0.213 0.237 0
#> XGB 0.190 0.199 0.210 0.214 0.224 0.247 0
#> Nnet 0.178 0.202 0.214 0.213 0.225 0.244 0
#>
#> F
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.539 0.687 0.754 0.723 0.790 0.843 0
#> Lasso 0.520 0.681 0.752 0.718 0.788 0.848 0
#> RF 0.600 0.695 0.747 0.735 0.786 0.846 0
#> XGB 0.615 0.701 0.734 0.724 0.757 0.815 0
#> Nnet 0.610 0.691 0.734 0.725 0.768 0.823 0
#>
#> Kappa
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.202 0.365 0.436 0.384 0.455 0.464 0
#> Lasso 0.189 0.357 0.432 0.380 0.455 0.468 0
#> RF 0.230 0.326 0.378 0.356 0.408 0.438 0
#> XGB 0.238 0.329 0.371 0.352 0.393 0.426 0
#> Nnet 0.206 0.324 0.365 0.335 0.375 0.403 0
#>
#> Precision
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.657 0.675 0.692 0.708 0.725 0.792 0
#> Lasso 0.655 0.676 0.692 0.707 0.723 0.789 0
#> RF 0.640 0.642 0.654 0.681 0.693 0.774 0
#> XGB 0.640 0.657 0.677 0.692 0.711 0.773 0
#> Nnet 0.617 0.645 0.668 0.678 0.701 0.759 0
#>
#> Recall
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.457 0.715 0.829 0.754 0.867 0.900 0
#> Lasso 0.431 0.703 0.822 0.748 0.867 0.915 0
#> RF 0.562 0.771 0.872 0.810 0.910 0.934 0
#> XGB 0.593 0.742 0.801 0.764 0.823 0.861 0
#> Nnet 0.602 0.746 0.814 0.782 0.850 0.899 0
-Plots
#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"
#> null device
#> 1
#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"
#> null device
#> 1
varImp
showcases variable
importance of the variables used in the training model.Plot Variable importance
#> svg
#> 2
-Statistical Significance
There is no significant difference between models.
Predicting Test Set Results. Finally, scoring needs to be performed
on the test sample using the parameter estimates obtained from the model
building process. This step can be easily implemented with the help of
predict
function. Using a predict
is as simple
as it can gets with caret
models. We’ll make predictions
using the test data in order to evaluate the performance of our logistic
regression model.The procedure is as follow: Predict the class
membership probabilities of observations based on predictor variables
Assign the observations to the class with highest probability score
(i.e above 0.5)
Which classes do these probabilities refer to? In our example, the output is the probability that the Matching Status test will be positive. We know that these values correspond to the probability of the test to be positive, rather than negative, because the contrasts() function indicates that R has created a dummy variable with a 1 for “pos” and “0” for neg. Check the dummy coding:
1- Lift plot
#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"
#> null device
#> 1
2-Calibration plot
#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"
#> null device
#> 1
3- Logistic Calibration plot
#> [1] 0
#> [1] 0
#> [1] 0
Plot Calibration
#> svg
#> 2
##Table and Plot of CI
IML
package. Use rf
first
then run IML
feature interaction.
2- train models on train data for LOG LOSS
-Summary Table of LOG LOSS
#>
#> Call:
#> summary.resamples(object = results3)
#>
#> Models: Logit, Lasso, RF, XGB, Nnet
#> Number of resamples: 4
#>
#> logLoss
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> Logit 0.514 0.549 0.566 0.577 0.594 0.664 0
#> Lasso 0.517 0.549 0.562 0.576 0.590 0.663 0
#> RF 0.528 0.583 0.602 0.611 0.630 0.712 0
#> XGB 0.593 0.599 0.639 0.656 0.696 0.751 0
#> Nnet 0.508 0.567 0.587 0.590 0.610 0.679 0
-Plots of LOG LOSS
Find LOG LOSS Confidence interval
-Staistical Significance
#>
#> Call:
#> summary.diff.resamples(object = diffs)
#>
#> p-value adjustment: bonferroni
#> Upper diagonal: estimates of the difference
#> Lower diagonal: p-value for H0: difference = 0
#>
#> logLoss
#> Logit Lasso RF XGB Nnet
#> Logit 0.00118 -0.03384 -0.07835 -0.01286
#> Lasso 1.000 -0.03502 -0.07953 -0.01404
#> RF 0.217 0.256 -0.04451 0.02098
#> XGB 0.228 0.196 0.794 0.06549
#> Nnet 1.000 1.000 0.152 0.350
Gives the polynomial formula
Create the y = mx+b Logistic Model Formula
#> [1] "y = -2.43 + 0.45*Military_Service_ObligationYes - 0.33*NCAANot_a_NCAA_athlete + 0.69*NIHNo_NIH_involvement - 0.09*Age - 0.55*Medical_Education_or_Training_InterruptedYes + 0.09*Misdemeanor_ConvictionYes + 0.3*Gold_Humanism_Honor_SocietyYes + 0.04*Count_of_Oral_Presentation - 0.13*Count_of_Peer_Reviewed_Book_Chapter + 0.03*Count_of_Peer_Reviewed_Journal_Articles_Abstracts + 0.08*Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published - 0.77*Year2018 + 1.61*USMLE_Pass_Fail_replacedPassed + 0.12*LocationCCAG - 0.08*LocationCU + 0.19*LocationDuke - 0.42*LocationTruman - 0.62*LocationU_Washington - 0.53*LocationUAB - 0.95*LocationUtah + 0.09*Meeting_Name_PresentedPresented at a meeting + 0.24*TopNIHfundedAttended a NIH top-funded Medical School - 0.07*Higher_Education_InstitutionIvy League + 0.22*Other_Service_ObligationYes + 0.81*Photo_ReceivedYes + 0.97*Tracks_Applied_by_Applicant_1Categorical Applicant - 0.05*AMAAmerican Medical Association Member + 0.41*ACOGACOG Member + 0.36*total_OBGYN_letter_writers + 0.02*number_of_applicant_first_author_publications + 0.34*Advance_DegreeNo Advanced Degree + 0.67*Advance_DegreePh.D. + 0.02*Advance_DegreeOther + 0.74*Type_of_medical_schoolOsteopathic School + 1.54*Type_of_medical_schoolU.S. Private School + 1.3*Type_of_medical_schoolU.S. Public School - 0.02*work_exp_count + 0.01*Volunteer_exp_count + 0.04*Research_exp_count - 0*logit"
Odds Ratio table