Introduction

Obstetrics and gynecology clerkship directors and medical students alike desire reliable prognostic matching information tailored to the individual. One predictive tool is the nomogram, which creates a simple graphical representation of a statistical predictive model that generates a numerical probability of an event. There is an inverse relationship between model accuracy and model interpretability. Here was have chosen for a more interpretable model with a simple logistic regression. The goal is to predict accurately whether a new student, not in this dataset, will match or not. The outcome variable is a variable called Match_Status: coded as “Matched” or “Not_Matched”. Other classification questions are to be or not to be, to vote or not to vote, and to click or not to click.

Objective: We sought to construct and validate a model that predict a medical student’s chances of matching into an obstetrics and gynecology residency.

This is the skeleton I used for creating the lrm and the GLM: * Factorize categorical variables because R didn’t do the job * Remove collinear variables * Add relevant power-transformations * Add relevant variable interaction * Remove insignificant variables with relevant testing criteria * Repeat above steps until you’ve exhausted your options * Remove any outlying datapoints * Evaluate your model * Visualizing your findings

Create folder structure, load bespoke functions, and Load necessary R libraries:

Load data

Data Dictionary

data_dictionary <- read_csv("~/Dropbox (Personal)/Nomogram/nomogram/data_dictionary.csv")
knitr::kable(x = data_dictionary, align = "l", caption = "Data Dictionary of Features", padding = 2) %>% 
    kable_styling(bootstrap_options = "striped", full_width = F) %>% kable_paper("hover", full_width = F) %>% 
    # row_spec(0, angle = -45) %>%
footnote(c("See 03-Feature_Selection.Rmd for more details"))

Data Dictionary of Features
Varialble Name	Data Type	Definition
Match_Status	factor	Not_Matched versus Matched, Did the applicant match? Based on search of “meet our residents” web page.
Gender	factor	Male versus Female. One person did not list gender and was removed for privacy concerns.
Age	double	Age at the time they applied to OBGYN residency.
US_or_Canadian_Applicant	factor	Yes versus No, US or Canadian Senior applicant
Medical_Degree	factor	MD versus DO
Military_Service_Obligation	factor	Yes or No
Visa_Sponsorship_Needed	factor	Yes or No
Medical_Education_or_Training_Interrupted	factor	Yes or No
Misdemeanor_Conviction	factor	Yes or No
Alpha_Omega_Alpha	factor	Yes or No
Gold_Humanism_Honor_Society	factor	Yes or No
Couples_Match	factor	Yes or No
Count_of_Oral_Presentation	integer	Number
Count_of_Peer_Reviewed_Book_Chapter	integer	Number
Count_of_Peer_Reviewed_Journal_Articles_Abstracts	integer	Number
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published	integer	Number
Count_of_Poster_Presentation	integer	Number
Year	factor	Season of application (2017, 2018, 2019, 2020)
USMLE_Pass_Fail_replaced	factor	Passed versus Failed attempt
Location	factor	Residency site of application: CU, Duke, Baylor Scott & White, Cleveland Clinic Akron,
Meeting_Name_Presented	factor	Did not present at a meeting versus Presented at a meeting
TopNIHfunded	factor	Attended a NIH top-funded Medical School versus Did not attend NIH top-funded medical school
Higher_Education_Institution	factor	Ivy League College vs. No Ivy league college
Higher_Education_Degree	factor	BS vs. Not a BS degree
Interest_Group	factor	Did they participate in OBGYN interest group
Birth_Place	factor	Born outside the USA versus Born in the USA
Language_Fluency	factor	Speaks English and Another Language versus Speaks English
ACLS	factor	Yes or No
PALS	factor	Yes or No
Other_Service_Obligation	factor	National Service Corps versus No
Medical_Training_Discipline	factor	Prior Residency Training versus No prior residency training
Photo_Received	factor	Yes or No
Tracks_Applied_by_Applicant_1	factor	Applying for a preliminary position versus Categorical applicant
AMA	factor	AMA membership versus No AMA membership
ACOG	factor	ACOG Member versus No ACOG member
Latin_Honors	character	Cum laude versus no latin honors
Scholarship	character	Scholarship versus No_scholarship
Grant	character	Grant_funding versus No_Grant_funding
Phi_beta_kappa	character	Phi_Beta_Kappa versus No_Phi_Beta_Kappa
NCAA	character	NCAA_athlete versus Not_a_NCAA_athlete
Boy_Scouts	character	Boy_Scout/Girl_Scout versus Not_a_Boy_Scout/Girl_Scout
Valedictorian	character	Valedictorian versus Not_a_Valedictorian
NIH	character	NIH_present versus No_NIH_involvement
NCI	character	NCI_present versus No_NCI_involvement
total_OBGYN_letter_writers	integer	Total number of letters written by OBGYNs
reco_count	integer	Total number of letters of recommendation available
number_of_applicant_first_author_publications	integer	Number of first author publications
Advance_Degree	factor	M.B.A. versus No Advanced Degree versus Ph.D. versus Other
Birth_Resident_status	character	Born outside the USA and now lives in USA versus Born in USA and lives in USA
female_reco_letter	integer	Number of female letter of recommendation authors
Note:
See 03-Feature_Selection.Rmd for more details

# Remove year and location columns from data as those are not modifiable risk factors.
all_data %<>% select(-Year, -Location)

info <- sessionInfo()
r_ver <- paste(info$R.version$major, info$R.version$minor, sep = ".")

All analyses were performed using R (ver. 4.0.3)[R Core Team, 2013] and packages rms (ver. 6.1-0) [F. Harrell, 2014] for analysis, Gmisc for plot and table output (ver. 1.11.0), and knitr (ver 1.30) [Xie, 2013] for reproducible research.

Data Preparation: Factorize categorical variables

# Copy and pasted fullformula output here.
var.labels <- c(ACLS = "ACLS", ACOG = "ACOG member", Advance_Degree = "Additional Degree", Age = "Age, years", 
    Alpha_Omega_Alpha = "AOA member", AMA = "AMA member", Birth_Place = "Location of Birth", Birth_Resident_status = "Immigrant", 
    Boy_Scouts = "Boy or Girl Scout", Count_of_Poster_Presentation = "Count of Poster Presentations", 
    Count_of_Oral_Presentation = "Count of Oral Presentations", Count_of_Peer_Reviewed_Journal_Articles_Abstracts = "Count of Published Abstracts", 
    Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published = "Count of Other Published Products", 
    Count_of_Online_Publications = "Count of Online Publications", Count_of_Peer_Reviewed_Book_Chapter = "Count of Book Chapters", 
    Couples_Match = "Couples Match", female_reco_letter = "Number of letters of recommendation written by women", 
    Gender = "Sex", Gold_Humanism_Honor_Society = "Gold Humanism membership", Grant = "Received grants", 
    Higher_Education_Degree = "Undergraduate degree", Higher_Education_Institution = "Ivy League College Education", 
    Interest_Group = "Participates in OBGYN Interest Group", Language_Fluency = "Speaks foreign language", 
    Latin_Honors = "Graduated cum laude", Medical_Degree = "Medical degree", Medical_Education_or_Training_Interrupted = "Medical Education Interrupted", 
    Medical_Training_Discipline = "Previous residency training", Meeting_Name_Presented = "Presented at a scientific meeting", 
    Military_Service_Obligation = "Military service obligation", Misdemeanor_Conviction = "Prior Misdemeanor", 
    NCAA = "College athlete", NCI = "National Cancer Institute experience", NIH = "National Institutes of Health experience", 
    number_of_applicant_first_author_publications = "Count of First-Author Publications", Other_Service_Obligation = "National Health Service Corps", 
    PALS = "PALS", Phi_beta_kappa = "Phi Beta Kappa membership", Photo_Received = "Photo in Application", 
    reco_count = "Number of letters of recommendation", Scholarship = "Received scholarship", TopNIHfunded = "Attended Top NIH funded Medical School", 
    total_OBGYN_letter_writers = "Count of OBGYN writing letter of recommendation", Tracks_Applied_by_Applicant_1 = "Application Track", 
    US_or_Canadian_Applicant = "US or Canadian Applicant", USMLE_Pass_Fail_replaced = "USMLE Step 1", 
    Valedictorian = "Prior valedictorian", Visa_Sponsorship_Needed = "Visa Sponsorship needed", y = "Match Outcome")

# These Hmisc labels will become the labels on the nomogram ************
Hmisc::label(all_data$ACLS) = "ACLS"
Hmisc::label(all_data$ACOG) = "ACOG member"
Hmisc::label(all_data$Advance_Degree) = "Additional Degree"
Hmisc::label(all_data$Age) = "Age"
units(all_data$Age) <- "years"
Hmisc::label(all_data$Alpha_Omega_Alpha) = "AOA member"
Hmisc::label(all_data$AMA) = "AMA member"
Hmisc::label(all_data$Birth_Place) = "Location of Birth"
Hmisc::label(all_data$Birth_Resident_status) = "Immigrant"
Hmisc::label(all_data$Boy_Scouts) = "Boy or Girl Scout"
Hmisc::label(all_data$Count_of_Poster_Presentation) = "Count of Poster Presentations"
Hmisc::label(all_data$Count_of_Oral_Presentation) = "Count of Oral Presentations"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts) = "Count of Published Abstracts"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) = "Count of Other Published Work"
# Hmisc::label(all_data$Count`) = 'Count of Online Publications'
Hmisc::label(all_data$Count_of_Peer_Reviewed_Book_Chapter) = "Count of Book Chapters"
Hmisc::label(all_data$female_reco_letter) = "Number of letters of recommendation written by women"
Hmisc::label(all_data$Gender) = "Sex"
Hmisc::label(all_data$Gold_Humanism_Honor_Society) = "Gold Humanism membership"
Hmisc::label(all_data$Grant) = "Received grants"
Hmisc::label(all_data$Higher_Education_Degree) = "Undergraduate degree"
Hmisc::label(all_data$Higher_Education_Institution) = "Ivy League College Education"
Hmisc::label(all_data$Interest_Group) = "Participates in OBGYN Interest Group"
Hmisc::label(all_data$Language_Fluency) = "Speaks foreign language"
Hmisc::label(all_data$Latin_Honors) = "Graduated cum laude"
Hmisc::label(all_data$Medical_Degree) = "Medical degree"
Hmisc::label(all_data$Medical_Education_or_Training_Interrupted) = "Medical Education Interrupted"
Hmisc::label(all_data$Medical_Training_Discipline) = "Prior Residency Training"
Hmisc::label(all_data$Meeting_Name_Presented) = "Presented at a scientific meeting"
Hmisc::label(all_data$Military_Service_Obligation) = "Military service obligation"
Hmisc::label(all_data$Misdemeanor_Conviction) = "Prior Misdemeanor"
Hmisc::label(all_data$NCAA) = "College athlete"
Hmisc::label(all_data$NCI) = "National Cancer Institute experience"
Hmisc::label(all_data$NIH) = "National Institutes of Health experience"
Hmisc::label(all_data$number_of_applicant_first_author_publications) = "Count of First Author Publications"
Hmisc::label(all_data$Other_Service_Obligation) = "National Health Service Corps"
Hmisc::label(all_data$PALS) = "PALS"
Hmisc::label(all_data$Phi_beta_kappa) = "Phi Beta Kappa membership"
Hmisc::label(all_data$Photo_Received) = "Photo in Application"
Hmisc::label(all_data$reco_count) = "Number of letters of recommendation"
Hmisc::label(all_data$Scholarship) = "Received scholarship"
Hmisc::label(all_data$TopNIHfunded) = "Attended Top NIH funded Medical School"
Hmisc::label(all_data$total_OBGYN_letter_writers) = "Count of OBGYN writing letter of recommendation"
Hmisc::label(all_data$Tracks_Applied_by_Applicant_1) = "Application Track"
Hmisc::label(all_data$US_or_Canadian_Applicant) = "US or Canadian Applicant"
Hmisc::label(all_data$USMLE_Pass_Fail_replaced) = "USMLE Step 1"
Hmisc::label(all_data$Valedictorian) = "Prior valedictorian"
Hmisc::label(all_data$Visa_Sponsorship_Needed) = "Visa Sponsorship needed"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) = "Unpublished Articles"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Journal_Articles_Abstracts) = "Peer-Reviewed Articles"
Hmisc::label(all_data$Count_of_Oral_Presentation) = "Oral Presentations"
Hmisc::label(all_data$Count_of_Peer_Reviewed_Book_Chapter) = "Book Chapters"
Hmisc::label(all_data$Count_of_Poster_Presentation) = "Poster Presentations"
# Hmisc::label(all_data$Self_Identify) = 'Race/Ethnicity'
# Hmisc::label(all_data$Type_of_medical_school) = 'Type of Medical School'
Hmisc::label(all_data$Medical_Education_or_Training_Interrupted) = "Medical Education Interrupted"
Hmisc::label(all_data$Alpha_Omega_Alpha) = "Alpha Omega Alpha"
Hmisc::label(all_data$Gold_Humanism_Honor_Society) = "Gold Humanism Society"
Hmisc::label(all_data$Couples_Match) = "Couples Match"
Hmisc::label(all_data$Visa_Sponsorship_Needed) = "Visa Sponsorship Needed"
Hmisc::label(all_data$USMLE_Pass_Fail_replaced) = "USMLE Step 1"
Hmisc::label(all_data$Medical_Degree) = "Medical Degree"

There are 822 missing variables from the all_data dataframe addressed with imputation using mice.

# Assigns imputed data to all_data variable name
all_data <- completeData

Data split

Validation set Approach - The validation set approach consists of randomly splitting the data into two sets: one set is used to train the model and the remaining other set sis used to test the model.

Results

The rate of successfully matching into an OBGYN residency for this population was 56.97% with a 95% confidence interval of 55.63% to 58.31%. The sample is slightly unbalanced, with almost 56.97percent of the samples having positive y/Matching. There is no outcome imbalance.

A total of 47 features were available in the trainSet dataframe before checking for multicollinearity. While this is a large number of predictors, we start by using all of them, look at performance, and then explore reducing the dimension.

`all.features.full` model

We started by looking at feature importance of all the variables in all.features.full.

The estimated glm.all.features.full model with 47 predictors is shown below.

Multicollinearity (Variance inflation factor)

In model adequacy checking, we will check for multicollinearity in order to understand the linear dependence amongst the covariates. Explore the vif values of the all.features.full model and determine if you meet the assumption of additivity (meaning no multicollinearity).

Hence, we will use VIF to calculate multicollinearity for each covariate. We kept on removing the covariates one after the other with p values greater than 0.05. Based on multicollinearity result, removing one covariate linked with each other:

# create vector of VIF values
vif_values <- vif(all.features.full)

  * Type_of_medical_school+ #Removed due to VIF
  * Count_of_Scientific_Monograph+ #Stupid variable that I really don't understand
  * Year #removed for VIF issues
  * Medical_School_of_Graduation+ # removed due to VIF issues
  * work_exp_percentage+ #removed for VIF issues
  * Research_exp_percentage+ #removed for VIF issues
  * Advance_Degree,  #removed for VIF issues
  * Year_diff_from_graduation_to_application,,  #removed for VIF issues

Ideally, there should be no multicollinearity between these covariates now so we will check multicollinearity between these covariates. The revised_formula has no collinear terms present and only includes p<0.05. Our VIF values for the covariates are close to 1 hence there exists no multicollinearity between these covariates and all.features is our best model now.

`all.features` model without multicollinear variables

all.features_model

			Crude		Adjusted
Adjusted and unadjusted estimates for Matching into OBGYN Residency.
Variable	Total	Event	OR	2.5 % to 97.5 %	OR	2.5 % to 97.5 %
ACOG member
No	1640	728 (44.4%)		ref		ref
Yes	2090	1,397 (66.8%)	2.5	2.2 to 2.9	1.5	1.3 to 1.8
Additional Degree
No Addl Degree	3,348	1,941 (58.0%)		ref		ref
MBA	20	6 (30.0%)	0.3	0.1 to 0.8	0.6	0.2 to 1.8
Other	344	173 (50.3%)	0.7	0.6 to 0.9	0.7	0.5 to 0.9
PhD	18	5 (27.8%)	0.3	0.1 to 0.8	0.6	0.2 to 1.9
Age	29.1 (±4.1)	28.1 (±2.5)	0.8	0.8 to 0.9	0.9	0.9 to 1.0
Location of Birth
Born in the USA	2899	1,821 (62.8%)		ref		ref
Born outside the USA	831	304 (36.6%)	0.3	0.3 to 0.4	0.6	0.5 to 0.8
Peer-Reviewed Articles	1.1 (±2.4)	1.2 (±2.3)	1.0	1.0 to 1.1	1.1	1.0 to 1.1
Unpublished Articles	0.4 (±1.0)	0.4 (±1.1)	1.2	1.1 to 1.3	1.1	1.0 to 1.2
Gold Humanism Society
No	3231	1,776 (55.0%)		ref		ref
Yes	499	349 (69.9%)	1.9	1.6 to 2.3	1.2	1.0 to 1.5
Ivy League College Education
No Ivy League Education	3529	1,981 (56.1%)		ref		ref
Ivy League	201	144 (71.6%)	2.0	1.4 to 2.7	1.4	1.0 to 2.0
Medical Degree
MD	3125	1,848 (59.1%)		ref		ref
DO	605	277 (45.8%)	0.6	0.5 to 0.7	0.5	0.4 to 0.6
Medical Education Interrupted
No	3200	1,913 (59.8%)		ref		ref
Yes	530	212 (40.0%)	0.4	0.4 to 0.5	0.6	0.5 to 0.7
National Health Service Corps
No	3694	2,099 (56.8%)		ref		ref
Yes	36	26 (72.2%)	2.0	1.0 to 4.1	2.7	1.2 to 6.1
Number of letters of recommendation	3.6 (±0.6)	3.6 (±0.5)	1.4	1.2 to 1.5	1.4	1.2 to 1.6
Attended Top NIH funded Medical School
Did not attend NIH top-funded medical school	2461	1,204 (48.9%)		ref		ref
Attended a NIH top-funded Medical School	1269	921 (72.6%)	2.8	2.4 to 3.2	1.4	1.1 to 1.6
Count of OBGYN writing letter of recommendation	2.0 (±0.9)	2.2 (±0.7)	1.9	1.7 to 2.0	1.4	1.3 to 1.5
Application Track
Categorical Applicant	2010	1,283 (63.8%)		ref		ref
Applying for a Preliminary Position	1720	842 (49.0%)	0.5	0.5 to 0.6	0.7	0.6 to 0.9
US or Canadian Applicant
No	717	157 (21.9%)		ref		ref
Yes	3013	1,968 (65.3%)	6.7	5.5 to 8.1	3.0	2.3 to 3.9
USMLE Step 1
Passed	3657	2,117 (57.9%)		ref		ref
Failed attempt	73	8 (11.0%)	0.1	0.0 to 0.2	0.2	0.1 to 0.5
Visa Sponsorship Needed
No	3483	2,071 (59.5%)		ref		ref
Yes	247	54 (21.9%)	0.2	0.1 to 0.3	0.7	0.5 to 1.1

plot(anova(all.features))

# summary(all.features)

Variable importance for the all.features_model. After adjusting for the 47 variables, only 15 remained significant.

Stepwise Backwards selection using aic and p-values

kable(colnames(trainSet1), digits = 2, caption = "Features following Backwards Regression Selection", 
    align = "l", padding = 2, format.args = list(decimal.mark = ",", big.mark = "'"))

Features following Backwards Regression Selection
x
Match_Status
Age
Visa_Sponsorship_Needed
Alpha_Omega_Alpha
Count_of_Peer_Reviewed_Journal_Articles_Abstracts
Other_Service_Obligation
Medical_Training_Discipline
NCI
number_of_applicant_first_author_publications
Birth_Resident_status
female_reco_letter

Table 1

# must have results="asis"

trainSet_ordered <- trainSet1

Table_1 <- arsenal::tableby(formula = Match_Status ~ ., data=trainSet1, 
                 control =arsenal::tableby.control(test = TRUE,
                 total = TRUE,
                 digits = 1L, digits.count = 0L, cat.simplify = TRUE, numeric.simplify = TRUE,numeric.stats = c(#"Nmiss2", 
                                                                                                                  "median",  "q1q3"), cat.stats =c(#"Nmiss2", 
                                                                                                                                                   "countpct"), stats.labels = list(Nmiss = "N Missing", Nmiss2 ="N Missing", meansd = "Mean (SD)", medianrange = "Median (Range)",median ="Median", medianq1q3 = "Median (Q1, Q3)",q1q3 = "Q1, Q3",iqr = "IQR",range = "Range",countpct = "Count (Pct)", Nevents = "Events", medSurv ="Median Survival",medTime = "Median Follow-Up")))
  
summary(Table_1,
          text=F,
          title = 'Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2020', 
          labelTranslations = mylabels, #Seen in additional functions file
          pfootnote=TRUE)

Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2020
	Not_Matched (N=1605)	Matched (N=2125)	Total (N=3730)	p value
Age, yrs				< 0.001¹
Median	29.0	28.0	28.0
Q1, Q3	27.0, 31.0	27.0, 29.0	27.0, 30.0
Visa Sponsorship Needed	193 (12.0%)	54 (2.5%)	247 (6.6%)	< 0.001²
Alpha Omega Alpha Member	91 (5.7%)	277 (13.0%)	368 (9.9%)	< 0.001²
Count of Peer Reviewed Journal Articles				0.004¹
Median	0.0	0.0	0.0
Q1, Q3	0.0, 1.0	0.0, 1.0	0.0, 1.0
National Health Service Corps	10 (0.6%)	26 (1.2%)	36 (1.0%)	0.063²
Prior Residency Training	200 (12.5%)	46 (2.2%)	246 (6.6%)	< 0.001²
National Cancer Institute experience	53 (3.3%)	84 (4.0%)	137 (3.7%)	0.295²
Count of First Author Publications				< 0.001¹
Median	0.0	1.0	1.0
Q1, Q3	0.0, 2.0	0.0, 3.0	0.0, 3.0
Immigrant	602 (37.5%)	360 (16.9%)	962 (25.8%)	< 0.001²
Number of letters of recommendation written by women				< 0.001¹
Median	1.0	1.0	1.0
Q1, Q3	0.0, 2.0	1.0, 2.0	1.0, 2.0

Linear Model ANOVA
Pearson’s Chi-squared test

table1 <- Table_1 %>% as.data.frame() 

#View(as.data.frame(Table_1))
#table1$Total[[2]]
#tm_write2word(tm_arsenal_table_output, "tm_arsenal_table_output1")
#tm_write2pdf(tm_arsenal_table_output, "tm_arsenal_table_output1")

The typical applicant to the included Obstetrics and Gynecology residency program is a 28-year-old (IQR: 25 - 31)female. A majority of applicants were trained in an allopathic medical school. The majority of students do not need visa sponsorship. About 13% of applicants had interrupted their medical school training either to get another degree or to take time for an illness or another reason. Typically applicants were not members of AOA or of the Gold Humanism Society.

Unsurprisingly the most common research output was a poster. In terms of research the applicants had presented a median of 1 poster (IQR: -2 - 4).

In terms of prior history predicting the future we looked at prior USMLE Step 1 failure. The most common was that most students Step 1.

Table 1 by Year

Model Training: `monica`

The `monica` model is going to include only the variables chosen by backwards selection above.

# monica is the model that uses only statistically significant features from backwards selection.

d <- rms::datadist(trainSet1)
options(datadist = "d")  #often a good idea, to store info with fit

monica <- rms::lrm(formula = ssformula, data = trainSet1, method = "lrm.fit", tol = 0.000000000001, linear.predictors = TRUE, 
    x = T, y = T)

No multicollinearity present in `monica` model

# Must have results='asis' in the rmarkdown header

vif_monica <- vif(monica)  # test for multicolinearity
# Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.

stargazer::stargazer(vif_monica, title = "Variable-Inflation Factors for monica model", type = "html", 
    notes = ("GVIF = Generalized Variance-inflation Factors.  Df = Degrees of freedom.  "), notes.align = "l", 
    notes.append = TRUE)

**Variable-Inflation Factors for monica model**

Age	Alpha_Omega_Alpha	Birth_Resident_status	Count_of_Peer_Reviewed_Journal_Articles_Abstracts	female_reco_letter	Medical_Training_Discipline	NCI	number_of_applicant_first_author_publications	Other_Service_Obligation	Visa_Sponsorship_Needed

110.000	1.100	1.600	1.600	3.000	1.200	1.000	1.700	1.000	1.300

GVIF = Generalized Variance-inflation Factors. Df = Degrees of freedom.

The mean variable inflation factor for the monica model is 12.35.

monica.glm

Variable Importance

Adding relevant power-transformations using the Nadaraya-Watson kernel regression estimate

Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.

Linearity assumptions

ggplot(mydata, aes(logit, predictor.value)) + geom_point(size = 0.5, alpha = 0.5) + geom_smooth(method = "loess") + 
    theme_bw() + facet_wrap(~predictors, scales = "free_y")

The variables Age, female_reco_letter, and Number_of_Peer_reviewed_journal_articles are not linear and might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function.

Here we compare the significance of terms as linear or as quadratic. In regression, polynomial effects are predictors that have a power greater than one.

The original monica model without polynomials effects had an AIC of 4536.5 but the updated model with the second degree polynomials for age, number of peer-reviewed articles, and number of applicants where they are the first author was 4527.75. There was no change in r.squared as well. This AIC difference of 8.76 was not significant. Plots could be used in a supplement. We found no quadratic or cubic terms that explained the variation in matching above and beyond the linear terms.

Adding variable interactions

Use the anova function to answer if the addition of the interaction was significant.

knitr::kable(as.data.frame(x) %>% filter(x$`Pr(>Chi)` < 0.05) %>% select("Pr(>Chi)"), digits = 4, padding = 2, 
    caption = "Interactions between Variables in the monica.glm", format.args = list(decimal.mark = ",", 
        big.mark = "'"))

Interactions between Variables in the monica.glm
	Pr(>Chi)
Age:Alpha_Omega_Alpha	0,001
Count_of_Peer_Reviewed_Journal_Articles_Abstracts:number_of_applicant_first_author_publications	0,014
Age:Count_of_Peer_Reviewed_Journal_Articles_Abstracts	0,018
Count_of_Peer_Reviewed_Journal_Articles_Abstracts:Medical_Training_Discipline	0,024
Alpha_Omega_Alpha:Other_Service_Obligation	0,034
Birth_Resident_status:Count_of_Peer_Reviewed_Journal_Articles_Abstracts	0,049

# Includes table of interactions Counts the number of statistically significant interactions between
# variables

There are 6 interactions between the factors in the monica model.

Let’s test for more interactions after adding the variable interaction Age:Alpha_Omega_Alpha:

The original monica model without interactions had an AIC of 4536.5 but the updated model with the six feature interactions produced an AIC of 4527.68. This difference of 8.83 was not significant. We found no quadratic or cubic terms that affected the model.

Remove insignficant variables

The p-values in monica.glm indicate the probability of a variable not being important for prediction.

There is no change in AIC between the two models with and without the statistically insignificant model. The main advantage of this model over our previous is the added simplicity inherent in the reduced number of explanatory variables! Removing these variables does not violate the principle of marginality.

# Update monica model where we remove NCI and other_service
monica <- rms::lrm(formula = Match_Status ~ Age + Alpha_Omega_Alpha + Birth_Resident_status + female_reco_letter + 
    Medical_Training_Discipline + number_of_applicant_first_author_publications + Visa_Sponsorship_Needed, 
    data = trainSet1, method = "lrm.fit", tol = 0.000000000001, linear.predictors = TRUE, x = T, y = T)

Removing outliers

So now that we have a model we’re satisfied with we can look for outliers that negatively effect the model.

When you have outliers in a continuous predictor, potential solutions include: * Removing the concerned records - I chose this option.
* Transform the data into log scale * Use non parametric methods

Data point that is greater than three standard deviations different from the mean.

# Filter potential influential data points with abs(.std.res) > 3:
model.data %>% filter(abs(.std.resid) > 3)

We see that the datapoint 1925 is classified as an outlier in both tests, we can take a look at the point in a few of our plots to gauge whether or not to remove it. Code turned off.

Down to almost 4527.1 AIC from the original 4536.5, this isn’t really a relevant measure of performance when comparing the AIC of two different sets of data.

Look for overfitting

In the code above we’re using k-fold cross-validation with k = 73 (since 73 is a factor of 3730 which is the length of trainSet1) this means we’re splitting our data in 73 ‘chunks’.

Table 2: Interpreting Model Coefficients

Performance?

The overall performance of the monica model without interactions was evaluated using the Brier score, rescaled to range from 0 to 1 (with higher values indicating better performance) as suggested by Steyerberg et al (2010). Brier score for monica was 0.21. The C-statistic/AUC was 0.71 for the monica model. The R2 was 0.19.

# rms::Predict is not working and so will use glm predict as another option.

# Page 139-140 in Harrell Text p <- rms::Predict(object = monica, #rms fit object digits=4, conf.int
# = FALSE) plot(p) ggplot(p, groups=FALSE)

# https://darrendahly.github.io/post/homr/
trainSet1$Match_Status <- as.numeric(trainSet1$Match_Status) - 1

monica.glm <- glm(formula = glm_ssformula, data = trainSet1, family = binomial(link = "logit"))

dust(monica.glm)

term	estimate	std.error	statistic	p.value
(Intercept)	4.25	0.37	11.39	0
Age	-0.15	0.01	-11.29	0
Alpha_Omega_AlphaYes	0.52	0.13	3.97	0
Birth_Resident_statusImmigrated to US	-0.52	0.09	-5.55	0
Count_of_Peer_Reviewed_Journal_Articles_Abstracts	0.11	0.02	6	0
female_reco_letter	0.2	0.04	5.44	0
Medical_Training_DisciplineYes	-0.73	0.2	-3.71	0
NCIYes	0.17	0.19	0.88	0.38
number_of_applicant_first_author_publications	0.03	0.01	2.28	0.02
Other_Service_ObligationYes	0.86	0.4	2.12	0.03
Visa_Sponsorship_NeededYes	-0.97	0.18	-5.37	0

Generating Predictions

The first step is predicting the match probabilities of test data set using predict( ) function. Setting a cut-off value (0.5 for binary classification). Below 0.5 of probability did not match (0) and above that pos (1) for matching.

#https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

#Create Predictions
probs <- 
  stats::predict(monica.glm, # Model created with the trainSet data
  newdata = testSet, #Need a test set called testSet
  type = "response")    # Return predicted probabilities
pred <- factor(ifelse(probs > 0.5, 1, 0))

# #Set levels so they are the same
# pred <- factor(pred, levels=c("1", "0"), labels = c("1", "0"))
# pred <- relevel(pred, ref = "1")
# levels(pred)
# class(pred)

# #Set levels so they are the same
# testSet$Match_Status <- factor(testSet$Match_Status, levels=c("1", "0"), labels = c("1", "0"))
# testSet$Match_Status <- relevel(testSet$Match_Status, ref = "1")
# levels(testSet$Match_Status)
# class(testSet$Match_Status)

Model Evaluation on Test data Set

Confusion Matrix

After fitting a binary logistic regression model, the next step is to check how well the fitted model performs on unseen data i.e. 30% test data. Thus, the next step is to predict the classes in the test data set and generating a confusion matrix.

testSet$Match_Status <- as.numeric(testSet$Match_Status) - 1

cm <- caret::confusionMatrix(data = pred, reference = factor(testSet$Match_Status), positive = as.character(1))
cm

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 291 124
         1 397 787
                                             
               Accuracy : 0.674              
                 95% CI : (0.651, 0.697)     
    No Information Rate : 0.57               
    P-Value [Acc > NIR] : <0.0000000000000002
                                             
                  Kappa : 0.301              
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.864              
            Specificity : 0.423              
         Pos Pred Value : 0.665              
         Neg Pred Value : 0.701              
             Prevalence : 0.570              
         Detection Rate : 0.492              
   Detection Prevalence : 0.740              
      Balanced Accuracy : 0.643              
                                             
       'Positive' Class : 1

draw_confusion_matrix(cm)

The accuracy for the model was 0.67 with sensitivity of NA, and finally NA.

Nomogram

Figure 1: Nomogram: An analogue tool to deliver digital knowledge

We generated the nomogram to provide a pre-match, personalized estimate of the chance of matching into OBGYN residency at all institutions whereby points in the nomogram were assigned in proportion to the effect sizes in the multivariable logistic regression analysis model. The nomogram was based on presurgical variables including pre-Match education preparations, research accomplishments, and applicant demographics.
Points were allocated for each variable, summed, and then used to calculate a medical student-specific, pre-application risk chance of Matching. The nomogram illustrates the strength of association of the predictors to the outcome as well as the nonlinear associations between age and count of poster presentations and matching.

The goal of machine learning is to build models that generate accurate predictions on previously unseen data (Max Kuhn, http://www.feat.engineering). Here we see that in action:

par(mar=c(1, 1, 1, 1)) # 1 inch margins all over 
#https://www.statmethods.net/advgraphs/parameters.html

plot(nomo.nomo, 
     xfrac=0.5, #distance of variable names to the bars
     cex.axis=0.6, #Size of the words  (e.g. "Male")
     cex.var=0.6, #Variable name size (e.g. size of "Age, years")
     total.points.label="Sum of all points",
     force.label = TRUE,
     lmgp = 0.3,
     #tcl = 0.8,
       label.every=2)

Equation

varNames <- colnames(stats::model.matrix(monica))
equationStr <- paste(round(coef(monica), 2), varNames, sep = "*", collapse = "   +   ")
equationStr <- gsub("*(Intercept)", "", equationStr, fixed = TRUE)
equationStr <- paste(all.features$terms[[2]], "=", equationStr)
equationStr[[2]] %>% knitr::kable(format = "html", align = "l", caption = "Polynomial Equation to Predict Matching for OBGYN Residency:", 
    padding = 2) %>% kableExtra::kable_styling()

Polynomial Equation to Predict Matching for OBGYN Residency:
x
Match_Status = 3.82 + -0.13Age + 0.52Alpha_Omega_AlphaYes + -0.5Birth_Resident_statusImmigrated to US + 0.19female_reco_letter + -0.64Medical_Training_DisciplineYes + 0.06number_of_applicant_first_author_publications + -0.95*Visa_Sponsorship_NeededYes

The overall performance of these models was evaluated using the Brier score, rescaled to range from 0 to 1 (with higher values indicating better performance) as suggested by Steyerberg et al (2010). We assessed calibration graphically, in addition to using the maximum and average difference in predicted vs loess-calibrated probabilities (Emax and Eavg); and we used the c-statistic to assess discrimination. For each of these metrics, we reported bootstrapped 95% confidence intervals. Finally, for Model Revision, we estimated an optimism-corrected c-statistic and shrinkage factor using bootstrapping, as described in Harrell, Lee and Mark (1996). Uncertainty in these metrics was evaluated with bootstrapping:

# # Metrics

# val_monica <- rms::val.prob( p = testSet$monica.glm_pred, # predicted probability y =
# testSet$Match_Status, #vector of binary outcomes pl = FALSE) %>% #plot calibration curves round(3)
# rescale_brier <- function(x, p, ...){ format(round(1 - (x / (mean(p) * (1 - mean(p)))), digits =
# 2), nsmall = 2) } b1 <- rescale_brier(val_monica['Brier'], 0.41) # Note: 0.41 is the marginal
# probabilty of not matching in the entire sample pander(val_monica) #Uncertainty of the ???
# set.seed(12345678) boot_val <- function(data, formula, ...){ out <- list() for(i in 1:500){ df <-
# sample_n(data, nrow(data), replace = TRUE) md <- glm(formula, data = df, family = binomial)
# out[[i]] <- rms::val.prob(predict(md, type = 'response'), as.numeric(df$Match_Status) - 1, pl =
# FALSE) %>% round(3) } return(out) } forglm_revisedformula <- as.formula( paste0('Match_Status ~',
# linearphrase) ) boot_vals_monica <- boot_val(trainSet, forglm_revisedformula) ##Should this be
# trainSet or testSet?  #This code just pulls out 95% intervals from the bootstrapped values.
# calc_ci <- function(metric, boot_vals, n){ x <- unlist(map(boot_vals, `[`, c(metric))) if(metric ==
# 'Brier'){x <- as.numeric(rescale_brier(x, 0.184))} paste0('(', round(quantile(x, 0.025), n), ' to
# ', round(quantile(x, 0.975), n), ')') } # m1 monica_c_boot_ci <- calc_ci('C (ROC)',
# boot_vals_monica, 2) monica_brier_boot_ci <- calc_ci('Brier', boot_vals_monica, 2)
# monica_emax_boot_ci <- calc_ci('Emax', boot_vals_monica, 2) monica_eavg_boot_ci <- calc_ci('Eavg',
# boot_vals_monica, 2)

# # Function to produce the calibration plots cal_plot <- function(model, model_name, pred_var, ...){
# require(tidyverse) require(viridis) require(gridExtra) # The calibration plot g1 <-
# dplyr::mutate(all_data, bin = ntile(get(pred_var), 10)) %>% # Bin prediction into 10ths
# group_by(bin) %>% dplyr::mutate(n = n(), # Get ests and CIs bin_pred = mean(get(pred_var)),
# bin_prob = mean(as.numeric(alive) - 1), se = sqrt((bin_prob * (1 - bin_prob)) / n), ul = bin_prob +
# 1.96 * se, ll = bin_prob - 1.96 * se) %>% ungroup() %>% ggplot(aes(x = bin_pred, y = bin_prob, ymin
# = ll, ymax = ul)) + geom_pointrange(size = 0.5, color = 'black') + scale_y_continuous(limits = c(0,
# 1), breaks = seq(0, 1, by = 0.1)) + scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by =
# 0.1)) + geom_abline() + # 45 degree line indicating perfect calibration geom_smooth(method = 'lm',
# se = FALSE, linetype = 'dashed', color = 'black', formula = y~-1 + x) + # straight line fit through
# estimates geom_smooth(aes(x = get(pred_var), y = as.numeric(alive) - 1), color = 'red', se = FALSE,
# method = 'loess') + # loess fit through estimates xlab('') + ylab('Observed Probability') +
# theme_minimal() + ggtitle(model_name) # The distribution plot g2 <- ggplot(all_data, aes(x =
# get(pred_var))) + geom_histogram(fill = 'black', bins = 200) + scale_x_continuous(limits = c(0, 1),
# breaks = seq(0, 1, by = 0.1)) + xlab('Predicted Probability') + ylab('') + theme_minimal() +
# scale_y_continuous(breaks = c(0, 40)) + theme(panel.grid.minor = element_blank()) # Combine them g
# <- arrangeGrob(g1, g2, respect = TRUE, heights = c(1, 0.25), ncol = 1) grid.newpage() grid.draw(g)
# return(g[[3]]) }

# model_tab_1 <- data_frame( Measures = c('Intercept', 'Slope', 'Residual deviance', 'Df', 'LRT Chisq
# p-value', 'Brier score (rescaled)', 'Emax', 'Eavg', 'c-statistic'), all.features_Values =
# c(round(c(0, 1, monica$deviance, monica$df.residual, 0), 2), paste(b1, monica_brier_boot_ci),
# paste(val_monica['Emax'], monica_emax_boot_ci), paste(val_monica['Eavg'], monica_eavg_boot_ci),
# paste(round(val_monica['C (ROC)'], 2), monica_c_boot_ci)), ) names(model_tab_1) <- c('', 'monica
# model') knitr::kable(model_tab_1)

Plot the Full distributions of bootstrapped values.

# AMAZING!!!!: https://darrendahly.github.io/post/homr/ boots <- function(metric, boot_vals){ x <-
# as.numeric(unlist(map(boot_vals, `[`, metric))) if(metric == 'Brier'){x <-
# as.numeric(rescale_brier(x, 0.184))} return(x) } boot_list <- list(boot_vals_monica) metrics <-
# c('C (ROC)', 'Brier', 'Eavg', 'Emax') x <- c() for(i in metrics){ for(j in 1:1){ x <- c(x, boots(i,
# boot_list[[j]])) } } y <- rep(c(paste0('all.features model', ' c-index'), paste0('all.features
# model', ' Brier'), paste0('all.features model', ' Emax'), paste0('all.features model',' Eavg')),
# each = 500) boot_data <- data_frame(var = y, val = x) #Density plots ggplot(boot_data, aes(x =
# val)) + geom_density() + facet_wrap(~var, nrow = 4, scales = 'free') + theme_minimal() + xlab('') +
# ylab('Density')

Evaluating the model: Overview

To evaluate the monica Model, we followed the procedure outlined in Vergouwe et al (2016).

monica_1 <- glm(formula = ssformula, data = trainSet, family = binomial(link = "logit"))

# Model predicted probabilties
trainSet$monica_pred <- stats::predict(monica_1, type = "response")

# x <- cal_plot(monica_1, 'monica model', 'all_data$monica_pred')

# sjPlot::tab_model(model = monica.glm, 
#                     show.p = TRUE, 
#                     show.ci = FALSE, 
#                     show.se = TRUE, 
#                     transform = NULL)
  
# sjlabelled::term_labels(monica.glm)

sjPlot::plot_model(model = monica.glm, 
                     transform = NULL, 
                     show.values = TRUE, 
                     #axis.labels = TRUE, 
                     show.legend = TRUE,
                     value.offset = .4, 
                     colors = "bw",
                     auto.label = FALSE) #Uses the column header names

#jtools::effect_plot(monica.glm, pred = Age, data = all_data, interval = TRUE)

Odds Ratios

Table 2: Interpreting Model Coefficients

#http://rstudio-pubs-static.s3.amazonaws.com/481253_60f332ede0ea4ad9bb8e8c1148f560cd.html#5_multiple_imputation_with_a_logistic_regression_model

modeltab <- arsenal::modelsum(formula = Match_Status ~ ., 
             family = "binomial", 
             data = trainSet,  #train or test?
             binomial.stats=c("estimate","CI.lower.estimate","CI.upper.estimate","p.value"))
             #adjust = ~ Age + Gender + US_or_Canadian_Applicant + Medical_Degree + Alpha_Omega_Alpha)

summary(modeltab, 
        title='Odds Ratios for monica model', 
        control=arsenal::modelsum.control(gaussian.stats=c("N","estimate", "Nmiss2")), 
        show.intercept = FALSE)

Odds Ratios for monica model
	estimate	CI.lower.estimate	CI.upper.estimate	p.value
Sex Male	-0.417	-0.583	-0.252	< 0.001
Age	-0.166	-0.188	-0.144	< 0.001
US or Canadian Applicant Yes	1.905	1.715	2.100	< 0.001
Medical Degree DO	-0.539	-0.714	-0.364	< 0.001
Military service obligation Yes	0.126	-0.598	0.885	0.737
Visa Sponsorship Needed Yes	-1.657	-1.975	-1.355	< 0.001
Medical Education Interrupted Yes	-0.802	-0.990	-0.615	< 0.001
Prior Misdemeanor Yes	-0.349	-0.847	0.146	0.166
Alpha Omega Alpha Yes	0.914	0.672	1.165	< 0.001
Gold Humanism Society Yes	0.645	0.444	0.851	< 0.001
Couples Match Yes	0.670	0.419	0.928	< 0.001
Oral Presentations	0.016	-0.021	0.055	0.403
Book Chapters	-0.188	-0.411	0.022	0.086
Peer-Reviewed Articles	0.044	0.015	0.075	0.004
Unpublished Articles	0.182	0.108	0.261	< 0.001
Poster Presentations	0.123	0.092	0.156	< 0.001
USMLE Step 1 Failed attempt	-2.413	-3.231	-1.737	< 0.001
Presented at a scientific meeting Presented at a meeting	0.361	0.226	0.497	< 0.001
Attended Top NIH funded Medical School Attended a NIH top-funded Medical School	1.016	0.871	1.164	< 0.001
Ivy League College Education Ivy League	0.680	0.372	1.001	< 0.001
Undergraduate degree Not a B.S. degree	-0.305	-0.436	-0.175	< 0.001
Participates in OBGYN Interest Group Mentions Interest Group	-0.975	-2.952	0.660	0.260
Location of Birth Born outside the USA	-1.074	-1.235	-0.915	< 0.001
Speaks foreign language Speaks English and Another Language	-0.118	-0.266	0.030	0.120
ACLS Yes	-0.247	-0.378	-0.116	< 0.001
PALS Yes	-0.966	-1.260	-0.680	< 0.001
National Health Service Corps Yes	0.681	-0.020	1.461	0.068
Prior Residency Training Yes	-1.862	-2.200	-1.544	< 0.001
Photo in Application No	-1.767	-2.559	-1.086	< 0.001
Application Track Applying for a Preliminary Position	-0.610	-0.741	-0.479	< 0.001
AMA member American Medical Association Member	0.521	0.389	0.653	< 0.001
ACOG member Yes	0.926	0.793	1.060	< 0.001
Graduated cum laude Latin_honors	-1.229	-2.056	-0.489	0.002
Received scholarship Scholarship	0.313	0.166	0.462	< 0.001
Received grants Grant_funding	0.466	0.186	0.755	0.001
Phi Beta Kappa membership Phi_Beta_Kappa	0.886	0.386	1.432	< 0.001
College athlete NCAA_athlente	0.627	0.003	1.310	0.058
Boy or Girl Scout Boy/Girl_Scouts	0.495	-0.436	1.543	0.317
Prior valedictorian Valedictorian	0.520	-0.095	1.185	0.108
National Institutes of Health experience NIH_present	-0.509	-1.183	0.150	0.131
National Cancer Institute experience Yes	0.187	-0.160	0.542	0.296
Count of OBGYN writing letter of recommendation	0.618	0.538	0.700	< 0.001
Number of letters of recommendation	0.317	0.200	0.434	< 0.001
Count of First Author Publications	0.054	0.032	0.077	< 0.001
Additional Degree MBA	-1.169	-2.209	-0.253	0.017
Additional Degree Other	-0.310	-0.532	-0.088	0.006
Additional Degree PhD	-1.277	-2.416	-0.300	0.015
Immigrant Immigrated to US	-1.079	-1.232	-0.928	< 0.001
Number of letters of recommendation written by women	0.306	0.240	0.372	< 0.001
monica_pred	4.742	4.310	5.186	< 0.001

# all.features_predict <- stats::predict(monica, x=TRUE, y=TRUE)

Internal Validation

Bootstrap validation of the monica model. In bootstrap validation random samples are drawn with replacement from the original data set are the same size as the original cohort.

# page 284 in Harrell's book, page 301
set.seed(123456)

monica <- update(monica, x = TRUE, y = TRUE)

v_monica <- rms::validate(fit = monica, method = "boot", B = 1000, pr = FALSE, digits = 2, bw = FALSE, 
    tol = 0.000000000001, maxdim = 5, maxiter = 100, dxy = TRUE)

head(v_monica, B = 20, digits = 3)

          index.orig training   test optimism index.corrected    n
Dxy             0.40     0.40 0.3946   0.0044          0.3911 1000
R2              0.18     0.18 0.1785   0.0044          0.1762 1000
Intercept       0.00     0.00 0.0056  -0.0056          0.0056 1000
Slope           1.00     1.00 0.9840   0.0160          0.9840 1000
Emax            0.00     0.00 0.0045   0.0045          0.0045 1000
D               0.14     0.15 0.1425   0.0038          0.1404 1000

Figure 2: Calibration of the `monica` model

graphics::plot(cal_monica, main = "Bootstrap Overfitting-Corrected Calibration Curve", xlim = c(0, 1), 
    ylim = c(0, 1), xlab = "Nomogram-Predicted Probability of Matching", ylab = "Actual Matching (proportion)", 
    lwd = 2, lty = 1, errbar.col = c(rgb(0, 118, 192, maxColorValue = 255)), legend = FALSE, subtitles = FALSE, 
    cex.lab = 1.2, cex.axis = 1, cex.main = 1, cex.sub = 0.6, scat1d.opts = list(nhistSpike = 240, side = 3, 
        frac = 0.08, lwd = 1, nint = 50))


n=3730   Mean absolute error=0.013   Mean squared error=0.00033
0.9 Quantile of absolute error=0.019

lines(cal_monica, lwd = 2, lty = 3, col = c(rgb(255, 0, 0, maxColorValue = 255)))
abline(0, 1, lty = 5, lwd = 2, col = c(rgb(0, 0, 255, maxColorValue = 255)))
legend("bottomright", cex = 0.8, legend = c("Apparent", "Bias-corrected", "Ideal"), col = c("red", "black", 
    "blue"), lwd = c(1, 1, 1), lty = c(1, 1, 2))

Site Validation

The process works as follow:

Build (train) the model on the training data set
Apply the model to the test data set to predict the outcome of new unseen observations
Quantify the prediction error as the mean squared difference between the observed and the predicted outcome values.

# Create the data for each individual site source('Code/000_1f_Site_Data_cleaning_function_maybe.R')

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Introduction

Create folder structure, load bespoke functions, and Load necessary R libraries:

Load data

Data Dictionary

Data Preparation: Factorize categorical variables

Data split

Results

all.features.full model

Multicollinearity (Variance inflation factor)

all.features model without multicollinear variables

Stepwise Backwards selection using aic and p-values

Table 1

Table 1 by Year

Model Training: monica

The monica model is going to include only the variables chosen by backwards selection above.

No multicollinearity present in monica model

monica.glm

Variable Importance

Adding relevant power-transformations using the Nadaraya-Watson kernel regression estimate

Linearity assumptions

Adding variable interactions

Remove insignficant variables

Removing outliers

Look for overfitting

Table 2: Interpreting Model Coefficients

Performance?

Generating Predictions

Model Evaluation on Test data Set

Confusion Matrix

Nomogram

Figure 1: Nomogram: An analogue tool to deliver digital knowledge

Equation

Plot the Full distributions of bootstrapped values.

Evaluating the model: Overview

Odds Ratios

Table 2: Interpreting Model Coefficients

Internal Validation

Figure 2: Calibration of the monica model

Site Validation

`all.features.full` model

`all.features` model without multicollinear variables

Model Training: `monica`

The `monica` model is going to include only the variables chosen by backwards selection above.

No multicollinearity present in `monica` model

Figure 2: Calibration of the `monica` model