Background

Twelve months ago, four of eight CU medical students applying to Obstetrics and Gynecology residency did not match. A model that predicts a medical student’s chances of matching into an obstetrics and gynecology residency may facilitate improved counseling and fewer unmatched medical students.

Objective

Create and validate a predictive model to understand a medical student’s chances of matching into obstetrics and gynecology residency.

Computational Reproducibility

I used the package renv to make the project more: isolated, portable, and reproducible. Therefore a lockfile with all versions of packages is set with every run of the code. In addition, version control for the project was stored on www.github.com/mufflyt in a private repository called nomogram. I tried keeping the code “readable” and “literate”. The start of data is downloaded from Dropbox each time to ensure we always start with the Next; the environment was controlled using Rstudio Cloud (https://posit.cloud/spaces/191846/content/4636956). We worked with the caret package (Classification and Regression Training) package using R given the number of tutorials and support. A strength of the caret package is that the exact syntax can be used whether the model we are considering has a categorical or numeric outcome. Analyses were conducted using the R Statistical language (version 4.2.2; R Core Team, 2022) on Ubuntu 20.04.5 LTS. I will not discuss mathematical background and equations. Lastly, the environment was controlled by using Rstudio Cloud (https://rstudio.cloud/spaces/191846/content/4546858).

LOADING DATA INTO R ENVIRONMENT.

After we’ve loaded the data in R, we’ll quickly look at the variables. ‘Match_Status’ is a categorical binary variable, meaning only two categories are possible. all_merged_Feature_engineering1 is a dataframe of the independent and the dependent variables for review. Each variable is contained in a column, and each row represents a single unique medical student. The most recent data was used if students applied for more than one year.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Data import and prep.  Data types already set as numeric or factor.   ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#Load Data from the website so it is always the correct file to start with.  

# Define file path and name
file_path <- "output/csv"
file_name <- "all_merged_Feature_engineering1.rds"

# Download file from URL
download.file("https://www.dropbox.com/s/i4hel9s8v33s76a/all_merged_Feature_engineering1.rds?raw=1",
              destfile = file.path(file_path, file_name), 
              method = "auto", 
              cacheOK = TRUE)

# all_merged_Feature_engineering1.rds - This is the original data that has been cleaned and merged for all programs.  The file is assigned to dataframe 'd' for ease of typing.  

# Read file and assign to variable
all_merged_Feature_engineering1 <- tryCatch({
  readRDS(file = file.path(file_path, file_name))
}, error = function(e) {
  # Handle error
  stop(paste("Error reading file:", e))
})

# Assign data to variable d
d <- all_merged_Feature_engineering1

We can see that these data have 7,226 observations of 52 features and that 54.2 percent of medical students applying to OB/GYN residency matched. Let’s create a few plots to get a sense of the data. Remember, the goal here will be to predict whether a given medical student will match into OB/GYN residency. We make ‘Not_Matched’ the reference level in the target variable, which will make logit models predict the probability of ‘Matched’.

# Subset data to exclude NA values in the Match_Status column
subset_data <- d[!is.na(d$Match_Status),]

# Create ggplot object using subsetted data and mapping Match_Status to the x-axis and fill color
whomatched <- ggplot2::ggplot(data = subset_data, 
                              mapping = aes(x = Match_Status, 
                                            fill = Match_Status)) +

# Add bar plot with count on y-axis
ggplot2::geom_bar(stat='count') +

# Add plot labels
labs(title = 'How many applicants \n matched into OBGYN?',
     caption = "Data from the 9 participating\n OBGYN residency programs.",
     colour = "Gears") +

# Add labels on top of bars using comma formatter
geom_label(stat='count',
           mapping = aes(label=scales::comma(after_stat(count))), 
           size=7) +

# Apply grey theme with base font size of 18
ggplot2::theme_grey(base_size = 18); whomatched

Data Dictionary/Descriptive Statistics with Exploratory Data Analysis

Some of the variables are skewed but I did not transform them because I wanted to make the nomogram with the original variables instead of a variables scaled and centered.

Codebook

A codebook is a technical description of the data that was collected for a particular purpose. It describes how the data are arranged in the computer file or files, what the various numbers and letters mean, and any special instructions on properly using the data. Predictors under consideration: 2020, 2019, 2018, 2017:

all_data$Match_Status - Did they Match or not Match?
all_data$Age - Age at the time of the match, numerical variable based on Date of Birth POSIXct
all_data$Year - Year of participation in the match, 4 level categorical
all_data$Gender - Male or Female, 2 level categorical
all_data$Couples_Match - Participating in the couples match? 2 level categorical
all_data$US_or_Canadian_Applicant - Are they a US Senior or an IMG, 2 level categorical
all_data$Medical_Education_Interrupted - Taking breaks, 2 level categorical
all_data$Alpha_Omega_Alpha - Membership in AOA, 2 level categorical
all_data$Military_Service_Obligation
all_data$USMLE_Step_1_Score - I did not use Step 2 score because most students will not have those numbers at the time they apply, numerical variable. Step 1 is now pass/fail.
all_data$Count_of_Poster_Presentation - numerical variable
all_data$Count_of_Oral_Presentation - numerical variable
all_data$Count_of_Articles_Abstracts - numerical variable
all_data$Count_of_Peer_Reviewed_Book_Chapter - numerical variable
all_data$Count_of_Other_than_Published - numerical variable
all_data$Visa_Sponsorship_Needed - numerical variable
all_data$Medical_Degree - Allopathic versus Osteopathic medical school education, 2 level categorical
all_data$AAMC_ID (Additional data not for analysis)
all_data$Applicant Name (Additional data not for analysis)
all_data$SOAP Applicant (2017 and earlier does not have SOAP data)
all_data$SOAP Reapply Applicant (2017 and earlier does not have SOAP data)
all_data$SOAP Match Status (2017 and earlier does not have SOAP data)
all_data$SOAP Reapply Track Applied (2017 and earlier does not have SOAP data)
all_data$Track Applied by Applicant - Prelim or categorical.
all_data$Gold Humanism Honor Society
all_data$Medical School of Graduation
all_data$Medical School Type
all_data$Misdemeanor Conviction
all_data$Felony Conviction - No one reporting a history of felonies.
all_data$Malpractice_Cases_Pending - One person with a pending malpractice case.
all_data$Location - Location where the data came from.

DATA VIZ: UNIVARIATE AND BIVARIATE ANALYSIS

Univariate analysis with Exploratory Data Analysis

Univariate data analysis does not deal with causes or relationships since it only looks at one variable. The purpose is to find trends within the data using histograms and box plots.

# Use the plot_bar function from the DataExplorer package to create a bar plot of all numeric variables in the dataframe
DataExplorer::plot_bar(d, theme_config = list("text" = element_text(size = 10)), nrow = 3, ncol = 3)

Bivariate analysis with Exploratory Data Analysis

Bivariate visualizations and summary statistics assess the relationship between each variable and the target variable Match_Status. To decide which predictors we should include, we can visualize the relationships between the predictors with the outcome variable. Boxplot to see if the distribution of the continuous predictor is different across the two classes, i.e., if it is different, it means the predictor could help separate the two classes. Here we can see that Age could do so.

DataExplorer::plot_bar(d, by = "Match_Status", theme_config = list("text" = element_text(size = 10)), nrow = 3L, ncol = 3L)

#Transforming the data to long format can sometimes make the plotting syntax a little easier. This allows for a simpler plot creation and means that we can make use of the facetting capabilites of ggplot2; facet_wrap in this example.

data_long %>%
    ggplot(aes(Match_Status, value, color = Match_Status)) + 
    geom_boxplot(alpha = 0.5, outlier.size = 2, notch = FALSE) + 
    geom_jitter(width = 0.1, height = 0, alpha = 0.5) +
    facet_wrap(~ key, scales='free_y', ncol = 3) + 
    labs(x = NULL, y = NULL) +
    theme_classic()

Difference in Values Between the Match_Status Classes

#We can also look at this by doing some calculations, and see by how much the values differ between the two classes of Match_Status Doing this we can put a numeric value on the difference between the average values for each of our variables.

#data_long_output is a cleaner version of 'data_long' that can be outputed into a table.  

data_long_output <- data_long %>%
    dplyr::group_by(key, Match_Status) %>%
    dplyr::summarise(mean = mean(value)) %>%
    spread(Match_Status, mean) %>% # Spread the data to make the table 'wide'
    dplyr::mutate(Diff = Matched - Not_Matched, 
           `Percentage Difference` = (Matched - Not_Matched) / Matched * 100) %>%
    dplyr::arrange(desc(abs(`Percentage Difference`))) # Arrange by the largest differences, accounting for negatives

knitr_table_html(data_long_output, "Difference in Values Between the Match_Status Classes")

Difference in Values Between the Match_Status Classes
key	Not_Matched	Matched	Diff	Percentage Difference
Count_of_Peer_Reviewed_Book_Chapter	0.056	0.041	-0.015	-37.76
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published	0.309	0.462	0.153	33.12
number_of_applicant_first_author_publications	1.305	1.872	0.566	30.26
Count_of_Poster_Presentation	1.548	2.114	0.566	26.79
Count_of_Peer_Reviewed_Journal_Articles_Abstracts	1.085	1.239	0.154	12.44
Age	29.919	27.891	-2.027	-7.27
Count_of_Oral_Presentation	0.907	0.962	0.056	5.81

Identify Missingness with Exploratory Data Analysis

# Check for NaN, NA, Inf, and -Inf values
nan_vars <- names(d)[map_lgl(d, ~ any(is.nan(.x)))] 
na_vars  <- names(d)[map_lgl(d, ~ any(is.na(.x) & !is.nan(.x)))]
inf_vars <- names(d)[map_lgl(d, ~ any(is.infinite(.x)))] 

# Print warning messages for variables with NaN, NA, or Inf values
if (length(nan_vars) > 0) {
  warning(paste("Check the NaN counts in these variables:", paste(nan_vars, collapse = ', ')))
}
if (length(na_vars) > 0) {
  warning(paste("Check the NA counts in these variables:", paste(na_vars, collapse = ', ')))
}
if (length(inf_vars) > 0) {
  warning(paste("Check the Inf counts in these variables:", paste(inf_vars, collapse = ', ')))
}
#It uses the map_lgl() function from the tidyverse package to apply the any(), is.nan(), is.na(), and is.infinite() functions to each variable in the data frame d. This creates logical vectors indicating which variables have any NaN, NA, or Inf values.
# It uses the names() function and indexing with the logical vectors to get the names of the variables with NaN, NA, or Inf values. These names are assigned to the nan_vars, na_vars, and inf_vars variables, respectively.
# It uses the warning() function to print warning messages
###

rm(nan_vars)
rm(na_vars)
rm(inf_vars)

options(repr.plot.width = 6, repr.plot.height = 4)

#missing data - This is a dataframe count of all the missing data in the dataframe.  Can be removed after visualization.  

# use `select_if()` to select only variables, instead of summarizing all variables 
missing_data <- d %>% dplyr::select_if(is.numeric) %>%
  dplyr::summarize_all(funs(sum(is.na(.))/n())) %>% 
  tidyr::gather(key = "variable", value = "percent_missing")

# Use ggplot2 for visualization
# Use fct_reorder() to reorder x-axis, to make it more readable and make it clear that the variable is a factor
ggplot(missing_data, aes(x = forcats::fct_reorder(variable, percent_missing), y = percent_missing)) +
    geom_col(fill = "red", aes(color = I('white')), linewidth = 0.3) + 
    xlab("Variables") + 
    ylab("Percent Missing") +
    coord_flip() + 
    theme_bw() +
    ggtitle("Missing Data by Variable")

rm(missing_data)

Remove constant variables from ‘d’ dataframe

All constant variables are removed from ‘d’ here.

#This code performs a logistic regression analysis of the variable 'Match_Status' on all other variables in the data frame 'train_match'. To check the linearity assumption, it first predicts the probability of positive 'Match_Status' and assigns predicted class based on threshold of 0.5. Then it selects only numeric predictors and makes a scatter plot with the logit (log of the ratio of predicted probability to 1-predicted probability) on the x-axis and predictor variable values on the y-axis. It also adds a loess smoother line to the scatter plot and facets by predictors to display all predictors in separate plots.

glm <- glm(Match_Status ~ ., 
           family = binomial,
           data=d)

Remove Near-zero/low-information variables from ‘d’ dataframe

#nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. 

# Data:  Uses the d dataframe without any constant variables.  

# # Identify near zero variance predictors: remove_cols
remove_cols <- caret::nearZeroVar(d, names = TRUE, freqCut = 19, uniqueCut = 10)
remove_cols <- base::sort(remove_cols)

# Remove near zero variance predictors from data frame
d <- d[, setdiff(names(d), remove_cols)]

# uses the names() function from the tidyverse package to get the names of all the variables in the data frame d, rather than storing them in a separate variable. This can make the code more concise and easier to read.

knitr::kable(x = remove_cols,
             format = "html", 
             padding = 2,
             booktabs = TRUE,
             col.names = c("Names of Removed Variable"),
             format.args = list(big.sep = ",", decimal.mark = "."),
             digits = getOption("digits"),
             caption = "Zero Variance or Near Zero Variance Variables Removed",
             escape = FALSE)

Zero Variance or Near Zero Variance Variables Removed
Names of Removed Variable
Boy_Scouts
Count_of_Peer_Reviewed_Book_Chapter
Grant
Higher_Education_Institution
Interest_Group
Latin_Honors
Military_Service_Obligation
Misdemeanor_Conviction
NCAA
NCI
NIH
Other_Service_Obligation
Phi_beta_kappa
Photo_Received
USMLE_Pass_Fail_replaced
Valedictorian

Multicollinearity

CORRELATION

Correlation of Continuous Variables

Determine Correlation for Polytomous vs. Dichotomous vs. Numerical Variables

Levels of Variables for Polytomous vs. Dichotomous

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Find correlations for mixtures of continuous, polytomous, and dichotomous variables ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

#Data: Here the imputed dataframe 'df' has any remaining NAs filled with 0.  
df <- df %>% replace(is.na(.), 0)

cor.max <- psych::mixedCor(data = df,
c = dput(names(numerical_vars)), #continuous variables
p = cat(capture.output(dput(names(polytomous_vars)))), #polytomous items
d = cat(capture.output(dput(names(dichotomous_vars)))), #dichotomous items
global = FALSE,
smooth = TRUE,
#use = "complete.obs", #Use only complete observations
method = "pearson") # Use Pearson correlations

#> c("Age", "Count_of_Oral_Presentation", "Count_of_Peer_Reviewed_Journal_Articles_Abstracts", 
#> "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published", 
#> "Count_of_Poster_Presentation", "total_OBGYN_letter_writers", 
#> "reco_count", "number_of_applicant_first_author_publications", 
#> "work_exp_count", "Volunteer_exp_count", "Research_exp_count"
#> )
#> c("Advance_Degree", "Type_of_medical_school")c("Match_Status", "Gender", "US_or_Canadian_Applicant", "Medical_Degree",  "Visa_Sponsorship_Needed", "Medical_Education_or_Training_Interrupted",  "Alpha_Omega_Alpha", "Gold_Humanism_Honor_Society", "Couples_Match",  "Meeting_Name_Presented", "TopNIHfunded", "Higher_Education_Degree",  "Birth_Place", "Language_Fluency", "ACLS", "PALS", "Medical_Training_Discipline",  "Tracks_Applied_by_Applicant_1", "AMA", "ACOG")

# Extract the correlation matrix from the result
cor.max2 <- base::as.matrix(x = cor.max$rho) #rho is the complete matrix of continuous, polytomous, and dichotomous items

In this correlation plot, we want to look for the bright, large circles that immediately show strong correlations (size and shading depending on the coefficients’ absolute values; color depends on direction). This indicates whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero, the weaker the correlation. Anything you would have to squint to see is usually not worth seeing!

#Data: Using the imputed dataframe 'df' here.  

# Rename some variables for clarity
df %>%
  dplyr::rename(Count_of_Other_than_Published = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) %>%
  dplyr::rename(first_author_publications = number_of_applicant_first_author_publications) %>%
  dplyr::rename(Count_of_Journal_Articles_Abstracts = Count_of_Peer_Reviewed_Journal_Articles_Abstracts) %>%
  
  # Select only numeric variables (excluding total_OBGYN_letter_writers)
  dplyr::select_if(is.numeric) %>%
  #dplyr::select(-total_OBGYN_letter_writers) %>%
  
  # Compute the correlation matrix using only complete observations
  stats::cor(use="complete.obs") %>%
    
  # Plot the upper triangle of the correlation matrix
corrplot(type = "upper", diag = FALSE, outline.col = "white", 
         insig = "blank", sig.level = 0.05, mar = c(0, 0.5, 0.5, 0.5))

#This code sets the mar argument specifies the margin sizes in the following order: bottom, left, top, right. By increasing the margin sizes, you can make the plot larger.

# Save the plot as a PNG image
ggplot2::ggsave("output/fig/train_correlation.tiff", width = 8, height = 8, units = "in")

# library(plotly)
# 
# df1 <- df %>%
#   dplyr::rename(Count_of_Other_than_Published = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) %>%
#   dplyr::rename(first_author_publications = number_of_applicant_first_author_publications) %>%
#   dplyr::rename(Count_of_Journal_Articles_Abstracts = Count_of_Peer_Reviewed_Journal_Articles_Abstracts) %>%
#   
#   # Select only numeric variables (excluding total_OBGYN_letter_writers)
#   dplyr::select_if(is.numeric) #%>%
#   
#   #dplyr::select(-total_OBGYN_letter_writers) %>%
# 
# ###
# df %>%
#   dplyr::rename(Count_of_Other_than_Published = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) %>%
#   dplyr::rename(first_author_publications = number_of_applicant_first_author_publications) %>%
#   dplyr::rename(Count_of_Journal_Articles_Abstracts = Count_of_Peer_Reviewed_Journal_Articles_Abstracts) %>%
#   
#   # Select only numeric variables (excluding total_OBGYN_letter_writers)
#   dplyr::select_if(is.numeric)
#   
#   # Compute the correlation matrix using only complete observations
# cor_mat <- stats::cor(use="complete.obs") %>%
#   plot_ly(x = df1, z = cor_mat, x = names(df1), y = names(df1), type = "heatmap", colorscale = "Viridis")

Rank of Correlations

This plot ranks the correlation and produces a gradually decreasing order of columns. Which is really useful to analyze the top most correlated variables.

#Data:  Use the original 'd' dataframe again here.  

lares::corr_cross((d %>% dplyr::select(-Location, -Year)), 
                  rm.na = T,
                  max_pvalue = 0.05, 
                  top = 10, 
                  type=1,
                  grid = T,
                  plot=TRUE)

Based on the plots above I decided to remove Medical_Degree because it understandably correlates with ‘Type_of_medical_school’.

d <- d %>%
  dplyr::select(-Medical_Degree)

Data reduction was conducted in two phases. First, highly correlated variables were removed with R threshold = 0.7. Variables were removed from the data. Second, group LASSO (Least Absolute Shrinkage and Selection Operator) was used for automatic variables selection. The LASSO penalizes the absolute size of the regression coefficients to drive the coefficients of irrelevant variables to zero.

TABLES

Descriptive Analysis of `df` dataframe

Table 1

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Create tables displaying results of univariate analyses, stratified or not by categorical variable groupings ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####[tibshirani1996regression]

# Perform a Shapiro-Wilks test to compare the groups defined by the target variable (Match_Status) and the predictors
# (all the other variables in the data)

d1 <- d %>%
  dplyr::rename (c(#"Match Status" = Match_Status,
                    "Gender" = Gender, 
                    "Age (years)" = Age,
                   "US or Canadian Applicant" = US_or_Canadian_Applicant,
                   "Visa Sponsorship Needed" = Visa_Sponsorship_Needed,
                   "Medical Education Interrupted"= Medical_Education_or_Training_Interrupted,
                   "Alpha Omega Alpha Membership" = Alpha_Omega_Alpha,
                   "Gold Humanism Honor Society Membership" = Gold_Humanism_Honor_Society,
                   "Participating in a Couples Match" = Couples_Match,
                   "Number of Oral Presentations" = Count_of_Oral_Presentation,
                   "Number of Peer-Reviewed Journal Articles or Abstracts" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts, 
                   "Number of Peer-Reviewed Non-published Journal_Articles" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published,
                   "Number of Poster Presentations" = Count_of_Poster_Presentation, 
                   "Applicant Match Year" = Year,
                   "Presented at a Scientific Meeting" = Meeting_Name_Presented,
                   "Trained at a Top NIH-funded Medical School" = TopNIHfunded,
                   "Bachelor of Science Degree" = Higher_Education_Degree, 
                   "Birth Place" = Birth_Place,
                   "Fluency in a Foreign Language" = Language_Fluency, 
                   "Certification in Advanced Cardiac Life Support" = ACLS,
                   "Certification in Pediatric Advanced Life Support" = PALS,
                   "Prior Residency Training" = Medical_Training_Discipline, 
                   "Residency Tracks Applied" = Tracks_Applied_by_Applicant_1, 
                   "Membership in the American Medical Association" = AMA, 
                   "Membership in the American College of Obstetricians and Gynecologists" = ACOG,
                   "Received a Scholarship" = Scholarship, 
                   "Number of letter of recommendation OBGYN authors" = total_OBGYN_letter_writers,
                   "Number of letters of recommendation" = reco_count, 
                   "Number of first author publications" = number_of_applicant_first_author_publications,
                   "Advanced Degree Other Than Medical Degree" = Advance_Degree, 
                   "Type of medical school" = Type_of_medical_school, 
                   "Number of Work Experiences" = work_exp_count, 
                   "Number of Volunteer Experiences" = Volunteer_exp_count, 
                   "Number of Research Experiences" = Research_exp_count))
  

# Create a table with the test results
t1 <-  compareGroups::compareGroups(formula = Match_Status ~ .,
                                    data = d1,
                                    method = 4,                  #Shapiro-Wilks test
                                    na.action = "na.pass",
                                    alpha = 0.05)                #alpha significance threshold

# Create a table of the results of the test
t1tab <- compareGroups::createTable(x = t1, 
         digits.p = 3, sd.type = 1,  # Specify the number of decimal places for p-values 
                                     #and type of standard deviation
         show.p.overall = TRUE,      # Include an overall p-value in the table
         hide.no = "no")             # Don't hide any variables with "no" in the name

# Export the table as an HTML-formatted Markdown table
compareGroups::export2md(t1tab, 
                         strip = TRUE,  # remove leading and trailing white space
                         first.strip = TRUE,  # remove leading white space from the first line
                         digits = 1L,   # specify the number of decimal places
                         caption = "Description of Applicants by Match Status",  # add a caption to the table
                         booktabs = TRUE, width = "400px",  # use booktabs formatting and specify the width of the table
                         format = "html")  # output as HTML

Description of Applicants by Match Status
	Not_Matched	Matched	p.overall
	N=3307	N=3919
Gender:			<0.001
Female	2585 (78.2%)	3327 (84.9%)
Male	722 (21.8%)	592 (15.1%)
Age (years)	28.0 [27.0;31.0]	27.0 [26.0;29.0]	<0.001
US or Canadian Applicant	2232 (67.5%)	3659 (93.4%)	<0.001
Visa Sponsorship Needed	365 (11.0%)	88 (2.25%)	<0.001
Medical Education Interrupted	601 (18.2%)	398 (10.2%)	<0.001
Alpha Omega Alpha Membership	187 (5.65%)	521 (13.3%)	<0.001
Gold Humanism Honor Society Membership	335 (10.1%)	704 (18.0%)	<0.001
Participating in a Couples Match	205 (6.20%)	406 (10.4%)	<0.001
Number of Oral Presentations	0.00 [0.00;1.00]	0.00 [0.00;1.00]	<0.001
Number of Peer-Reviewed Journal Articles or Abstracts	0.00 [0.00;1.00]	0.00 [0.00;1.00]	<0.001
Number of Peer-Reviewed Non-published Journal_Articles	0.00 [0.00;0.00]	0.00 [0.00;1.00]	<0.001
Number of Poster Presentations	1.00 [0.00;2.00]	1.00 [0.00;3.00]	<0.001
Applicant Match Year:			<0.001
2016	148 (4.48%)	294 (7.50%)
2017	777 (23.5%)	1060 (27.0%)
2018	573 (17.3%)	711 (18.1%)
2019	1076 (32.5%)	1292 (33.0%)
2020	733 (22.2%)	562 (14.3%)
Location:			<0.001
BSW	174 (5.26%)	129 (3.29%)
CCAG	333 (10.1%)	270 (6.89%)
CU	1431 (43.3%)	1449 (37.0%)
Duke	409 (12.4%)	802 (20.5%)
Truman	170 (5.14%)	84 (2.14%)
U_Washington	126 (3.81%)	180 (4.59%)
UAB	130 (3.93%)	110 (2.81%)
Utah	534 (16.1%)	895 (22.8%)
Presented at a Scientific Meeting:			<0.001
Did not present at a meeting	2534 (76.6%)	2657 (67.8%)
Presented at a meeting	773 (23.4%)	1262 (32.2%)
Trained at a Top NIH-funded Medical School:			<0.001
Did not attend NIH top-funded medical school	2568 (77.7%)	2247 (57.3%)
Attended a NIH top-funded Medical School	739 (22.3%)	1672 (42.7%)
Bachelor of Science Degree:			<0.001
Not a B.S. degree	2132 (64.5%)	2136 (54.5%)
B.S.	1175 (35.5%)	1783 (45.5%)
Birth Place:			<0.001
Born outside the USA	1713 (51.8%)	1271 (32.4%)
Born in the USA	1594 (48.2%)	2648 (67.6%)
Fluency in a Foreign Language:			0.054
Speaks English and Another Language	1778 (75.6%)	2256 (73.2%)
Speaks English	575 (24.4%)	825 (26.8%)
Certification in Advanced Cardiac Life Support	1125 (47.8%)	1262 (41.0%)	<0.001
Certification in Pediatric Advanced Life Support	205 (8.71%)	106 (3.44%)	<0.001
Prior Residency Training:			<0.001
No Prior Residency Training	3006 (90.9%)	3843 (98.1%)
Prior Residency Training	301 (9.10%)	76 (1.94%)
Residency Tracks Applied:			<0.001
Applying for a Preliminary Position	2203 (66.6%)	2035 (51.9%)
Categorical Applicant	1104 (33.4%)	1884 (48.1%)
Membership in the American Medical Association:			<0.001
No AMA Membership	2386 (72.1%)	2302 (58.7%)
American Medical Association Member	921 (27.9%)	1617 (41.3%)
Membership in the American College of Obstetricians and Gynecologists:			<0.001
No ACOG Membership	2291 (69.3%)	1886 (48.1%)
ACOG Member	1016 (30.7%)	2033 (51.9%)
Received a Scholarship:			<0.001
No_scholarship	2769 (83.7%)	3033 (77.4%)
Scholarship	538 (16.3%)	886 (22.6%)
Number of letter of recommendation OBGYN authors	2.00 [1.00;2.00]	2.00 [2.00;3.00]	<0.001
Number of letters of recommendation	3.00 [0.00;4.00]	3.00 [3.00;4.00]	<0.001
Number of first author publications	0.00 [0.00;1.00]	1.00 [0.00;3.00]	<0.001
Advanced Degree Other Than Medical Degree:			0.015
M.B.A.	20 (0.60%)	14 (0.36%)
No Advanced Degree	3027 (91.5%)	3651 (93.2%)
Ph.D.	19 (0.57%)	10 (0.26%)
Other	241 (7.29%)	244 (6.23%)
Type of medical school:			.
U.S. Public School	1024 (31.0%)	1937 (49.4%)
International School	1075 (32.5%)	260 (6.63%)
Osteopathic School	704 (21.3%)	523 (13.3%)
Osteopathic School,International School	1 (0.03%)	0 (0.00%)
U.S. Private School	503 (15.2%)	1199 (30.6%)
Number of Work Experiences	2.00 [0.00;4.00]	2.00 [0.00;4.00]	0.052
Number of Volunteer Experiences	4.00 [0.00;8.00]	6.00 [2.00;9.00]	<0.001
Number of Research Experiences	1.00 [0.00;2.00]	2.00 [0.00;3.00]	<0.001

Table 2

Demographics of the All Residency Applicants

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Exports the result of createTable ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

t2 <-  compareGroups::compareGroups( ~ .,
                                     data = d1,
                                     method = 4)

t2 <- compareGroups::createTable(t2,
                                 type = 2,
                                 show.n = TRUE)

# Export the table as an HTML-formatted Markdown table
compareGroups::export2md(t2, 
                         strip = TRUE,  # remove leading and trailing white space
                         first.strip = TRUE,  # remove leading white space from the first line
                         digits = 1L,   # specify the number of decimal places
                         caption = "Demographics of the All Residency Applicants",  # add a caption to the table
                         booktabs = TRUE, width = "400px",  # use booktabs formatting and specify the width of the table
                         format = "html")  # output as HTML

Demographics of the All Residency Applicants
	[ALL]	N
	N=7226
Match_Status:		7226
Not_Matched	3307 (45.8%)
Matched	3919 (54.2%)
Gender:		7226
Female	5912 (81.8%)
Male	1314 (18.2%)
Age (years)	28.0 [26.0;30.0]	7226
US or Canadian Applicant:		7226
No	1335 (18.5%)
Yes	5891 (81.5%)
Visa Sponsorship Needed:		7226
No	6773 (93.7%)
Yes	453 (6.27%)
Medical Education Interrupted:		7226
No	6227 (86.2%)
Yes	999 (13.8%)
Alpha Omega Alpha Membership:		7226
No	6518 (90.2%)
Yes	708 (9.80%)
Gold Humanism Honor Society Membership:		7226
No	6187 (85.6%)
Yes	1039 (14.4%)
Participating in a Couples Match:		7226
No	6615 (91.5%)
Yes	611 (8.46%)
Number of Oral Presentations	0.00 [0.00;1.00]	7226
Number of Peer-Reviewed Journal Articles or Abstracts	0.00 [0.00;1.00]	7226
Number of Peer-Reviewed Non-published Journal_Articles	0.00 [0.00;0.00]	7226
Number of Poster Presentations	1.00 [0.00;3.00]	7226
Applicant Match Year:		7226
2016	442 (6.12%)
2017	1837 (25.4%)
2018	1284 (17.8%)
2019	2368 (32.8%)
2020	1295 (17.9%)
Location:		7226
BSW	303 (4.19%)
CCAG	603 (8.34%)
CU	2880 (39.9%)
Duke	1211 (16.8%)
Truman	254 (3.52%)
U_Washington	306 (4.23%)
UAB	240 (3.32%)
Utah	1429 (19.8%)
Presented at a Scientific Meeting:		7226
Did not present at a meeting	5191 (71.8%)
Presented at a meeting	2035 (28.2%)
Trained at a Top NIH-funded Medical School:		7226
Did not attend NIH top-funded medical school	4815 (66.6%)
Attended a NIH top-funded Medical School	2411 (33.4%)
Bachelor of Science Degree:		7226
Not a B.S. degree	4268 (59.1%)
B.S.	2958 (40.9%)
Birth Place:		7226
Born outside the USA	2984 (41.3%)
Born in the USA	4242 (58.7%)
Fluency in a Foreign Language:		5434
Speaks English and Another Language	4034 (74.2%)
Speaks English	1400 (25.8%)
Certification in Advanced Cardiac Life Support:		5434
No	3047 (56.1%)
Yes	2387 (43.9%)
Certification in Pediatric Advanced Life Support:		5434
No	5123 (94.3%)
Yes	311 (5.72%)
Prior Residency Training:		7226
No Prior Residency Training	6849 (94.8%)
Prior Residency Training	377 (5.22%)
Residency Tracks Applied:		7226
Applying for a Preliminary Position	4238 (58.6%)
Categorical Applicant	2988 (41.4%)
Membership in the American Medical Association:		7226
No AMA Membership	4688 (64.9%)
American Medical Association Member	2538 (35.1%)
Membership in the American College of Obstetricians and Gynecologists:		7226
No ACOG Membership	4177 (57.8%)
ACOG Member	3049 (42.2%)
Received a Scholarship:		7226
No_scholarship	5802 (80.3%)
Scholarship	1424 (19.7%)
Number of letter of recommendation OBGYN authors	2.00 [2.00;3.00]	5348
Number of letters of recommendation	3.00 [1.00;4.00]	7226
Number of first author publications	0.00 [0.00;2.00]	7226
Advanced Degree Other Than Medical Degree:		7226
M.B.A.	34 (0.47%)
No Advanced Degree	6678 (92.4%)
Ph.D.	29 (0.40%)
Other	485 (6.71%)
Type of medical school:		7226
U.S. Public School	2961 (41.0%)
International School	1335 (18.5%)
Osteopathic School	1227 (17.0%)
Osteopathic School,International School	1 (0.01%)
U.S. Private School	1702 (23.6%)
Number of Work Experiences	2.00 [0.00;4.00]	7226
Number of Volunteer Experiences	5.00 [0.00;9.00]	7226
Number of Research Experiences	1.00 [0.00;3.00]	7226

###

NOTES

Data Transformation: Applying transformations to variables, such as log or square root, to make them better suited for analysis. This was not done given the need to create a nomogram for the models.
Dimensionality Reduction: Was not done as we have few numerical variables.

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Background

Objective

Computational Reproducibility

LOADING DATA INTO R ENVIRONMENT.

Data Dictionary/Descriptive Statistics with Exploratory Data Analysis

Codebook

DATA VIZ: UNIVARIATE AND BIVARIATE ANALYSIS

Univariate analysis with Exploratory Data Analysis

Bivariate analysis with Exploratory Data Analysis

Difference in Values Between the Match_Status Classes

Identify Missingness with Exploratory Data Analysis

Remove constant variables from ‘d’ dataframe

Remove Near-zero/low-information variables from ‘d’ dataframe

Multicollinearity

CORRELATION

Correlation of Continuous Variables

Determine Correlation for Polytomous vs. Dichotomous vs. Numerical Variables

Levels of Variables for Polytomous vs. Dichotomous

Rank of Correlations

TABLES

Descriptive Analysis of `df` dataframe

Table 1

Table 2

NOTES

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Background

Objective

Computational Reproducibility

LOADING DATA INTO R ENVIRONMENT.

Data Dictionary/Descriptive Statistics with Exploratory Data Analysis

Codebook

DATA VIZ: UNIVARIATE AND BIVARIATE ANALYSIS

Univariate analysis with Exploratory Data Analysis

Bivariate analysis with Exploratory Data Analysis

Difference in Values Between the Match_Status Classes

Identify Missingness with Exploratory Data Analysis

Remove constant variables from ‘d’ dataframe

Remove Near-zero/low-information variables from ‘d’ dataframe

Multicollinearity

CORRELATION

Correlation of Continuous Variables

Determine Correlation for Polytomous vs. Dichotomous vs. Numerical Variables

Levels of Variables for Polytomous vs. Dichotomous

Rank of Correlations

TABLES

Descriptive Analysis of df dataframe

Table 1

Table 2

NOTES

Descriptive Analysis of `df` dataframe