Background

Twelve months ago, four of eight CU medical students applying to Obstetrics and Gynecology residency did not match. A model that predicts a medical student’s chances of matching into an obstetrics and gynecology residency may facilitate improved counseling and fewer unmatched medical students.

Objective

Create and validate a predictive model to understand a medical student’s chances of matching into obstetrics and gynecology residency.

Computational Reproducibility

I used the package renv to make the project more: isolated, portable, and reproducible. Therefore a lockfile with all versions of packages is set with every run of the code. In addition, version control for the project was stored on www.github.com/mufflyt in a private repository called nomogram. I tried keeping the code “readable” and “literate”. The start of data is downloaded from Dropbox each time to ensure we always start with the Next; the environment was controlled using Rstudio Cloud (https://posit.cloud/spaces/191846/content/4636956). We worked with the caret package (Classification and Regression Training) package using R given the number of tutorials and support. A strength of the caret package is that the exact syntax can be used whether the model we are considering has a categorical or numeric outcome. Analyses were conducted using the R Statistical language (version 4.2.2; R Core Team, 2022) on Ubuntu 20.04.5 LTS. I will not discuss mathematical background and equations. Lastly, the environment was controlled by using Rstudio Cloud (https://rstudio.cloud/spaces/191846/content/4546858).

LOADING DATA INTO R ENVIRONMENT.

After we’ve loaded the data in R, we’ll quickly look at the variables. ‘Match_Status’ is a categorical binary variable, meaning only two categories are possible. all_merged_Feature_engineering1 is a dataframe of the independent and the dependent variables for review. Each variable is contained in a column, and each row represents a single unique medical student. The most recent data was used if students applied for more than one year.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Data import and prep.  Data types already set as numeric or factor.   ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#Load Data from the website so it is always the correct file to start with.  

# Define file path and name
file_path <- "output/csv"
file_name <- "all_merged_Feature_engineering1.rds"

# Download file from URL
download.file("https://www.dropbox.com/s/i4hel9s8v33s76a/all_merged_Feature_engineering1.rds?raw=1",
              destfile = file.path(file_path, file_name), 
              method = "auto", 
              cacheOK = TRUE)

# all_merged_Feature_engineering1.rds - This is the original data that has been cleaned and merged for all programs.  The file is assigned to dataframe 'd' for ease of typing.  

# Read file and assign to variable
all_merged_Feature_engineering1 <- tryCatch({
  readRDS(file = file.path(file_path, file_name))
}, error = function(e) {
  # Handle error
  stop(paste("Error reading file:", e))
})

# Assign data to variable d
d <- all_merged_Feature_engineering1

We can see that these data have 7,226 observations of 52 features and that 54.2 percent of medical students applying to OB/GYN residency matched. Let’s create a few plots to get a sense of the data. Remember, the goal here will be to predict whether a given medical student will match into OB/GYN residency. We make ‘Not_Matched’ the reference level in the target variable, which will make logit models predict the probability of ‘Matched’.

# Subset data to exclude NA values in the Match_Status column
subset_data <- d[!is.na(d$Match_Status),]

# Create ggplot object using subsetted data and mapping Match_Status to the x-axis and fill color
whomatched <- ggplot2::ggplot(data = subset_data, 
                              mapping = aes(x = Match_Status, 
                                            fill = Match_Status)) +

# Add bar plot with count on y-axis
ggplot2::geom_bar(stat='count') +

# Add plot labels
labs(title = 'How many applicants \n matched into OBGYN?',
     caption = "Data from the 9 participating\n OBGYN residency programs.",
     colour = "Gears") +

# Add labels on top of bars using comma formatter
geom_label(stat='count',
           mapping = aes(label=scales::comma(after_stat(count))), 
           size=7) +

# Apply grey theme with base font size of 18
ggplot2::theme_grey(base_size = 18); whomatched

Data Dictionary/Descriptive Statistics with Exploratory Data Analysis

Some of the variables are skewed but I did not transform them because I wanted to make the nomogram with the original variables instead of a variables scaled and centered.

Codebook

A codebook is a technical description of the data that was collected for a particular purpose. It describes how the data are arranged in the computer file or files, what the various numbers and letters mean, and any special instructions on properly using the data. Predictors under consideration: 2020, 2019, 2018, 2017:

  1. all_data$Match_Status - Did they Match or not Match?
  2. all_data$Age - Age at the time of the match, numerical variable based on Date of Birth POSIXct
  3. all_data$Year - Year of participation in the match, 4 level categorical
  4. all_data$Gender - Male or Female, 2 level categorical
  5. all_data$Couples_Match - Participating in the couples match? 2 level categorical
  6. all_data$US_or_Canadian_Applicant - Are they a US Senior or an IMG, 2 level categorical
  7. all_data$Medical_Education_Interrupted - Taking breaks, 2 level categorical
  8. all_data$Alpha_Omega_Alpha - Membership in AOA, 2 level categorical
  9. all_data$Military_Service_Obligation
  10. all_data$USMLE_Step_1_Score - I did not use Step 2 score because most students will not have those numbers at the time they apply, numerical variable. Step 1 is now pass/fail.
  11. all_data$Count_of_Poster_Presentation - numerical variable
  12. all_data$Count_of_Oral_Presentation - numerical variable
  13. all_data$Count_of_Articles_Abstracts - numerical variable
  14. all_data$Count_of_Peer_Reviewed_Book_Chapter - numerical variable
  15. all_data$Count_of_Other_than_Published - numerical variable
  16. all_data$Visa_Sponsorship_Needed - numerical variable
  17. all_data$Medical_Degree - Allopathic versus Osteopathic medical school education, 2 level categorical
  18. all_data$AAMC_ID (Additional data not for analysis)
  19. all_data$Applicant Name (Additional data not for analysis)
  20. all_data$SOAP Applicant (2017 and earlier does not have SOAP data)
  21. all_data$SOAP Reapply Applicant (2017 and earlier does not have SOAP data)
  22. all_data$SOAP Match Status (2017 and earlier does not have SOAP data)
  23. all_data$SOAP Reapply Track Applied (2017 and earlier does not have SOAP data)
  24. all_data$Track Applied by Applicant - Prelim or categorical.
  25. all_data$Gold Humanism Honor Society
  26. all_data$Medical School of Graduation
  27. all_data$Medical School Type
  28. all_data$Misdemeanor Conviction
  29. all_data$Felony Conviction - No one reporting a history of felonies.
  30. all_data$Malpractice_Cases_Pending - One person with a pending malpractice case.
  31. all_data$Location - Location where the data came from.

DATA VIZ: UNIVARIATE AND BIVARIATE ANALYSIS

Univariate analysis with Exploratory Data Analysis

Univariate data analysis does not deal with causes or relationships since it only looks at one variable. The purpose is to find trends within the data using histograms and box plots.

# Use the plot_bar function from the DataExplorer package to create a bar plot of all numeric variables in the dataframe
DataExplorer::plot_bar(d, theme_config = list("text" = element_text(size = 10)), nrow = 3, ncol = 3)

Bivariate analysis with Exploratory Data Analysis

Bivariate visualizations and summary statistics assess the relationship between each variable and the target variable Match_Status. To decide which predictors we should include, we can visualize the relationships between the predictors with the outcome variable. Boxplot to see if the distribution of the continuous predictor is different across the two classes, i.e., if it is different, it means the predictor could help separate the two classes. Here we can see that Age could do so.

DataExplorer::plot_bar(d, by = "Match_Status", theme_config = list("text" = element_text(size = 10)), nrow = 3L, ncol = 3L) 

#Transforming the data to long format can sometimes make the plotting syntax a little easier. This allows for a simpler plot creation and means that we can make use of the facetting capabilites of ggplot2; facet_wrap in this example.

data_long %>%
    ggplot(aes(Match_Status, value, color = Match_Status)) + 
    geom_boxplot(alpha = 0.5, outlier.size = 2, notch = FALSE) + 
    geom_jitter(width = 0.1, height = 0, alpha = 0.5) +
    facet_wrap(~ key, scales='free_y', ncol = 3) + 
    labs(x = NULL, y = NULL) +
    theme_classic() 

Difference in Values Between the Match_Status Classes

#We can also look at this by doing some calculations, and see by how much the values differ between the two classes of Match_Status Doing this we can put a numeric value on the difference between the average values for each of our variables.

#data_long_output is a cleaner version of 'data_long' that can be outputed into a table.  

data_long_output <- data_long %>%
    dplyr::group_by(key, Match_Status) %>%
    dplyr::summarise(mean = mean(value)) %>%
    spread(Match_Status, mean) %>% # Spread the data to make the table 'wide'
    dplyr::mutate(Diff = Matched - Not_Matched, 
           `Percentage Difference` = (Matched - Not_Matched) / Matched * 100) %>%
    dplyr::arrange(desc(abs(`Percentage Difference`))) # Arrange by the largest differences, accounting for negatives
knitr_table_html(data_long_output, "Difference in Values Between the Match_Status Classes")
Difference in Values Between the Match_Status Classes
key Not_Matched Matched Diff Percentage Difference
Count_of_Peer_Reviewed_Book_Chapter 0.056 0.041 -0.015 -37.76
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published 0.309 0.462 0.153 33.12
number_of_applicant_first_author_publications 1.305 1.872 0.566 30.26
Count_of_Poster_Presentation 1.548 2.114 0.566 26.79
Count_of_Peer_Reviewed_Journal_Articles_Abstracts 1.085 1.239 0.154 12.44
Age 29.919 27.891 -2.027 -7.27
Count_of_Oral_Presentation 0.907 0.962 0.056 5.81

Identify Missingness with Exploratory Data Analysis

# Check for NaN, NA, Inf, and -Inf values
nan_vars <- names(d)[map_lgl(d, ~ any(is.nan(.x)))] 
na_vars  <- names(d)[map_lgl(d, ~ any(is.na(.x) & !is.nan(.x)))]
inf_vars <- names(d)[map_lgl(d, ~ any(is.infinite(.x)))] 

# Print warning messages for variables with NaN, NA, or Inf values
if (length(nan_vars) > 0) {
  warning(paste("Check the NaN counts in these variables:", paste(nan_vars, collapse = ', ')))
}
if (length(na_vars) > 0) {
  warning(paste("Check the NA counts in these variables:", paste(na_vars, collapse = ', ')))
}
if (length(inf_vars) > 0) {
  warning(paste("Check the Inf counts in these variables:", paste(inf_vars, collapse = ', ')))
}
#It uses the map_lgl() function from the tidyverse package to apply the any(), is.nan(), is.na(), and is.infinite() functions to each variable in the data frame d. This creates logical vectors indicating which variables have any NaN, NA, or Inf values.
# It uses the names() function and indexing with the logical vectors to get the names of the variables with NaN, NA, or Inf values. These names are assigned to the nan_vars, na_vars, and inf_vars variables, respectively.
# It uses the warning() function to print warning messages
###
rm(nan_vars)
rm(na_vars)
rm(inf_vars)
options(repr.plot.width = 6, repr.plot.height = 4)

#missing data - This is a dataframe count of all the missing data in the dataframe.  Can be removed after visualization.  

# use `select_if()` to select only variables, instead of summarizing all variables 
missing_data <- d %>% dplyr::select_if(is.numeric) %>%
  dplyr::summarize_all(funs(sum(is.na(.))/n())) %>% 
  tidyr::gather(key = "variable", value = "percent_missing")

# Use ggplot2 for visualization
# Use fct_reorder() to reorder x-axis, to make it more readable and make it clear that the variable is a factor
ggplot(missing_data, aes(x = forcats::fct_reorder(variable, percent_missing), y = percent_missing)) +
    geom_col(fill = "red", aes(color = I('white')), linewidth = 0.3) + 
    xlab("Variables") + 
    ylab("Percent Missing") +
    coord_flip() + 
    theme_bw() +
    ggtitle("Missing Data by Variable")

rm(missing_data)

Remove constant variables from ‘d’ dataframe

All constant variables are removed from ‘d’ here.

#This code performs a logistic regression analysis of the variable 'Match_Status' on all other variables in the data frame 'train_match'. To check the linearity assumption, it first predicts the probability of positive 'Match_Status' and assigns predicted class based on threshold of 0.5. Then it selects only numeric predictors and makes a scatter plot with the logit (log of the ratio of predicted probability to 1-predicted probability) on the x-axis and predictor variable values on the y-axis. It also adds a loess smoother line to the scatter plot and facets by predictors to display all predictors in separate plots.

glm <- glm(Match_Status ~ ., 
           family = binomial,
           data=d)

Remove Near-zero/low-information variables from ‘d’ dataframe

#nearZeroVar() takes in data x, then looks at the ratio of the most common value to the second most common value, freqCut, and the percentage of distinct values out of the number of total samples, uniqueCut. By default, caret uses freqCut = 19 and uniqueCut = 10, which is fairly conservative. 

# Data:  Uses the d dataframe without any constant variables.  

# # Identify near zero variance predictors: remove_cols
remove_cols <- caret::nearZeroVar(d, names = TRUE, freqCut = 19, uniqueCut = 10)
remove_cols <- base::sort(remove_cols)

# Remove near zero variance predictors from data frame
d <- d[, setdiff(names(d), remove_cols)]

# uses the names() function from the tidyverse package to get the names of all the variables in the data frame d, rather than storing them in a separate variable. This can make the code more concise and easier to read.
knitr::kable(x = remove_cols,
             format = "html", 
             padding = 2,
             booktabs = TRUE,
             col.names = c("Names of Removed Variable"),
             format.args = list(big.sep = ",", decimal.mark = "."),
             digits = getOption("digits"),
             caption = "Zero Variance or Near Zero Variance Variables Removed",
             escape = FALSE)
Zero Variance or Near Zero Variance Variables Removed
Names of Removed Variable
Boy_Scouts
Count_of_Peer_Reviewed_Book_Chapter
Grant
Higher_Education_Institution
Interest_Group
Latin_Honors
Military_Service_Obligation
Misdemeanor_Conviction
NCAA
NCI
NIH
Other_Service_Obligation
Phi_beta_kappa
Photo_Received
USMLE_Pass_Fail_replaced
Valedictorian

Multicollinearity

CORRELATION

Correlation of Continuous Variables

Determine Correlation for Polytomous vs. Dichotomous vs. Numerical Variables

Levels of Variables for Polytomous vs. Dichotomous

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Find correlations for mixtures of continuous, polytomous, and dichotomous variables ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

#Data: Here the imputed dataframe 'df' has any remaining NAs filled with 0.  
df <- df %>% replace(is.na(.), 0)

cor.max <- psych::mixedCor(data = df,
c = dput(names(numerical_vars)), #continuous variables
p = cat(capture.output(dput(names(polytomous_vars)))), #polytomous items
d = cat(capture.output(dput(names(dichotomous_vars)))), #dichotomous items
global = FALSE,
smooth = TRUE,
#use = "complete.obs", #Use only complete observations
method = "pearson") # Use Pearson correlations
#> c("Age", "Count_of_Oral_Presentation", "Count_of_Peer_Reviewed_Journal_Articles_Abstracts", 
#> "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published", 
#> "Count_of_Poster_Presentation", "total_OBGYN_letter_writers", 
#> "reco_count", "number_of_applicant_first_author_publications", 
#> "work_exp_count", "Volunteer_exp_count", "Research_exp_count"
#> )
#> c("Advance_Degree", "Type_of_medical_school")c("Match_Status", "Gender", "US_or_Canadian_Applicant", "Medical_Degree",  "Visa_Sponsorship_Needed", "Medical_Education_or_Training_Interrupted",  "Alpha_Omega_Alpha", "Gold_Humanism_Honor_Society", "Couples_Match",  "Meeting_Name_Presented", "TopNIHfunded", "Higher_Education_Degree",  "Birth_Place", "Language_Fluency", "ACLS", "PALS", "Medical_Training_Discipline",  "Tracks_Applied_by_Applicant_1", "AMA", "ACOG")
# Extract the correlation matrix from the result
cor.max2 <- base::as.matrix(x = cor.max$rho) #rho is the complete matrix of continuous, polytomous, and dichotomous items

In this correlation plot, we want to look for the bright, large circles that immediately show strong correlations (size and shading depending on the coefficients’ absolute values; color depends on direction). This indicates whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero, the weaker the correlation. Anything you would have to squint to see is usually not worth seeing!

#Data: Using the imputed dataframe 'df' here.  

# Rename some variables for clarity
df %>%
  dplyr::rename(Count_of_Other_than_Published = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) %>%
  dplyr::rename(first_author_publications = number_of_applicant_first_author_publications) %>%
  dplyr::rename(Count_of_Journal_Articles_Abstracts = Count_of_Peer_Reviewed_Journal_Articles_Abstracts) %>%
  
  # Select only numeric variables (excluding total_OBGYN_letter_writers)
  dplyr::select_if(is.numeric) %>%
  #dplyr::select(-total_OBGYN_letter_writers) %>%
  
  # Compute the correlation matrix using only complete observations
  stats::cor(use="complete.obs") %>%
    
  # Plot the upper triangle of the correlation matrix
corrplot(type = "upper", diag = FALSE, outline.col = "white", 
         insig = "blank", sig.level = 0.05, mar = c(0, 0.5, 0.5, 0.5)) 

#This code sets the mar argument specifies the margin sizes in the following order: bottom, left, top, right. By increasing the margin sizes, you can make the plot larger.

# Save the plot as a PNG image
ggplot2::ggsave("output/fig/train_correlation.tiff", width = 8, height = 8, units = "in")
# library(plotly)
# 
# df1 <- df %>%
#   dplyr::rename(Count_of_Other_than_Published = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) %>%
#   dplyr::rename(first_author_publications = number_of_applicant_first_author_publications) %>%
#   dplyr::rename(Count_of_Journal_Articles_Abstracts = Count_of_Peer_Reviewed_Journal_Articles_Abstracts) %>%
#   
#   # Select only numeric variables (excluding total_OBGYN_letter_writers)
#   dplyr::select_if(is.numeric) #%>%
#   
#   #dplyr::select(-total_OBGYN_letter_writers) %>%
# 
# ###
# df %>%
#   dplyr::rename(Count_of_Other_than_Published = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published) %>%
#   dplyr::rename(first_author_publications = number_of_applicant_first_author_publications) %>%
#   dplyr::rename(Count_of_Journal_Articles_Abstracts = Count_of_Peer_Reviewed_Journal_Articles_Abstracts) %>%
#   
#   # Select only numeric variables (excluding total_OBGYN_letter_writers)
#   dplyr::select_if(is.numeric)
#   
#   # Compute the correlation matrix using only complete observations
# cor_mat <- stats::cor(use="complete.obs") %>%
#   plot_ly(x = df1, z = cor_mat, x = names(df1), y = names(df1), type = "heatmap", colorscale = "Viridis")

Rank of Correlations

This plot ranks the correlation and produces a gradually decreasing order of columns. Which is really useful to analyze the top most correlated variables.

#Data:  Use the original 'd' dataframe again here.  

lares::corr_cross((d %>% dplyr::select(-Location, -Year)), 
                  rm.na = T,
                  max_pvalue = 0.05, 
                  top = 10, 
                  type=1,
                  grid = T,
                  plot=TRUE)

Based on the plots above I decided to remove Medical_Degree because it understandably correlates with ‘Type_of_medical_school’.

d <- d %>%
  dplyr::select(-Medical_Degree)

Data reduction was conducted in two phases. First, highly correlated variables were removed with R threshold = 0.7. Variables were removed from the data. Second, group LASSO (Least Absolute Shrinkage and Selection Operator) was used for automatic variables selection. The LASSO penalizes the absolute size of the regression coefficients to drive the coefficients of irrelevant variables to zero.

TABLES

Descriptive Analysis of df dataframe

Table 1

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Create tables displaying results of univariate analyses, stratified or not by categorical variable groupings ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####[tibshirani1996regression]

# Perform a Shapiro-Wilks test to compare the groups defined by the target variable (Match_Status) and the predictors
# (all the other variables in the data)

d1 <- d %>%
  dplyr::rename (c(#"Match Status" = Match_Status,
                    "Gender" = Gender, 
                    "Age (years)" = Age,
                   "US or Canadian Applicant" = US_or_Canadian_Applicant,
                   "Visa Sponsorship Needed" = Visa_Sponsorship_Needed,
                   "Medical Education Interrupted"= Medical_Education_or_Training_Interrupted,
                   "Alpha Omega Alpha Membership" = Alpha_Omega_Alpha,
                   "Gold Humanism Honor Society Membership" = Gold_Humanism_Honor_Society,
                   "Participating in a Couples Match" = Couples_Match,
                   "Number of Oral Presentations" = Count_of_Oral_Presentation,
                   "Number of Peer-Reviewed Journal Articles or Abstracts" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts, 
                   "Number of Peer-Reviewed Non-published Journal_Articles" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published,
                   "Number of Poster Presentations" = Count_of_Poster_Presentation, 
                   "Applicant Match Year" = Year,
                   "Presented at a Scientific Meeting" = Meeting_Name_Presented,
                   "Trained at a Top NIH-funded Medical School" = TopNIHfunded,
                   "Bachelor of Science Degree" = Higher_Education_Degree, 
                   "Birth Place" = Birth_Place,
                   "Fluency in a Foreign Language" = Language_Fluency, 
                   "Certification in Advanced Cardiac Life Support" = ACLS,
                   "Certification in Pediatric Advanced Life Support" = PALS,
                   "Prior Residency Training" = Medical_Training_Discipline, 
                   "Residency Tracks Applied" = Tracks_Applied_by_Applicant_1, 
                   "Membership in the American Medical Association" = AMA, 
                   "Membership in the American College of Obstetricians and Gynecologists" = ACOG,
                   "Received a Scholarship" = Scholarship, 
                   "Number of letter of recommendation OBGYN authors" = total_OBGYN_letter_writers,
                   "Number of letters of recommendation" = reco_count, 
                   "Number of first author publications" = number_of_applicant_first_author_publications,
                   "Advanced Degree Other Than Medical Degree" = Advance_Degree, 
                   "Type of medical school" = Type_of_medical_school, 
                   "Number of Work Experiences" = work_exp_count, 
                   "Number of Volunteer Experiences" = Volunteer_exp_count, 
                   "Number of Research Experiences" = Research_exp_count))
  

# Create a table with the test results
t1 <-  compareGroups::compareGroups(formula = Match_Status ~ .,
                                    data = d1,
                                    method = 4,                  #Shapiro-Wilks test
                                    na.action = "na.pass",
                                    alpha = 0.05)                #alpha significance threshold

# Create a table of the results of the test
t1tab <- compareGroups::createTable(x = t1, 
         digits.p = 3, sd.type = 1,  # Specify the number of decimal places for p-values 
                                     #and type of standard deviation
         show.p.overall = TRUE,      # Include an overall p-value in the table
         hide.no = "no")             # Don't hide any variables with "no" in the name

# Export the table as an HTML-formatted Markdown table
compareGroups::export2md(t1tab, 
                         strip = TRUE,  # remove leading and trailing white space
                         first.strip = TRUE,  # remove leading white space from the first line
                         digits = 1L,   # specify the number of decimal places
                         caption = "Description of Applicants by Match Status",  # add a caption to the table
                         booktabs = TRUE, width = "400px",  # use booktabs formatting and specify the width of the table
                         format = "html")  # output as HTML
Description of Applicants by Match Status
Not_Matched Matched p.overall
N=3307 N=3919
Gender: <0.001
Female 2585 (78.2%) 3327 (84.9%)
Male 722 (21.8%) 592 (15.1%)
Age (years) 28.0 [27.0;31.0] 27.0 [26.0;29.0] <0.001
US or Canadian Applicant 2232 (67.5%) 3659 (93.4%) <0.001
Visa Sponsorship Needed 365 (11.0%) 88 (2.25%) <0.001
Medical Education Interrupted 601 (18.2%) 398 (10.2%) <0.001
Alpha Omega Alpha Membership 187 (5.65%) 521 (13.3%) <0.001
Gold Humanism Honor Society Membership 335 (10.1%) 704 (18.0%) <0.001
Participating in a Couples Match 205 (6.20%) 406 (10.4%) <0.001
Number of Oral Presentations 0.00 [0.00;1.00] 0.00 [0.00;1.00] <0.001
Number of Peer-Reviewed Journal Articles or Abstracts 0.00 [0.00;1.00] 0.00 [0.00;1.00] <0.001
Number of Peer-Reviewed Non-published Journal_Articles 0.00 [0.00;0.00] 0.00 [0.00;1.00] <0.001
Number of Poster Presentations 1.00 [0.00;2.00] 1.00 [0.00;3.00] <0.001
Applicant Match Year: <0.001
2016 148 (4.48%) 294 (7.50%)
2017 777 (23.5%) 1060 (27.0%)
2018 573 (17.3%) 711 (18.1%)
2019 1076 (32.5%) 1292 (33.0%)
2020 733 (22.2%) 562 (14.3%)
Location: <0.001
BSW 174 (5.26%) 129 (3.29%)
CCAG 333 (10.1%) 270 (6.89%)
CU 1431 (43.3%) 1449 (37.0%)
Duke 409 (12.4%) 802 (20.5%)
Truman 170 (5.14%) 84 (2.14%)
U_Washington 126 (3.81%) 180 (4.59%)
UAB 130 (3.93%) 110 (2.81%)
Utah 534 (16.1%) 895 (22.8%)
Presented at a Scientific Meeting: <0.001
Did not present at a meeting 2534 (76.6%) 2657 (67.8%)
Presented at a meeting 773 (23.4%) 1262 (32.2%)
Trained at a Top NIH-funded Medical School: <0.001
Did not attend NIH top-funded medical school 2568 (77.7%) 2247 (57.3%)
Attended a NIH top-funded Medical School 739 (22.3%) 1672 (42.7%)
Bachelor of Science Degree: <0.001
Not a B.S. degree 2132 (64.5%) 2136 (54.5%)
B.S. 1175 (35.5%) 1783 (45.5%)
Birth Place: <0.001
Born outside the USA 1713 (51.8%) 1271 (32.4%)
Born in the USA 1594 (48.2%) 2648 (67.6%)
Fluency in a Foreign Language: 0.054
Speaks English and Another Language 1778 (75.6%) 2256 (73.2%)
Speaks English 575 (24.4%) 825 (26.8%)
Certification in Advanced Cardiac Life Support 1125 (47.8%) 1262 (41.0%) <0.001
Certification in Pediatric Advanced Life Support 205 (8.71%) 106 (3.44%) <0.001
Prior Residency Training: <0.001
No Prior Residency Training 3006 (90.9%) 3843 (98.1%)
Prior Residency Training 301 (9.10%) 76 (1.94%)
Residency Tracks Applied: <0.001
Applying for a Preliminary Position 2203 (66.6%) 2035 (51.9%)
Categorical Applicant 1104 (33.4%) 1884 (48.1%)
Membership in the American Medical Association: <0.001
No AMA Membership 2386 (72.1%) 2302 (58.7%)
American Medical Association Member 921 (27.9%) 1617 (41.3%)
Membership in the American College of Obstetricians and Gynecologists: <0.001
No ACOG Membership 2291 (69.3%) 1886 (48.1%)
ACOG Member 1016 (30.7%) 2033 (51.9%)
Received a Scholarship: <0.001
No_scholarship 2769 (83.7%) 3033 (77.4%)
Scholarship 538 (16.3%) 886 (22.6%)
Number of letter of recommendation OBGYN authors 2.00 [1.00;2.00] 2.00 [2.00;3.00] <0.001
Number of letters of recommendation 3.00 [0.00;4.00] 3.00 [3.00;4.00] <0.001
Number of first author publications 0.00 [0.00;1.00] 1.00 [0.00;3.00] <0.001
Advanced Degree Other Than Medical Degree: 0.015
M.B.A. 20 (0.60%) 14 (0.36%)
No Advanced Degree 3027 (91.5%) 3651 (93.2%)
Ph.D.  19 (0.57%) 10 (0.26%)
Other 241 (7.29%) 244 (6.23%)
Type of medical school: .
U.S. Public School 1024 (31.0%) 1937 (49.4%)
International School 1075 (32.5%) 260 (6.63%)
Osteopathic School 704 (21.3%) 523 (13.3%)
Osteopathic School,International School 1 (0.03%) 0 (0.00%)
U.S. Private School 503 (15.2%) 1199 (30.6%)
Number of Work Experiences 2.00 [0.00;4.00] 2.00 [0.00;4.00] 0.052
Number of Volunteer Experiences 4.00 [0.00;8.00] 6.00 [2.00;9.00] <0.001
Number of Research Experiences 1.00 [0.00;2.00] 2.00 [0.00;3.00] <0.001

Table 2

Demographics of the All Residency Applicants

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Exports the result of createTable ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

t2 <-  compareGroups::compareGroups( ~ .,
                                     data = d1,
                                     method = 4)

t2 <- compareGroups::createTable(t2,
                                 type = 2,
                                 show.n = TRUE)

# Export the table as an HTML-formatted Markdown table
compareGroups::export2md(t2, 
                         strip = TRUE,  # remove leading and trailing white space
                         first.strip = TRUE,  # remove leading white space from the first line
                         digits = 1L,   # specify the number of decimal places
                         caption = "Demographics of the All Residency Applicants",  # add a caption to the table
                         booktabs = TRUE, width = "400px",  # use booktabs formatting and specify the width of the table
                         format = "html")  # output as HTML
Demographics of the All Residency Applicants
[ALL] N
N=7226
Match_Status: 7226
Not_Matched 3307 (45.8%)
Matched 3919 (54.2%)
Gender: 7226
Female 5912 (81.8%)
Male 1314 (18.2%)
Age (years) 28.0 [26.0;30.0] 7226
US or Canadian Applicant: 7226
No 1335 (18.5%)
Yes 5891 (81.5%)
Visa Sponsorship Needed: 7226
No 6773 (93.7%)
Yes 453 (6.27%)
Medical Education Interrupted: 7226
No 6227 (86.2%)
Yes 999 (13.8%)
Alpha Omega Alpha Membership: 7226
No 6518 (90.2%)
Yes 708 (9.80%)
Gold Humanism Honor Society Membership: 7226
No 6187 (85.6%)
Yes 1039 (14.4%)
Participating in a Couples Match: 7226
No 6615 (91.5%)
Yes 611 (8.46%)
Number of Oral Presentations 0.00 [0.00;1.00] 7226
Number of Peer-Reviewed Journal Articles or Abstracts 0.00 [0.00;1.00] 7226
Number of Peer-Reviewed Non-published Journal_Articles 0.00 [0.00;0.00] 7226
Number of Poster Presentations 1.00 [0.00;3.00] 7226
Applicant Match Year: 7226
2016 442 (6.12%)
2017 1837 (25.4%)
2018 1284 (17.8%)
2019 2368 (32.8%)
2020 1295 (17.9%)
Location: 7226
BSW 303 (4.19%)
CCAG 603 (8.34%)
CU 2880 (39.9%)
Duke 1211 (16.8%)
Truman 254 (3.52%)
U_Washington 306 (4.23%)
UAB 240 (3.32%)
Utah 1429 (19.8%)
Presented at a Scientific Meeting: 7226
Did not present at a meeting 5191 (71.8%)
Presented at a meeting 2035 (28.2%)
Trained at a Top NIH-funded Medical School: 7226
Did not attend NIH top-funded medical school 4815 (66.6%)
Attended a NIH top-funded Medical School 2411 (33.4%)
Bachelor of Science Degree: 7226
Not a B.S. degree 4268 (59.1%)
B.S. 2958 (40.9%)
Birth Place: 7226
Born outside the USA 2984 (41.3%)
Born in the USA 4242 (58.7%)
Fluency in a Foreign Language: 5434
Speaks English and Another Language 4034 (74.2%)
Speaks English 1400 (25.8%)
Certification in Advanced Cardiac Life Support: 5434
No 3047 (56.1%)
Yes 2387 (43.9%)
Certification in Pediatric Advanced Life Support: 5434
No 5123 (94.3%)
Yes 311 (5.72%)
Prior Residency Training: 7226
No Prior Residency Training 6849 (94.8%)
Prior Residency Training 377 (5.22%)
Residency Tracks Applied: 7226
Applying for a Preliminary Position 4238 (58.6%)
Categorical Applicant 2988 (41.4%)
Membership in the American Medical Association: 7226
No AMA Membership 4688 (64.9%)
American Medical Association Member 2538 (35.1%)
Membership in the American College of Obstetricians and Gynecologists: 7226
No ACOG Membership 4177 (57.8%)
ACOG Member 3049 (42.2%)
Received a Scholarship: 7226
No_scholarship 5802 (80.3%)
Scholarship 1424 (19.7%)
Number of letter of recommendation OBGYN authors 2.00 [2.00;3.00] 5348
Number of letters of recommendation 3.00 [1.00;4.00] 7226
Number of first author publications 0.00 [0.00;2.00] 7226
Advanced Degree Other Than Medical Degree: 7226
M.B.A. 34 (0.47%)
No Advanced Degree 6678 (92.4%)
Ph.D.  29 (0.40%)
Other 485 (6.71%)
Type of medical school: 7226
U.S. Public School 2961 (41.0%)
International School 1335 (18.5%)
Osteopathic School 1227 (17.0%)
Osteopathic School,International School 1 (0.01%)
U.S. Private School 1702 (23.6%)
Number of Work Experiences 2.00 [0.00;4.00] 7226
Number of Volunteer Experiences 5.00 [0.00;9.00] 7226
Number of Research Experiences 1.00 [0.00;3.00] 7226
###

NOTES

  1. Data Transformation: Applying transformations to variables, such as log or square root, to make them better suited for analysis. This was not done given the need to create a nomogram for the models.

  2. Dimensionality Reduction: Was not done as we have few numerical variables.