Twelve months ago, four of eight CU medical students applying to Obstetrics and Gynecology residency did not match. A model that predicts a medical student’s chances of matching into an obstetrics and gynecology residency may facilitate improved counseling and fewer unmatched medical students.
Create and validate a predictive model to understand a medical student’s chances of matching into obstetrics and gynecology residency.
I used the package renv
to make the project more:
isolated, portable, and reproducible. Therefore a lockfile with all
versions of packages is set with every run of the code. In addition,
version control for the project was stored on www.github.com/mufflyt in
a private repository called nomogram
. I tried keeping the
code “readable” and “literate”. The start of data is downloaded from
Dropbox each time to ensure we always start with the Next; the
environment was controlled using Rstudio Cloud (https://posit.cloud/spaces/191846/content/4636956). We
worked with the caret
package (Classification and
Regression Training) package using R
given the number of
tutorials and support. A strength of the caret
package is
that the exact syntax can be used whether the model we are considering
has a categorical or numeric outcome. Analyses were conducted using the
R Statistical language (version 4.2.2; R Core Team, 2022) on Ubuntu
20.04.5 LTS. I will not discuss mathematical background and equations.
Lastly, the environment was controlled by using Rstudio Cloud (https://rstudio.cloud/spaces/191846/content/4546858).
After we’ve loaded the data in R, we’ll quickly look at the
variables. ‘Match_Status’ is a categorical binary variable, meaning only
two categories are possible.
all_merged_Feature_engineering1
is a dataframe of the
independent and the dependent variables for review. Each variable is
contained in a column, and each row represents a single unique medical
student. The most recent data was used if students applied for more than
one year.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Data import and prep. Data types already set as numeric or factor. ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#Load Data from the website so it is always the correct file to start with.
# Define file path and name
file_path <- "output/csv"
file_name <- "all_merged_Feature_engineering1.rds"
# Download file from URL
download.file("https://www.dropbox.com/s/i4hel9s8v33s76a/all_merged_Feature_engineering1.rds?raw=1",
destfile = file.path(file_path, file_name),
method = "auto",
cacheOK = TRUE)
# all_merged_Feature_engineering1.rds - This is the original data that has been cleaned and merged for all programs. The file is assigned to dataframe 'd' for ease of typing.
# Read file and assign to variable
all_merged_Feature_engineering1 <- tryCatch({
readRDS(file = file.path(file_path, file_name))
}, error = function(e) {
# Handle error
stop(paste("Error reading file:", e))
})
# Assign data to variable d
d <- all_merged_Feature_engineering1
We can see that these data have 7,226 observations of 52 features and that 54.2 percent of medical students applying to OB/GYN residency matched. Let’s create a few plots to get a sense of the data. Remember, the goal here will be to predict whether a given medical student will match into OB/GYN residency. We make ‘Not_Matched’ the reference level in the target variable, which will make logit models predict the probability of ‘Matched’.
# Subset data to exclude NA values in the Match_Status column
subset_data <- d[!is.na(d$Match_Status),]
# Create ggplot object using subsetted data and mapping Match_Status to the x-axis and fill color
whomatched <- ggplot2::ggplot(data = subset_data,
mapping = aes(x = Match_Status,
fill = Match_Status)) +
# Add bar plot with count on y-axis
ggplot2::geom_bar(stat='count') +
# Add plot labels
labs(title = 'How many applicants \n matched into OBGYN?',
caption = "Data from the 9 participating\n OBGYN residency programs.",
colour = "Gears") +
# Add labels on top of bars using comma formatter
geom_label(stat='count',
mapping = aes(label=scales::comma(after_stat(count))),
size=7) +
# Apply grey theme with base font size of 18
ggplot2::theme_grey(base_size = 18); whomatched
Some of the variables are skewed but I did not transform them because I wanted to make the nomogram with the original variables instead of a variables scaled and centered.
A codebook is a technical description of the data that was collected for a particular purpose. It describes how the data are arranged in the computer file or files, what the various numbers and letters mean, and any special instructions on properly using the data. Predictors under consideration: 2020, 2019, 2018, 2017:
all_data$Match_Status
- Did they Match or not
Match?all_data$Age
- Age at the time of the match, numerical
variable based on Date of Birth POSIXctall_data$Year
- Year of participation in the match, 4
level categoricalall_data$Gender
- Male or Female, 2 level
categoricalall_data$Couples_Match
- Participating in the couples
match? 2 level categoricalall_data$US_or_Canadian_Applicant
- Are they a US
Senior or an IMG, 2 level categoricalall_data$Medical_Education_Interrupted
- Taking breaks,
2 level categoricalall_data$Alpha_Omega_Alpha
- Membership in AOA, 2 level
categoricalall_data$Military_Service_Obligation
all_data$USMLE_Step_1_Score
- I did not use Step 2
score because most students will not have those numbers at the time they
apply, numerical variable. Step 1 is now pass/fail.all_data$Count_of_Poster_Presentation
- numerical
variableall_data$Count_of_Oral_Presentation
- numerical
variableall_data$Count_of_Articles_Abstracts
- numerical
variableall_data$Count_of_Peer_Reviewed_Book_Chapter
-
numerical variableall_data$Count_of_Other_than_Published
- numerical
variableall_data$Visa_Sponsorship_Needed
- numerical
variableall_data$Medical_Degree
- Allopathic versus Osteopathic
medical school education, 2 level categoricalall_data$AAMC_ID
(Additional data not for
analysis)all_data$Applicant Name
(Additional data not for
analysis)all_data$SOAP Applicant
(2017 and earlier does not have
SOAP data)all_data$SOAP Reapply Applicant
(2017 and earlier does
not have SOAP data)all_data$SOAP Match Status
(2017 and earlier does not
have SOAP data)all_data$SOAP Reapply Track Applied
(2017 and earlier
does not have SOAP data)all_data$Track Applied by Applicant
- Prelim or
categorical.all_data$Gold Humanism Honor Society
all_data$Medical School of Graduation
all_data$Medical School Type
all_data$Misdemeanor Conviction
all_data$Felony Conviction
- No one reporting a history
of felonies.all_data$Malpractice_Cases_Pending
- One person with a
pending malpractice case.all_data$Location
- Location where the data came
from.Univariate data analysis does not deal with causes or relationships since it only looks at one variable. The purpose is to find trends within the data using histograms and box plots.
# Use the plot_bar function from the DataExplorer package to create a bar plot of all numeric variables in the dataframe
DataExplorer::plot_bar(d, theme_config = list("text" = element_text(size = 10)), nrow = 3, ncol = 3)
Bivariate visualizations and summary statistics assess the
relationship between each variable and the target variable
Match_Status
. To decide which predictors we should include,
we can visualize the relationships between the predictors with the
outcome variable. Boxplot to see if the distribution of the continuous
predictor is different across the two classes, i.e., if it is different,
it means the predictor could help separate the two classes. Here we can
see that Age
could do so.
DataExplorer::plot_bar(d, by = "Match_Status", theme_config = list("text" = element_text(size = 10)), nrow = 3L, ncol = 3L)
#Transforming the data to long format can sometimes make the plotting syntax a little easier. This allows for a simpler plot creation and means that we can make use of the facetting capabilites of ggplot2; facet_wrap in this example.
data_long %>%
ggplot(aes(Match_Status, value, color = Match_Status)) +
geom_boxplot(alpha = 0.5, outlier.size = 2, notch = FALSE) +
geom_jitter(width = 0.1, height = 0, alpha = 0.5) +
facet_wrap(~ key, scales='free_y', ncol = 3) +
labs(x = NULL, y = NULL) +
theme_classic()