Tyler M. Muffly, Rich Amini - name: Tyler M. Muffly, MD affiliation: Denver Health - name: Rich Amini, MD affiliation: University of Arizona
Objective: We sought to construct and validate a model that predict a medical student’s chances of matching into an emergency medicine residency.
Package installation, data download from Dropbox.com, and functions written for this project are all loaded in a separate “Additional_functions_nomogram.R” file.
## [1] "/Users/tylermuffly/Dropbox (Personal)/Amini_study"
The training set contains a known output (all_data$Match_Status
) and the model learns this data in order to be generalized to other data in the process. In this way, the model will predict values for the test data (cross validation). It is possible to determine the prediction accuracy of the model.
Overfitting is one of the biggest challenges in the machine learning process. You don’t want to build a model that fits perfectly to the data it knows the answer but doesn’t perferm well for the data that it doesn’t know the answer. This is to prevent the over-fitting.
knitr::include_graphics(here::here("images", "overfitting.jpg"))
Overfitting means that the model has been trained “too well”, and as a result it learns the noise present in the training data as if it was a reliable pattern. Overfitting affects the ability of the model to perform well in unseen data, which is known as generalisation. Two well known strategies to overcome the problem of overfitting are the train/validation split and cross-validation.
knitr::include_graphics(here::here("images", "Train-Test-Split-Diagram.jpg"))
Shows the data split
I will call the training sample train
and the test sample test
. Creative! There are 1997 medical students in the training data set and 353 in the test data set. The two samples should have the same proportion of students who matched and did not match.
train
data characteristics are all reported with medians and IQR.###tableby labels
#mylabels <- list(white_non_white = "Race", Age = "Age, years", Gender = "Sex", Couples_Match = "Participating in the Couples Match", US_or_Canadian_Applicant = "US or Canadian Applicant", Medical_Education_Interrupted = "Medical Education Process was Interrupted", Alpha_Omega_Alpha = "Alpha Omega Alpha", Military_Service_Obligation = "Military Service Obligation", USMLE_Step_1_Score = "USMLE Step 1 Score", Military_Service_Obligation = "Military Service Obligations", Count_of_Poster_Presentation = "Count of Poster Presentations", Count_of_Oral_Presentation = "Count of Oral Presentations", Count_of_Articles_Abstracts = "Count of Published Abstracts", Count_of_Peer_Reviewed_Book_Chapter = "Count of Peer Reviewed Book Chapters", Count_of_Other_than_Published = "Count of Other Published Products", Count_of_Online_Publications = "Count of Online Publications", Visa_Sponsorship_Needed = "Visa Sponsorship is Needed", Medical_Degree = "Medical Degree Training")
tm_arsenal_table = function(df, by){
print("Function Sanity Check: Create Arsenal Table using arsenal package")
table_variable_within_function <- arsenal::tableby(by ~ .,
data=df, control = arsenal::tableby.control(test = TRUE,
total = F,
digits = 1L,
digits.p = 2L,
digits.count = 0L,
numeric.simplify = F,
numeric.stats =
c("median",
"q1q3"),
cat.stats =
c("Nmiss",
"countpct"),
stats.labels = list(Nmiss = "N Missing",
Nmiss2 ="N Missing",
meansd = "Mean (SD)",
medianrange = "Median (Range)",
median ="Median",
medianq1q3 = "Median (Q1, Q3)",
q1q3 = "Q1, Q3",
iqr = "IQR",
range = "Range",
countpct = "Count (Pct)",
Nevents = "Events",
medSurv ="Median Survival",
medTime = "Median Follow-Up")))
final <- summary(table_variable_within_function,
text=T,
title = 'Table: Applicant Descriptive Variables by Matched or Did Not Match from 2015 to 2018',
#labelTranslations = mylabels, #Seen in additional functions file #TURNED OFF TEMPORARILY
pfootnote=TRUE)
return(final)
}
train_table_characteristics <- tm_arsenal_table(
df=train,
by=train$Match_Status)
[1] “Function Sanity Check: Create Arsenal Table using arsenal package”
train_table_characteristics
N (N=135) | Y (N=1862) | p value | |
---|---|---|---|
STAR_ID | < 0.01 (1) | ||
- Median | 2019040274.0 | 2020040226.5 | |
- Q1, Q3 | 2018040302.0, 2020040377.0 | 2019040108.8, 2021040138.8 | |
Survey_Year | < 0.01 (1) | ||
- Median | 2019.0 | 2020.0 | |
- Q1, Q3 | 2018.0, 2020.0 | 2019.0, 2021.0 | |
Applied_Total | 0.76 (1) | ||
- Median | 34.0 | 39.0 | |
- Q1, Q3 | 15.0, 59.0 | 29.0, 52.0 | |
Interview_Offer_Total | < 0.01 (1) | ||
- Median | 7.0 | 21.0 | |
- Q1, Q3 | 3.0, 16.5 | 13.0, 32.0 | |
Match_Status | < 0.01 (2) | ||
- N | 135 (100.0%) | 0 (0.0%) | |
- Y | 0 (0.0%) | 1862 (100.0%) | |
Home_State | 0.04 (2) | ||
- AL | 0 (0.0%) | 7 (0.4%) | |
- AR | 0 (0.0%) | 2 (0.1%) | |
- AZ | 1 (0.7%) | 39 (2.1%) | |
- CA | 3 (2.2%) | 55 (3.0%) | |
- CT | 3 (2.2%) | 29 (1.6%) | |
- DC | 4 (3.0%) | 51 (2.7%) | |
- FL | 12 (8.9%) | 112 (6.0%) | |
- GA | 1 (0.7%) | 24 (1.3%) | |
- IA | 0 (0.0%) | 16 (0.9%) | |
- ID | 0 (0.0%) | 29 (1.6%) | |
- IL | 16 (11.9%) | 82 (4.4%) | |
- KY | 2 (1.5%) | 47 (2.5%) | |
- LA | 11 (8.1%) | 50 (2.7%) | |
- MA | 2 (1.5%) | 49 (2.6%) | |
- MD | 0 (0.0%) | 2 (0.1%) | |
- MI | 6 (4.4%) | 79 (4.2%) | |
- MN | 3 (2.2%) | 41 (2.2%) | |
- MO | 1 (0.7%) | 15 (0.8%) | |
- MS | 0 (0.0%) | 14 (0.8%) | |
- NC | 4 (3.0%) | 67 (3.6%) | |
- ND | 1 (0.7%) | 1 (0.1%) | |
- NE | 1 (0.7%) | 14 (0.8%) | |
- NJ | 0 (0.0%) | 35 (1.9%) | |
- NM | 0 (0.0%) | 6 (0.3%) | |
- NV | 0 (0.0%) | 5 (0.3%) | |
- NY | 10 (7.4%) | 181 (9.7%) | |
- OH | 6 (4.4%) | 89 (4.8%) | |
- OK | 1 (0.7%) | 16 (0.9%) | |
- PA | 8 (5.9%) | 141 (7.6%) | |
- SC | 6 (4.4%) | 66 (3.5%) | |
- SD | 0 (0.0%) | 4 (0.2%) | |
- TN | 4 (3.0%) | 41 (2.2%) | |
- TX | 23 (17.0%) | 281 (15.1%) | |
- VA | 2 (1.5%) | 71 (3.8%) | |
- WA | 2 (1.5%) | 44 (2.4%) | |
- WI | 1 (0.7%) | 48 (2.6%) | |
- WV | 1 (0.7%) | 9 (0.5%) | |
Step_1_Score | < 0.01 (1) | ||
- Median | 227.0 | 237.0 | |
- Q1, Q3 | 217.0, 242.0 | 222.0, 247.0 | |
Cumulative_Quartile | < 0.01 (2) | ||
- 1st | 26 (19.3%) | 433 (23.3%) | |
- 2nd | 28 (20.7%) | 373 (20.0%) | |
- 3rd | 21 (15.6%) | 287 (15.4%) | |
- 4th | 25 (18.5%) | 148 (7.9%) | |
- Unknown | 35 (25.9%) | 621 (33.4%) | |
Quartile_Rank | 0.83 (1) | ||
- Median | 2.0 | 2.0 | |
- Q1, Q3 | 0.0, 3.0 | 0.0, 3.0 | |
number_Honored_Clerkships | < 0.01 (1) | ||
- Median | 2.0 | 3.0 | |
- Q1, Q3 | 0.0, 4.0 | 1.0, 5.0 | |
Honors_A_This_Specialty | < 0.01 (2) | ||
- No | 82 (60.7%) | 840 (45.1%) | |
- Yes | 53 (39.3%) | 1022 (54.9%) | |
AOA_Sigma | 0.12 (2) | ||
- No | 115 (85.2%) | 1446 (77.7%) | |
- No School Chapter | 5 (3.7%) | 107 (5.7%) | |
- Yes | 15 (11.1%) | 309 (16.6%) | |
GHHS | 0.45 (2) | ||
- No | 113 (83.7%) | 1475 (79.2%) | |
- No School Chapter | 4 (3.0%) | 78 (4.2%) | |
- Yes | 18 (13.3%) | 309 (16.6%) | |
Couples_Match | 0.12 (2) | ||
- No | 131 (97.0%) | 1745 (93.7%) | |
- Yes | 4 (3.0%) | 117 (6.3%) | |
Other_Degrees | 0.06 (2) | ||
- MBA | 4 (3.0%) | 26 (1.4%) | |
- MDiv | 0 (0.0%) | 4 (0.2%) | |
- MEd | 1 (0.7%) | 12 (0.6%) | |
- MPH | 4 (3.0%) | 83 (4.5%) | |
- MSc | 11 (8.1%) | 117 (6.3%) | |
- No additional degree | 98 (72.6%) | 1502 (80.7%) | |
- Other | 16 (11.9%) | 102 (5.5%) | |
- PhD | 1 (0.7%) | 16 (0.9%) | |
number_Research_Experiences | 0.17 (1) | ||
- Median | 3.0 | 3.0 | |
- Q1, Q3 | 1.0, 4.0 | 2.0, 4.0 | |
number_Abstracts_Pres_Posters | 0.06 (1) | ||
- Median | 2.0 | 2.0 | |
- Q1, Q3 | 0.0, 4.0 | 1.0, 4.0 | |
number_Peer_Rev_Publications | 0.55 (1) | ||
- Median | 1.0 | 1.0 | |
- Q1, Q3 | 0.0, 2.0 | 0.0, 2.0 | |
number_Volunteer_Experiences | < 0.01 (1) | ||
- Median | 5.0 | 7.0 | |
- Q1, Q3 | 4.0, 9.0 | 4.0, 10.0 | |
number_Leadership_Positions | 0.22 (1) | ||
- Median | 3.0 | 4.0 | |
- Q1, Q3 | 2.0, 5.0 | 2.0, 5.0 | |
number_Programs_Applied | 0.04 (1) | ||
- Median | 48.0 | 45.0 | |
- Q1, Q3 | 33.5, 69.0 | 35.0, 60.0 | |
number_Interviews_Attended | < 0.01 (1) | ||
- Median | 9.0 | 13.0 | |
- Q1, Q3 | 5.0, 14.0 | 11.0, 16.0 |
#
# tm_write2word(train_table_characteristics, "train_table_characteristics")
# tm_write2pdf(train_table_characteristics, "train_table_characteristics")
test
data characteristics are all reported with medians and IQR.test_table_characteristics <- tm_arsenal_table(df = test, by = test$Match_Status)
[1] “Function Sanity Check: Create Arsenal Table using arsenal package”
test_table_characteristics
N (N=32) | Y (N=321) | p value | |
---|---|---|---|
STAR_ID | < 0.01 (1) | ||
- Median | 2019040201.5 | 2020040083.0 | |
- Q1, Q3 | 2018040185.2, 2020040258.2 | 2019040116.0, 2021040136.0 | |
Survey_Year | < 0.01 (1) | ||
- Median | 2019.0 | 2020.0 | |
- Q1, Q3 | 2018.0, 2020.0 | 2019.0, 2021.0 | |
Applied_Total | 0.23 (1) | ||
- Median | 35.5 | 40.0 | |
- Q1, Q3 | 21.8, 50.5 | 30.0, 51.0 | |
Interview_Offer_Total | < 0.01 (1) | ||
- Median | 10.5 | 23.0 | |
- Q1, Q3 | 4.0, 20.5 | 15.0, 33.0 | |
Match_Status | < 0.01 (2) | ||
- N | 32 (100.0%) | 0 (0.0%) | |
- Y | 0 (0.0%) | 321 (100.0%) | |
Home_State | 0.02 (2) | ||
- AL | 0 (0.0%) | 2 (0.6%) | |
- AR | 0 (0.0%) | 1 (0.3%) | |
- AZ | 0 (0.0%) | 5 (1.6%) | |
- CA | 0 (0.0%) | 7 (2.2%) | |
- CT | 1 (3.1%) | 4 (1.2%) | |
- DC | 4 (12.5%) | 6 (1.9%) | |
- FL | 3 (9.4%) | 15 (4.7%) | |
- GA | 0 (0.0%) | 2 (0.6%) | |
- IA | 0 (0.0%) | 4 (1.2%) | |
- ID | 0 (0.0%) | 1 (0.3%) | |
- IL | 4 (12.5%) | 13 (4.0%) | |
- KY | 2 (6.2%) | 8 (2.5%) | |
- LA | 0 (0.0%) | 14 (4.4%) | |
- MA | 0 (0.0%) | 18 (5.6%) | |
- MI | 1 (3.1%) | 14 (4.4%) | |
- MN | 4 (12.5%) | 5 (1.6%) | |
- MO | 0 (0.0%) | 2 (0.6%) | |
- MS | 0 (0.0%) | 1 (0.3%) | |
- NC | 0 (0.0%) | 14 (4.4%) | |
- NE | 0 (0.0%) | 4 (1.2%) | |
- NJ | 1 (3.1%) | 6 (1.9%) | |
- NV | 1 (3.1%) | 4 (1.2%) | |
- NY | 2 (6.2%) | 36 (11.2%) | |
- OH | 0 (0.0%) | 17 (5.3%) | |
- OK | 0 (0.0%) | 4 (1.2%) | |
- PA | 0 (0.0%) | 26 (8.1%) | |
- SC | 0 (0.0%) | 10 (3.1%) | |
- TN | 0 (0.0%) | 7 (2.2%) | |
- TX | 6 (18.8%) | 46 (14.3%) | |
- VA | 1 (3.1%) | 11 (3.4%) | |
- WA | 1 (3.1%) | 9 (2.8%) | |
- WI | 1 (3.1%) | 5 (1.6%) | |
Step_1_Score | 0.01 (1) | ||
- Median | 229.5 | 237.0 | |
- Q1, Q3 | 215.8, 242.0 | 227.0, 242.0 | |
Cumulative_Quartile | 0.14 (2) | ||
- 1st | 2 (6.2%) | 78 (24.3%) | |
- 2nd | 9 (28.1%) | 74 (23.1%) | |
- 3rd | 8 (25.0%) | 46 (14.3%) | |
- 4th | 2 (6.2%) | 26 (8.1%) | |
- Unknown | 11 (34.4%) | 97 (30.2%) | |
Quartile_Rank | 0.20 (1) | ||
- Median | 2.0 | 2.0 | |
- Q1, Q3 | 0.0, 3.0 | 0.0, 3.0 | |
number_Honored_Clerkships | 0.19 (1) | ||
- Median | 2.0 | 3.0 | |
- Q1, Q3 | 1.0, 4.0 | 2.0, 5.0 | |
Honors_A_This_Specialty | < 0.01 (2) | ||
- No | 21 (65.6%) | 128 (39.9%) | |
- Yes | 11 (34.4%) | 193 (60.1%) | |
AOA_Sigma | 0.38 (2) | ||
- No | 28 (87.5%) | 251 (78.2%) | |
- No School Chapter | 2 (6.2%) | 21 (6.5%) | |
- Yes | 2 (6.2%) | 49 (15.3%) | |
GHHS | 0.45 (2) | ||
- No | 27 (84.4%) | 239 (74.5%) | |
- No School Chapter | 1 (3.1%) | 21 (6.5%) | |
- Yes | 4 (12.5%) | 61 (19.0%) | |
Couples_Match | 0.27 (2) | ||
- No | 31 (96.9%) | 293 (91.3%) | |
- Yes | 1 (3.1%) | 28 (8.7%) | |
Other_Degrees | 0.65 (2) | ||
- MBA | 0 (0.0%) | 5 (1.6%) | |
- MEd | 0 (0.0%) | 1 (0.3%) | |
- MPH | 1 (3.1%) | 14 (4.4%) | |
- MSc | 4 (12.5%) | 21 (6.5%) | |
- No additional degree | 27 (84.4%) | 260 (81.0%) | |
- Other | 0 (0.0%) | 15 (4.7%) | |
- PhD | 0 (0.0%) | 5 (1.6%) | |
number_Research_Experiences | 0.99 (1) | ||
- Median | 2.5 | 3.0 | |
- Q1, Q3 | 1.0, 4.0 | 2.0, 4.0 | |
number_Abstracts_Pres_Posters | 0.93 (1) | ||
- Median | 2.0 | 2.0 | |
- Q1, Q3 | 1.0, 4.0 | 1.0, 4.0 | |
number_Peer_Rev_Publications | 0.87 (1) | ||
- Median | 0.0 | 1.0 | |
- Q1, Q3 | 0.0, 2.0 | 0.0, 2.0 | |
number_Volunteer_Experiences | 0.49 (1) | ||
- Median | 6.0 | 7.0 | |
- Q1, Q3 | 4.8, 9.5 | 5.0, 10.0 | |
number_Leadership_Positions | 0.48 (1) | ||
- Median | 4.0 | 4.0 | |
- Q1, Q3 | 2.0, 6.2 | 2.0, 5.0 | |
number_Programs_Applied | < 0.01 (1) | ||
- Median | 56.5 | 45.0 | |
- Q1, Q3 | 38.5, 90.2 | 35.0, 60.0 | |
number_Interviews_Attended | < 0.01 (1) | ||
- Median | 10.5 | 14.0 | |
- Q1, Q3 | 6.8, 14.0 | 12.0, 16.0 |
# tm_write2word(test_table_characteristics, "test_table_characteristics")
# tm_write2pdf(test_table_characteristics, "test_table_characteristics")
Original response distribution and then showing consistent response ratio between train
& test
data sets.
N Y
0.0676014 0.9323986
N Y
0.09065156 0.90934844
Var1 | Freq |
---|---|
N | 0.0710638 |
Y | 0.9289362 |
knitr::include_graphics(here::here("images/skewness---mean-median-mode.jpg"))
Using the training sample, we will provide numerical summaries of each predictor variable and the outcome, as well as graphical summaries of the outcome variable. Our results should now show no missing values in any variable. We’ll need to determine whether there are any evident problems, such as substantial skew in the outcome variable.
knitr::kable(skewed_feature_names, caption="List of skewed variables for all_data", format="html")
x |
---|
Interview_Offer_Total |
Applied_Total |
number_Peer_Rev_Publications |
Get a correlation matrix for elementary feature selection and remove highly correlated features.
knitr::include_graphics(here::here("images/correlation1.jpg"))
knitr::include_graphics(here::here("images/iu.jpg"))
The values of correlation range from -1 to 1. If there is a value of 0 there is no correlation between the variables. Perfect correlation is 1. Perfect negative correlation is -1.
This is a great interactive plot that shows what variables are positively and negatively correlated with matching OBGYN. The closer the dots to the zero line the less correlation there is to matching.
The chi-squared test is a statistical test used to discover whether there is a relationship between categorical variables.
tm_chi_square_test <- function (variable) {
print("Function Sanity Test: chi-square test")
chisq <- stats::chisq.test(x = variable,
y = train$Match_Status,
correct = FALSE)
return(chisq)
}
# In this correlation plot we want to look for the bright, large circles which immediately show the strong correlations
# (size and shading depends on the absolute values of the coefficients; color depends on direction).
# This shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Anything that you would have to squint to see is usually not worth seeing!
tm_print_save <- function (filename) {
print("Function Sanity Check: Saving TIFF of what is in the viewer")
dev.print(tiff, (here::here("results", filename)), compression = "lzw",width=2000, height=2000, bg="transparent", res = 200, units = "px" )
dev.off()
}
# Numeric variables only
# cnum <- train[,c("Age", "Count_of_Oral_Presentation",
# "Count_of_Peer_Reviewed_Book_Chapter",
# "Count_of_Peer_Reviewed_Journal_Articles_Abstracts",
# "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published",
# "Count_of_Poster_Presentation")]
# cormat <- stats::cor(cnum) # Select only numeric variables that are significant on univariate testing.
# pairs <- graphics::pairs(cnum)
# pairs
# #tm_print_save("significant_numeric_variable_correlation.tiff")
# #
# #corrplot::corrplot(cormat, method="circle")
# #corrplot::corrplot(cormat, method="circle", addCoef.col="black") # With correlation
# # tm_print_save("corrplot.tiff")
# rm(cnum)
# rm(cormat)
tm_ggsave <- function (object, filename, ...){ #make sure the file name has quotation marks around it.
print("Function Sanity Check: Saving a ggplot image as a TIFF")
ggplot2::ggsave(here::here("results", filename), object, device = "tiff", width = 12, height = 10, dpi = 200)
}
train <- train %>% select(-STAR_ID)
inspect_cor_plot <- inspectdf::inspect_cor(train, method = "pearson", alpha = 0.05) %>% inspectdf::show_plot()
inspect_cor_plot
tm_ggsave(inspect_cor_plot, "inspect_cor_plot.tiff")
## [1] "Function Sanity Check: Saving a ggplot image as a TIFF"