Tyler M. Muffly, Rich Amini - name: Tyler M. Muffly, MD affiliation: Denver Health - name: Rich Amini, MD affiliation: University of Arizona

Objective: We sought to construct and validate a model that predict a medical student’s chances of matching into an emergency medicine residency.

Package installation, data download from Dropbox.com, and functions written for this project are all loaded in a separate “Additional_functions_nomogram.R” file.

## [1] "/Users/tylermuffly/Dropbox (Personal)/Amini_study"

Why are we using a train and test sample data set to test the model?

The training set contains a known output (all_data$Match_Status) and the model learns this data in order to be generalized to other data in the process. In this way, the model will predict values for the test data (cross validation). It is possible to determine the prediction accuracy of the model.

Overfitting is one of the biggest challenges in the machine learning process. You don’t want to build a model that fits perfectly to the data it knows the answer but doesn’t perferm well for the data that it doesn’t know the answer. This is to prevent the over-fitting.

knitr::include_graphics(here::here("images", "overfitting.jpg"))

Overfitting means that the model has been trained “too well”, and as a result it learns the noise present in the training data as if it was a reliable pattern. Overfitting affects the ability of the model to perform well in unseen data, which is known as generalisation. Two well known strategies to overcome the problem of overfitting are the train/validation split and cross-validation.

Identify training and test samples

knitr::include_graphics(here::here("images", "Train-Test-Split-Diagram.jpg"))

Shows the data split

I will call the training sample train and the test sample test. Creative! There are 1997 medical students in the training data set and 353 in the test data set. The two samples should have the same proportion of students who matched and did not match.

`train` data characteristics are all reported with medians and IQR.

###tableby labels
#mylabels <- list(white_non_white = "Race", Age = "Age, years", Gender = "Sex", Couples_Match = "Participating in the Couples Match", US_or_Canadian_Applicant = "US or Canadian Applicant", Medical_Education_Interrupted = "Medical Education Process was Interrupted", Alpha_Omega_Alpha = "Alpha Omega Alpha", Military_Service_Obligation = "Military Service Obligation", USMLE_Step_1_Score = "USMLE Step 1 Score", Military_Service_Obligation = "Military Service Obligations", Count_of_Poster_Presentation = "Count of Poster Presentations", Count_of_Oral_Presentation = "Count of Oral Presentations", Count_of_Articles_Abstracts = "Count of Published Abstracts", Count_of_Peer_Reviewed_Book_Chapter = "Count of Peer Reviewed Book Chapters", Count_of_Other_than_Published = "Count of Other Published Products", Count_of_Online_Publications = "Count of Online Publications", Visa_Sponsorship_Needed = "Visa Sponsorship is Needed", Medical_Degree = "Medical Degree Training")

tm_arsenal_table = function(df, by){
  print("Function Sanity Check: Create Arsenal Table using arsenal package")
  table_variable_within_function <- arsenal::tableby(by ~ .,
                 data=df, control = arsenal::tableby.control(test = TRUE,
                                                                total = F,
                                                                digits = 1L,
                                                                digits.p = 2L,
                                                                digits.count = 0L,
                                                                numeric.simplify = F,
                                                                numeric.stats =
                                                                  c("median",
                                                                    "q1q3"),
                                                                cat.stats =
                                                                  c("Nmiss",
                                                                    "countpct"),
                                                                stats.labels = list(Nmiss = "N Missing",
                                                                                    Nmiss2 ="N Missing",
                                                                                    meansd = "Mean (SD)",
                                                                                    medianrange = "Median (Range)",
                                                                                    median ="Median",
                                                                                    medianq1q3 = "Median (Q1, Q3)",
                                                                                    q1q3 = "Q1, Q3",
                                                                                    iqr = "IQR",
                                                                                    range = "Range",
                                                                                    countpct = "Count (Pct)",
                                                                                    Nevents = "Events",
                                                                                    medSurv ="Median Survival",
                                                                                    medTime = "Median Follow-Up")))
  final <- summary(table_variable_within_function,
          text=T,
          title = 'Table: Applicant Descriptive Variables by Matched or Did Not Match from 2015 to 2018',
          #labelTranslations = mylabels, #Seen in additional functions file  #TURNED OFF TEMPORARILY
          pfootnote=TRUE)
  return(final)
}

train_table_characteristics <- tm_arsenal_table(
  df=train,
  by=train$Match_Status)

[1] “Function Sanity Check: Create Arsenal Table using arsenal package”

train_table_characteristics

Table: Applicant Descriptive Variables by Matched or Did Not Match from 2015 to 2018
	N (N=135)	Y (N=1862)	p value
STAR_ID			< 0.01 (1)
- Median	2019040274.0	2020040226.5
- Q1, Q3	2018040302.0, 2020040377.0	2019040108.8, 2021040138.8
Survey_Year			< 0.01 (1)
- Median	2019.0	2020.0
- Q1, Q3	2018.0, 2020.0	2019.0, 2021.0
Applied_Total			0.76 (1)
- Median	34.0	39.0
- Q1, Q3	15.0, 59.0	29.0, 52.0
Interview_Offer_Total			< 0.01 (1)
- Median	7.0	21.0
- Q1, Q3	3.0, 16.5	13.0, 32.0
Match_Status			< 0.01 (2)
- N	135 (100.0%)	0 (0.0%)
- Y	0 (0.0%)	1862 (100.0%)
Home_State			0.04 (2)
- AL	0 (0.0%)	7 (0.4%)
- AR	0 (0.0%)	2 (0.1%)
- AZ	1 (0.7%)	39 (2.1%)
- CA	3 (2.2%)	55 (3.0%)
- CT	3 (2.2%)	29 (1.6%)
- DC	4 (3.0%)	51 (2.7%)
- FL	12 (8.9%)	112 (6.0%)
- GA	1 (0.7%)	24 (1.3%)
- IA	0 (0.0%)	16 (0.9%)
- ID	0 (0.0%)	29 (1.6%)
- IL	16 (11.9%)	82 (4.4%)
- KY	2 (1.5%)	47 (2.5%)
- LA	11 (8.1%)	50 (2.7%)
- MA	2 (1.5%)	49 (2.6%)
- MD	0 (0.0%)	2 (0.1%)
- MI	6 (4.4%)	79 (4.2%)
- MN	3 (2.2%)	41 (2.2%)
- MO	1 (0.7%)	15 (0.8%)
- MS	0 (0.0%)	14 (0.8%)
- NC	4 (3.0%)	67 (3.6%)
- ND	1 (0.7%)	1 (0.1%)
- NE	1 (0.7%)	14 (0.8%)
- NJ	0 (0.0%)	35 (1.9%)
- NM	0 (0.0%)	6 (0.3%)
- NV	0 (0.0%)	5 (0.3%)
- NY	10 (7.4%)	181 (9.7%)
- OH	6 (4.4%)	89 (4.8%)
- OK	1 (0.7%)	16 (0.9%)
- PA	8 (5.9%)	141 (7.6%)
- SC	6 (4.4%)	66 (3.5%)
- SD	0 (0.0%)	4 (0.2%)
- TN	4 (3.0%)	41 (2.2%)
- TX	23 (17.0%)	281 (15.1%)
- VA	2 (1.5%)	71 (3.8%)
- WA	2 (1.5%)	44 (2.4%)
- WI	1 (0.7%)	48 (2.6%)
- WV	1 (0.7%)	9 (0.5%)
Step_1_Score			< 0.01 (1)
- Median	227.0	237.0
- Q1, Q3	217.0, 242.0	222.0, 247.0
Cumulative_Quartile			< 0.01 (2)
- 1st	26 (19.3%)	433 (23.3%)
- 2nd	28 (20.7%)	373 (20.0%)
- 3rd	21 (15.6%)	287 (15.4%)
- 4th	25 (18.5%)	148 (7.9%)
- Unknown	35 (25.9%)	621 (33.4%)
Quartile_Rank			0.83 (1)
- Median	2.0	2.0
- Q1, Q3	0.0, 3.0	0.0, 3.0
number_Honored_Clerkships			< 0.01 (1)
- Median	2.0	3.0
- Q1, Q3	0.0, 4.0	1.0, 5.0
Honors_A_This_Specialty			< 0.01 (2)
- No	82 (60.7%)	840 (45.1%)
- Yes	53 (39.3%)	1022 (54.9%)
AOA_Sigma			0.12 (2)
- No	115 (85.2%)	1446 (77.7%)
- No School Chapter	5 (3.7%)	107 (5.7%)
- Yes	15 (11.1%)	309 (16.6%)
GHHS			0.45 (2)
- No	113 (83.7%)	1475 (79.2%)
- No School Chapter	4 (3.0%)	78 (4.2%)
- Yes	18 (13.3%)	309 (16.6%)
Couples_Match			0.12 (2)
- No	131 (97.0%)	1745 (93.7%)
- Yes	4 (3.0%)	117 (6.3%)
Other_Degrees			0.06 (2)
- MBA	4 (3.0%)	26 (1.4%)
- MDiv	0 (0.0%)	4 (0.2%)
- MEd	1 (0.7%)	12 (0.6%)
- MPH	4 (3.0%)	83 (4.5%)
- MSc	11 (8.1%)	117 (6.3%)
- No additional degree	98 (72.6%)	1502 (80.7%)
- Other	16 (11.9%)	102 (5.5%)
- PhD	1 (0.7%)	16 (0.9%)
number_Research_Experiences			0.17 (1)
- Median	3.0	3.0
- Q1, Q3	1.0, 4.0	2.0, 4.0
number_Abstracts_Pres_Posters			0.06 (1)
- Median	2.0	2.0
- Q1, Q3	0.0, 4.0	1.0, 4.0
number_Peer_Rev_Publications			0.55 (1)
- Median	1.0	1.0
- Q1, Q3	0.0, 2.0	0.0, 2.0
number_Volunteer_Experiences			< 0.01 (1)
- Median	5.0	7.0
- Q1, Q3	4.0, 9.0	4.0, 10.0
number_Leadership_Positions			0.22 (1)
- Median	3.0	4.0
- Q1, Q3	2.0, 5.0	2.0, 5.0
number_Programs_Applied			0.04 (1)
- Median	48.0	45.0
- Q1, Q3	33.5, 69.0	35.0, 60.0
number_Interviews_Attended			< 0.01 (1)
- Median	9.0	13.0
- Q1, Q3	5.0, 14.0	11.0, 16.0

Linear Model ANOVA
Pearson’s Chi-squared test

# 
# tm_write2word(train_table_characteristics, "train_table_characteristics")
# tm_write2pdf(train_table_characteristics, "train_table_characteristics")

`test` data characteristics are all reported with medians and IQR.

test_table_characteristics <- tm_arsenal_table(df = test, by = test$Match_Status)

[1] “Function Sanity Check: Create Arsenal Table using arsenal package”

test_table_characteristics

Table: Applicant Descriptive Variables by Matched or Did Not Match from 2015 to 2018
	N (N=32)	Y (N=321)	p value
STAR_ID			< 0.01 (1)
- Median	2019040201.5	2020040083.0
- Q1, Q3	2018040185.2, 2020040258.2	2019040116.0, 2021040136.0
Survey_Year			< 0.01 (1)
- Median	2019.0	2020.0
- Q1, Q3	2018.0, 2020.0	2019.0, 2021.0
Applied_Total			0.23 (1)
- Median	35.5	40.0
- Q1, Q3	21.8, 50.5	30.0, 51.0
Interview_Offer_Total			< 0.01 (1)
- Median	10.5	23.0
- Q1, Q3	4.0, 20.5	15.0, 33.0
Match_Status			< 0.01 (2)
- N	32 (100.0%)	0 (0.0%)
- Y	0 (0.0%)	321 (100.0%)
Home_State			0.02 (2)
- AL	0 (0.0%)	2 (0.6%)
- AR	0 (0.0%)	1 (0.3%)
- AZ	0 (0.0%)	5 (1.6%)
- CA	0 (0.0%)	7 (2.2%)
- CT	1 (3.1%)	4 (1.2%)
- DC	4 (12.5%)	6 (1.9%)
- FL	3 (9.4%)	15 (4.7%)
- GA	0 (0.0%)	2 (0.6%)
- IA	0 (0.0%)	4 (1.2%)
- ID	0 (0.0%)	1 (0.3%)
- IL	4 (12.5%)	13 (4.0%)
- KY	2 (6.2%)	8 (2.5%)
- LA	0 (0.0%)	14 (4.4%)
- MA	0 (0.0%)	18 (5.6%)
- MI	1 (3.1%)	14 (4.4%)
- MN	4 (12.5%)	5 (1.6%)
- MO	0 (0.0%)	2 (0.6%)
- MS	0 (0.0%)	1 (0.3%)
- NC	0 (0.0%)	14 (4.4%)
- NE	0 (0.0%)	4 (1.2%)
- NJ	1 (3.1%)	6 (1.9%)
- NV	1 (3.1%)	4 (1.2%)
- NY	2 (6.2%)	36 (11.2%)
- OH	0 (0.0%)	17 (5.3%)
- OK	0 (0.0%)	4 (1.2%)
- PA	0 (0.0%)	26 (8.1%)
- SC	0 (0.0%)	10 (3.1%)
- TN	0 (0.0%)	7 (2.2%)
- TX	6 (18.8%)	46 (14.3%)
- VA	1 (3.1%)	11 (3.4%)
- WA	1 (3.1%)	9 (2.8%)
- WI	1 (3.1%)	5 (1.6%)
Step_1_Score			0.01 (1)
- Median	229.5	237.0
- Q1, Q3	215.8, 242.0	227.0, 242.0
Cumulative_Quartile			0.14 (2)
- 1st	2 (6.2%)	78 (24.3%)
- 2nd	9 (28.1%)	74 (23.1%)
- 3rd	8 (25.0%)	46 (14.3%)
- 4th	2 (6.2%)	26 (8.1%)
- Unknown	11 (34.4%)	97 (30.2%)
Quartile_Rank			0.20 (1)
- Median	2.0	2.0
- Q1, Q3	0.0, 3.0	0.0, 3.0
number_Honored_Clerkships			0.19 (1)
- Median	2.0	3.0
- Q1, Q3	1.0, 4.0	2.0, 5.0
Honors_A_This_Specialty			< 0.01 (2)
- No	21 (65.6%)	128 (39.9%)
- Yes	11 (34.4%)	193 (60.1%)
AOA_Sigma			0.38 (2)
- No	28 (87.5%)	251 (78.2%)
- No School Chapter	2 (6.2%)	21 (6.5%)
- Yes	2 (6.2%)	49 (15.3%)
GHHS			0.45 (2)
- No	27 (84.4%)	239 (74.5%)
- No School Chapter	1 (3.1%)	21 (6.5%)
- Yes	4 (12.5%)	61 (19.0%)
Couples_Match			0.27 (2)
- No	31 (96.9%)	293 (91.3%)
- Yes	1 (3.1%)	28 (8.7%)
Other_Degrees			0.65 (2)
- MBA	0 (0.0%)	5 (1.6%)
- MEd	0 (0.0%)	1 (0.3%)
- MPH	1 (3.1%)	14 (4.4%)
- MSc	4 (12.5%)	21 (6.5%)
- No additional degree	27 (84.4%)	260 (81.0%)
- Other	0 (0.0%)	15 (4.7%)
- PhD	0 (0.0%)	5 (1.6%)
number_Research_Experiences			0.99 (1)
- Median	2.5	3.0
- Q1, Q3	1.0, 4.0	2.0, 4.0
number_Abstracts_Pres_Posters			0.93 (1)
- Median	2.0	2.0
- Q1, Q3	1.0, 4.0	1.0, 4.0
number_Peer_Rev_Publications			0.87 (1)
- Median	0.0	1.0
- Q1, Q3	0.0, 2.0	0.0, 2.0
number_Volunteer_Experiences			0.49 (1)
- Median	6.0	7.0
- Q1, Q3	4.8, 9.5	5.0, 10.0
number_Leadership_Positions			0.48 (1)
- Median	4.0	4.0
- Q1, Q3	2.0, 6.2	2.0, 5.0
number_Programs_Applied			< 0.01 (1)
- Median	56.5	45.0
- Q1, Q3	38.5, 90.2	35.0, 60.0
number_Interviews_Attended			< 0.01 (1)
- Median	10.5	14.0
- Q1, Q3	6.8, 14.0	12.0, 16.0

Linear Model ANOVA
Pearson’s Chi-squared test

# tm_write2word(test_table_characteristics, "test_table_characteristics")
# tm_write2pdf(test_table_characteristics, "test_table_characteristics")

Check Proportions of Matched/Unmatched applicants in the test and train data sets.

Original response distribution and then showing consistent response ratio between train & test data sets.

    N         Y

0.0676014 0.9323986

     N          Y

0.09065156 0.90934844

2x2 Contingency Table on Matching for all_data
Var1	Freq
N	0.0710638
Y	0.9289362

Is feature skew present?

knitr::include_graphics(here::here("images/skewness---mean-median-mode.jpg"))

Using the training sample, we will provide numerical summaries of each predictor variable and the outcome, as well as graphical summaries of the outcome variable. Our results should now show no missing values in any variable. We’ll need to determine whether there are any evident problems, such as substantial skew in the outcome variable.

knitr::kable(skewed_feature_names, caption="List of skewed variables for all_data", format="html")

List of skewed variables for all_data
x
Interview_Offer_Total
Applied_Total
number_Peer_Rev_Publications

Correlation

Get a correlation matrix for elementary feature selection and remove highly correlated features.

knitr::include_graphics(here::here("images/correlation1.jpg"))

knitr::include_graphics(here::here("images/iu.jpg"))

The values of correlation range from -1 to 1. If there is a value of 0 there is no correlation between the variables. Perfect correlation is 1. Perfect negative correlation is -1.

One way to look at correlation is with a correlation funnel.

This is a great interactive plot that shows what variables are positively and negatively correlated with matching OBGYN. The closer the dots to the zero line the less correlation there is to matching.

Are there any highly correlated variables that can be removed?

There are 2 variables that are highly correllated and will be removed to create data.clean. The variables that were not highly correlated and retained were:

x
STAR_ID
Applied_Total
Interview_Offer_Total
Step_1_Score
Quartile_Rank
number_Honored_Clerkships
number_Research_Experiences
number_Abstracts_Pres_Posters
number_Peer_Rev_Publications
number_Volunteer_Experiences
number_Leadership_Positions
number_Programs_Applied
number_Interviews_Attended

tm_t_test <- function(variable){
  print("Function Sanity Test: t-test")
  output <- stats::t.test(y=variable[train$Match_Status == "Matched"],
                          x=variable[train$Match_Status == "Not_Matched"],
                          alternative = ("two.sided"),
                          paired = FALSE,
                          conf.level = 0.95,
                          var.equal = TRUE)
  
  #print("X is group that did not match and Y is group that did match:")
  return(output)
}

We compared categorical variables through the chi-square test

The chi-squared test is a statistical test used to discover whether there is a relationship between categorical variables.

tm_chi_square_test <- function (variable) {
  print("Function Sanity Test: chi-square test")
  chisq <- stats::chisq.test(x = variable, 
                             y = train$Match_Status, 
                             correct = FALSE)
  return(chisq)
}

# In this correlation plot we want to look for the bright, large circles which immediately show the strong correlations 

# (size and shading depends on the absolute values of the coefficients; color depends on direction).  
# This shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Anything that you would have to squint to see is usually not worth seeing! 
tm_print_save <- function (filename) {
  print("Function Sanity Check: Saving TIFF of what is in the viewer")
  dev.print(tiff, (here::here("results", filename)), compression = "lzw",width=2000, height=2000, bg="transparent", res = 200, units = "px" )
  dev.off()
}

# Numeric variables only
# cnum <- train[,c("Age", "Count_of_Oral_Presentation", 
#                  "Count_of_Peer_Reviewed_Book_Chapter", 
#                  "Count_of_Peer_Reviewed_Journal_Articles_Abstracts",
#                  "Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published",
#                  "Count_of_Poster_Presentation")]
# cormat <- stats::cor(cnum) # Select only numeric variables that are significant on univariate testing.
# pairs <- graphics::pairs(cnum)
# pairs
# #tm_print_save("significant_numeric_variable_correlation.tiff")
# # 
# #corrplot::corrplot(cormat, method="circle")
# #corrplot::corrplot(cormat, method="circle", addCoef.col="black") # With correlation
# # tm_print_save("corrplot.tiff")
# rm(cnum)
# rm(cormat)

tm_ggsave <- function (object, filename, ...){  #make sure the file name has quotation marks around it.  
  print("Function Sanity Check: Saving a ggplot image as a TIFF")
  ggplot2::ggsave(here::here("results", filename), object, device = "tiff", width = 12, height = 10, dpi = 200)
}

train <- train %>% select(-STAR_ID)

inspect_cor_plot <- inspectdf::inspect_cor(train, method = "pearson", alpha = 0.05) %>% inspectdf::show_plot()  
inspect_cor_plot

tm_ggsave(inspect_cor_plot, "inspect_cor_plot.tiff")

## [1] "Function Sanity Check: Saving a ggplot image as a TIFF"

A Model to Predict Chances of Matching into Emergency Medicine Residency

Tyler M. Muffly, MD; Richard Amini, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO; Department of Emergency Medicine, University of Arizona

Why are we using a train and test sample data set to test the model?

Identify training and test samples

`train` data characteristics are all reported with medians and IQR.

`test` data characteristics are all reported with medians and IQR.

Check Proportions of Matched/Unmatched applicants in the test and train data sets.

Is feature skew present?

Correlation

One way to look at correlation is with a correlation funnel.

Are there any highly correlated variables that can be removed?

We compared categorical variables through the chi-square test

A Model to Predict Chances of Matching into Emergency Medicine Residency

Tyler M. Muffly, MD; Richard Amini, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO; Department of Emergency Medicine, University of Arizona

Why are we using a train and test sample data set to test the model?

Identify training and test samples

train data characteristics are all reported with medians and IQR.

test data characteristics are all reported with medians and IQR.

Check Proportions of Matched/Unmatched applicants in the test and train data sets.

Is feature skew present?

Correlation

One way to look at correlation is with a correlation funnel.

Are there any highly correlated variables that can be removed?

We compared categorical variables through the chi-square test

`train` data characteristics are all reported with medians and IQR.

`test` data characteristics are all reported with medians and IQR.