Tyler M. Muffly, Rich Amini - name: Tyler M. Muffly, MD affiliation: Denver Health - name: Rich Amini, MD affiliation: University of Arizona

Objective: We sought to construct and validate a model that predict a medical student’s chances of matching into an emergency medicine residency.

This data was cleaned in a separate R script with the help of exploratory.io. The project was created in R version 4.0.1 and run inside RStudio 1.2.5019. Session info is at the bottom of the script. Package installation, data download from Dropbox.com, and functions written for this project are all loaded in a separate “Additional_functions_nomogram.R” file.

Ingest Data

all_data is a dataframe of the independent and the dependent variables for review. Each variable is contained in a column, and each row represents a single unique medical student. If students applied in more than one year the most contemporary data was used.

It makes sense to start with the outcome we want to predict: the matching rate in EM:

ggplot(all_data, aes(x = Match_Status)) + 
  geom_bar() +
  labs(y = "Number of Applicants")

There are more matched applicants than unmatched applicants.

Near Zero or zero variance variables

Find the zero variance variables and remove them from the original data frame:

## Candidate Near Zero-Variance Variables
zero_variance_vars <- names(all_data)[caret::nearZeroVar(all_data)]  
zero_variance_vars

## [1] "Degree"                "Research_Year"         "Absence_Year"         
## [4] "Required_to_Remediate" "Pass_Attempt_Step_1"

## Remove Zero-Variance Features
#https://tidyselect.r-lib.org/reference/all_of.html, use any_of when removing variables not all_of
all_data <- all_data %>% select(!any_of(zero_variance_vars))

#Creates all_data1 with no PII present
all_data1 <- all_data

Data Check

Checks to see if there are any NA or if there are any infinite values. FALSE means that there are no issues with NA or infinite variables.

## Check for bad values
apply(all_data1, 2, function(x) any(is.na(x) | is.infinite(x)))

##                       STAR_ID                   Survey_Year 
##                         FALSE                         FALSE 
##         Interview_Offer_Total                  Match_Status 
##                         FALSE                         FALSE 
##                    Home_State                  Step_1_Score 
##                         FALSE                         FALSE 
##           Cumulative_Quartile                 Quartile_Rank 
##                         FALSE                         FALSE 
##     number_Honored_Clerkships       Honors_A_This_Specialty 
##                         FALSE                         FALSE 
##                     AOA_Sigma                          GHHS 
##                         FALSE                         FALSE 
##                 Couples_Match                 Other_Degrees 
##                         FALSE                         FALSE 
##   number_Research_Experiences number_Abstracts_Pres_Posters 
##                         FALSE                         FALSE 
##  number_Peer_Rev_Publications  number_Volunteer_Experiences 
##                         FALSE                         FALSE 
##   number_Leadership_Positions       number_Programs_Applied 
##                         FALSE                         FALSE 
##    number_Interviews_Attended 
##                         FALSE

The data are given below:

DT::datatable(all_data1, options = list(pageLength = 10))

Checking the rate of matching

# orginal response distribution
tmp <- table(all_data1$Match_Status)
match_rate <- (tmp[[2]]/(tmp[[2]] + tmp[[1]]))*100
match_rate

## [1] 92.89362

rm(tmp)

Complete Exploratory Data Analysis

First approach to data

A summary of the variables are listed below.

We can see that these data have 2350 observations of 21 features, and that about 92.9 percent of medical students applying to EM residency matched. Let’s create a few plots to get a sense of the data. Remember, the goal here will be to predict whether a given medical student will match into EM residency, based on the variables listed in the codebook.

Structure of the data.

## Check data types
sapply(all_data1, class)

##                       STAR_ID                   Survey_Year 
##                     "numeric"                      "factor" 
##         Interview_Offer_Total                  Match_Status 
##                     "numeric"                      "factor" 
##                    Home_State                  Step_1_Score 
##                      "factor"                     "numeric" 
##           Cumulative_Quartile                 Quartile_Rank 
##                      "factor"                     "numeric" 
##     number_Honored_Clerkships       Honors_A_This_Specialty 
##                     "numeric"                      "factor" 
##                     AOA_Sigma                          GHHS 
##                      "factor"                      "factor" 
##                 Couples_Match                 Other_Degrees 
##                      "factor"                      "factor" 
##   number_Research_Experiences number_Abstracts_Pres_Posters 
##                     "numeric"                     "numeric" 
##  number_Peer_Rev_Publications  number_Volunteer_Experiences 
##                     "numeric"                     "numeric" 
##   number_Leadership_Positions       number_Programs_Applied 
##                     "numeric"                     "numeric" 
##    number_Interviews_Attended 
##                     "numeric"

Describe the Data

Hmisc::describe(all_data1)

## all_data1 
## 
##  21  Variables      2350  Observations
## --------------------------------------------------------------------------------
## STAR_ID 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##      2350         0      2350         1  2.02e+09   1300887 2.018e+09 2.018e+09 
##       .25       .50       .75       .90       .95 
## 2.019e+09 2.020e+09 2.021e+09 2.021e+09 2.021e+09 
## 
## lowest : 2017040001 2017040002 2017040003 2017040004 2017040005
## highest: 2021040718 2021040719 2021040720 2021040721 2021040722
##                                                                  
## Value      2017050000 2018050000 2019050000 2020050000 2021050000
## Frequency          70        437        504        652        687
## Proportion      0.030      0.186      0.214      0.277      0.292
## 
## For the frequency table, variable is rounded to the nearest 50000
## --------------------------------------------------------------------------------
## Survey_Year 
##        n  missing distinct 
##     2350        0        5 
## 
## lowest : 2017 2018 2019 2020 2021, highest: 2017 2018 2019 2020 2021
##                                         
## Value       2017  2018  2019  2020  2021
## Frequency     70   437   504   652   687
## Proportion 0.030 0.186 0.214 0.277 0.292
## --------------------------------------------------------------------------------
## Interview_Offer_Total 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0      100    0.999     24.5    17.95        4        7 
##      .25      .50      .75      .90      .95 
##       13       21       32       46       57 
## 
## lowest :   0   1   2   3   4, highest: 133 134 180 192 249
## --------------------------------------------------------------------------------
## Match_Status 
##        n  missing distinct 
##     2350        0        2 
##                       
## Value          N     Y
## Frequency    167  2183
## Proportion 0.071 0.929
## --------------------------------------------------------------------------------
## Home_State 
##        n  missing distinct 
##     2350        0       37 
## 
## lowest : AL AR AZ CA CT, highest: TX VA WA WI WV
## --------------------------------------------------------------------------------
## Step_1_Score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       17    0.991    234.1    17.26      207      212 
##      .25      .50      .75      .90      .95 
##      222      237      247      252      257 
## 
## lowest : 192 197 202 207 212, highest: 252 257 262 267 272
##                                                                             
## Value        192   197   202   207   212   217   222   227   232   237   242
## Frequency      7    17    54    74   108   153   208   262   274   321   251
## Proportion 0.003 0.007 0.023 0.031 0.046 0.065 0.089 0.111 0.117 0.137 0.107
##                                               
## Value        247   252   257   262   267   272
## Frequency    236   180   114    59    26     6
## Proportion 0.100 0.077 0.049 0.025 0.011 0.003
## --------------------------------------------------------------------------------
## Cumulative_Quartile 
##        n  missing distinct 
##     2350        0        5 
## 
## lowest : 1st     2nd     3rd     4th     Unknown
## highest: 1st     2nd     3rd     4th     Unknown
##                                                   
## Value          1st     2nd     3rd     4th Unknown
## Frequency      539     484     362     201     764
## Proportion   0.229   0.206   0.154   0.086   0.325
## --------------------------------------------------------------------------------
## Quartile_Rank 
##        n  missing distinct     Info     Mean      Gmd 
##     2350        0        5    0.941    1.929    1.769 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency    764   201   362   484   539
## Proportion 0.325 0.086 0.154 0.206 0.229
## --------------------------------------------------------------------------------
## number_Honored_Clerkships 
##        n  missing distinct     Info     Mean      Gmd 
##     2350        0        9    0.983    3.155    2.686 
## 
## lowest : 0 1 2 3 4, highest: 4 5 6 7 8
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency    390   308   347   315   303   230   206   138   113
## Proportion 0.166 0.131 0.148 0.134 0.129 0.098 0.088 0.059 0.048
## --------------------------------------------------------------------------------
## Honors_A_This_Specialty 
##        n  missing distinct 
##     2350        0        2 
##                       
## Value         No   Yes
## Frequency   1071  1279
## Proportion 0.456 0.544
## --------------------------------------------------------------------------------
## AOA_Sigma 
##        n  missing distinct 
##     2350        0        3 
##                                                                 
## Value                     No No School Chapter               Yes
## Frequency               1840               135               375
## Proportion             0.783             0.057             0.160
## --------------------------------------------------------------------------------
## GHHS 
##        n  missing distinct 
##     2350        0        3 
##                                                                 
## Value                     No No School Chapter               Yes
## Frequency               1854               104               392
## Proportion             0.789             0.044             0.167
## --------------------------------------------------------------------------------
## Couples_Match 
##        n  missing distinct 
##     2350        0        2 
##                       
## Value         No   Yes
## Frequency   2200   150
## Proportion 0.936 0.064
## --------------------------------------------------------------------------------
## Other_Degrees 
##        n  missing distinct 
##     2350        0        8 
## 
## lowest : MBA                  MDiv                 MEd                  MPH                  MSc                 
## highest: MPH                  MSc                  No additional degree Other                PhD                 
## 
## MBA (35, 0.015), MDiv (4, 0.002), MEd (14, 0.006), MPH (102, 0.043), MSc (153,
## 0.065), No additional degree (1887, 0.803), Other (133, 0.057), PhD (22, 0.009)
## --------------------------------------------------------------------------------
## number_Research_Experiences 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       12    0.971    2.919    2.217        0        1 
##      .25      .50      .75      .90      .95 
##        2        3        4        5        7 
## 
## lowest :  0  1  2  3  4, highest:  7  8  9 10 11
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency    215   368   542   477   322   200   101    49    29     3     4
## Proportion 0.091 0.157 0.231 0.203 0.137 0.085 0.043 0.021 0.012 0.001 0.002
##                 
## Value         11
## Frequency     40
## Proportion 0.017
## --------------------------------------------------------------------------------
## number_Abstracts_Pres_Posters 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       12    0.979    3.114    3.156        0        0 
##      .25      .50      .75      .90      .95 
##        1        2        4        8       11 
## 
## lowest :  0  1  2  3  4, highest:  7  8  9 10 11
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency    452   360   451   303   223   145    96    74    59    33    26
## Proportion 0.192 0.153 0.192 0.129 0.095 0.062 0.041 0.031 0.025 0.014 0.011
##                 
## Value         11
## Frequency    128
## Proportion 0.054
## --------------------------------------------------------------------------------
## number_Peer_Rev_Publications 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       12    0.901    1.383    1.813        0        0 
##      .25      .50      .75      .90      .95 
##        0        1        2        4        5 
## 
## lowest :  0  1  2  3  4, highest:  7  8  9 10 11
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency   1009   585   359   160    88    43    23    25    13     3     7
## Proportion 0.429 0.249 0.153 0.068 0.037 0.018 0.010 0.011 0.006 0.001 0.003
##                 
## Value         11
## Frequency     35
## Proportion 0.015
## --------------------------------------------------------------------------------
## number_Volunteer_Experiences 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       12    0.982    6.869    3.452        2        3 
##      .25      .50      .75      .90      .95 
##        4        7       10       11       11 
## 
## lowest :  0  1  2  3  4, highest:  7  8  9 10 11
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency     12    37   108   198   249   291   246   219   240   107   121
## Proportion 0.005 0.016 0.046 0.084 0.106 0.124 0.105 0.093 0.102 0.046 0.051
##                 
## Value         11
## Frequency    522
## Proportion 0.222
## --------------------------------------------------------------------------------
## number_Leadership_Positions 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       12    0.984    4.041    2.956        0        1 
##      .25      .50      .75      .90      .95 
##        2        4        5        8       10 
## 
## lowest :  0  1  2  3  4, highest:  7  8  9 10 11
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency    155   215   357   434   319   302   153   171    81    23    31
## Proportion 0.066 0.091 0.152 0.185 0.136 0.129 0.065 0.073 0.034 0.010 0.013
##                 
## Value         11
## Frequency    109
## Proportion 0.046
## --------------------------------------------------------------------------------
## number_Programs_Applied 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0      138    0.999    49.22    23.92    21.45    28.00 
##      .25      .50      .75      .90      .95 
##    35.00    45.00    60.00    77.00    92.00 
## 
## lowest :   1   2   3   4   5, highest: 160 169 176 191 192
## --------------------------------------------------------------------------------
## number_Interviews_Attended 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     2350        0       35    0.994    12.98    5.182        4        7 
##      .25      .50      .75      .90      .95 
##       11       13       16       18       20 
## 
## lowest :  0  1  2  3  4, highest: 30 31 32 33 36
## --------------------------------------------------------------------------------

Another view of the data

A nice data summary is available from the skim package.

#skimr::skim(all_data1)

Third view of the data

## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow

Data Frame Summary

all_data1

Dimensions: 2350 x 21
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

STAR_ID [numeric]

Mean (sd) : 2019656896 (1171521) min < med < max: 2017040001 < 2020040172 < 2021040722 IQR (CV) : 2000020 (0)

2350 distinct values

0 (0.0%)

Survey_Year [factor]

1. 2017 2. 2018 3. 2019 4. 2020 5. 2021

70	(	3.0%	)
437	(	18.6%	)
504	(	21.4%	)
652	(	27.7%	)
687	(	29.2%	)

0 (0.0%)

Interview_Offer_Total [numeric]

Mean (sd) : 24.5 (18.07) min < med < max: 0 < 21 < 249 IQR (CV) : 19 (0.74)

100 distinct values

0 (0.0%)

Match_Status [factor]

1. N 2. Y

167	(	7.1%	)
2183	(	92.9%	)

0 (0.0%)

Home_State [factor]

1. AL 2. AR 3. AZ 4. CA 5. CT 6. DC 7. FL 8. GA 9. IA 10. ID [ 27 others ]

9	(	0.4%	)
3	(	0.1%	)
45	(	1.9%	)
65	(	2.8%	)
37	(	1.6%	)
65	(	2.8%	)
142	(	6.0%	)
27	(	1.1%	)
20	(	0.9%	)
30	(	1.3%	)
1907	(	81.1%	)

0 (0.0%)

Step_1_Score [numeric]

Mean (sd) : 234.06 (15.24) min < med < max: 192 < 237 < 272 IQR (CV) : 25 (0.07)

17 distinct values

0 (0.0%)

Cumulative_Quartile [factor]

1. 1st 2. 2nd 3. 3rd 4. 4th 5. Unknown

539	(	22.9%	)
484	(	20.6%	)
362	(	15.4%	)
201	(	8.6%	)
764	(	32.5%	)

0 (0.0%)

Quartile_Rank [numeric]

Mean (sd) : 1.93 (1.58) min < med < max: 0 < 2 < 4 IQR (CV) : 3 (0.82)

0	:	764	(	32.5%	)
1	:	201	(	8.6%	)
2	:	362	(	15.4%	)
3	:	484	(	20.6%	)
4	:	539	(	22.9%	)

0 (0.0%)

number_Honored_Clerkships [numeric]

Mean (sd) : 3.16 (2.37) min < med < max: 0 < 3 < 8 IQR (CV) : 4 (0.75)

0	:	390	(	16.6%	)
1	:	308	(	13.1%	)
2	:	347	(	14.8%	)
3	:	315	(	13.4%	)
4	:	303	(	12.9%	)
5	:	230	(	9.8%	)
6	:	206	(	8.8%	)
7	:	138	(	5.9%	)
8	:	113	(	4.8%	)

0 (0.0%)

Honors_A_This_Specialty [factor]

1. No 2. Yes

1071	(	45.6%	)
1279	(	54.4%	)

0 (0.0%)

AOA_Sigma [factor]

1. No 2. No School Chapter 3. Yes

1840	(	78.3%	)
135	(	5.7%	)
375	(	16.0%	)

0 (0.0%)

GHHS [factor]

1. No 2. No School Chapter 3. Yes

1854	(	78.9%	)
104	(	4.4%	)
392	(	16.7%	)

0 (0.0%)

Couples_Match [factor]

1. No 2. Yes

2200	(	93.6%	)
150	(	6.4%	)

0 (0.0%)

Other_Degrees [factor]

1. MBA 2. MDiv 3. MEd 4. MPH 5. MSc 6. No additional degree 7. Other 8. PhD

35	(	1.5%	)
4	(	0.2%	)
14	(	0.6%	)
102	(	4.3%	)
153	(	6.5%	)
1887	(	80.3%	)
133	(	5.7%	)
22	(	0.9%	)

0 (0.0%)

number_Research_Experiences [numeric]

Mean (sd) : 2.92 (2.1) min < med < max: 0 < 3 < 11 IQR (CV) : 2 (0.72)

12 distinct values

0 (0.0%)

number_Abstracts_Pres_Posters [numeric]

Mean (sd) : 3.11 (2.98) min < med < max: 0 < 2 < 11 IQR (CV) : 3 (0.96)

12 distinct values

0 (0.0%)

number_Peer_Rev_Publications [numeric]

Mean (sd) : 1.38 (2.01) min < med < max: 0 < 1 < 11 IQR (CV) : 2 (1.46)

12 distinct values

0 (0.0%)

number_Volunteer_Experiences [numeric]

Mean (sd) : 6.87 (3.03) min < med < max: 0 < 7 < 11 IQR (CV) : 6 (0.44)

12 distinct values

0 (0.0%)

number_Leadership_Positions [numeric]

Mean (sd) : 4.04 (2.69) min < med < max: 0 < 4 < 11 IQR (CV) : 3 (0.67)

12 distinct values

0 (0.0%)

number_Programs_Applied [numeric]

Mean (sd) : 49.22 (23.05) min < med < max: 1 < 45 < 192 IQR (CV) : 25 (0.47)

138 distinct values

0 (0.0%)

number_Interviews_Attended [numeric]

Mean (sd) : 12.98 (4.79) min < med < max: 0 < 13 < 36 IQR (CV) : 5 (0.37)

35 distinct values

0 (0.0%)

Generated by summarytools 0.9.9 (R version 4.1.0)
2021-07-13

Missing values

The new data set in all_data1, includes 2350 rows and 21 columns. The two thousand three hundred fifty applicants data are missing zero values.

Check for missing data

#plot_str(all_data) #COOL BUT USELESS HERE
DataExplorer::plot_missing(all_data1,
  ggtheme = theme_gray(),
  theme_config = list(),
  title = "DataExplorer NA Plot")

The hidden code has an additional four different ways to check for missingness in the data.

Visualising distributions

Data Description and Univariate analysis of variables.

After the data check was completed, an exploratory data analysis (EDA) was conducted to look for interesting relationships among the variables. Histograms were used to visualize distributions among predictors. Since the outcome of Matching is a classification problem, relationships between predictors and the dichotomous outcome were also performed.

Categorical and numerical variable plots:

Analyzing categorical variables

#General Data Description, nice start for overview
inspect_cat_plot <- inspectdf::inspect_cat(all_data1) %>% inspectdf::show_plot() 
inspect_cat_plot

tm_ggsave(object = inspect_cat_plot, filename = "inspect_cat_plot.tiff")

## [1] "Function Sanity Check: Saving a ggplot image as a TIFF"

funModeling::freq(data=all_data1, plot = TRUE, na.rm = FALSE)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##   Survey_Year frequency percentage cumulative_perc
## 1        2021       687      29.23           29.23
## 2        2020       652      27.74           56.97
## 3        2019       504      21.45           78.42
## 4        2018       437      18.60           97.02
## 5        2017        70       2.98          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##   Match_Status frequency percentage cumulative_perc
## 1            Y      2183      92.89           92.89
## 2            N       167       7.11          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##    Home_State frequency percentage cumulative_perc
## 1          TX       356      15.15           15.15
## 2          NY       229       9.74           24.89
## 3          PA       175       7.45           32.34
## 4          FL       142       6.04           38.38
## 5          IL       115       4.89           43.27
## 6          OH       112       4.77           48.04
## 7          MI       100       4.26           52.30
## 8          NC        85       3.62           55.92
## 9          VA        85       3.62           59.54
## 10         SC        82       3.49           63.03
## 11         LA        75       3.19           66.22
## 12         MA        69       2.94           69.16
## 13         CA        65       2.77           71.93
## 14         DC        65       2.77           74.70
## 15         KY        59       2.51           77.21
## 16         WA        56       2.38           79.59
## 17         WI        55       2.34           81.93
## 18         MN        53       2.26           84.19
## 19         TN        52       2.21           86.40
## 20         AZ        45       1.91           88.31
## 21         NJ        42       1.79           90.10
## 22         CT        37       1.57           91.67
## 23         ID        30       1.28           92.95
## 24         GA        27       1.15           94.10
## 25         OK        21       0.89           94.99
## 26         IA        20       0.85           95.84
## 27         NE        19       0.81           96.65
## 28         MO        18       0.77           97.42
## 29         MS        15       0.64           98.06
## 30         NV        10       0.43           98.49
## 31         WV        10       0.43           98.92
## 32         AL         9       0.38           99.30
## 33         NM         6       0.26           99.56
## 34         SD         4       0.17           99.73
## 35         AR         3       0.13           99.86
## 36         MD         2       0.09           99.95
## 37         ND         2       0.09          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##   Cumulative_Quartile frequency percentage cumulative_perc
## 1             Unknown       764      32.51           32.51
## 2                 1st       539      22.94           55.45
## 3                 2nd       484      20.60           76.05
## 4                 3rd       362      15.40           91.45
## 5                 4th       201       8.55          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##   Honors_A_This_Specialty frequency percentage cumulative_perc
## 1                     Yes      1279      54.43           54.43
## 2                      No      1071      45.57          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##           AOA_Sigma frequency percentage cumulative_perc
## 1                No      1840      78.30           78.30
## 2               Yes       375      15.96           94.26
## 3 No School Chapter       135       5.74          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##                GHHS frequency percentage cumulative_perc
## 1                No      1854      78.89           78.89
## 2               Yes       392      16.68           95.57
## 3 No School Chapter       104       4.43          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##   Couples_Match frequency percentage cumulative_perc
## 1            No      2200      93.62           93.62
## 2           Yes       150       6.38          100.00

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##          Other_Degrees frequency percentage cumulative_perc
## 1 No additional degree      1887      80.30           80.30
## 2                  MSc       153       6.51           86.81
## 3                Other       133       5.66           92.47
## 4                  MPH       102       4.34           96.81
## 5                  MBA        35       1.49           98.30
## 6                  PhD        22       0.94           99.24
## 7                  MEd        14       0.60           99.84
## 8                 MDiv         4       0.17          100.00

## [1] "Variables processed: Survey_Year, Match_Status, Home_State, Cumulative_Quartile, Honors_A_This_Specialty, AOA_Sigma, GHHS, Couples_Match, Other_Degrees"

Analyzing numerical variables

$page_1

[1] “Function Sanity Check: Saving a ggplot image as a TIFF”

Summary stats of the numerical data showing means, medians, skew

create_profiling_num_output <- create_profiling_num(all_data1)

## [1] "Function Sanity Check: Plot Numeric Features"

all_data1 %>% mosaic::inspect()  #another good option

## 
## categorical variables:  
##                      name  class levels    n missing
## 1             Survey_Year factor      5 2350       0
## 2            Match_Status factor      2 2350       0
## 3              Home_State factor     37 2350       0
## 4     Cumulative_Quartile factor      5 2350       0
## 5 Honors_A_This_Specialty factor      2 2350       0
## 6               AOA_Sigma factor      3 2350       0
## 7                    GHHS factor      3 2350       0
## 8           Couples_Match factor      2 2350       0
## 9           Other_Degrees factor      8 2350       0
##                                    distribution
## 1 2021 (29.2%), 2020 (27.7%) ...               
## 2 Y (92.9%), N (7.1%)                          
## 3 TX (15.1%), NY (9.7%), PA (7.4%) ...         
## 4 Unknown (32.5%), 1st (22.9%) ...             
## 5 Yes (54.4%), No (45.6%)                      
## 6 No (78.3%), Yes (16%) ...                    
## 7 No (78.9%), Yes (16.7%) ...                  
## 8 No (93.6%), Yes (6.4%)                       
## 9 No additional degree (80.3%) ...             
## 
## quantitative variables:  
##                                name   class        min         Q1     median
## ...1                        STAR_ID numeric 2017040001 2019040083 2020040172
## ...2          Interview_Offer_Total numeric          0         13         21
## ...3                   Step_1_Score numeric        192        222        237
## ...4                  Quartile_Rank numeric          0          0          2
## ...5      number_Honored_Clerkships numeric          0          1          3
## ...6    number_Research_Experiences numeric          0          2          3
## ...7  number_Abstracts_Pres_Posters numeric          0          1          2
## ...8   number_Peer_Rev_Publications numeric          0          0          1
## ...9   number_Volunteer_Experiences numeric          0          4          7
## ...10   number_Leadership_Positions numeric          0          2          4
## ...11       number_Programs_Applied numeric          1         35         45
## ...12    number_Interviews_Attended numeric          0         11         13
##               Q3        max         mean           sd    n missing
## ...1  2021040104 2021040722 2.019657e+09 1.171521e+06 2350       0
## ...2          32        249 2.449872e+01 1.806869e+01 2350       0
## ...3         247        272 2.340574e+02 1.524495e+01 2350       0
## ...4           3          4 1.928936e+00 1.582839e+00 2350       0
## ...5           5          8 3.155319e+00 2.365344e+00 2350       0
## ...6           4         11 2.918723e+00 2.097808e+00 2350       0
## ...7           4         11 3.114468e+00 2.983437e+00 2350       0
## ...8           2         11 1.382553e+00 2.012634e+00 2350       0
## ...9          10         11 6.868936e+00 3.025902e+00 2350       0
## ...10          5         11 4.040851e+00 2.694465e+00 2350       0
## ...11         60        192 4.921617e+01 2.305222e+01 2350       0
## ...12         16         36 1.298043e+01 4.786684e+00 2350       0

readr::write_csv(create_profiling_num_output, (here::here("results", "create_profiling_num_output.csv")))

Bivariate/Cross plots of predictors by outcome

DataExplorer::plot_boxplot(all_data1, by = "Match_Status", 
  ggtheme = theme_gray(),
  theme_config = list(),
  nrow = 10L,
  ncol = 2L, 
  title = "DataExplorer of Variables")

Typical values/Table 1

Table: Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2021

# Draws a nice table one plot
tm_arsenal_table <- function(df, by){
  print("Function Sanity Check: Create Arsenal Table using arsenal package")
  table_variable_within_function <- arsenal::tableby(formula = by ~ .,
                 data=df, control = arsenal::tableby.control(test = TRUE,
                                                                total = F,
                                                                digits = 1L,
                                                                digits.p = 2L,
                                                                digits.count = 0L,
                                                                numeric.simplify = F,
                                                                numeric.stats =
                                                                  c("median",
                                                                    "q1q3"),
                                                                cat.stats =
                                                                  c("Nmiss",
                                                                    "countpct"),
                                                                stats.labels = list(Nmiss = "N Missing",
                                                                                    Nmiss2 ="N Missing",
                                                                                    meansd = "Mean (SD)",
                                                                                    medianrange = "Median (Range)",
                                                                                    median ="Median",
                                                                                    medianq1q3 = "Median (Q1, Q3)",
                                                                                    q1q3 = "Q1, Q3",
                                                                                    iqr = "IQR",
                                                                                    range = "Range",
                                                                                    countpct = "Count (Pct)",
                                                                                    Nevents = "Events",
                                                                                    medSurv ="Median Survival",
                                                                                    medTime = "Median Follow-Up")))
  final <- summary(table_variable_within_function,
          text=T,
          title = 'Table: Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2021',
          #labelTranslations = mylabels, #Seen in additional functions file
          pfootnote=TRUE)
  return(final)
}

tm_arsenal_table_output <- tm_arsenal_table(df = all_data1 %>% select(-STAR_ID), by = all_data1$Match_Status)

[1] “Function Sanity Check: Create Arsenal Table using arsenal package”

tm_arsenal_table_output

Table: Applicant Descriptive Variables by Matched or Did Not Match from 2017 to 2021
	N (N=167)	Y (N=2183)	p value
Survey_Year			< 0.01 (1)
- 2017	5 (3.0%)	65 (3.0%)
- 2018	55 (32.9%)	382 (17.5%)
- 2019	41 (24.6%)	463 (21.2%)
- 2020	53 (31.7%)	599 (27.4%)
- 2021	13 (7.8%)	674 (30.9%)
Interview_Offer_Total			< 0.01 (2)
- Median	7.0	22.0
- Q1, Q3	3.5, 17.5	13.0, 32.5
Match_Status			< 0.01 (1)
- N	167 (100.0%)	0 (0.0%)
- Y	0 (0.0%)	2183 (100.0%)
Home_State			< 0.01 (1)
- AL	0 (0.0%)	9 (0.4%)
- AR	0 (0.0%)	3 (0.1%)
- AZ	1 (0.6%)	44 (2.0%)
- CA	3 (1.8%)	62 (2.8%)
- CT	4 (2.4%)	33 (1.5%)
- DC	8 (4.8%)	57 (2.6%)
- FL	15 (9.0%)	127 (5.8%)
- GA	1 (0.6%)	26 (1.2%)
- IA	0 (0.0%)	20 (0.9%)
- ID	0 (0.0%)	30 (1.4%)
- IL	20 (12.0%)	95 (4.4%)
- KY	4 (2.4%)	55 (2.5%)
- LA	11 (6.6%)	64 (2.9%)
- MA	2 (1.2%)	67 (3.1%)
- MD	0 (0.0%)	2 (0.1%)
- MI	7 (4.2%)	93 (4.3%)
- MN	7 (4.2%)	46 (2.1%)
- MO	1 (0.6%)	17 (0.8%)
- MS	0 (0.0%)	15 (0.7%)
- NC	4 (2.4%)	81 (3.7%)
- ND	1 (0.6%)	1 (0.0%)
- NE	1 (0.6%)	18 (0.8%)
- NJ	1 (0.6%)	41 (1.9%)
- NM	0 (0.0%)	6 (0.3%)
- NV	1 (0.6%)	9 (0.4%)
- NY	12 (7.2%)	217 (9.9%)
- OH	6 (3.6%)	106 (4.9%)
- OK	1 (0.6%)	20 (0.9%)
- PA	8 (4.8%)	167 (7.7%)
- SC	6 (3.6%)	76 (3.5%)
- SD	0 (0.0%)	4 (0.2%)
- TN	4 (2.4%)	48 (2.2%)
- TX	29 (17.4%)	327 (15.0%)
- VA	3 (1.8%)	82 (3.8%)
- WA	3 (1.8%)	53 (2.4%)
- WI	2 (1.2%)	53 (2.4%)
- WV	1 (0.6%)	9 (0.4%)
Step_1_Score			< 0.01 (2)
- Median	227.0	237.0
- Q1, Q3	217.0, 242.0	222.0, 247.0
Cumulative_Quartile			< 0.01 (1)
- 1st	28 (16.8%)	511 (23.4%)
- 2nd	37 (22.2%)	447 (20.5%)
- 3rd	29 (17.4%)	333 (15.3%)
- 4th	27 (16.2%)	174 (8.0%)
- Unknown	46 (27.5%)	718 (32.9%)
Quartile_Rank			0.47 (2)
- Median	2.0	2.0
- Q1, Q3	0.0, 3.0	0.0, 3.0
number_Honored_Clerkships			< 0.01 (2)
- Median	2.0	3.0
- Q1, Q3	0.0, 4.0	1.0, 5.0
Honors_A_This_Specialty			< 0.01 (1)
- No	103 (61.7%)	968 (44.3%)
- Yes	64 (38.3%)	1215 (55.7%)
AOA_Sigma			0.06 (1)
- No	143 (85.6%)	1697 (77.7%)
- No School Chapter	7 (4.2%)	128 (5.9%)
- Yes	17 (10.2%)	358 (16.4%)
GHHS			0.26 (1)
- No	140 (83.8%)	1714 (78.5%)
- No School Chapter	5 (3.0%)	99 (4.5%)
- Yes	22 (13.2%)	370 (16.9%)
Couples_Match			0.06 (1)
- No	162 (97.0%)	2038 (93.4%)
- Yes	5 (3.0%)	145 (6.6%)
Other_Degrees			0.22 (1)
- MBA	4 (2.4%)	31 (1.4%)
- MDiv	0 (0.0%)	4 (0.2%)
- MEd	1 (0.6%)	13 (0.6%)
- MPH	5 (3.0%)	97 (4.4%)
- MSc	15 (9.0%)	138 (6.3%)
- No additional degree	125 (74.9%)	1762 (80.7%)
- Other	16 (9.6%)	117 (5.4%)
- PhD	1 (0.6%)	21 (1.0%)
number_Research_Experiences			0.20 (2)
- Median	3.0	3.0
- Q1, Q3	1.0, 4.0	2.0, 4.0
number_Abstracts_Pres_Posters			0.09 (2)
- Median	2.0	2.0
- Q1, Q3	1.0, 4.0	1.0, 4.0
number_Peer_Rev_Publications			0.64 (2)
- Median	0.0	1.0
- Q1, Q3	0.0, 2.0	0.0, 2.0
number_Volunteer_Experiences			< 0.01 (2)
- Median	5.0	7.0
- Q1, Q3	4.0, 9.0	4.0, 10.0
number_Leadership_Positions			0.44 (2)
- Median	3.0	4.0
- Q1, Q3	2.0, 5.5	2.0, 5.0
number_Programs_Applied			< 0.01 (2)
- Median	50.0	45.0
- Q1, Q3	34.0, 70.0	35.0, 60.0
number_Interviews_Attended			< 0.01 (2)
- Median	10.0	13.0
- Q1, Q3	5.0, 14.0	11.0, 16.0

Pearson’s Chi-squared test
Linear Model ANOVA

#tm_write2word(tm_arsenal_table_output, "tm_arsenal_table_output1")
#tm_write2pdf(tm_arsenal_table_output, "tm_arsenal_table_output1")

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##   [1] exploratory_6.6.3.1             anonymizer_0.2.2               
##   [3] rgl_0.106.8                     MLmetrics_1.1.1                
##   [5] plotROC_2.2.1                   R.methodsS3_1.8.1              
##   [7] ranger_0.12.1                   psych_2.1.6                    
##   [9] correlationfunnel_0.2.0         fs_1.5.0                       
##  [11] remotes_2.4.0                   summarytools_0.9.9             
##  [13] doMC_1.3.7                      discrim_0.1.2                  
##  [15] RSQLite_2.2.7                   odbc_1.3.2                     
##  [17] doParallel_1.0.16               iterators_1.0.13               
##  [19] factoextra_1.0.7                corrgram_1.14                  
##  [21] ezknitr_0.6                     vip_0.3.2                      
##  [23] BH_1.75.0-0                     plotly_4.9.4.1                 
##  [25] shinyWidgets_0.6.0              shinyjs_2.0.0                  
##  [27] flexdashboard_0.5.2             lime_0.5.2                     
##  [29] cowplot_1.1.1                   tidyquant_1.0.3                
##  [31] quantmod_0.4.18                 TTR_0.24.2                     
##  [33] PerformanceAnalytics_2.0.4      xts_0.12.1                     
##  [35] yardstick_0.0.8                 workflowsets_0.0.2             
##  [37] workflows_0.2.2                 tune_0.1.5                     
##  [39] rsample_0.1.0                   recipes_0.1.16                 
##  [41] parsnip_0.1.6                   modeldata_0.1.0                
##  [43] infer_0.5.4                     dials_0.0.9                    
##  [45] tidymodels_0.1.3                mltools_0.3.5                  
##  [47] broom_0.7.6                     mosaic_1.8.3                   
##  [49] ggridges_0.5.3                  mosaicData_0.20.2              
##  [51] ggformula_0.10.1                ggstance_0.3.5                 
##  [53] english_1.2-5                   naniar_0.6.1                   
##  [55] skimr_2.1.3                     data.table_1.14.0              
##  [57] naivebayes_0.9.7                kernlab_0.9-29                 
##  [59] kableExtra_1.3.4                RColorBrewer_1.1-2             
##  [61] rpart.plot_3.0.9                visNetwork_2.0.9               
##  [63] drake_7.13.2                    knitr_1.33                     
##  [65] tableone_0.12.0                 ggpubr_0.4.0                   
##  [67] inspectdf_0.0.11                rsconnect_0.8.18               
##  [69] DataExplorer_0.8.2              labeling_0.4.2                 
##  [71] highr_0.9                       vctrs_0.3.8                    
##  [73] progress_1.2.2                  corrplot_0.89                  
##  [75] Rmisc_1.5                       plyr_1.8.6                     
##  [77] caretEnsemble_2.0.1             tinytex_0.32                   
##  [79] funModeling_1.9.4               rmda_1.6                       
##  [81] rattle_5.4.0                    bitops_1.0-7                   
##  [83] rmarkdown_2.9                   ResourceSelection_0.3-5        
##  [85] lmtest_0.9-38                   zoo_1.8-9                      
##  [87] utf8_1.2.1                      fansi_0.5.0                    
##  [89] beepr_1.3                       rpart_4.1-15                   
##  [91] mice_3.13.0                     car_3.0-10                     
##  [93] carData_3.0-4                   MatchIt_4.2.0                  
##  [95] leaps_3.1                       moments_0.14                   
##  [97] pander_0.6.4                    arsenal_3.6.3                  
##  [99] gbm_2.1.8                       DescTools_0.99.41              
## [101] scoring_0.6                     pscl_1.5.5                     
## [103] InformationValue_1.2.3          tidylog_1.0.2                  
## [105] ggforce_0.3.3                   glmnet_4.1-2                   
## [107] Matrix_1.3-4                    Boruta_7.0.0                   
## [109] fastAdaboost_1.0.0              earth_5.3.0                    
## [111] plotmo_3.6.0                    TeachingDemos_2.12             
## [113] plotrix_3.8-1                   shiny_1.6.0                    
## [115] AppliedPredictiveModeling_1.1-7 RANN_2.6.1                     
## [117] Metrics_0.1.4                   xgboost_1.4.1.1                
## [119] ipred_0.9-11                    randomForest_4.6-14            
## [121] mlbench_2.1-3                   caTools_1.18.2                 
## [123] DynNom_5.0.1                    magrittr_2.0.1                 
## [125] packrat_0.6.0                   nnet_7.3-16                    
## [127] ROCR_1.0-11                     pROC_1.17.0.1                  
## [129] rms_6.2-0                       SparseM_1.81                   
## [131] Hmisc_4.5-0                     Formula_1.2-4                  
## [133] survival_3.2-11                 PASWR_1.1                      
## [135] MASS_7.3-54                     e1071_1.7-7                    
## [137] foreach_1.5.1                   tidyverse_1.3.1                
## [139] rgdal_1.5-23                    sp_1.4-5                       
## [141] scales_1.1.1                    munsell_0.5.0                  
## [143] bit64_4.0.5                     bit_4.0.4                      
## [145] tibble_3.1.2                    RcppRoll_0.3.0                 
## [147] forcats_0.5.1                   openxlsx_4.2.4                 
## [149] stringr_1.4.0                   tidyr_1.1.3                    
## [151] hms_1.1.0                       lubridate_1.7.10               
## [153] janitor_2.1.0                   magick_2.7.2                   
## [155] dplyr_1.0.7                     readr_1.4.0.1                  
## [157] purrr_0.3.4                     devtools_2.4.2                 
## [159] usethis_2.0.1                   reshape2_1.4.4                 
## [161] XML_3.99-0.6                    readxl_1.3.1                   
## [163] caret_6.0-88                    ggplot2_3.3.5                  
## [165] lattice_0.20-44                 here_1.0.1                     
## 
## loaded via a namespace (and not attached):
##   [1] storr_1.2.5             clisymbols_1.2.0        mitools_2.4            
##   [4] pbapply_1.4-3           haven_2.4.1             tcltk_4.1.0            
##   [7] expm_0.999-6            blob_1.2.1              prodlim_2019.11.13     
##  [10] later_1.2.0             DBI_1.1.1               jpeg_0.1-8.1           
##  [13] MatrixModels_0.5-0      htmlwidgets_1.5.3       mvtnorm_1.1-1          
##  [16] future_1.21.0           Rcpp_1.0.7              DT_0.18                
##  [19] promises_1.2.0.1        pkgload_1.2.1           leaflet_2.0.4.1        
##  [22] textshaping_0.3.5       mnormt_2.0.2            digest_0.6.27          
##  [25] png_0.1-7               polspline_1.1.19        pkgconfig_2.0.3        
##  [28] gower_0.2.2             GPfit_1.0-8             xfun_0.24              
##  [31] bslib_0.2.5.1           tidyselect_1.1.1        labelled_2.8.0         
##  [34] viridisLite_0.4.0       pkgbuild_1.2.0          rlang_0.4.11           
##  [37] manipulateWidget_0.11.0 jquerylib_0.1.4         glue_1.4.2             
##  [40] pryr_0.1.4              lhs_1.1.1               modelr_0.1.8           
##  [43] matrixStats_0.58.0      lava_1.6.9              ggsignif_0.6.2         
##  [46] httpuv_1.6.1            class_7.3-19            Rttf2pt1_1.3.8         
##  [49] TH.data_1.0-10          CORElearn_1.56.0        webshot_0.5.2          
##  [52] jsonlite_1.7.2          tmvnsim_1.0-2           mime_0.10              
##  [55] systemfonts_1.0.2       gridExtra_2.3           Exact_2.1              
##  [58] stringi_1.6.2           processx_3.5.2          survey_4.0             
##  [61] quadprog_1.5-8          cli_3.0.0               rstudioapi_0.13        
##  [64] nlme_3.1-152            listenv_0.8.0           miniUI_0.1.1.1         
##  [67] dbplyr_2.1.1            entropy_1.3.0           sessioninfo_1.1.1      
##  [70] lifecycle_1.0.0         networkD3_0.4           mosaicCore_0.9.0       
##  [73] timeDate_3043.102       Quandl_2.10.0           ggfittext_0.9.1        
##  [76] cellranger_1.1.0        codetools_0.2-18        triebeard_0.3.0        
##  [79] htmlTable_2.2.1         xtable_1.8-4            abind_1.4-5            
##  [82] farver_2.1.0            parallelly_1.25.0       rapportools_1.0        
##  [85] BBmisc_1.11             visdat_0.5.3            compare_0.2-6          
##  [88] base64url_1.4           ggdendro_0.1.22         cluster_2.1.2          
##  [91] extrafontdb_1.0         ellipsis_0.3.2          prettyunits_1.1.1      
##  [94] reprex_2.0.0            igraph_1.2.6            testthat_3.0.2         
##  [97] htmltools_0.5.1.1       yaml_2.2.1              pkgdown_1.6.1          
## [100] ModelMetrics_1.2.2.2    foreign_0.8-81          withr_2.4.2            
## [103] rootSolve_1.8.2.1       multcomp_1.4-17         ragg_1.1.3             
## [106] prediction_0.3.14       memoise_2.0.0           evaluate_0.14          
## [109] rio_0.5.26              extrafont_0.17          callr_3.7.0            
## [112] lmom_2.8                ps_1.6.0                curl_4.3.1             
## [115] urltools_1.7.3          furrr_0.2.3             conquer_1.0.2          
## [118] checkmate_2.0.0         cachem_1.0.5            desc_1.3.0             
## [121] ellipse_0.4.2           rstatix_0.7.0           stargazer_5.2.2        
## [124] ggrepel_0.9.1           dtw_1.22-3              rprojroot_2.0.2        
## [127] tools_4.1.0             sass_0.4.0              sandwich_3.0-1         
## [130] proxy_0.4-26            xml2_1.3.2              httr_1.4.2             
## [133] assertthat_0.2.1        boot_1.3-28             globals_0.14.0         
## [136] R6_2.5.0                shape_1.4.6             repr_1.1.3             
## [139] splines_4.1.0           snakecase_0.11.0        colorspace_2.0-2       
## [142] generics_0.1.0          stats4_4.1.0            base64enc_0.1-3        
## [145] pillar_1.6.1            txtq_0.2.4              tweenr_1.0.2           
## [148] audio_0.1-7             gtable_0.3.0            rvest_1.0.0            
## [151] zip_2.1.1               latticeExtra_0.6-29     fastmap_1.1.0          
## [154] crosstalk_1.1.1         quantreg_5.86           filelock_1.0.2         
## [157] backports_1.2.1         gld_2.6.2               polyclip_1.10-0        
## [160] grid_4.1.0              DiceDesign_1.9          lazyeval_0.2.2         
## [163] crayon_1.4.1            reshape_0.8.8           svglite_2.0.0          
## [166] compiler_4.1.0

pacman::p_loaded()

##   [1] "exploratory"               "anonymizer"               
##   [3] "rgl"                       "MLmetrics"                
##   [5] "plotROC"                   "R.methodsS3"              
##   [7] "ranger"                    "psych"                    
##   [9] "correlationfunnel"         "fs"                       
##  [11] "remotes"                   "summarytools"             
##  [13] "doMC"                      "discrim"                  
##  [15] "RSQLite"                   "odbc"                     
##  [17] "doParallel"                "iterators"                
##  [19] "factoextra"                "corrgram"                 
##  [21] "ezknitr"                   "vip"                      
##  [23] "BH"                        "plotly"                   
##  [25] "shinyWidgets"              "shinyjs"                  
##  [27] "flexdashboard"             "lime"                     
##  [29] "cowplot"                   "tidyquant"                
##  [31] "quantmod"                  "TTR"                      
##  [33] "PerformanceAnalytics"      "xts"                      
##  [35] "yardstick"                 "workflowsets"             
##  [37] "workflows"                 "tune"                     
##  [39] "rsample"                   "recipes"                  
##  [41] "parsnip"                   "modeldata"                
##  [43] "infer"                     "dials"                    
##  [45] "tidymodels"                "mltools"                  
##  [47] "broom"                     "mosaic"                   
##  [49] "ggridges"                  "mosaicData"               
##  [51] "ggformula"                 "ggstance"                 
##  [53] "english"                   "naniar"                   
##  [55] "skimr"                     "data.table"               
##  [57] "naivebayes"                "kernlab"                  
##  [59] "kableExtra"                "RColorBrewer"             
##  [61] "rpart.plot"                "visNetwork"               
##  [63] "drake"                     "knitr"                    
##  [65] "tableone"                  "ggpubr"                   
##  [67] "inspectdf"                 "rsconnect"                
##  [69] "DataExplorer"              "labeling"                 
##  [71] "highr"                     "vctrs"                    
##  [73] "progress"                  "corrplot"                 
##  [75] "Rmisc"                     "plyr"                     
##  [77] "caretEnsemble"             "tinytex"                  
##  [79] "funModeling"               "rmda"                     
##  [81] "rattle"                    "bitops"                   
##  [83] "rmarkdown"                 "ResourceSelection"        
##  [85] "lmtest"                    "zoo"                      
##  [87] "utf8"                      "fansi"                    
##  [89] "beepr"                     "rpart"                    
##  [91] "mice"                      "car"                      
##  [93] "carData"                   "MatchIt"                  
##  [95] "leaps"                     "moments"                  
##  [97] "pander"                    "arsenal"                  
##  [99] "gbm"                       "DescTools"                
## [101] "scoring"                   "pscl"                     
## [103] "InformationValue"          "tidylog"                  
## [105] "ggforce"                   "glmnet"                   
## [107] "Matrix"                    "Boruta"                   
## [109] "fastAdaboost"              "earth"                    
## [111] "plotmo"                    "TeachingDemos"            
## [113] "plotrix"                   "shiny"                    
## [115] "AppliedPredictiveModeling" "RANN"                     
## [117] "Metrics"                   "xgboost"                  
## [119] "ipred"                     "randomForest"             
## [121] "mlbench"                   "caTools"                  
## [123] "DynNom"                    "magrittr"                 
## [125] "packrat"                   "nnet"                     
## [127] "ROCR"                      "pROC"                     
## [129] "rms"                       "SparseM"                  
## [131] "Hmisc"                     "Formula"                  
## [133] "survival"                  "PASWR"                    
## [135] "MASS"                      "e1071"                    
## [137] "foreach"                   "tidyverse"                
## [139] "rgdal"                     "sp"                       
## [141] "scales"                    "munsell"                  
## [143] "bit64"                     "bit"                      
## [145] "tibble"                    "RcppRoll"                 
## [147] "forcats"                   "openxlsx"                 
## [149] "stringr"                   "tidyr"                    
## [151] "hms"                       "lubridate"                
## [153] "janitor"                   "magick"                   
## [155] "dplyr"                     "readr"                    
## [157] "purrr"                     "devtools"                 
## [159] "usethis"                   "reshape2"                 
## [161] "XML"                       "readxl"                   
## [163] "caret"                     "ggplot2"                  
## [165] "lattice"                   "here"

A Model to Predict Chances of Matching into Emergency Medicine Residency

Tyler M. Muffly, MD; Richard Amini, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO; Department of Emergency Medicine, University of Arizona

Ingest Data

It makes sense to start with the outcome we want to predict: the matching rate in EM:

Near Zero or zero variance variables

Data Check

Checking the rate of matching

Complete Exploratory Data Analysis

First approach to data

Structure of the data.

Describe the Data

Another view of the data

Third view of the data

Data Frame Summary

all_data1

Missing values

Check for missing data

Visualising distributions

Data Description and Univariate analysis of variables.

Analyzing categorical variables

Analyzing numerical variables

Summary stats of the numerical data showing means, medians, skew

Bivariate/Cross plots of predictors by outcome

Typical values/Table 1