The number of NAs in the ‘d’ dataframe is: 0.

1- Create dummy variables from categorical data.

Here we create a feature matrix where the categorical features are converted to numeric with one-hot encoding, and it’ll be used in glmnet when we train logit models with shrinkage.

Variable Selection using Group LASSO

Group LASSO is used in preference to LASSO when features are naturally grouped into clusters, meaning that they are related in a certain way: ‘Locaiton’ or ‘Year’. In such cases, LASSO would treat each feature individually and may not capture the relationship between the features within the same group. Group LASSO, on the other hand, allows for the selection of groups of features, rather than individual features, leading to a more interpretable model. This can be especially useful when dealing with high dimensional data, where the number of features is much larger than the number of observations. By penalizing the magnitude of the coefficients of the entire group rather than the magnitude of each coefficient individually, group LASSO can produce sparse models while preserving the relationship between features.

To handle this, feature selection was performed using LASSO. LASSO is a popular feature selection method that focuses on eliminating irrelevant or redundant features by shrinking their coefficients to zero. To use LASSO in logit models with shrinkage, it’s necessary to convert categorical features into numeric ones. This can be achieved through the process of one-hot encoding. The resulting feature matrix is then used in the glmnet package to train logit models with shrinkage.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  One hot encoding to dummify the variables ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

#This code is performing one-hot encoding, also known as "dummifying", on the factors variables in the dataframe "reduced_Data". The function "select_if" is used to select only the factor variables in the dataframe. The function "dummy_cols" is then used to create new columns for each level of the factors variables, with the levels being represented by binary variables (0 or 1). The option "remove_first_dummy" is set to true so that one level of each factor variable is removed to prevent collinearity issues. The option "remove_selected_columns" is also set to true so that the original factor columns are removed from the dataframe. Finally, the option "select_columns" is set to the names of the factor columns. The final step is to remove the first column of the dummies to avoid collinearity issues. The last line is to update the column names of the dummies.

d_factor2 <- dplyr::select_if(reduced_Data, is.factor)
dummies <- dummy_cols(d_factor2, 
    remove_first_dummy = T, #remove the first dummy variable column to prevent multicollinearity
    remove_selected_columns = T, #remove the original factor columns from the dataframe
    select_columns = colnames(d_factor2))

dummies <- dummies[,-1]

2 - Drop reference dummies

The step of “dropping reference dummies” refers to the fact that in group lasso, each group of features is assigned a single penalty term. To account for the different scales of the features in each group, a reference dummy is often added to each group. The reference dummy is typically excluded from the penalty term, so that the size of the penalty reflects the size of the other features in the group.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#####  Drop the reference dummies ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

dummies <- dummies %>% 
  dplyr::select(-starts_with(c("Location_", "Year_", "Match_Status_")))

3 - Combine dummies with numeric variables then convert them to a matrix X.

The third step is to combine the dummy variables with the numeric variables and convert them into a matrix X. This matrix will be used in the glmnet function, which is a penalized regression method. When using a binary categorical outcome, the only difference is to specify family = “binomial” in the glmnet function.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#####  Convert to a matrix because the models need the data in a particular way ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

d2 <-dplyr::select_if(reduced_Data, is.numeric)
X <- cbind(d2, dummies)
X <- as.matrix(X) #For {glmnet}, must use predictors as matrix and response as vector

4 - Create your X matrix (predictors) and Y vector (outcome variable)

In this step, the predictors and the outcome variable are separated and prepared for modeling. The predictors are combined into a matrix called X, while the outcome variable is put into a vector called Y. This separation is a common step in many machine learning algorithms to prepare the data for modeling.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#####  Create your X matrix (predictors) and Y vector (outcome variable) ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

d_factor2$Match_Status <- ifelse(d_factor2$Match_Status == "Matched", 1, 0)
y <- d_factor2[,1]

5 - Create group vector that distinguish groups

The group vector is used to distinguish groups of predictors in the data. It separates the predictors into different groups based on some criteria such as the type of predictor or the relationship between predictors. This step is necessary for the group LASSO step, as it allows for the lasso to apply different penalties to different groups of predictors.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#####  Create the grouping variable that is 'Location' ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

 group <- dummies
 colnames(group) <-  sub("_[^_]+$", "", colnames(group))
 group <- cbind(d2, group)
 group <- colnames(group)
 group <- ifelse(group == "Location_U", "Location", group)

‘glmnet’ is used to perform LASSO. In this step, k-fold cross-validation is conducted over a range of values for the regularization parameter lambda, which determines the amount of penalty to be applied. The goal is to find the value of lambda that results in the minimum mean squared error (MSE) using cross-validation. This helps to determine the optimal level of regularization for the model.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####  
# Performs k-fold cross validation for penalized regression models with grouped covariates over a grid of values for the regularization parameter lambda. ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

#In this code, data is a data frame containing the predictors (stored in the predictors variable) and response (stored in the response variable) variables. grouping_vector is a vector that describes the grouping of the coefficients. X is a design matrix created from the predictors in data, y is the response vector, and group is the vector describing the grouping of the coefficients. The grpreg function is used to fit a penalized regression model with grouped covariates, and the cv.grpreg function is used to perform k-fold cross validation on the model to tune the regularization parameter lambda. The summary function is used to print a summary of the cross-validated model, and the plot function is used to visualize the path trajectory of the fitted sparse regression parameters.

# Fit penalized regression model with grouped covariates
fit <- grpreg::grpreg(X, y, group, penalty="grLasso", family="binomial")

plot(fit, labels = TRUE,
     legend = TRUE,
     xlab = "Penalty Parameter (Lambda)", 
     ylab = "Coefficients")

# Perform k-fold cross validation to tune regularization parameter lambda
cvfit <- grpreg::cv.grpreg(X, y, group, penalty="grLasso", family="binomial")

# Print summary of cross-validated model
summary(cvfit)
#> grLasso-penalized logistic regression with n=5348, p=35
#> At minimum cross-validation error (lambda=0.0032):
#> -------------------------------------------------
#>   Nonzero coefficients: 28
#>   Nonzero groups: 23
#>   Cross-validation error of 1.14
#>   Maximum R-squared: 0.20
#>   Maximum signal-to-noise ratio: 0.20
#>   Prediction error at lambda.min: 0.283
# Plot path trajectory of fitted sparse regression parameters
plot(cvfit, labels = TRUE,
     main = "Group LASSO: Residency Location",
     xlab = "Lambda",
     ylab = "Coefficients",
     vertical.line=TRUE, 
     col="red")

a<- summary(cvfit)

We found the amount of penalty, λ by cross-validation. λ is a tuning parameter that helps to control our model from over-fitting to the training data. To identify the optimal λ value we can use k-fold cross-validation. Extra predictors eroded the performance of logistic regression while the LASSO penalty resulted in stable performance as the number of irrelevant predictors increased. We searched for the λ that give the minimum prediction error.

A vertical line is drawn at the value where cross-validation prediction error is minimized, at which point 25 variables were selected. However, as the confidence intervals show, there is substantial uncertainty about this minimum value.

plot(cvfit, labels = TRUE, type='cve')

Path trajectory of the fitted regression parameters: The figure should be read from right to left (i.e., λ from small to large). Variables that become zero later are stronger (i.e., since a larger λ is needed to make them become 0). The variables that quickly become zero are weak or insignificant variables.

plot(fit)

tm_print_save("output/fig/box_and_whisker_plot_of_LASSO_model.tiff")
#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"
#> null device 
#>           1

To view the best model and the corresponding coefficients. cv.fit$lambda.min is the best lambda value that results in the best model with smallest cross-validation error.

cvfit$lambda.min 
#> [1] 0.00323
# This extracts the fitted regression parameters of the logistic regression model using the given lambda value of `r cvfit$lambda.min`. Picks out which predictors have a coefficient > 0.  

This code uses the cvfit to fit a logistic regression model to your data. The cv.glmnet function performs cross-validation to select the optimal value of lambda for the logistic regression: 0.003.

Varables selected by group LASSO: Location

Table of Split Train and Test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Create tables displaying results of univariate analyses, stratified or not by categorical variable groupings ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####[tibshirani1996regression]
d1 <- d %>%
  dplyr::rename (c(#"Match Status" = Match_Status,
                    "Gender" = Gender,
                    "Age (years)" = Age,
                   "US or Canadian Applicant" = US_or_Canadian_Applicant,
                   "Visa Sponsorship Needed" = Visa_Sponsorship_Needed,
                   "Medical Education Interrupted"= Medical_Education_or_Training_Interrupted,
                   "Alpha Omega Alpha Membership" = Alpha_Omega_Alpha,
                   "Gold Humanism Honor Society Membership" = Gold_Humanism_Honor_Society,
                   "Participating in a Couples Match" = Couples_Match,
                   "Number of Oral Presentations" = Count_of_Oral_Presentation,
                   "Number of Peer-Reviewed Journal Articles or Abstracts" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts,
                   "Number of Peer-Reviewed Non-published Journal_Articles" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published,
                   "Number of Poster Presentations" = Count_of_Poster_Presentation,
                   #"Applicant Match Year" = Year,
                   "Presented at a Scientific Meeting" = Meeting_Name_Presented,
                   "Trained at a Top NIH-funded Medical School" = TopNIHfunded,
                   "Bachelor of Science Degree" = Higher_Education_Degree,
                   #"Birth Place" = Birth_Place,
                   "Fluency in a Foreign Language" = Language_Fluency,
                   "Certification in Advanced Cardiac Life Support" = ACLS,
                   "Certification in Pediatric Advanced Life Support" = PALS,
                   "Prior Residency Training" = Medical_Training_Discipline,
                   "Residency Tracks Applied" = Tracks_Applied_by_Applicant_1,
                   "Membership in the American Medical Association" = AMA,
                   "Membership in the American College of Obstetricians and Gynecologists" = ACOG,
                   "Received a Scholarship" = Scholarship,
                   "Number of letter of recommendation OBGYN authors" = total_OBGYN_letter_writers,
                   "Number of letters of recommendation" = reco_count,
                   "Number of first author publications" = number_of_applicant_first_author_publications,
                   "Advanced Degree Other Than Medical Degree" = Advance_Degree,
                   "Type of medical school" = Type_of_medical_school,
                   "Number of Work Experiences" = work_exp_count,
                   "Number of Volunteer Experiences" = Volunteer_exp_count,
                   "Number of Research Experiences" = Research_exp_count))
# ######################################################################################
# #
# #
# #  Demographics of the Location
# #
# #
# ######################################################################################
# Define the formula for the tableby function
formula <- d1$Location ~ .

# Define the control argument for the tableby function
control <- arsenal::tableby.control(
  test = TRUE,
  total = TRUE,
  digits = 0,
  digits.p = 2,
  digits.count = 0,
  numeric.simplify = TRUE,
  cat.simplify = FALSE,
  numeric.stats = c("median", "q1q3"),
  cat.stats = c("Nmiss", "countpct"),
  stats.labels = list(
    Nmiss = "N Missing",
    Nmiss2 = "N Missing",
    meansd = "Mean (SD)",
    medianrange = "Median (Range)",
    median = "Median",
    medianq1q3 = "Median (Q1, Q3)",
    q1q3 = "Q1, Q3",
    iqr = "IQR",
    range = "Range",
    countpct = "Count (Pct)",
    Nevents = "Events",
    medSurv = "Median Survival",
    medTime = "Median Follow-Up"
  )
)

# Create the tableby object
arsenal_table <- arsenal::tableby(formula, data = d1, control = control)
# Create the summary of the tableby object
final <- summary(
  arsenal_table
#,
  #text = TRUE,
 # title = "Table: Locations",
  #pfootnote = FALSE
); final
BSW (N=303) CCAG (N=603) CU (N=2880) Duke (N=1211) Truman (N=254) U_Washington (N=306) UAB (N=240) Utah (N=1429) Total (N=7226) p value
Match_Status < 0.01
   Not_Matched 174 (57.4%) 333 (55.2%) 1431 (49.7%) 409 (33.8%) 170 (66.9%) 126 (41.2%) 130 (54.2%) 534 (37.4%) 3307 (45.8%)
   Matched 129 (42.6%) 270 (44.8%) 1449 (50.3%) 802 (66.2%) 84 (33.1%) 180 (58.8%) 110 (45.8%) 895 (62.6%) 3919 (54.2%)
Gender < 0.01
   Female 244 (80.5%) 481 (79.8%) 2330 (80.9%) 1043 (86.1%) 195 (76.8%) 263 (85.9%) 187 (77.9%) 1169 (81.8%) 5912 (81.8%)
   Male 59 (19.5%) 122 (20.2%) 550 (19.1%) 168 (13.9%) 59 (23.2%) 43 (14.1%) 53 (22.1%) 260 (18.2%) 1314 (18.2%)
Age (years) < 0.01
   Median 28 28 28 27 29 28 28 27 28
   Q1, Q3 27, 30 27, 30 27, 30 26, 29 27, 31 27, 30 27, 30 26, 29 26, 30
US or Canadian Applicant < 0.01
   No 46 (15.2%) 96 (15.9%) 639 (22.2%) 145 (12.0%) 109 (42.9%) 55 (18.0%) 54 (22.5%) 191 (13.4%) 1335 (18.5%)
   Yes 257 (84.8%) 507 (84.1%) 2241 (77.8%) 1066 (88.0%) 145 (57.1%) 251 (82.0%) 186 (77.5%) 1238 (86.6%) 5891 (81.5%)
Visa Sponsorship Needed < 0.01
   No 287 (94.7%) 573 (95.0%) 2663 (92.5%) 1147 (94.7%) 226 (89.0%) 283 (92.5%) 226 (94.2%) 1368 (95.7%) 6773 (93.7%)
   Yes 16 (5.3%) 30 (5.0%) 217 (7.5%) 64 (5.3%) 28 (11.0%) 23 (7.5%) 14 (5.8%) 61 (4.3%) 453 (6.3%)
Medical Education Interrupted 0.02
   No 260 (85.8%) 531 (88.1%) 2461 (85.5%) 1064 (87.9%) 206 (81.1%) 276 (90.2%) 207 (86.2%) 1222 (85.5%) 6227 (86.2%)
   Yes 43 (14.2%) 72 (11.9%) 419 (14.5%) 147 (12.1%) 48 (18.9%) 30 (9.8%) 33 (13.8%) 207 (14.5%) 999 (13.8%)
Alpha Omega Alpha Membership < 0.01
   No 302 (99.7%) 598 (99.2%) 2562 (89.0%) 1045 (86.3%) 253 (99.6%) 292 (95.4%) 234 (97.5%) 1232 (86.2%) 6518 (90.2%)
   Yes 1 (0.3%) 5 (0.8%) 318 (11.0%) 166 (13.7%) 1 (0.4%) 14 (4.6%) 6 (2.5%) 197 (13.8%) 708 (9.8%)
Gold Humanism Honor Society Membership < 0.01
   No 270 (89.1%) 543 (90.0%) 2472 (85.8%) 1015 (83.8%) 234 (92.1%) 267 (87.3%) 213 (88.8%) 1173 (82.1%) 6187 (85.6%)
   Yes 33 (10.9%) 60 (10.0%) 408 (14.2%) 196 (16.2%) 20 (7.9%) 39 (12.7%) 27 (11.2%) 256 (17.9%) 1039 (14.4%)
Participating in a Couples Match < 0.01
   No 287 (94.7%) 556 (92.2%) 2659 (92.3%) 1093 (90.3%) 243 (95.7%) 294 (96.1%) 224 (93.3%) 1259 (88.1%) 6615 (91.5%)
   Yes 16 (5.3%) 47 (7.8%) 221 (7.7%) 118 (9.7%) 11 (4.3%) 12 (3.9%) 16 (6.7%) 170 (11.9%) 611 (8.5%)
Number of Oral Presentations < 0.01
   Median 0 0 0 0 0 0 0 0 0
   Q1, Q3 0, 1 0, 1 0, 1 0, 1 0, 1 0, 1 0, 1 0, 1 0, 1
Number of Peer-Reviewed Journal Articles or Abstracts < 0.01
   Median 0 0 0 0 0 0 0 0 0
   Q1, Q3 0, 1 0, 1 0, 1 0, 2 0, 1 0, 1 0, 1 0, 2 0, 1
Number of Peer-Reviewed Non-published Journal_Articles < 0.01
   Median 0 0 0 0 0 0 0 0 0
   Q1, Q3 0, 0 0, 0 0, 0 0, 1 0, 0 0, 0 0, 0 0, 1 0, 0
Number of Poster Presentations < 0.01
   Median 1 1 1 1 0 1 0 1 1
   Q1, Q3 0, 2 0, 2 0, 3 0, 3 0, 2 0, 3 0, 2 0, 3 0, 3
Year < 0.01
   2016 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 442 (30.9%) 442 (6.1%)
   2017 57 (18.8%) 125 (20.7%) 638 (22.2%) 328 (27.1%) 61 (24.0%) 77 (25.2%) 61 (25.4%) 490 (34.3%) 1837 (25.4%)
   2018 60 (19.8%) 135 (22.4%) 606 (21.0%) 284 (23.5%) 63 (24.8%) 82 (26.8%) 54 (22.5%) 0 (0.0%) 1284 (17.8%)
   2019 100 (33.0%) 173 (28.7%) 1054 (36.6%) 323 (26.7%) 66 (26.0%) 92 (30.1%) 63 (26.2%) 497 (34.8%) 2368 (32.8%)
   2020 86 (28.4%) 170 (28.2%) 582 (20.2%) 276 (22.8%) 64 (25.2%) 55 (18.0%) 62 (25.8%) 0 (0.0%) 1295 (17.9%)
Presented at a Scientific Meeting < 0.01
   Did not present at a meeting 219 (72.3%) 448 (74.3%) 2031 (70.5%) 850 (70.2%) 201 (79.1%) 204 (66.7%) 187 (77.9%) 1051 (73.5%) 5191 (71.8%)
   Presented at a meeting 84 (27.7%) 155 (25.7%) 849 (29.5%) 361 (29.8%) 53 (20.9%) 102 (33.3%) 53 (22.1%) 378 (26.5%) 2035 (28.2%)
Trained at a Top NIH-funded Medical School < 0.01
   Did not attend NIH top-funded medical school 251 (82.8%) 525 (87.1%) 1912 (66.4%) 708 (58.5%) 221 (87.0%) 217 (70.9%) 181 (75.4%) 800 (56.0%) 4815 (66.6%)
   Attended a NIH top-funded Medical School 52 (17.2%) 78 (12.9%) 968 (33.6%) 503 (41.5%) 33 (13.0%) 89 (29.1%) 59 (24.6%) 629 (44.0%) 2411 (33.4%)
Bachelor of Science Degree < 0.01
   B.S. 129 (42.6%) 276 (45.8%) 1184 (41.1%) 538 (44.4%) 92 (36.2%) 117 (38.2%) 114 (47.5%) 508 (35.5%) 2958 (40.9%)
   Not a B.S. degree 174 (57.4%) 327 (54.2%) 1696 (58.9%) 673 (55.6%) 162 (63.8%) 189 (61.8%) 126 (52.5%) 921 (64.5%) 4268 (59.1%)
Fluency in a Foreign Language < 0.01
   N Missing 86 170 582 276 64 55 62 497 1792
   Speaks English 63 (29.0%) 174 (40.2%) 529 (23.0%) 264 (28.2%) 58 (30.5%) 42 (16.7%) 58 (32.6%) 212 (22.7%) 1400 (25.8%)
   Speaks English and Another Language 154 (71.0%) 259 (59.8%) 1769 (77.0%) 671 (71.8%) 132 (69.5%) 209 (83.3%) 120 (67.4%) 720 (77.3%) 4034 (74.2%)
Certification in Advanced Cardiac Life Support < 0.01
   N Missing 86 170 582 276 64 55 62 497 1792
   No 70 (32.3%) 117 (27.0%) 1392 (60.6%) 586 (62.7%) 92 (48.4%) 135 (53.8%) 92 (51.7%) 563 (60.4%) 3047 (56.1%)
   Yes 147 (67.7%) 316 (73.0%) 906 (39.4%) 349 (37.3%) 98 (51.6%) 116 (46.2%) 86 (48.3%) 369 (39.6%) 2387 (43.9%)
Certification in Pediatric Advanced Life Support < 0.01
   N Missing 86 170 582 276 64 55 62 497 1792
   No 204 (94.0%) 395 (91.2%) 2181 (94.9%) 888 (95.0%) 174 (91.6%) 233 (92.8%) 159 (89.3%) 889 (95.4%) 5123 (94.3%)
   Yes 13 (6.0%) 38 (8.8%) 117 (5.1%) 47 (5.0%) 16 (8.4%) 18 (7.2%) 19 (10.7%) 43 (4.6%) 311 (5.7%)
Prior Residency Training < 0.01
   No 290 (95.7%) 580 (96.2%) 2686 (93.3%) 1167 (96.4%) 238 (93.7%) 288 (94.1%) 231 (96.2%) 1369 (95.8%) 6849 (94.8%)
   Yes 13 (4.3%) 23 (3.8%) 194 (6.7%) 44 (3.6%) 16 (6.3%) 18 (5.9%) 9 (3.8%) 60 (4.2%) 377 (5.2%)
Residency Tracks Applied < 0.01
   Applying for a Preliminary Position 143 (47.2%) 295 (48.9%) 1862 (64.7%) 604 (49.9%) 125 (49.2%) 147 (48.0%) 123 (51.2%) 939 (65.7%) 4238 (58.6%)
   Categorical Applicant 160 (52.8%) 308 (51.1%) 1018 (35.3%) 607 (50.1%) 129 (50.8%) 159 (52.0%) 117 (48.8%) 490 (34.3%) 2988 (41.4%)
Membership in the American Medical Association < 0.01
   No AMA Membership 210 (69.3%) 434 (72.0%) 1798 (62.4%) 758 (62.6%) 182 (71.7%) 197 (64.4%) 158 (65.8%) 951 (66.6%) 4688 (64.9%)
   American Medical Association Member 93 (30.7%) 169 (28.0%) 1082 (37.6%) 453 (37.4%) 72 (28.3%) 109 (35.6%) 82 (34.2%) 478 (33.4%) 2538 (35.1%)
Membership in the American College of Obstetricians and Gynecologists < 0.01
   No 196 (64.7%) 387 (64.2%) 1598 (55.5%) 662 (54.7%) 166 (65.4%) 168 (54.9%) 143 (59.6%) 857 (60.0%) 4177 (57.8%)
   Yes 107 (35.3%) 216 (35.8%) 1282 (44.5%) 549 (45.3%) 88 (34.6%) 138 (45.1%) 97 (40.4%) 572 (40.0%) 3049 (42.2%)
Received a Scholarship < 0.01
   No_scholarship 260 (85.8%) 492 (81.6%) 2246 (78.0%) 964 (79.6%) 213 (83.9%) 237 (77.5%) 193 (80.4%) 1197 (83.8%) 5802 (80.3%)
   Scholarship 43 (14.2%) 111 (18.4%) 634 (22.0%) 247 (20.4%) 41 (16.1%) 69 (22.5%) 47 (19.6%) 232 (16.2%) 1424 (19.7%)
Number of letter of recommendation OBGYN authors < 0.01
   Median 2 2 2 2 2 2 2 2 2
   Q1, Q3 1, 2 1, 2 2, 3 2, 3 1, 2 2, 3 1, 3 2, 3 2, 3
Number of letters of recommendation < 0.01
   Median 3 3 3 4 3 3 3 3 3
   Q1, Q3 0, 4 0, 4 3, 4 3, 4 0, 4 3, 4 0, 4 0, 4 1, 4
Number of first author publications < 0.01
   Median 0 0 0 0 0 0 0 0 0
   Q1, Q3 0, 1 0, 1 0, 3 0, 3 0, 1 0, 3 0, 2 0, 2 0, 2
Advanced Degree Other Than Medical Degree < 0.01
   No Addl Degree 278 (91.7%) 566 (93.9%) 2622 (91.0%) 1119 (92.4%) 229 (90.2%) 280 (91.5%) 226 (94.2%) 1358 (95.0%) 6678 (92.4%)
   MBA 0 (0.0%) 3 (0.5%) 17 (0.6%) 9 (0.7%) 1 (0.4%) 0 (0.0%) 0 (0.0%) 4 (0.3%) 34 (0.5%)
   Other 23 (7.6%) 34 (5.6%) 225 (7.8%) 78 (6.4%) 24 (9.4%) 26 (8.5%) 14 (5.8%) 61 (4.3%) 485 (6.7%)
   PhD 2 (0.7%) 0 (0.0%) 16 (0.6%) 5 (0.4%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 6 (0.4%) 29 (0.4%)
Type of medical school < 0.01
   Osteopathic School 143 (47.2%) 345 (57.2%) 351 (12.2%) 77 (6.4%) 61 (24.0%) 73 (23.9%) 41 (17.1%) 136 (9.5%) 1227 (17.0%)
   U.S. Public School 76 (25.1%) 110 (18.2%) 1163 (40.4%) 562 (46.4%) 53 (20.9%) 96 (31.4%) 115 (47.9%) 786 (55.0%) 2961 (41.0%)
   International School 48 (15.8%) 96 (15.9%) 639 (22.2%) 145 (12.0%) 108 (42.5%) 55 (18.0%) 54 (22.5%) 190 (13.3%) 1335 (18.5%)
   U.S. Private School 36 (11.9%) 52 (8.6%) 727 (25.2%) 427 (35.3%) 31 (12.2%) 82 (26.8%) 30 (12.5%) 317 (22.2%) 1702 (23.6%)
   Osteopathic School,International School 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 1 (0.4%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 1 (0.0%)
Number of Work Experiences < 0.01
   Median 2 2 2 2 2 2 2 1 2
   Q1, Q3 0, 4 0, 3 0, 4 0, 4 0, 4 0, 4 0, 4 0, 4 0, 4
Number of Volunteer Experiences < 0.01
   Median 5 5 5 6 4 6 5 4 5
   Q1, Q3 0, 9 0, 9 1, 9 1, 9 0, 7 1, 9 0, 9 0, 8 0, 9
Number of Research Experiences < 0.01
   Median 1 1 2 2 1 2 1 1 1
   Q1, Q3 0, 2 0, 2 0, 3 0, 3 0, 2 0, 3 0, 2 0, 3 0, 3
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Picks out which predictors have a coefficient > 0.   ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

#This code performs LASSO (Least Absolute Shrinkage and Selection Operator) regularization on a cross-validated model fit (cvfit). The purpose of the LASSO regularization is to select a subset of predictor variables that have non-zero coefficients and are important for the outcome.

# The line coef2 <- stats::coef(cvfit, s = "lambda.min") computes the coefficients for the fit at the minimum value of the penalty parameter (lambda.min).
# 
# ix = which(abs(coef2) > 0) and coef2 <- coef2[ix, drop=FALSE] extract the coefficients that are not zero.
# 
# selected_vars <- names(coef2)[ix] and selected_vars <- intersect(selected_vars, colnames(d)) are used to find the variables in the data frame d that correspond to the non-zero coefficients.
# 
# non_existing_vars <- setdiff(selected_vars, colnames(d)) is used to check if any of the variables in selected_vars do not exist in the data frame d. If there are any non-existing variables, a warning message is printed. If all of the variables exist in d, then a new data frame reduced_d is created that contains only the columns in d that correspond to the variables selected by LASSO.
# 
# Finally, the line names(reduced_d) returns the names of the columns in the reduced_d data frame.

coef2 <- stats::coef(cvfit, s = "lambda.min") # View coefficients

ix = which(abs(coef2) > 0) #These are the variables to include.  
length(ix)
#> [1] 29
coef2 <- coef2[ix, drop=FALSE]

#removes the intercept so you can compare 'coef2' and the 'd' column names with no difference
selected_vars <- names(coef2)[ix]
selected_vars <- intersect(selected_vars, colnames(d))
reduced_d <- dplyr::select(d, one_of(selected_vars))

#Function that determines if the variables in the coef2 are found in d.  If not you get a warning.  If they are in both dataframes then it creates a new dataframe called 'reduced_d' with only the variables that LASSO recommended.  
non_existing_vars <- setdiff(selected_vars, colnames(d))
if (length(non_existing_vars) > 0) {
  # Print an error message
  print(paste("The following variable names in 'selected_vars' do not exist in the 'X' data:", paste(non_existing_vars, collapse = ", ")))
} else {
  # Extract only the columns in X that are listed in selected_vars
  reduced_d <- d[, selected_vars, drop = FALSE]
}

This code creates two vectors, one called names_in_ix and another called not_in_ix. names_in_ix contains the names of variables that were included in the LASSO model, based on the variable indices stored in ix. not_in_ix contains the names of variables that were not included in the LASSO model, which are obtained by finding the difference between the names of all variables stored in coef2 and the names of variables included in names_in_ix.

names_in_ix = names(coef2)[ix]                              #Variables included from LASSO
not_in_ix = names(coef2)[!(names(coef2) %in% names_in_ix)] #These are the variables to REMOVE.    

knitr_table_html(data = not_in_ix, caption = "Table of Variables Removed by the LASSO Grouped by Location Model.")
Table of Variables Removed by the LASSO Grouped by Location Model.
x
Count_of_Peer_Reviewed_Journal_Articles_Abstracts
reco_count
Visa_Sponsorship_Needed_Yes
Alpha_Omega_Alpha_Yes
Medical_Training_Discipline_Prior Residency Training
ACOG_ACOG Member
Advance_Degree_No Advanced Degree
knitr_table_html(data = names(ix), caption = "Table of Variables Included by the LASSO Grouped by Location Model.")
Table of Variables Included by the LASSO Grouped by Location Model.
x
(Intercept)
Age
Count_of_Peer_Reviewed_Journal_Articles_Abstracts
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published
total_OBGYN_letter_writers
reco_count
number_of_applicant_first_author_publications
work_exp_count
Research_exp_count
US_or_Canadian_Applicant_Yes
Visa_Sponsorship_Needed_Yes
Medical_Education_or_Training_Interrupted_Yes
Alpha_Omega_Alpha_Yes
Gold_Humanism_Honor_Society_Yes
Meeting_Name_Presented_Presented at a meeting
TopNIHfunded_Attended a NIH top-funded Medical School
ACLS_Yes
PALS_Yes
Medical_Training_Discipline_Prior Residency Training
Tracks_Applied_by_Applicant_1_Categorical Applicant
AMA_American Medical Association Member
ACOG_ACOG Member
Advance_Degree_No Advanced Degree
Advance_Degree_Ph.D. 
Advance_Degree_Other
Type_of_medical_school_International School
Type_of_medical_school_Osteopathic School
Type_of_medical_school_Osteopathic School,International School
Type_of_medical_school_U.S. Private School
knitr_table_html(modelPerformance, caption = "Table of Model Characteristics for the LASSO Model.")

Can’t get this to work. : Varables selected by group LASSO: Year vs. Year

Imputation

Imputation is the process of replacing missing values with estimated ones. It is a crucial step in preparing the data for machine learning models as many algorithms do not work well with missing data. There are various methods for imputation such as mean imputation, median imputation, and regression imputation. By imputing missing values, we can ensure that the data is complete and ready for analysis.

Table of Split Train and Test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Create tables displaying results of univariate analyses, stratified or not by categorical variable groupings ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####[tibshirani1996regression]
d1 <- d
d1$Year <- ifelse(d1$Year %in% c(2017,2018), "train", ifelse(d1$Year %in% c(2019,2020), "test", NA))

convert_to_factor(data = d1, 
                       variable = "Year", 
                       levels = c("train", "test"), 
                       labels = c("train", "test"), 
                       reference = "train")
d1$Year <- as.factor(d1$Year)

d1 <- d %>%
  dplyr::rename (c(#"Match Status" = Match_Status,
                    "Gender" = Gender,
                    "Age (years)" = Age,
                   "US or Canadian Applicant" = US_or_Canadian_Applicant,
                   "Visa Sponsorship Needed" = Visa_Sponsorship_Needed,
                   "Medical Education Interrupted"= Medical_Education_or_Training_Interrupted,
                   "Alpha Omega Alpha Membership" = Alpha_Omega_Alpha,
                   "Gold Humanism Honor Society Membership" = Gold_Humanism_Honor_Society,
                   "Participating in a Couples Match" = Couples_Match,
                   "Number of Oral Presentations" = Count_of_Oral_Presentation,
                   "Number of Peer-Reviewed Journal Articles or Abstracts" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts,
                   "Number of Peer-Reviewed Non-published Journal_Articles" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published,
                   "Number of Poster Presentations" = Count_of_Poster_Presentation,
                   #"Applicant Match Year" = Year,
                   "Presented at a Scientific Meeting" = Meeting_Name_Presented,
                   "Trained at a Top NIH-funded Medical School" = TopNIHfunded,
                   "Bachelor of Science Degree" = Higher_Education_Degree,
                   #"Birth Place" = Birth_Place,
                   "Fluency in a Foreign Language" = Language_Fluency,
                   "Certification in Advanced Cardiac Life Support" = ACLS,
                   "Certification in Pediatric Advanced Life Support" = PALS,
                   "Prior Residency Training" = Medical_Training_Discipline,
                   "Residency Tracks Applied" = Tracks_Applied_by_Applicant_1,
                   "Membership in the American Medical Association" = AMA,
                   "Membership in the American College of Obstetricians and Gynecologists" = ACOG,
                   "Received a Scholarship" = Scholarship,
                   "Number of letter of recommendation OBGYN authors" = total_OBGYN_letter_writers,
                   "Number of letters of recommendation" = reco_count,
                   "Number of first author publications" = number_of_applicant_first_author_publications,
                   "Advanced Degree Other Than Medical Degree" = Advance_Degree,
                   "Type of medical school" = Type_of_medical_school,
                   "Number of Work Experiences" = work_exp_count,
                   "Number of Volunteer Experiences" = Volunteer_exp_count,
                   "Number of Research Experiences" = Research_exp_count))
  

# Create a table with the test results
t1_year <-  compareGroups::compareGroups(formula = Year ~ .,
                                    data = d1,
                                    method = 4,                  #Shapiro-Wilks test
                                    na.action = "na.pass",
                                    alpha = 0.05)                #alpha significance threshold

# Create a table of the results of the test
t1tab_year <- compareGroups::createTable(x = t1_year, 
         digits.p = 3, sd.type = 1,  # Specify the number of decimal places for p-values 
                                     #and type of standard deviation
         show.p.overall = TRUE,      # Include an overall p-value in the table
         hide.no = "no")             # Don't hide any variables with "no" in the name

# Export the table as an HTML-formatted Markdown table
compareGroups::export2md(t1tab_year, 
                         strip = TRUE,  # remove leading and trailing white space
                         first.strip = TRUE,  # remove leading white space from the first line
                         digits = 1L,   # specify the number of decimal places
                         caption = "Description of Applicants by Match Status",  # add a caption to the table
                         booktabs = TRUE, width = "400px",  # use booktabs formatting and specify the width of the table
                         format = "html")  # output as HTML
Description of Applicants by Match Status
2016 2017 2018 2019 2020 p.overall
N=442 N=1837 N=1284 N=2368 N=1295
Match_Status: <0.001
Not_Matched 148 (33.5%) 777 (42.3%) 573 (44.6%) 1076 (45.4%) 733 (56.6%)
Matched 294 (66.5%) 1060 (57.7%) 711 (55.4%) 1292 (54.6%) 562 (43.4%)
Gender: 0.028
Female 361 (81.7%) 1458 (79.4%) 1053 (82.0%) 1963 (82.9%) 1077 (83.2%)
Male 81 (18.3%) 379 (20.6%) 231 (18.0%) 405 (17.1%) 218 (16.8%)
Age (years) 27.0 [26.0;29.0] 28.0 [26.0;30.0] 28.0 [27.0;30.0] 28.0 [26.0;30.0] 28.0 [27.0;30.0] <0.001
US or Canadian Applicant 369 (83.5%) 1409 (76.7%) 1033 (80.5%) 2022 (85.4%) 1058 (81.7%) <0.001
Visa Sponsorship Needed 21 (4.75%) 136 (7.40%) 96 (7.48%) 128 (5.41%) 72 (5.56%) 0.011
Medical Education Interrupted 58 (13.1%) 271 (14.8%) 181 (14.1%) 325 (13.7%) 164 (12.7%) 0.546
Alpha Omega Alpha Membership 56 (12.7%) 183 (9.96%) 100 (7.79%) 269 (11.4%) 100 (7.72%) <0.001
Gold Humanism Honor Society Membership 76 (17.2%) 228 (12.4%) 155 (12.1%) 389 (16.4%) 191 (14.7%) <0.001
Participating in a Couples Match 50 (11.3%) 166 (9.04%) 86 (6.70%) 203 (8.57%) 106 (8.19%) 0.030
Number of Oral Presentations 0.00 [0.00;1.00] 0.00 [0.00;1.00] 0.00 [0.00;1.00] 0.00 [0.00;1.00] 0.00 [0.00;1.00] <0.001
Number of Peer-Reviewed Journal Articles or Abstracts 0.00 [0.00;1.00] 0.00 [0.00;1.00] 0.00 [0.00;1.00] 0.00 [0.00;1.00] 0.00 [0.00;1.00] <0.001
Number of Peer-Reviewed Non-published Journal_Articles 0.00 [0.00;0.00] 0.00 [0.00;0.00] 0.00 [0.00;0.00] 0.00 [0.00;0.00] 0.00 [0.00;0.00] <0.001
Number of Poster Presentations 1.00 [0.00;2.00] 1.00 [0.00;2.00] 1.00 [0.00;3.00] 1.00 [0.00;3.00] 1.00 [0.00;3.00] <0.001
Location: 0.000
BSW 0 (0.00%) 57 (3.10%) 60 (4.67%) 100 (4.22%) 86 (6.64%)
CCAG 0 (0.00%) 125 (6.80%) 135 (10.5%) 173 (7.31%) 170 (13.1%)
CU 0 (0.00%) 638 (34.7%) 606 (47.2%) 1054 (44.5%) 582 (44.9%)
Duke 0 (0.00%) 328 (17.9%) 284 (22.1%) 323 (13.6%) 276 (21.3%)
Truman 0 (0.00%) 61 (3.32%) 63 (4.91%) 66 (2.79%) 64 (4.94%)
U_Washington 0 (0.00%) 77 (4.19%) 82 (6.39%) 92 (3.89%) 55 (4.25%)
UAB 0 (0.00%) 61 (3.32%) 54 (4.21%) 63 (2.66%) 62 (4.79%)
Utah 442 (100%) 490 (26.7%) 0 (0.00%) 497 (21.0%) 0 (0.00%)
Presented at a Scientific Meeting: <0.001
Did not present at a meeting 278 (62.9%) 1172 (63.8%) 807 (62.9%) 1639 (69.2%) 1295 (100%)
Presented at a meeting 164 (37.1%) 665 (36.2%) 477 (37.1%) 729 (30.8%) 0 (0.00%)
Trained at a Top NIH-funded Medical School: <0.001
Did not attend NIH top-funded medical school 250 (56.6%) 1226 (66.7%) 894 (69.6%) 1529 (64.6%) 916 (70.7%)
Attended a NIH top-funded Medical School 192 (43.4%) 611 (33.3%) 390 (30.4%) 839 (35.4%) 379 (29.3%)
Bachelor of Science Degree: <0.001
B.S. 271 (61.3%) 1022 (55.6%) 655 (51.0%) 1010 (42.7%) 0 (0.00%)
Not a B.S. degree 171 (38.7%) 815 (44.4%) 629 (49.0%) 1358 (57.3%) 1295 (100%)
Fluency in a Foreign Language: <0.001
Speaks English 110 (24.9%) 415 (22.6%) 344 (26.8%) 671 (28.3%) 398 (30.7%)
Speaks English and Another Language 332 (75.1%) 1422 (77.4%) 940 (73.2%) 1697 (71.7%) 897 (69.3%)
Certification in Advanced Cardiac Life Support 177 (40.0%) 789 (43.0%) 593 (46.2%) 1019 (43.0%) 605 (46.7%) 0.028
Certification in Pediatric Advanced Life Support 20 (4.52%) 108 (5.88%) 82 (6.39%) 108 (4.56%) 22 (1.70%) <0.001
Prior Residency Training 30 (6.79%) 176 (9.58%) 103 (8.02%) 68 (2.87%) 0 (0.00%) <0.001
Residency Tracks Applied: 0.000
Applying for a Preliminary Position 442 (100%) 1251 (68.1%) 277 (21.6%) 973 (41.1%) 1295 (100%)
Categorical Applicant 0 (0.00%) 586 (31.9%) 1007 (78.4%) 1395 (58.9%) 0 (0.00%)
Membership in the American Medical Association: <0.001
No AMA Membership 222 (50.2%) 1019 (55.5%) 684 (53.3%) 1468 (62.0%) 1295 (100%)
American Medical Association Member 220 (49.8%) 818 (44.5%) 600 (46.7%) 900 (38.0%) 0 (0.00%)
Membership in the American College of Obstetricians and Gynecologists 263 (59.5%) 970 (52.8%) 685 (53.3%) 1131 (47.8%) 0 (0.00%) <0.001
Received a Scholarship: <0.001
No_scholarship 326 (73.8%) 1379 (75.1%) 961 (74.8%) 1841 (77.7%) 1295 (100%)
Scholarship 116 (26.2%) 458 (24.9%) 323 (25.2%) 527 (22.3%) 0 (0.00%)
Number of letter of recommendation OBGYN authors 2.00 [2.00;3.00] 2.00 [1.00;3.00] 2.00 [1.00;3.00] 2.00 [2.00;3.00] 2.00 [2.00;3.00] <0.001
Number of letters of recommendation 4.00 [3.00;4.00] 4.00 [3.00;4.00] 4.00 [3.00;4.00] 3.00 [3.00;4.00] 0.00 [0.00;0.00] 0.000
Number of first author publications 1.00 [0.00;3.00] 1.00 [0.00;3.00] 1.00 [0.00;3.00] 0.00 [0.00;3.00] 0.00 [0.00;0.00] <0.001
Advanced Degree Other Than Medical Degree: .
No Addl Degree 441 (99.8%) 1735 (94.4%) 1086 (84.6%) 2121 (89.6%) 1295 (100%)
MBA 0 (0.00%) 6 (0.33%) 11 (0.86%) 17 (0.72%) 0 (0.00%)
Other 0 (0.00%) 86 (4.68%) 178 (13.9%) 221 (9.33%) 0 (0.00%)
PhD 1 (0.23%) 10 (0.54%) 9 (0.70%) 9 (0.38%) 0 (0.00%)
Type of medical school: .
Osteopathic School 34 (7.69%) 238 (13.0%) 275 (21.4%) 419 (17.7%) 261 (20.2%)
U.S. Public School 239 (54.1%) 729 (39.7%) 464 (36.1%) 1052 (44.4%) 477 (36.8%)
International School 72 (16.3%) 425 (23.1%) 250 (19.5%) 348 (14.7%) 240 (18.5%)
U.S. Private School 97 (21.9%) 444 (24.2%) 295 (23.0%) 549 (23.2%) 317 (24.5%)
Osteopathic School,International School 0 (0.00%) 1 (0.05%) 0 (0.00%) 0 (0.00%) 0 (0.00%)
Number of Work Experiences 3.00 [1.00;5.00] 3.00 [1.00;5.00] 3.00 [1.00;5.00] 2.00 [0.00;4.00] 0.00 [0.00;0.00] 0.000
Number of Volunteer Experiences 7.00 [5.00;10.0] 6.00 [4.00;9.00] 7.00 [4.00;10.0] 6.00 [1.00;9.00] 0.00 [0.00;0.00] 0.000
Number of Research Experiences 2.00 [1.00;3.00] 2.00 [1.00;3.00] 2.00 [1.00;3.00] 2.00 [0.00;3.00] 0.00 [0.00;0.00] 0.000

Splitting the Data into training set and test set

In some cases, our data is naturally separated into two sets, one of which can be used to fit a model and the other to evaluate it. A common example of this is when data has been collected during two distinct time periods, and the older data is used to fit a model that is evaluated on the newer data, to see if historical data can be used to predict the future.

Here we check sample sizes and proportions of target=“Match_Status” in training and test sets the proportions of target=“Match_Status” should be approximately the same across the training, test, and the whole dataset. Do NOT up- or down- sample the training set because the eval metric is Brier Score, and there is no evidence showing making the data balanced will improve out-of-sample performance. In fact, based on my experiments, it will make the models performance much worse. For more explanations, see this link: https://stats.stackexchange.com/a/474431

In general, a lower Brier score indicates a better model. Therefore, in this case, the code should aim to minimize the Brier score. This is reflected in the maximize argument of the caret::train function, which is set to FALSE to minimize the Brier score.

The Brier score is a measure of the accuracy of probabilistic predictions, and it is defined as the mean squared difference between the predicted probabilities and the observed outcomes. A model with a low Brier score is able to make more accurate probabilistic predictions, as the predicted probabilities are closer to the observed outcomes.

In this case, the Brier score is being used as an evaluation metric to compare the performance of different models. A model with a lower Brier score is generally considered to be a better model, as it is able to make more accurate probabilistic predictions.

# check sample sizes and proportions of target=Y in training and test sets
# the proportions of target=Y should be approximately the same across the 
# training, test, and the whole dataset.

# Calculate number of observations in training and test sets
ntest <- nrow(test_match)
ntrain <- nrow(train_match)

# Print the number of observations in each set
cat("The training set has", ntrain, "observations.\n")
#> The training set has 3121 observations.
cat("The testing set has", ntest, "observations.\n\n")
#> The testing set has 3663 observations.
# Calculate and print the proportion of target=Y in the training set
prop_train <- sum(train_match$Match_Status == "Matched") / ntrain
cat("Proportion of target=Y in the training set:", prop_train, "\n\n")
#> Proportion of target=Y in the training set: 0.567
# Calculate and print the proportion of target=Y in the test set
prop_test <- sum(test_match$Match_Status == "Matched") / ntest
cat("Proportion of target=Y in the test set:", prop_test, "\n\n")
#> Proportion of target=Y in the test set: 0.506
invisible(gc())
###
#This code first calculates the proportion of target=Y in the training and test sets, then creates a data frame with these values. Finally, it uses ggplot2 to create a bar chart with the proportions for the training and test sets. The bars are filled with light blue color and the chart has appropriate labels for the title, x-axis, and y-axis.

# Calculate proportions of target=Y in training and test sets
prop_train <- nrow(train_match[train_match$Match_Status=="Matched",]) / (nrow(train_match[train_match$Match_Status=="Matched",]) + nrow(train_match[train_match$Match_Status=="Not_Matched",]))
prop_test <- nrow(test_match[test_match$Match_Status=="Matched",]) / (nrow(test_match[test_match$Match_Status=="Matched",]) + nrow(test_match[test_match$Match_Status=="Not_Matched",]))

# Create data frame with proportions for training and test sets
proportions_df <- data.frame(set = c("Training", "Test"), proportion = c(prop_train, prop_test))

# Plot proportions using ggplot2
ggplotly(ggplot2::ggplot(proportions_df, aes(x = set, y = proportion)) +
  geom_col(fill = "lightblue", color = "black", width = 0.5) +
  geom_text(aes(label = sprintf("%.2f%%", 100 * proportion)),
            hjust = 1.5, vjust = -0.5, color = "black", size = 8) +
  labs(title = "Proportion of Target=Match_Status in Training and Test Sets",
       x = "Splits", y = "Proportion of Applicants who Matched"))

Feature Exploration

The next step is to explore potential predictive relationships between individual predictors and the response and between pairs of predictors and the response.

  # `a` and `b` are two models from `train()`
  compare_models_1way <- function(a, b, metric = a$metric[1], ...) {
    # Create a list of the models to compare
    mods <- list(a, b)
    # Use caret::resamples to compare the models
    rs <- caret::resamples(mods)
    # Calculate the difference in performance between the models
    diffs <- base::diff(rs, metric = metric[1], ...)
     # Return the first element of the first element of the statistics list
    diffs$statistics[[1]][[1]]
  }

  
VC_preds <- names(select_if(d, is.numeric)) 
risk_preds <- names(select_if(d, is.factor))

write.csv(VC_preds, file = "output/csv/features_VC_preds.csv")
write.csv(risk_preds, file = "output/csv/features_risk_preds.csv")

# ------------------------------------------------------------------------------
# Create a "null model" with no predictors to get baseline performance ---------
# This model will be used as a reference point to compare the performance of other models against
null_mat <- data.frame(intercept = rep_len(1, nrow(train_match)))

#Set up trainControl object to specify the type of cross-validation to use
ctrl <- trainControl(method = "repeatedcv", number = 5, classProbs = TRUE, summaryFunction = BigSummary)

#Train the null model using a generalized linear model with Brier metric
null_mod <- caret::train(x = null_mat, 
                  y = train_match$Match_Status, 
                  method = "glm", 
                  metric = "Brier", 
                  trControl = ctrl)

#t might be good to print the results of the null model performance so it can be visualised and compared
print(null_mod)
#> Generalized Linear Model 
#> 
#> 3121 samples
#>    1 predictor
#>    2 classes: 'Not_Matched', 'Matched' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 1 times) 
#> Summary of sample sizes: 2497, 2496, 2497, 2497, 2497 
#> Resampling results:
#> 
#>   Acc    Kappa  AUCROC  AUCPR  Brier  Precision  Recall  F    
#>   0.567  0      0.5     0      0.245  0.567      1       0.724
# ------------------------------------------------------------------------------
# Compare the models with single predictors to the risk model. These data make
# https://bookdown.org/max/FES/stroke-tour.html#tab:stroke-strokeRiskAssociations
# `VC_preds` and `risk_preds` contain the predictor names for different sets.

one_predictor_res <- data.frame(Predictor = setdiff(names(train_match), "Match_Status"), 
             Improvement = NA,
             Pvalue = NA,
             Brier = NA,
             stringsAsFactors = FALSE)

#Removes the Outcome from the dataframe.  
one_predictor_res <- one_predictor_res %>% 
  dplyr::filter(Predictor != "Match_Status")

#Loop through each predictor variable
for (i in 1:nrow(one_predictor_res)) {
  set.seed(1978)
  #Train a model using the current predictor variable
  var_mod <- caret::train(Match_Status ~ ., 
                   data = train_match[, c("Match_Status", one_predictor_res$Predictor[i])], 
                   method = "glm", 
                   metric = "Brier",
                   trControl = ctrl)  
  
  #Compare the model with the null model
  tmp_diff <- compare_models_1way(var_mod, null_mod, alternative = "greater")
  
  #Store the Brier score, improvement in model performance, and p-value of the improvement in the data frame
  one_predictor_res$Brier[i] <- caret::getTrainPerf(var_mod)[1, "TrainBrier"]
  one_predictor_res$Improvement[i] <- tmp_diff$estimate
  one_predictor_res$Pvalue[i] <- tmp_diff$p.value
}

Two logistic regression models were considered. The “null model” contains only an intercept term while the model complex model has a single term for an individual predictor from the risk set. These results are presented in the table and orders the risk predictors from most significant to least significant in terms of improvement in Brier Score. For the training data, several predictors provide marginal, but significant improvement in predicting Match_Status outcome as measured by the improvement in Brier Score. Based on these results, our intuition would lead us to believe that the significant categorical set predictors will likely be integral to the final predictive model.

# ------------------------------------------------------------------------------
# Data in table 2.3
# https://bookdown.org/max/FES/stroke-tour.html#tab:stroke-strokeRiskAssociations

one_predictor_res %>% 
  dplyr::filter(Predictor %in% risk_preds) %>% 
  dplyr::arrange(Brier)
write.csv(one_predictor_res, file = "output/csv/one_predictor_res.csv"); invisible(gc())
# ------------------------------------------------------------------------------
# Figure 2.4
# https://bookdown.org/max/FES/stroke-tour.html#fig:stroke-vascuCAPAssocations
vc_pred <-
  recipes::recipe(Match_Status ~ ., data = d %>% dplyr::select(Match_Status, !!!VC_preds)) %>%
  recipes::prep(train_match %>%
  dplyr::select(Match_Status, !!!VC_preds)) %>%
  recipes::juice() %>%
  tidyr::gather(Predictor, value, -Match_Status)
# get_max value per predictor

pred_max <- vc_pred %>% 
  dplyr::inner_join(one_predictor_res %>% dplyr::select(Pvalue, Predictor)) %>%
  dplyr::group_by(Predictor) %>% 
  dplyr::summarize(max_val = max(value), Pvalue = min(Pvalue)) %>% 
  dplyr::mutate(x = 1.5, value = 1.25 * max_val,
    label = paste0("p-value: ", format.pval(Pvalue, digits = 2, sci = FALSE, eps = .0001)))
###
fig_2_4 <-ggplot2::ggplot(vc_pred, aes(x = Match_Status, y = value)) +
geom_boxplot() +
geom_point(alpha = 0.3, cex = .5) +
geom_text(data = pred_max, aes(x = x, label = label), size = 3) +
facet_wrap(~Predictor, scales = "free_y") + 
ylab(""); fig_2_4

#Univariate associations between imaging predictors and stroke outcome. The p-value of the improvement in ROC for each predictor over the intercept-only logistic regression model is listed in the top center of each facet.

ggplot2::ggsave(fig_2_4, 
       filename = "output/fig/fig_2_4_univariate_associations_numeric_match_status.tiff",
       units = c("cm"),
       width = 20, height = 20, 
       dpi = 800); invisible(gc())
# ------------------------------------------------------------------------------
# Interaction exploration
#This code creates a data frame of all pairs of elements from the VC_preds object, transposes the resulting matrix, and converts it to a data frame. Then, it adds three new columns to the data frame called Improvement, Pvalue, and Brier, and sets their values to NA.  The utils::combn function generates a matrix with all possible combinations of elements from a given vector. The t function transposes the resulting matrix. The as.data.frame function converts the matrix to a data frame. Finally, the dplyr::mutate function adds the three new columns to the data frame and sets their values to NA.


pairs <- utils::combn(VC_preds, 2) %>% 
  t() %>% 
  as.data.frame(stringsAsFactors = FALSE) %>% 
  dplyr::mutate(Improvement = NA, Pvalue = NA, Brier = NA)

#Creates a vector called tmp_vars that consists of the Match_Status column and the two columns identified by the current row of pairs. 
#Uses the caret::train function to train two models: main_eff and main_int. The first model is a simple linear model, while the second model is a model with interaction terms. Both models are trained on the train_match data frame, using only the columns identified by tmp_vars.
# Uses the compare_models_1way function to compare the two models. This function returns a list with the estimate of the difference between the models and the p-value of the difference.
# Stores the Brier score of the main_eff model, the estimate of the difference between the models, and the p-value of the difference in the appropriate columns of the pairs data frame.
# The loop repeats these steps for each row of the pairs data frame. At the end of the loop, the pairs data frame will have three additional columns with the Brier scores, improvement estimates, and p-values for each pair of variables.

for (i in 1:nrow(pairs)) {
  tmp_vars <- c("Match_Status", pairs$V1[i], pairs$V2[i])
  set.seed(1978)
  main_eff <- caret::train(Match_Status ~ ., 
                    data = train_match[, tmp_vars], 
                    method = "glm", 
                    metric = "Brier",
                    trControl = ctrl)
  set.seed(1978)
  main_int <- caret::train(Match_Status ~ (.)^2, 
                    data = train_match[, tmp_vars], 
                    method = "glm", 
                    metric = "Brier", 
                    trControl = ctrl)  
  
  tmp_diff <- compare_models_1way(main_int, main_eff, alternative = "greater")
  pairs$Brier[i] <- caret::getTrainPerf(main_eff)[1, "TrainBrier"]
  pairs$Improvement[i] <- tmp_diff$estimate
  pairs$Pvalue[i] <- tmp_diff$p.value
}

#This code filters the pairs data frame to retain only those rows that meet two criteria:The value in the Brier column is less than 0.25. The value in the Pvalue column is less than or equal to the value of the p_value variable (which is 0.2 in this case). The resulting data frame is stored in the retained_pairs object. The dplyr::filter function is used to select only those rows that meet the specified criteria. The resulting retained_pairs data frame will contain only those pairs of variables that have a Brier score less than 0.25 and a p-value less than or equal to 0.2.

p_value = 0.2
retained_pairs <- 
  pairs %>% 
  dplyr::filter(Brier < 0.25  & Pvalue <= p_value) %>%
  dplyr::arrange(Brier) %>%
  dplyr::rename("Variable 1" = "V1") %>%
  dplyr::rename("Variable 2" = "V2")
knitr_table_html(data = retained_pairs, caption = "No significant improvent in Brier Score with Interations")
No significant improvent in Brier Score with Interations
Variable 1 Variable 2 Improvement Pvalue Brier
Age Volunteer_exp_count 0.000 0.178 0.218
Age work_exp_count 0.000 0.085 0.222
total_OBGYN_letter_writers Research_exp_count 0.000 0.081 0.222
total_OBGYN_letter_writers number_of_applicant_first_author_publications 0.000 0.165 0.226
work_exp_count Volunteer_exp_count 0.000 0.118 0.232
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published Volunteer_exp_count 0.000 0.120 0.234
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published Research_exp_count 0.000 0.090 0.235
reco_count Volunteer_exp_count 0.000 0.041 0.235
Count_of_Poster_Presentation work_exp_count 0.000 0.006 0.239
reco_count work_exp_count 0.000 0.105 0.242
reco_count number_of_applicant_first_author_publications 0.000 0.185 0.242
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published reco_count 0.000 0.092 0.244
Count_of_Oral_Presentation Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published 0.000 0.016 0.244
Count_of_Oral_Presentation reco_count 0.000 0.013 0.245
Count_of_Peer_Reviewed_Journal_Articles_Abstracts reco_count 0.000 0.035 0.245
Count_of_Oral_Presentation Count_of_Peer_Reviewed_Journal_Articles_Abstracts 0.001 0.121 0.245

Volcano Plot

A comparison of the improvement in Brier Score to the p-value of the importance of the interaction term for the numerical predictors was made. The results showed that the improvement from using interactions was less than a 0.1 change in the Brier Score, indicating a minimal improvement in the model’s accuracy. This suggests that the contribution of the interaction terms for the numerical predictors was not significant in terms of improving the Brier Score.

# ------------------------------------------------------------------------------
# Figure 2.6
# https://bookdown.org/max/FES/stroke-tour.html#fig:stroke-interactionScreening

#The scatterplot displays the Improvement values on the x-axis and the negative logarithm of the Pvalue values on the y-axis. The size of the points in the scatterplot is determined by the Brier values, and the text label for each point is a combination of the two variables (V1 and V2) and the corresponding Brier score.The code first filters the pairs data frame to retain only those rows where the Brier value is less than 0.25. It then adds a new column called Term to the data frame, which is a string containing the two variables and the Brier score.Next, the code uses the ggplot2::ggplot function to create a scatterplot object with the Improvement and -log10(Pvalue) values on the x- and y-axes, respectively. The aes function is used to specify that the point size should be determined by the Brier values and the text labels should be the Term values. The xlab function is used to set the label for the x-axis. The geom_point function is used to add the points to the scatterplot, with an alpha value of 0.2 to make the points partially transparent. Finally, the code converts the scatterplot object to a plotly object using the plotly::ggplotly function and sets the tooltip to the Term values. The resulting plotly object is stored in the vol_plot object and displayed.

vol_data <- pairs %>% 
  dplyr::filter(Brier < 0.22) ## select only rows with Brier score < 0.25

# create a new column that combines the interaction term, Brier score, and p-value
vol_data <- vol_data %>%
  dplyr::mutate(Term = paste(V1, "\nby\n", V2, "\nBrier:", round(Brier, 2), 
                             "\n-log10(p-value):", round(-log10(Pvalue), 2)))

# create the plot
vol_plot <- 
  ggplot2::ggplot(vol_data, aes(x = Improvement, y = -log10(Pvalue))) + 
  xlab("Improvement") +
  ylab("-log10(p-value)") +
  ggtitle("Volcano Plot of Interaction Effects")+
  geom_point(alpha = .2, aes(size = Brier, color = Brier), shape = 21) +
  geom_text(aes(label = Term), hjust = -0.2, vjust = -0.2) +
  scale_size_continuous(range = c(1, 1))

# use ggplotly for interactive version 
vol_plot <- plotly::ggplotly(vol_plot) %>% 
  plotly::layout(xaxis = list(title = "Improvement", autorange = "reversed"), 
                 yaxis = list(title = "-log10(p-value)", autorange = "reversed")); vol_plot
#A volcano plot is a scatter plot that is used to quickly identify the genes or features that show the largest changes in expression or activity between two or more conditions. The plot typically uses the logarithm of the p-value on the y-axis and the logarithm of the fold change (FC) or some other measure of effect size on the x-axis. The name "volcano" comes from the characteristic shape of the plot, where there is a large number of genes or features with small or insignificant changes in expression, which form the volcano's slope, and a smaller number of genes or features with large or significant changes in expression, which form the volcano's peak. The volcano plot is useful for identifying the genes or features that are likely to be most important in a given biological process or disease state and for filtering large lists of differentially expressed genes or features.

BASELINE PERFORMANCE

The Brier score is a widely used evaluation metric for binary classifiers and measures the mean squared error between the predicted probabilities and the true target values (0’s or 1’s). In other words, the Brier score quantifies how well the model is able to predict the binary outcomes. A smaller Brier score indicates a better model performance. The Brier score takes into account both the accuracy and the calibration of the predicted probabilities. It is widely used in model evaluation because it provides a clear and concise measure of the performance of binary classifiers.

# Count number of "Not_Matched" and "Matched" values
counts <- table(d$Match_Status)

# Calculate negative baseline performance
base.neg <- counts["Not_Matched"] / sum(counts)

# Calculate positive baseline performance
base.pos <- counts["Matched"] / sum(counts)

Model performance and utility will be judged on how much more accurate the model can determine a residency match than a human. Since the target of the prediction is binary (Matched or Not_Matched), we can assume a baseline of 50% for random chance. Next we want to determine a rather modest human baseline. If a human were to predict “Matched” for every resident, they would be accurate 54.23% of the time, a 4.23% increase over chance. Though we can imagine more elaborate methods to improve human performance, we will use this method to establish a baseline. It could easily be just as true, that we could see human performance fall below random chance. Possibly, the human could predict “Not_Matched” for every candidate resident, dropping their accuracy to 45.77%. Let’s give the benefit of the doubt, however, and give the higher accuracy. Therefore, in order to create a successful model, we must aim for an accuracy greater than 54.23%. Even slight improvement in performance is not inconsequential, especially due to the nature of the model.

# convert the target values on the test set to 0's and 1's
ytest = ifelse(test.dat[[yvar]]=='Matched', 1, 0)
mse = function(ytru, yhat) mean((yhat - ytru)^2)

#The function brier_score_binary takes two inputs:
# ytru: the true binary outcome
# pred_prob: the predicted probability of the positive class
# It uses the mse function to compute the mean squared error between the true binary outcome and the predicted probability of the positive class, which is the Brier score.
# In the next line, the variable pred_probs is assigned a vector of 1s with the same number of rows as the test.dat data frame. This vector represent a naive prediction that assigns a probability of 1 to all observations of the test set to be in the positive class.
# 
# The variable test_brier_naive is assigned the output of the brier_score_binary function where ytest is the true binary outcome and pred_probs is the vector of 1s.

brier_score_binary = function(ytru, pred_prob) mse(ytru, pred_prob)
pred_probs = rep(1, nrow(test.dat))
test_brier_naive = brier_score_binary(ytest, pred_probs)

A naive prediction of all matches gives a test Brier score: 0.494.