The number of NAs in the ‘d’ dataframe is: 0.
Here we create a feature matrix where the categorical features are
converted to numeric with one-hot encoding, and it’ll be used in
glmnet
when we train logit models with shrinkage.
Group LASSO is used in preference to LASSO when features are naturally grouped into clusters, meaning that they are related in a certain way: ‘Locaiton’ or ‘Year’. In such cases, LASSO would treat each feature individually and may not capture the relationship between the features within the same group. Group LASSO, on the other hand, allows for the selection of groups of features, rather than individual features, leading to a more interpretable model. This can be especially useful when dealing with high dimensional data, where the number of features is much larger than the number of observations. By penalizing the magnitude of the coefficients of the entire group rather than the magnitude of each coefficient individually, group LASSO can produce sparse models while preserving the relationship between features.
To handle this, feature selection was performed using LASSO. LASSO is a popular feature selection method that focuses on eliminating irrelevant or redundant features by shrinking their coefficients to zero. To use LASSO in logit models with shrinkage, it’s necessary to convert categorical features into numeric ones. This can be achieved through the process of one-hot encoding. The resulting feature matrix is then used in the glmnet package to train logit models with shrinkage.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### One hot encoding to dummify the variables ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#This code is performing one-hot encoding, also known as "dummifying", on the factors variables in the dataframe "reduced_Data". The function "select_if" is used to select only the factor variables in the dataframe. The function "dummy_cols" is then used to create new columns for each level of the factors variables, with the levels being represented by binary variables (0 or 1). The option "remove_first_dummy" is set to true so that one level of each factor variable is removed to prevent collinearity issues. The option "remove_selected_columns" is also set to true so that the original factor columns are removed from the dataframe. Finally, the option "select_columns" is set to the names of the factor columns. The final step is to remove the first column of the dummies to avoid collinearity issues. The last line is to update the column names of the dummies.
d_factor2 <- dplyr::select_if(reduced_Data, is.factor)
dummies <- dummy_cols(d_factor2,
remove_first_dummy = T, #remove the first dummy variable column to prevent multicollinearity
remove_selected_columns = T, #remove the original factor columns from the dataframe
select_columns = colnames(d_factor2))
dummies <- dummies[,-1]
The step of “dropping reference dummies” refers to the fact that in group lasso, each group of features is assigned a single penalty term. To account for the different scales of the features in each group, a reference dummy is often added to each group. The reference dummy is typically excluded from the penalty term, so that the size of the penalty reflects the size of the other features in the group.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Drop the reference dummies ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
dummies <- dummies %>%
dplyr::select(-starts_with(c("Location_", "Year_", "Match_Status_")))
The third step is to combine the dummy variables with the numeric variables and convert them into a matrix X. This matrix will be used in the glmnet function, which is a penalized regression method. When using a binary categorical outcome, the only difference is to specify family = “binomial” in the glmnet function.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Convert to a matrix because the models need the data in a particular way ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
d2 <-dplyr::select_if(reduced_Data, is.numeric)
X <- cbind(d2, dummies)
X <- as.matrix(X) #For {glmnet}, must use predictors as matrix and response as vector
In this step, the predictors and the outcome variable are separated and prepared for modeling. The predictors are combined into a matrix called X, while the outcome variable is put into a vector called Y. This separation is a common step in many machine learning algorithms to prepare the data for modeling.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Create your X matrix (predictors) and Y vector (outcome variable) ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
d_factor2$Match_Status <- ifelse(d_factor2$Match_Status == "Matched", 1, 0)
y <- d_factor2[,1]
The group vector is used to distinguish groups of predictors in the data. It separates the predictors into different groups based on some criteria such as the type of predictor or the relationship between predictors. This step is necessary for the group LASSO step, as it allows for the lasso to apply different penalties to different groups of predictors.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Create the grouping variable that is 'Location' ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
group <- dummies
colnames(group) <- sub("_[^_]+$", "", colnames(group))
group <- cbind(d2, group)
group <- colnames(group)
group <- ifelse(group == "Location_U", "Location", group)
‘glmnet’ is used to perform LASSO. In this step, k-fold cross-validation is conducted over a range of values for the regularization parameter lambda, which determines the amount of penalty to be applied. The goal is to find the value of lambda that results in the minimum mean squared error (MSE) using cross-validation. This helps to determine the optimal level of regularization for the model.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####
# Performs k-fold cross validation for penalized regression models with grouped covariates over a grid of values for the regularization parameter lambda. ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#In this code, data is a data frame containing the predictors (stored in the predictors variable) and response (stored in the response variable) variables. grouping_vector is a vector that describes the grouping of the coefficients. X is a design matrix created from the predictors in data, y is the response vector, and group is the vector describing the grouping of the coefficients. The grpreg function is used to fit a penalized regression model with grouped covariates, and the cv.grpreg function is used to perform k-fold cross validation on the model to tune the regularization parameter lambda. The summary function is used to print a summary of the cross-validated model, and the plot function is used to visualize the path trajectory of the fitted sparse regression parameters.
# Fit penalized regression model with grouped covariates
fit <- grpreg::grpreg(X, y, group, penalty="grLasso", family="binomial")
plot(fit, labels = TRUE,
legend = TRUE,
xlab = "Penalty Parameter (Lambda)",
ylab = "Coefficients")
# Perform k-fold cross validation to tune regularization parameter lambda
cvfit <- grpreg::cv.grpreg(X, y, group, penalty="grLasso", family="binomial")
# Print summary of cross-validated model
summary(cvfit)
#> grLasso-penalized logistic regression with n=5348, p=35
#> At minimum cross-validation error (lambda=0.0032):
#> -------------------------------------------------
#> Nonzero coefficients: 28
#> Nonzero groups: 23
#> Cross-validation error of 1.14
#> Maximum R-squared: 0.20
#> Maximum signal-to-noise ratio: 0.20
#> Prediction error at lambda.min: 0.283
# Plot path trajectory of fitted sparse regression parameters
plot(cvfit, labels = TRUE,
main = "Group LASSO: Residency Location",
xlab = "Lambda",
ylab = "Coefficients",
vertical.line=TRUE,
col="red")
a<- summary(cvfit)
We found the amount of penalty, λ by cross-validation. λ is a tuning parameter that helps to control our model from over-fitting to the training data. To identify the optimal λ value we can use k-fold cross-validation. Extra predictors eroded the performance of logistic regression while the LASSO penalty resulted in stable performance as the number of irrelevant predictors increased. We searched for the λ that give the minimum prediction error.
A vertical line is drawn at the value where cross-validation prediction error is minimized, at which point 25 variables were selected. However, as the confidence intervals show, there is substantial uncertainty about this minimum value.
plot(cvfit, labels = TRUE, type='cve')
Path trajectory of the fitted regression parameters: The figure should be read from right to left (i.e., λ from small to large). Variables that become zero later are stronger (i.e., since a larger λ is needed to make them become 0). The variables that quickly become zero are weak or insignificant variables.
plot(fit)
tm_print_save("output/fig/box_and_whisker_plot_of_LASSO_model.tiff")
#> [1] "Function Sanity Check: Saving TIFF of what is in the viewer"
#> null device
#> 1
To view the best model and the corresponding coefficients. cv.fit$lambda.min is the best lambda value that results in the best model with smallest cross-validation error.
cvfit$lambda.min
#> [1] 0.00323
# This extracts the fitted regression parameters of the logistic regression model using the given lambda value of `r cvfit$lambda.min`. Picks out which predictors have a coefficient > 0.
This code uses the cvfit to fit a logistic regression model to your data. The cv.glmnet function performs cross-validation to select the optimal value of lambda for the logistic regression: 0.003.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Create tables displaying results of univariate analyses, stratified or not by categorical variable groupings ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####[tibshirani1996regression]
d1 <- d %>%
dplyr::rename (c(#"Match Status" = Match_Status,
"Gender" = Gender,
"Age (years)" = Age,
"US or Canadian Applicant" = US_or_Canadian_Applicant,
"Visa Sponsorship Needed" = Visa_Sponsorship_Needed,
"Medical Education Interrupted"= Medical_Education_or_Training_Interrupted,
"Alpha Omega Alpha Membership" = Alpha_Omega_Alpha,
"Gold Humanism Honor Society Membership" = Gold_Humanism_Honor_Society,
"Participating in a Couples Match" = Couples_Match,
"Number of Oral Presentations" = Count_of_Oral_Presentation,
"Number of Peer-Reviewed Journal Articles or Abstracts" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts,
"Number of Peer-Reviewed Non-published Journal_Articles" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published,
"Number of Poster Presentations" = Count_of_Poster_Presentation,
#"Applicant Match Year" = Year,
"Presented at a Scientific Meeting" = Meeting_Name_Presented,
"Trained at a Top NIH-funded Medical School" = TopNIHfunded,
"Bachelor of Science Degree" = Higher_Education_Degree,
#"Birth Place" = Birth_Place,
"Fluency in a Foreign Language" = Language_Fluency,
"Certification in Advanced Cardiac Life Support" = ACLS,
"Certification in Pediatric Advanced Life Support" = PALS,
"Prior Residency Training" = Medical_Training_Discipline,
"Residency Tracks Applied" = Tracks_Applied_by_Applicant_1,
"Membership in the American Medical Association" = AMA,
"Membership in the American College of Obstetricians and Gynecologists" = ACOG,
"Received a Scholarship" = Scholarship,
"Number of letter of recommendation OBGYN authors" = total_OBGYN_letter_writers,
"Number of letters of recommendation" = reco_count,
"Number of first author publications" = number_of_applicant_first_author_publications,
"Advanced Degree Other Than Medical Degree" = Advance_Degree,
"Type of medical school" = Type_of_medical_school,
"Number of Work Experiences" = work_exp_count,
"Number of Volunteer Experiences" = Volunteer_exp_count,
"Number of Research Experiences" = Research_exp_count))
# ######################################################################################
# #
# #
# # Demographics of the Location
# #
# #
# ######################################################################################
# Define the formula for the tableby function
formula <- d1$Location ~ .
# Define the control argument for the tableby function
control <- arsenal::tableby.control(
test = TRUE,
total = TRUE,
digits = 0,
digits.p = 2,
digits.count = 0,
numeric.simplify = TRUE,
cat.simplify = FALSE,
numeric.stats = c("median", "q1q3"),
cat.stats = c("Nmiss", "countpct"),
stats.labels = list(
Nmiss = "N Missing",
Nmiss2 = "N Missing",
meansd = "Mean (SD)",
medianrange = "Median (Range)",
median = "Median",
medianq1q3 = "Median (Q1, Q3)",
q1q3 = "Q1, Q3",
iqr = "IQR",
range = "Range",
countpct = "Count (Pct)",
Nevents = "Events",
medSurv = "Median Survival",
medTime = "Median Follow-Up"
)
)
# Create the tableby object
arsenal_table <- arsenal::tableby(formula, data = d1, control = control)
# Create the summary of the tableby object
final <- summary(
arsenal_table
#,
#text = TRUE,
# title = "Table: Locations",
#pfootnote = FALSE
); final
BSW (N=303) | CCAG (N=603) | CU (N=2880) | Duke (N=1211) | Truman (N=254) | U_Washington (N=306) | UAB (N=240) | Utah (N=1429) | Total (N=7226) | p value | |
---|---|---|---|---|---|---|---|---|---|---|
Match_Status | < 0.01 | |||||||||
Not_Matched | 174 (57.4%) | 333 (55.2%) | 1431 (49.7%) | 409 (33.8%) | 170 (66.9%) | 126 (41.2%) | 130 (54.2%) | 534 (37.4%) | 3307 (45.8%) | |
Matched | 129 (42.6%) | 270 (44.8%) | 1449 (50.3%) | 802 (66.2%) | 84 (33.1%) | 180 (58.8%) | 110 (45.8%) | 895 (62.6%) | 3919 (54.2%) | |
Gender | < 0.01 | |||||||||
Female | 244 (80.5%) | 481 (79.8%) | 2330 (80.9%) | 1043 (86.1%) | 195 (76.8%) | 263 (85.9%) | 187 (77.9%) | 1169 (81.8%) | 5912 (81.8%) | |
Male | 59 (19.5%) | 122 (20.2%) | 550 (19.1%) | 168 (13.9%) | 59 (23.2%) | 43 (14.1%) | 53 (22.1%) | 260 (18.2%) | 1314 (18.2%) | |
Age (years) | < 0.01 | |||||||||
Median | 28 | 28 | 28 | 27 | 29 | 28 | 28 | 27 | 28 | |
Q1, Q3 | 27, 30 | 27, 30 | 27, 30 | 26, 29 | 27, 31 | 27, 30 | 27, 30 | 26, 29 | 26, 30 | |
US or Canadian Applicant | < 0.01 | |||||||||
No | 46 (15.2%) | 96 (15.9%) | 639 (22.2%) | 145 (12.0%) | 109 (42.9%) | 55 (18.0%) | 54 (22.5%) | 191 (13.4%) | 1335 (18.5%) | |
Yes | 257 (84.8%) | 507 (84.1%) | 2241 (77.8%) | 1066 (88.0%) | 145 (57.1%) | 251 (82.0%) | 186 (77.5%) | 1238 (86.6%) | 5891 (81.5%) | |
Visa Sponsorship Needed | < 0.01 | |||||||||
No | 287 (94.7%) | 573 (95.0%) | 2663 (92.5%) | 1147 (94.7%) | 226 (89.0%) | 283 (92.5%) | 226 (94.2%) | 1368 (95.7%) | 6773 (93.7%) | |
Yes | 16 (5.3%) | 30 (5.0%) | 217 (7.5%) | 64 (5.3%) | 28 (11.0%) | 23 (7.5%) | 14 (5.8%) | 61 (4.3%) | 453 (6.3%) | |
Medical Education Interrupted | 0.02 | |||||||||
No | 260 (85.8%) | 531 (88.1%) | 2461 (85.5%) | 1064 (87.9%) | 206 (81.1%) | 276 (90.2%) | 207 (86.2%) | 1222 (85.5%) | 6227 (86.2%) | |
Yes | 43 (14.2%) | 72 (11.9%) | 419 (14.5%) | 147 (12.1%) | 48 (18.9%) | 30 (9.8%) | 33 (13.8%) | 207 (14.5%) | 999 (13.8%) | |
Alpha Omega Alpha Membership | < 0.01 | |||||||||
No | 302 (99.7%) | 598 (99.2%) | 2562 (89.0%) | 1045 (86.3%) | 253 (99.6%) | 292 (95.4%) | 234 (97.5%) | 1232 (86.2%) | 6518 (90.2%) | |
Yes | 1 (0.3%) | 5 (0.8%) | 318 (11.0%) | 166 (13.7%) | 1 (0.4%) | 14 (4.6%) | 6 (2.5%) | 197 (13.8%) | 708 (9.8%) | |
Gold Humanism Honor Society Membership | < 0.01 | |||||||||
No | 270 (89.1%) | 543 (90.0%) | 2472 (85.8%) | 1015 (83.8%) | 234 (92.1%) | 267 (87.3%) | 213 (88.8%) | 1173 (82.1%) | 6187 (85.6%) | |
Yes | 33 (10.9%) | 60 (10.0%) | 408 (14.2%) | 196 (16.2%) | 20 (7.9%) | 39 (12.7%) | 27 (11.2%) | 256 (17.9%) | 1039 (14.4%) | |
Participating in a Couples Match | < 0.01 | |||||||||
No | 287 (94.7%) | 556 (92.2%) | 2659 (92.3%) | 1093 (90.3%) | 243 (95.7%) | 294 (96.1%) | 224 (93.3%) | 1259 (88.1%) | 6615 (91.5%) | |
Yes | 16 (5.3%) | 47 (7.8%) | 221 (7.7%) | 118 (9.7%) | 11 (4.3%) | 12 (3.9%) | 16 (6.7%) | 170 (11.9%) | 611 (8.5%) | |
Number of Oral Presentations | < 0.01 | |||||||||
Median | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
Q1, Q3 | 0, 1 | 0, 1 | 0, 1 | 0, 1 | 0, 1 | 0, 1 | 0, 1 | 0, 1 | 0, 1 | |
Number of Peer-Reviewed Journal Articles or Abstracts | < 0.01 | |||||||||
Median | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
Q1, Q3 | 0, 1 | 0, 1 | 0, 1 | 0, 2 | 0, 1 | 0, 1 | 0, 1 | 0, 2 | 0, 1 | |
Number of Peer-Reviewed Non-published Journal_Articles | < 0.01 | |||||||||
Median | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
Q1, Q3 | 0, 0 | 0, 0 | 0, 0 | 0, 1 | 0, 0 | 0, 0 | 0, 0 | 0, 1 | 0, 0 | |
Number of Poster Presentations | < 0.01 | |||||||||
Median | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | |
Q1, Q3 | 0, 2 | 0, 2 | 0, 3 | 0, 3 | 0, 2 | 0, 3 | 0, 2 | 0, 3 | 0, 3 | |
Year | < 0.01 | |||||||||
2016 | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 442 (30.9%) | 442 (6.1%) | |
2017 | 57 (18.8%) | 125 (20.7%) | 638 (22.2%) | 328 (27.1%) | 61 (24.0%) | 77 (25.2%) | 61 (25.4%) | 490 (34.3%) | 1837 (25.4%) | |
2018 | 60 (19.8%) | 135 (22.4%) | 606 (21.0%) | 284 (23.5%) | 63 (24.8%) | 82 (26.8%) | 54 (22.5%) | 0 (0.0%) | 1284 (17.8%) | |
2019 | 100 (33.0%) | 173 (28.7%) | 1054 (36.6%) | 323 (26.7%) | 66 (26.0%) | 92 (30.1%) | 63 (26.2%) | 497 (34.8%) | 2368 (32.8%) | |
2020 | 86 (28.4%) | 170 (28.2%) | 582 (20.2%) | 276 (22.8%) | 64 (25.2%) | 55 (18.0%) | 62 (25.8%) | 0 (0.0%) | 1295 (17.9%) | |
Presented at a Scientific Meeting | < 0.01 | |||||||||
Did not present at a meeting | 219 (72.3%) | 448 (74.3%) | 2031 (70.5%) | 850 (70.2%) | 201 (79.1%) | 204 (66.7%) | 187 (77.9%) | 1051 (73.5%) | 5191 (71.8%) | |
Presented at a meeting | 84 (27.7%) | 155 (25.7%) | 849 (29.5%) | 361 (29.8%) | 53 (20.9%) | 102 (33.3%) | 53 (22.1%) | 378 (26.5%) | 2035 (28.2%) | |
Trained at a Top NIH-funded Medical School | < 0.01 | |||||||||
Did not attend NIH top-funded medical school | 251 (82.8%) | 525 (87.1%) | 1912 (66.4%) | 708 (58.5%) | 221 (87.0%) | 217 (70.9%) | 181 (75.4%) | 800 (56.0%) | 4815 (66.6%) | |
Attended a NIH top-funded Medical School | 52 (17.2%) | 78 (12.9%) | 968 (33.6%) | 503 (41.5%) | 33 (13.0%) | 89 (29.1%) | 59 (24.6%) | 629 (44.0%) | 2411 (33.4%) | |
Bachelor of Science Degree | < 0.01 | |||||||||
B.S. | 129 (42.6%) | 276 (45.8%) | 1184 (41.1%) | 538 (44.4%) | 92 (36.2%) | 117 (38.2%) | 114 (47.5%) | 508 (35.5%) | 2958 (40.9%) | |
Not a B.S. degree | 174 (57.4%) | 327 (54.2%) | 1696 (58.9%) | 673 (55.6%) | 162 (63.8%) | 189 (61.8%) | 126 (52.5%) | 921 (64.5%) | 4268 (59.1%) | |
Fluency in a Foreign Language | < 0.01 | |||||||||
N Missing | 86 | 170 | 582 | 276 | 64 | 55 | 62 | 497 | 1792 | |
Speaks English | 63 (29.0%) | 174 (40.2%) | 529 (23.0%) | 264 (28.2%) | 58 (30.5%) | 42 (16.7%) | 58 (32.6%) | 212 (22.7%) | 1400 (25.8%) | |
Speaks English and Another Language | 154 (71.0%) | 259 (59.8%) | 1769 (77.0%) | 671 (71.8%) | 132 (69.5%) | 209 (83.3%) | 120 (67.4%) | 720 (77.3%) | 4034 (74.2%) | |
Certification in Advanced Cardiac Life Support | < 0.01 | |||||||||
N Missing | 86 | 170 | 582 | 276 | 64 | 55 | 62 | 497 | 1792 | |
No | 70 (32.3%) | 117 (27.0%) | 1392 (60.6%) | 586 (62.7%) | 92 (48.4%) | 135 (53.8%) | 92 (51.7%) | 563 (60.4%) | 3047 (56.1%) | |
Yes | 147 (67.7%) | 316 (73.0%) | 906 (39.4%) | 349 (37.3%) | 98 (51.6%) | 116 (46.2%) | 86 (48.3%) | 369 (39.6%) | 2387 (43.9%) | |
Certification in Pediatric Advanced Life Support | < 0.01 | |||||||||
N Missing | 86 | 170 | 582 | 276 | 64 | 55 | 62 | 497 | 1792 | |
No | 204 (94.0%) | 395 (91.2%) | 2181 (94.9%) | 888 (95.0%) | 174 (91.6%) | 233 (92.8%) | 159 (89.3%) | 889 (95.4%) | 5123 (94.3%) | |
Yes | 13 (6.0%) | 38 (8.8%) | 117 (5.1%) | 47 (5.0%) | 16 (8.4%) | 18 (7.2%) | 19 (10.7%) | 43 (4.6%) | 311 (5.7%) | |
Prior Residency Training | < 0.01 | |||||||||
No | 290 (95.7%) | 580 (96.2%) | 2686 (93.3%) | 1167 (96.4%) | 238 (93.7%) | 288 (94.1%) | 231 (96.2%) | 1369 (95.8%) | 6849 (94.8%) | |
Yes | 13 (4.3%) | 23 (3.8%) | 194 (6.7%) | 44 (3.6%) | 16 (6.3%) | 18 (5.9%) | 9 (3.8%) | 60 (4.2%) | 377 (5.2%) | |
Residency Tracks Applied | < 0.01 | |||||||||
Applying for a Preliminary Position | 143 (47.2%) | 295 (48.9%) | 1862 (64.7%) | 604 (49.9%) | 125 (49.2%) | 147 (48.0%) | 123 (51.2%) | 939 (65.7%) | 4238 (58.6%) | |
Categorical Applicant | 160 (52.8%) | 308 (51.1%) | 1018 (35.3%) | 607 (50.1%) | 129 (50.8%) | 159 (52.0%) | 117 (48.8%) | 490 (34.3%) | 2988 (41.4%) | |
Membership in the American Medical Association | < 0.01 | |||||||||
No AMA Membership | 210 (69.3%) | 434 (72.0%) | 1798 (62.4%) | 758 (62.6%) | 182 (71.7%) | 197 (64.4%) | 158 (65.8%) | 951 (66.6%) | 4688 (64.9%) | |
American Medical Association Member | 93 (30.7%) | 169 (28.0%) | 1082 (37.6%) | 453 (37.4%) | 72 (28.3%) | 109 (35.6%) | 82 (34.2%) | 478 (33.4%) | 2538 (35.1%) | |
Membership in the American College of Obstetricians and Gynecologists | < 0.01 | |||||||||
No | 196 (64.7%) | 387 (64.2%) | 1598 (55.5%) | 662 (54.7%) | 166 (65.4%) | 168 (54.9%) | 143 (59.6%) | 857 (60.0%) | 4177 (57.8%) | |
Yes | 107 (35.3%) | 216 (35.8%) | 1282 (44.5%) | 549 (45.3%) | 88 (34.6%) | 138 (45.1%) | 97 (40.4%) | 572 (40.0%) | 3049 (42.2%) | |
Received a Scholarship | < 0.01 | |||||||||
No_scholarship | 260 (85.8%) | 492 (81.6%) | 2246 (78.0%) | 964 (79.6%) | 213 (83.9%) | 237 (77.5%) | 193 (80.4%) | 1197 (83.8%) | 5802 (80.3%) | |
Scholarship | 43 (14.2%) | 111 (18.4%) | 634 (22.0%) | 247 (20.4%) | 41 (16.1%) | 69 (22.5%) | 47 (19.6%) | 232 (16.2%) | 1424 (19.7%) | |
Number of letter of recommendation OBGYN authors | < 0.01 | |||||||||
Median | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | |
Q1, Q3 | 1, 2 | 1, 2 | 2, 3 | 2, 3 | 1, 2 | 2, 3 | 1, 3 | 2, 3 | 2, 3 | |
Number of letters of recommendation | < 0.01 | |||||||||
Median | 3 | 3 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | |
Q1, Q3 | 0, 4 | 0, 4 | 3, 4 | 3, 4 | 0, 4 | 3, 4 | 0, 4 | 0, 4 | 1, 4 | |
Number of first author publications | < 0.01 | |||||||||
Median | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
Q1, Q3 | 0, 1 | 0, 1 | 0, 3 | 0, 3 | 0, 1 | 0, 3 | 0, 2 | 0, 2 | 0, 2 | |
Advanced Degree Other Than Medical Degree | < 0.01 | |||||||||
No Addl Degree | 278 (91.7%) | 566 (93.9%) | 2622 (91.0%) | 1119 (92.4%) | 229 (90.2%) | 280 (91.5%) | 226 (94.2%) | 1358 (95.0%) | 6678 (92.4%) | |
MBA | 0 (0.0%) | 3 (0.5%) | 17 (0.6%) | 9 (0.7%) | 1 (0.4%) | 0 (0.0%) | 0 (0.0%) | 4 (0.3%) | 34 (0.5%) | |
Other | 23 (7.6%) | 34 (5.6%) | 225 (7.8%) | 78 (6.4%) | 24 (9.4%) | 26 (8.5%) | 14 (5.8%) | 61 (4.3%) | 485 (6.7%) | |
PhD | 2 (0.7%) | 0 (0.0%) | 16 (0.6%) | 5 (0.4%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 6 (0.4%) | 29 (0.4%) | |
Type of medical school | < 0.01 | |||||||||
Osteopathic School | 143 (47.2%) | 345 (57.2%) | 351 (12.2%) | 77 (6.4%) | 61 (24.0%) | 73 (23.9%) | 41 (17.1%) | 136 (9.5%) | 1227 (17.0%) | |
U.S. Public School | 76 (25.1%) | 110 (18.2%) | 1163 (40.4%) | 562 (46.4%) | 53 (20.9%) | 96 (31.4%) | 115 (47.9%) | 786 (55.0%) | 2961 (41.0%) | |
International School | 48 (15.8%) | 96 (15.9%) | 639 (22.2%) | 145 (12.0%) | 108 (42.5%) | 55 (18.0%) | 54 (22.5%) | 190 (13.3%) | 1335 (18.5%) | |
U.S. Private School | 36 (11.9%) | 52 (8.6%) | 727 (25.2%) | 427 (35.3%) | 31 (12.2%) | 82 (26.8%) | 30 (12.5%) | 317 (22.2%) | 1702 (23.6%) | |
Osteopathic School,International School | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 1 (0.4%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 1 (0.0%) | |
Number of Work Experiences | < 0.01 | |||||||||
Median | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | |
Q1, Q3 | 0, 4 | 0, 3 | 0, 4 | 0, 4 | 0, 4 | 0, 4 | 0, 4 | 0, 4 | 0, 4 | |
Number of Volunteer Experiences | < 0.01 | |||||||||
Median | 5 | 5 | 5 | 6 | 4 | 6 | 5 | 4 | 5 | |
Q1, Q3 | 0, 9 | 0, 9 | 1, 9 | 1, 9 | 0, 7 | 1, 9 | 0, 9 | 0, 8 | 0, 9 | |
Number of Research Experiences | < 0.01 | |||||||||
Median | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | |
Q1, Q3 | 0, 2 | 0, 2 | 0, 3 | 0, 3 | 0, 2 | 0, 3 | 0, 2 | 0, 3 | 0, 3 |
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Picks out which predictors have a coefficient > 0. ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#This code performs LASSO (Least Absolute Shrinkage and Selection Operator) regularization on a cross-validated model fit (cvfit). The purpose of the LASSO regularization is to select a subset of predictor variables that have non-zero coefficients and are important for the outcome.
# The line coef2 <- stats::coef(cvfit, s = "lambda.min") computes the coefficients for the fit at the minimum value of the penalty parameter (lambda.min).
#
# ix = which(abs(coef2) > 0) and coef2 <- coef2[ix, drop=FALSE] extract the coefficients that are not zero.
#
# selected_vars <- names(coef2)[ix] and selected_vars <- intersect(selected_vars, colnames(d)) are used to find the variables in the data frame d that correspond to the non-zero coefficients.
#
# non_existing_vars <- setdiff(selected_vars, colnames(d)) is used to check if any of the variables in selected_vars do not exist in the data frame d. If there are any non-existing variables, a warning message is printed. If all of the variables exist in d, then a new data frame reduced_d is created that contains only the columns in d that correspond to the variables selected by LASSO.
#
# Finally, the line names(reduced_d) returns the names of the columns in the reduced_d data frame.
coef2 <- stats::coef(cvfit, s = "lambda.min") # View coefficients
ix = which(abs(coef2) > 0) #These are the variables to include.
length(ix)
#> [1] 29
coef2 <- coef2[ix, drop=FALSE]
#removes the intercept so you can compare 'coef2' and the 'd' column names with no difference
selected_vars <- names(coef2)[ix]
selected_vars <- intersect(selected_vars, colnames(d))
reduced_d <- dplyr::select(d, one_of(selected_vars))
#Function that determines if the variables in the coef2 are found in d. If not you get a warning. If they are in both dataframes then it creates a new dataframe called 'reduced_d' with only the variables that LASSO recommended.
non_existing_vars <- setdiff(selected_vars, colnames(d))
if (length(non_existing_vars) > 0) {
# Print an error message
print(paste("The following variable names in 'selected_vars' do not exist in the 'X' data:", paste(non_existing_vars, collapse = ", ")))
} else {
# Extract only the columns in X that are listed in selected_vars
reduced_d <- d[, selected_vars, drop = FALSE]
}
This code creates two vectors, one called names_in_ix and another called not_in_ix. names_in_ix contains the names of variables that were included in the LASSO model, based on the variable indices stored in ix. not_in_ix contains the names of variables that were not included in the LASSO model, which are obtained by finding the difference between the names of all variables stored in coef2 and the names of variables included in names_in_ix.
names_in_ix = names(coef2)[ix] #Variables included from LASSO
not_in_ix = names(coef2)[!(names(coef2) %in% names_in_ix)] #These are the variables to REMOVE.
knitr_table_html(data = not_in_ix, caption = "Table of Variables Removed by the LASSO Grouped by Location Model.")
x |
---|
Count_of_Peer_Reviewed_Journal_Articles_Abstracts |
reco_count |
Visa_Sponsorship_Needed_Yes |
Alpha_Omega_Alpha_Yes |
Medical_Training_Discipline_Prior Residency Training |
ACOG_ACOG Member |
Advance_Degree_No Advanced Degree |
knitr_table_html(data = names(ix), caption = "Table of Variables Included by the LASSO Grouped by Location Model.")
x |
---|
(Intercept) |
Age |
Count_of_Peer_Reviewed_Journal_Articles_Abstracts |
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published |
total_OBGYN_letter_writers |
reco_count |
number_of_applicant_first_author_publications |
work_exp_count |
Research_exp_count |
US_or_Canadian_Applicant_Yes |
Visa_Sponsorship_Needed_Yes |
Medical_Education_or_Training_Interrupted_Yes |
Alpha_Omega_Alpha_Yes |
Gold_Humanism_Honor_Society_Yes |
Meeting_Name_Presented_Presented at a meeting |
TopNIHfunded_Attended a NIH top-funded Medical School |
ACLS_Yes |
PALS_Yes |
Medical_Training_Discipline_Prior Residency Training |
Tracks_Applied_by_Applicant_1_Categorical Applicant |
AMA_American Medical Association Member |
ACOG_ACOG Member |
Advance_Degree_No Advanced Degree |
Advance_Degree_Ph.D. |
Advance_Degree_Other |
Type_of_medical_school_International School |
Type_of_medical_school_Osteopathic School |
Type_of_medical_school_Osteopathic School,International School |
Type_of_medical_school_U.S. Private School |
knitr_table_html(modelPerformance, caption = "Table of Model Characteristics for the LASSO Model.")
Imputation is the process of replacing missing values with estimated ones. It is a crucial step in preparing the data for machine learning models as many algorithms do not work well with missing data. There are various methods for imputation such as mean imputation, median imputation, and regression imputation. By imputing missing values, we can ensure that the data is complete and ready for analysis.
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
##### Create tables displaying results of univariate analyses, stratified or not by categorical variable groupings ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #####[tibshirani1996regression]
d1 <- d
d1$Year <- ifelse(d1$Year %in% c(2017,2018), "train", ifelse(d1$Year %in% c(2019,2020), "test", NA))
convert_to_factor(data = d1,
variable = "Year",
levels = c("train", "test"),
labels = c("train", "test"),
reference = "train")
d1$Year <- as.factor(d1$Year)
d1 <- d %>%
dplyr::rename (c(#"Match Status" = Match_Status,
"Gender" = Gender,
"Age (years)" = Age,
"US or Canadian Applicant" = US_or_Canadian_Applicant,
"Visa Sponsorship Needed" = Visa_Sponsorship_Needed,
"Medical Education Interrupted"= Medical_Education_or_Training_Interrupted,
"Alpha Omega Alpha Membership" = Alpha_Omega_Alpha,
"Gold Humanism Honor Society Membership" = Gold_Humanism_Honor_Society,
"Participating in a Couples Match" = Couples_Match,
"Number of Oral Presentations" = Count_of_Oral_Presentation,
"Number of Peer-Reviewed Journal Articles or Abstracts" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts,
"Number of Peer-Reviewed Non-published Journal_Articles" = Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published,
"Number of Poster Presentations" = Count_of_Poster_Presentation,
#"Applicant Match Year" = Year,
"Presented at a Scientific Meeting" = Meeting_Name_Presented,
"Trained at a Top NIH-funded Medical School" = TopNIHfunded,
"Bachelor of Science Degree" = Higher_Education_Degree,
#"Birth Place" = Birth_Place,
"Fluency in a Foreign Language" = Language_Fluency,
"Certification in Advanced Cardiac Life Support" = ACLS,
"Certification in Pediatric Advanced Life Support" = PALS,
"Prior Residency Training" = Medical_Training_Discipline,
"Residency Tracks Applied" = Tracks_Applied_by_Applicant_1,
"Membership in the American Medical Association" = AMA,
"Membership in the American College of Obstetricians and Gynecologists" = ACOG,
"Received a Scholarship" = Scholarship,
"Number of letter of recommendation OBGYN authors" = total_OBGYN_letter_writers,
"Number of letters of recommendation" = reco_count,
"Number of first author publications" = number_of_applicant_first_author_publications,
"Advanced Degree Other Than Medical Degree" = Advance_Degree,
"Type of medical school" = Type_of_medical_school,
"Number of Work Experiences" = work_exp_count,
"Number of Volunteer Experiences" = Volunteer_exp_count,
"Number of Research Experiences" = Research_exp_count))
# Create a table with the test results
t1_year <- compareGroups::compareGroups(formula = Year ~ .,
data = d1,
method = 4, #Shapiro-Wilks test
na.action = "na.pass",
alpha = 0.05) #alpha significance threshold
# Create a table of the results of the test
t1tab_year <- compareGroups::createTable(x = t1_year,
digits.p = 3, sd.type = 1, # Specify the number of decimal places for p-values
#and type of standard deviation
show.p.overall = TRUE, # Include an overall p-value in the table
hide.no = "no") # Don't hide any variables with "no" in the name
# Export the table as an HTML-formatted Markdown table
compareGroups::export2md(t1tab_year,
strip = TRUE, # remove leading and trailing white space
first.strip = TRUE, # remove leading white space from the first line
digits = 1L, # specify the number of decimal places
caption = "Description of Applicants by Match Status", # add a caption to the table
booktabs = TRUE, width = "400px", # use booktabs formatting and specify the width of the table
format = "html") # output as HTML
2016 | 2017 | 2018 | 2019 | 2020 | p.overall | |
---|---|---|---|---|---|---|
N=442 | N=1837 | N=1284 | N=2368 | N=1295 | ||
Match_Status: | <0.001 | |||||
Not_Matched | 148 (33.5%) | 777 (42.3%) | 573 (44.6%) | 1076 (45.4%) | 733 (56.6%) | |
Matched | 294 (66.5%) | 1060 (57.7%) | 711 (55.4%) | 1292 (54.6%) | 562 (43.4%) | |
Gender: | 0.028 | |||||
Female | 361 (81.7%) | 1458 (79.4%) | 1053 (82.0%) | 1963 (82.9%) | 1077 (83.2%) | |
Male | 81 (18.3%) | 379 (20.6%) | 231 (18.0%) | 405 (17.1%) | 218 (16.8%) | |
Age (years) | 27.0 [26.0;29.0] | 28.0 [26.0;30.0] | 28.0 [27.0;30.0] | 28.0 [26.0;30.0] | 28.0 [27.0;30.0] | <0.001 |
US or Canadian Applicant | 369 (83.5%) | 1409 (76.7%) | 1033 (80.5%) | 2022 (85.4%) | 1058 (81.7%) | <0.001 |
Visa Sponsorship Needed | 21 (4.75%) | 136 (7.40%) | 96 (7.48%) | 128 (5.41%) | 72 (5.56%) | 0.011 |
Medical Education Interrupted | 58 (13.1%) | 271 (14.8%) | 181 (14.1%) | 325 (13.7%) | 164 (12.7%) | 0.546 |
Alpha Omega Alpha Membership | 56 (12.7%) | 183 (9.96%) | 100 (7.79%) | 269 (11.4%) | 100 (7.72%) | <0.001 |
Gold Humanism Honor Society Membership | 76 (17.2%) | 228 (12.4%) | 155 (12.1%) | 389 (16.4%) | 191 (14.7%) | <0.001 |
Participating in a Couples Match | 50 (11.3%) | 166 (9.04%) | 86 (6.70%) | 203 (8.57%) | 106 (8.19%) | 0.030 |
Number of Oral Presentations | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | <0.001 |
Number of Peer-Reviewed Journal Articles or Abstracts | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | 0.00 [0.00;1.00] | <0.001 |
Number of Peer-Reviewed Non-published Journal_Articles | 0.00 [0.00;0.00] | 0.00 [0.00;0.00] | 0.00 [0.00;0.00] | 0.00 [0.00;0.00] | 0.00 [0.00;0.00] | <0.001 |
Number of Poster Presentations | 1.00 [0.00;2.00] | 1.00 [0.00;2.00] | 1.00 [0.00;3.00] | 1.00 [0.00;3.00] | 1.00 [0.00;3.00] | <0.001 |
Location: | 0.000 | |||||
BSW | 0 (0.00%) | 57 (3.10%) | 60 (4.67%) | 100 (4.22%) | 86 (6.64%) | |
CCAG | 0 (0.00%) | 125 (6.80%) | 135 (10.5%) | 173 (7.31%) | 170 (13.1%) | |
CU | 0 (0.00%) | 638 (34.7%) | 606 (47.2%) | 1054 (44.5%) | 582 (44.9%) | |
Duke | 0 (0.00%) | 328 (17.9%) | 284 (22.1%) | 323 (13.6%) | 276 (21.3%) | |
Truman | 0 (0.00%) | 61 (3.32%) | 63 (4.91%) | 66 (2.79%) | 64 (4.94%) | |
U_Washington | 0 (0.00%) | 77 (4.19%) | 82 (6.39%) | 92 (3.89%) | 55 (4.25%) | |
UAB | 0 (0.00%) | 61 (3.32%) | 54 (4.21%) | 63 (2.66%) | 62 (4.79%) | |
Utah | 442 (100%) | 490 (26.7%) | 0 (0.00%) | 497 (21.0%) | 0 (0.00%) | |
Presented at a Scientific Meeting: | <0.001 | |||||
Did not present at a meeting | 278 (62.9%) | 1172 (63.8%) | 807 (62.9%) | 1639 (69.2%) | 1295 (100%) | |
Presented at a meeting | 164 (37.1%) | 665 (36.2%) | 477 (37.1%) | 729 (30.8%) | 0 (0.00%) | |
Trained at a Top NIH-funded Medical School: | <0.001 | |||||
Did not attend NIH top-funded medical school | 250 (56.6%) | 1226 (66.7%) | 894 (69.6%) | 1529 (64.6%) | 916 (70.7%) | |
Attended a NIH top-funded Medical School | 192 (43.4%) | 611 (33.3%) | 390 (30.4%) | 839 (35.4%) | 379 (29.3%) | |
Bachelor of Science Degree: | <0.001 | |||||
B.S. | 271 (61.3%) | 1022 (55.6%) | 655 (51.0%) | 1010 (42.7%) | 0 (0.00%) | |
Not a B.S. degree | 171 (38.7%) | 815 (44.4%) | 629 (49.0%) | 1358 (57.3%) | 1295 (100%) | |
Fluency in a Foreign Language: | <0.001 | |||||
Speaks English | 110 (24.9%) | 415 (22.6%) | 344 (26.8%) | 671 (28.3%) | 398 (30.7%) | |
Speaks English and Another Language | 332 (75.1%) | 1422 (77.4%) | 940 (73.2%) | 1697 (71.7%) | 897 (69.3%) | |
Certification in Advanced Cardiac Life Support | 177 (40.0%) | 789 (43.0%) | 593 (46.2%) | 1019 (43.0%) | 605 (46.7%) | 0.028 |
Certification in Pediatric Advanced Life Support | 20 (4.52%) | 108 (5.88%) | 82 (6.39%) | 108 (4.56%) | 22 (1.70%) | <0.001 |
Prior Residency Training | 30 (6.79%) | 176 (9.58%) | 103 (8.02%) | 68 (2.87%) | 0 (0.00%) | <0.001 |
Residency Tracks Applied: | 0.000 | |||||
Applying for a Preliminary Position | 442 (100%) | 1251 (68.1%) | 277 (21.6%) | 973 (41.1%) | 1295 (100%) | |
Categorical Applicant | 0 (0.00%) | 586 (31.9%) | 1007 (78.4%) | 1395 (58.9%) | 0 (0.00%) | |
Membership in the American Medical Association: | <0.001 | |||||
No AMA Membership | 222 (50.2%) | 1019 (55.5%) | 684 (53.3%) | 1468 (62.0%) | 1295 (100%) | |
American Medical Association Member | 220 (49.8%) | 818 (44.5%) | 600 (46.7%) | 900 (38.0%) | 0 (0.00%) | |
Membership in the American College of Obstetricians and Gynecologists | 263 (59.5%) | 970 (52.8%) | 685 (53.3%) | 1131 (47.8%) | 0 (0.00%) | <0.001 |
Received a Scholarship: | <0.001 | |||||
No_scholarship | 326 (73.8%) | 1379 (75.1%) | 961 (74.8%) | 1841 (77.7%) | 1295 (100%) | |
Scholarship | 116 (26.2%) | 458 (24.9%) | 323 (25.2%) | 527 (22.3%) | 0 (0.00%) | |
Number of letter of recommendation OBGYN authors | 2.00 [2.00;3.00] | 2.00 [1.00;3.00] | 2.00 [1.00;3.00] | 2.00 [2.00;3.00] | 2.00 [2.00;3.00] | <0.001 |
Number of letters of recommendation | 4.00 [3.00;4.00] | 4.00 [3.00;4.00] | 4.00 [3.00;4.00] | 3.00 [3.00;4.00] | 0.00 [0.00;0.00] | 0.000 |
Number of first author publications | 1.00 [0.00;3.00] | 1.00 [0.00;3.00] | 1.00 [0.00;3.00] | 0.00 [0.00;3.00] | 0.00 [0.00;0.00] | <0.001 |
Advanced Degree Other Than Medical Degree: | . | |||||
No Addl Degree | 441 (99.8%) | 1735 (94.4%) | 1086 (84.6%) | 2121 (89.6%) | 1295 (100%) | |
MBA | 0 (0.00%) | 6 (0.33%) | 11 (0.86%) | 17 (0.72%) | 0 (0.00%) | |
Other | 0 (0.00%) | 86 (4.68%) | 178 (13.9%) | 221 (9.33%) | 0 (0.00%) | |
PhD | 1 (0.23%) | 10 (0.54%) | 9 (0.70%) | 9 (0.38%) | 0 (0.00%) | |
Type of medical school: | . | |||||
Osteopathic School | 34 (7.69%) | 238 (13.0%) | 275 (21.4%) | 419 (17.7%) | 261 (20.2%) | |
U.S. Public School | 239 (54.1%) | 729 (39.7%) | 464 (36.1%) | 1052 (44.4%) | 477 (36.8%) | |
International School | 72 (16.3%) | 425 (23.1%) | 250 (19.5%) | 348 (14.7%) | 240 (18.5%) | |
U.S. Private School | 97 (21.9%) | 444 (24.2%) | 295 (23.0%) | 549 (23.2%) | 317 (24.5%) | |
Osteopathic School,International School | 0 (0.00%) | 1 (0.05%) | 0 (0.00%) | 0 (0.00%) | 0 (0.00%) | |
Number of Work Experiences | 3.00 [1.00;5.00] | 3.00 [1.00;5.00] | 3.00 [1.00;5.00] | 2.00 [0.00;4.00] | 0.00 [0.00;0.00] | 0.000 |
Number of Volunteer Experiences | 7.00 [5.00;10.0] | 6.00 [4.00;9.00] | 7.00 [4.00;10.0] | 6.00 [1.00;9.00] | 0.00 [0.00;0.00] | 0.000 |
Number of Research Experiences | 2.00 [1.00;3.00] | 2.00 [1.00;3.00] | 2.00 [1.00;3.00] | 2.00 [0.00;3.00] | 0.00 [0.00;0.00] | 0.000 |
In some cases, our data is naturally separated into two sets, one of which can be used to fit a model and the other to evaluate it. A common example of this is when data has been collected during two distinct time periods, and the older data is used to fit a model that is evaluated on the newer data, to see if historical data can be used to predict the future.
Here we check sample sizes and proportions of target=“Match_Status” in training and test sets the proportions of target=“Match_Status” should be approximately the same across the training, test, and the whole dataset. Do NOT up- or down- sample the training set because the eval metric is Brier Score, and there is no evidence showing making the data balanced will improve out-of-sample performance. In fact, based on my experiments, it will make the models performance much worse. For more explanations, see this link: https://stats.stackexchange.com/a/474431
In general, a lower Brier score indicates a better model. Therefore, in this case, the code should aim to minimize the Brier score. This is reflected in the maximize argument of the caret::train function, which is set to FALSE to minimize the Brier score.
The Brier score is a measure of the accuracy of probabilistic predictions, and it is defined as the mean squared difference between the predicted probabilities and the observed outcomes. A model with a low Brier score is able to make more accurate probabilistic predictions, as the predicted probabilities are closer to the observed outcomes.
In this case, the Brier score is being used as an evaluation metric to compare the performance of different models. A model with a lower Brier score is generally considered to be a better model, as it is able to make more accurate probabilistic predictions.
# check sample sizes and proportions of target=Y in training and test sets
# the proportions of target=Y should be approximately the same across the
# training, test, and the whole dataset.
# Calculate number of observations in training and test sets
ntest <- nrow(test_match)
ntrain <- nrow(train_match)
# Print the number of observations in each set
cat("The training set has", ntrain, "observations.\n")
#> The training set has 3121 observations.
cat("The testing set has", ntest, "observations.\n\n")
#> The testing set has 3663 observations.
# Calculate and print the proportion of target=Y in the training set
prop_train <- sum(train_match$Match_Status == "Matched") / ntrain
cat("Proportion of target=Y in the training set:", prop_train, "\n\n")
#> Proportion of target=Y in the training set: 0.567
# Calculate and print the proportion of target=Y in the test set
prop_test <- sum(test_match$Match_Status == "Matched") / ntest
cat("Proportion of target=Y in the test set:", prop_test, "\n\n")
#> Proportion of target=Y in the test set: 0.506
invisible(gc())
###
#This code first calculates the proportion of target=Y in the training and test sets, then creates a data frame with these values. Finally, it uses ggplot2 to create a bar chart with the proportions for the training and test sets. The bars are filled with light blue color and the chart has appropriate labels for the title, x-axis, and y-axis.
# Calculate proportions of target=Y in training and test sets
prop_train <- nrow(train_match[train_match$Match_Status=="Matched",]) / (nrow(train_match[train_match$Match_Status=="Matched",]) + nrow(train_match[train_match$Match_Status=="Not_Matched",]))
prop_test <- nrow(test_match[test_match$Match_Status=="Matched",]) / (nrow(test_match[test_match$Match_Status=="Matched",]) + nrow(test_match[test_match$Match_Status=="Not_Matched",]))
# Create data frame with proportions for training and test sets
proportions_df <- data.frame(set = c("Training", "Test"), proportion = c(prop_train, prop_test))
# Plot proportions using ggplot2
ggplotly(ggplot2::ggplot(proportions_df, aes(x = set, y = proportion)) +
geom_col(fill = "lightblue", color = "black", width = 0.5) +
geom_text(aes(label = sprintf("%.2f%%", 100 * proportion)),
hjust = 1.5, vjust = -0.5, color = "black", size = 8) +
labs(title = "Proportion of Target=Match_Status in Training and Test Sets",
x = "Splits", y = "Proportion of Applicants who Matched"))
The next step is to explore potential predictive relationships between individual predictors and the response and between pairs of predictors and the response.
# `a` and `b` are two models from `train()`
compare_models_1way <- function(a, b, metric = a$metric[1], ...) {
# Create a list of the models to compare
mods <- list(a, b)
# Use caret::resamples to compare the models
rs <- caret::resamples(mods)
# Calculate the difference in performance between the models
diffs <- base::diff(rs, metric = metric[1], ...)
# Return the first element of the first element of the statistics list
diffs$statistics[[1]][[1]]
}
VC_preds <- names(select_if(d, is.numeric))
risk_preds <- names(select_if(d, is.factor))
write.csv(VC_preds, file = "output/csv/features_VC_preds.csv")
write.csv(risk_preds, file = "output/csv/features_risk_preds.csv")
# ------------------------------------------------------------------------------
# Create a "null model" with no predictors to get baseline performance ---------
# This model will be used as a reference point to compare the performance of other models against
null_mat <- data.frame(intercept = rep_len(1, nrow(train_match)))
#Set up trainControl object to specify the type of cross-validation to use
ctrl <- trainControl(method = "repeatedcv", number = 5, classProbs = TRUE, summaryFunction = BigSummary)
#Train the null model using a generalized linear model with Brier metric
null_mod <- caret::train(x = null_mat,
y = train_match$Match_Status,
method = "glm",
metric = "Brier",
trControl = ctrl)
#t might be good to print the results of the null model performance so it can be visualised and compared
print(null_mod)
#> Generalized Linear Model
#>
#> 3121 samples
#> 1 predictor
#> 2 classes: 'Not_Matched', 'Matched'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 1 times)
#> Summary of sample sizes: 2497, 2496, 2497, 2497, 2497
#> Resampling results:
#>
#> Acc Kappa AUCROC AUCPR Brier Precision Recall F
#> 0.567 0 0.5 0 0.245 0.567 1 0.724
# ------------------------------------------------------------------------------
# Compare the models with single predictors to the risk model. These data make
# https://bookdown.org/max/FES/stroke-tour.html#tab:stroke-strokeRiskAssociations
# `VC_preds` and `risk_preds` contain the predictor names for different sets.
one_predictor_res <- data.frame(Predictor = setdiff(names(train_match), "Match_Status"),
Improvement = NA,
Pvalue = NA,
Brier = NA,
stringsAsFactors = FALSE)
#Removes the Outcome from the dataframe.
one_predictor_res <- one_predictor_res %>%
dplyr::filter(Predictor != "Match_Status")
#Loop through each predictor variable
for (i in 1:nrow(one_predictor_res)) {
set.seed(1978)
#Train a model using the current predictor variable
var_mod <- caret::train(Match_Status ~ .,
data = train_match[, c("Match_Status", one_predictor_res$Predictor[i])],
method = "glm",
metric = "Brier",
trControl = ctrl)
#Compare the model with the null model
tmp_diff <- compare_models_1way(var_mod, null_mod, alternative = "greater")
#Store the Brier score, improvement in model performance, and p-value of the improvement in the data frame
one_predictor_res$Brier[i] <- caret::getTrainPerf(var_mod)[1, "TrainBrier"]
one_predictor_res$Improvement[i] <- tmp_diff$estimate
one_predictor_res$Pvalue[i] <- tmp_diff$p.value
}
Two logistic regression models were considered. The “null model” contains only an intercept term while the model complex model has a single term for an individual predictor from the risk set. These results are presented in the table and orders the risk predictors from most significant to least significant in terms of improvement in Brier Score. For the training data, several predictors provide marginal, but significant improvement in predicting Match_Status outcome as measured by the improvement in Brier Score. Based on these results, our intuition would lead us to believe that the significant categorical set predictors will likely be integral to the final predictive model.
# ------------------------------------------------------------------------------
# Data in table 2.3
# https://bookdown.org/max/FES/stroke-tour.html#tab:stroke-strokeRiskAssociations
one_predictor_res %>%
dplyr::filter(Predictor %in% risk_preds) %>%
dplyr::arrange(Brier)
write.csv(one_predictor_res, file = "output/csv/one_predictor_res.csv"); invisible(gc())
# ------------------------------------------------------------------------------
# Figure 2.4
# https://bookdown.org/max/FES/stroke-tour.html#fig:stroke-vascuCAPAssocations
vc_pred <-
recipes::recipe(Match_Status ~ ., data = d %>% dplyr::select(Match_Status, !!!VC_preds)) %>%
recipes::prep(train_match %>%
dplyr::select(Match_Status, !!!VC_preds)) %>%
recipes::juice() %>%
tidyr::gather(Predictor, value, -Match_Status)
# get_max value per predictor
pred_max <- vc_pred %>%
dplyr::inner_join(one_predictor_res %>% dplyr::select(Pvalue, Predictor)) %>%
dplyr::group_by(Predictor) %>%
dplyr::summarize(max_val = max(value), Pvalue = min(Pvalue)) %>%
dplyr::mutate(x = 1.5, value = 1.25 * max_val,
label = paste0("p-value: ", format.pval(Pvalue, digits = 2, sci = FALSE, eps = .0001)))
###
fig_2_4 <-ggplot2::ggplot(vc_pred, aes(x = Match_Status, y = value)) +
geom_boxplot() +
geom_point(alpha = 0.3, cex = .5) +
geom_text(data = pred_max, aes(x = x, label = label), size = 3) +
facet_wrap(~Predictor, scales = "free_y") +
ylab(""); fig_2_4