Objective

Create and validate a predictive model to understand a medical student’s chances of matching into obstetrics and gynecology residency.

invisible(gc())

Read in split data and train models (all train_match models end with `.fit` e.g. ‘logit.fit’)

train_match <- read.csv(file = "output/csv/train_dat_2017_2018_years.csv")
test_match <- read.csv(file = "output/csv/test_dat_2019_2020_years.csv")

logit.fit <- readr::read_rds(file.path(mod_path, "logit.fit.rds"))
lasso.fit <- readr::read_rds(file.path(mod_path, 'lasso_fit.rds'))
lasso.best.fit <- readr::read_rds(file.path(mod_path, 'lasso_best_fit.rds'))
rf.fit <- readr::read_rds(file.path(mod_path, "random_forest_fit.rds"))
xgb.fit <- readr::read_rds(file.path(mod_path, 'xgb_fit.rds'))
nnet.fit <- readr::read_rds(file.path(mod_path, "neural_network_fit.rds"))

PREDICTING TEST SET RESULTS

Finally, scoring needs to be performed on the test sample using the parameter estimates obtained from the model building process. This step can be easily implemented with the help of predict function. Using a predict is as simple as it can gets with caret models. We’ll make predictions using the test data in order to evaluate the performance of our models.The procedure is as follow: Predict the class membership probabilities of observations based on predictor variables Assign the observations to the class with highest probability score (i.e above 0.5)

Logistic regression: logit.test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Logistic regression model. Out-of-sample----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
logit.test <- stats::predict(logit.fit, 
                             newdata = test_match, 
                             type = "prob")

# Save results to file
readr::write_rds(logit.test, file.path(mod_path, "logit.test.rds"))

LASSO Optimized: lasso.best.test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   LASSO Optimized model. Out-of-sample----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
lasso.best.test <- stats::predict(lasso.best.fit, 
                             newdata = test_match, 
                             type = "prob")

# Save results to file
readr::write_rds(lasso.best.test, file.path(mod_path, "lasso.best.test.rds"))

Random forest: rf.test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Random Forest. Out-of-sample----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
rf.test <- stats::predict(rf.fit, 
                             newdata = test_match, 
                             type = "prob")

# Save results to file
readr::write_rds(rf.test, file.path(mod_path, "rf.test.rds"))

Xgboost: xgb.test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Xgboost predictions. Out-of-sample----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
xgb.test <- stats::predict(xgb.fit, 
                             newdata = test_match, 
                             type = "prob")

# Save results to file
readr::write_rds(xgb.test, file.path(mod_path, "xgb.test.rds"))

Neural net: nnet.test

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Neural net predictions. Out-of-sample----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
nnet.test <- stats::predict(nnet.fit, 
                             newdata = test_match, 
                             type = "raw")

# Save results to file
readr::write_rds(nnet.test, file.path(mod_path, "nnet.test.rds"))

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Generate results of the test set data in each model. Out-of-sample. ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

#Lift is a measure of how much better a model's prediction is compared to a random guess. It is typically used in binary classification to evaluate the performance of a model on a binary response variable. Lift is defined as the ratio of the average predicted probability of the positive class to the overall probability of the positive class in the population. A lift of 1 indicates that the model is no better than random guessing, while a lift greater than 1 indicates that the model is making better predictions than random guessing. Lift can be used to compare the performance of different models or to compare the performance of a single model with different settings or configurations.

## Generate the test set results
lift_results <- data.frame(Class = test_match$Match_Status)
lift_results$Logit <- stats::predict(logit.fit, test_match, type = "prob")[,"Matched"]
lift_results$Lasso <- stats::predict(lasso.fit, test_match, type = "prob")[,"Matched"]
lift_results$RF <- stats::predict(rf.fit, test_match, type = "prob")[,"Matched"]
lift_results$XGB <- stats::predict(xgb.fit, test_match, type = "prob")[,"Matched"]
#lift_results$CAT <- stats::predict(CAT.fit, x2, type = "prob")[,"Matched"]
lift_results$Nnet <- stats::predict(nnet.fit, test_match, type = "prob")[,"Matched"]

tmp <- lift_results
colnames(tmp) <- c("Logistic Regression", "Lasso", "Random Forest", "XGBoost", "Neural Network")
knitr_table_html(data = head(tmp), caption = "Actual Class Results versus the Probability Results of Various Models on the Test Data Set")

Actual Class Results versus the Probability Results of Various Models on the Test Data Set
Logistic Regression	Lasso	Random Forest	XGBoost	Neural Network	NA
Not_Matched	0.547	0.567	0.504	0.335	0.212
Matched	0.705	0.567	0.606	0.581	0.633
Matched	0.579	0.567	0.600	0.389	0.310
Matched	0.616	0.567	0.638	0.381	0.769
Matched	0.495	0.567	0.524	0.242	0.176
Not_Matched	0.665	0.567	0.644	0.521	0.406

rm(tmp)

Plotting the Predicted Probalities

Lift plot

A lift plot is a performance visualization used in binary classification problems to evaluate the effectiveness of a predictive model. It is a plot of the proportion of positive class cases that are correctly identified by the model, as a function of the proportion of the total sample size used to make the predictions. The lift plot provides a way to compare the performance of the model against a random selection process in which the positive cases are chosen by chance.

In the code you provided, the lift function from the caret package is being used to create a lift plot for the results of multiple binary classifiers (Logit, Lasso, RF, XGB, Nnet). The lift plot will show the improvement in the proportion of positive class cases that are correctly identified by each classifier compared to a random selection process. The lift_results data frame is passed as the input data, and the Class variable is converted into a factor variable. The lift plot is stored in an object called lift_obj.

The dotted line on the lift plot that moves upward at a 45 degree angle and then plateaus horizontally is called the random chance line. It represents a situation where the model is making random predictions, so the ratio of true positive cases to the total number of positive cases is constant regardless of the threshold used. The line starts at the origin (0,0) and increases at a 45-degree angle, meaning that the rate of true positive cases is equal to the rate of false positive cases. The line then plateaus horizontally, meaning that the rate of true positive cases stops increasing, regardless of the threshold used. The lift plot compares the performance of a model to this random chance line, and if the model’s lift curve is above the random chance line, it means that the model is doing better than random chance.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Lift plot. Out-of-sample. ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

# Set caret theme for lattice plots
invisible(lattice::trellis.par.set(caretTheme()))

# Create lift object using caret::lift function
# uses the caret::lift function to create a lift object using the Class variable as the response variable and the predicted probabilities of Match Status (Logit, Lasso, RF, XGB, and Nnet) as predictors.


#lift_results$Class <- ifelse(test_match$Match_Status == "Matched", 1, 0)
lift_results$Class <- as.factor(lift_results$Class)

lift_obj <- caret::lift(Class ~ Logit + Lasso + RF + XGB + Nnet, 
                        data = lift_results)

# Create ggplot object
a <- ggplot2::ggplot(lift_obj)

# Save plot to file
ggplot2::ggsave(a, 
                file = "output/fig/lift_obj_out_of_sample.tiff",
                units = "cm",
                width = 20, 
                height = 20, 
                dpi = 800)

# Convert the ggplot object to a plotly object
a_plotly <- ggplotly(a)

# Show the plotly object
a_plotly

This graph shows the scatterplot of predicted probabilities of the matching status of the test data. The x-axis represents the predicted probability of the match status, and the y-axis represents the actual class. The points in the plot represent the relationship between the predicted and actual values for the test data. The plot helps to visually inspect the performance of the model by showing how well the predicted values match the actual class.

# Scatterplot of predicted probabilities
scatterplot <- ggplot2::ggplot(lift_results, aes(x = Logit, y = Class)) +
  geom_point() +
  ggtitle("Scatterplot of Probabilities of Matching Status (test data)") +
  xlab("Predicted Probability of Match Status") +
  ylab("Actual Class")

ggplotly(scatterplot)

Calibration plot

A calibration plot is used to assess the quality of predicted probabilities generated by a classification model. In the context of comparing five models (Logit, Lasso, RF, XGB, Nnet), a calibration plot on out-of-sample or out-of-bag data provides visual representation of how well the predicted probabilities align with the actual observed outcomes.

The plot represents the predicted probabilities along the x-axis and the observed outcomes along the y-axis. The plot can be broken down into multiple curves, one for each model being compared. Ideally, the plot should show a diagonal line from the bottom left to the top right, indicating that the model’s predicted probabilities are aligned with the actual outcomes. If the curves are not close to the diagonal line, it suggests that the model’s predictions are not well calibrated, and the model may be over- or under-predicting some outcomes.

By comparing the calibration plots of the five models, one can evaluate the performance of each model in terms of its predicted probabilities aligning with the actual outcomes, and select the model that is best calibrated for the specific use case.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Calibration plot.  ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

cal_obj <- caret::calibration(Class ~ Logit + Lasso + RF + XGB + #CAT + 
                                Nnet, data = lift_results)

plot(cal_obj, type = "l",
  auto.key = list(
    columns = 5,
    lines = TRUE,
    points = FALSE,
    title = "",
    cex = 0.8,
    lty = 1:5,
    col = viridis(5)
  ),
  main = "Calibration Plot",
  xlab = "Predicted Probability",
  ylab = "Observed Frequency",
  lty = 1:6,
  pch = 16,
  lwd = 2)

#tm_print_save("output/fig/cal_obj_out_of_sample.tiff")

A model is considered to be well-calibrated if the predicted probabilities match the observed frequencies in the data. The goal of a calibration plot is to visualize this match between predicted probabilities and observed frequencies. A well-calibrated model will have points close to the diagonal line in the plot, indicating that the predicted probabilities are similar to the observed frequencies.

In general, the following evaluation criteria can be used to compare the calibration of different models: 1. Closer the points to the diagonal line, the better the calibration of the model. 2. The slope of the diagonal line should be similar to the slope of the points, indicating that the model has similar accuracy across different probability levels. 3. The spread of the points should be similar across different probability levels, indicating that the model has similar uncertainty at different probability levels.

Based on these criteria, the model that has a calibration plot closest to the diagonal line, with a similar slope and similar spread of the points, is considered to be the best calibrated model.

Pick the Best Model by Calibration Plot and Brier Score

logit = logit.fit$results$Brier[1]
lasso = lasso.best.fit$results$Brier[1]
rf = rf.fit$results$Brier[1]
xgb = xgb.fit$results$Brier[1]
#cat = CAT.fit$results$Brier[1]
nnet = nnet.fit$results$Brier[1]

Metric_value <- data.frame(logit = logit, lasso = lasso, RF =rf, XGB = xgb, #CAT=cat,
                           Nnet=nnet) #model types and names

best_model_name <- Metric_value %>% 
# The best_model object is created by taking the Metric_value data frame and piping it into 
# the following operations

dplyr::mutate(model = row.names(Metric_value)) %>% 
# A new column "model" is added to the data frame, which takes the row names of the Metric_value data frame.
# This effectively adds a column that specifies which metric each row belongs to.

dplyr::select(model, everything()) %>% 
# The columns are selected and rearranged, so that the newly added "model" column is the first column, 
# and the rest of the columns follow.

tidyr::gather(metric, value, -model) %>% 
# The data frame is transformed from a wide format to a long format, so that each metric is a separate row. 
# The gather() function collects all columns except the "model" column and combines them into two columns: "metric" and "value". 
# The -model argument specifies to exclude the "model" column from the columns that are gathered.

dplyr::arrange(value) %>% 
# The rows are arranged in ascending order of the "value" column, which contains the Brier score of each metric.

dplyr::slice_head(n = 1) %>% 
# The first row (the one with the lowest "value") is selected using the slice_head() function.

dplyr::pull(metric); best_model_name

#> [1] "logit"

# Finally, the "metric" column is extracted and assigned to the best_model object.

best_model_brier_value <- Metric_value %>% 
  dplyr::mutate(model = row.names(Metric_value)) %>% 
  dplyr::select(model, everything()) %>% 
  tidyr::gather(metric, value, -model) %>% 
  dplyr::arrange(value) %>% 
  slice_head(n = 1) %>% 
  dplyr::pull(value); best_model_brier_value

#> [1] 0.19

The model with the lowest Brier score that has the lowest Brier Score: logit with a Brier Score of 0.19.

Plot Calibration

A calibration plot shows the relationship between the predicted probabilities of a model and the observed frequency of the target variable. The plot starts by calculating the midpoint of each binned interval of predicted probabilities and plotting them on the x-axis. On the y-axis, the observed frequency of the target variable is calculated as the proportion of positive cases within each binned interval of predicted probabilities.

A ribbon is then added to the plot, representing the confidence interval of the observed frequency. The confidence interval is calculated using the lower and upper values, which are added to the plot using the geom_ribbon function. The blue line represents the mean of the observed frequency within each binned interval of predicted probabilities, and is added to the plot using the geom_line function.

cal_obj_best_model_brier <- caret::calibration(Class ~ Logit, data = lift_results) # I had to type in Logit by hand

# plot(cal_obj_best_model_brier, type = "l",
#   auto.key = list(
#     columns = 1,
#     lines = TRUE,
#     points = FALSE,
#     title = "",
#     cex = 0.8,
#     lty = 1:5,
#     col = viridis(5)
#   ),
#   main = "Calibration Plot",
#   xlab = "Predicted Probability",
#   ylab = "Observed Frequency",
#   lty = 1:6,
#   pch = 16,
#   lwd = 2)

# tm_print_save("output/fig/cal_obj_out_of_sample.tiff")

b <- ggplot2::ggplot(cal_obj_best_model_brier) +
  geom_ribbon(aes(ymin = Lower, ymax = Upper, x = midpoint, y = Percent), alpha = 0.2) +
  geom_line(aes(x = midpoint, y = Percent), color = "blue", linetype = 1, linewidth = 2, alpha = 0.5) +
  xlab("Predicted Probability") +
  ylab("Observed Frequency") +
  ggtitle("Logistic Regression Model on Test Data:\nCalibration Plot with Confidence Interval"); b

#plotly(b)

ggsave("only_logistic_regression_model_cal.tiff", plot = b,
                units = "cm",
                width = 20, 
                height = 20, 
                dpi = 800)

ggplotly(ggplot2::ggplot(cal_obj) +
  geom_ribbon(aes(ymin = Lower, ymax = Upper, x = midpoint, y = Percent), alpha = 0.05) +
  geom_line(aes(x = midpoint, y = Percent)) +
  xlab("Predicted Probabilities") +
  ylab("Observed Frequencies") +
  ggtitle("Calibration Plot of the Models"))

Concordonce Index

Calibration measures the agreement between the predicted probabilities and the observed outcomes, while the concordance index measures the overall accuracy of the model in ranking the subjects based on the predicted risk or probability of an event. Together, they provide a comprehensive evaluation of the model’s performance in predicting the outcome.

The concordance index, also known as the c-index, is a widely used measure for evaluating the performance of binary classification models. It measures the ability of a model to correctly rank pairs of positive and negative outcomes, and is defined as the proportion of all positive-negative pairs for which the predicted probabilities produce the correct rank. The c-index ranges from 0.5 for a random model to 1 for a perfect model. A model with a c-index of 1 predicts the outcomes of all positive-negative pairs correctly, while a model with a c-index of 0.5 predicts the outcomes of positive-negative pairs randomly.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Checks the AUC/concordance index for each of the models.  ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

Class =  test_match$Match_Status
Logit.pred <- stats::predict(logit.fit, test_match, type = "prob")[,"Matched"]
Lasso.pred <- stats::predict(lasso.fit, test_match, type = "prob")[,"Matched"]
RF.pred <- stats::predict(rf.fit, test_match, type = "prob")[,"Matched"]
XGB.pred <- stats::predict(xgb.fit, test_match, type = "prob")[,"Matched"]
#CAT.pred <- stats::predict(CAT.fit, test_match, type = "prob")[,"Matched"]
Nnet.pred <- stats::predict(nnet.fit, test_match, type = "prob")[,"Matched"]

Class <- as.data.frame(Class)
Class <- test_match
Class$Match_Status <- ifelse(Class$Match_Status == "Matched", 1, 0)
Class$Match_Status <- as.numeric(Class$Match_Status)

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####   Checks the AUC/concordance index for each of the models.  ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####

logit_concor <- InformationValue::Concordance(Class$Match_Status, Logit.pred )
lasso_concor <- InformationValue::Concordance(Class$Match_Status, Lasso.pred)
RF_concor <- InformationValue::Concordance(Class$Match_Status, RF.pred)
XGB_concor <- InformationValue::Concordance(Class$Match_Status, XGB.pred)
#CAT_concor <- InformationValue::Concordance(Class$Match_Status, CAT.pred)
Nnet_concor <- InformationValue::Concordance(Class$Match_Status, Nnet.pred)

concor <- data.frame(logit = logit_concor$Concordance, lasso = lasso_concor$Concordance, RF = RF_concor$Concordance, XGB = XGB_concor$Concordance, #CAT = CAT_concor$Concordance,
                     Nnet = Nnet_concor$Concordance)

colnames(concor) <- c("Logistic Regression", "Lasso", "Random Forest", "XGBoost", "Neural Network")
knitr_table_html(concor, "Concordance Index or C-index for Each Model")

Concordance Index or C-index for Each Model
Logistic Regression	Lasso	Random Forest	XGBoost	Neural Network
0.733	0	0.733	0.725	0.692

test_match$Match_Status <- factor(test_match$Match_Status, levels = c( "Not_Matched", "Matched"))

#This is backwards from the prior order.  
#test_match$Match_Status <- factor(test_match$Match_Status, levels=c("Match", "Not_Matched"), labels = c("Matched", "Not_Matched"))

test_match$Match_Status <- relevel(test_match$Match_Status, ref = "Not_Matched")
contrasts(test_match$Match_Status)

#>             Matched
#> Not_Matched       0
#> Matched           1

class(test_match$Match_Status)

#> [1] "factor"

GLM Model: needed for non-technical plots

if ("X" %in% colnames(test_match)) {
  test_match <- test_match %>% dplyr::select(-X)
}

if ("Year" %in% colnames(test_match)) {
  test_match <- test_match %>% dplyr::select(-Year)
}

if ("Location" %in% colnames(test_match)) {
  test_match <- test_match %>% dplyr::select(-Location)
}

Imputation

Next, we will impute missing values. This will make our data ready for various machine learning models.

#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### #### 
#####  Impute missing observations Multivariate Imputation by Chained Equations. Imputes and removes any missing data     ----
#### #### #### #### #### #### #### #### #### #### #### #### #### #### #### ####
#selected_data <- reduced_data2 #I did this because the grpreg is so screwed up above this.  

sum(is.na(test_match)) # Check for NA values

#> [1] 0

# use `select_if()` to select only variables, instead of summarizing all variables 
missing_data <- test_match %>% #dplyr::select_if(is.numeric) %>%
  dplyr::summarize_all(funs(sum(is.na(.))/n())) %>% 
  tidyr::gather(key = "variable", value = "percent_missing")

# Use ggplot2 for visualization
# Use fct_reorder() to reorder x-axis, to make it more readable and make it clear that the variable is a factor
ggplot(missing_data, aes(x = forcats::fct_reorder(variable, percent_missing), y = percent_missing)) +
    geom_col(fill = "red", aes(color = I('white')), linewidth = 0.3) + 
    xlab("Variables") + 
    ylab("Percent Missing") +
    coord_flip() + 
    theme_bw() +
    ggtitle("Missing Data by Variable")

glm <- glm(Match_Status ~ .,
           family = binomial(link = logit),
           data=test_match) #Weird that this does not work with data = train_match.

Equation

Equation is available upon request for further research purposes by contacting the authors. This is a logistic regression equation, which models the probability of an outcome (in this case, the chance of matching into OBGYN residency). The equation takes the form of:

\[\textbf{Chance of Matching into OBGYN Residency} = \beta_0 + \beta_1 \cdot \textbf{Variable 1} + \beta_2 \cdot \textbf{Variable 2} + \dots + \beta_n \cdot \textbf{Variable n}\] where \(\beta_0, \beta_1, \dots, \beta_n\) are coefficients that capture the effect of each predictor variable on the outcome. In this case, the outcome is represented by the chance of matching into OBGYN residency and the predictor variables are Gender…… Each predictor variable takes a value of 0 or 1, with 1 indicating the presence of a certain characteristic (e.g., Male, Yes, etc.).

\[\textbf{Chance of Matching into OBGYN Residency} = -0.87 - 0.29 \cdot \textbf{Gender} \text{Male} - 0.06 \cdot \text{Age} + 0.53 \cdot \textbf{US or Canadian Applicant} \text{Yes}\] \[- 0.4 \cdot \textbf{Visa Sponsorship Needed} \text{Yes} - 0.45 \cdot \textbf{Medical Education or Training Interrupted} \text{Yes} + 0.26 \cdot \textbf{Alpha Omega Alpha} \text{Yes} + 0.33 \cdot \textbf{Gold Humanism Honor Society} \text{Yes} + 0.16 \cdot \textbf{Couples Match} \text{Yes} + 0.01 \cdot \textbf{Count of Oral Presentation}\] \[ + 0.02 \cdot \textbf{Count of Peer Reviewed Journal Articles Abstracts} + 0.07 \cdot \textbf{Count of Peer Reviewed Journal Articles Abstracts} \textbf{Other than Published} + 0.03 \cdot \text{Count of Poster Presentation} - 0.06 \cdot \textbf{Meeting Name Presented} \textbf{Presented at a meeting} - 0.12 \cdot \textbf{TopNIHfunded} \text{Did not attend NIH top-funded medical school}\] \[ + 0.19 \cdot \textbf{Higher Education Degree} \text{Not a B.S. degree} - 0.13 \cdot \textbf{Language Fluency} \text{Speaks English and Another Language} - 0.19 \cdot \textbf{ACLS} \text{Yes} - 0.35 \cdot \textbf{PALS} \text{Yes} + 0.24 \cdot \textbf{Medical Training Discipline} \text{Yes} + 0.71 \cdot \textbf{Tracks Applied by Applicant 1} \text{Categorical Applicant}\] \[ + 0.3 \cdot \textbf{AMA} \text{No AMA Membership} + 0.56 \cdot \textbf{ACOG} \text{Yes} - 0.1 \cdot \textbf{Scholarship} \text{Scholarship} + 0.28 \cdot \textbf{total OBGYN letter writers} - 0.02 \cdot \textbf{reco count} - 0.01 \cdot \textbf{number of applicant first author publications}\] \[ + 0.37 \cdot \textbf{Advance Degree} \text{No Addl Degree} + 0.23 \cdot \textbf{Advance Degree} \text{Other} - 0.92 \cdot \textbf{Advance Degree} \text{PhD} + 0.27 \cdot \textbf{Type of medical school} \text{Osteopathic School} + 0.9 \cdot \textbf{Type of medical school} \text{U.S. Private School}\]

Odds Ratio table

An odds ratio (OR) is a statistical measure that represents the strength of the association between a binary outcome and an independent variable. In this logistic regression model, the odds ratios represent the change in the odds of the outcome (binary dependent variable) per unit increase in the independent variable.

An OR of 1 indicates that there is no association between the outcome and the independent variable. An OR greater than 1 indicates that the odds of the outcome increase with an increase in the independent variable, while an OR less than 1 indicates that the odds of the outcome decrease with an increase in the independent variable.

The Pr(>|Z|) column in the table represents the p-value for each independent variable, which is used to test the hypothesis that the corresponding coefficient (OR) is equal to 0. A low p-value (e.g. <0.01) indicates that the OR is significantly different from 0, meaning that the independent variable is associated with the outcome. The lower the p-value, the stronger the association.

oR.table <- epiDisplay::logistic.display(glm, simplified = T, crude = F)
oR.table <- as.data.frame(oR.table$table)
oR.table <- oR.table %>%
  dplyr::mutate_if(is.numeric, funs(round(., 2)))

oR.table <- oR.table %>%
  dplyr::mutate(`Pr(>|Z|)` = ifelse(`Pr(>|Z|)` < 0.01, "<0.01", round(`Pr(>|Z|)`, 2))) %>%
  dplyr::arrange(`Pr(>|Z|)`) %>%
  dplyr::mutate_if(is.numeric, funs(format(., nsmall = 2, scientific = FALSE)))

Although this table may be appropriate for technical audiences who are familiar with logistic regression, but non-technical audiences will not understand the meaning behind each of the columns. Furthermore, the numbers in the table can be taxing on the eyes making it difficult to absorb insights from your model. We can aid the audience by swapping out the table with a caterpillar plot of the predictors. (see below)

colnames(oR.table) <- c("Odds Ratio", "Lower 95% Confidence Interval", "Upper 95% Confidence Interval", "P-value")

kable(oR.table, 
      format = "html", 
      escape = F,
      caption = "Odds Ratio Predicting Match Status, Ordered by P-value", # %>% 
      align = c("l", "c", "c", "c")) %>% 
  kable_styling(full_width = F, 
                position = "center", 
                font_size = 12)

Odds Ratio Predicting Match Status, Ordered by P-value
	Odds Ratio	Lower 95% Confidence Interval	Upper 95% Confidence Interval	P-value
GenderMale	0.75	0.62	0.92	<0.01
Age	0.94	0.92	0.97	<0.01
Medical_Education_or_Training_InterruptedYes	0.64	0.51	0.80	<0.01
Gold_Humanism_Honor_SocietyYes	1.40	1.14	1.71	<0.01
Tracks_Applied_by_Applicant_1Categorical Applicant	2.03	1.60	2.58	<0.01
ACOGYes	1.74	1.39	2.19	<0.01
total_OBGYN_letter_writers	1.32	1.20	1.46	<0.01
AMANo AMA Membership	1.35	1.08	1.69	0.01
Research_exp_count	1.07	1.01	1.14	0.02
ACLSYes	0.83	0.69	0.99	0.03
Alpha_Omega_AlphaYes	1.30	1.01	1.68	0.04
Count_of_Poster_Presentation	1.03	1.00	1.07	0.06
Visa_Sponsorship_NeededYes	0.67	0.43	1.04	0.07
Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published	1.07	0.99	1.16	0.07
Higher_Education_DegreeNot a B.S. degree	1.21	0.98	1.50	0.08
Count_of_Peer_Reviewed_Journal_Articles_Abstracts	1.02	1.00	1.05	0.1
Language_FluencySpeaks English and Another Language	0.88	0.75	1.03	0.12
PALSYes	0.70	0.44	1.13	0.14
TopNIHfundedDid not attend NIH top-funded medical school	0.89	0.75	1.05	0.17
Type_of_medical_schoolU.S. Private School	2.46	0.65	9.37	0.19
Couples_MatchYes	1.17	0.90	1.52	0.24
Type_of_medical_schoolU.S. Public School	1.91	0.50	7.26	0.34
Volunteer_exp_count	0.99	0.96	1.01	0.37
Advance_DegreePhD	0.40	0.05	3.29	0.39
ScholarshipScholarship	0.91	0.72	1.15	0.42
US_or_Canadian_ApplicantYes	1.69	0.45	6.45	0.44
Medical_Training_DisciplineYes	1.27	0.66	2.46	0.47
Advance_DegreeNo Addl Degree	1.44	0.47	4.43	0.52
Meeting_Name_PresentedPresented at a meeting	0.94	0.75	1.17	0.56
work_exp_count	0.99	0.95	1.03	0.57
Count_of_Oral_Presentation	1.01	0.96	1.07	0.63
reco_count	0.98	0.89	1.08	0.67
Advance_DegreeOther	1.26	0.40	3.98	0.69
Type_of_medical_schoolOsteopathic School	1.31	0.34	4.99	0.69
number_of_applicant_first_author_publications	0.99	0.95	1.04	0.77

Logistic Regression for a non-technical audience

Attribution: https://www.youtube.com/watch?v=svHT7H1ZykA, https://github.com/keikcaw/visualizing-logistic-regression

Disclaimer about Causality

Several of the visualizations below strongly imply a causal effect of some predictor on the outcome. However, this discussion is not about causality, and we will not discuss how to robustly evaluate causality. Although the logistic regression model estimates the values of the predictor variables, whether a causal relationship actually exists between the predictor and outcome variable requires careful consideration and domain knowledge surrounding the research question. You should use your judgment when considering these visualizations: if your data doesn’t justify a causal interpretation, don’t use them!(https://github.com/keikcaw/visualizing-logistic-regression)

We pulled out the model coefficients and created a dataframe. This dataframe will be used throughout our visualization-exploration journey. (https://github.com/keikcaw/visualizing-logistic-regression)

# This improved version uses tidyr::gather() to reshape the dataframe, and mutate() to add new columns with the standard error, z-value, and p-value of the coefficients. Finally, it selects only the columns that are needed, resulting in a dataframe that is more readable and easier to understand.
coefs.df = summary(glm)$coefficients %>%
  data.frame() %>%
  tibble::rownames_to_column("parameter") %>% 
  dplyr::mutate(pretty.parameter = parameter) %>% #case_when(parameter == "(Intercept)" ~ "Intercept")) %>%
  dplyr::select(parameter, pretty.parameter, est = Estimate, se = Std..Error,
                z = z.value, p = Pr...z..)
coefs.df$parameter <- gsub("_", " ", coefs.df$parameter)
coefs.df$pretty.parameter <- gsub("_", " ", coefs.df$parameter)

#https://github.com/keikcaw/visualizing-logistic-regression/blob/main/R_code/fit_model.R

#Color Palette
good.color = "#0571B0"
neutral.color = "gray"
bad.color = "#CA0020"

Change in Log odds

The characteristic which makes this plot more visually appealing than a table is that the numbers are isolated in one place: the x-axis. However, using the raw outputs from the logistic regression means that the magnitude of effect is in the log odds scale. This visual is perfect for audiences who know exactly what a log odds is, and problematic for non-technical audiences who do not. The visual can leave your non-technical audience member wondering what a 0.4 change in the log odds even means. Is the difference between a change in the log odds of 0.4 and 0.8 big or small? (https://github.com/keikcaw/visualizing-logistic-regression)

#Fresh glm
glm <- glm(Match_Status ~ .,
           family = binomial(link = logit),
           data=test_match) #Weird that this does not work with data = train_match.  

coefs.df_asdf = summary(glm)$coefficients %>%
  data.frame() %>%
  tibble::rownames_to_column("parameter") %>% 
  dplyr::mutate(pretty.parameter = parameter) %>% 
  #case_when(parameter == "(Intercept)" ~ "Intercept") %>%
  dplyr::rename(est = Estimate, 
                se = Std..Error,
                p = Pr...z..)
coefs.df_asdf$parameter <- gsub("_", " ", coefs.df_asdf$parameter)
coefs.df_asdf$pretty.parameter <- gsub("_", " ", coefs.df_asdf$parameter)

log.odds.p = coefs.df_asdf %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    pretty.parameter = fct_reorder(pretty.parameter, est),
    lower.95 = est + (qnorm(0.025) * se),
    lower.50 = est + (qnorm(0.25) * se),
    upper.50 = est + (qnorm(0.75) * se),
    upper.95 = est + (qnorm(0.975) * se),
    signif = case_when(p > 0.05 ~ "Not significant",
                       est > 0 ~ "Positive",
                       est < 0 ~ "Negative"),
    signif = fct_relevel(signif, "Positive",
                         "Not significant",
                         "Negative")
  ) %>%
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95,
                     ymax = upper.95),
                 size = 1) +
  geom_linerange(aes(ymin = lower.50,
                     ymax = upper.50),
                 size = 2) +
  geom_point(aes(y = est), size = 3) +
  geom_hline(yintercept = 0) +
  scale_color_manual(
    "Relationship to\nlog odds of matching",
    values = c(good.color, neutral.color, bad.color)
  ) +
  labs(x = "", y = "Change in log odds",
       title = "Estimated relationships between\nApplicant characteristics\nand log odds of matching") +
  coord_flip(clip = "off")
log.odds.p

###

Rather than trying to have your audience understand log odds, the simplest adjustment you could make is to relabel the x-axis. This second visual is exactly the same as the fist visual, with the only difference being in the x-axis. Instead of the x-axis showing the log odds scale explicitly, it simply indicates whether the chance of passing is higher, the same, or lower.

# #Secret log odds
  
secret.log.odds.p = coefs.df_asdf %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    pretty.parameter = fct_reorder(pretty.parameter, est),
    lower.95 = est + (qnorm(0.025) * se),
    lower.50 = est + (qnorm(0.25) * se),
    upper.50 = est + (qnorm(0.75) * se),
    upper.95 = est + (qnorm(0.975) * se),
    signif = case_when(p > 0.05 ~ "Not significant",
                       est > 0 ~ "Positive",
                       est < 0 ~ "Negative"),
    signif = fct_relevel(signif, "Positive",
                         "Not significant",
                         "Negative")) %>%
  ggplot(aes(x = pretty.parameter, color = signif)) +
  geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1) +
  geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
  geom_point(aes(y = est), size = 3) +
  geom_hline(yintercept = 0) +
  scale_y_continuous(
    breaks = c(-1, 0, 1),
    labels = c("← Lower",
               "Same",
               "Higher →")
  ) +
  scale_color_manual(
    "Relationship to\nlog odds of matching",
    values = c(good.color, neutral.color, bad.color)
  ) +
  labs(x = "", y = "Chance of matching",
       title = "Estimated relationships between\nApplicant characteristics\nand chance of matching") +
  coord_flip(); secret.log.odds.p

###

This graph is perfect if your audience is only interested in knowing which variables (if any) are related to the outcome, and what the sign of each relationship is. Within limits, it can also show which variables are more strongly related to the outcome than others (depending on how continuous predictors are scaled). However, relabeling the axis this way means that there are bands of different lengths but no unit of measurement to describe their absolute effect size. Therefore, just like the first graph, we run into the same issue: the non-technical audience member still will not get a sense of the overall magnitude of effect.

Probability relative to some baseline

Your audience may not be familiar with log odds or odds ratios, but they are certainly familiar with probabilities. We’ve already discussed the primary obstacle to translating logistic regression coefficients into probabilities: the change in percentage points depends on the baseline starting value. But if we can choose an appropriate baseline probability, we can present our model coefficients on a probability scale after all.

One approach is to use the overall intercept of the model as a baseline. We can then add each coefficient individually to the intercept to discover the predicted probability of passing for a baseline student with one characteristic changed:

# Get the intercept value from the "coefs.df_asdf" data frame
intercept = coefs.df_asdf$est[coefs.df_asdf$parameter == "(Intercept)"]

# Calculate the probabilities relative to the baseline by filtering out the "(Intercept)" parameter, reordering the levels of the "pretty.parameter" factor by the estimated values, and calculating the confidence intervals
prob.baseline.p = coefs.df_asdf %>%
  filter(parameter != "(Intercept)") %>%
  mutate(
    pretty.parameter = fct_reorder(pretty.parameter, est),
    lower.95 = est + (qnorm(0.025) * se),
    lower.50 = est + (qnorm(0.25) * se),
    upper.50 = est + (qnorm(0.75) * se),
    upper.95 = est + (qnorm(0.975) * se),
    signif = case_when(
      p > 0.05 ~ "Not significant",
      est > 0 ~ "Positive",
      est < 0 ~ "Negative"
    ),
    signif = fct_relevel(signif, "Positive", "Not significant", "Negative")
  ) %>%
  # Transform the estimates and confidence intervals back to probabilities using the invlogit function
  mutate(
    across(
      matches("est|lower|upper"),
      ~ invlogit(. + intercept)
    )
  ) %>%
  # Create a ggplot object to visualize the probabilities by "pretty.parameter" and "signif"
  ggplot(aes(x = pretty.parameter, color = signif)) +
  # Add confidence intervals for each "pretty.parameter" level
  geom_linerange(aes(ymin = lower.95, ymax = upper.95), size = 1) +
  geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
  # Add points for the estimated probabilities for each "pretty.parameter" level
  geom_point(aes(y = est), size = 3) +
  # Add a horizontal line for the baseline probability
  geom_hline(yintercept = invlogit(intercept)) +
  # Set the y-axis to show probabilities and limit it from 0 to 1
  scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
  # Set the color scale for the "signif" factor
  scale_color_manual(
    "Relationship to\nprobability of matching",
    values = c(good.color, neutral.color, bad.color)
  ) +
  # Set the axis and plot titles, and flip the coordinates to show the "pretty.parameter" levels vertically
  labs(
    x = "",
    y = "Probability of matching",
    title = "Estimated relationships between\nApplicant characteristics\nand probability of matching"
  ) +
  coord_flip() +
  # Set the theme to a basic white and black scheme
  theme_bw(); prob.baseline.p

This graph includes confidence intervals around the predicted probabilities, just like the confidence intervals around the coefficients in the earlier graphs. Alternatively, we can use arrows to emphasize that we’re showing predicted differences from a baseline:

#Probabilities relative to a baseline: arrows
prob.baseline.arrows.p = coefs.df_asdf %>%
  filter(parameter != "(Intercept)") %>%
  mutate(pretty.parameter = fct_reorder(pretty.parameter, est),
         signif = case_when(p > 0.05 ~ "Not significant",
                            est > 0 ~ "Positive",
                            est < 0 ~ "Negative"),
         signif = fct_relevel(signif, "Positive",
                              "Not significant",
                              "Negative"),
         est = invlogit(est + intercept)) %>%
  ggplot(aes(x = invlogit(intercept),
             xend = est,
             y = pretty.parameter,
             yend = pretty.parameter,
             color = signif)) +
  geom_segment(
    size = 1,
    arrow = arrow(length = unit(0.1, "in"),
                  type = "closed")
  ) +
  geom_vline(xintercept = invlogit(intercept)) +
  scale_x_continuous(
    limits = c(0, 1),
    labels = scales::percent_format()
  ) +
  scale_color_manual("Relationship to\nprobability of matching",
                     values = c(good.color, neutral.color, bad.color)) +
  labs(x = "Probability of matching", y = "",
       title = "Estimated relationships between\nApplicants characteristics\nand probability of matching") +
  theme_bw()
prob.baseline.arrows.p

Combined plot with research experience count variable

This graph is a combined histogram, rug plot, and logistic regression prediction. The x-axis represents the number of research experiences and the y-axis represents the probability of matching into a residency. The graph shows the distribution of research experiences between matched and non-matched applicants, with the histogram bars representing the number of applicants in each bin and the rug plot showing the individual observations. The logistic regression prediction line represents the predicted probability of matching into a residency based on the number of research experiences.

ggplot() +
    geom_segment(data=h, size=4, show.legend=FALSE,
                 aes(x=breaks, xend=breaks, y=y, yend=pct, colour=factor(y))) +
    geom_segment(dat=dat[dat$y==0,], aes(x=x1, xend=x1, y=0, yend=-0.02), size=0.2, colour="grey30") +
    geom_segment(dat=dat[dat$y==1,], aes(x=x1, xend=x1, y=1, yend=1.02), size=0.2, colour="grey30") +
    geom_line(data=data.frame(x=seq(-2,20,0.1), 
                              y=predict(glm(y ~ x1, family="binomial", data=dat), 
                                        newdata=data.frame(x1=seq(-2,20,0.1)),
                                        type="response")), 
              aes(x,y), colour="grey50", lwd=1) +
    scale_x_discrete() +
    theme_bw(base_size=12)

DALEX Variable Importance Plot

# #Random forest
# library(DALEX)
# 
# # Create an explainer object for the random forest model
# explainer_rf <- DALEX::explain(model = rf.fit, 
#                                y = dat$Match_Status, 
#                                label = "Random Forest")
# 
# #set seed for reproducibility
# set.seed(1978)
# 
# #compute mean variable importance 
# (vip.rf <- model_parts(explainer = explainer_rf, 
#                    loss_function = loss_root_mean_square,
#                                B = 50,
#                                type = "difference"))
# 
# #plot the mean variable importance over 50 permutations
# p <- plot(vip.rf) +
#   ggtitle("Mean variable-importance over 50 permutations", "") + 
#   ylab("Variable Importance"); p

#Logistic regression

explainer_glm <- DALEX::explain(model = logit.fit, 
                               y = dat$Match_Status, 
                               label = "logistic regression")

set.seed(1980)
(vip.glm <- DALEX::model_parts(explainer = explainer_glm, 
                   loss_function = loss_root_mean_square,
                               B = 50,
                            type = "difference"))

p <- plot( vip.glm) +
  ggtitle("Mean variable-importance over 50 permutations", ""); p

#Lasso regression
explainer_lasso <- DALEX::explain(model = lasso.fit, 
                               y = dat$Match_Status, 
                               label = "lasso regression")

set.seed(1980)
(vip.lasso <- model_parts(explainer = explainer_lasso, 
                   loss_function = loss_root_mean_square,
                               B = 50,
                            type = "difference"))

p <- plot(vip.lasso) +
  ggtitle("Mean variable-importance over 50 permutations", "")

#Xgb

explainer_xgb <- DALEX::explain(model = xgb.fit, 
                               y = dat$Match_Status, 
                               label = "XGB regression")

set.seed(1980)
(vip.xgb <- model_parts(explainer = explainer_xgb, 
                   loss_function = loss_root_mean_square,
                               B = 50,
                            type = "difference"))
p <- plot(vip.xgb) +
  ggtitle("Mean variable-importance over 50 permutations", "")

#nnet
explainer_nnet <- DALEX::explain(model = nnet.fit, 
                               y = dat$Match_Status, 
                               label = "Neural net regression")

set.seed(1978)

logit <- function(x) exp(x)/(1+exp(x))
custom_loss <- function(observed, predicted){
   sum((observed - logit(predicted))^2)
}

(vip.nnet <- DALEX::model_parts(explainer = explainer_nnet, 
                   loss_function = custom_loss,
                               B = 50,
                            type = "difference"))

p <- plot(vip.nnet) +
  ggtitle("Mean variable-importance over 50 permutations", "")

#code_folder <- "~/Dropbox (Personal)/Nomogram/nomogram/Code/"
#data_folder <- "~/Dropbox (Personal)/Nomogram/nomogram/data/"

### Nomogram 
# rsconnect::setAccountInfo(name='mufflyt',
#                           token='D8846CA8B32E6A5EAEA94BFD02EEEA39',
#                           secret='dIXWOv+ud/z6dTPN2xOF9M4BKJtWKROc2cOsZS4U')
# rsconnect::accountInfo(name = "mufflyt")

Sys.setenv(rsc_account_name = "mufflyt")
Sys.setenv(rsc_token = "D8846CA8B32E6A5EAEA94BFD02EEEA39")
Sys.setenv(rsc_secret = "dIXWOv+ud/z6dTPN2xOF9M4BKJtWKROc2cOsZS4U")

rsconnect::setAccountInfo(
  name = Sys.getenv("rsc_account_name"),
  token = Sys.getenv("rsc_token"),
  secret = Sys.getenv("rsc_secret")
)

Using the Harrelverse

Read in split data and train models (all train_match models end with `.fit` e.g. ‘logit.fit’)

train_match <- read.csv(file = "output/csv/train_dat_2017_2018_years.csv") %>% 
  dplyr::select(-X, -Year, -Location)

# Recode the "Type_of_medical_school" variable
train_match$Type_of_medical_school <- ifelse(train_match$Type_of_medical_school == "Osteopathic School,International School",
                                   "Osteopathic School",
                                   train_match$Type_of_medical_school)

test_match <- read.csv(file = "output/csv/test_dat_2019_2020_years.csv") %>%
  dplyr::select(-X, -Year, -Location)

# Recode the "Type_of_medical_school" variable
test_match$Type_of_medical_school <- ifelse(test_match$Type_of_medical_school == "Osteopathic School,International School",
                                   "Osteopathic School",
                                   test_match$Type_of_medical_school)

Create model with lrm

The lrm function in the rms package is a flexible tool for fitting binary logistic regression models. It allows for the inclusion of interactions and nonlinear effects, as well as options for regularization and model selection. In this code, the model is fit with the formula Match_Status ~ ., which includes all predictors except for Type_of_medical_school. This is done to avoid issues with singularity that can arise when including all predictors, as well as because Type_of_medical_school may not be an important predictor of the outcome. The x = TRUE and y = TRUE options include additional information about the predictors and response

# Specify the distribution of the data for use in later steps
t.data <- rms::datadist(train_match)
options(datadist = t.data)

# Fit a logistic regression model with formula Match_Status ~ . (all predictors except Type_of_medical_school)
fit <- rms::lrm(
  formula = Match_Status ~ ., 
  data = train_match %>% dplyr::select(-Type_of_medical_school), # Remove Type_of_medical_school to avoid singularity
  x = TRUE, # Include information matrix of predictors
  y = TRUE # Include information about the response variable
)

knitr::kable(broom::tidy(fit$stats), 
             digits =2, 
             caption = "Performance statistics of the Training Model")

Performance statistics of the Training Model
names	x
Obs	3121.00
Max Deriv	0.00
Model L.R.	875.67
d.f.	32.00
P	0.00
C	0.78
Dxy	0.56
Gamma	0.56
Tau-a	0.28
R2	0.33
R2(3121)	0.24
R2(32,3121)	0.24
R2(2298.2)	0.32
R2(32,2298.2)	0.31
Brier	0.18
g	1.45
gr	4.27
gp	0.28

plot(anova(fit), cex=0.5, cex.lab=0.4, cex.axis = 0.4)

plot(summary(fit), cex=0.5, cex.lab=0.7, cex.axis = 0.7)

oddsratios <- rms::orm(Match_Status ~ .,
                       data = train_match %>% dplyr::select(-Type_of_medical_school),
                       x = TRUE,
                       y = TRUE); oddsratios

#> Logistic (Proportional Odds) Ordinal Regression Model
#>  
#>  rms::orm(formula = Match_Status ~ ., data = train_match %>% dplyr::select(-Type_of_medical_school), 
#>      x = TRUE, y = TRUE)
#>  
#>                         Model Likelihood               Discrimination    Rank Discrim.    
#>                               Ratio Test                      Indexes          Indexes    
#>  Obs          3121    LR chi2     875.67    R2                  0.328    rho     0.483    
#>   Matched     1771    d.f.            32    R2(32,3121)         0.237                     
#>   Not_Matched 1350    Pr(> chi2) <0.0001    R2(32,2298.2)       0.307                     
#>  Distinct Y      2    Score chi2  779.37    |Pr(Y>=median)-0.5| 0.229                     
#>  Median Y        1    Pr(> chi2) <0.0001                                                  
#>  max |deriv| 0.007                                                                        
#>  
#>                                                                         Coef    S.E.   Wald Z Pr(>|Z|)
#>  Intercept                                                              -0.3986 0.8368 -0.48  0.6338  
#>  Gender=Male                                                             0.1486 0.1088  1.37  0.1719  
#>  Age                                                                     0.0877 0.0161  5.44  <0.0001 
#>  US_or_Canadian_Applicant=Yes                                           -0.9489 0.1503 -6.32  <0.0001 
#>  Visa_Sponsorship_Needed=Yes                                             0.8512 0.2107  4.04  <0.0001 
#>  Medical_Education_or_Training_Interrupted=Yes                           0.6540 0.1240  5.28  <0.0001 
#>  Alpha_Omega_Alpha=Yes                                                  -0.0175 0.1571 -0.11  0.9113  
#>  Gold_Humanism_Honor_Society=Yes                                        -0.3072 0.1336 -2.30  0.0214  
#>  Couples_Match=Yes                                                       0.0741 0.1577  0.47  0.6385  
#>  Count_of_Oral_Presentation                                             -0.0199 0.0297 -0.67  0.5020  
#>  Count_of_Peer_Reviewed_Journal_Articles_Abstracts                      -0.0437 0.0166 -2.63  0.0086  
#>  Count_of_Peer_Reviewed_Journal_Articles_Abstracts_Other_than_Published -0.1020 0.0511 -2.00  0.0458  
#>  Count_of_Poster_Presentation                                            0.0420 0.0250  1.68  0.0929  
#>  Meeting_Name_Presented=Presented at a meeting                          -0.1406 0.0929 -1.51  0.1303  
#>  TopNIHfunded=Did not attend NIH top-funded medical school               0.3656 0.1027  3.56  0.0004  
#>  Higher_Education_Degree=Not a B.S. degree                               0.0731 0.0916  0.80  0.4252  
#>  Language_Fluency=Speaks English and Another Language                   -0.0009 0.0987 -0.01  0.9925  
#>  ACLS=Yes                                                                0.2353 0.0934  2.52  0.0118  
#>  PALS=Yes                                                               -0.0095 0.1835 -0.05  0.9587  
#>  Medical_Training_Discipline=Yes                                         0.3654 0.2044  1.79  0.0738  
#>  Tracks_Applied_by_Applicant_1=Categorical Applicant                    -0.2772 0.0875 -3.17  0.0015  
#>  AMA=No AMA Membership                                                   0.0205 0.0916  0.22  0.8227  
#>  ACOG=Yes                                                               -0.3543 0.0913 -3.88  0.0001  
#>  Scholarship=Scholarship                                                -0.0396 0.0985 -0.40  0.6875  
#>  total_OBGYN_letter_writers                                             -0.4081 0.0517 -7.90  <0.0001 
#>  reco_count                                                             -0.1685 0.0783 -2.15  0.0314  
#>  number_of_applicant_first_author_publications                          -0.0423 0.0224 -1.89  0.0593  
#>  Advance_Degree=No Addl Degree                                          -0.2500 0.5965 -0.42  0.6751  
#>  Advance_Degree=Other                                                    0.2261 0.6104  0.37  0.7111  
#>  Advance_Degree=PhD                                                     -0.6160 0.8560 -0.72  0.4717  
#>  work_exp_count                                                          0.0194 0.0152  1.28  0.2007  
#>  Volunteer_exp_count                                                    -0.0046 0.0101 -0.45  0.6519  
#>  Research_exp_count                                                     -0.0722 0.0260 -2.78  0.0055  
#>

Bootstrap Internal Validation

Bootstrap validation involves repeatedly resampling the dataset to estimate the model’s performance and variability. This method has several benefits, including its ability to estimate the bias and variance of the model, its flexibility to work with small or unbalanced datasets, and its simplicity to implement. However, bootstrapping can be computationally intensive and may produce biased estimates if the resampled datasets are not representative of the population. Cross-validation, on the other hand, involves partitioning the data into training and validation sets and iteratively estimating the model on different partitions. This method can also estimate the model’s performance and generalize well to new data, but may be more sensitive to the particular partitioning scheme used and may not work well with small datasets or imbalanced classes.

#page 284 in Harrell's book, page 301
set.seed(1978)

v_fit <- rms::validate(
  fit = fit, # The model object to be validated
  method = "boot", # The resampling method to use (bootstrapping)
  B = 1000, # The number of bootstrap samples to draw
  pr = FALSE, # Whether to print progress information
  digits = 2, # The number of digits to display in output
  bw = FALSE, # Whether to use bandwidth selection for nonparametric smoothing
  tol = 1e-12, # The convergence tolerance for smoothing
  maxdim = 5, # The maximum number of variables to include in interactions
  maxiter = 100, # The maximum number of iterations for smoothing
  dxy = TRUE # Whether to display correlation matrix of variables
)

Calibration of the `fit` model

The calibrate() function in the rms package is used to calibrate the predictions from a logistic regression model, so that they are more accurate and better calibrated to the outcome data. The code specifies bootstrapping as the resampling method and lowess smoothing as the function to use for calibration. The results of the calibration can be used to determine how well the model predictions match the actual outcomes, as well as to assess the model’s overall predictive accuracy. The output from calibrate() includes a plot of the calibration curves, as well as various summary statistics such as the Brier score and calibration intercept/slope.

# Plot the calibration curves from the calibrated logistic regression model
graphics::plot(
  cal_monica, # The calibration object to be plotted
  main = "In Sample/Bootstrap Overfitting-Corrected Calibration Curve", # The main title of the plot
  xlim = c(0,1), # The limits of the x-axis
  ylim = c(0,1), # The limits of the y-axis
  xlab = "Nomogram-Predicted Probability of Matching", # The label for the x-axis
  ylab = "Actual Matching (proportion)", # The label for the y-axis
  lwd = 2, # The width of the lines in the plot
  lty = 1, # The line type of the calibration curve
  #errbar.col = c(rgb(0,118,192,maxColorValue=255)), # The color of the error bars
  legend = FALSE, # Whether to display a legend for the plot
  subtitles = FALSE, # Whether to display subtitles for each curve
  cex.lab = 1.2, cex.axis = 1, cex.main = 1, cex.sub = 0.6, # The size of the plot labels and title
  scat1d.opts = list(nhistSpike = 240, side = 3, frac = 0.08, lwd = 1, nint = 50) # Options for the scatterplot display
)

#> 
#> n=3121   Mean absolute error=0.015   Mean squared error=0.00044
#> 0.9 Quantile of absolute error=0.037

# Add lines to the plot to represent the ideal calibration curve and the bias-corrected curve
lines(cal_monica, lwd = 2, lty = 3, col = c(rgb(255,0,0,maxColorValue=255))) # Add the bias-corrected line
abline(0, 1, lty = 5, lwd = 2, col = c(rgb(0,0,255,maxColorValue= 255))) # Add the ideal line

# Add a legend to the plot
legend("bottomright", cex = 0.8, legend = c("Apparent", "Bias-corrected","Ideal"),
col = c("red", "black", "blue"), lwd = c(1, 1, 1), lty = c(1, 1, 2))

Closing Time Procedures

# parallel::stopCluster(cl)
# #grDevices::dev.off()
# invisible(gc())
# sessioninfo::session_info()
# beepr::beep(sound = 4)
# if (!interactive())
#   q("no")
report::report_packages(include_R = TRUE)

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Objective

Read in split data and train models (all train_match models end with `.fit` e.g. ‘logit.fit’)

PREDICTING TEST SET RESULTS

Logistic regression: logit.test

LASSO Optimized: lasso.best.test

Random forest: rf.test

Xgboost: xgb.test

Neural net: nnet.test

Plotting the Predicted Probalities

Lift plot

Calibration plot

Pick the Best Model by Calibration Plot and Brier Score

Plot Calibration

Concordonce Index

GLM Model: needed for non-technical plots

Imputation

Equation

Odds Ratio table

Logistic Regression for a non-technical audience

Disclaimer about Causality

Change in Log odds

Probability relative to some baseline

Combined plot with research experience count variable

DALEX Variable Importance Plot

Using the Harrelverse

Read in split data and train models (all train_match models end with `.fit` e.g. ‘logit.fit’)

Create model with lrm

Bootstrap Internal Validation

Calibration of the `fit` model

Closing Time Procedures

A Model to Predict Chances of Matching into Obstetrics and Gynecology Residency at All Participating Sites

Tyler M. Muffly, MD

Department of Obstetrics and Gynecology, Denver Health, Denver, CO

Objective

Read in split data and train models (all train_match models end with .fit e.g. ‘logit.fit’)

PREDICTING TEST SET RESULTS

Logistic regression: logit.test

LASSO Optimized: lasso.best.test

Random forest: rf.test

Xgboost: xgb.test

Neural net: nnet.test

Plotting the Predicted Probalities

Lift plot

Calibration plot

Pick the Best Model by Calibration Plot and Brier Score

Plot Calibration

Concordonce Index

GLM Model: needed for non-technical plots

Imputation

Equation

Odds Ratio table

Logistic Regression for a non-technical audience

Disclaimer about Causality

Change in Log odds

Probability relative to some baseline

Combined plot with research experience count variable

DALEX Variable Importance Plot

Using the Harrelverse

Read in split data and train models (all train_match models end with .fit e.g. ‘logit.fit’)

Create model with lrm

Bootstrap Internal Validation

Calibration of the fit model

Closing Time Procedures

Read in split data and train models (all train_match models end with `.fit` e.g. ‘logit.fit’)

Read in split data and train models (all train_match models end with `.fit` e.g. ‘logit.fit’)

Calibration of the `fit` model