library(tidyverse)
library(gt)
library(gtsummary)Machine Learning Demo
Predicting hospital readmissions after discharge
Readmissions are situations where you are discharged from hospital and then readmitted for the same or related causes within 30, 60, 0r 90 days. Readmission rates are used as indicators of hospital quality, with a low rate preferable. Readmitting a patient can mean followup care was not properly administered or that the patient was not fully treated the first time they were in hospital.
Accordingly, predicting which patients are at higher risk for readmission can help doctors and hospitals manage patient care to prevent readmissions, thereby lowering costs, improving overall care quality, and improving their hospital ranking.
Here I demonstrate techniques for creating machine learning models to predict readmissions using synthetic patient data as actual patient data is not yet available in sufficient quantities.
Data
I created data using random sampling from normal, skewed, and binomial distributions, using logical conditions depending on the simulated variable. To get the R code I used, simply expand the code block then copy and paste into a script in R Studio. I first loaded the Tidyverse libraries. Tidyverse is a set of packages used for data manipulation. GT and GTSummary are packages for creating tables in Quarto documents. (Note: This is written in RStudio using Quarto.)
I created the synthetic data using the following code, which also created a table of summary statistics for each variable. A real data set would have many more variables (including treatments, etc) but I wanted to keep this simple.
# Create 100000 rows of random data with female = 0, male = 1 and random ages
set.seed(123) # Allows you to rerun my code and get the same numbers I did.
synthetic_df <- data.frame(gender = sample(c(0, 1), size = 100000, replace = TRUE),
age = round(runif(100000, 18, 90)))
# Add new variables based on conditions
synthetic_df <- synthetic_df %>%
mutate(heart_disease = if_else(age >= 60,
rbinom(n = 1, size = 1, prob = 0.5),
rbinom(n = 1, size = 1, prob = 0.05)),
stroke = if_else(age >= 70,
rbinom(n = 1, size = 1, prob = 0.4),
rbinom(n = 1, size = 1, prob = 0.01)),
cancer = if_else(age >= 65,
rbinom(n = 1, size = 1, prob = 0.6),
rbinom(n = 1, size = 1, prob = 0.1)),
type2_diabetes = if_else(age >=50,
rbinom(n = 1, size = 1, prob = 0.67),
rbinom(n = 1, size = 1, prob = 0.25)))
## Set up logistic distribution for BMI
age_males <- subset(synthetic_df, gender == 1) %>%
select(age)
age_females <- subset(synthetic_df, gender == 0) %>%
select(age)
# Mean BMI and standard deviation taken from
# Block et al. (2013) Population trends and variation in body mass index from 1971 to 2008 in
# the Framingham Heart Study Offspring Cohort. PLOS One 8(5):e63217
mean_females <- 27.7
sd_females <- 6.15
mean_males <- 29.0
sd_males <- 4.73
# Calculate scale parameter for logistic distribution from population standard deviation
scale_female_bmi <- sd_females / sqrt(3)
scale_male_bmi <- sd_males / sqrt(3)
# Generate the synthetic data for females
female_BMI <- rlogis(length(age_females$age),
location = mean_females + 0.01*age_females$age,
scale = scale_female_bmi)
## Replace unrealistic BMI numbers with realistic ones
female_BMI <- replace(female_BMI,
female_BMI < 18,
sample(c(18:42), 1, replace = TRUE))
female_BMI <- replace(female_BMI,
female_BMI > 42,
sample(c(18:42), 1, replace = TRUE))
# Generate male BMI numbers
male_BMI <- rlogis(length(age_males$age),
location = mean_males + 0.009*age_males$age,
scale = scale_male_bmi)
## replace unrealistic BMI numbers
male_BMI <- replace(male_BMI,
male_BMI < 20,
sample(c(19:40), 1, replace = TRUE))
male_BMI <- replace(male_BMI,
male_BMI > 40,
sample(c(19:40), 1, replace = TRUE))
## Join the two set of BMI numbers together
BMI <- c(female_BMI, male_BMI)
## Incorporate BMI into the synthetic data set
synthetic_df <- synthetic_df %>%
arrange(gender) %>%
mutate(BMI = BMI)
# Create response variable
attach(synthetic_df)
xb <- -9 + 0.25*gender + 0.005*age + 0.025*BMI + 1.75*heart_disease + 2.1*stroke + 2.5*cancer + 1.4*type2_diabetes
p <- 1/(1 + exp(-xb))
detach(synthetic_df)
synthetic_df$Readmitted <- rbinom(n = 100000, size = 1, prob = p)
# Change factor variables to actual factors, leave rest as dummy variables
synthetic_df <- synthetic_df %>%
mutate(gender = case_match(
gender,
0 ~ "Female",
1 ~ "Male"),
Readmitted = case_match(
Readmitted,
0 ~ "No",
1 ~ "Yes"
))
# Check the data
tbl_summary(synthetic_df,
by = gender)| Characteristic | Female, N = 50,1611 | Male, N = 49,8391 |
|---|---|---|
| age | 54 (36, 72) | 54 (36, 72) |
| heart_disease | 21,267 (42%) | 21,078 (42%) |
| stroke | 14,346 (29%) | 14,209 (29%) |
| cancer | 17,821 (36%) | 17,639 (35%) |
| type2_diabetes | 28,197 (56%) | 27,988 (56%) |
| BMI | 28.8 (25.0, 32.8) | 29.4 (26.4, 32.2) |
| Readmitted | 7,101 (14%) | 8,121 (16%) |
| 1 Median (IQR); n (%) | ||
You can make synthetic data using real data as a template via packages like the synthpop package. This is particularly useful when your real data set is too small to properly train machine learning models. The synthetic data will have all the characteristics of your actual data and be sufficiently large for training purposes.
Exploratory Data Analysis
Exploratory data analysis (EDA) is a critical first step in any data science project. We have several ways of conducting EDA in R—manually or automated. My favorite way? Using the dlookr package. Note: Running this code will open two tabs in your system browser.
library(dlookr)
# Diagnostic report
diagnose_web_report(synthetic_df, output_format = "html")
# EDA report
eda_web_report(synthetic_df, output_format = "html")Data analysis with base R
With that out of the way, now we can get down to the analysis. As always, we start with the simplest model that is sufficient for the data and work our way up to more complex models. In many cases, we’ll discover that the simple model does a sufficient job and we won’t need more complex models. So without further ado, let’s start with multiple logistic regression, the simplest model for this data.
Logistic Regression
R has multiple ways for regression, from base R to various implentations in the caret and tidymodels packages. Let’s take a look at the base R version and then move to the caret version.
# Turn dummy variables into factors
synthetic_df2 <- synthetic_df %>%
mutate_at(c("gender",
"heart_disease",
"stroke",
"cancer",
"type2_diabetes",
"Readmitted"), factor)
synthetic_logistic <- glm(Readmitted ~ ., data = synthetic_df2, family = "binomial")
summary(synthetic_logistic)
Call:
glm(formula = Readmitted ~ ., family = "binomial", data = synthetic_df2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3748 -0.1193 -0.0318 -0.0269 4.0315
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.782592 0.248739 -35.308 < 2e-16 ***
genderMale 0.268301 0.022638 11.852 < 2e-16 ***
age 0.007115 0.001987 3.580 0.000344 ***
heart_disease1 1.439853 0.242129 5.947 2.74e-09 ***
stroke1 2.076302 0.048165 43.108 < 2e-16 ***
cancer1 2.649781 0.141319 18.750 < 2e-16 ***
type2_diabetes1 1.278637 0.307155 4.163 3.14e-05 ***
BMI 0.022243 0.002264 9.825 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 85308 on 99999 degrees of freedom
Residual deviance: 45312 on 99992 degrees of freedom
AIC: 45328
Number of Fisher Scoring iterations: 10
Note that the coefficients R gives is log-odds ratios. Note that the coefficients are very close to the xb formula I used to create the data. So R largely figured out the original log-odds ratios I gave it. We have to take the exponential to convert them to odds ratio so we can interpret it. Here I pulled both the odds ratios (OR) and 95% confidence intervals for each variable
exp_coef <- exp(cbind(coef(synthetic_logistic), confint(synthetic_logistic)))
exp_coef 2.5 % 97.5 %
(Intercept) 0.00015338 0.000091275 2.432464e-04
genderMale 1.30774097 1.250999032 1.367090e+00
age 1.00714024 1.003225194 1.011072e+00
heart_disease1 4.22007418 2.656348872 6.894083e+00
stroke1 7.97491910 7.260198087 8.769065e+00
cancer1 14.15093225 10.835513049 1.887539e+01
type2_diabetes1 3.59174030 1.974440908 6.639641e+00
BMI 1.02249251 1.017967551 1.027042e+00
Now we can tell that men have roughly a -30.7740968% greater chance of being readmitted compared to women, being diagnosed with heart disease increased the risk by 4.2x compared to normal, stroke by 7.97x, cancer 14x, type 2 diabetes 3.6x, and increasing BMI by 2.22%.
To make a prediction from base R, we first make a new data frame with the patient data we wish to predict.
p_gender <- "Female"
p_age <- 87
p_heartdisease <- as.factor(1)
p_stroke <- as.factor(1)
p_cancer <- as.factor(1)
p_type2diabetes <- as.factor(1)
p_BMI <- 34
new_patient_data <- data.frame(gender = p_gender,
age = p_age,
heart_disease = p_heartdisease,
stroke = p_stroke,
cancer = p_cancer,
type2_diabetes = p_type2diabetes,
BMI = p_BMI)
# Predict from logistic model and giving the predicted probability
exp(predict(synthetic_logistic, newdata = new_patient_data, type = "response")) 1
1.664143
So according to this model, a 87-year-old woman with heart disease, type 2 diabetes, cancer, and an obese BMI would have a 66.4% greater chance of being readmitted over a younger person without those conditions.
Caret
Caret is a machine learning package for R, similar to Tidymodels. I’m more familiar with Caret than Tidymodels as it’s been around longer than Tidymodels and was used in my data science certification course. It has several extension packages for additional functionality, including caretEnsemble for easily making stacked models and caretForecast for time series. Let’s create the training and testing data sets for the synthetic data in caret
library(caret)
# Parallel processing to speed analysis
library(doParallel)
## One fewer than the number of cores on your computer
registerDoParallel(cores = 3)
set.seed(42)
# Split data using Readmitted as index variable. Do *before* preprocessing the data
split_data <- createDataPartition(synthetic_df$Readmitted, p = 0.8, list = FALSE)
training <- synthetic_df[split_data, ]
test <- synthetic_df[-split_data, ]
training$Readmitted <- as.factor(training$Readmitted)
test$Readmitted <- as.factor(test$Readmitted)Logistic Model
Creating models in caret is simple.
logistic_fit <- train(
form = Readmitted ~ .,
data = training,
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3),
method = "glm",
family = "binomial"
)
summary(logistic_fit$finalModel)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3748 -0.1180 -0.0345 -0.0293 3.9882
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.605289 0.258991 -33.226 < 2e-16 ***
genderMale 0.262971 0.025321 10.385 < 2e-16 ***
age 0.006782 0.002221 3.054 0.00226 **
heart_disease 1.519005 0.281718 5.392 6.97e-08 ***
stroke 2.087335 0.053889 38.734 < 2e-16 ***
cancer 2.674334 0.159671 16.749 < 2e-16 ***
type2_diabetes 1.013237 0.336543 3.011 0.00261 **
BMI 0.022449 0.002529 8.877 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 68248 on 80000 degrees of freedom
Residual deviance: 36225 on 79993 degrees of freedom
AIC: 36241
Number of Fisher Scoring iterations: 10
Note that the coefficients are a bit different than the base R logistic regression. That’s due to different engines behind the analysis. Here it appears age is not a significant predictor whereas in the base R analysis, it was significant. As before, the coefficients here are log-odds. You can get the odds ratios via the following:
coefficients <- coef(logistic_fit$finalModel)
exp_coefficients <- exp(coefficients)
coefficients_table <- data.frame(
Log_odds = coefficients,
Odds_Ratio = exp_coefficients
)
print(coefficients_table) Log_odds Odds_Ratio
(Intercept) -8.605288743 1.831347e-04
genderMale 0.262970913 1.300789e+00
age 0.006781732 1.006805e+00
heart_disease 1.519005409 4.567680e+00
stroke 2.087334858 8.063396e+00
cancer 2.674333959 1.450269e+01
type2_diabetes 1.013237454 2.754504e+00
BMI 0.022448915 1.022703e+00
Caret uses several base R functions (summary( ), coef( ), etc), which makes learning this package fairly easy if you know R. Also note how tuning and resampling are included in the model. Want to know how important variables are to the model? Try this:
# Importance scaled to 0 to 100. Use "scale = FALSE" to get unscaled importance
logistic_importance <- varImp(logistic_fit)
plot(logistic_importance)As we can see, stroke is most importance followed by cancer.
If you need an estimate of the accuracy of the model, you can get it two ways. The first uses the cross-validation from the model itself.
logistic_fit$results parameter Accuracy Kappa AccuracySD KappaSD
1 none 0.8597518 0.4700193 0.003218907 0.009892456
Caret estimated the accuracy of the logistic model to be 0.8597518 by repeated cross-validation. We can get more information on model accuracy by testing it on the test data set then creating a confusion matrix. Since we created the test set before doing any preprocessing on the data, it’s a true test of the model’s predictive abilities. Let’s look at the model predictions and probabilities first and then the confusion matrix.
logistic_pred <- predict(logistic_fit, newdata = test)
logistic_probabilities <- predict(logistic_fit, newdata = test, type = "prob")
logistic_predictions_probabilities <- data.frame(predictions = logistic_pred,
probabilities = logistic_probabilities)
head(logistic_predictions_probabilities) predictions probabilities.No probabilities.Yes
5 No 0.9986552 0.0013448219
6 No 0.9995462 0.0004537860
12 No 0.9995390 0.0004609887
15 No 0.9995784 0.0004215964
22 No 0.5264910 0.4735089972
23 No 0.9996122 0.0003877748
# Confusion matrix for logistic analyses requires factors
logistic_pred2 <- as.factor(logistic_pred)
Readmitted <- as.factor(test$Readmitted)
# Create confusion matrix using Readmitted variable
confusionMatrix(logistic_pred2, Readmitted)Confusion Matrix and Statistics
Reference
Prediction No Yes
No 15415 1259
Yes 1540 1785
Accuracy : 0.86
95% CI : (0.8552, 0.8648)
No Information Rate : 0.8478
P-Value [Acc > NIR] : 5.709e-07
Kappa : 0.4775
Mcnemar's Test P-Value : 1.207e-07
Sensitivity : 0.9092
Specificity : 0.5864
Pos Pred Value : 0.9245
Neg Pred Value : 0.5368
Prevalence : 0.8478
Detection Rate : 0.7708
Detection Prevalence : 0.8337
Balanced Accuracy : 0.7478
'Positive' Class : No
As you can see, caret includes a number of metrics by default. Overall, the model does a decent job given the synthetic data it was given. If you want a ROC curve, we have to add another package.
library(MLeval)
pred <- predict(logistic_fit, newdata = test, type = "prob")
test1 <- evalm(data.frame(pred, test$Readmitted, Group = "Logistic"))test1$rocThe ROC confirms that the logistic model is superior over the naive default model, with a ROC of 0.92.
Ensemble models
Creating ensemble models in Caret is quite simple. First, let’s make the other models.
Machine learning models
Note: This is far from all the models Caret has but should be representative of what Caret is capable of doing.
Random Forest
# Additional library for random forest
library(randomForest)
# Tuning parameters
mtry <- sqrt(ncol(training))
ntree <- 3
rf_fit <- train(Readmitted ~ .,
data = training,
method = "ranger",
metric = "Accuracy",
tuneLength = 5,
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
search = "random"),
na.action = na.omit)
rf_pred <- predict(rf_fit, newdata = test)Gradient Boosting Machines
# Note: Load doParallel if not already loaded
library(gbm)
gbm_fit <- train(Readmitted ~ .,
data = training,
method = "gbm",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3),
tuneLength = 5)Iter TrainDeviance ValidDeviance StepSize Improve
1 0.7792 nan 0.1000 0.0371
2 0.7252 nan 0.1000 0.0269
3 0.6835 nan 0.1000 0.0208
4 0.6501 nan 0.1000 0.0166
5 0.6234 nan 0.1000 0.0133
6 0.6008 nan 0.1000 0.0114
7 0.5825 nan 0.1000 0.0091
8 0.5671 nan 0.1000 0.0077
9 0.5532 nan 0.1000 0.0070
10 0.5412 nan 0.1000 0.0061
20 0.4811 nan 0.1000 0.0018
40 0.4568 nan 0.1000 0.0003
60 0.4542 nan 0.1000 -0.0000
80 0.4534 nan 0.1000 0.0001
100 0.4531 nan 0.1000 -0.0000
gbm_pred <- predict(gbm_fit, newdata = test)Support Vector Machines
svm_fit <- train(Readmitted ~ .,
data = training,
method = "svmLinear",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3),
tuneLength = 5)
svm_pred <- predict(svm_fit, newdata = test)Ensemble Model
While you can use a package like caretEnsemble to create ensemble models, I like doing it myself.
predictions_df <- data.frame(logistic_pred,
rf_pred,
gbm_pred,
svm_pred,
Readmitted = test$Readmitted)
# You can use different methods. I just like rf
combined_fit <- train(Readmitted ~ ., method = "rf", data = predictions_df)
combined_pred <- predict(combined_fit, predictions_df)
confusionMatrix(combined_pred, predictions_df$Readmitted)Confusion Matrix and Statistics
Reference
Prediction No Yes
No 15404 1237
Yes 1551 1807
Accuracy : 0.8606
95% CI : (0.8557, 0.8654)
No Information Rate : 0.8478
P-Value [Acc > NIR] : 1.817e-07
Kappa : 0.4818
Mcnemar's Test P-Value : 3.069e-09
Sensitivity : 0.9085
Specificity : 0.5936
Pos Pred Value : 0.9257
Neg Pred Value : 0.5381
Prevalence : 0.8478
Detection Rate : 0.7702
Detection Prevalence : 0.8321
Balanced Accuracy : 0.7511
'Positive' Class : No
The metrics for the ensemble model are very similar to the metrics for the base R logistic model alone. The last part is to use the ensemble model to predict whether or not a patient will be readmitted. We’ll use the same imaginary patient from before.
new_patient_data <- data.frame(gender = "Female",
age = 87,
heart_disease = as.integer(1),
stroke = as.integer(1),
cancer = as.integer(1),
type2_diabetes = as.integer(1),
BMI = 34)
# Predict the new patient data with the original model fits
logistic_final <- predict(logistic_fit, new_patient_data)
rf_final <- predict(rf_fit, new_patient_data)
gbm_final <- predict(gbm_fit, new_patient_data)
svm_final <- predict(svm_fit, new_patient_data)
# Gather the predictions into a dataframe
final_df <- data.frame(logistic_pred = logistic_final,
rf_pred = rf_final,
gbm_pred = gbm_final,
svm_pred = svm_final)
# Predict the final outcome and probabilities with the ensemble model fit
combo_pred_outcome <- predict(combined_fit, final_df)
combo_pred_prob <- predict(combined_fit, final_df, type = "prob")
# Gather the final outcom and probability into a dataframe for display
combo_pred_final <- data.frame(outcome = combo_pred_outcome,
probability = combo_pred_prob)
# Display the final results
combo_pred_final outcome probability.No probability.Yes
1 Yes 0.342 0.658
The estimated probability for readmittance is 65.8%, just a little below the 66.4% estimated from the base R logistic model.
Conclusion
And there you have it. In this example, I generated a synthetic data set then analyzed the data using base R and Caret. I evaluated the resulting fits and used the models to predict the outcome of a hypothetical patient.