Machine Learning Demo

Author

Jim Milks

Predicting hospital readmissions after discharge

Readmissions are situations where you are discharged from hospital and then readmitted for the same or related causes within 30, 60, 0r 90 days. Readmission rates are used as indicators of hospital quality, with a low rate preferable. Readmitting a patient can mean followup care was not properly administered or that the patient was not fully treated the first time they were in hospital.

Accordingly, predicting which patients are at higher risk for readmission can help doctors and hospitals manage patient care to prevent readmissions, thereby lowering costs, improving overall care quality, and improving their hospital ranking.

Here I demonstrate techniques for creating machine learning models to predict readmissions using synthetic patient data as actual patient data is not yet available in sufficient quantities.

Data

I created data using random sampling from normal, skewed, and binomial distributions, using logical conditions depending on the simulated variable. To get the R code I used, simply expand the code block then copy and paste into a script in R Studio. I first loaded the Tidyverse libraries. Tidyverse is a set of packages used for data manipulation. GT and GTSummary are packages for creating tables in Quarto documents. (Note: This is written in RStudio using Quarto.)

library(tidyverse)
library(gt)
library(gtsummary)

I created the synthetic data using the following code, which also created a table of summary statistics for each variable. A real data set would have many more variables (including treatments, etc) but I wanted to keep this simple.

# Create 100000 rows of random data with female = 0, male = 1 and random ages
set.seed(123) # Allows you to rerun my code and get the same numbers I did.

synthetic_df <- data.frame(gender = sample(c(0, 1), size = 100000, replace = TRUE), 
                        age = round(runif(100000, 18, 90)))

# Add new variables based on conditions
synthetic_df <- synthetic_df %>%
        mutate(heart_disease = if_else(age >= 60,
                                       rbinom(n = 1, size = 1, prob = 0.5),
                                       rbinom(n = 1, size = 1, prob = 0.05)),
               stroke = if_else(age >= 70,
                                rbinom(n = 1, size = 1, prob = 0.4),
                                rbinom(n = 1, size = 1, prob = 0.01)),
               cancer = if_else(age >= 65,
                                rbinom(n = 1, size = 1, prob = 0.6),
                                rbinom(n = 1, size = 1, prob = 0.1)),
               type2_diabetes = if_else(age >=50,
                                        rbinom(n = 1, size = 1, prob = 0.67),
                                        rbinom(n = 1, size = 1, prob = 0.25)))

## Set up logistic distribution for BMI
age_males <- subset(synthetic_df, gender == 1) %>%
        select(age)

age_females <- subset(synthetic_df, gender == 0) %>%
        select(age)

# Mean BMI and standard deviation taken from 
# Block et al. (2013) Population trends and variation in body mass index from 1971 to 2008 in 
# the Framingham Heart Study Offspring Cohort. PLOS One 8(5):e63217

mean_females <- 27.7
sd_females <- 6.15

mean_males <- 29.0
sd_males <- 4.73

# Calculate scale parameter for logistic distribution from population standard deviation
scale_female_bmi <- sd_females / sqrt(3)
scale_male_bmi <- sd_males / sqrt(3)

# Generate the synthetic data for females
female_BMI <- rlogis(length(age_females$age),
                     location = mean_females + 0.01*age_females$age,
                     scale = scale_female_bmi)
## Replace unrealistic BMI numbers with realistic ones
female_BMI <- replace(female_BMI,
                      female_BMI < 18,
                      sample(c(18:42), 1, replace = TRUE))
female_BMI <- replace(female_BMI,
                      female_BMI > 42,
                      sample(c(18:42), 1, replace = TRUE))

# Generate male BMI numbers
male_BMI <- rlogis(length(age_males$age),
                   location = mean_males + 0.009*age_males$age,
                   scale = scale_male_bmi)
## replace unrealistic BMI numbers
male_BMI <- replace(male_BMI,
                    male_BMI < 20,
                    sample(c(19:40), 1, replace = TRUE))
male_BMI <- replace(male_BMI,
                    male_BMI > 40,
                    sample(c(19:40), 1, replace = TRUE))
## Join the two set of BMI numbers together
BMI <- c(female_BMI, male_BMI)

## Incorporate BMI into the synthetic data set
synthetic_df <- synthetic_df %>%
        arrange(gender) %>%
        mutate(BMI = BMI)

# Create response variable
attach(synthetic_df)
xb <- -9 + 0.25*gender + 0.005*age + 0.025*BMI + 1.75*heart_disease + 2.1*stroke + 2.5*cancer + 1.4*type2_diabetes
p <- 1/(1 + exp(-xb))
detach(synthetic_df)
synthetic_df$Readmitted <- rbinom(n = 100000, size = 1, prob = p)

# Change factor variables to actual factors, leave rest as dummy variables
synthetic_df <- synthetic_df %>%
        mutate(gender = case_match(
                gender,
                0 ~ "Female",
                1 ~ "Male"),
               Readmitted = case_match(
                       Readmitted,
                       0 ~ "No",
                       1 ~ "Yes"
               ))

# Check the data
tbl_summary(synthetic_df,
            by = gender)

Characteristic	Female, N = 50,161¹	Male, N = 49,839¹
age	54 (36, 72)	54 (36, 72)
heart_disease	21,267 (42%)	21,078 (42%)
stroke	14,346 (29%)	14,209 (29%)
cancer	17,821 (36%)	17,639 (35%)
type2_diabetes	28,197 (56%)	27,988 (56%)
BMI	28.8 (25.0, 32.8)	29.4 (26.4, 32.2)
Readmitted	7,101 (14%)	8,121 (16%)
¹ Median (IQR); n (%)

You can make synthetic data using real data as a template via packages like the synthpop package. This is particularly useful when your real data set is too small to properly train machine learning models. The synthetic data will have all the characteristics of your actual data and be sufficiently large for training purposes.

Exploratory Data Analysis

Exploratory data analysis (EDA) is a critical first step in any data science project. We have several ways of conducting EDA in R—manually or automated. My favorite way? Using the dlookr package. Note: Running this code will open two tabs in your system browser.

library(dlookr)

# Diagnostic report
diagnose_web_report(synthetic_df, output_format = "html")

# EDA report
eda_web_report(synthetic_df, output_format = "html")

Data analysis with base R

With that out of the way, now we can get down to the analysis. As always, we start with the simplest model that is sufficient for the data and work our way up to more complex models. In many cases, we’ll discover that the simple model does a sufficient job and we won’t need more complex models. So without further ado, let’s start with multiple logistic regression, the simplest model for this data.

Logistic Regression

R has multiple ways for regression, from base R to various implentations in the caret and tidymodels packages. Let’s take a look at the base R version and then move to the caret version.

# Turn dummy variables into factors
synthetic_df2 <- synthetic_df %>%
        mutate_at(c("gender",
                    "heart_disease",
                    "stroke",
                    "cancer",
                    "type2_diabetes",
                    "Readmitted"), factor)

synthetic_logistic <- glm(Readmitted ~ ., data = synthetic_df2, family = "binomial")

summary(synthetic_logistic)


Call:
glm(formula = Readmitted ~ ., family = "binomial", data = synthetic_df2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3748  -0.1193  -0.0318  -0.0269   4.0315  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -8.782592   0.248739 -35.308  < 2e-16 ***
genderMale       0.268301   0.022638  11.852  < 2e-16 ***
age              0.007115   0.001987   3.580 0.000344 ***
heart_disease1   1.439853   0.242129   5.947 2.74e-09 ***
stroke1          2.076302   0.048165  43.108  < 2e-16 ***
cancer1          2.649781   0.141319  18.750  < 2e-16 ***
type2_diabetes1  1.278637   0.307155   4.163 3.14e-05 ***
BMI              0.022243   0.002264   9.825  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 85308  on 99999  degrees of freedom
Residual deviance: 45312  on 99992  degrees of freedom
AIC: 45328

Number of Fisher Scoring iterations: 10

Note that the coefficients R gives is log-odds ratios. Note that the coefficients are very close to the xb formula I used to create the data. So R largely figured out the original log-odds ratios I gave it. We have to take the exponential to convert them to odds ratio so we can interpret it. Here I pulled both the odds ratios (OR) and 95% confidence intervals for each variable

exp_coef <- exp(cbind(coef(synthetic_logistic), confint(synthetic_logistic)))
exp_coef

                                   2.5 %       97.5 %
(Intercept)      0.00015338  0.000091275 2.432464e-04
genderMale       1.30774097  1.250999032 1.367090e+00
age              1.00714024  1.003225194 1.011072e+00
heart_disease1   4.22007418  2.656348872 6.894083e+00
stroke1          7.97491910  7.260198087 8.769065e+00
cancer1         14.15093225 10.835513049 1.887539e+01
type2_diabetes1  3.59174030  1.974440908 6.639641e+00
BMI              1.02249251  1.017967551 1.027042e+00

Now we can tell that men have roughly a -30.7740968% greater chance of being readmitted compared to women, being diagnosed with heart disease increased the risk by 4.2x compared to normal, stroke by 7.97x, cancer 14x, type 2 diabetes 3.6x, and increasing BMI by 2.22%.

To make a prediction from base R, we first make a new data frame with the patient data we wish to predict.

p_gender <- "Female"
p_age <- 87
p_heartdisease <- as.factor(1)
p_stroke <- as.factor(1)
p_cancer <- as.factor(1)
p_type2diabetes <- as.factor(1)
p_BMI <- 34

new_patient_data <- data.frame(gender = p_gender,
                       age = p_age,
                       heart_disease = p_heartdisease,
                       stroke = p_stroke,
                       cancer = p_cancer,
                       type2_diabetes = p_type2diabetes,
                       BMI = p_BMI)

# Predict from logistic model and giving the predicted probability
exp(predict(synthetic_logistic, newdata = new_patient_data, type = "response"))

       1 
1.664143

So according to this model, a 87-year-old woman with heart disease, type 2 diabetes, cancer, and an obese BMI would have a 66.4% greater chance of being readmitted over a younger person without those conditions.

Caret

Caret is a machine learning package for R, similar to Tidymodels. I’m more familiar with Caret than Tidymodels as it’s been around longer than Tidymodels and was used in my data science certification course. It has several extension packages for additional functionality, including caretEnsemble for easily making stacked models and caretForecast for time series. Let’s create the training and testing data sets for the synthetic data in caret

library(caret)

# Parallel processing to speed analysis
library(doParallel)

## One fewer than the number of cores on your computer
registerDoParallel(cores = 3)

set.seed(42)

# Split data using Readmitted as index variable. Do *before* preprocessing the data
split_data <- createDataPartition(synthetic_df$Readmitted, p = 0.8, list = FALSE)

training <- synthetic_df[split_data, ]
test <- synthetic_df[-split_data, ]

training$Readmitted <- as.factor(training$Readmitted)
test$Readmitted <- as.factor(test$Readmitted)

Logistic Model

Creating models in caret is simple.

logistic_fit <- train(
        form = Readmitted ~ .,
        data = training,
        trControl = trainControl(method = "repeatedcv",
                                 number = 5,
                                 repeats = 3),
        method = "glm",
        family = "binomial"
)

summary(logistic_fit$finalModel)


Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3748  -0.1180  -0.0345  -0.0293   3.9882  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.605289   0.258991 -33.226  < 2e-16 ***
genderMale      0.262971   0.025321  10.385  < 2e-16 ***
age             0.006782   0.002221   3.054  0.00226 ** 
heart_disease   1.519005   0.281718   5.392 6.97e-08 ***
stroke          2.087335   0.053889  38.734  < 2e-16 ***
cancer          2.674334   0.159671  16.749  < 2e-16 ***
type2_diabetes  1.013237   0.336543   3.011  0.00261 ** 
BMI             0.022449   0.002529   8.877  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 68248  on 80000  degrees of freedom
Residual deviance: 36225  on 79993  degrees of freedom
AIC: 36241

Number of Fisher Scoring iterations: 10

Note that the coefficients are a bit different than the base R logistic regression. That’s due to different engines behind the analysis. Here it appears age is not a significant predictor whereas in the base R analysis, it was significant. As before, the coefficients here are log-odds. You can get the odds ratios via the following:

coefficients <- coef(logistic_fit$finalModel)
exp_coefficients <- exp(coefficients)
coefficients_table <- data.frame(
        Log_odds = coefficients,
        Odds_Ratio = exp_coefficients
)
print(coefficients_table)

                   Log_odds   Odds_Ratio
(Intercept)    -8.605288743 1.831347e-04
genderMale      0.262970913 1.300789e+00
age             0.006781732 1.006805e+00
heart_disease   1.519005409 4.567680e+00
stroke          2.087334858 8.063396e+00
cancer          2.674333959 1.450269e+01
type2_diabetes  1.013237454 2.754504e+00
BMI             0.022448915 1.022703e+00

Caret uses several base R functions (summary( ), coef( ), etc), which makes learning this package fairly easy if you know R. Also note how tuning and resampling are included in the model. Want to know how important variables are to the model? Try this:

# Importance scaled to 0 to 100. Use "scale = FALSE" to get unscaled importance
logistic_importance <- varImp(logistic_fit)
plot(logistic_importance)

As we can see, stroke is most importance followed by cancer.

If you need an estimate of the accuracy of the model, you can get it two ways. The first uses the cross-validation from the model itself.

logistic_fit$results

  parameter  Accuracy     Kappa  AccuracySD     KappaSD
1      none 0.8597518 0.4700193 0.003218907 0.009892456

Caret estimated the accuracy of the logistic model to be 0.8597518 by repeated cross-validation. We can get more information on model accuracy by testing it on the test data set then creating a confusion matrix. Since we created the test set before doing any preprocessing on the data, it’s a true test of the model’s predictive abilities. Let’s look at the model predictions and probabilities first and then the confusion matrix.

logistic_pred <- predict(logistic_fit, newdata = test)
logistic_probabilities <- predict(logistic_fit, newdata = test, type = "prob")
logistic_predictions_probabilities <- data.frame(predictions = logistic_pred, 
                                                 probabilities = logistic_probabilities)

head(logistic_predictions_probabilities)

   predictions probabilities.No probabilities.Yes
5           No        0.9986552      0.0013448219
6           No        0.9995462      0.0004537860
12          No        0.9995390      0.0004609887
15          No        0.9995784      0.0004215964
22          No        0.5264910      0.4735089972
23          No        0.9996122      0.0003877748

# Confusion matrix for logistic analyses requires factors
logistic_pred2 <- as.factor(logistic_pred)
Readmitted <- as.factor(test$Readmitted)

# Create confusion matrix using Readmitted variable
confusionMatrix(logistic_pred2, Readmitted)

Confusion Matrix and Statistics

          Reference
Prediction    No   Yes
       No  15415  1259
       Yes  1540  1785
                                          
               Accuracy : 0.86            
                 95% CI : (0.8552, 0.8648)
    No Information Rate : 0.8478          
    P-Value [Acc > NIR] : 5.709e-07       
                                          
                  Kappa : 0.4775          
                                          
 Mcnemar's Test P-Value : 1.207e-07       
                                          
            Sensitivity : 0.9092          
            Specificity : 0.5864          
         Pos Pred Value : 0.9245          
         Neg Pred Value : 0.5368          
             Prevalence : 0.8478          
         Detection Rate : 0.7708          
   Detection Prevalence : 0.8337          
      Balanced Accuracy : 0.7478          
                                          
       'Positive' Class : No

As you can see, caret includes a number of metrics by default. Overall, the model does a decent job given the synthetic data it was given. If you want a ROC curve, we have to add another package.

library(MLeval)
pred <- predict(logistic_fit, newdata = test, type = "prob")
test1 <- evalm(data.frame(pred, test$Readmitted, Group = "Logistic"))

test1$roc

The ROC confirms that the logistic model is superior over the naive default model, with a ROC of 0.92.

Ensemble models

Creating ensemble models in Caret is quite simple. First, let’s make the other models.

Machine learning models

Note: This is far from all the models Caret has but should be representative of what Caret is capable of doing.

Random Forest

# Additional library for random forest
library(randomForest)

# Tuning parameters
mtry <- sqrt(ncol(training))
ntree <- 3

rf_fit <- train(Readmitted ~ .,
                  data = training,
                  method = "ranger",
                  metric = "Accuracy",
                  tuneLength = 5,
                  trControl = trainControl(method = "repeatedcv",
                                           number = 5,
                                           repeats = 3,
                                           search = "random"),
                  na.action = na.omit)
rf_pred <- predict(rf_fit, newdata = test)

Gradient Boosting Machines

# Note: Load doParallel if not already loaded

library(gbm)

gbm_fit <- train(Readmitted ~ .,
                   data = training,
                   method = "gbm",
                   trControl = trainControl(method = "repeatedcv",
                                            number = 5,
                                            repeats = 3),
                   tuneLength = 5)

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        0.7792             nan     0.1000    0.0371
     2        0.7252             nan     0.1000    0.0269
     3        0.6835             nan     0.1000    0.0208
     4        0.6501             nan     0.1000    0.0166
     5        0.6234             nan     0.1000    0.0133
     6        0.6008             nan     0.1000    0.0114
     7        0.5825             nan     0.1000    0.0091
     8        0.5671             nan     0.1000    0.0077
     9        0.5532             nan     0.1000    0.0070
    10        0.5412             nan     0.1000    0.0061
    20        0.4811             nan     0.1000    0.0018
    40        0.4568             nan     0.1000    0.0003
    60        0.4542             nan     0.1000   -0.0000
    80        0.4534             nan     0.1000    0.0001
   100        0.4531             nan     0.1000   -0.0000

gbm_pred <- predict(gbm_fit, newdata = test)

Support Vector Machines

svm_fit <- train(Readmitted ~ .,
                   data = training,
                   method = "svmLinear",
                   trControl = trainControl(method = "repeatedcv",
                                            number = 5,
                                            repeats = 3),
                   tuneLength = 5)

svm_pred <- predict(svm_fit, newdata = test)

Ensemble Model

While you can use a package like caretEnsemble to create ensemble models, I like doing it myself.

predictions_df <- data.frame(logistic_pred, 
                             rf_pred, 
                             gbm_pred, 
                             svm_pred, 
                             Readmitted = test$Readmitted)

# You can use different methods. I just like rf
combined_fit <- train(Readmitted ~ ., method = "rf", data = predictions_df)

combined_pred <- predict(combined_fit, predictions_df)

confusionMatrix(combined_pred, predictions_df$Readmitted)

Confusion Matrix and Statistics

          Reference
Prediction    No   Yes
       No  15404  1237
       Yes  1551  1807
                                          
               Accuracy : 0.8606          
                 95% CI : (0.8557, 0.8654)
    No Information Rate : 0.8478          
    P-Value [Acc > NIR] : 1.817e-07       
                                          
                  Kappa : 0.4818          
                                          
 Mcnemar's Test P-Value : 3.069e-09       
                                          
            Sensitivity : 0.9085          
            Specificity : 0.5936          
         Pos Pred Value : 0.9257          
         Neg Pred Value : 0.5381          
             Prevalence : 0.8478          
         Detection Rate : 0.7702          
   Detection Prevalence : 0.8321          
      Balanced Accuracy : 0.7511          
                                          
       'Positive' Class : No

The metrics for the ensemble model are very similar to the metrics for the base R logistic model alone. The last part is to use the ensemble model to predict whether or not a patient will be readmitted. We’ll use the same imaginary patient from before.

new_patient_data <- data.frame(gender = "Female",
                       age = 87,
                       heart_disease = as.integer(1),
                       stroke = as.integer(1),
                       cancer = as.integer(1),
                       type2_diabetes = as.integer(1),
                       BMI = 34)

# Predict the new patient data with the original model fits
logistic_final <- predict(logistic_fit, new_patient_data)
rf_final <- predict(rf_fit, new_patient_data)
gbm_final <- predict(gbm_fit, new_patient_data)
svm_final <- predict(svm_fit, new_patient_data)

# Gather the predictions into a dataframe
final_df <- data.frame(logistic_pred = logistic_final,
                       rf_pred = rf_final,
                       gbm_pred = gbm_final,
                       svm_pred = svm_final)

# Predict the final outcome and probabilities with the ensemble model fit
combo_pred_outcome <- predict(combined_fit, final_df)
combo_pred_prob <- predict(combined_fit, final_df, type = "prob")

# Gather the final outcom and probability into a dataframe for display
combo_pred_final <- data.frame(outcome = combo_pred_outcome, 
                               probability = combo_pred_prob)

# Display the final results
combo_pred_final

  outcome probability.No probability.Yes
1     Yes          0.342           0.658

The estimated probability for readmittance is 65.8%, just a little below the 66.4% estimated from the base R logistic model.

Conclusion

And there you have it. In this example, I generated a synthetic data set then analyzed the data using base R and Caret. I evaluated the resulting fits and used the models to predict the outcome of a hypothetical patient.