• Term Paper
  • Fall 2020, DSPA (HS650)
  • Name: Yihao Yang, Fengyi Gao
  • UMich E-mail: ,
  • We certify that the following paper represents our own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.

1 Abstract

Hospital readmission takes up a large portion of medical resources and has been increasingly recognized as an important health care quality measure. With diabetic patients being at higher risk of readmission, reducing readmission rates of diabetic patients will not only lessen the burden of their medical costs but also improve their life quality. In this study, we dug deep into the diabetes data from 130-US Hospitals with almost 70,000 inpatient diabetes encounters. Tree based models and multivariate logistic regression were used to predict and infer important factors for early readmission. Among a variety of predictive models we built, random forest results in highest predicting accuracy of 93.09%, with sensitivity and specificity of 96.51% and 90.44%. Moreover, our findings demonstrated that admission type/source, number of diagnoses/procedures, and diagnosis type, etc. are important features impacting hospital early readmission. The results can be used to inform future directions for diabetic patient health care improvement.

Keywords: Hospital Early Readmission, Diabetes, Predictive Analysis, Clustering, Feature selection

2 Intro/Background

Diabetes, being one of the top leading death globally, has more than doubled its prevalence during the past 20 years (Zimmet et al. 2014). Early readmission plays a critical role, as a measure of quality of care, in guiding medical professionals to improve their treatment. It is defined by the presence a hospital claim starting within 30 days of discharge from the index hospitalization (Lum et al. 2012). Early readmission not only takes up a large portion of medical resources but also has been increasingly recognized as an important health care quality measure (Rubin 2015). As patients with diabetes may be at higher risk of readmission than those without diabetes (Rubin 2015), reducing readmission rates of diabetic patients will not only lessen the burden of their medical costs but also improve the life quality.

Therefore, it is important to know/predict if a patient will be readmitted in hospital so actions such as change of treatment/medicines can be taken to prevent readmission. Utilizing the diabetes dataset (Strack et al. 2014) across 10 years (1999-2008) of clinical care at 130 US hospitals, we are able to develop several risk predictive models to predict whether a diabetic patient will be readmitted. Additionally, we examined relationship between multiple risk factors and 30-day readmission rate of diabetes patients, and found combination of risk factors that greatly impact the hospital readmission. We hypothesized that more actions taken at the hospital (larger number of procedures/tests) and more careful diagnosis can be associated with a reduction in readmission rates in individuals admitted to the hospital. Our results provide intuitions on the useful predictor of readmission rates which may prove valuable in the development of strategies to reduce readmission rates and costs for the care of individuals with diabetes mellitus.

3 Methods

3.1 Data assembly

In this study, we use the dataset Diabetes 130-US hospitals for years 1999-2008 Data Set(Strack et al. 2014). This dataset consists of risk factors related to hospital readmission of diabetes patients collected from 130 hospitals and integrated delivery networks from 1999-2008. It includes over 50 features representing patient and hospital outcomes of over 70,000 observations. Examples of attributes including race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetes medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.

Readimission is a key label used for prediction in this study. It has three types in the original dataset: * No readmission * A readmission in less than 30 days (this situation is not good, because maybe the treatment was not appropriate) * A readmission in more than 30 days (this one is not good as well, however, the reason can be the state of the patient) Our goal is to use risk factors for predicting the hospital readmission, especially the cases with readmission in less than 30 days. So we can inform treatment improvement for hospitals.

The data used for later analysis is created in two step. We first performed data collection and preprocessing, including formatting/dropping missing values, factorizing features. This process involves only transformation without any feature selection or oberservation filtering. Then we primarily cleaned data based on the criteria as followed.

  • Remove repeated patients (only keep the first encounter) to ensure data independence.
  • Rmoved all encounters that resulted in either discharge to a hospice or patient death, to avoid biasing our analysis.
  • Deselect features with over 25% of missing values including weight, medical_specialty, payer_code.
  • Drop irrelavent features/identifier including encounter_id, patient_nbr based on our prior knowledge.

After the comprehensive data cleaning, we carried out exploratory data analysis to gain some intuition on the feature distributions and correlations. Lastly, we did data processing such as labelling, splitting before we dive into predictive modelling. The process is as followed:

  • First divide readmitted into two groups by putting long term readmission “>30” into “NO” class. This will result in a highly imbalanced data, so we rebalance it using SMOTE function.
Illustration of SMOTE algorithm [resource](https://towardsdatascience.com/telecom-customer-churn-prediction-using-smote-powered-machine-learning-a7354d54380d)

Illustration of SMOTE algorithm resource

  • Then we drop the rows where readmitted is NA.
  • Lastly, we split the data into 90:10 training:testing (randomly).

3.2 Classification

We implemented three machine learning algorithms, including Random Forest, Decision Tree and Logistic regression to identify diabetic patients who have higher likelihood of being readmitted within 30 days. To improve our training accuracy/efficiency and assess the important factors that significantly associated with readmission, we carried out model implementation in four steps.

  • Build a random forest model and evaluate the confusion matrix by applying it on test sets.
  • Implement a decision tree model using rpart evaluate the confusion matrix by applying it on test sets.
  • With intuition gained from tree based model, we further perform feature selection using three approaches, including recursive feature elimination, Rpart feature selection, and Boruta method.
  • Based on the feature selection result, we select the confirmed important features for logistic regression training and evalutated its confusion matrix as well as ROC curve.

The best model for this purpose has been found to be Random Forest, with over 93% test accuracy.

knitr::include_graphics("Feature_selection.png")
Feature Selection [resource](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/)

Feature Selection resource

3.3 Dimensionality Reduction and Clustering

Clustering and dimensionality reduction allow us to better understand how a sample might be comprised of distinct subgroups given a set of variables. Here, we tried to visualize the dataset in lower dimensions to figure out if we can find subgroups of readmission given the features we selected. In particular, we applied t-SNE and project the feature space in two dimensions. The data is not separable in two dimension since the feature space is quite large in our study. For clustering, a popular choice is Euclidean distance. However, Euclidean distance is only valid for continuous variables, and thus is not applicable here. Instead, we use a distance metric called Gower distance that can handle mixed data types. THe following plot illustrates the definition of Gower distance.

knitr::include_graphics("gower.jpg")
Illustration of Gower distance [resource](https://slideplayer.com/slide/12445898/)

Illustration of Gower distance resource

Then we select the optimal number of clusters (2) based on silhouette width, and visualize the result in two dimensions. This result becomes more separable compared to pure t-SNE visualization. However, there is no strong indication of readmission difference between two groups, and number of patients with admission is slightly larger in the second group. Finally, we apply hierarchical clustering to visualize the distribution of readmission group among the branches.

4 Results and Discussions

R Libraries used in this study:

library(tidyverse)
library(ggplot2)
library(DT)
library(corrplot)
library(GGally)
library(ggforce)
library(unbalanced)
library(Boruta)
library(caret)
library(randomForest)
library(reshape2)
library(rpart.plot)
library(pROC)
library(MASS)
library(Rtsne)
library(lubridate)
library(rgl)
library(tidyr)
library(cluster)
library(dendextend)

4.1 Data Collection and Preprocessing

In this section, we perform data collection and cleaning, including formatting missing values, factorizing features, etc.

4.1.1 Data Collection

Here we collected the data from the case-studies on the Canvas site.

data_main <- as.data.frame(read.csv("https://umich.instructure.com/files/5525199/download?download_frd=1"))
data_mapping <- as.data.frame(read.csv("https://umich.instructure.com/files/5525197/download?download_frd=1"))
# different link but same data: 
# df3 <- as.data.frame((read.csv("https://umich.instructure.com/files/7843517/download?download_frd=1")))

First look at the data:

datatable(t(head((data_main))))
#str(data_main)

Here we showed a table describing the features in the dataset.

List of features and their descriptions in the initial dataset [@strack2014impact]

List of features and their descriptions in the initial dataset (Strack et al. 2014)

4.1.2 Data Cleaning

4.1.2.1 Handling Missing Values

We observed in the dataset that some NA values are represented as “?” as some are coded as integers. Here we converted all the missing values to NA.

data_main <- data_main %>% 
      # Convert "?" values to NA 
      mutate_all(~na_if(.,"?")) 
      # Set missing values to NA according to mapping data
data_main[(data_main$admission_type_id %in% c(5,6,8)),"admission_type_id"] <- NA

data_main[(data_main$discharge_disposition_id %in% c(18,25,26)), "discharge_disposition_id"] <- NA

data_main[(data_main$admission_source_id %in% c(9,15,17,20,21)),"admission_source_id"] <- NA

4.1.2.2 Data Type Conversion

In this section we transform appropriate features into categorical data and convert the diagnosis features into categorical data based on the icd9 codes (See Table 2 in Strack et al. 2014).

data_main$race <- as.factor(data_main$race)
data_main$gender <- as.factor(data_main$gender)
data_main$age <- as.factor(data_main$age)
data_main$payer_code <- as.factor(data_main$payer_code)
data_main$max_glu_serum <- as.factor(data_main$max_glu_serum)
data_main$A1Cresult <- as.factor(data_main$A1Cresult)
data_main$admission_type_id <- as.factor(data_main$admission_type_id)
data_main$admission_source_id <- as.factor(data_main$admission_source_id)
data_main$discharge_disposition_id <- as.factor(data_main$discharge_disposition_id)

for (i in 25:50){
  data_main[,i] <- as.factor(data_main[,i])
}
convert_diag <- function(feature,data_main)
{
  val_name <- paste(feature,"_group",sep='')
  data_main[,val_name] <- NA
  for (i in 1:length(data_main[,feature]))
  {
    if (!is.na(data_main[i,feature]))
    {
      if (grepl("[E-Ve-v]", data_main[i,feature], perl = T)) {data_main[i,val_name] <- "Other"}
      else{
          cur_val <- as.numeric(data_main[i,feature])
          if (cur_val %in% c(390:459,785)) {data_main[i,val_name] <- "Circulatory"}
          else if (cur_val %in% c(460:519,786)) {data_main[i,val_name] <- "Respiratory"}
          else if (cur_val %in% c(520:579,787)) {data_main[i,val_name] <- "Digestive"}
          else if (cur_val %in% c(800:999)) {data_main[i,val_name] <- "Injury"}
          else if (cur_val %in% c(710:739)) {data_main[i,val_name] <- "Musculoskeletal"}
          else if (cur_val %in% c(580:629,788)) {data_main[i,val_name] <- "Genitourinary"}
          else if (cur_val %in% c(140:239)) {data_main[i,val_name] <- "Neoplasms"}
          else if (cur_val>=250 & cur_val<251) {data_main[i,val_name] <- "Diabetes"}
          else {data_main[i,val_name] <- "Other"}
        }
    }
  }
  return(data_main)
}

for (i in c("diag_1","diag_2","diag_3")){
  data_main <- convert_diag(i,data_main)
}
data_main$diag_1_group <- as.factor(data_main$diag_1_group)
data_main$diag_2_group <- as.factor(data_main$diag_2_group)
data_main$diag_3_group <- as.factor(data_main$diag_3_group)

data_main <- subset(data_main, select = -c(diag_1,diag_2,diag_3))

4.1.2.3 Data Filtering

We did the following to ensure data independence and reduce Bias to improve the prediction of readmission.

  1. Remove repeated patients (only keep the first encounter).
  2. Remove admissions that resulted in death/transfer.
hospiceORdeathORnull_id <- c(11,13,14,19,20,21,18,25,26)
data_main <- group_by(data_main, patient_nbr) %>% slice(1) %>% 
      filter(!discharge_disposition_id %in% hospiceORdeathORnull_id)
data_main <- as.data.frame(data_main)
#length(unique(data_main$patient_nbr))/length(data_main$patient_nbr)

4.1.2.4 Feature Pruning

In this section, we pruned the features based on our prior knowledge and criteria on the missing value, possible correlation, etc.

Prune Features Based on Data Missing Rates

Here, we computed percentages of missing values for all the features and determine if we should remove any feature with large amount of missing values.

miss_data <- data.frame(features = colnames(data_main),
                     missing_rates = round(colSums(is.na(data_main))/nrow(data_main),3))
miss_data <- miss_data[order(-miss_data$missing_rates),]
# datatable(miss_data)
miss_data$features = factor(miss_data$features, levels = unique(miss_data$features))
ggplot(miss_data, aes(features,missing_rates,fill=missing_rates))+
  geom_bar(stat="identity")+
  ggtitle("Missing value rates of 50 features") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We decide to deselect features with over 25% of missing values including weight, medical_specialty, payer_code.

Prune Features Based on Prior Knowledge

  • Since Encounter ID is just the unique identifier of an encounter, clearly we can drop this one for predictive analysis.
  • Similarly, patient number is the unique identifier of a patient, thus it will be appropriate to drop this feature as well.

Feature Pruning Summary To summarize, there are five features to drop: weight, medical_specialty, payer_code,encounter_id and patient_nbr.

data_main <- subset(data_main, select = -c(medical_specialty,encounter_id,weight,patient_nbr,payer_code))

4.1.3 Exploratory Data Analysis

In this section, we carry out some exploratory data analysis to gain some intuition.

4.1.3.1 The Distribution of Categorical Variables

data_factor <- data_main[, unlist(lapply(data_main, is.factor))]
num_subplots <- 8
for (col_start in seq(1, ncol(data_factor),num_subplots)){
      long <- pivot_longer(data_factor[,col_start:min(col_start+num_subplots-1,ncol(data_factor))], everything(),
                           names_to = "features", 
                           values_to = "categories")
      
      g <- ggplot(long, aes(x=categories, fill = features))
      
      print(g + geom_histogram(stat = "count") +
                  facet_wrap(~features,scales = "free", ncol = 4) +
                  ggtitle("Histogram Plots") + coord_flip() + 
                  scale_y_continuous(breaks = scales::pretty_breaks(n=5)) +
                  theme(legend.position = "top", strip.text.x = element_blank(),
                        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))
}

Some preliminary conclusions from the visualizations of categorical variables:

  • Elderly patients have a higher likelihood of hospital admission.
  • Caucasian patients take up a large portion of hospital admission.
  • Most of the patients don’t have A1C result.
  • Female patients account for slightly larger portion of hospital admission.
  • The most common diagnosis for diabetic patients is circulatory related.

4.1.3.2 The Distributions of Numerical Features.

data_numeric <- data_main[, unlist(lapply(data_main, is.numeric))]
num_subplots <- 8
for (col_start in seq(1, ncol(data_numeric),num_subplots)){
      long <- pivot_longer(data_numeric[,col_start:min(col_start+num_subplots-1,ncol(data_numeric))], everything(),
                           names_to = "features", 
                           values_to = "values")
      
      g <- ggplot(long, aes(x=values, fill = features))
      
      print(g + geom_histogram(stat = "count") +
                  facet_wrap(~features,scales = "free", ncol = 4) +
                  ggtitle("Histogram Plots") + coord_flip() + 
                  scale_y_continuous(breaks = scales::pretty_breaks(n=5)) +
                  theme(legend.position = "top", strip.text.x = element_blank(),
                        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))
}

Some preliminary conclusions from the visualizations of numerical variables:

  • The most common admission type is emergency.
  • The most common admission source is emergency room.
  • The most common discharge type is to home.
  • The number of lab procedures range from 0 to 100.
  • The number of medications mostly range from 10 to 20.
  • Most of the patients have more than 5 diagonosis.

4.1.3.3 Correlation between Numerical Variables

Here we plotted the correlations between numerical variables. And we found that the time spent in hospital, number of medications and number of procedures are potively correlated.

data_numeric_cor <- cor(data_numeric,use="pairwise.complete.obs")
corrplot(data_numeric_cor, method="color", type="upper", tl.cex = 0.8, mar=c(0,0,1,0), 
         tl.pos='d',tl.offset = 1)

# mtext("Correlation of numerical features", at=3, line=0.5, cex=1)

4.1.4 Data Preprocessing

In this section, we prepared data before diving into the predictive modelling. We did the following:

  1. First divide readmitted into two groups by putting long term readmission “>30” into “NO” class. This will result in a highly imbalanced data, so we rebalance it using SMOTE function.
  2. Then we drop the rows where readmitted is NA.
  3. Lastly, we split the data into 90:10 training:testing (randomly).
data_predict <- drop_na(data_main, readmitted) %>%
      mutate(readmitted = as.factor(if_else(readmitted == "<30", 1, 0)))

# rebalance dataset
set.seed(53)
percOver <- (sum(data_predict$readmitted==0)-sum(data_predict$readmitted==1))/sum(data_predict$readmitted==1)*100
percUnder <- sum(data_predict$readmitted==0)/(sum(data_predict$readmitted==0)-sum(data_predict$readmitted==1))*100
balanced <- ubBalance(X=data_predict[ , -which(names(data_predict) %in% c("readmitted"))],
            Y=data_predict$readmitted, type="ubSMOTE",
            percOver=percOver, percUnder=percUnder, verbose=TRUE)
WARNING: NAs generated by SMOTE removed 
Proportion of positives after ubSMOTE : 50.69 % of 100685 observations 
# plot the result
counts <- rbind(table(data_predict$readmitted),table(balanced$Y))
barplot(t(counts), main="Before and After Rebalancing", 
        names.arg=c("Before", "After"), beside=TRUE,
        xlab="Raw vs. Rebalanced Data",ylab="Number of cases", 
        col=c("orange","darkgreen"), legend = c("NO", "YES")) 

readmitted <- as.factor(if_else(balanced$Y == 0, "NO", "YES"))
data_predict <- cbind(balanced$X, readmitted)
num_samples <- nrow(data_predict)
test_id <- sample(num_samples, 0.1*num_samples)
data_train <- data_predict[-test_id,]
data_test  <- data_predict[ test_id,]

4.2 Predictive Analysis

In this section, we will predict the readmitted outcome with different models.

After setting up the data, we use multiple prediction method for this classification problem, including random forest, decision tree, and logistic regression.

4.2.1 Tree-based Models

4.2.1.1 Random Forest

Build the training model

library(randomForest)
library(caTools)

rf <- randomForest(readmitted~., 
                   data = data_train,
                   na.action = na.omit,
                   importance = TRUE)
print(rf)

Call:
 randomForest(formula = readmitted ~ ., data = data_train, importance = TRUE,      na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 6

        OOB estimate of  error rate: 7.23%
Confusion matrix:
       NO   YES class.error
NO  34547  1448  0.04022781
YES  4393 40353  0.09817637
# Random forest parameter tuning
# set.seed(1)
# data_train_temp <- data_train[complete.cases(data_train),]
# bestRF <- tuneRF(x = select(data_train_temp, -readmitted),
#                  y = data_train_temp$readmitted,
#                  stepFactor = 1.5, 
#                  improve = 1e-5,
#                  ntree = 500)

Evaluate the model performance on test data

test_pred <- predict(rf, data_test)
rf_cfm <- caret::confusionMatrix(test_pred, data_test$readmitted)
rf_cfm
Confusion Matrix and Statistics

          Reference
Prediction   NO  YES
       NO  3786  483
       YES  137 4570
                                          
               Accuracy : 0.9309          
                 95% CI : (0.9255, 0.9361)
    No Information Rate : 0.5629          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.861           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9651          
            Specificity : 0.9044          
         Pos Pred Value : 0.8869          
         Neg Pred Value : 0.9709          
             Prevalence : 0.4371          
         Detection Rate : 0.4218          
   Detection Prevalence : 0.4756          
      Balanced Accuracy : 0.9347          
                                          
       'Positive' Class : NO              
                                          
fourfoldplot(rf_cfm$table, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "RF - Confusion Matrix (ACC: 93.09%)")

Plot ROC curve

test_prob <- predict(rf, data_test, type = "prob")
par(pty="s")
roc_rf <- roc(data_test$readmitted ~ test_prob[,2], plot = TRUE, print.auc = TRUE, main="ROC curve of random forest")

4.2.1.2 Recursive Partitioning And Regression Trees (rpart)

Build the tree model

library(rpart)
set.seed(53)
control <- rpart.control(cp = 0.000, xxval = 1000, minsplit = 1, maxdepth = 5)

rpart_model <- rpart(readmitted~., 
                     data = data_train,
                     method = "class",
                     control = control)

printcp(rpart_model)

Classification tree:
rpart(formula = readmitted ~ ., data = data_train, method = "class", 
    control = control)

Variables actually used in tree construction:
 [1] A1Cresult           admission_source_id admission_type_id  
 [4] age                 diag_1_group        diag_2_group       
 [7] diag_3_group        num_lab_procedures  num_medications    
[10] number_diagnoses    time_in_hospital   

Root node error: 44737/90617 = 0.49369

n= 90617 

          CP nsplit rel error  xerror      xstd
1  0.2008405      0   1.00000 1.00000 0.0033641
2  0.1219572      1   0.79916 0.79916 0.0032887
3  0.0313164      2   0.67720 0.67720 0.0031744
4  0.0238840      4   0.61457 0.61493 0.0030939
5  0.0212352      6   0.56680 0.56539 0.0030184
6  0.0130541      7   0.54557 0.54637 0.0029864
7  0.0103941      8   0.53251 0.53473 0.0029660
8  0.0087400      9   0.52212 0.52730 0.0029527
9  0.0072423     10   0.51338 0.52078 0.0029407
10 0.0070747     11   0.50614 0.51633 0.0029325
11 0.0069294     13   0.49199 0.51141 0.0029232
12 0.0038223     14   0.48506 0.50348 0.0029081
13 0.0030176     16   0.47741 0.49498 0.0028915
14 0.0000000     17   0.47439 0.48687 0.0028752
# rpart.plot(rpart_model, type = 1, 
#            clip.right.labs = TRUE, 
#            extra = 101,
#            under.cex = 1,
#            round = 0, branch.lwd = 2,
#            branch = .5, under = TRUE)

rattle::fancyRpartPlot(rpart_model, sub = "")

Evaluation on test set

rpart_test_pred <- predict(rpart_model, data_test)
rpart_test_pred <- factor(colnames(rpart_test_pred)[max.col(rpart_test_pred)])
rpart_test_cfm <- caret::confusionMatrix(rpart_test_pred, data_test$readmitted)
rpart_test_cfm
Confusion Matrix and Statistics

          Reference
Prediction   NO  YES
       NO  3770 1184
       YES 1143 3971
                                          
               Accuracy : 0.7689          
                 95% CI : (0.7605, 0.7771)
    No Information Rate : 0.512           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.5376          
                                          
 Mcnemar's Test P-Value : 0.407           
                                          
            Sensitivity : 0.7674          
            Specificity : 0.7703          
         Pos Pred Value : 0.7610          
         Neg Pred Value : 0.7765          
             Prevalence : 0.4880          
         Detection Rate : 0.3745          
   Detection Prevalence : 0.4921          
      Balanced Accuracy : 0.7688          
                                          
       'Positive' Class : NO              
                                          
fourfoldplot(rpart_test_cfm$table, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "rpart - Confusion Matrix (ACC: 76.89%)")

Plot ROC curve

test_prob <- predict(rpart_model, data_test, type = "prob")
par(pty="s")
roc_dt <- roc(data_test$readmitted ~ test_prob[,2], plot = TRUE, print.auc = TRUE, main="ROC curve of decision tree")

4.2.2 Feature Selection

4.2.2.1 RFE

set.seed(123)
# use 10% of the training data for feature selection
num_train <- nrow(data_train)
sele_id <- sample(num_train, 0.1*num_train)
data_train.sele <- data_train[sele_id,]

control <- rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.select <- rfe(data_train.sele[complete.cases(data_train.sele), ][, -length(colnames(data_train.sele))], 
                 data_train.sele[complete.cases(data_train.sele), ][, length(colnames(data_train.sele))], 
                 sizes=seq(1,45,by=10), rfeControl=control)
rf.select

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
         1   0.7564 0.4864    0.01231 0.02695         
        11   0.8770 0.7529    0.01091 0.02175         
        21   0.8803 0.7593    0.01284 0.02548         
        31   0.8851 0.7688    0.01240 0.02478         
        41   0.8865 0.7715    0.01301 0.02596        *
        44   0.8852 0.7691    0.01289 0.02574         

The top 5 variables (out of 41):
   number_diagnoses, diag_2_group, num_medications, time_in_hospital, num_procedures
plot(rf.select, type=c("g", "o"), cex=1,main="RFE")

RFE_feature <- predictors(rf.select)
RFE_feature
 [1] "number_diagnoses"         "diag_2_group"            
 [3] "num_medications"          "time_in_hospital"        
 [5] "num_procedures"           "diag_1_group"            
 [7] "num_lab_procedures"       "age"                     
 [9] "diag_3_group"             "admission_type_id"       
[11] "gender"                   "A1Cresult"               
[13] "admission_source_id"      "repaglinide"             
[15] "metformin"                "number_inpatient"        
[17] "pioglitazone"             "discharge_disposition_id"
[19] "glipizide"                "race"                    
[21] "number_outpatient"        "diabetesMed"             
[23] "number_emergency"         "glimepiride"             
[25] "change"                   "insulin"                 
[27] "glyburide"                "rosiglitazone"           
[29] "glyburide.metformin"      "nateglinide"             
[31] "max_glu_serum"            "acarbose"                
[33] "tolazamide"               "acetohexamide"           
[35] "chlorpropamide"           "citoglipton"             
[37] "examide"                  "glimepiride.pioglitazone"
[39] "glipizide.metformin"      "metformin.pioglitazone"  
[41] "metformin.rosiglitazone" 

4.2.2.2 Rpart variable importance

rpartImp <- varImp(rpart_model)
rpartImp <- data.frame(feature = row.names(rpartImp), rpartImp)
rpartImp <- rpartImp[order(-rpartImp$Overall),]
rownames(rpartImp) <- 1:length(rpartImp$Overall)
rpartImp$feature <- factor(rpartImp$feature,levels = unique(rpartImp$feature))
ggplot(rpartImp, aes(feature,Overall,fill=Overall))+
  geom_bar(stat="identity")+
  ggtitle("Rpart Variable Importance") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

rpart_feature <- rpartImp[rpartImp$Overall>0,]$feature
# dist of other features across the two group
feature_all <- c(as.character(rpart_feature),"readmitted")

data2plot <- data_main %>% 
  mutate(admission_source_id = as.factor(admission_source_id)) %>%
  mutate(admission_type_id = as.factor(admission_type_id)) %>%
  drop_na(readmitted) %>%
  mutate(readmitted = as.factor(if_else(readmitted == "<30", "YES", "NO")))

# numerical features
numerical <- unlist(lapply(data2plot, is.numeric))
data_numerical <- data.frame(data2plot[,numerical], readmitted=data2plot$readmitted)
data_numerical <- data_numerical[which(colnames(data_numerical) %in% feature_all)]

long <- gather(data_numerical, key, value, -readmitted)
ggplot(long, aes(x = value,fill=readmitted, group = readmitted)) + 
  geom_histogram(aes(y=..density..),position = "dodge",alpha=0.8) + 
  facet_wrap(~ key,scales = "free") + 
  ggtitle("Visualize important features Across two readmitted groups")

# categorical feature
categorical <- unlist(lapply(data2plot, is.factor))
data_categorical <- data.frame(data2plot[,categorical], readmitted=data2plot[,ncol(data2plot)])
data_categorical <- data_categorical[which(colnames(data_categorical) %in% feature_all)]

long <- gather(data_categorical, key, value, -readmitted)
ggplot(long, aes(x = value,fill=readmitted,color=readmitted, group = readmitted)) + 
  geom_bar(aes(y = ..prop..),stat = "count", alpha=0.8,position = 'dodge') + 
  facet_wrap(~ key,scales = "free", ncol = 3) + 
  ggtitle("Visualize important features Across two readmitted groups") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_y_continuous(labels=scales::percent)

4.2.2.3 Boruta

set.seed(123)
data_boruta <- Boruta(readmitted~., data=data_train.sele[complete.cases(data_train.sele), ], doTrace=0)
print(data_boruta)
Boruta performed 99 iterations in 6.976531 mins.
 28 attributes confirmed important: A1Cresult, admission_source_id,
admission_type_id, age, change and 23 more;
 14 attributes confirmed unimportant: acarbose, acetohexamide,
chlorpropamide, citoglipton, examide and 9 more;
 2 tentative attributes left: glyburide.metformin, nateglinide;
par(mar=c(8,8,4,4),oma=c(0.1,0.1,0.1,0))
plot(data_boruta, xlab="", xaxt="n",cex.lab=3,cex.axis=2)
title("Boruta Feature Selection")
lz <- lapply(1:ncol(data_boruta$ImpHistory), function(i)
data_boruta$ImpHistory[is.finite(data_boruta$ImpHistory[, i]), i])
names(lz) <- colnames(data_boruta$ImpHistory)
lb <- sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(data_boruta$ImpHistory), cex.axis=0.5, font = 4)

boruta_feature <- getSelectedAttributes(data_boruta, withTentative = F)

4.2.2.4 Feature Selection Comparison

all_var <- colnames(data_train)[-45]
plot_data <- data.frame(variable = all_var,
                        Boruta = ifelse(all_var %in% boruta_feature, 1, 0),
                        RFE = ifelse(all_var %in% RFE_feature, 1, 0),
                        Rpart = ifelse(all_var %in% rpart_feature, 1, 0))

# plot only selected features
plot_data <- subset(plot_data,!(Boruta==0 & RFE==0 & Rpart==0))
plot_data <- melt(plot_data, value.name = "select")
colnames(plot_data) <- c("variable","method","select")

ggplot(plot_data, aes(x=variable, y=select, fill=method)) + 
    geom_bar(position="stack", stat="identity") +
    ggtitle("Comparison of selected features") +
    #theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
    xlab("") +
    coord_flip()

The three methods select similar features and we will use the selected features by “Boruta” for logistic regression. This process reduce the feature space from 44 to 28.

4.2.3 Logistic regression

# select important features and remove rows with missing value
data_train.prune <- data_train[,c(boruta_feature,"readmitted")]
data_test.prune <- data_test[,c(boruta_feature,"readmitted")]
logit <- glm(readmitted ~ ., data = data_train.prune, family = "binomial", 
             na.action = na.omit)
summary(logit)

Call:
glm(formula = readmitted ~ ., family = "binomial", data = data_train.prune, 
    na.action = na.omit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3649  -0.7178   0.2616   0.6827   3.3186  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -7.496e+00  1.030e+00  -7.279 3.37e-13 ***
raceAsian                   -2.988e-01  1.286e-01  -2.324 0.020150 *  
raceCaucasian                4.973e-01  2.704e-02  18.390  < 2e-16 ***
raceHispanic                -3.508e-02  8.113e-02  -0.432 0.665495    
raceOther                   -3.228e-01  9.039e-02  -3.571 0.000355 ***
genderMale                   7.604e-01  1.892e-02  40.199  < 2e-16 ***
age[10-20)                   6.437e-01  4.981e-01   1.292 0.196299    
age[20-30)                   6.442e-01  4.790e-01   1.345 0.178617    
age[30-40)                   3.780e-01  4.746e-01   0.797 0.425710    
age[40-50)                   2.360e-01  4.724e-01   0.500 0.617415    
age[50-60)                   1.055e+00  4.716e-01   2.237 0.025275 *  
age[60-70)                   1.445e-01  4.718e-01   0.306 0.759454    
age[70-80)                   6.632e-01  4.717e-01   1.406 0.159731    
age[80-90)                   1.004e+00  4.719e-01   2.127 0.033427 *  
age[90-100)                  2.126e-01  4.758e-01   0.447 0.654955    
admission_type_id2           5.940e-01  2.453e-02  24.218  < 2e-16 ***
admission_type_id3          -8.052e-01  3.313e-02 -24.306  < 2e-16 ***
admission_type_id4           7.940e-01  9.087e-01   0.874 0.382256    
admission_type_id7          -1.246e+01  7.133e+01  -0.175 0.861336    
discharge_disposition_id2   -1.690e-01  6.210e-02  -2.722 0.006497 ** 
discharge_disposition_id3    8.898e-02  3.146e-02   2.828 0.004681 ** 
discharge_disposition_id4   -2.890e-01  1.090e-01  -2.652 0.007997 ** 
discharge_disposition_id5    7.009e-01  7.817e-02   8.967  < 2e-16 ***
discharge_disposition_id6   -3.191e-01  3.247e-02  -9.828  < 2e-16 ***
discharge_disposition_id7   -2.050e-01  1.260e-01  -1.627 0.103634    
discharge_disposition_id8   -3.354e-01  2.930e-01  -1.145 0.252299    
discharge_disposition_id9    1.328e+00  1.148e+00   1.156 0.247485    
discharge_disposition_id12   5.427e-01  9.035e-01   0.601 0.548090    
discharge_disposition_id15   1.145e+00  3.075e-01   3.722 0.000197 ***
discharge_disposition_id22   1.154e+00  6.005e-02  19.220  < 2e-16 ***
discharge_disposition_id23  -9.847e-01  2.196e-01  -4.483 7.35e-06 ***
discharge_disposition_id24  -3.717e-01  6.184e-01  -0.601 0.547812    
discharge_disposition_id27  -1.124e+01  3.247e+02  -0.035 0.972397    
discharge_disposition_id28   2.084e+00  2.566e-01   8.122 4.59e-16 ***
admission_source_id2        -1.295e+00  9.416e-02 -13.757  < 2e-16 ***
admission_source_id3         4.395e-01  2.649e-01   1.659 0.097107 .  
admission_source_id4        -1.405e+00  5.752e-02 -24.419  < 2e-16 ***
admission_source_id5        -1.156e+00  1.225e-01  -9.440  < 2e-16 ***
admission_source_id6        -9.939e-01  7.181e-02 -13.840  < 2e-16 ***
admission_source_id7        -9.757e-01  2.528e-02 -38.595  < 2e-16 ***
admission_source_id8        -8.703e-01  9.070e-01  -0.960 0.337303    
admission_source_id10       -1.439e+01  1.371e+02  -0.105 0.916433    
admission_source_id11       -1.190e+01  2.296e+02  -0.052 0.958672    
admission_source_id13       -1.385e+01  3.247e+02  -0.043 0.965976    
admission_source_id14       -1.100e+01  2.296e+02  -0.048 0.961778    
admission_source_id22       -1.106e+01  1.562e+02  -0.071 0.943569    
admission_source_id25       -1.207e+01  2.296e+02  -0.053 0.958066    
time_in_hospital            -3.141e-03  4.469e-03  -0.703 0.482147    
num_lab_procedures           1.609e-02  6.284e-04  25.597  < 2e-16 ***
num_procedures               1.078e-01  6.302e-03  17.099  < 2e-16 ***
num_medications             -6.368e-02  1.734e-03 -36.736  < 2e-16 ***
number_outpatient           -1.614e-01  1.170e-02 -13.792  < 2e-16 ***
number_emergency            -3.634e-03  2.087e-02  -0.174 0.861770    
number_inpatient             1.988e-01  1.625e-02  12.236  < 2e-16 ***
number_diagnoses             2.876e-01  6.563e-03  43.830  < 2e-16 ***
A1Cresult>8                  7.208e-02  6.346e-02   1.136 0.256017    
A1CresultNone                6.439e-01  5.267e-02  12.225  < 2e-16 ***
A1CresultNorm                1.396e+00  6.049e-02  23.071  < 2e-16 ***
metforminNo                  1.062e-01  1.370e-01   0.775 0.438139    
metforminSteady             -4.561e-01  1.382e-01  -3.300 0.000966 ***
metforminUp                 -3.137e-01  1.700e-01  -1.846 0.064948 .  
repaglinideNo                1.376e+00  6.398e-01   2.150 0.031541 *  
repaglinideSteady            3.131e+00  6.425e-01   4.873 1.10e-06 ***
repaglinideUp                2.700e+00  6.980e-01   3.868 0.000110 ***
glimepirideNo                1.720e-01  2.039e-01   0.843 0.399082    
glimepirideSteady           -6.006e-01  2.088e-01  -2.876 0.004029 ** 
glimepirideUp                2.927e-02  2.686e-01   0.109 0.913238    
glipizideNo                  2.815e-01  1.331e-01   2.115 0.034441 *  
glipizideSteady              6.266e-01  1.345e-01   4.660 3.17e-06 ***
glipizideUp                  5.772e-02  1.730e-01   0.334 0.738587    
glyburideNo                  1.798e-01  1.444e-01   1.245 0.213156    
glyburideSteady             -4.038e-01  1.474e-01  -2.739 0.006163 ** 
glyburideUp                 -2.978e-01  1.835e-01  -1.623 0.104576    
pioglitazoneNo              -1.750e-01  2.797e-01  -0.626 0.531498    
pioglitazoneSteady           4.507e-01  2.809e-01   1.605 0.108535    
pioglitazoneUp              -3.949e-01  3.489e-01  -1.132 0.257781    
rosiglitazoneNo              1.747e+00  4.910e-01   3.559 0.000373 ***
rosiglitazoneSteady          1.308e+00  4.925e-01   2.656 0.007916 ** 
rosiglitazoneUp              2.031e+00  5.445e-01   3.730 0.000191 ***
insulinNo                    2.840e-01  3.971e-02   7.152 8.53e-13 ***
insulinSteady                3.264e-01  3.673e-02   8.888  < 2e-16 ***
insulinUp                   -3.484e-02  4.577e-02  -0.761 0.446530    
changeNo                     1.409e-01  2.508e-02   5.620 1.91e-08 ***
diabetesMedYes               5.741e-01  2.752e-02  20.863  < 2e-16 ***
diag_1_groupDiabetes        -6.096e-01  4.257e-02 -14.318  < 2e-16 ***
diag_1_groupDigestive       -1.148e+00  3.938e-02 -29.157  < 2e-16 ***
diag_1_groupGenitourinary   -9.424e-01  4.881e-02 -19.306  < 2e-16 ***
diag_1_groupInjury          -7.420e-01  4.267e-02 -17.391  < 2e-16 ***
diag_1_groupMusculoskeletal -7.626e-01  5.027e-02 -15.170  < 2e-16 ***
diag_1_groupNeoplasms       -1.163e+00  6.020e-02 -19.316  < 2e-16 ***
diag_1_groupOther           -1.033e+00  3.024e-02 -34.166  < 2e-16 ***
diag_1_groupRespiratory     -2.226e-01  2.943e-02  -7.564 3.90e-14 ***
diag_2_groupDiabetes        -1.719e-02  3.516e-02  -0.489 0.624930    
diag_2_groupDigestive       -2.148e-01  5.642e-02  -3.808 0.000140 ***
diag_2_groupGenitourinary   -3.964e-01  4.053e-02  -9.781  < 2e-16 ***
diag_2_groupInjury          -1.695e-02  6.525e-02  -0.260 0.794997    
diag_2_groupMusculoskeletal  1.509e+00  5.764e-02  26.180  < 2e-16 ***
diag_2_groupNeoplasms        2.044e+00  4.864e-02  42.027  < 2e-16 ***
diag_2_groupOther            1.718e-01  2.517e-02   6.827 8.67e-12 ***
diag_2_groupRespiratory     -5.950e-01  3.678e-02 -16.178  < 2e-16 ***
diag_3_groupDiabetes         1.000e-01  2.653e-02   3.771 0.000163 ***
diag_3_groupDigestive       -6.250e-01  5.658e-02 -11.046  < 2e-16 ***
diag_3_groupGenitourinary   -8.173e-01  4.432e-02 -18.443  < 2e-16 ***
diag_3_groupInjury          -5.557e-01  7.410e-02  -7.498 6.46e-14 ***
diag_3_groupMusculoskeletal -9.346e-01  8.098e-02 -11.541  < 2e-16 ***
diag_3_groupNeoplasms       -8.766e-01  7.731e-02 -11.340  < 2e-16 ***
diag_3_groupOther           -8.043e-01  2.498e-02 -32.204  < 2e-16 ***
diag_3_groupRespiratory     -7.967e-01  4.127e-02 -19.304  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 110980  on 80740  degrees of freedom
Residual deviance:  73077  on 80633  degrees of freedom
  (9876 observations deleted due to missingness)
AIC: 73293

Number of Fisher Scoring iterations: 11

Important features that lead to higher readmission rate

selectedFeatures <- rownames(summary(logit)$coefficients)[summary(logit)$coefficients[,4] < 0.001 & summary(logit)$coefficients[,1] > 0]
selectedFeatures
 [1] "raceCaucasian"               "genderMale"                 
 [3] "admission_type_id2"          "discharge_disposition_id5"  
 [5] "discharge_disposition_id15"  "discharge_disposition_id22" 
 [7] "discharge_disposition_id28"  "num_lab_procedures"         
 [9] "num_procedures"              "number_inpatient"           
[11] "number_diagnoses"            "A1CresultNone"              
[13] "A1CresultNorm"               "repaglinideSteady"          
[15] "repaglinideUp"               "glipizideSteady"            
[17] "rosiglitazoneNo"             "rosiglitazoneUp"            
[19] "insulinNo"                   "insulinSteady"              
[21] "changeNo"                    "diabetesMedYes"             
[23] "diag_2_groupMusculoskeletal" "diag_2_groupNeoplasms"      
[25] "diag_2_groupOther"           "diag_3_groupDiabetes"       

Important features that lead to lower readmission rate

selectedFeatures <- rownames(summary(logit)$coefficients)[summary(logit)$coefficients[,4] < 0.001 & summary(logit)$coefficients[,1] < 0]
selectedFeatures
 [1] "(Intercept)"                 "raceOther"                  
 [3] "admission_type_id3"          "discharge_disposition_id6"  
 [5] "discharge_disposition_id23"  "admission_source_id2"       
 [7] "admission_source_id4"        "admission_source_id5"       
 [9] "admission_source_id6"        "admission_source_id7"       
[11] "num_medications"             "number_outpatient"          
[13] "metforminSteady"             "diag_1_groupDiabetes"       
[15] "diag_1_groupDigestive"       "diag_1_groupGenitourinary"  
[17] "diag_1_groupInjury"          "diag_1_groupMusculoskeletal"
[19] "diag_1_groupNeoplasms"       "diag_1_groupOther"          
[21] "diag_1_groupRespiratory"     "diag_2_groupDigestive"      
[23] "diag_2_groupGenitourinary"   "diag_2_groupRespiratory"    
[25] "diag_3_groupDigestive"       "diag_3_groupGenitourinary"  
[27] "diag_3_groupInjury"          "diag_3_groupMusculoskeletal"
[29] "diag_3_groupNeoplasms"       "diag_3_groupOther"          
[31] "diag_3_groupRespiratory"    

By analyzing important features from logistic model we have the following observations:

  • Male with age range of 50~60, and patients with urgent admission are more likely to get readmitted.
  • Patients discharged/transferred to another type of inpatient care institution can have a higher risk.
  • Patients with no/steady insulin treatment have a higher risk for readmission compared with those with increased insulin dosage.
  • If the primary or secondary diagnosis include diabetes, then patients have a lower chance to be readmitted within 30 days.

Evaluation on the test data

# remove rows with missing value
data_train.prune <- data_train.prune[complete.cases(data_train.prune), ]
data_test.prune <- data_test.prune[complete.cases(data_test.prune), ]
logit.prob <- predict(logit, data_test.prune, type="response")
logit.pred <- factor(ifelse(logit.prob>0.5,"YES","NO"))
logit_test_cfm <- caret::confusionMatrix(logit.pred, data_test.prune$readmitted)
logit_test_cfm
Confusion Matrix and Statistics

          Reference
Prediction   NO  YES
       NO  2972  860
       YES  951 4193
                                          
               Accuracy : 0.7982          
                 95% CI : (0.7898, 0.8065)
    No Information Rate : 0.5629          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.5889          
                                          
 Mcnemar's Test P-Value : 0.03444         
                                          
            Sensitivity : 0.7576          
            Specificity : 0.8298          
         Pos Pred Value : 0.7756          
         Neg Pred Value : 0.8151          
             Prevalence : 0.4371          
         Detection Rate : 0.3311          
   Detection Prevalence : 0.4269          
      Balanced Accuracy : 0.7937          
                                          
       'Positive' Class : NO              
                                          
fourfoldplot(logit_test_cfm$table, color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1, main = "Logistic Regression - Confusion Matrix (79.82%)")

par(pty="s")
roc_lg <- roc(data_test.prune$readmitted ~ logit.prob, plot = TRUE, print.auc = TRUE, main="ROC curve of logistic regression")

4.2.4 Model comparison

In this section, we summarize the overall performance of three models we applied for predicting readmission likelihood.

compare_model <- data.frame(Model=c("Random Forest", "Decision Tree", "Logistic Regression"),
                            Accuracy = c(0.9309,0.7689,0.7982),
                            Kappa = c(0.861,0.5376,0.5889),
                            Sensitivity = c(0.9651,0.7674,0.7576),
                            Specificity = c(0.9044,0.7703,0.8298),
                            AUC = c(0.978, 0.811, 0.874))
knitr::kable(compare_model)
Model Accuracy Kappa Sensitivity Specificity AUC
Random Forest 0.9309 0.8610 0.9651 0.9044 0.978
Decision Tree 0.7689 0.5376 0.7674 0.7703 0.811
Logistic Regression 0.7982 0.5889 0.7576 0.8298 0.874
par(pty="s")
plot(roc_rf, col = "green", main = "ROC for three models")
lines(roc_dt, col = "red")
lines(roc_lg, col = "blue")
legend("bottomright", c("Random Forest (0.978)","Decision Tree (0.811)","Logistic Regression (0.874)"), fill=c("green","red","blue"),bty = "n")

4.3 Dimensionality Reduction and Clustering

4.3.1 t-SNE

set.seed(100)
data_train.prune <- data_train.prune[complete.cases(data_train.sele), ]
num_train <- nrow(data_train.prune)
sele_id <- sample(num_train, 0.1*num_train)
data_train.temp <- data_train.prune[sele_id,]
rownames(data_train.temp) <- 1:nrow(data_train.temp)
tsne <- Rtsne(data_train.temp[,-ncol(data_train.temp)], 
              dims = 2, perplexity=30, verbose=TRUE, 
              max_iter = 500, check_duplicates = FALSE)
Performing PCA
Read the 7235 x 50 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 1.13 seconds (sparsity = 0.016896)!
Learning embedding...
Iteration 50: error is 93.502540 (50 iterations in 1.19 seconds)
Iteration 100: error is 78.804539 (50 iterations in 1.02 seconds)
Iteration 150: error is 76.776172 (50 iterations in 0.95 seconds)
Iteration 200: error is 76.237789 (50 iterations in 0.99 seconds)
Iteration 250: error is 75.990168 (50 iterations in 0.99 seconds)
Iteration 300: error is 2.564907 (50 iterations in 0.87 seconds)
Iteration 350: error is 2.203439 (50 iterations in 0.89 seconds)
Iteration 400: error is 2.008632 (50 iterations in 0.88 seconds)
Iteration 450: error is 1.884065 (50 iterations in 0.90 seconds)
Iteration 500: error is 1.797850 (50 iterations in 0.90 seconds)
Fitting performed in 9.58 seconds.
tsne_data <- tsne$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(readmitted = data_train.temp$readmitted)
ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = readmitted))

4.3.2 Clustering Mixed Data Types

gower_dist <- daisy(data_train.temp[, -ncol(data_train.temp)],
                    metric = "gower",
                    type = list(logratio = 3))
summary(gower_dist)
26168995 dissimilarities, summarized :
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.2847  0.3385  0.3371  0.3891  0.6699 
Metric :  mixed ;  Types = N, N, N, N, N, N, I, I, I, I, I, I, I, I, N, N, N, N, N, N, N, N, N, N, N, N, N, N 
Number of objects : 7235
sil_width <- c(NA)
for(i in 2:10){
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}
plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

pam_fit <- pam(gower_dist, diss = TRUE, k = 2)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         readmitted = data_train.temp$readmitted)
ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster)) + ggtitle("Clustering on pruned data")

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = readmitted, shape = readmitted)) +
  ggtitle("Compare cluster and readmitted type")

table(tsne_data$cluster, tsne_data$readmitted)
   
      NO  YES
  1 2129 2057
  2 1163 1886

From above we can find there is no strong indication of readmission difference between the two groups.

4.3.3 Hierachical Clustering

hc <- hclust(gower_dist, method = "ward")
sub_grp <- cutree(hc, k = 2)
# data_train.temp %>% mutate(cluster = sub_grp)
# rect.hclust(hc, k = 2, border = 2:3)
dend <- as.dendrogram(hc)
plot(dend,cex=0,main="Hierachical Clustering",leaflab = "none") 
colored_bars(colors = ifelse(data_train.temp$readmitted=="NO", "blue", "red"), dend = dend, rowLabels = "readmitted")

Although the admission type cannot be inferred directly with two clusters, we observe aggregation of readmitted/unreadmitted cases at the endpoint of deeper level. This similar architecture with tree-based model suggests the improved performance of random forest.

5 Conclusion

In this study, we leveraged a hospital dataset that contains demographic, clinical and diagnostic-related features, as well as medication information to identify diabetic patients who have higher likelihood of being readmitted within 30 days. Based on our analysis of three machine learning algorithms, including Random Forest, Decision Tree and Logistic Regression, the best model for this purpose has been found to be Random Forest, which yielded an accuracy of 93.09%. Some of the key factors that are important for readmission are the number of diagnoses, the number of lab procedures and the number of medications, diagnosis group. Whereas many medicines such as acarbose, acetohexamide, chlorpropamide, citoglipton, examide provided in the dataset are not informative for admission prediction.

Based on the results from this study, hospitals are advised to work not only on inpatient treatment but also on continued care after discharge. For example, more care should be taken for male with age range of 50~60, and patients with urgent admission are more likely to get readmitted. The finding that patients discharged/transferred to another type of inpatient care institution can have a higher risk of readmission suggests that hospitals may want to develop further strategy to work closely with other health care institution. Our results can also provide medical guidance. For example, patients with no/steady insulin treatment have a higher risk for readmission compared with those with increased insulin dosage. This suggests that hospital should adjust the insulin usage timely to prevent readmission. Another interesting observation can be the diagnostic time for diabetic patients. Diabetes diagnosis at a later time may result in higher risk of readmission compared to primary/earlier diagnosis. Therefore, it would be beneficial that doctors and patients treat the diagnosis seriously at early stage.

It is not surprising that random forest that integrates multiple tree based models performs best among all the methods considering the large feature space with mixed categorical and numerical type. We can see that multiple variables contribute to the readmission rate after the feature selection process. The clustering result also supports this, showing the data is not separable in lower dimension less than 3. In particular, hierarchical clustering illustrates the aggregation of readmitted cases at deeper branches, suggesting the similar architecture and good performance of tree-based models. In the future, we could try improving the model performance by parameter tuning such as the mtry parameter in random forest, or applying boosting algorithm to compare the prediction.

6 Appendix

Mapping between admission/discharge id and description

admission_mapping <- as.data.frame(data_mapping[1:8,])
knitr::kable(admission_mapping)
admission_type_id description
1 Emergency
2 Urgent
3 Elective
4 Newborn
5 Not Available
6 NULL
7 Trauma Center
8 Not Mapped
discharge_mapping <- as.data.frame(data_mapping[11:40,])
knitr::kable(discharge_mapping)
admission_type_id description
11 1 Discharged to home
12 2 Discharged/transferred to another short term hospital
13 3 Discharged/transferred to SNF
14 4 Discharged/transferred to ICF
15 5 Discharged/transferred to another type of inpatient care institution
16 6 Discharged/transferred to home with home health service
17 7 Left AMA
18 8 Discharged/transferred to home under care of Home IV provider
19 9 Admitted as an inpatient to this hospital
20 10 Neonate discharged to another hospital for neonatal aftercare
21 11 Expired
22 12 Still patient or expected to return for outpatient services
23 13 Hospice / home
24 14 Hospice / medical facility
25 15 Discharged/transferred within this institution to Medicare approved swing bed
26 16 Discharged/transferred/referred another institution for outpatient services
27 17 Discharged/transferred/referred to this institution for outpatient services
28 18 NULL
29 19 Expired at home. Medicaid only, hospice.
30 20 Expired in a medical facility. Medicaid only, hospice.
31 21 Expired, place unknown. Medicaid only, hospice.
32 22 Discharged/transferred to another rehab fac including rehab units of a hospital .
33 23 Discharged/transferred to a long term care hospital.
34 24 Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare.
35 25 Not Mapped
36 26 Unknown/Invalid
37 30 Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere
38 27 Discharged/transferred to a federal health care facility.
39 28 Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital
40 29 Discharged/transferred to a Critical Access Hospital (CAH).
colnames(discharge_mapping) <- c("discharge_disposition_id","description")
source_mapping <- as.data.frame(data_mapping[43:67,])
colnames(source_mapping) <- c("admission_source_id","description")
knitr::kable(source_mapping)
admission_source_id description
43 1 Physician Referral
44 2 Clinic Referral
45 3 HMO Referral
46 4 Transfer from a hospital
47 5 Transfer from a Skilled Nursing Facility (SNF)
48 6 Transfer from another health care facility
49 7 Emergency Room
50 8 Court/Law Enforcement
51 9 Not Available
52 10 Transfer from critial access hospital
53 11 Normal Delivery
54 12 Premature Delivery
55 13 Sick Baby
56 14 Extramural Birth
57 15 Not Available
58 17 NULL
59 18 Transfer From Another Home Health Agency
60 19 Readmission to Same Home Health Agency
61 20 Not Mapped
62 21 Unknown/Invalid
63 22 Transfer from hospital inpt/same fac reslt in a sep claim
64 23 Born inside this hospital
65 24 Born outside this hospital
66 25 Transfer from Ambulatory Surgery Center
67 26 Transfer from Hospice

References

Lum, Hillary D, Stephanie A Studenski, Howard B Degenholtz, and Susan E Hardy. 2012. “Early Hospital Readmission Is a Predictor of One-Year Mortality in Community-Dwelling Older Medicare Beneficiaries.” Journal of General Internal Medicine 27 (11): 1467–74.

Rubin, Daniel J. 2015. “Hospital Readmission of Patients with Diabetes.” Current Diabetes Reports 15 (4): 17.

Strack, Beata, Jonathan P DeShazo, Chris Gennings, Juan L Olmo, Sebastian Ventura, Krzysztof J Cios, and John N Clore. 2014. “Impact of Hba1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.” BioMed Research International 2014.

Zimmet, Paul Z, Dianna J Magliano, William H Herman, and Jonathan E Shaw. 2014. “Diabetes: A 21st Century Challenge.” The Lancet Diabetes & Endocrinology 2 (1): 56–64. https://doi.org/https://doi.org/10.1016/S2213-8587(13)70112-8.