Hospital readmission takes up a large portion of medical resources and has been increasingly recognized as an important health care quality measure. With diabetic patients being at higher risk of readmission, reducing readmission rates of diabetic patients will not only lessen the burden of their medical costs but also improve their life quality. In this study, we dug deep into the diabetes data from 130-US Hospitals with almost 70,000 inpatient diabetes encounters. Tree based models and multivariate logistic regression were used to predict and infer important factors for early readmission. Among a variety of predictive models we built, random forest results in highest predicting accuracy of 93.09%, with sensitivity and specificity of 96.51% and 90.44%. Moreover, our findings demonstrated that admission type/source, number of diagnoses/procedures, and diagnosis type, etc. are important features impacting hospital early readmission. The results can be used to inform future directions for diabetic patient health care improvement.
Keywords: Hospital Early Readmission, Diabetes, Predictive Analysis, Clustering, Feature selection
Diabetes, being one of the top leading death globally, has more than doubled its prevalence during the past 20 years (Zimmet et al. 2014). Early readmission plays a critical role, as a measure of quality of care, in guiding medical professionals to improve their treatment. It is defined by the presence a hospital claim starting within 30 days of discharge from the index hospitalization (Lum et al. 2012). Early readmission not only takes up a large portion of medical resources but also has been increasingly recognized as an important health care quality measure (Rubin 2015). As patients with diabetes may be at higher risk of readmission than those without diabetes (Rubin 2015), reducing readmission rates of diabetic patients will not only lessen the burden of their medical costs but also improve the life quality.
Therefore, it is important to know/predict if a patient will be readmitted in hospital so actions such as change of treatment/medicines can be taken to prevent readmission. Utilizing the diabetes dataset (Strack et al. 2014) across 10 years (1999-2008) of clinical care at 130 US hospitals, we are able to develop several risk predictive models to predict whether a diabetic patient will be readmitted. Additionally, we examined relationship between multiple risk factors and 30-day readmission rate of diabetes patients, and found combination of risk factors that greatly impact the hospital readmission. We hypothesized that more actions taken at the hospital (larger number of procedures/tests) and more careful diagnosis can be associated with a reduction in readmission rates in individuals admitted to the hospital. Our results provide intuitions on the useful predictor of readmission rates which may prove valuable in the development of strategies to reduce readmission rates and costs for the care of individuals with diabetes mellitus.
In this study, we use the dataset Diabetes 130-US hospitals for years 1999-2008 Data Set(Strack et al. 2014). This dataset consists of risk factors related to hospital readmission of diabetes patients collected from 130 hospitals and integrated delivery networks from 1999-2008. It includes over 50 features representing patient and hospital outcomes of over 70,000 observations. Examples of attributes including race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetes medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization.
Readimission is a key label used for prediction in this study. It has three types in the original dataset: * No readmission * A readmission in less than 30 days (this situation is not good, because maybe the treatment was not appropriate) * A readmission in more than 30 days (this one is not good as well, however, the reason can be the state of the patient) Our goal is to use risk factors for predicting the hospital readmission, especially the cases with readmission in less than 30 days. So we can inform treatment improvement for hospitals.
The data used for later analysis is created in two step. We first performed data collection and preprocessing, including formatting/dropping missing values, factorizing features. This process involves only transformation without any feature selection or oberservation filtering. Then we primarily cleaned data based on the criteria as followed.
weight, medical_specialty, payer_code.encounter_id, patient_nbr based on our prior knowledge.After the comprehensive data cleaning, we carried out exploratory data analysis to gain some intuition on the feature distributions and correlations. Lastly, we did data processing such as labelling, splitting before we dive into predictive modelling. The process is as followed:
readmitted into two groups by putting long term readmission “>30” into “NO” class. This will result in a highly imbalanced data, so we rebalance it using SMOTE function.Illustration of SMOTE algorithm resource
readmitted is NA.We implemented three machine learning algorithms, including Random Forest, Decision Tree and Logistic regression to identify diabetic patients who have higher likelihood of being readmitted within 30 days. To improve our training accuracy/efficiency and assess the important factors that significantly associated with readmission, we carried out model implementation in four steps.
The best model for this purpose has been found to be Random Forest, with over 93% test accuracy.
Feature Selection resource
Clustering and dimensionality reduction allow us to better understand how a sample might be comprised of distinct subgroups given a set of variables. Here, we tried to visualize the dataset in lower dimensions to figure out if we can find subgroups of readmission given the features we selected. In particular, we applied t-SNE and project the feature space in two dimensions. The data is not separable in two dimension since the feature space is quite large in our study. For clustering, a popular choice is Euclidean distance. However, Euclidean distance is only valid for continuous variables, and thus is not applicable here. Instead, we use a distance metric called Gower distance that can handle mixed data types. THe following plot illustrates the definition of Gower distance.
Illustration of Gower distance resource
Then we select the optimal number of clusters (2) based on silhouette width, and visualize the result in two dimensions. This result becomes more separable compared to pure t-SNE visualization. However, there is no strong indication of readmission difference between two groups, and number of patients with admission is slightly larger in the second group. Finally, we apply hierarchical clustering to visualize the distribution of readmission group among the branches.
R Libraries used in this study:
library(tidyverse)
library(ggplot2)
library(DT)
library(corrplot)
library(GGally)
library(ggforce)
library(unbalanced)
library(Boruta)
library(caret)
library(randomForest)
library(reshape2)
library(rpart.plot)
library(pROC)
library(MASS)
library(Rtsne)
library(lubridate)
library(rgl)
library(tidyr)
library(cluster)
library(dendextend)In this section, we perform data collection and cleaning, including formatting missing values, factorizing features, etc.
Here we collected the data from the case-studies on the Canvas site.
data_main <- as.data.frame(read.csv("https://umich.instructure.com/files/5525199/download?download_frd=1"))
data_mapping <- as.data.frame(read.csv("https://umich.instructure.com/files/5525197/download?download_frd=1"))
# different link but same data:
# df3 <- as.data.frame((read.csv("https://umich.instructure.com/files/7843517/download?download_frd=1")))First look at the data:
Here we showed a table describing the features in the dataset.
List of features and their descriptions in the initial dataset (Strack et al. 2014)
We observed in the dataset that some NA values are represented as “?” as some are coded as integers. Here we converted all the missing values to NA.
data_main <- data_main %>%
# Convert "?" values to NA
mutate_all(~na_if(.,"?"))
# Set missing values to NA according to mapping data
data_main[(data_main$admission_type_id %in% c(5,6,8)),"admission_type_id"] <- NA
data_main[(data_main$discharge_disposition_id %in% c(18,25,26)), "discharge_disposition_id"] <- NA
data_main[(data_main$admission_source_id %in% c(9,15,17,20,21)),"admission_source_id"] <- NAIn this section we transform appropriate features into categorical data and convert the diagnosis features into categorical data based on the icd9 codes (See Table 2 in Strack et al. 2014).
data_main$race <- as.factor(data_main$race)
data_main$gender <- as.factor(data_main$gender)
data_main$age <- as.factor(data_main$age)
data_main$payer_code <- as.factor(data_main$payer_code)
data_main$max_glu_serum <- as.factor(data_main$max_glu_serum)
data_main$A1Cresult <- as.factor(data_main$A1Cresult)
data_main$admission_type_id <- as.factor(data_main$admission_type_id)
data_main$admission_source_id <- as.factor(data_main$admission_source_id)
data_main$discharge_disposition_id <- as.factor(data_main$discharge_disposition_id)
for (i in 25:50){
data_main[,i] <- as.factor(data_main[,i])
}convert_diag <- function(feature,data_main)
{
val_name <- paste(feature,"_group",sep='')
data_main[,val_name] <- NA
for (i in 1:length(data_main[,feature]))
{
if (!is.na(data_main[i,feature]))
{
if (grepl("[E-Ve-v]", data_main[i,feature], perl = T)) {data_main[i,val_name] <- "Other"}
else{
cur_val <- as.numeric(data_main[i,feature])
if (cur_val %in% c(390:459,785)) {data_main[i,val_name] <- "Circulatory"}
else if (cur_val %in% c(460:519,786)) {data_main[i,val_name] <- "Respiratory"}
else if (cur_val %in% c(520:579,787)) {data_main[i,val_name] <- "Digestive"}
else if (cur_val %in% c(800:999)) {data_main[i,val_name] <- "Injury"}
else if (cur_val %in% c(710:739)) {data_main[i,val_name] <- "Musculoskeletal"}
else if (cur_val %in% c(580:629,788)) {data_main[i,val_name] <- "Genitourinary"}
else if (cur_val %in% c(140:239)) {data_main[i,val_name] <- "Neoplasms"}
else if (cur_val>=250 & cur_val<251) {data_main[i,val_name] <- "Diabetes"}
else {data_main[i,val_name] <- "Other"}
}
}
}
return(data_main)
}
for (i in c("diag_1","diag_2","diag_3")){
data_main <- convert_diag(i,data_main)
}
data_main$diag_1_group <- as.factor(data_main$diag_1_group)
data_main$diag_2_group <- as.factor(data_main$diag_2_group)
data_main$diag_3_group <- as.factor(data_main$diag_3_group)
data_main <- subset(data_main, select = -c(diag_1,diag_2,diag_3))We did the following to ensure data independence and reduce Bias to improve the prediction of readmission.
In this section, we pruned the features based on our prior knowledge and criteria on the missing value, possible correlation, etc.
Prune Features Based on Data Missing Rates
Here, we computed percentages of missing values for all the features and determine if we should remove any feature with large amount of missing values.
miss_data <- data.frame(features = colnames(data_main),
missing_rates = round(colSums(is.na(data_main))/nrow(data_main),3))
miss_data <- miss_data[order(-miss_data$missing_rates),]
# datatable(miss_data)
miss_data$features = factor(miss_data$features, levels = unique(miss_data$features))
ggplot(miss_data, aes(features,missing_rates,fill=missing_rates))+
geom_bar(stat="identity")+
ggtitle("Missing value rates of 50 features") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))We decide to deselect features with over 25% of missing values including weight, medical_specialty, payer_code.
Prune Features Based on Prior Knowledge
Feature Pruning Summary To summarize, there are five features to drop: weight, medical_specialty, payer_code,encounter_id and patient_nbr.
In this section, we carry out some exploratory data analysis to gain some intuition.
data_factor <- data_main[, unlist(lapply(data_main, is.factor))]
num_subplots <- 8
for (col_start in seq(1, ncol(data_factor),num_subplots)){
long <- pivot_longer(data_factor[,col_start:min(col_start+num_subplots-1,ncol(data_factor))], everything(),
names_to = "features",
values_to = "categories")
g <- ggplot(long, aes(x=categories, fill = features))
print(g + geom_histogram(stat = "count") +
facet_wrap(~features,scales = "free", ncol = 4) +
ggtitle("Histogram Plots") + coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(n=5)) +
theme(legend.position = "top", strip.text.x = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))
}Some preliminary conclusions from the visualizations of categorical variables:
data_numeric <- data_main[, unlist(lapply(data_main, is.numeric))]
num_subplots <- 8
for (col_start in seq(1, ncol(data_numeric),num_subplots)){
long <- pivot_longer(data_numeric[,col_start:min(col_start+num_subplots-1,ncol(data_numeric))], everything(),
names_to = "features",
values_to = "values")
g <- ggplot(long, aes(x=values, fill = features))
print(g + geom_histogram(stat = "count") +
facet_wrap(~features,scales = "free", ncol = 4) +
ggtitle("Histogram Plots") + coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(n=5)) +
theme(legend.position = "top", strip.text.x = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)))
}Some preliminary conclusions from the visualizations of numerical variables:
Here we plotted the correlations between numerical variables. And we found that the time spent in hospital, number of medications and number of procedures are potively correlated.
data_numeric_cor <- cor(data_numeric,use="pairwise.complete.obs")
corrplot(data_numeric_cor, method="color", type="upper", tl.cex = 0.8, mar=c(0,0,1,0),
tl.pos='d',tl.offset = 1)In this section, we prepared data before diving into the predictive modelling. We did the following:
readmitted into two groups by putting long term readmission “>30” into “NO” class. This will result in a highly imbalanced data, so we rebalance it using SMOTE function.readmitted is NA.data_predict <- drop_na(data_main, readmitted) %>%
mutate(readmitted = as.factor(if_else(readmitted == "<30", 1, 0)))
# rebalance dataset
set.seed(53)
percOver <- (sum(data_predict$readmitted==0)-sum(data_predict$readmitted==1))/sum(data_predict$readmitted==1)*100
percUnder <- sum(data_predict$readmitted==0)/(sum(data_predict$readmitted==0)-sum(data_predict$readmitted==1))*100
balanced <- ubBalance(X=data_predict[ , -which(names(data_predict) %in% c("readmitted"))],
Y=data_predict$readmitted, type="ubSMOTE",
percOver=percOver, percUnder=percUnder, verbose=TRUE)WARNING: NAs generated by SMOTE removed
Proportion of positives after ubSMOTE : 50.69 % of 100685 observations
# plot the result
counts <- rbind(table(data_predict$readmitted),table(balanced$Y))
barplot(t(counts), main="Before and After Rebalancing",
names.arg=c("Before", "After"), beside=TRUE,
xlab="Raw vs. Rebalanced Data",ylab="Number of cases",
col=c("orange","darkgreen"), legend = c("NO", "YES")) In this section, we will predict the readmitted outcome with different models.
After setting up the data, we use multiple prediction method for this classification problem, including random forest, decision tree, and logistic regression.
Build the training model
library(randomForest)
library(caTools)
rf <- randomForest(readmitted~.,
data = data_train,
na.action = na.omit,
importance = TRUE)
print(rf)
Call:
randomForest(formula = readmitted ~ ., data = data_train, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
OOB estimate of error rate: 7.23%
Confusion matrix:
NO YES class.error
NO 34547 1448 0.04022781
YES 4393 40353 0.09817637
# Random forest parameter tuning
# set.seed(1)
# data_train_temp <- data_train[complete.cases(data_train),]
# bestRF <- tuneRF(x = select(data_train_temp, -readmitted),
# y = data_train_temp$readmitted,
# stepFactor = 1.5,
# improve = 1e-5,
# ntree = 500)Evaluate the model performance on test data
test_pred <- predict(rf, data_test)
rf_cfm <- caret::confusionMatrix(test_pred, data_test$readmitted)
rf_cfmConfusion Matrix and Statistics
Reference
Prediction NO YES
NO 3786 483
YES 137 4570
Accuracy : 0.9309
95% CI : (0.9255, 0.9361)
No Information Rate : 0.5629
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.861
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9651
Specificity : 0.9044
Pos Pred Value : 0.8869
Neg Pred Value : 0.9709
Prevalence : 0.4371
Detection Rate : 0.4218
Detection Prevalence : 0.4756
Balanced Accuracy : 0.9347
'Positive' Class : NO
fourfoldplot(rf_cfm$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "RF - Confusion Matrix (ACC: 93.09%)")Plot ROC curve
test_prob <- predict(rf, data_test, type = "prob")
par(pty="s")
roc_rf <- roc(data_test$readmitted ~ test_prob[,2], plot = TRUE, print.auc = TRUE, main="ROC curve of random forest")Build the tree model
library(rpart)
set.seed(53)
control <- rpart.control(cp = 0.000, xxval = 1000, minsplit = 1, maxdepth = 5)
rpart_model <- rpart(readmitted~.,
data = data_train,
method = "class",
control = control)
printcp(rpart_model)
Classification tree:
rpart(formula = readmitted ~ ., data = data_train, method = "class",
control = control)
Variables actually used in tree construction:
[1] A1Cresult admission_source_id admission_type_id
[4] age diag_1_group diag_2_group
[7] diag_3_group num_lab_procedures num_medications
[10] number_diagnoses time_in_hospital
Root node error: 44737/90617 = 0.49369
n= 90617
CP nsplit rel error xerror xstd
1 0.2008405 0 1.00000 1.00000 0.0033641
2 0.1219572 1 0.79916 0.79916 0.0032887
3 0.0313164 2 0.67720 0.67720 0.0031744
4 0.0238840 4 0.61457 0.61493 0.0030939
5 0.0212352 6 0.56680 0.56539 0.0030184
6 0.0130541 7 0.54557 0.54637 0.0029864
7 0.0103941 8 0.53251 0.53473 0.0029660
8 0.0087400 9 0.52212 0.52730 0.0029527
9 0.0072423 10 0.51338 0.52078 0.0029407
10 0.0070747 11 0.50614 0.51633 0.0029325
11 0.0069294 13 0.49199 0.51141 0.0029232
12 0.0038223 14 0.48506 0.50348 0.0029081
13 0.0030176 16 0.47741 0.49498 0.0028915
14 0.0000000 17 0.47439 0.48687 0.0028752
# rpart.plot(rpart_model, type = 1,
# clip.right.labs = TRUE,
# extra = 101,
# under.cex = 1,
# round = 0, branch.lwd = 2,
# branch = .5, under = TRUE)
rattle::fancyRpartPlot(rpart_model, sub = "")Evaluation on test set
rpart_test_pred <- predict(rpart_model, data_test)
rpart_test_pred <- factor(colnames(rpart_test_pred)[max.col(rpart_test_pred)])
rpart_test_cfm <- caret::confusionMatrix(rpart_test_pred, data_test$readmitted)
rpart_test_cfmConfusion Matrix and Statistics
Reference
Prediction NO YES
NO 3770 1184
YES 1143 3971
Accuracy : 0.7689
95% CI : (0.7605, 0.7771)
No Information Rate : 0.512
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5376
Mcnemar's Test P-Value : 0.407
Sensitivity : 0.7674
Specificity : 0.7703
Pos Pred Value : 0.7610
Neg Pred Value : 0.7765
Prevalence : 0.4880
Detection Rate : 0.3745
Detection Prevalence : 0.4921
Balanced Accuracy : 0.7688
'Positive' Class : NO
fourfoldplot(rpart_test_cfm$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "rpart - Confusion Matrix (ACC: 76.89%)")Plot ROC curve
test_prob <- predict(rpart_model, data_test, type = "prob")
par(pty="s")
roc_dt <- roc(data_test$readmitted ~ test_prob[,2], plot = TRUE, print.auc = TRUE, main="ROC curve of decision tree")set.seed(123)
# use 10% of the training data for feature selection
num_train <- nrow(data_train)
sele_id <- sample(num_train, 0.1*num_train)
data_train.sele <- data_train[sele_id,]
control <- rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.select <- rfe(data_train.sele[complete.cases(data_train.sele), ][, -length(colnames(data_train.sele))],
data_train.sele[complete.cases(data_train.sele), ][, length(colnames(data_train.sele))],
sizes=seq(1,45,by=10), rfeControl=control)
rf.select
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.7564 0.4864 0.01231 0.02695
11 0.8770 0.7529 0.01091 0.02175
21 0.8803 0.7593 0.01284 0.02548
31 0.8851 0.7688 0.01240 0.02478
41 0.8865 0.7715 0.01301 0.02596 *
44 0.8852 0.7691 0.01289 0.02574
The top 5 variables (out of 41):
number_diagnoses, diag_2_group, num_medications, time_in_hospital, num_procedures
[1] "number_diagnoses" "diag_2_group"
[3] "num_medications" "time_in_hospital"
[5] "num_procedures" "diag_1_group"
[7] "num_lab_procedures" "age"
[9] "diag_3_group" "admission_type_id"
[11] "gender" "A1Cresult"
[13] "admission_source_id" "repaglinide"
[15] "metformin" "number_inpatient"
[17] "pioglitazone" "discharge_disposition_id"
[19] "glipizide" "race"
[21] "number_outpatient" "diabetesMed"
[23] "number_emergency" "glimepiride"
[25] "change" "insulin"
[27] "glyburide" "rosiglitazone"
[29] "glyburide.metformin" "nateglinide"
[31] "max_glu_serum" "acarbose"
[33] "tolazamide" "acetohexamide"
[35] "chlorpropamide" "citoglipton"
[37] "examide" "glimepiride.pioglitazone"
[39] "glipizide.metformin" "metformin.pioglitazone"
[41] "metformin.rosiglitazone"
rpartImp <- varImp(rpart_model)
rpartImp <- data.frame(feature = row.names(rpartImp), rpartImp)
rpartImp <- rpartImp[order(-rpartImp$Overall),]
rownames(rpartImp) <- 1:length(rpartImp$Overall)
rpartImp$feature <- factor(rpartImp$feature,levels = unique(rpartImp$feature))
ggplot(rpartImp, aes(feature,Overall,fill=Overall))+
geom_bar(stat="identity")+
ggtitle("Rpart Variable Importance") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))# dist of other features across the two group
feature_all <- c(as.character(rpart_feature),"readmitted")
data2plot <- data_main %>%
mutate(admission_source_id = as.factor(admission_source_id)) %>%
mutate(admission_type_id = as.factor(admission_type_id)) %>%
drop_na(readmitted) %>%
mutate(readmitted = as.factor(if_else(readmitted == "<30", "YES", "NO")))
# numerical features
numerical <- unlist(lapply(data2plot, is.numeric))
data_numerical <- data.frame(data2plot[,numerical], readmitted=data2plot$readmitted)
data_numerical <- data_numerical[which(colnames(data_numerical) %in% feature_all)]
long <- gather(data_numerical, key, value, -readmitted)
ggplot(long, aes(x = value,fill=readmitted, group = readmitted)) +
geom_histogram(aes(y=..density..),position = "dodge",alpha=0.8) +
facet_wrap(~ key,scales = "free") +
ggtitle("Visualize important features Across two readmitted groups")# categorical feature
categorical <- unlist(lapply(data2plot, is.factor))
data_categorical <- data.frame(data2plot[,categorical], readmitted=data2plot[,ncol(data2plot)])
data_categorical <- data_categorical[which(colnames(data_categorical) %in% feature_all)]
long <- gather(data_categorical, key, value, -readmitted)
ggplot(long, aes(x = value,fill=readmitted,color=readmitted, group = readmitted)) +
geom_bar(aes(y = ..prop..),stat = "count", alpha=0.8,position = 'dodge') +
facet_wrap(~ key,scales = "free", ncol = 3) +
ggtitle("Visualize important features Across two readmitted groups") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) + scale_y_continuous(labels=scales::percent)set.seed(123)
data_boruta <- Boruta(readmitted~., data=data_train.sele[complete.cases(data_train.sele), ], doTrace=0)
print(data_boruta)Boruta performed 99 iterations in 6.976531 mins.
28 attributes confirmed important: A1Cresult, admission_source_id,
admission_type_id, age, change and 23 more;
14 attributes confirmed unimportant: acarbose, acetohexamide,
chlorpropamide, citoglipton, examide and 9 more;
2 tentative attributes left: glyburide.metformin, nateglinide;
par(mar=c(8,8,4,4),oma=c(0.1,0.1,0.1,0))
plot(data_boruta, xlab="", xaxt="n",cex.lab=3,cex.axis=2)
title("Boruta Feature Selection")
lz <- lapply(1:ncol(data_boruta$ImpHistory), function(i)
data_boruta$ImpHistory[is.finite(data_boruta$ImpHistory[, i]), i])
names(lz) <- colnames(data_boruta$ImpHistory)
lb <- sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(data_boruta$ImpHistory), cex.axis=0.5, font = 4)all_var <- colnames(data_train)[-45]
plot_data <- data.frame(variable = all_var,
Boruta = ifelse(all_var %in% boruta_feature, 1, 0),
RFE = ifelse(all_var %in% RFE_feature, 1, 0),
Rpart = ifelse(all_var %in% rpart_feature, 1, 0))
# plot only selected features
plot_data <- subset(plot_data,!(Boruta==0 & RFE==0 & Rpart==0))
plot_data <- melt(plot_data, value.name = "select")
colnames(plot_data) <- c("variable","method","select")
ggplot(plot_data, aes(x=variable, y=select, fill=method)) +
geom_bar(position="stack", stat="identity") +
ggtitle("Comparison of selected features") +
#theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
xlab("") +
coord_flip()The three methods select similar features and we will use the selected features by “Boruta” for logistic regression. This process reduce the feature space from 44 to 28.
# select important features and remove rows with missing value
data_train.prune <- data_train[,c(boruta_feature,"readmitted")]
data_test.prune <- data_test[,c(boruta_feature,"readmitted")]
logit <- glm(readmitted ~ ., data = data_train.prune, family = "binomial",
na.action = na.omit)
summary(logit)
Call:
glm(formula = readmitted ~ ., family = "binomial", data = data_train.prune,
na.action = na.omit)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3649 -0.7178 0.2616 0.6827 3.3186
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.496e+00 1.030e+00 -7.279 3.37e-13 ***
raceAsian -2.988e-01 1.286e-01 -2.324 0.020150 *
raceCaucasian 4.973e-01 2.704e-02 18.390 < 2e-16 ***
raceHispanic -3.508e-02 8.113e-02 -0.432 0.665495
raceOther -3.228e-01 9.039e-02 -3.571 0.000355 ***
genderMale 7.604e-01 1.892e-02 40.199 < 2e-16 ***
age[10-20) 6.437e-01 4.981e-01 1.292 0.196299
age[20-30) 6.442e-01 4.790e-01 1.345 0.178617
age[30-40) 3.780e-01 4.746e-01 0.797 0.425710
age[40-50) 2.360e-01 4.724e-01 0.500 0.617415
age[50-60) 1.055e+00 4.716e-01 2.237 0.025275 *
age[60-70) 1.445e-01 4.718e-01 0.306 0.759454
age[70-80) 6.632e-01 4.717e-01 1.406 0.159731
age[80-90) 1.004e+00 4.719e-01 2.127 0.033427 *
age[90-100) 2.126e-01 4.758e-01 0.447 0.654955
admission_type_id2 5.940e-01 2.453e-02 24.218 < 2e-16 ***
admission_type_id3 -8.052e-01 3.313e-02 -24.306 < 2e-16 ***
admission_type_id4 7.940e-01 9.087e-01 0.874 0.382256
admission_type_id7 -1.246e+01 7.133e+01 -0.175 0.861336
discharge_disposition_id2 -1.690e-01 6.210e-02 -2.722 0.006497 **
discharge_disposition_id3 8.898e-02 3.146e-02 2.828 0.004681 **
discharge_disposition_id4 -2.890e-01 1.090e-01 -2.652 0.007997 **
discharge_disposition_id5 7.009e-01 7.817e-02 8.967 < 2e-16 ***
discharge_disposition_id6 -3.191e-01 3.247e-02 -9.828 < 2e-16 ***
discharge_disposition_id7 -2.050e-01 1.260e-01 -1.627 0.103634
discharge_disposition_id8 -3.354e-01 2.930e-01 -1.145 0.252299
discharge_disposition_id9 1.328e+00 1.148e+00 1.156 0.247485
discharge_disposition_id12 5.427e-01 9.035e-01 0.601 0.548090
discharge_disposition_id15 1.145e+00 3.075e-01 3.722 0.000197 ***
discharge_disposition_id22 1.154e+00 6.005e-02 19.220 < 2e-16 ***
discharge_disposition_id23 -9.847e-01 2.196e-01 -4.483 7.35e-06 ***
discharge_disposition_id24 -3.717e-01 6.184e-01 -0.601 0.547812
discharge_disposition_id27 -1.124e+01 3.247e+02 -0.035 0.972397
discharge_disposition_id28 2.084e+00 2.566e-01 8.122 4.59e-16 ***
admission_source_id2 -1.295e+00 9.416e-02 -13.757 < 2e-16 ***
admission_source_id3 4.395e-01 2.649e-01 1.659 0.097107 .
admission_source_id4 -1.405e+00 5.752e-02 -24.419 < 2e-16 ***
admission_source_id5 -1.156e+00 1.225e-01 -9.440 < 2e-16 ***
admission_source_id6 -9.939e-01 7.181e-02 -13.840 < 2e-16 ***
admission_source_id7 -9.757e-01 2.528e-02 -38.595 < 2e-16 ***
admission_source_id8 -8.703e-01 9.070e-01 -0.960 0.337303
admission_source_id10 -1.439e+01 1.371e+02 -0.105 0.916433
admission_source_id11 -1.190e+01 2.296e+02 -0.052 0.958672
admission_source_id13 -1.385e+01 3.247e+02 -0.043 0.965976
admission_source_id14 -1.100e+01 2.296e+02 -0.048 0.961778
admission_source_id22 -1.106e+01 1.562e+02 -0.071 0.943569
admission_source_id25 -1.207e+01 2.296e+02 -0.053 0.958066
time_in_hospital -3.141e-03 4.469e-03 -0.703 0.482147
num_lab_procedures 1.609e-02 6.284e-04 25.597 < 2e-16 ***
num_procedures 1.078e-01 6.302e-03 17.099 < 2e-16 ***
num_medications -6.368e-02 1.734e-03 -36.736 < 2e-16 ***
number_outpatient -1.614e-01 1.170e-02 -13.792 < 2e-16 ***
number_emergency -3.634e-03 2.087e-02 -0.174 0.861770
number_inpatient 1.988e-01 1.625e-02 12.236 < 2e-16 ***
number_diagnoses 2.876e-01 6.563e-03 43.830 < 2e-16 ***
A1Cresult>8 7.208e-02 6.346e-02 1.136 0.256017
A1CresultNone 6.439e-01 5.267e-02 12.225 < 2e-16 ***
A1CresultNorm 1.396e+00 6.049e-02 23.071 < 2e-16 ***
metforminNo 1.062e-01 1.370e-01 0.775 0.438139
metforminSteady -4.561e-01 1.382e-01 -3.300 0.000966 ***
metforminUp -3.137e-01 1.700e-01 -1.846 0.064948 .
repaglinideNo 1.376e+00 6.398e-01 2.150 0.031541 *
repaglinideSteady 3.131e+00 6.425e-01 4.873 1.10e-06 ***
repaglinideUp 2.700e+00 6.980e-01 3.868 0.000110 ***
glimepirideNo 1.720e-01 2.039e-01 0.843 0.399082
glimepirideSteady -6.006e-01 2.088e-01 -2.876 0.004029 **
glimepirideUp 2.927e-02 2.686e-01 0.109 0.913238
glipizideNo 2.815e-01 1.331e-01 2.115 0.034441 *
glipizideSteady 6.266e-01 1.345e-01 4.660 3.17e-06 ***
glipizideUp 5.772e-02 1.730e-01 0.334 0.738587
glyburideNo 1.798e-01 1.444e-01 1.245 0.213156
glyburideSteady -4.038e-01 1.474e-01 -2.739 0.006163 **
glyburideUp -2.978e-01 1.835e-01 -1.623 0.104576
pioglitazoneNo -1.750e-01 2.797e-01 -0.626 0.531498
pioglitazoneSteady 4.507e-01 2.809e-01 1.605 0.108535
pioglitazoneUp -3.949e-01 3.489e-01 -1.132 0.257781
rosiglitazoneNo 1.747e+00 4.910e-01 3.559 0.000373 ***
rosiglitazoneSteady 1.308e+00 4.925e-01 2.656 0.007916 **
rosiglitazoneUp 2.031e+00 5.445e-01 3.730 0.000191 ***
insulinNo 2.840e-01 3.971e-02 7.152 8.53e-13 ***
insulinSteady 3.264e-01 3.673e-02 8.888 < 2e-16 ***
insulinUp -3.484e-02 4.577e-02 -0.761 0.446530
changeNo 1.409e-01 2.508e-02 5.620 1.91e-08 ***
diabetesMedYes 5.741e-01 2.752e-02 20.863 < 2e-16 ***
diag_1_groupDiabetes -6.096e-01 4.257e-02 -14.318 < 2e-16 ***
diag_1_groupDigestive -1.148e+00 3.938e-02 -29.157 < 2e-16 ***
diag_1_groupGenitourinary -9.424e-01 4.881e-02 -19.306 < 2e-16 ***
diag_1_groupInjury -7.420e-01 4.267e-02 -17.391 < 2e-16 ***
diag_1_groupMusculoskeletal -7.626e-01 5.027e-02 -15.170 < 2e-16 ***
diag_1_groupNeoplasms -1.163e+00 6.020e-02 -19.316 < 2e-16 ***
diag_1_groupOther -1.033e+00 3.024e-02 -34.166 < 2e-16 ***
diag_1_groupRespiratory -2.226e-01 2.943e-02 -7.564 3.90e-14 ***
diag_2_groupDiabetes -1.719e-02 3.516e-02 -0.489 0.624930
diag_2_groupDigestive -2.148e-01 5.642e-02 -3.808 0.000140 ***
diag_2_groupGenitourinary -3.964e-01 4.053e-02 -9.781 < 2e-16 ***
diag_2_groupInjury -1.695e-02 6.525e-02 -0.260 0.794997
diag_2_groupMusculoskeletal 1.509e+00 5.764e-02 26.180 < 2e-16 ***
diag_2_groupNeoplasms 2.044e+00 4.864e-02 42.027 < 2e-16 ***
diag_2_groupOther 1.718e-01 2.517e-02 6.827 8.67e-12 ***
diag_2_groupRespiratory -5.950e-01 3.678e-02 -16.178 < 2e-16 ***
diag_3_groupDiabetes 1.000e-01 2.653e-02 3.771 0.000163 ***
diag_3_groupDigestive -6.250e-01 5.658e-02 -11.046 < 2e-16 ***
diag_3_groupGenitourinary -8.173e-01 4.432e-02 -18.443 < 2e-16 ***
diag_3_groupInjury -5.557e-01 7.410e-02 -7.498 6.46e-14 ***
diag_3_groupMusculoskeletal -9.346e-01 8.098e-02 -11.541 < 2e-16 ***
diag_3_groupNeoplasms -8.766e-01 7.731e-02 -11.340 < 2e-16 ***
diag_3_groupOther -8.043e-01 2.498e-02 -32.204 < 2e-16 ***
diag_3_groupRespiratory -7.967e-01 4.127e-02 -19.304 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 110980 on 80740 degrees of freedom
Residual deviance: 73077 on 80633 degrees of freedom
(9876 observations deleted due to missingness)
AIC: 73293
Number of Fisher Scoring iterations: 11
Important features that lead to higher readmission rate
selectedFeatures <- rownames(summary(logit)$coefficients)[summary(logit)$coefficients[,4] < 0.001 & summary(logit)$coefficients[,1] > 0]
selectedFeatures [1] "raceCaucasian" "genderMale"
[3] "admission_type_id2" "discharge_disposition_id5"
[5] "discharge_disposition_id15" "discharge_disposition_id22"
[7] "discharge_disposition_id28" "num_lab_procedures"
[9] "num_procedures" "number_inpatient"
[11] "number_diagnoses" "A1CresultNone"
[13] "A1CresultNorm" "repaglinideSteady"
[15] "repaglinideUp" "glipizideSteady"
[17] "rosiglitazoneNo" "rosiglitazoneUp"
[19] "insulinNo" "insulinSteady"
[21] "changeNo" "diabetesMedYes"
[23] "diag_2_groupMusculoskeletal" "diag_2_groupNeoplasms"
[25] "diag_2_groupOther" "diag_3_groupDiabetes"
Important features that lead to lower readmission rate
selectedFeatures <- rownames(summary(logit)$coefficients)[summary(logit)$coefficients[,4] < 0.001 & summary(logit)$coefficients[,1] < 0]
selectedFeatures [1] "(Intercept)" "raceOther"
[3] "admission_type_id3" "discharge_disposition_id6"
[5] "discharge_disposition_id23" "admission_source_id2"
[7] "admission_source_id4" "admission_source_id5"
[9] "admission_source_id6" "admission_source_id7"
[11] "num_medications" "number_outpatient"
[13] "metforminSteady" "diag_1_groupDiabetes"
[15] "diag_1_groupDigestive" "diag_1_groupGenitourinary"
[17] "diag_1_groupInjury" "diag_1_groupMusculoskeletal"
[19] "diag_1_groupNeoplasms" "diag_1_groupOther"
[21] "diag_1_groupRespiratory" "diag_2_groupDigestive"
[23] "diag_2_groupGenitourinary" "diag_2_groupRespiratory"
[25] "diag_3_groupDigestive" "diag_3_groupGenitourinary"
[27] "diag_3_groupInjury" "diag_3_groupMusculoskeletal"
[29] "diag_3_groupNeoplasms" "diag_3_groupOther"
[31] "diag_3_groupRespiratory"
By analyzing important features from logistic model we have the following observations:
Evaluation on the test data
# remove rows with missing value
data_train.prune <- data_train.prune[complete.cases(data_train.prune), ]
data_test.prune <- data_test.prune[complete.cases(data_test.prune), ]
logit.prob <- predict(logit, data_test.prune, type="response")
logit.pred <- factor(ifelse(logit.prob>0.5,"YES","NO"))Confusion Matrix and Statistics
Reference
Prediction NO YES
NO 2972 860
YES 951 4193
Accuracy : 0.7982
95% CI : (0.7898, 0.8065)
No Information Rate : 0.5629
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.5889
Mcnemar's Test P-Value : 0.03444
Sensitivity : 0.7576
Specificity : 0.8298
Pos Pred Value : 0.7756
Neg Pred Value : 0.8151
Prevalence : 0.4371
Detection Rate : 0.3311
Detection Prevalence : 0.4269
Balanced Accuracy : 0.7937
'Positive' Class : NO
fourfoldplot(logit_test_cfm$table, color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1, main = "Logistic Regression - Confusion Matrix (79.82%)")par(pty="s")
roc_lg <- roc(data_test.prune$readmitted ~ logit.prob, plot = TRUE, print.auc = TRUE, main="ROC curve of logistic regression")In this section, we summarize the overall performance of three models we applied for predicting readmission likelihood.
compare_model <- data.frame(Model=c("Random Forest", "Decision Tree", "Logistic Regression"),
Accuracy = c(0.9309,0.7689,0.7982),
Kappa = c(0.861,0.5376,0.5889),
Sensitivity = c(0.9651,0.7674,0.7576),
Specificity = c(0.9044,0.7703,0.8298),
AUC = c(0.978, 0.811, 0.874))
knitr::kable(compare_model)| Model | Accuracy | Kappa | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|
| Random Forest | 0.9309 | 0.8610 | 0.9651 | 0.9044 | 0.978 |
| Decision Tree | 0.7689 | 0.5376 | 0.7674 | 0.7703 | 0.811 |
| Logistic Regression | 0.7982 | 0.5889 | 0.7576 | 0.8298 | 0.874 |
par(pty="s")
plot(roc_rf, col = "green", main = "ROC for three models")
lines(roc_dt, col = "red")
lines(roc_lg, col = "blue")
legend("bottomright", c("Random Forest (0.978)","Decision Tree (0.811)","Logistic Regression (0.874)"), fill=c("green","red","blue"),bty = "n")set.seed(100)
data_train.prune <- data_train.prune[complete.cases(data_train.sele), ]
num_train <- nrow(data_train.prune)
sele_id <- sample(num_train, 0.1*num_train)
data_train.temp <- data_train.prune[sele_id,]
rownames(data_train.temp) <- 1:nrow(data_train.temp)
tsne <- Rtsne(data_train.temp[,-ncol(data_train.temp)],
dims = 2, perplexity=30, verbose=TRUE,
max_iter = 500, check_duplicates = FALSE)Performing PCA
Read the 7235 x 50 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 1.13 seconds (sparsity = 0.016896)!
Learning embedding...
Iteration 50: error is 93.502540 (50 iterations in 1.19 seconds)
Iteration 100: error is 78.804539 (50 iterations in 1.02 seconds)
Iteration 150: error is 76.776172 (50 iterations in 0.95 seconds)
Iteration 200: error is 76.237789 (50 iterations in 0.99 seconds)
Iteration 250: error is 75.990168 (50 iterations in 0.99 seconds)
Iteration 300: error is 2.564907 (50 iterations in 0.87 seconds)
Iteration 350: error is 2.203439 (50 iterations in 0.89 seconds)
Iteration 400: error is 2.008632 (50 iterations in 0.88 seconds)
Iteration 450: error is 1.884065 (50 iterations in 0.90 seconds)
Iteration 500: error is 1.797850 (50 iterations in 0.90 seconds)
Fitting performed in 9.58 seconds.
tsne_data <- tsne$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(readmitted = data_train.temp$readmitted)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = readmitted))gower_dist <- daisy(data_train.temp[, -ncol(data_train.temp)],
metric = "gower",
type = list(logratio = 3))
summary(gower_dist)26168995 dissimilarities, summarized :
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2847 0.3385 0.3371 0.3891 0.6699
Metric : mixed ; Types = N, N, N, N, N, N, I, I, I, I, I, I, I, I, N, N, N, N, N, N, N, N, N, N, N, N, N, N
Number of objects : 7235
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(gower_dist,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width)pam_fit <- pam(gower_dist, diss = TRUE, k = 2)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
readmitted = data_train.temp$readmitted)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster)) + ggtitle("Clustering on pruned data")ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = readmitted, shape = readmitted)) +
ggtitle("Compare cluster and readmitted type")
NO YES
1 2129 2057
2 1163 1886
From above we can find there is no strong indication of readmission difference between the two groups.
hc <- hclust(gower_dist, method = "ward")
sub_grp <- cutree(hc, k = 2)
# data_train.temp %>% mutate(cluster = sub_grp)
# rect.hclust(hc, k = 2, border = 2:3)
dend <- as.dendrogram(hc)
plot(dend,cex=0,main="Hierachical Clustering",leaflab = "none")
colored_bars(colors = ifelse(data_train.temp$readmitted=="NO", "blue", "red"), dend = dend, rowLabels = "readmitted")Although the admission type cannot be inferred directly with two clusters, we observe aggregation of readmitted/unreadmitted cases at the endpoint of deeper level. This similar architecture with tree-based model suggests the improved performance of random forest.
In this study, we leveraged a hospital dataset that contains demographic, clinical and diagnostic-related features, as well as medication information to identify diabetic patients who have higher likelihood of being readmitted within 30 days. Based on our analysis of three machine learning algorithms, including Random Forest, Decision Tree and Logistic Regression, the best model for this purpose has been found to be Random Forest, which yielded an accuracy of 93.09%. Some of the key factors that are important for readmission are the number of diagnoses, the number of lab procedures and the number of medications, diagnosis group. Whereas many medicines such as acarbose, acetohexamide, chlorpropamide, citoglipton, examide provided in the dataset are not informative for admission prediction.
Based on the results from this study, hospitals are advised to work not only on inpatient treatment but also on continued care after discharge. For example, more care should be taken for male with age range of 50~60, and patients with urgent admission are more likely to get readmitted. The finding that patients discharged/transferred to another type of inpatient care institution can have a higher risk of readmission suggests that hospitals may want to develop further strategy to work closely with other health care institution. Our results can also provide medical guidance. For example, patients with no/steady insulin treatment have a higher risk for readmission compared with those with increased insulin dosage. This suggests that hospital should adjust the insulin usage timely to prevent readmission. Another interesting observation can be the diagnostic time for diabetic patients. Diabetes diagnosis at a later time may result in higher risk of readmission compared to primary/earlier diagnosis. Therefore, it would be beneficial that doctors and patients treat the diagnosis seriously at early stage.
It is not surprising that random forest that integrates multiple tree based models performs best among all the methods considering the large feature space with mixed categorical and numerical type. We can see that multiple variables contribute to the readmission rate after the feature selection process. The clustering result also supports this, showing the data is not separable in lower dimension less than 3. In particular, hierarchical clustering illustrates the aggregation of readmitted cases at deeper branches, suggesting the similar architecture and good performance of tree-based models. In the future, we could try improving the model performance by parameter tuning such as the mtry parameter in random forest, or applying boosting algorithm to compare the prediction.
Mapping between admission/discharge id and description
| admission_type_id | description |
|---|---|
| 1 | Emergency |
| 2 | Urgent |
| 3 | Elective |
| 4 | Newborn |
| 5 | Not Available |
| 6 | NULL |
| 7 | Trauma Center |
| 8 | Not Mapped |
| admission_type_id | description | |
|---|---|---|
| 11 | 1 | Discharged to home |
| 12 | 2 | Discharged/transferred to another short term hospital |
| 13 | 3 | Discharged/transferred to SNF |
| 14 | 4 | Discharged/transferred to ICF |
| 15 | 5 | Discharged/transferred to another type of inpatient care institution |
| 16 | 6 | Discharged/transferred to home with home health service |
| 17 | 7 | Left AMA |
| 18 | 8 | Discharged/transferred to home under care of Home IV provider |
| 19 | 9 | Admitted as an inpatient to this hospital |
| 20 | 10 | Neonate discharged to another hospital for neonatal aftercare |
| 21 | 11 | Expired |
| 22 | 12 | Still patient or expected to return for outpatient services |
| 23 | 13 | Hospice / home |
| 24 | 14 | Hospice / medical facility |
| 25 | 15 | Discharged/transferred within this institution to Medicare approved swing bed |
| 26 | 16 | Discharged/transferred/referred another institution for outpatient services |
| 27 | 17 | Discharged/transferred/referred to this institution for outpatient services |
| 28 | 18 | NULL |
| 29 | 19 | Expired at home. Medicaid only, hospice. |
| 30 | 20 | Expired in a medical facility. Medicaid only, hospice. |
| 31 | 21 | Expired, place unknown. Medicaid only, hospice. |
| 32 | 22 | Discharged/transferred to another rehab fac including rehab units of a hospital . |
| 33 | 23 | Discharged/transferred to a long term care hospital. |
| 34 | 24 | Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare. |
| 35 | 25 | Not Mapped |
| 36 | 26 | Unknown/Invalid |
| 37 | 30 | Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere |
| 38 | 27 | Discharged/transferred to a federal health care facility. |
| 39 | 28 | Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital |
| 40 | 29 | Discharged/transferred to a Critical Access Hospital (CAH). |
colnames(discharge_mapping) <- c("discharge_disposition_id","description")
source_mapping <- as.data.frame(data_mapping[43:67,])
colnames(source_mapping) <- c("admission_source_id","description")
knitr::kable(source_mapping)| admission_source_id | description | |
|---|---|---|
| 43 | 1 | Physician Referral |
| 44 | 2 | Clinic Referral |
| 45 | 3 | HMO Referral |
| 46 | 4 | Transfer from a hospital |
| 47 | 5 | Transfer from a Skilled Nursing Facility (SNF) |
| 48 | 6 | Transfer from another health care facility |
| 49 | 7 | Emergency Room |
| 50 | 8 | Court/Law Enforcement |
| 51 | 9 | Not Available |
| 52 | 10 | Transfer from critial access hospital |
| 53 | 11 | Normal Delivery |
| 54 | 12 | Premature Delivery |
| 55 | 13 | Sick Baby |
| 56 | 14 | Extramural Birth |
| 57 | 15 | Not Available |
| 58 | 17 | NULL |
| 59 | 18 | Transfer From Another Home Health Agency |
| 60 | 19 | Readmission to Same Home Health Agency |
| 61 | 20 | Not Mapped |
| 62 | 21 | Unknown/Invalid |
| 63 | 22 | Transfer from hospital inpt/same fac reslt in a sep claim |
| 64 | 23 | Born inside this hospital |
| 65 | 24 | Born outside this hospital |
| 66 | 25 | Transfer from Ambulatory Surgery Center |
| 67 | 26 | Transfer from Hospice |
Lum, Hillary D, Stephanie A Studenski, Howard B Degenholtz, and Susan E Hardy. 2012. “Early Hospital Readmission Is a Predictor of One-Year Mortality in Community-Dwelling Older Medicare Beneficiaries.” Journal of General Internal Medicine 27 (11): 1467–74.
Rubin, Daniel J. 2015. “Hospital Readmission of Patients with Diabetes.” Current Diabetes Reports 15 (4): 17.
Strack, Beata, Jonathan P DeShazo, Chris Gennings, Juan L Olmo, Sebastian Ventura, Krzysztof J Cios, and John N Clore. 2014. “Impact of Hba1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.” BioMed Research International 2014.
Zimmet, Paul Z, Dianna J Magliano, William H Herman, and Jonathan E Shaw. 2014. “Diabetes: A 21st Century Challenge.” The Lancet Diabetes & Endocrinology 2 (1): 56–64. https://doi.org/https://doi.org/10.1016/S2213-8587(13)70112-8.