Title: “Diabetes Readmission Prediction -Data Analysis”

The dataset used is from UCI Machine learning website and represents 10 years (1999-2008) of clinical care at 130 U.S. hospitals. Hospital readmission for diabetic patients is a major concern in the United States. Over $250 million dollars was spent on treatment of readmitted diabetic inpatients in 2011. This disease is chronic and does not have any specific cure. Hospital readmissions are expensive as hospitals face penalties if their readmission rate is higher than expected and reflects the inadequacies in health care system. Most hospitals can agree that their main goals are centered around improving outcomes, creating more satisfied patients, and better value. For these reasons, it is important for the hospitals to improve focus on reducing readmission rates. This study attempts to identify the key factors that influence readmission for diabetes and to predict the probability of patient readmission.

The results obtained by random forest and decision tree on the actual imbalanced data set are the same as previous research on this data set, as evidence towards the predictors available are not encompassing the true cause of readmission. However, I used synthetic data balancing techniques and improved model accuracy


DIABETES READMISION PREDICTION:DATA ANALYSIS

Data Source- UCI Machine learning website

DATA INGESTION

#binData <- getBinaryURL("https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip", ssl.verifypeer=FALSE)
#head(binData)
#conObj <- file("dataset_diabetes.zip", open = "wb") # writing binary file
#writeBin(binData, conObj) #transfer binary data 
#close(conObj) # close connection
#files <- unzip("dataset_diabetes.zip") 
#diabetes <- read.csv(files[1], stringsAsFactors = FALSE) # file[1] = diabetic_data.csv
#names(diabetes)
#head(diabetes)

data cleansing

#1: remove non useful columns-
#import data if data ingenstion does not work- we have shortned the title of some variables like discharge_disposition_id as discharge_id so that  decision trees are better
diabetes <- read.csv("C:/Users/Amit/Dropbox/Dataset/diabetic_data.csv")
diabetes1 <- subset(diabetes,select=-c(encounter_id, patient_nbr, examide,citoglipton,weight, payer_code, medical_spec)) 
diabetes2 <- diabetes1[diabetes1$race != "?",] # No. of observations drops by 2273
diabetes2 <- diabetes2[diabetes2$diag_1 != "?",] # No of observations drops by 21
diabetes2 <- diabetes2[diabetes2$diag_2 != "?",] # No of observations drops by 358
diabetes2 <- diabetes2[diabetes2$diag_3 != "?",] # No of observations drops by 1453
removed non useful columns
examide and citoglipton were not given even to a single patient hence removed and weight is missing for 97% of observations and payer code for 47% of observations

binary representation of readmitted within 30 days

“0” represnts No readmisson or readmission after 30 days whereas 1 represents readmission within 30 days

diabetes2$readmittedbin <- ifelse(diabetes2$readmitted == "<30",1,0) 

To make factor of levels

diabetes3 <- cbind(diabetes2[c(7:13,17)], lapply(diabetes2[c(1:6,14:16,18:44)],factor))
head(diabetes3)
##   time_in_hos num_lab_proc num_proc num_medic num_out num_emerg num_inpat
## 2           3           59        0        18       0         0         0
## 3           2           11        5        13       2         0         1
## 4           2           44        1        16       0         0         0
## 5           1           51        0         8       0         0         0
## 6           3           31        6        16       0         0         0
## 7           4           70        1        21       0         0         0
##   num_diag            race gender     age adm_type_id discharge_id
## 2        9       Caucasian Female [10-20)           1            1
## 3        6 AfricanAmerican Female [20-30)           1            1
## 4        7       Caucasian   Male [30-40)           1            1
## 5        5       Caucasian   Male [40-50)           1            1
## 6        9       Caucasian   Male [50-60)           2            1
## 7        7       Caucasian   Male [60-70)           3            1
##   adm_source_id diag_1 diag_2 diag_3 max_glu_ser A1Cresult metformin
## 2             7    276 250.01    255        None      None        No
## 3             7    648    250    V27        None      None        No
## 4             7      8 250.43    403        None      None        No
## 5             7    197    157    250        None      None        No
## 6             2    414    411    250        None      None        No
## 7             2    414    411    V45        None      None    Steady
##   repaglinide nateglinide chlorpropamide glimepiride acetohexamide
## 2          No          No             No          No            No
## 3          No          No             No          No            No
## 4          No          No             No          No            No
## 5          No          No             No          No            No
## 6          No          No             No          No            No
## 7          No          No             No      Steady            No
##   glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose
## 2        No        No          No           No            No       No
## 3    Steady        No          No           No            No       No
## 4        No        No          No           No            No       No
## 5    Steady        No          No           No            No       No
## 6        No        No          No           No            No       No
## 7        No        No          No           No            No       No
##   miglitol troglitazone tolazamide insulin glyburide.metformin
## 2       No           No         No      Up                  No
## 3       No           No         No      No                  No
## 4       No           No         No      Up                  No
## 5       No           No         No  Steady                  No
## 6       No           No         No  Steady                  No
## 7       No           No         No  Steady                  No
##   glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone
## 2                  No                       No                      No
## 3                  No                       No                      No
## 4                  No                       No                      No
## 5                  No                       No                      No
## 6                  No                       No                      No
## 7                  No                       No                      No
##   metformin.pioglitazone change diabetesMed readmitted readmittedbin
## 2                     No     Ch         Yes        >30             0
## 3                     No     No         Yes         NO             0
## 4                     No     Ch         Yes         NO             0
## 5                     No     Ch         Yes         NO             0
## 6                     No     No         Yes        >30             0
## 7                     No     Ch         Yes         NO             0

table to get frquency of different levels of readmission

table(diabetes3$readmitted)
## 
##   <30   >30    NO 
## 11066 34649 52338
prop.table(table(diabetes3$readmitted))
## 
##       <30       >30        NO 
## 0.1128573 0.3533701 0.5337726
table(diabetes3$readmittedbin)
## 
##     0     1 
## 86987 11066
prop.table(table(diabetes3$readmittedbin))
## 
##         0         1 
## 0.8871427 0.1128573
racewise <- table(diabetes3$readmittedbin,diabetes3$race)
racewise
##    
##     AfricanAmerican Asian Caucasian Hispanic Other
##   0           16741   560     66567     1777  1342
##   1            2140    65      8512      207   142
genderwise <- table(diabetes3$readmittedbin,diabetes3$gender)
genderwise
##    
##     Female  Male Unknown/Invalid
##   0  46831 40155               1
##   1   6002  5064               0
agewise <- table(diabetes3$readmittedbin,diabetes3$age)
agewise
##    
##     [0-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-90)
##   0     64     435    1265    3140    8265   15059   19349   22302   14690
##   1      1      31     213     408    1000    1638    2460    3004    2012
##    
##     [90-100)
##   0     2418
##   1      299
timinhoswise <- table(diabetes3$readmittedbin,diabetes3$time_in_hos)
timinhoswise
##    
##         1     2     3     4     5     6     7     8     9    10    11
##   0 12363 14791 15201 11845  8519  6396  4961  3661  2524  1954  1618
##   1  1127  1650  1848  1589  1180   924   733   615   404   333   191
##    
##        12    13    14
##   0  1231  1038   885
##   1   193   147   132
Number_patients <- table(diabetes3$readmitted)
plot(Number_patients,col ="lightblue", xlab = " Readmission Days ", main= " Frequency of Readmission", lwd =20,pch=18)

Number_patients_bin <- table(diabetes3$readmittedbin)
plot(Number_patients_bin,col ="lightblue", xlab = " Readmission Days ", main= " Frequency of Readmission", lwd =20,pch=18)

# split the data into training and testing data sets #### We are using a sample training data as our original sample has 100,000 observations. This large data takes a long time to finish and we face issues with Rmarkdown. Hence we are using a smaller training data (20,000 observations)

set.seed(111)
inTrain <- createDataPartition(diabetes3$readmittedbin, p=.2, list=FALSE)
objTrain <-diabetes3[inTrain,]
objTest <- diabetes3[-inTrain,]
table(objTrain$readmittedbin)
## 
##     0     1 
## 17398  2214
prop.table(table(objTrain$readmittedbin))
## 
##         0         1 
## 0.8871099 0.1128901
table(objTest$readmittedbin)
## 
##     0     1 
## 69589  8852
prop.table(table(objTest$readmittedbin))
## 
##         0         1 
## 0.8871509 0.1128491

As we see, this data set contains only 11% of cases of readmission within 30 days and 89% of cases with readmission after 30 days or No readmission. This is a severely imbalanced data set.

Let us see how badly can this affect our prediction accuracy ? Let’s build a model on this data. I’ll be using decision tree algorithm for modeling purpose. #Prediction with three levels of response variable

cfit <- rpart(readmitted ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain, method="class", minsplit = 20, minbucket = 5, cp = 0.001)

head(predict(cfit))
##           <30       >30        NO
## 7  0.08252352 0.3054913 0.6119852
## 10 0.08252352 0.3054913 0.6119852
## 19 0.08252352 0.3054913 0.6119852
## 33 0.08252352 0.3054913 0.6119852
## 37 0.08252352 0.3054913 0.6119852
## 39 0.08252352 0.3054913 0.6119852
par(mar=c(1,1,0.25,1))
plot(cfit, branch = 0.4,uniform = TRUE, compress = TRUE)
text(cfit, pretty = 0)

rpart.predict <- predict(cfit, newdata = objTrain, type="class")
tail(rpart.predict)
## 101736 101744 101746 101748 101753 101757 
##    >30     NO     NO     NO     NO    >30 
## Levels: <30 >30 NO
#accuracy.meas(objTest$readmitted, rpart.predict[,2])
cf <-confusionMatrix(rpart.predict, objTrain$readmitted)
cf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  <30  >30   NO
##        <30  131   85   28
##        >30  771 2328 1470
##        NO  1312 4513 8974
## 
## Overall Statistics
##                                          
##                Accuracy : 0.583          
##                  95% CI : (0.576, 0.5899)
##     No Information Rate : 0.534          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.1877         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: <30 Class: >30 Class: NO
## Sensitivity             0.05917     0.3361    0.8570
## Specificity             0.99351     0.8233    0.3627
## Pos Pred Value          0.53689     0.5095    0.6064
## Neg Pred Value          0.89245     0.6943    0.6888
## Prevalence              0.11289     0.3532    0.5340
## Detection Rate          0.00668     0.1187    0.4576
## Detection Prevalence    0.01244     0.2330    0.7546
## Balanced Accuracy       0.52634     0.5797    0.6098
#Mean error rate
mean.error.rate.rpart <- 1- cf$overall[1]
mean.error.rate.rpart
##  Accuracy 
## 0.4170406
par(mar=c(3,3,3,3))
plotcp(cfit,lty = 3, col = 1)
printcp(cfit)
## 
## Classification tree:
## rpart(formula = readmitted ~ time_in_hos + num_lab_proc + num_proc + 
##     num_medic + num_out + num_emerg + num_inpat + race + age + 
##     adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + 
##     A1Cresult + metformin + insulin, data = objTrain, method = "class", 
##     minsplit = 20, minbucket = 5, cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] adm_source_id adm_type_id   age           discharge_id  insulin      
##  [6] num_diag      num_inpat     num_lab_proc  num_medic     num_proc     
## 
## Root node error: 9140/19612 = 0.46604
## 
## n= 19612 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.0468271      0   1.00000 1.00000 0.0076433
## 2 0.0203501      1   0.95317 0.95317 0.0076132
## 3 0.0051422      2   0.93282 0.93315 0.0075957
## 4 0.0014442      5   0.91740 0.92987 0.0075926
## 5 0.0014223     12   0.90613 0.92987 0.0075926
## 6 0.0012035     13   0.90470 0.92921 0.0075920
## 7 0.0011488     15   0.90230 0.92834 0.0075912
## 8 0.0010284     17   0.90000 0.92746 0.0075903
## 9 0.0010000     22   0.89486 0.92681 0.0075897
# The plot shows 3 branches as optimum. The CP value for 3 branches would be around 0.0014. Pruning and replotting the tree:

# Cross-validation of the tree
cfit.tree <- tree(readmitted ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain, method="class")
cv.cfit.tree <- cv.tree(cfit.tree, FUN = prune.misclass)
cv.cfit.tree
## $size
## [1] 2 1
## 
## $dev
## [1] 8712 9140
## 
## $k
## [1] -Inf  428
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
prune.cfit.tree <- prune.misclass(cfit.tree, best = 4)
## Warning in prune.tree(tree = cfit.tree, best = 4, method = "misclass"):
## best is bigger than tree size
#plot(prune.cfit.tree)
text(prune.cfit.tree, pretty = 0)

#After pruning
cfit2 = prune(cfit, cp = 0.0014)

par(mar=c(1,1,0.25,1))
plot(cfit2, branch = 0.4,uniform = TRUE, compress = TRUE)
text(cfit2)

#Using the pruned tree to predict and pulling up the mean error rate and confusion matrix

#Prediction on test set
rpart.prune.predict <- predict(cfit2, newdata = objTest,type = "class")

cf.prune <-confusionMatrix(rpart.prune.predict,objTest$readmitted)

#Mean error rate
mean.error.rate.rpart.prune <- 1- cf.prune$overall[1]
mean.error.rate.rpart.prune
##  Accuracy 
## 0.4312923
cf.prune$table
##           Reference
## Prediction   <30   >30    NO
##        <30     8    37    42
##        >30  3628  9735  6957
##        NO   5216 17951 34867

Decision Tree with two class response variable

cfit_bin <- rpart(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain, method="class", minsplit = 1, minbucket = 1, cp = 0.001)

par(mar=c(2,2,0.25,1))
plot(cfit_bin, branch = 0.4,uniform = TRUE, compress = TRUE)
text(cfit_bin, pretty = 0)

#How to read plotcp - http://www.wekaleamstudios.co.uk/posts/classification-trees-using-the-rpart-function/#m3mLNpeke0I
rpart.predict_bin <- predict(cfit_bin, newdata = objTrain,type="prob")

View(objTrain)
head(rpart.predict_bin)
##            0          1
## 7  0.9122884 0.08771157
## 10 0.9122884 0.08771157
## 19 0.9122884 0.08771157
## 33 0.9122884 0.08771157
## 37 0.9122884 0.08771157
## 39 0.9122884 0.08771157
View(rpart.predict_bin)
accuracy.meas(objTrain$readmittedbin, rpart.predict_bin[,2])
## 
## Call: 
## accuracy.meas(response = objTrain$readmittedbin, predicted = rpart.predict_bin[, 
##     2])
## 
## Examples are labelled as positive when predicted is greater than 0.5 
## 
## precision: 0.801
## recall: 0.062
## F: 0.057
roc.curve(objTrain$readmittedbin, rpart.predict_bin[,2], plotit = T)

## Area under the curve (AUC): 0.610
par =TRUE

AUC = 0.61 is a terribly low score. Therefore, it is necessary to balanced data before applying a machine learning algorithm. In this case, the algorithm gets biased toward the majority class and fails to map minority class. I have used the sampling techniques and try to improve this prediction accuracy.

str(rpart.predict_bin)
##  num [1:19612, 1:2] 0.912 0.912 0.912 0.912 0.912 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:19612] "7" "10" "19" "33" ...
##   ..$ : chr [1:2] "0" "1"
str(objTrain$readmittedbin)
##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
#cf_bin <-confusionMatrix(rpart.predict_bin, objTrain$readmittedbin)
#cf_bin
#Mean error rate
#mean.error.rate.rpart_bin <- 1- cf_bin$overall[1]
#mean.error.rate.rpart_bin
#par(mar=c(3,3,3,3))
#plotcp(cfit_bin,lty = 3, col = 1)
#printcp(cfit_bin)

The plot shows 34 branches as optimum. The CP value for 2 branches would be around 0.001. Pruning and replotting the tree:

#After pruning
cfit2_bin = prune(cfit_bin, cp = 0.0001)

par(mar=c(.5,.5,.5,.5))
plot(cfit2_bin, branch = 0.4,uniform = TRUE, compress = TRUE)
text(cfit2_bin, pretty=0)

head(predict(cfit2_bin))
##            0          1
## 7  0.9122884 0.08771157
## 10 0.9122884 0.08771157
## 19 0.9122884 0.08771157
## 33 0.9122884 0.08771157
## 37 0.9122884 0.08771157
## 39 0.9122884 0.08771157
#Prediction on training  set
rpart.prune.predict2_bin <- predict(cfit2_bin, newdata = objTrain,type = "class")

cf.prune_bin <-confusionMatrix(rpart.prune.predict2_bin,objTrain$readmittedbin)
cf.prune_bin
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17364  2077
##          1    34   137
##                                           
##                Accuracy : 0.8924          
##                  95% CI : (0.8879, 0.8967)
##     No Information Rate : 0.8871          
##     P-Value [Acc > NIR] : 0.01001         
##                                           
##                   Kappa : 0.1003          
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.99805         
##             Specificity : 0.06188         
##          Pos Pred Value : 0.89316         
##          Neg Pred Value : 0.80117         
##              Prevalence : 0.88711         
##          Detection Rate : 0.88538         
##    Detection Prevalence : 0.99128         
##       Balanced Accuracy : 0.52996         
##                                           
##        'Positive' Class : 0               
## 
#Mean error rate
mean.error.rate.rpart.prune2 <- 1- cf.prune_bin$overall[1]
mean.error.rate.rpart.prune2
##  Accuracy 
## 0.1076382

over sampling

table(objTrain$readmittedbin)
## 
##     0     1 
## 17398  2214
data_balanced_over <- ovun.sample(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain, method = "over", N = 34794)$data
table(data_balanced_over$readmittedbin)
## 
##     0     1 
## 17398 17396

under sampling

data_balanced_under <- ovun.sample(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain, method = "under", N = 4428, seed=1)$data
table(data_balanced_under$readmittedbin)
## 
##    0    1 
## 2214 2214

Balanced sampling

Let us do both undersampling and oversampling on this imbalanced data. This can be achieved using method = “both”. In this case, the minority class is oversampled with replacement and majority class is undersampled without replacement.

data_balanced_both <- ovun.sample(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain, method = "both", N = 19610, seed=1)$data
table(data_balanced_both$readmittedbin)
## 
##    0    1 
## 9847 9763

ROSE SYTHETIC DATA BALANCING

ROSE helps us to generate data synthetically as well. The data generated using ROSE is considered to provide better estimate of original data.

data.rose <- ROSE(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = objTrain,seed=1)$data
table(data.rose$readmittedbin)
## 
##    0    1 
## 9848 9764

build decision tree models-Rose

cfit.rose <- rpart(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = data.rose)
head(data.rose)
##   time_in_hos num_lab_proc   num_proc num_medic    num_out    num_emerg
## 1    3.875275    15.260395  0.5041033  8.912528  1.1967522 -0.003483168
## 2   11.566065    59.596280 -0.1071821 19.516326 -0.2592942  0.591731331
## 3    3.825353     1.079877 -0.9974357  3.793863 -0.1146781  0.946548770
## 4    4.756819    38.079734  1.3753153 28.513673  1.0952137  0.199230920
## 5    1.423249    40.854999  1.4328905  7.907884  0.6224822  0.183480966
## 6    2.379349    23.891914  1.8367864 16.530931 -0.6984490  0.063265616
##    num_inpat num_diag            race     age adm_type_id discharge_id
## 1  0.8878647 7.493924       Caucasian [40-50)           1            1
## 2  3.1756714 6.979433 AfricanAmerican [70-80)           1           18
## 3 -0.4678477 4.832296       Caucasian [60-70)           1            1
## 4 -0.1664868 9.073524 AfricanAmerican [40-50)           3            2
## 5  1.0278402 9.767368 AfricanAmerican [70-80)           3            6
## 6  0.2439110 4.901822 AfricanAmerican [70-80)           2            3
##   adm_source_id max_glu_ser A1Cresult metformin insulin readmittedbin
## 1             7        None      None        No  Steady             0
## 2             7        None      None        No      No             0
## 3             7        None      None        No      No             0
## 4             1        None      None        No      No             0
## 5             1        None      None        No  Steady             0
## 6             1        None      None        No  Steady             0
rpart.predict.rose <- predict(cfit.rose, newdata = data.rose)
par(2,2,2,2)
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
#roc.curve(data.rose$readmittedbin, rpart.predict.rose[,2], col= redblue(10000), add =TRUE)
par =TRUE
#Prediction on rose set
rpart.prune.predict3_bin <- predict(cfit.rose, newdata = data.rose,type = "class")

cf.prune_bin <-confusionMatrix(rpart.prune.predict3_bin,objTrain$readmittedbin)
cf.prune_bin
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7733 1017
##          1 9665 1197
##                                           
##                Accuracy : 0.4553          
##                  95% CI : (0.4483, 0.4623)
##     No Information Rate : 0.8871          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0055         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.4445          
##             Specificity : 0.5407          
##          Pos Pred Value : 0.8838          
##          Neg Pred Value : 0.1102          
##              Prevalence : 0.8871          
##          Detection Rate : 0.3943          
##    Detection Prevalence : 0.4462          
##       Balanced Accuracy : 0.4926          
##                                           
##        'Positive' Class : 0               
## 
#Mean error rate
mean.error.rate.rpart.prune2 <- 1- cf.prune_bin$overall[1]
mean.error.rate.rpart.prune2
##  Accuracy 
## 0.5446665

decision tree models-over sampling

cfit.over <- rpart(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin,  data = data_balanced_over)
rpart.predict.over <- predict(cfit.over, newdata = data_balanced_over)

#plot(, colorize = TRUE)
#plot(perf2, add = TRUE, colorize = TRUE)
#roc.curve(data_balanced_over$readmittedbin, rpart.predict.over[,2], add =TRUE, col = greenred(2) )
#str(rpart.predict.over)
#confusionMatrix(data_balanced_over$readmittedbin, rpart.predict.over[,2])

decision tree model-undersampling

cfit.under <- rpart(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = data_balanced_under)
rpart.predict.under <- predict(cfit.over, newdata = data_balanced_under)
par(new=TRUE)
## Warning in par(new = TRUE): calling par(new=TRUE) with no plot
#roc.curve(data_balanced_under$readmittedbin, rpart.predict.under[,2], add =TRUE, col = bluered(2))

decision tree model-both under and over sampling

cfit.both <- rpart(readmittedbin ~ time_in_hos + num_lab_proc + num_proc + num_medic + num_out + num_emerg + num_inpat + race + age + adm_type_id + discharge_id + adm_source_id + num_diag + max_glu_ser + A1Cresult + metformin + insulin, data = data_balanced_both)
rpart.predict.both <- predict(cfit.both, newdata = data_balanced_both)
#roc.curve(data_balanced_both$readmittedbin, rpart.predict.both[,2], add =TRUE, col = redblue(5))

# ROC curve comparison
img1 <-  rasterGrob(as.raster(readPNG("C:/Users/Amit/Desktop/DA6813/Project/Final/ROC curve comparison.png")), interpolate = FALSE)
#img1
grid.arrange(img1,ncol = 1)