0. Abstract

This paper presents a big data analysis of a clinical dataset containing early readmission information for patients undergoing diabetes treatment. The study involves prediction and causal inference analyses on patients’ readmission tendencies. Nine machine learning models were employed for prediction, with logistic lasso regression demonstrating the best performance, achieving the highest accuracy and sensitivity simultaneously.
Additionally, causal inference analysis revealed that patients with the highest A1C test results, indicating elevated blood sugar levels associated with diabetes, exhibited a significantly lower readmission rate compared to other patients. The findings suggest that the hospital tends to provide more effective care for patients with severe diabetes levels.

1. Introduction

Failure to provide proper diabetes care not only increases the managing costs for the hospitals (as the patients are readmitted) but also impacts the mortality of the patients, who may face complications associated with diabetes. Thus, it is important for hospitals to provide proper medical care to the patients minimize the patients’ readmission.

2. Dataset and Objective

The dataset we used was obtained from : https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals.
Our objective with this dataset is to build a model that could predict patient’s early readmission and analyze whether a certain factor affects the early readmission rate or not.

3. Method

For prediction analysis, we will be using logistic regression, logistic lasso, single tree, bagging decision tree, random forest, gradient boosting classification tree, hard margin support vector machine, k-nearest neighbors, and neural network. Our goal is to maximize the accuracy and sensitivity, while minimizing the false negative rate.
For causal inference, we will utilize logistic regression to define how each variable affects the odds ratio of readmission, and t.test using propensity score matching method to define whether AlC blood sugar test affect the readmission rate or not.

4. Dataset Exploration

Let’s explore how the raw data looks like:

raw_data <- read.csv('diabetic_data.csv')
head(raw_data)

##   encounter_id patient_nbr            race gender     age weight
## 1      2278392     8222157       Caucasian Female  [0-10)      ?
## 2       149190    55629189       Caucasian Female [10-20)      ?
## 3        64410    86047875 AfricanAmerican Female [20-30)      ?
## 4       500364    82442376       Caucasian   Male [30-40)      ?
## 5        16680    42519267       Caucasian   Male [40-50)      ?
## 6        35754    82637451       Caucasian   Male [50-60)      ?
##   admission_type_id discharge_disposition_id admission_source_id
## 1                 6                       25                   1
## 2                 1                        1                   7
## 3                 1                        1                   7
## 4                 1                        1                   7
## 5                 1                        1                   7
## 6                 2                        1                   2
##   time_in_hospital payer_code        medical_specialty num_lab_procedures
## 1                1          ? Pediatrics-Endocrinology                 41
## 2                3          ?                        ?                 59
## 3                2          ?                        ?                 11
## 4                2          ?                        ?                 44
## 5                1          ?                        ?                 51
## 6                3          ?                        ?                 31
##   num_procedures num_medications number_outpatient number_emergency
## 1              0               1                 0                0
## 2              0              18                 0                0
## 3              5              13                 2                0
## 4              1              16                 0                0
## 5              0               8                 0                0
## 6              6              16                 0                0
##   number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1                0 250.83      ?      ?                1          None
## 2                0    276 250.01    255                9          None
## 3                1    648    250    V27                6          None
## 4                0      8 250.43    403                7          None
## 5                0    197    157    250                5          None
## 6                0    414    411    250                9          None
##   A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1      None        No          No          No             No          No
## 2      None        No          No          No             No          No
## 3      None        No          No          No             No          No
## 4      None        No          No          No             No          No
## 5      None        No          No          No             No          No
## 6      None        No          No          No             No          No
##   acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1            No        No        No          No           No            No
## 2            No        No        No          No           No            No
## 3            No    Steady        No          No           No            No
## 4            No        No        No          No           No            No
## 5            No    Steady        No          No           No            No
## 6            No        No        No          No           No            No
##   acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1       No       No           No         No      No          No      No
## 2       No       No           No         No      No          No      Up
## 3       No       No           No         No      No          No      No
## 4       No       No           No         No      No          No      Up
## 5       No       No           No         No      No          No  Steady
## 6       No       No           No         No      No          No  Steady
##   glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1                  No                  No                       No
## 2                  No                  No                       No
## 3                  No                  No                       No
## 4                  No                  No                       No
## 5                  No                  No                       No
## 6                  No                  No                       No
##   metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
## 1                      No                     No     No          No         NO
## 2                      No                     No     Ch         Yes        >30
## 3                      No                     No     No         Yes         NO
## 4                      No                     No     Ch         Yes         NO
## 5                      No                     No     Ch         Yes         NO
## 6                      No                     No     No         Yes        >30

Hmisc::describe(raw_data)

## raw_data 
## 
##  50  Variables      101766  Observations
## --------------------------------------------------------------------------------
## encounter_id 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##    101766         0    101766         1 165201646 114704306  27170784  43368426 
##       .25       .50       .75       .90       .95 
##  84961194 152388987 230270888 311346359 378962843 
## 
## lowest :     12522     15738     16680     28236     35754
## highest: 443847548 443847782 443854148 443857166 443867222
## --------------------------------------------------------------------------------
## patient_nbr 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##    101766         0     71518         1  54330401  43738203   1456972   3957116 
##       .25       .50       .75       .90       .95 
##  23413221  45505143  87545950 103287825 111480273 
## 
## lowest :       135       378       729       774       927
## highest: 189351095 189365864 189445127 189481478 189502619
## --------------------------------------------------------------------------------
## race 
##        n  missing distinct 
##   101766        0        6 
##                                                                           
## Value                    ? AfricanAmerican           Asian       Caucasian
## Frequency             2273           19210             641           76099
## Proportion           0.022           0.189           0.006           0.748
##                                           
## Value             Hispanic           Other
## Frequency             2037            1506
## Proportion           0.020           0.015
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##   101766        0        3 
##                                                           
## Value               Female            Male Unknown/Invalid
## Frequency            54708           47055               3
## Proportion           0.538           0.462           0.000
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct 
##   101766        0       10 
##                                                                          
## Value        [0-10)  [10-20)  [20-30)  [30-40)  [40-50)  [50-60)  [60-70)
## Frequency       161      691     1657     3775     9685    17256    22483
## Proportion    0.002    0.007    0.016    0.037    0.095    0.170    0.221
##                                      
## Value       [70-80)  [80-90) [90-100)
## Frequency     26068    17197     2793
## Proportion    0.256    0.169    0.027
## --------------------------------------------------------------------------------
## weight 
##        n  missing distinct 
##   101766        0       10 
##                                                                       
## Value              ?    [0-25) [100-125) [125-150) [150-175) [175-200)
## Frequency      98569        48       625       145        35        11
## Proportion     0.969     0.000     0.006     0.001     0.000     0.000
##                                                   
## Value        [25-50)   [50-75)  [75-100)      >200
## Frequency         97       897      1336         3
## Proportion     0.001     0.009     0.013     0.000
## --------------------------------------------------------------------------------
## admission_type_id 
##        n  missing distinct     Info     Mean      Gmd 
##   101766        0        8    0.838    2.024    1.393 
##                                                           
## Value          1     2     3     4     5     6     7     8
## Frequency  53990 18480 18869    10  4785  5291    21   320
## Proportion 0.531 0.182 0.185 0.000 0.047 0.052 0.000 0.003
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## discharge_disposition_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       26    0.788    3.716    4.264        1        1 
##      .25      .50      .75      .90      .95 
##        1        1        4        7       18 
## 
## lowest :  1  2  3  4  5, highest: 23 24 25 27 28
## --------------------------------------------------------------------------------
## admission_source_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       17    0.795    5.754    3.905        1        1 
##      .25      .50      .75      .90      .95 
##        1        7        7        7       17 
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency  29565  1104   187  3187   855  2264 57494    16   125     8     2
## Proportion 0.291 0.011 0.002 0.031 0.008 0.022 0.565 0.000 0.001 0.000 0.000
##                                               
## Value         13    14    17    20    22    25
## Frequency      1     2  6781   161    12     2
## Proportion 0.000 0.000 0.067 0.002 0.000 0.000
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## time_in_hospital 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       14    0.983    4.396    3.198        1        1 
##      .25      .50      .75      .90      .95 
##        2        4        6        9       11 
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency  14208 17224 17756 13924  9966  7539  5859  4391  3002  2342  1855
## Proportion 0.140 0.169 0.174 0.137 0.098 0.074 0.058 0.043 0.029 0.023 0.018
##                             
## Value         12    13    14
## Frequency   1448  1210  1042
## Proportion 0.014 0.012 0.010
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## payer_code 
##        n  missing distinct 
##   101766        0       18 
##                                                                             
## Value          ?    BC    CH    CM    CP    DM    FR    HM    MC    MD    MP
## Frequency  40256  4655   146  1937  2533   549     1  6274 32439  3532    79
## Proportion 0.396 0.046 0.001 0.019 0.025 0.005 0.000 0.062 0.319 0.035 0.001
##                                                     
## Value         OG    OT    PO    SI    SP    UN    WC
## Frequency   1033    95   592    55  5007  2448   135
## Proportion 0.010 0.001 0.006 0.001 0.049 0.024 0.001
## --------------------------------------------------------------------------------
## medical_specialty 
##        n  missing distinct 
##   101766        0       73 
## 
## lowest : ?                                AllergyandImmunology             Anesthesiology                   Anesthesiology-Pediatric         Cardiology                      
## highest: Surgery-PlasticwithinHeadandNeck Surgery-Thoracic                 Surgery-Vascular                 SurgicalSpecialty                Urology                         
## --------------------------------------------------------------------------------
## num_lab_procedures 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0      118        1     43.1     22.2        4       14 
##      .25      .50      .75      .90      .95 
##       31       44       57       67       73 
## 
## lowest :   1   2   3   4   5, highest: 120 121 126 129 132
## --------------------------------------------------------------------------------
## num_procedures 
##        n  missing distinct     Info     Mean      Gmd 
##   101766        0        7    0.892     1.34    1.728 
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency  46652 20742 12717  9443  4180  3078  4954
## Proportion 0.458 0.204 0.125 0.093 0.041 0.030 0.049
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## num_medications 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       75    0.998    16.02    8.662        6        7 
##      .25      .50      .75      .90      .95 
##       10       15       20       26       31 
## 
## lowest :  1  2  3  4  5, highest: 72 74 75 79 81
## --------------------------------------------------------------------------------
## number_outpatient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       39    0.416   0.3694   0.6655        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        1        2 
## 
## lowest :  0  1  2  3  4, highest: 37 38 39 40 42
## --------------------------------------------------------------------------------
## number_emergency 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       33    0.299   0.1978   0.3672        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        1        1 
## 
## lowest :  0  1  2  3  4, highest: 46 54 63 64 76
## --------------------------------------------------------------------------------
## number_inpatient 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       21    0.699   0.6356   0.9915        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        1        2        3 
## 
## lowest :  0  1  2  3  4, highest: 16 17 18 19 21
## --------------------------------------------------------------------------------
## diag_1 
##        n  missing distinct 
##   101766        0      717 
## 
## lowest : ?   10  11  110 112, highest: V63 V66 V67 V70 V71
## --------------------------------------------------------------------------------
## diag_2 
##        n  missing distinct 
##   101766        0      749 
## 
## lowest : ?   11  110 111 112, highest: V69 V70 V72 V85 V86
## --------------------------------------------------------------------------------
## diag_3 
##        n  missing distinct 
##   101766        0      790 
## 
## lowest : ?   11  110 111 112, highest: V66 V70 V72 V85 V86
## --------------------------------------------------------------------------------
## number_diagnoses 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   101766        0       16     0.88    7.423    2.022        4        5 
##      .25      .50      .75      .90      .95 
##        6        8        9        9        9 
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency    219  1023  2835  5537 11393 10161 10393 10616 49474    17    11
## Proportion 0.002 0.010 0.028 0.054 0.112 0.100 0.102 0.104 0.486 0.000 0.000
##                                         
## Value         12    13    14    15    16
## Frequency      9    16     7    10    45
## Proportion 0.000 0.000 0.000 0.000 0.000
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## max_glu_serum 
##        n  missing distinct 
##   101766        0        4 
##                                   
## Value       >200  >300  None  Norm
## Frequency   1485  1264 96420  2597
## Proportion 0.015 0.012 0.947 0.026
## --------------------------------------------------------------------------------
## A1Cresult 
##        n  missing distinct 
##   101766        0        4 
##                                   
## Value         >7    >8  None  Norm
## Frequency   3812  8216 84748  4990
## Proportion 0.037 0.081 0.833 0.049
## --------------------------------------------------------------------------------
## metformin 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency     575  81778  18346   1067
## Proportion  0.006  0.804  0.180  0.010
## --------------------------------------------------------------------------------
## repaglinide 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency      45 100227   1384    110
## Proportion  0.000  0.985  0.014  0.001
## --------------------------------------------------------------------------------
## nateglinide 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency      11 101063    668     24
## Proportion  0.000  0.993  0.007  0.000
## --------------------------------------------------------------------------------
## chlorpropamide 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency       1 101680     79      6
## Proportion  0.000  0.999  0.001  0.000
## --------------------------------------------------------------------------------
## glimepiride 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency     194  96575   4670    327
## Proportion  0.002  0.949  0.046  0.003
## --------------------------------------------------------------------------------
## acetohexamide 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101765      1
## Proportion      1      0
## --------------------------------------------------------------------------------
## glipizide 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency     560  89080  11356    770
## Proportion  0.006  0.875  0.112  0.008
## --------------------------------------------------------------------------------
## glyburide 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency     564  91116   9274    812
## Proportion  0.006  0.895  0.091  0.008
## --------------------------------------------------------------------------------
## tolbutamide 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101743     23
## Proportion      1      0
## --------------------------------------------------------------------------------
## pioglitazone 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency     118  94438   6976    234
## Proportion  0.001  0.928  0.069  0.002
## --------------------------------------------------------------------------------
## rosiglitazone 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency      87  95401   6100    178
## Proportion  0.001  0.937  0.060  0.002
## --------------------------------------------------------------------------------
## acarbose 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency       3 101458    295     10
## Proportion  0.000  0.997  0.003  0.000
## --------------------------------------------------------------------------------
## miglitol 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency       5 101728     31      2
## Proportion      0      1      0      0
## --------------------------------------------------------------------------------
## troglitazone 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101763      3
## Proportion      1      0
## --------------------------------------------------------------------------------
## tolazamide 
##        n  missing distinct 
##   101766        0        3 
##                                
## Value          No Steady     Up
## Frequency  101727     38      1
## Proportion      1      0      0
## --------------------------------------------------------------------------------
## examide 
##        n  missing distinct    value 
##   101766        0        1       No 
##                  
## Value          No
## Frequency  101766
## Proportion      1
## --------------------------------------------------------------------------------
## citoglipton 
##        n  missing distinct    value 
##   101766        0        1       No 
##                  
## Value          No
## Frequency  101766
## Proportion      1
## --------------------------------------------------------------------------------
## insulin 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency   12218  47383  30849  11316
## Proportion  0.120  0.466  0.303  0.111
## --------------------------------------------------------------------------------
## glyburide.metformin 
##        n  missing distinct 
##   101766        0        4 
##                                       
## Value        Down     No Steady     Up
## Frequency       6 101060    692      8
## Proportion  0.000  0.993  0.007  0.000
## --------------------------------------------------------------------------------
## glipizide.metformin 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101753     13
## Proportion      1      0
## --------------------------------------------------------------------------------
## glimepiride.pioglitazone 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101765      1
## Proportion      1      0
## --------------------------------------------------------------------------------
## metformin.rosiglitazone 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101764      2
## Proportion      1      0
## --------------------------------------------------------------------------------
## metformin.pioglitazone 
##        n  missing distinct 
##   101766        0        2 
##                         
## Value          No Steady
## Frequency  101765      1
## Proportion      1      0
## --------------------------------------------------------------------------------
## change 
##        n  missing distinct 
##   101766        0        2 
##                       
## Value         Ch    No
## Frequency  47011 54755
## Proportion 0.462 0.538
## --------------------------------------------------------------------------------
## diabetesMed 
##        n  missing distinct 
##   101766        0        2 
##                       
## Value         No   Yes
## Frequency  23403 78363
## Proportion  0.23  0.77
## --------------------------------------------------------------------------------
## readmitted 
##        n  missing distinct 
##   101766        0        3 
##                             
## Value        <30   >30    NO
## Frequency  11357 35545 54864
## Proportion 0.112 0.349 0.539
## --------------------------------------------------------------------------------

It seems that the raw data contains values with ‘?’. Let’s see this more precisely.

colSums(raw_data=='?')

##             encounter_id              patient_nbr                     race 
##                        0                        0                     2273 
##                   gender                      age                   weight 
##                        0                        0                    98569 
##        admission_type_id discharge_disposition_id      admission_source_id 
##                        0                        0                        0 
##         time_in_hospital               payer_code        medical_specialty 
##                        0                    40256                    49949 
##       num_lab_procedures           num_procedures          num_medications 
##                        0                        0                        0 
##        number_outpatient         number_emergency         number_inpatient 
##                        0                        0                        0 
##                   diag_1                   diag_2                   diag_3 
##                       21                      358                     1423 
##         number_diagnoses            max_glu_serum                A1Cresult 
##                        0                        0                        0 
##                metformin              repaglinide              nateglinide 
##                        0                        0                        0 
##           chlorpropamide              glimepiride            acetohexamide 
##                        0                        0                        0 
##                glipizide                glyburide              tolbutamide 
##                        0                        0                        0 
##             pioglitazone            rosiglitazone                 acarbose 
##                        0                        0                        0 
##                 miglitol             troglitazone               tolazamide 
##                        0                        0                        0 
##                  examide              citoglipton                  insulin 
##                        0                        0                        0 
##      glyburide.metformin      glipizide.metformin glimepiride.pioglitazone 
##                        0                        0                        0 
##  metformin.rosiglitazone   metformin.pioglitazone                   change 
##                        0                        0                        0 
##              diabetesMed               readmitted 
##                        0                        0

5. Data Cleaning

Since the data has a lot of issues, we will be cleaning the data. Please see the explanation in the chunk for detailed steps.

# 0 Drop unnecessary columns
# Drop patients who died right after the diagnosis
data <- raw_data[!(raw_data$discharge_disposition_id %in% c(13,14,18,19,20,21,25,26)), ]

#get rid of duplicated values
duplicates <- duplicated(data$patient_nbr)
data <- data[!duplicates,]

# 1 Extract the relevant value  
# Get rid of values that consists of single value
data <- data[,c(-1,-2,-6,-7,-8,-9,-11,-13,-26,-27,-28,-30,-33,-36,-37,-38,-39,-40,-41,-43,-44,-45,-46,-47)]

# 2 The dataset contains '?' values. Change this to 'Unknown'
for (col in names(data)) {
  data[data[, col] == '?', col] <- 'Unknown'
}

# 3 Get rid of unknown gender since there are only 3 rows 
data <- data[data$gender!='Unknown/Invalid',]

# 4 Change the dependent variable to bianry. Less than 30 is 1, other 0
data$readmitted <- ifelse(data$readmitted=='<30','Yes','No')

# 5 Change the age column to simpler format
data$age <- factor(data$age, 
       levels = c("[0-10)", "[10-20)", "[20-30)", 
                  "[30-40)", "[40-50)", "[50-60)", 
                  "[60-70)", "[70-80)", "[80-90)", 
                  "[90-100)"),
       labels = c("Under 10", "10s", "20s", 
                  "30s","40s", "50s",
                  "60s", "70s", "80s", 
                  "90s"))

# 6 Change the diagnosis column
categorize_values <- function(data, column_names) {
  for (column_name in column_names) {
    value <- ifelse(data[[column_name]] %in% c(390:459, 785), "Circulatory",
                        ifelse(data[[column_name]] %in% c(460:519, 786), "Respiratory",
                        ifelse(data[[column_name]] %in% c(520:579, 787), "Digestive",
                        ifelse(data[[column_name]] %in% "250", "Diabetes",
                        ifelse(data[[column_name]] %in% c(800:999), "Injury",
                        ifelse(data[[column_name]] %in% c(710:739), "Musculoskeletal",
                        ifelse(data[[column_name]] %in% c(580:629, 788), "Genitourinary",
                        ifelse(data[[column_name]] %in% c(140:239), "Neoplasms", "Other"))))))))
    

  }
  return(value)
}

data$diag_1 <-categorize_values(data, 'diag_1')
data$diag_2 <-categorize_values(data, 'diag_2')
data$diag_3 <-categorize_values(data, 'diag_3')

# 7 Calculate the table of medical_specialty
table_medical_specialty <- table(data$medical_specialty)

# Replace values with counts less than or equal to 1000 with "Other"
data$medical_specialty[!(data$medical_specialty %in% names(table_medical_specialty)[table_medical_specialty > 1000])] <- "Other"

This is the result after cleaning the data

library(dplyr)
# Change the chr to factor prior to the analysis
chr_columns <- sapply(data,is.character)
data <- data %>%
  mutate_if(chr_columns,as.factor)

head(data)

##              race gender age time_in_hospital medical_specialty num_procedures
## 2       Caucasian Female 10s                3           Unknown              0
## 3 AfricanAmerican Female 20s                2           Unknown              5
## 4       Caucasian   Male 30s                2           Unknown              1
## 5       Caucasian   Male 40s                1           Unknown              0
## 6       Caucasian   Male 50s                3           Unknown              6
## 7       Caucasian   Male 60s                4           Unknown              1
##   num_medications number_outpatient number_emergency number_inpatient
## 2              18                 0                0                0
## 3              13                 2                0                1
## 4              16                 0                0                0
## 5               8                 0                0                0
## 6              16                 0                0                0
## 7              21                 0                0                0
##        diag_1      diag_2      diag_3 number_diagnoses max_glu_serum A1Cresult
## 2       Other       Other       Other                9          None      None
## 3       Other    Diabetes       Other                6          None      None
## 4       Other       Other Circulatory                7          None      None
## 5   Neoplasms   Neoplasms    Diabetes                5          None      None
## 6 Circulatory Circulatory    Diabetes                9          None      None
## 7 Circulatory Circulatory       Other                7          None      None
##   metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin
## 2        No          No        No        No           No            No      Up
## 3        No          No    Steady        No           No            No      No
## 4        No          No        No        No           No            No      Up
## 5        No          No    Steady        No           No            No  Steady
## 6        No          No        No        No           No            No  Steady
## 7    Steady      Steady        No        No           No            No  Steady
##   change diabetesMed readmitted
## 2     Ch         Yes         No
## 3     No         Yes         No
## 4     Ch         Yes         No
## 5     Ch         Yes         No
## 6     No         Yes         No
## 7     Ch         Yes         No

Data Exploration

In this section, we will be visualizing some relevant information in the data using ggplot2.
We have used a custom theme from the website: https://rpubs.com/ledongnhatnam.

library(ggplot2)
library(tidyverse)
library(gridExtra)
library(tidyr)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 12),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#fffcfc"),
      strip.background = element_rect(fill = "#820000", color = "#820000", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

mycolors=c("#db3434","#ccc7c7","#871618","#590050","#7f34a8")

generate_chart <- function(data, variable) {
  var_sym <- ensym(variable)
  p <- data %>%
    group_by(!!var_sym) %>%
    summarize(count = n(),
              prop = round(count / nrow(data), 2),
              labels = scales::percent(prop)) %>%
    ggplot(aes(x = readmitted, fill = factor(!!var_sym), y = count)) +
    geom_bar(width = 0.5, stat = 'identity', color = 'black',alpha=0.8) +
    scale_fill_manual(values=mycolors) +
    labs(fill = as_label(var_sym)) +
    theme(legend.position = "right")
  
  return(p)
}

# Generate chart using the function
generate_chart(data, readmitted)+
  labs(title='Readmission (Dependent Variable)',fill='Readmission')+
   geom_text(aes(label = labels), color='black',
             position = position_stack(vjust=0.5),show.legend = FALSE) +
  coord_flip()

make_bar_chart <- function(data, variable) {
  var_sym <- ensym(variable)
  p <- data %>%
    group_by(!!var_sym,readmitted) %>%
    summarise(count = n()) %>%
    group_by(!!var_sym) %>%
    mutate(total_count = sum(count)) %>%
    mutate(prop = round(count / total_count, 2)) %>%
    mutate(labels = scales::percent(prop)) %>%
    mutate(labels2 = ifelse(row_number() %% 2 == 1, labels, NA))%>%
    ggplot(aes(x = !!var_sym, fill = factor(readmitted), y = count)) +
    geom_bar(width = 0.7, stat = 'identity', color = 'black',alpha=0.8) +
    scale_fill_manual(values=mycolors) +
    labs(fill = 'Readmission')+
    geom_text(aes(label = labels2), color='black',
             position = position_stack(vjust=0.5),show.legend = FALSE)+
    theme(legend.position = "right")
    
    return(p)
}

b1 <- make_bar_chart(data,gender)+labs(title = 'Readmission by Gender')+coord_flip()
b2 <- make_bar_chart(data,change)+labs(title = 'Readmission by Change in Medication')+coord_flip()
b3 <- make_bar_chart(data,diabetesMed)+labs(title = 'Readmission by Diabetic Medication Prescription')+coord_flip()

grid.arrange(b1,b2,b3,nrow=3)

b4 <- make_bar_chart(data,race)+labs(title = 'Readmission by Race')+coord_flip()
b5 <- make_bar_chart(data,age)+labs(title = 'Readmission by Age')+coord_flip()
b6 <- make_bar_chart(data,max_glu_serum)+labs(title = 'Readmission by Glucose Serum Test Result')+coord_flip()
b7 <- make_bar_chart(data,A1Cresult)+labs(title = 'Readmission by A1C Test Result')+coord_flip()

grid.arrange(b4,b5,b6,b7,ncol=1)

b8 <- make_bar_chart(data,medical_specialty)+labs(title = 'Readmission by Medical Specialty')
b9 <- make_bar_chart(data,diag_1) +labs(title = 'Readmission by Initial Diagnosis')
b10 <- make_bar_chart(data,diag_2) +labs(title = 'Readmission by Secondary Diagnosis')
b11 <- make_bar_chart(data,diag_3) +labs(title = 'Readmission by Third Diagnosis')

grid.arrange(b8,b9,b10,b11,ncol=1)

generate_histogram2 <- function(data, variable) {
  var_sym <- ensym(variable)
  p <- data%>%
    ggplot(aes(!!var_sym,y=..count..,fill=readmitted))+
    geom_histogram(color='black',alpha = 0.8, binwidth = 1) +
    labs(fill = 'Readmission')+
    scale_fill_manual(values=mycolors) + 
    theme_minimal()+
    labs(color='Readmission')
    
  return(p)
}


h0 <- generate_histogram2(data,time_in_hospital)+labs(title='Time in Hospital')
h1 <- generate_histogram2(data,num_procedures)+labs(title= 'Number of Procedures')
h2 <- generate_histogram2(data,num_medications)+labs(title='Number of Medications')
h3 <- generate_histogram2(data,number_emergency)+labs(title='Number of Emergency Visits')
h4 <- generate_histogram2(data,number_outpatient)+labs(title='Number of Outpatient Visits')
h5 <- generate_histogram2(data,number_inpatient)+labs(title='Number of Inpatient Visits')
h6 <- generate_histogram2(data,number_diagnoses)+labs(title='Number of Diagnoses')

grid.arrange(h0,h1,h2,h3,h4,h5,h6,ncol=3)

5. Prediction Modeling

Undersampling

Since the dataset is biased towards no readmission, we will be undersampling the train dataset to predict the test dataset.

set.seed(0)
library(caret)
# splot into train test dataset
index <- createDataPartition(data$readmitted, p = 0.7, list = FALSE)
train_data<-data[index,]
test_data <- data[-index, ]

train_data$readmitted %>%table()

## .
##    No   Yes 
## 43633  4249

test_data$readmitted %>%table()

## .
##    No   Yes 
## 18699  1821

# undersample train dataset due to 
set.seed(0)
re1 <- train_data[(train_data$readmitted=='Yes'),]
re0 <- train_data[(train_data$readmitted=='No'),]
idx0 <- sample(1:nrow(re0), sum(train_data$readmitted=='Yes'))
re0 <- re0[idx0,]
re_undersample <- rbind(re1,re0)

idx1 <-  sample(1:nrow(re_undersample), nrow(re_undersample)*0.7)
u_train_data <- re_undersample[idx1, ]
u_test_data <-  re_undersample[-idx1, ]

re_undersample$readmitted%>%table()

## .
##   No  Yes 
## 4249 4249

# Generate chart using the function
t1 <- generate_chart(train_data, readmitted)+
  labs(title='Original Train Dataset',fill='Readmission')+
   geom_text(aes(label = labels), color='black',
             position = position_stack(vjust=0.5),show.legend = FALSE)+
  coord_flip()

t2 <- generate_chart(test_data, readmitted)+
  labs(title='Original Test Dataset',fill='Readmission')+
   geom_text(aes(label = labels), color='black',
             position = position_stack(vjust=0.5),show.legend = FALSE)+
  coord_flip()

t3 <- generate_chart(re_undersample, readmitted)+
  labs(title='Undersampling Dataset',fill='Readmission')+
   geom_text(aes(label = labels), color='black',
             position = position_stack(vjust=0.5),show.legend = FALSE)+
  coord_flip()

grid.arrange(t1,t2,t3,ncol=1)

Models

In this section, we will be using several machine learning methods to choose the best model for our prediction. We use undersampled dataset to predict the original test data set.

Logistic Regression

simple_logit_model <- glm(data=re_undersample, readmitted~.,family='binomial')

# p is the cutoff for the classification
p <- 0.5

# predict out of sample 
logit_pred <- predict(simple_logit_model, newdata = test_data, type = "response")
logit_pred <- as.factor(ifelse(logit_pred>p,'Yes','No'))


# compute the accuracy
logit_cm <- confusionMatrix(data=logit_pred,
                reference=test_data$readmitted)

logit_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11494   865
##        Yes  7205   956
##                                        
##                Accuracy : 0.6067       
##                  95% CI : (0.6, 0.6134)
##     No Information Rate : 0.9113       
##     P-Value [Acc > NIR] : 1            
##                                        
##                   Kappa : 0.0543       
##                                        
##  Mcnemar's Test P-Value : <2e-16       
##                                        
##             Sensitivity : 0.6147       
##             Specificity : 0.5250       
##          Pos Pred Value : 0.9300       
##          Neg Pred Value : 0.1171       
##              Prevalence : 0.9113       
##          Detection Rate : 0.5601       
##    Detection Prevalence : 0.6023       
##       Balanced Accuracy : 0.5698       
##                                        
##        'Positive' Class : No           
##

Logistic Lasso

library(gamlr)
library(Matrix)

set.seed(0)
Xmatrix_is <- sparse.model.matrix(readmitted ~., data=naref(re_undersample))[,-1]
logit_lasso_model <- cv.gamlr(Xmatrix_is,re_undersample$readmitted,family='binomial',verb=TRUE)

## fold 1,2,3,4,5,done.

plot(logit_lasso_model)

length(coef(logit_lasso_model))

## [1] 101

sum(coef(logit_lasso_model, select="min")!=0)

## [1] 50

sum(coef(logit_lasso_model, select="1se")!=0)

## [1] 15

Xmatrix_oos <- sparse.model.matrix(readmitted ~., data=naref(test_data))[,-1]
lasso_pred <- drop(predict(logit_lasso_model, Xmatrix_oos, type="response"))
lasso_pred <- as.factor(ifelse(lasso_pred>p,'Yes','No'))

# compute the accuracy
lasso_cm <- confusionMatrix(data=lasso_pred,
                reference=test_data$readmitted)

Simple Classification Tree

library(rpart.plot)
library(caret)
library(rattle)

set.seed(0)
tree_is <- rpart(formula = readmitted ~ .,
                 data = re_undersample,
                 method = 'class',
                 xval=5, #5 fold cross-validation
                 model = TRUE) 

fancyRpartPlot(tree_is,palettes = 'Reds')

tree_pred <- predict(tree_is,test_data,type='class')
tree_cm <- confusionMatrix(data=tree_pred,
                reference=test_data$readmitted)
tree_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  10882   814
##        Yes  7817  1007
##                                           
##                Accuracy : 0.5794          
##                  95% CI : (0.5726, 0.5862)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0493          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5820          
##             Specificity : 0.5530          
##          Pos Pred Value : 0.9304          
##          Neg Pred Value : 0.1141          
##              Prevalence : 0.9113          
##          Detection Rate : 0.5303          
##    Detection Prevalence : 0.5700          
##       Balanced Accuracy : 0.5675          
##                                           
##        'Positive' Class : No              
##

Bagging

set.seed(0)
bag_is <- train(readmitted ~ ., 
                   data = re_undersample, method = "treebag",  # for bagging
                   tuneLength = 5,  # choose up to 5 combinations of tuning parameter 
                   metric = "ROC",  # evaluate hyperparamter combinations with ROC
                   trControl = trainControl(
                     method = "cv",  # k-fold cross validation
                     number = 5,  # k=5 folds
                     savePredictions = "final",       # save predictions for the optimal tuning parameters
                     classProbs = TRUE,  # return class probabilities in addition to predicted values
                     summaryFunction = twoClassSummary  # for binary response variable
                      )
                    )

bag_pred <- predict(bag_is,test_data,type='raw')
bag_cm <- confusionMatrix(data=bag_pred,reference=test_data$readmitted)
bag_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  10470   845
##        Yes  8229   976
##                                          
##                Accuracy : 0.5578         
##                  95% CI : (0.551, 0.5646)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0339         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.5599         
##             Specificity : 0.5360         
##          Pos Pred Value : 0.9253         
##          Neg Pred Value : 0.1060         
##              Prevalence : 0.9113         
##          Detection Rate : 0.5102         
##    Detection Prevalence : 0.5514         
##       Balanced Accuracy : 0.5479         
##                                          
##        'Positive' Class : No             
##

Random Forest

set.seed(0)
random_is <- train(readmitted ~ ., 
                   data = re_undersample, method = "ranger",  # for bagging
                   tuneLength = 5,  # choose up to 5 combinations of tuning parameter 
                   metric = "ROC",  # evaluate hyperparamter combinations with ROC
                   trControl = trainControl(
                     method = "cv",  # k-fold cross validation
                     number = 5,  # k=5 folds
                     savePredictions = "final",       # save predictions for the optimal tuning parameters
                     classProbs = TRUE,  # return class probabilities in addition to predicted values
                     summaryFunction = twoClassSummary  # for binary response variable
                      )
                    )

ggplot(random_is)+scale_color_manual(values=c(mycolors[1],mycolors[3]))

# compute variable importance for RF
library(ranger)
library(tree)
var_imp <- ranger(readmitted ~., data=re_undersample,write.forest = TRUE,num.trees = 200, min.node.size = 5, importance ='impurity')

library(timetk)
var_imp <- ranger::importance(var_imp) %>% sort(decreasing=TRUE)
var_imp9 <- var_imp[1:9]%>%data.frame()%>%tk_tbl()

# Create the bar graph
ggplot(aes(x = reorder(index,.), y = ., fill = reorder(index,.)), data = var_imp9) +
  geom_bar(stat = 'identity', color = 'black',show.legend = F) +
  scale_fill_brewer(palette = 'Reds') +
  coord_flip() +
  labs(x = "Features", y = "Importance", title = "Random Forest Variable Importance")

random_pred <- predict(random_is,test_data,type='raw')
random_cm <- confusionMatrix(data=random_pred,reference=test_data$readmitted)
random_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  10627   764
##        Yes  8072  1057
##                                           
##                Accuracy : 0.5694          
##                  95% CI : (0.5626, 0.5762)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0529          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.5683          
##             Specificity : 0.5805          
##          Pos Pred Value : 0.9329          
##          Neg Pred Value : 0.1158          
##              Prevalence : 0.9113          
##          Detection Rate : 0.5179          
##    Detection Prevalence : 0.5551          
##       Balanced Accuracy : 0.5744          
##                                           
##        'Positive' Class : No              
##

Gradient Boosting Classification Tree

set.seed(0)
gdb_is <- train(readmitted ~ ., 
               data = re_undersample, 
               method = "gbm",  # for gradient boosting
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "ROC",  # evaluate hyperparamter combinations with ROC
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 5,  # 5 folds
                 savePredictions = "final",       # save predictions for the optimal tuning parameter1
                      classProbs = TRUE,  # return class probabilities in addition to predicted values
                      summaryFunction = twoClassSummary  # for binary response variable
                      ))

ggplot(gdb_is)+scale_color_manual(values=mycolors)

gdb_pred <- predict(gdb_is,test_data,type='raw')
gdb_cm <- confusionMatrix(data=gdb_pred,reference=test_data$readmitted)
gdb_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11647   862
##        Yes  7052   959
##                                          
##                Accuracy : 0.6143         
##                  95% CI : (0.6076, 0.621)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.059          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6229         
##             Specificity : 0.5266         
##          Pos Pred Value : 0.9311         
##          Neg Pred Value : 0.1197         
##              Prevalence : 0.9113         
##          Detection Rate : 0.5676         
##    Detection Prevalence : 0.6096         
##       Balanced Accuracy : 0.5748         
##                                          
##        'Positive' Class : No             
##

Hard Margin SVM

library(MLmetrics)
set.seed(0)
svm_is <- train(readmitted ~ .,
                data = re_undersample, 
                method = "svmLinear2",
                trControl = trainControl(method = "cv",
                                         number = 5, 
                                         savePredictions = TRUE,
                                         classProbs=TRUE,
                                         verboseIter = FALSE,
                                         summaryFunction = multiClassSummary))

svm_pred <- predict(svm_is,test_data,type='raw')
svm_cm <- confusionMatrix(data=svm_pred,reference=test_data$readmitted)
svm_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11991   898
##        Yes  6708   923
##                                          
##                Accuracy : 0.6293         
##                  95% CI : (0.6227, 0.636)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0607         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6413         
##             Specificity : 0.5069         
##          Pos Pred Value : 0.9303         
##          Neg Pred Value : 0.1210         
##              Prevalence : 0.9113         
##          Detection Rate : 0.5844         
##    Detection Prevalence : 0.6281         
##       Balanced Accuracy : 0.5741         
##                                          
##        'Positive' Class : No             
##

k-Nearest Neighbors

knn_is <- train(readmitted ~ .,
                          data = re_undersample,
                          method = "kknn",
                         trControl = trainControl(method = "cv", 
                                                   number = 5, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE,
                                                  summaryFunction = multiClassSummary))

knn_is

## k-Nearest Neighbors 
## 
## 8498 samples
##   25 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 6799, 6798, 6798, 6798, 6799 
## Resampling results across tuning parameters:
## 
##   kmax  Accuracy   Kappa       F1         Sensitivity  Specificity
##   5     0.5075332  0.01506926  0.5134579  0.5196552    0.4954141  
##   7     0.5123586  0.02472230  0.5186078  0.5253040    0.4994186  
##   9     0.5143590  0.02872073  0.5217147  0.5297759    0.4989449  
##   Pos_Pred_Value  Neg_Pred_Value  Precision  Recall     Detection_Rate
##   0.5074869       0.5075790       0.5074869  0.5196552  0.2598270     
##   0.5121738       0.5125634       0.5121738  0.5253040  0.2626512     
##   0.5139609       0.5147897       0.5139609  0.5297759  0.2648875     
##   Balanced_Accuracy
##   0.5075347        
##   0.5123613        
##   0.5143604        
## 
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
##  parameter 'kernel' was held constant at a value of optimal
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 9, distance = 2 and kernel
##  = optimal.

knn_pred <- predict(knn_is,test_data,type='raw')
knn_cm <- confusionMatrix(data=knn_pred,reference=test_data$readmitted)
knn_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  9874  887
##        Yes 8825  934
##                                           
##                Accuracy : 0.5267          
##                  95% CI : (0.5198, 0.5336)
##     No Information Rate : 0.9113          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0138          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.52805         
##             Specificity : 0.51290         
##          Pos Pred Value : 0.91757         
##          Neg Pred Value : 0.09571         
##              Prevalence : 0.91126         
##          Detection Rate : 0.48119         
##    Detection Prevalence : 0.52442         
##       Balanced Accuracy : 0.52048         
##                                           
##        'Positive' Class : No              
##

Neural Network

nnet_is <- train(readmitted ~ .,
                           data = re_undersample,
                           method = "nnet",
                           trControl = trainControl(method = "cv", 
                                                    number = 5, 
                                                    savePredictions = TRUE, 
                                                    verboseIter = FALSE,
                                                    search = "grid",
                                                    summaryFunction = multiClassSummary),
                                                    tuneLength = 5)

ggplot(nnet_is)+scale_color_manual(values=mycolors)

nnet_pred <- predict(nnet_is,test_data,type='raw')
nnet_cm <- confusionMatrix(data=nnet_pred,reference=test_data$readmitted)
nnet_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11419   851
##        Yes  7280   970
##                                          
##                Accuracy : 0.6038         
##                  95% CI : (0.597, 0.6105)
##     No Information Rate : 0.9113         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0553         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6107         
##             Specificity : 0.5327         
##          Pos Pred Value : 0.9306         
##          Neg Pred Value : 0.1176         
##              Prevalence : 0.9113         
##          Detection Rate : 0.5565         
##    Detection Prevalence : 0.5980         
##       Balanced Accuracy : 0.5717         
##                                          
##        'Positive' Class : No             
##

Results Summary

# creating function that extracts relevant statistics from the confusion matrix
calculate_metrics <- function(confusion_matrix) {
  accuracy <- confusion_matrix$overall[1]
  sensitivity <- confusion_matrix$byClass[1]
  specificity <- confusion_matrix$byClass[2]
  f1score <- confusion_matrix$byClass[7]
  fn <- confusion_matrix$table[1,2]/(confusion_matrix$table[1,2]+confusion_matrix$table[2,2])
  return(c(accuracy, sensitivity,specificity,f1score,fn))
}

confusion_matrices <- list(logit_cm, lasso_cm, tree_cm, bag_cm, random_cm, gdb_cm, svm_cm, knn_cm, nnet_cm)

metrics_list <- lapply(confusion_matrices, calculate_metrics)

accuracy <- round(sapply(metrics_list, `[`, 1),4)
sensitivity <- round(sapply(metrics_list, `[`, 2),4)
specificity <- round(sapply(metrics_list, `[`, 3),4)
f1score <- round(sapply(metrics_list, `[`, 4),4)
fn <- round(sapply(metrics_list, `[`, 5),4)
models <- c('Logit','Lasso','Tree','Bag','RF','GDB','SVM','KNN','NNET')

results <-  tibble(models=rep(models,5),
       metrics=c(rep('accuracy',9),rep('sensitivity',9),rep('specificity',9),rep('f1sore',9),rep('fn',9)),
       scores=c(accuracy,sensitivity,specificity,f1score,fn))
results$models <- reorder(results$models, results$scores)

results %>% 
  ggplot(aes(x=models,y=scores,fill=metrics))+
  geom_bar(color='black',stat='identity',alpha=0.8,show.legend = F,width = 0.3)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  facet_grid(~metrics)+
  labs(title='Model Summary')

results_wide <- pivot_wider(results,names_from =models, values_from=scores)
knitr::kable(results_wide)

metrics	Logit	Lasso	Tree	Bag	RF	GDB	SVM	KNN	NNET
accuracy	0.6067	0.6718	0.5794	0.5578	0.5694	0.6143	0.6293	0.5267	0.6038
sensitivity	0.6147	0.6935	0.5820	0.5599	0.5683	0.6229	0.6413	0.5280	0.6107
specificity	0.5250	0.4492	0.5530	0.5360	0.5805	0.5266	0.5069	0.5129	0.5327
f1sore	0.7402	0.7938	0.7160	0.6977	0.7063	0.7464	0.7592	0.6703	0.7374
fn	0.4750	0.5508	0.4470	0.4640	0.4195	0.4734	0.4931	0.4871	0.4673

In terms of accuracy and f1score, Lasso delivered the highest score. Although SVM model delivered good results as well, considering the sensitivty and the computation cost, we will choose Logistic Lasso as our final model

6. Causal Inference

Logistic Regression

library(magrittr)
library(broom)


glm(data = data, readmitted~.,family='binomial') %>% summary()

## 
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = data)
## 
## Coefficients:
##                                              Estimate Std. Error z value
## (Intercept)                                 -4.366572   1.015783  -4.299
## raceAsian                                    0.019792   0.168082   0.118
## raceCaucasian                                0.016009   0.037499   0.427
## raceHispanic                                -0.047402   0.103546  -0.458
## raceOther                                   -0.198979   0.123736  -1.608
## raceUnknown                                 -0.180487   0.095513  -1.890
## genderMale                                   0.014980   0.027691   0.541
## age10s                                       0.726906   0.620068   1.172
## age20s                                       1.145377   0.596801   1.919
## age30s                                       1.121664   0.591097   1.898
## age40s                                       1.166759   0.588449   1.983
## age50s                                       1.129370   0.587969   1.921
## age60s                                       1.342771   0.587814   2.284
## age70s                                       1.450256   0.587835   2.467
## age80s                                       1.467212   0.588233   2.494
## age90s                                       1.312095   0.593311   2.211
## time_in_hospital                             0.038229   0.005214   7.332
## medical_specialtyEmergency/Trauma            0.118773   0.085986   1.381
## medical_specialtyFamily/GeneralPractice      0.301131   0.080767   3.728
## medical_specialtyInternalMedicine            0.245286   0.072063   3.404
## medical_specialtyOrthopedics-Reconstructive -0.093367   0.147585  -0.633
## medical_specialtyOther                       0.196931   0.075320   2.615
## medical_specialtySurgery-General             0.192138   0.101395   1.895
## medical_specialtyUnknown                     0.225335   0.066399   3.394
## num_procedures                              -0.035216   0.009361  -3.762
## num_medications                              0.004703   0.002152   2.186
## number_outpatient                           -0.008844   0.012301  -0.719
## number_emergency                             0.078790   0.021017   3.749
## number_inpatient                             0.326592   0.015860  20.592
## diag_1Diabetes                              -0.494374   0.329034  -1.503
## diag_1Digestive                             -0.201359   0.056134  -3.587
## diag_1Genitourinary                         -0.120231   0.067604  -1.778
## diag_1Injury                                 0.059809   0.057046   1.048
## diag_1Musculoskeletal                       -0.038734   0.069358  -0.558
## diag_1Neoplasms                             -0.237184   0.082181  -2.886
## diag_1Other                                 -0.106119   0.040472  -2.622
## diag_1Respiratory                           -0.339824   0.048903  -6.949
## diag_2Diabetes                              -0.180238   0.069236  -2.603
## diag_2Digestive                             -0.026896   0.076637  -0.351
## diag_2Genitourinary                         -0.101094   0.055825  -1.811
## diag_2Injury                                 0.023883   0.086702   0.275
## diag_2Musculoskeletal                       -0.122380   0.111998  -1.093
## diag_2Neoplasms                              0.291447   0.088777   3.283
## diag_2Other                                  0.043165   0.036364   1.187
## diag_2Respiratory                           -0.134534   0.051122  -2.632
## diag_3Diabetes                              -0.110950   0.051673  -2.147
## diag_3Digestive                              0.144616   0.073708   1.962
## diag_3Genitourinary                          0.014403   0.060226   0.239
## diag_3Injury                                 0.031084   0.097091   0.320
## diag_3Musculoskeletal                       -0.044376   0.105411  -0.421
## diag_3Neoplasms                              0.190152   0.100451   1.893
## diag_3Other                                  0.037572   0.035052   1.072
## diag_3Respiratory                            0.058341   0.056099   1.040
## number_diagnoses                             0.022872   0.008688   2.633
## max_glu_serum>300                           -0.101754   0.167432  -0.608
## max_glu_serumNone                           -0.117163   0.109713  -1.068
## max_glu_serumNorm                            0.017129   0.136331   0.126
## A1Cresult>8                                 -0.031024   0.084683  -0.366
## A1CresultNone                                0.065747   0.070643   0.931
## A1CresultNorm                                0.011609   0.091151   0.127
## metforminNo                                 -0.194192   0.162510  -1.195
## metforminSteady                             -0.275948   0.162953  -1.693
## metforminUp                                 -0.420402   0.210757  -1.995
## glimepirideNo                               -0.276060   0.259699  -1.063
## glimepirideSteady                           -0.373060   0.264728  -1.409
## glimepirideUp                               -0.213089   0.343142  -0.621
## glipizideNo                                 -0.290284   0.163711  -1.773
## glipizideSteady                             -0.247316   0.164275  -1.506
## glipizideUp                                 -0.140074   0.209667  -0.668
## glyburideNo                                  0.117804   0.186522   0.632
## glyburideSteady                              0.131853   0.187668   0.703
## glyburideUp                                  0.150627   0.232081   0.649
## pioglitazoneNo                              -0.088423   0.364132  -0.243
## pioglitazoneSteady                          -0.188050   0.366770  -0.513
## pioglitazoneUp                               0.040129   0.432425   0.093
## rosiglitazoneNo                              0.873177   0.594422   1.469
## rosiglitazoneSteady                          0.844145   0.596145   1.416
## rosiglitazoneUp                              0.909834   0.662820   1.373
## insulinNo                                   -0.141764   0.070373  -2.014
## insulinSteady                               -0.150790   0.054568  -2.763
## insulinUp                                   -0.106220   0.056968  -1.865
## changeNo                                     0.033859   0.050045   0.677
## diabetesMedYes                               0.224787   0.049889   4.506
##                                             Pr(>|z|)    
## (Intercept)                                 1.72e-05 ***
## raceAsian                                   0.906262    
## raceCaucasian                               0.669441    
## raceHispanic                                0.647106    
## raceOther                                   0.107815    
## raceUnknown                                 0.058803 .  
## genderMale                                  0.588532    
## age10s                                      0.241076    
## age20s                                      0.054960 .  
## age30s                                      0.057749 .  
## age40s                                      0.047393 *  
## age50s                                      0.054757 .  
## age60s                                      0.022351 *  
## age70s                                      0.013621 *  
## age80s                                      0.012622 *  
## age90s                                      0.027003 *  
## time_in_hospital                            2.26e-13 ***
## medical_specialtyEmergency/Trauma           0.167184    
## medical_specialtyFamily/GeneralPractice     0.000193 ***
## medical_specialtyInternalMedicine           0.000665 ***
## medical_specialtyOrthopedics-Reconstructive 0.526976    
## medical_specialtyOther                      0.008934 ** 
## medical_specialtySurgery-General            0.058098 .  
## medical_specialtyUnknown                    0.000690 ***
## num_procedures                              0.000169 ***
## num_medications                             0.028837 *  
## number_outpatient                           0.472168    
## number_emergency                            0.000178 ***
## number_inpatient                             < 2e-16 ***
## diag_1Diabetes                              0.132967    
## diag_1Digestive                             0.000334 ***
## diag_1Genitourinary                         0.075330 .  
## diag_1Injury                                0.294442    
## diag_1Musculoskeletal                       0.576526    
## diag_1Neoplasms                             0.003900 ** 
## diag_1Other                                 0.008742 ** 
## diag_1Respiratory                           3.68e-12 ***
## diag_2Diabetes                              0.009235 ** 
## diag_2Digestive                             0.725620    
## diag_2Genitourinary                         0.070154 .  
## diag_2Injury                                0.782959    
## diag_2Musculoskeletal                       0.274524    
## diag_2Neoplasms                             0.001027 ** 
## diag_2Other                                 0.235214    
## diag_2Respiratory                           0.008498 ** 
## diag_3Diabetes                              0.031780 *  
## diag_3Digestive                             0.049762 *  
## diag_3Genitourinary                         0.810987    
## diag_3Injury                                0.748856    
## diag_3Musculoskeletal                       0.673770    
## diag_3Neoplasms                             0.058359 .  
## diag_3Other                                 0.283766    
## diag_3Respiratory                           0.298356    
## number_diagnoses                            0.008475 ** 
## max_glu_serum>300                           0.543364    
## max_glu_serumNone                           0.285561    
## max_glu_serumNorm                           0.900016    
## A1Cresult>8                                 0.714098    
## A1CresultNone                               0.352012    
## A1CresultNorm                               0.898659    
## metforminNo                                 0.232105    
## metforminSteady                             0.090376 .  
## metforminUp                                 0.046073 *  
## glimepirideNo                               0.287781    
## glimepirideSteady                           0.158771    
## glimepirideUp                               0.534604    
## glipizideNo                                 0.076204 .  
## glipizideSteady                             0.132195    
## glipizideUp                                 0.504082    
## glyburideNo                                 0.527658    
## glyburideSteady                             0.482314    
## glyburideUp                                 0.516321    
## pioglitazoneNo                              0.808137    
## pioglitazoneSteady                          0.608148    
## pioglitazoneUp                              0.926063    
## rosiglitazoneNo                             0.141846    
## rosiglitazoneSteady                         0.156774    
## rosiglitazoneUp                             0.169854    
## insulinNo                                   0.043961 *  
## insulinSteady                               0.005721 ** 
## insulinUp                                   0.062241 .  
## changeNo                                    0.498680    
## diabetesMedYes                              6.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40988  on 68401  degrees of freedom
## Residual deviance: 39892  on 68319  degrees of freedom
## AIC: 40058
## 
## Number of Fisher Scoring iterations: 6

From the logistic regression, we can say that the defualt probability for no early readmission is very high which could be observed from the intercept. Note that NA values are the baseline. The coefficients that increased the odds ratio for early readmission significantly were time_in_hospital,number_inpatient,diag_1,diag_2.

Propensity Score Matching

Here, we will be using A1C result to define whether this variable had a causal effect on readmission rate. A1C or HbA1c is a test that estimates how much blood sugar has been in the bloodstream over the last 2~3 months.
Let’s explore the statistics of A1C result data.

data$readmitted1 <- as.character(data$readmitted)
data$readmitted1 <- ifelse(data$readmitted=='Yes',1,0)

ci1 <- data%>%
  ggplot(aes(x=readmitted,fill=A1Cresult))+
  geom_bar(position="fill",color="black",alpha=0.8,show.legend = F)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  ggtitle('HbA1c Test')
  
ci2 <- data %>% 
  ggplot(aes(x=A1Cresult,y=..count..,fill=A1Cresult))+
  geom_bar(color='black',alpha=0.8,show.legend = F)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  facet_grid(readmitted~.)

ci3 <- data %>% group_by(A1Cresult)%>%
  summarise(mean=round(mean(readmitted1),4))%>%
  ggplot(aes(x=as.factor(A1Cresult),y=mean,fill=as.factor(A1Cresult)))+
  geom_bar(stat='identity',color='black',alpha=0.8)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  labs(x='A1Cresult',y='Mean of Readmission',fill='A1Cresult',title='Mean of Readmission')

grid.arrange(ci1,ci2,ci3,ncol=1)

For our first analysis, we are going to define whether the taking the A1C result affected the readmission rate or not. We will use normal t-test and t-test after propensity score matching to match the patient’s information.

library(MatchIt)
library(tableone)
library(tidyr)

# 1 if A1C test was taken, 0 if not
data$A1Cresult_group<- ifelse(data$A1Cresult == 'None', 0, 1)
data$group<-as.logical(data$A1Cresult_group)
data$group <- ifelse(data$group==1, 'Treatment', 'Control')
y_treatment <-data$readmitted1[data$group=='Treatment']  
y_control<- data$readmitted1[data$group=='Control']

ci4 <- data %>% 
  ggplot(aes(x=as.factor(group),y=..count..,fill=as.factor(group)))+
  geom_bar(color='black',alpha=0.8,show.legend = F)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  facet_grid(readmitted~.)+
  labs(x='Group',y='Count',fill='Group',title='Treatment Group Count')

ci5 <- data %>% group_by(group)%>%
  summarise(mean=round(mean(readmitted1),4))%>%
  ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
  geom_bar(stat='identity',color='black',alpha=0.8,show.legend = F)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  labs(x='Group',y='Mean',fill='Group',title='Mean of Readmission by Group')

set.seed(0)
match.it<-matchit(A1Cresult_group~., data=data,method = 'nearest',ratio=1)
psm_matchit_data <- match.data(match.it)  
y_treatment_psm_matchit <-psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==1]
y_control_psm_matchit <- psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==0]



ci6 <- psm_matchit_data %>% group_by(group)%>%
  summarise(mean=round(mean(readmitted1),4))%>%
  ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
  geom_bar(stat='identity',color='black',alpha=0.8)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  labs(x='Group',y='Mean',fill='Group',title='PSM Mean of Group')

grid.arrange(ci4,ci5,ci6,ncol=1)

Interestingly, the propensity score matching score method changed the mean of the readmission rate between the control and treatment group. According to the PSM, patients with the treatment (taking the A1C test) had higher readmission rate than those who didn’t. Let’s look at the p-value to see the statistical significance.

set.seed(0)
# t test before PSM
t.test(y_treatment,  
       y_control,  
       var.equal =F,  
       paired=FALSE,  
       conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  y_treatment and y_control
## t = -2.831, df = 18949, p-value = 0.004645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.013153009 -0.002390939
## sample estimates:
##  mean of x  mean of y 
## 0.08238317 0.09015514

# t test after PSM
t.test(y_treatment_psm_matchit,  
       y_control_psm_matchit,  
       var.equal = F,  
       paired=FALSE,  
       conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  y_treatment_psm_matchit and y_control_psm_matchit
## t = 4.6046, df = 24686, p-value = 4.154e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.008808077 0.021864799
## sample estimates:
##  mean of x  mean of y 
## 0.08238317 0.06704673

The p-value tells that normal t-test showed a statistical difference between the treatment and control group while the PSM method is not valid. Considering the PSM being more reliable since it reflects similar patient information, it is hard to conclude whether the taking the A1C test helps to reduce the readmission.
Now, let’s see whether high blood sugar (more than 8%) had affected the readmission rate or not.

# 1 if A1C >8%, if not 0
data$A1Cresult_group<- ifelse(data$A1Cresult == '>8', 1, 0)
data$group<-as.logical(data$A1Cresult_group)
data$group <- ifelse(data$A1Cresult_group==1, 'Treatment', 'Control')
y_treatment <-data$readmitted1[data$group=='Treatment']  
y_control<- data$readmitted1[data$group=='Control']

ci7 <- data %>% 
  ggplot(aes(x=as.factor(group),y=..count..,fill=as.factor(group)))+
  geom_bar(color='black',alpha=0.8,show.legend = F)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  facet_grid(readmitted~.)+
  labs(x='Group',y='Count',fill='Group',title='Treatment Group Count')

ci8 <- data %>% group_by(group)%>%
  summarise(mean=round(mean(readmitted1),4))%>%
  ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
  geom_bar(stat='identity',color='black',alpha=0.8,show.legend = F)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  labs(x='Group',y='Mean',fill='Group',title='Mean of Readmission by Group')

set.seed(0)
match.it<-matchit(A1Cresult_group~., data=data,method = 'nearest',ratio=1)
psm_matchit_data <- match.data(match.it)  
y_treatment_psm_matchit <-psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==1] 
y_control_psm_matchit <- psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==0]

ci9 <- psm_matchit_data %>% group_by(group)%>%
  summarise(mean=round(mean(readmitted1),4))%>%
  ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
  geom_bar(stat='identity',color='black',alpha=0.8)+
  scale_fill_manual(values=mycolors)+
  coord_flip()+
  labs(x='Group',y='Mean',fill='Group',title='PSM Mean of Readmission by Group')

grid.arrange(ci7,ci8,ci9,ncol=1)

- For those who were diagnosed as higher A1C result (8%), both t-test method showed that patients had lower readmission rate.

t.test(y_treatment,  
       y_control,  
       var.equal =F,  
       paired=FALSE,  
       conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  y_treatment and y_control
## t = -2.7441, df = 7376.1, p-value = 0.006083
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01726678 -0.00287674
## sample estimates:
##  mean of x  mean of y 
## 0.07955489 0.08962665

t.test(y_treatment_psm_matchit,  
       y_control_psm_matchit,  
       var.equal = F,  
       paired=FALSE,  
       conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  y_treatment_psm_matchit and y_control_psm_matchit
## t = 0.40641, df = 12038, p-value = 0.6844
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.007619461  0.011605510
## sample estimates:
##  mean of x  mean of y 
## 0.07955489 0.07756187

The p-value for both methods were significant indicating that the A1C result higher than 8% had a lower chance of early readmission. This could indicate that the hospital treats patient who has higher than 8% A1C result seriously and provide better care that might be reducing the early readmission.

7. Conclusion

In this project, we have analyzed clinical big data, in which the data contains approximately 70,000 patients’ early hospital readmission information who were hospitalized for diabetes treatment. To analyze the dataset, we have established 9 models for predicting the early readmission. Considering several factors, Lasso Logistic model performed the best among the models to predict patients’ early readmission. Also, we have used A1C test result factor to implement two-sample t-test and utilized Propensity Score Matching method to verify whether the factor influenced the early readmission rate or not. As a result, we have discovered that patients who were diagnosed with higher than 8% A1C test results showed a statistically lower readmission rate than any other groups.

Readmission Analysis

Jaehyun Ahn

2023-11-29

0. Abstract

1. Introduction

2. Dataset and Objective

3. Method

4. Dataset Exploration

5. Data Cleaning

Data Exploration

5. Prediction Modeling

Undersampling

Models

Logistic Regression

Logistic Lasso

Simple Classification Tree

Bagging

Random Forest

Gradient Boosting Classification Tree

Hard Margin SVM

k-Nearest Neighbors

Neural Network

Results Summary

6. Causal Inference

Logistic Regression

Propensity Score Matching

7. Conclusion