Overview

The goal of this case analysis is to predict whether or not a patient will be readmitted to the hospital within 30 days using data from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008). The data covers 10 years (1999-2008) of readmissions in 130 U.S. hospitals. Information regarding patient history, demographics, medications administered, and diagnoses are included in the data. Each record satisfies the following:

  1. It is an inpatient encounter (a hospital admission)
  2. It is a diabetic encounter (one during which any kind of diabetes was entered to the system as a diagnosis)
  3. The length of stay was at least 1 day and at most 14 days
  4. Laboratory tests were performed during the encounter
  5. Medications were administered during the encounter

The main constraints of this case study are model interpretability and misclassification. Doctors need to be able to understand and explain why a model predicted that a patient will be readmitted to the hospital so they can make appropriate modifications to a patient’s course of treatment and also the reason for the changes. If a model makes an incorrect prediction, it could be financially and emotionally costly to the patient if he/she is subject to additional unnecessary time in the hospital or is discharged home but has to come home a few days later because their symptoms or pain are back. Due to these constraint we will be testing logistic regression and decision tree models.

Read and Summarize Data

#import data saved on github
df_raw <- read.csv(url("https://raw.githubusercontent.com/KatherineEvers/698/main/diabetic_data.csv"), header=TRUE)

#first few rows of data
head(df_raw) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "800px", height = "200px")
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ? Pediatrics-Endocrinology 41 0 1 0 0 0 250.83 ? ? 1 None None No No No No No No No No No No No No No No No No No No No No No No No No No NO
149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ? ? 59 0 18 0 0 0 276 250.01 255 9 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes >30
64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ? ? 11 5 13 2 0 1 648 250 V27 6 None None No No No No No No Steady No No No No No No No No No No No No No No No No No Yes NO
500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ? ? 44 1 16 0 0 0 8 250.43 403 7 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes NO
16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ? ? 51 0 8 0 0 0 197 157 250 5 None None No No No No No No Steady No No No No No No No No No No Steady No No No No No Ch Yes NO
35754 82637451 Caucasian Male [50-60) ? 2 1 2 3 ? ? 31 6 16 0 0 0 414 411 250 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes >30
#summarize data
skim(df_raw)
Data summary
Name df_raw
Number of rows 101766
Number of columns 50
_______________________
Column type frequency:
character 37
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
race 0 1 1 15 0 6 0
gender 0 1 4 15 0 3 0
age 0 1 6 8 0 10 0
weight 0 1 1 9 0 10 0
payer_code 0 1 1 2 0 18 0
medical_specialty 0 1 1 36 0 73 0
diag_1 0 1 1 6 0 717 0
diag_2 0 1 1 6 0 749 0
diag_3 0 1 1 6 0 790 0
max_glu_serum 0 1 4 4 0 4 0
A1Cresult 0 1 2 4 0 4 0
metformin 0 1 2 6 0 4 0
repaglinide 0 1 2 6 0 4 0
nateglinide 0 1 2 6 0 4 0
chlorpropamide 0 1 2 6 0 4 0
glimepiride 0 1 2 6 0 4 0
acetohexamide 0 1 2 6 0 2 0
glipizide 0 1 2 6 0 4 0
glyburide 0 1 2 6 0 4 0
tolbutamide 0 1 2 6 0 2 0
pioglitazone 0 1 2 6 0 4 0
rosiglitazone 0 1 2 6 0 4 0
acarbose 0 1 2 6 0 4 0
miglitol 0 1 2 6 0 4 0
troglitazone 0 1 2 6 0 2 0
tolazamide 0 1 2 6 0 3 0
examide 0 1 2 2 0 1 0
citoglipton 0 1 2 2 0 1 0
insulin 0 1 2 6 0 4 0
glyburide.metformin 0 1 2 6 0 4 0
glipizide.metformin 0 1 2 6 0 2 0
glimepiride.pioglitazone 0 1 2 6 0 2 0
metformin.rosiglitazone 0 1 2 6 0 2 0
metformin.pioglitazone 0 1 2 6 0 2 0
change 0 1 2 2 0 2 0
diabetesMed 0 1 2 3 0 2 0
readmitted 0 1 2 3 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
encounter_id 0 1 165201645.62 102640295.98 12522 84961194 152388987 230270888 443867222 ▆▇▅▂▂
patient_nbr 0 1 54330400.69 38696359.35 135 23413221 45505143 87545950 189502619 ▇▆▆▁▁
admission_type_id 0 1 2.02 1.45 1 1 1 3 8 ▇▂▁▁▁
discharge_disposition_id 0 1 3.72 5.28 1 1 1 4 28 ▇▁▁▁▁
admission_source_id 0 1 5.75 4.06 1 1 7 7 25 ▅▇▁▁▁
time_in_hospital 0 1 4.40 2.99 1 2 4 6 14 ▇▅▂▁▁
num_lab_procedures 0 1 43.10 19.67 1 31 44 57 132 ▃▇▅▁▁
num_procedures 0 1 1.34 1.71 0 0 1 2 6 ▇▂▁▁▁
num_medications 0 1 16.02 8.13 1 10 15 20 81 ▇▃▁▁▁
number_outpatient 0 1 0.37 1.27 0 0 0 0 42 ▇▁▁▁▁
number_emergency 0 1 0.20 0.93 0 0 0 0 76 ▇▁▁▁▁
number_inpatient 0 1 0.64 1.26 0 0 0 1 21 ▇▁▁▁▁
number_diagnoses 0 1 7.42 1.93 1 6 8 9 16 ▁▅▇▁▁
#describe(df)

#sum_data <- summary(df)
#pander(sum_data, split.table = 100, style = 'rmarkdown')

As seen in the summary table, the data contains 101,766 rows and 50 columns. Of the 50 columns, 13 are numeric and 37 are character (nominal) variables. The table also provides distributions of the numerical variables and the number of unique values and most common values of nominal variables. Note that examide and citoglipton have only 1 unique value and diag_1, diag_2 and diag_3 have over 700 unique values.

Missing Values

Some variables have “?” or other labels coded as missing/not available/not measured, which is why no missing values are reported in the summary table above. After replacing these labels with NA, we get the following table of variables with missing values. Since weight, max_glu_serum, and A1Cresult have over 50% missing values, they will be dropped from the modeling data. In the data transformation section later on missing values for the remaining factors with less than 50% missing values will be filled with mode or kept as a separate category.

#recode missing labels to NA
missing_df <- df_raw %>% 
    mutate(race = na_if(race, '?')) %>% 
    mutate(weight = na_if(weight, '?')) %>% 
    mutate(admission_type_id = na_if(admission_type_id, '5')) %>% 
    mutate(admission_type_id = na_if(admission_type_id, '6')) %>% 
    mutate(admission_type_id = na_if(admission_type_id, '8')) %>% 
    mutate(discharge_disposition_id = na_if(discharge_disposition_id, '18')) %>% 
    mutate(discharge_disposition_id = na_if(discharge_disposition_id, '25')) %>% 
    mutate(discharge_disposition_id = na_if(discharge_disposition_id, '26')) %>% 
    mutate(admission_source_id = na_if(admission_source_id, '15')) %>% 
    mutate(admission_source_id = na_if(admission_source_id, '17'))%>% 
    mutate(admission_source_id = na_if(admission_source_id, '20')) %>% 
    mutate(admission_source_id = na_if(admission_source_id, '21')) %>% 
    mutate(payer_code = na_if(payer_code, '?')) %>% 
    mutate(medical_specialty = na_if(medical_specialty, '?')) %>% 
    mutate(payer_code = na_if(payer_code, '?')) %>% 
    mutate(diag_1 = na_if(diag_1, '?')) %>% 
    mutate(diag_2 = na_if(diag_2, '?')) %>% 
    mutate(diag_3 = na_if(diag_3, '?')) %>% 
    mutate(max_glu_serum = na_if(max_glu_serum, 'None')) %>% 
    mutate(A1Cresult = na_if(A1Cresult, 'None')) 

#sapply(missing_df, function(x) sum(is.na(x)))
missing_df %>% 
  summarize_all(funs(sum(is.na(.)) / length(.) * 100)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "800px", height = "100px")
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
0 0 2.233555 0 0 96.85848 10.21559 4.598785 6.821532 0 39.55742 49.08221 0 0 0 0 0 0 0.0206356 0.3517874 1.398306 0 94.74677 83.27732 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#missing/not available/not measured
#table(df_raw['race'])["?"] #2273
#table(df_raw['weight'])["?"] #98569
#table(df_raw['admission_type_id']) #5: 4785 ,6: 5291,8: 320
#table(df_raw['discharge_disposition_id']) #18: 3691, 25:989, 26:0
#table(df_raw['admission_source_id']) #15:0, 17: 6781, 20: 161, 21:0
#table(df_raw['payer_code'])["?"] #40245
#table(df_raw['medical_specialty'])["?"]  #49949
#table(df_raw['diag_1'])["?"] #21
#table(df_raw['diag_2'])["?"] #358
#table(df_raw['diag_3'])["?"] #1423
#table(df_raw['max_glu_serum'])["None"] #96420
#table(df_raw['A1Cresult'])["None"] #84748

Bias

Duplicate patient records

Upon investigation we find that there are cases of multiple records for one patient number. Based on the five requirements of the patient records we will assume each record is an independent, separate occurrence and keep all records.

dup <- df_raw %>% 
  count(patient_nbr) %>%
  arrange(desc(n))

head(dup) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "800px", height = "200px")
patient_nbr n
88785891 40
43140906 28
1660293 23
23199021 23
88227540 23
23643405 22
#examine example patient number
df_raw[df_raw$patient_nbr==8693541,] %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  scroll_box(width = "800px", height = "200px")
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
19311 68920296 8693541 Caucasian Female [80-90) ? 1 1 7 12 ? ? 74 0 23 0 0 0 410 428 427 9 None None No No No No No No No Down No No No No No No No No No No No No No No No Ch Yes >30
20345 71711130 8693541 Caucasian Female [80-90) ? 1 1 7 6 ? ? 60 0 13 0 0 1 410 428 427 8 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes >30
21637 75256344 8693541 Caucasian Female [80-90) ? 1 1 7 5 ? ? 61 0 10 0 0 2 428 496 427 6 None None No No No No No No No Down No No No No No No No No No No No No No No No Ch Yes >30
24712 83442996 8693541 Caucasian Female [80-90) ? 1 3 7 13 ? ? 66 1 19 0 0 3 428 486 427 8 None None No No No No No No No Down No No No No No No No No No No No No No No No Ch Yes <30
25682 85503798 8693541 Caucasian Female [80-90) ? 1 6 7 6 ? ? 49 2 16 0 0 4 398 486 427 8 None None No No No No No No No Steady No No No No No No No No No No No No No No No No Yes NO

Expired/Hospice Patients

Since we are predicting whether a patient was readmitted or not, we want to remove any patients who expired in the hospital or were discharged to hospice in order to avoid bias. These patients have a discharge_disposition_id of 11, 13, 14, 19, 20, or 21. There are 2,423 records removed with 99,343 remaining.

table(df_raw['discharge_disposition_id']) 
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 60234  2128 13954   815  1184 12902   623   108    21     6  1642     3   399 
##    14    15    16    17    18    19    20    22    23    24    25    27    28 
##   372    63    11    14  3691     8     2  1993   412    48   989     5   139
df <- subset(df_raw, !(discharge_disposition_id %in% c('11','13','14','19','20', '21'))) #drop 2423 records, 99343 remaining

#table(df['discharge_disposition_id'])

Data Transformation

#convert from character to factor
df$race <- as.factor(df$race)
df$gender <- as.factor(df$gender)
df$age <- as.factor(df$age)
df$weight <- as.factor(df$weight)
df$admission_type_id <- as.factor(df$admission_type_id)
df$discharge_disposition_id <- as.factor(df$discharge_disposition_id)
df$admission_source_id <- as.factor(df$admission_source_id)
df$payer_code <- as.factor(df$payer_code)
df$medical_specialty <- as.factor(df$medical_specialty)
df$diag_1 <- as.factor(df$diag_1)
df$diag_2 <- as.factor(df$diag_2)
df$diag_3 <- as.factor(df$diag_3)
df$max_glu_serum <- as.factor(df$max_glu_serum)
df$A1Cresult <- as.factor(df$A1Cresult)
df$metformin <- as.factor(df$metformin)
df$repaglinide <- as.factor(df$repaglinide)
df$nateglinide <- as.factor(df$nateglinide)
df$chlorpropamide <- as.factor(df$chlorpropamide)
df$glimepiride <- as.factor(df$glimepiride)
df$acetohexamide <- as.factor(df$acetohexamide)
df$glipizide <- as.factor(df$glipizide)
df$glyburide <- as.factor(df$glyburide)
df$tolbutamide <- as.factor(df$tolbutamide)
df$pioglitazone <- as.factor(df$pioglitazone)
df$rosiglitazone <- as.factor(df$rosiglitazone)
df$acarbose <- as.factor(df$acarbose)
df$miglitol <- as.factor(df$miglitol)
df$troglitazone <- as.factor(df$troglitazone)
df$tolazamide <- as.factor(df$tolazamide)
df$examide <- as.factor(df$examide)
df$citoglipton <- as.factor(df$citoglipton)
df$insulin <- as.factor(df$insulin)
df$glyburide.metformin <- as.factor(df$glyburide.metformin)
df$glipizide.metformin <- as.factor(df$glipizide.metformin)
df$glimepiride.pioglitazone <- as.factor(df$glimepiride.pioglitazone)
df$metformin.rosiglitazone <- as.factor(df$metformin.rosiglitazone)
df$metformin.pioglitazone <- as.factor(df$metformin.pioglitazone)
df$change <- as.factor(df$change)
df$diabetesMed <- as.factor(df$diabetesMed)
df$readmitted <- as.factor(df$readmitted)

Many of the nominal variables have multiple levels such as diag_1 and discharge_disposition_id. If a nominal variable has too many levels or levels that rarely occur, it is less likely that these levels will impact the model. So in order to reduce the number of levels for these variables we will recode and combine some levels. We will also create the Charlson Comorbidity Index (CCI) 1992 version based on the ICD-9 (International Classification of Diseases) codes for diag_1, diag_2, and diag_3

#readmitted-convert to binary
df_2 <- df %>% mutate(readmitted=recode(readmitted,
'<30' = 1,
'>30' = 0,
'NO' = 0))

#table(df['readmitted'])

#count unique values for each variable
#sapply(lapply(df, unique), length)

#admission type id
df_2 <- df_2 %>% mutate(admission_type_id=recode(admission_type_id,
'2'=3,
'1'=3,
'7'=3,
'3'=1,
'4'=2,
'5'=0,
'6'=0,
'8'=0))

#discharge_disposition_id
df_2 <- df_2 %>% mutate(discharge_disposition_id=recode(discharge_disposition_id,
       '1'=1,  
       '2'=2,
       '18'=0,
       '25'=0,
       '26'=0,
       '3'=2,
       '4'=2,
       '5'=2,
       '6'=2,
       '8'=2,
       '9'=2,
       '10'=2,
       '15'=2,
       '16'=2,
       '17'=2,
       '22'=2,
       '23'=2,
       '24'=2,
       '30'=2,
       '27'=2,
       '28'=2,
       '29'=2,
       '7'=3,
       '11'=4,
       '12'=2,
       '19'=4,
       '20'=4,
       '21'=4,
       '13'=4,
       '14'=4
))

#table(df_2['discharge_disposition_id'])

#admission source id
df_2 <- df_2 %>% mutate(admission_source_id=recode(admission_source_id,
'1' = 3,
'2'=3,
'3'=3,
'4'=1,
'5'=1,
'6'=1,
'7'=2,
'8'=3,
'9'=0,
'10'=1,
'11'=3,
'12'=3,
'13'=3,
'14'=3,
'15'=0,
'17'=0,
'18'=1,
'19'=4,
'20'=0,
'21'=0,
'22'=1,
'23'=3,
'24'=3,
'25'=1,
'26'=1
))

#table(df_2['admission_source_id'])

#fill missing with mode
#race
df_2 <- df_2 %>% mutate(race=recode(race,
'Caucasian' = 0,
'AfricanAmerican' = 1,
'Asian' = 1,
'Hispanic'=1,
'Other'=1,
'?'=0))

#gender
df_2 <- df_2 %>% mutate(gender=recode(gender,
'Unknown/Invalid' = 1,
'Male' = 0,
'Female' = 1))

#age
df_2 <- df_2 %>% mutate(age=recode(age,
'[0-10)' = 5,
'[10-20)' = 15,
'[20-30)' = 25,
'[30-40)' = 35,
'[40-50)' = 45,
'[50-60)' = 55,
'[60-70)' = 65,
'[70-80)' = 75,
'[80-90)' = 85,
'[90-100)' = 95))

#weight
#prop.table(table(df$weight))["?"] #0.97-drop

#payer_code
df_2 <- df_2 %>% mutate(payer_code=recode(payer_code,
MC=1,
MD=2,
HM=2,
UN=2,
BC=2,
SP=3,
CP=2,
SI=2,
DM=2,
CM=2,
CH=2,
PO=2,
WC=2,
OT=2,
OG=2,
MP=2,
FR=2,
'?'=0))

#table(df_2['payer_code'])

#medical_specialty
df_2 <- df_2 %>% mutate(medical_specialty=recode(medical_specialty,
'?'=0,
'AllergyandImmunology'=7,
'Anesthesiology'=7,
'Anesthesiology-Pediatric'=13,
'Cardiology'=1,
'Cardiology-Pediatric'=13,
'DCPTEAM'=7,
'Dentistry'=7,
'Dermatology'=7,
'Emergency/Trauma'=2,
'Endocrinology'=7,
'Endocrinology-Metabolism'=7,
'Family/GeneralPractice'=4,
'Gastroenterology'=5,
'Gynecology'=7,
'Hematology'=6,
'Hematology/Oncology'=6,
'Hospitalist'=7,
'InfectiousDiseases'=7,
'InternalMedicine'=4,
'Nephrology'=8,
'Neurology'=9,
'Neurophysiology'=9,
'Obsterics&Gynecology-GynecologicOnco'=7,
'Obstetrics'=7,
'ObstetricsandGynecology'=7,
'Oncology'=10,
'Ophthalmology'=7,
'Orthopedics'=11,
'Orthopedics-Reconstructive'=11,
'Osteopath'=11,
'Otolaryngology'=7,
'OutreachServices'=7,
'Pathology'=7,
'Pediatrics'=13,
'Pediatrics-AllergyandImmunology'=13,
'Pediatrics-CriticalCare'=13,
'Pediatrics-EmergencyMedicine'=13,
'Pediatrics-Endocrinology'=13,
'Pediatrics-Hematology-Oncology'=13,
'Pediatrics-InfectiousDiseases'=13,
'Pediatrics-Neurology'=13,
'Pediatrics-Pulmonology'=13,
'Perinatology'=7,
'PhysicalMedicineandRehabilitation'=7,
'PhysicianNotFound'=7,
'Podiatry'=10,
'Proctology'=8,
'Psychiatry'=7,
'Psychiatry-Addictive'=7,
'Psychiatry-Child/Adolescent'=13,
'Psychology'=7,
'Pulmonology'=12,
'Radiologist'=11,
'Radiology'=11,
'Resident'=7,
'Rheumatology'=11,
'Speech'=7,
'SportsMedicine'=11,
'Surgeon'=3,
'Surgery-Cardiovascular'=3,
'Surgery-Cardiovascular/Thoracic'=3,
'Surgery-Colon&Rectal'=3,
'Surgery-General'=3,
'Surgery-Maxillofacial'=3,
'Surgery-Neuro'=3,
'Surgery-Pediatric'=3,
'Surgery-Plastic'=3,
'Surgery-PlasticwithinHeadandNeck'=3,
'Surgery-Thoracic'=3,
'Surgery-Vascular'=3,
'SurgicalSpecialty'=3,
'Urology'=8))

#table(df_2['medical_specialty'])

#diag_1 
#reassign ? to separate category
levels(df_2$diag_1)[levels(df_2$diag_1)=="?"] <- "9999"
                                
#convert factor levels to numeric
df_2$diag_1 <- as.numeric(levels(df_2$diag_1))[df_2$diag_1] 

df_2 <- transform(df_2, AMI_1= ifelse(diag_1==410, 1, 0))
df_2 <- transform(df_2, AMI_history_1= ifelse(diag_1==412, 1, 0))
df_2 <- transform(df_2, CHF_1= ifelse(diag_1==428, 1, 0))
df_2$PVD_1 <- ifelse(df_2$diag_1==411 | df_2$diag_1==443| df_2$diag_1==785, 1, 0)
df_2$CVD_1 <- ifelse( df_2$diag_1 >= 430 & df_2$diag_1 <= 438, 1, 0)
df_2$COPD_1 <- ifelse( (df_2$diag_1 >= 490 & df_2$diag_1 <= 496) | (df_2$diag_1 >= 500 & df_2$diag_1 <= 505), 1, 0)
df_2 <- transform(df_2, dementia_1= ifelse(diag_1==290, 1, 0))
df_2 <- transform(df_2, paralysis_1= ifelse(diag_1==342, 1, 0))
df_2$diabetes_1 <- ifelse(df_2$diag_1 == 250 | (df_2$diag_1 >= 250.01 & df_2$diag_1 <= 250.39) | df_2$diag_1 == 250.7, 1, 0)
df_2$diabetes_w_comp_1 <- ifelse((df_2$diag_1 >= 250.4 & df_2$diag_1 <= 250.69) | (df_2$diag_1 >= 250.8 & df_2$diag_1 == 250.99), 1, 0)
df_2$renal_1 <- ifelse((df_2$diag_1 >= 582 & df_2$diag_1 <= 583) | (df_2$diag_1 >= 585 & df_2$diag_1 <= 586) | df$diag_1==588, 1, 0)
df_2 <- transform(df_2, mild_liver_1= ifelse(diag_1==571, 1, 0))
df_2$mod_sev_liver_1 <- ifelse(df_2$diag_1 == 456 | df_2$diag_1 == 572, 1, 0)
df_2$peptic_ulcer_1 <- ifelse( df_2$diag_1 >= 531 & df_2$diag_1 == 534, 1, 0)
df_2$rheumatologic_1 <- ifelse(df_2$diag_1 == 710 | df_2$diag_1 == 714, 1, 0)
df_2$aids_1 <- ifelse(df_2$diag_1 == 42 | df_2$diag_1 == 43 | df_2$diag_1 == 44, 1, 0)
df_2$malig_1 <- ifelse((df_2$diag_1 >= 140 & df_2$diag_1 <= 172) | (df_2$diag_1 >= 174 & df_2$diag_1 <= 195) |
                      (df_2$diag_1 >= 200 & df_2$diag_1 <= 208), 1, 0)
df_2$tumor_1 <- ifelse((df_2$diag_1 >= 196 & df_2$diag_1 <= 199), 1, 0)

df_2 <- transform(df_2, wt_AMI_1= ifelse(AMI_1==1, 1, 0))
df_2 <- transform(df_2, wt_AMI_history_1= ifelse(AMI_history_1==1, 1, 0))
df_2 <- transform(df_2, wt_CHF_1= ifelse(CHF_1==1, 1, 0))
df_2 <- transform(df_2, wt_PVD_1= ifelse(PVD_1==1, 1, 0))
df_2 <- transform(df_2, wt_CVD_1= ifelse(PVD_1==1, 1, 0))
df_2 <- transform(df_2, wt_COPD_1= ifelse(COPD_1==1, 1, 0))
df_2 <- transform(df_2, wt_dementia_1= ifelse(dementia_1==1, 1, 0))
df_2 <- transform(df_2, wt_paralysis_1= ifelse(paralysis_1==1, 2, 0))
df_2 <- transform(df_2, wt_diabetes_1= ifelse(diabetes_1==1, 1, 0))
df_2 <- transform(df_2, wt_diabetes_w_comp_1= ifelse(diabetes_w_comp_1==1, 2, 0))
df_2 <- transform(df_2, wt_renal_1= ifelse(renal_1==1, 2, 0))
df_2 <- transform(df_2, wt_mild_liver_1= ifelse(mild_liver_1==1, 1, 0))
df_2 <- transform(df_2, wt_mod_sev_liver_1= ifelse(mod_sev_liver_1==1, 3, 0))
df_2 <- transform(df_2, wt_peptic_ulcer_1= ifelse(peptic_ulcer_1==1, 1, 0))
df_2 <- transform(df_2, wt_rheumatologic_1= ifelse(rheumatologic_1==1, 1, 0))
df_2 <- transform(df_2, wt_aids_1= ifelse(aids_1==1, 6, 0))
df_2 <- transform(df_2, wt_malig_1= ifelse(malig_1==1, 2, 0))
df_2 <- transform(df_2, wt_tumor_1= ifelse(tumor_1==1, 6, 0))

df_2$wt_AMI_1[is.na(df_2$wt_AMI_1)] <- 0
df_2$wt_AMI_history_1[is.na(df_2$wt_AMI_history_1)] <- 0
df_2$wt_CHF_1[is.na(df_2$wt_CHF_1)] <- 0
df_2$wt_PVD_1[is.na(df_2$wt_PVD_1)] <- 0
df_2$wt_CVD_1[is.na(df_2$wt_CVD_1)] <- 0
df_2$wt_COPD_1[is.na(df_2$wt_COPD_1)] <- 0
df_2$wt_dementia_1[is.na(df_2$wt_dementia_1)] <- 0
df_2$wt_paralysis_1[is.na(df_2$wt_paralysis_1)] <- 0
df_2$wt_diabetes_1[is.na(df_2$wt_diabetes_1)] <- 0
df_2$wt_diabetes_w_comp_1[is.na(df_2$wt_diabetes_w_comp_1)] <- 0
df_2$wt_renal_1[is.na(df_2$wt_renal_1)] <- 0
df_2$wt_mild_liver_1[is.na(df_2$wt_mild_liver_1)] <- 0
df_2$wt_mod_sev_liver_1[is.na(df_2$wt_mod_sev_liver_1)] <- 0
df_2$wt_peptic_ulcer_1[is.na(df_2$wt_peptic_ulcer_1)] <- 0
df_2$wt_rheumatologic_1[is.na(df_2$wt_rheumatologic_1)] <- 0
df_2$wt_aids_1[is.na(df_2$wt_aids_1)] <- 0
df_2$wt_malig_1[is.na(df_2$wt_malig_1)] <- 0
df_2$wt_tumor_1[is.na(df_2$wt_tumor_1)] <- 0


#table(df_2['diag_1'])
#table(df['diag_3'])

#diag_2 
#reassign ?
levels(df_2$diag_2)[levels(df_2$diag_2)=="?"] <- "9999"

df_2$diag_2 <- as.numeric(levels(df_2$diag_2))[df_2$diag_2] 

df_2 <- transform(df_2, AMI_2= ifelse(diag_2==410, 1, 0))
df_2 <- transform(df_2, AMI_history_2= ifelse(diag_2==412, 1, 0))
df_2 <- transform(df_2, CHF_2= ifelse(diag_2==428, 1, 0))
df_2$PVD_2 <- ifelse(df_2$diag_2==411 | df_2$diag_2==443| df_2$diag_2==785, 1, 0)
df_2$CVD_2 <- ifelse( df_2$diag_2 >= 430 & df_2$diag_2 <= 438, 1, 0)
df_2$COPD_2 <- ifelse( (df_2$diag_2 >= 490 & df_2$diag_2 <= 496) | (df_2$diag_2 >= 500 & df_2$diag_2 <= 505), 1, 0)
df_2 <- transform(df_2, dementia_2= ifelse(diag_2==290, 1, 0))
df_2 <- transform(df_2, paralysis_2= ifelse(diag_2==342, 1, 0))
df_2$diabetes_2 <- ifelse(df_2$diag_2 == 250 | (df_2$diag_2 >= 250.01 & df_2$diag_2 <= 250.39) | df_2$diag_2 == 250.7, 1, 0)
df_2$diabetes_w_comp_2 <- ifelse((df_2$diag_2 >= 250.4 & df_2$diag_2 <= 250.69) | (df_2$diag_2 >= 250.8 & df_2$diag_2 == 250.99), 1, 0)
df_2$renal_2 <- ifelse((df_2$diag_2 >= 582 & df_2$diag_2 <= 583) | (df_2$diag_2 >= 585 & df_2$diag_2 <= 586) | df$diag_2==588, 1, 0)
df_2 <- transform(df_2, mild_liver_2= ifelse(diag_2==571, 1, 0))
df_2$mod_sev_liver_2 <- ifelse(df_2$diag_2 == 456 | df_2$diag_2 == 572, 1, 0)
df_2$peptic_ulcer_2 <- ifelse( df_2$diag_2 >= 531 & df_2$diag_2 == 534, 1, 0)
df_2$rheumatologic_2 <- ifelse(df_2$diag_2 == 710 | df_2$diag_2 == 714, 1, 0)
df_2$aids_2 <- ifelse(df_2$diag_2 == 42 | df_2$diag_2 == 43 | df_2$diag_2 == 44, 1, 0)
df_2$malig_2 <- ifelse((df_2$diag_2 >= 140 & df_2$diag_2 <= 172) | (df_2$diag_2 >= 174 & df_2$diag_2 <= 195) |
                      (df_2$diag_2 >= 200 & df_2$diag_2 <= 208), 1, 0)
df_2$tumor_2 <- ifelse((df_2$diag_2 >= 196 & df_2$diag_2 <= 199), 1, 0)

df_2 <- transform(df_2, wt_AMI_2= ifelse(AMI_2==1, 1, 0))
df_2 <- transform(df_2, wt_AMI_history_2= ifelse(AMI_history_2==1, 1, 0))
df_2 <- transform(df_2, wt_CHF_2= ifelse(CHF_2==1, 1, 0))
df_2 <- transform(df_2, wt_PVD_2= ifelse(PVD_2==1, 1, 0))
df_2 <- transform(df_2, wt_CVD_2= ifelse(PVD_2==1, 1, 0))
df_2 <- transform(df_2, wt_COPD_2= ifelse(COPD_2==1, 1, 0))
df_2 <- transform(df_2, wt_dementia_2= ifelse(dementia_2==1, 1, 0))
df_2 <- transform(df_2, wt_paralysis_2= ifelse(paralysis_2==1, 2, 0))
df_2 <- transform(df_2, wt_diabetes_2= ifelse(diabetes_2==1, 1, 0))
df_2 <- transform(df_2, wt_diabetes_w_comp_2= ifelse(diabetes_w_comp_2==1, 2, 0))
df_2 <- transform(df_2, wt_renal_2= ifelse(renal_2==1, 2, 0))
df_2 <- transform(df_2, wt_mild_liver_2= ifelse(mild_liver_2==1, 1, 0))
df_2 <- transform(df_2, wt_mod_sev_liver_2= ifelse(mod_sev_liver_2==1, 3, 0))
df_2 <- transform(df_2, wt_peptic_ulcer_2= ifelse(peptic_ulcer_2==1, 1, 0))
df_2 <- transform(df_2, wt_rheumatologic_2= ifelse(rheumatologic_2==1, 1, 0))
df_2 <- transform(df_2, wt_aids_2= ifelse(aids_2==1, 6, 0))
df_2 <- transform(df_2, wt_malig_2= ifelse(malig_2==1, 2, 0))
df_2 <- transform(df_2, wt_tumor_2= ifelse(tumor_2==1, 6, 0))

df_2$wt_AMI_2[is.na(df_2$wt_AMI_2)] <- 0
df_2$wt_AMI_history_2[is.na(df_2$wt_AMI_history_2)] <- 0
df_2$wt_CHF_2[is.na(df_2$wt_CHF_2)] <- 0
df_2$wt_PVD_2[is.na(df_2$wt_PVD_2)] <- 0
df_2$wt_CVD_2[is.na(df_2$wt_CVD_2)] <- 0
df_2$wt_COPD_2[is.na(df_2$wt_COPD_2)] <- 0
df_2$wt_dementia_2[is.na(df_2$wt_dementia_2)] <- 0
df_2$wt_paralysis_2[is.na(df_2$wt_paralysis_2)] <- 0
df_2$wt_diabetes_2[is.na(df_2$wt_diabetes_2)] <- 0
df_2$wt_diabetes_w_comp_2[is.na(df_2$wt_diabetes_w_comp_2)] <- 0
df_2$wt_renal_2[is.na(df_2$wt_renal_2)] <- 0
df_2$wt_mild_liver_2[is.na(df_2$wt_mild_liver_2)] <- 0
df_2$wt_mod_sev_liver_2[is.na(df_2$wt_mod_sev_liver_2)] <- 0
df_2$wt_peptic_ulcer_2[is.na(df_2$wt_peptic_ulcer_2)] <- 0
df_2$wt_rheumatologic_2[is.na(df_2$wt_rheumatologic_2)] <- 0
df_2$wt_aids_2[is.na(df_2$wt_aids_2)] <- 0
df_2$wt_malig_2[is.na(df_2$wt_malig_2)] <- 0
df_2$wt_tumor_2[is.na(df_2$wt_tumor_2)] <- 0
#table(df_2['diag_2'])

#diag_3
#reassign ?
levels(df_2$diag_3)[levels(df_2$diag_3)=="?"] <- "9999"

#convert to numeric
df_2$diag_3 <- as.numeric(levels(df_2$diag_3))[df_2$diag_3] 

df_2 <- transform(df_2, AMI_3= ifelse(diag_3==410, 1, 0))
df_2 <- transform(df_2, AMI_history_3= ifelse(diag_3==412, 1, 0))
df_2 <- transform(df_2, CHF_3= ifelse(diag_3==428, 1, 0))
df_2$PVD_3 <- ifelse(df_2$diag_3==411 | df_2$diag_3==443| df_2$diag_3==785, 1, 0)
df_2$CVD_3 <- ifelse( df_2$diag_3 >= 430 & df_2$diag_3 <= 438, 1, 0)
df_2$COPD_3 <- ifelse( (df_2$diag_3 >= 490 & df_2$diag_3 <= 496) | (df_2$diag_3 >= 500 & df_2$diag_3 <= 505), 1, 0)
df_2 <- transform(df_2, dementia_3= ifelse(diag_3==290, 1, 0))
df_2 <- transform(df_2, paralysis_3= ifelse(diag_3==342, 1, 0))
df_2$diabetes_3 <- ifelse(df_2$diag_3 == 250 | (df_2$diag_3 >= 250.01 & df_2$diag_3 <= 250.39) | df_2$diag_3 == 250.7, 1, 0)
df_2$diabetes_w_comp_3 <- ifelse((df_2$diag_3 >= 250.4 & df_2$diag_3 <= 250.69) | (df_2$diag_3 >= 250.8 & df_2$diag_3 == 250.99), 1, 0)
df_2$renal_3 <- ifelse((df_2$diag_3 >= 582 & df_2$diag_3 <= 583) | (df_2$diag_3 >= 585 & df_2$diag_3 <= 586) | df$diag_3==588, 1, 0)
df_2 <- transform(df_2, mild_liver_3= ifelse(diag_3==571, 1, 0))
df_2$mod_sev_liver_3 <- ifelse(df_2$diag_3 == 456 | df_2$diag_3 == 572, 1, 0)
df_2$peptic_ulcer_3 <- ifelse( df_2$diag_3 >= 531 & df_2$diag_3 == 534, 1, 0)
df_2$rheumatologic_3 <- ifelse(df_2$diag_3 == 710 | df_2$diag_3 == 714, 1, 0)
df_2$aids_3 <- ifelse(df_2$diag_3 == 42 | df_2$diag_3 == 43 | df_2$diag_3 == 44, 1, 0)
df_2$malig_3 <- ifelse((df_2$diag_3 >= 140 & df_2$diag_3 <= 172) | (df_2$diag_3 >= 174 & df_2$diag_3 <= 195) |
                      (df_2$diag_3 >= 200 & df_2$diag_3 <= 208), 1, 0)
df_2$tumor_3 <- ifelse((df_2$diag_3 >= 196 & df_2$diag_3 <= 199), 1, 0)

df_2 <- transform(df_2, wt_AMI_3= ifelse(AMI_3==1, 1, 0))
df_2 <- transform(df_2, wt_AMI_history_3= ifelse(AMI_history_3==1, 1, 0))
df_2 <- transform(df_2, wt_CHF_3= ifelse(CHF_3==1, 1, 0))
df_2 <- transform(df_2, wt_PVD_3= ifelse(PVD_3==1, 1, 0))
df_2 <- transform(df_2, wt_CVD_3= ifelse(PVD_3==1, 1, 0))
df_2 <- transform(df_2, wt_COPD_3= ifelse(COPD_3==1, 1, 0))
df_2 <- transform(df_2, wt_dementia_3= ifelse(dementia_3==1, 1, 0))
df_2 <- transform(df_2, wt_paralysis_3= ifelse(paralysis_3==1, 2, 0))
df_2 <- transform(df_2, wt_diabetes_3= ifelse(diabetes_3==1, 1, 0))
df_2 <- transform(df_2, wt_diabetes_w_comp_3= ifelse(diabetes_w_comp_3==1, 2, 0))
df_2 <- transform(df_2, wt_renal_3= ifelse(renal_3==1, 2, 0))
df_2 <- transform(df_2, wt_mild_liver_3= ifelse(mild_liver_3==1, 1, 0))
df_2 <- transform(df_2, wt_mod_sev_liver_3= ifelse(mod_sev_liver_3==1, 3, 0))
df_2 <- transform(df_2, wt_peptic_ulcer_3= ifelse(peptic_ulcer_3==1, 1, 0))
df_2 <- transform(df_2, wt_rheumatologic_3= ifelse(rheumatologic_3==1, 1, 0))
df_2 <- transform(df_2, wt_aids_3= ifelse(aids_3==1, 6, 0))
df_2 <- transform(df_2, wt_malig_3= ifelse(malig_3==1, 2, 0))
df_2 <- transform(df_2, wt_tumor_3= ifelse(tumor_3==1, 6, 0))

df_2$wt_AMI_3[is.na(df_2$wt_AMI_3)] <- 0
df_2$wt_AMI_history_3[is.na(df_2$wt_AMI_history_3)] <- 0
df_2$wt_CHF_3[is.na(df_2$wt_CHF_3)] <- 0
df_2$wt_PVD_3[is.na(df_2$wt_PVD_3)] <- 0
df_2$wt_CVD_3[is.na(df_2$wt_CVD_3)] <- 0
df_2$wt_COPD_3[is.na(df_2$wt_COPD_3)] <- 0
df_2$wt_dementia_3[is.na(df_2$wt_dementia_3)] <- 0
df_2$wt_paralysis_3[is.na(df_2$wt_paralysis_3)] <- 0
df_2$wt_diabetes_3[is.na(df_2$wt_diabetes_3)] <- 0
df_2$wt_diabetes_w_comp_3[is.na(df_2$wt_diabetes_w_comp_3)] <- 0
df_2$wt_renal_3[is.na(df_2$wt_renal_3)] <- 0
df_2$wt_mild_liver_3[is.na(df_2$wt_mild_liver_3)] <- 0
df_2$wt_mod_sev_liver_3[is.na(df_2$wt_mod_sev_liver_3)] <- 0
df_2$wt_peptic_ulcer_3[is.na(df_2$wt_peptic_ulcer_3)] <- 0
df_2$wt_rheumatologic_3[is.na(df_2$wt_rheumatologic_3)] <- 0
df_2$wt_aids_3[is.na(df_2$wt_aids_3)] <- 0
df_2$wt_malig_3[is.na(df_2$wt_malig_3)] <- 0
df_2$wt_tumor_3[is.na(df_2$wt_tumor_3)] <- 0


#Create Charlson score
df_2$CCI <- df_2$wt_AMI_1 + df_2$wt_AMI_history_1 + df_2$wt_CHF_1 + df_2$wt_PVD_1 + df_2$wt_CVD_1 + df_2$wt_COPD_1 + 
            df_2$wt_dementia_1 + df_2$wt_paralysis_1 + df_2$wt_diabetes_1 + df_2$wt_diabetes_w_comp_1 + df_2$wt_renal_1 + 
            df_2$wt_mild_liver_1 + df_2$wt_mod_sev_liver_1 + df_2$wt_peptic_ulcer_1 + df_2$wt_rheumatologic_1 + df_2$wt_aids_1 +
            df_2$wt_AMI_2 + df_2$wt_AMI_history_2 + df_2$wt_CHF_2 + df_2$wt_PVD_2 + df_2$wt_CVD_2 + df_2$wt_COPD_2 + 
            df_2$wt_dementia_2 + df_2$wt_paralysis_2 + df_2$wt_diabetes_2 + df_2$wt_diabetes_w_comp_2 + df_2$wt_renal_2 + 
            df_2$wt_mild_liver_2 + df_2$wt_mod_sev_liver_2 + df_2$wt_peptic_ulcer_2 + df_2$wt_rheumatologic_2 + df_2$wt_aids_2 +
            df_2$wt_AMI_3 + df_2$wt_AMI_history_3 + df_2$wt_CHF_3 + df_2$wt_PVD_3 + df_2$wt_CVD_3 + df_2$wt_COPD_3 + 
            df_2$wt_dementia_3 + df_2$wt_paralysis_3 + df_2$wt_diabetes_3 + df_2$wt_diabetes_w_comp_3 + df_2$wt_renal_3 + 
            df_2$wt_mild_liver_3 + df_2$wt_mod_sev_liver_3 + df_2$wt_peptic_ulcer_3 + df_2$wt_rheumatologic_3 + df_2$wt_aids_3 + df_2$wt_malig_1 + df_2$wt_tumor_1 + df_2$wt_malig_2 + df_2$wt_tumor_2 + df_2$wt_malig_3 + df_2$wt_tumor_3  


#table(df_2['diag_3'])

#max_glu_serum
df_2 <- df_2 %>% mutate(max_glu_serum=recode(max_glu_serum,
'None'=0,
'Norm'=1,
'>200'=2,
'>300'=3))

#A1Cresult
df_2 <- df_2 %>% mutate(A1Cresult=recode(A1Cresult,
'None'=0,
'Norm'=1,
'>7'=2,
'>8'=3))

#metformin:metformin.pioglitazone
df_2 <- df_2 %>% 
  mutate_at(vars(metformin: metformin.pioglitazone), recode, 'No'=0,'Steady'=1,'Up'=1,'Down'=1)

#df_2 <- df_2 %>% 
  #mutate_at(vars(metformin: metformin.pioglitazone), recode, 'No'=0,'Steady'=1,'Up'=2,'Down'=3)

#change
df_2 <- df_2 %>% mutate(change=recode(change,
'Ch' = 1,
'No' = 0))

#diabetesMed
df_2 <- df_2 %>% mutate(diabetesMed=recode(diabetesMed,
'Yes' = 1,
'No' = 0))

#convert to factors
#************************
df_2$race <- as.factor(df_2$race)
df_2$gender <- as.factor(df_2$gender)
df_2$admission_type_id <- as.factor(df_2$admission_type_id)
df_2$discharge_disposition_id <- as.factor(df_2$discharge_disposition_id)
df_2$admission_source_id <- as.factor(df_2$admission_source_id)
df_2$payer_code <- as.factor(df_2$payer_code)
df_2$medical_specialty <- as.factor(df_2$medical_specialty)
df_2$diag_1 <- as.factor(df_2$diag_1)
df_2$diag_2 <- as.factor(df_2$diag_2)
df_2$diag_3 <- as.factor(df_2$diag_3)
df_2$max_glu_serum <- as.factor(df_2$max_glu_serum)
df_2$A1Cresult <- as.factor(df_2$A1Cresult)
df_2$metformin <- as.factor(df_2$metformin)
df_2$repaglinide <- as.factor(df_2$repaglinide)
df_2$nateglinide <- as.factor(df_2$nateglinide)
df_2$chlorpropamide <- as.factor(df_2$chlorpropamide)
df_2$glimepiride <- as.factor(df_2$glimepiride)
df_2$acetohexamide <- as.factor(df_2$acetohexamide)
df_2$glipizide <- as.factor(df_2$glipizide)
df_2$glyburide <- as.factor(df_2$glyburide)
df_2$tolbutamide <- as.factor(df_2$tolbutamide)
df_2$pioglitazone <- as.factor(df_2$pioglitazone)
df_2$rosiglitazone <- as.factor(df_2$rosiglitazone)
df_2$acarbose <- as.factor(df_2$acarbose)
df_2$miglitol <- as.factor(df_2$miglitol)
df_2$troglitazone <- as.factor(df_2$troglitazone)
df_2$tolazamide <- as.factor(df_2$tolazamide)
df_2$examide <- as.factor(df_2$examide)
df_2$citoglipton <- as.factor(df_2$citoglipton)
df_2$insulin <- as.factor(df_2$insulin)
df_2$glyburide.metformin <- as.factor(df_2$glyburide.metformin)
df_2$glipizide.metformin <- as.factor(df_2$glipizide.metformin)
df_2$glimepiride.pioglitazone <- as.factor(df_2$glimepiride.pioglitazone)
df_2$metformin.rosiglitazone <- as.factor(df_2$metformin.rosiglitazone)
df_2$metformin.pioglitazone <- as.factor(df_2$metformin.pioglitazone)
df_2$change <- as.factor(df_2$change)
df_2$diabetesMed <- as.factor(df_2$diabetesMed)
df_2$readmitted <- as.factor(df_2$readmitted)

#drop one level factors and unneeded columns
dev_df <- subset(df_2, select=-c(weight,encounter_id, patient_nbr, examide, citoglipton,
                                 diag_1, diag_2, diag_3,
                                 wt_AMI_1 , wt_AMI_history_1 , wt_CHF_1 , wt_PVD_1 , wt_CVD_1 , wt_COPD_1 , 
                                 wt_dementia_1 , wt_paralysis_1 , wt_diabetes_1 , wt_diabetes_w_comp_1 , wt_renal_1 , 
                                 wt_mild_liver_1 , wt_mod_sev_liver_1 , wt_peptic_ulcer_1 , wt_rheumatologic_1 , wt_aids_1 ,
                                 wt_AMI_2 , wt_AMI_history_2 , wt_CHF_2 , wt_PVD_2 , wt_CVD_2 , wt_COPD_2 , 
                                 wt_dementia_2 , wt_paralysis_2 , wt_diabetes_2 , wt_diabetes_w_comp_2 , wt_renal_2 , 
                                 wt_mild_liver_2 , wt_mod_sev_liver_2 , wt_peptic_ulcer_2 , wt_rheumatologic_2 , wt_aids_2 ,
                                 wt_AMI_3 , wt_AMI_history_3 , wt_CHF_3 , wt_PVD_3 , wt_CVD_3 , wt_COPD_3 , 
                                 wt_dementia_3 , wt_paralysis_3 , wt_diabetes_3 , wt_diabetes_w_comp_3 , wt_renal_3 , 
                                 wt_mild_liver_3 , wt_mod_sev_liver_3 , wt_peptic_ulcer_3 , wt_rheumatologic_3 , wt_aids_3,
                                 AMI_1, AMI_history_1, CHF_1,PVD_1,CVD_1,COPD_1,dementia_1,paralysis_1,diabetes_1,
                                 diabetes_w_comp_1,renal_1, mild_liver_1,mod_sev_liver_1,peptic_ulcer_1,rheumatologic_1,aids_1,
                                 AMI_2, AMI_history_2, CHF_2,PVD_2,CVD_2,COPD_2,dementia_2,paralysis_2,diabetes_2,
                                 diabetes_w_comp_2,renal_2, mild_liver_2,mod_sev_liver_2,peptic_ulcer_2,rheumatologic_2,aids_2,
                                 AMI_3, AMI_history_3, CHF_3,PVD_3,CVD_3,COPD_3,dementia_3,paralysis_3,diabetes_3,
                                 diabetes_w_comp_3,renal_3, mild_liver_3,mod_sev_liver_3,peptic_ulcer_3,rheumatologic_3,aids_3, malig_1, malig_2, malig_3, wt_malig_1,
                                 wt_malig_2, wt_malig_3, tumor_1, tumor_2, tumor_3, wt_tumor_1, wt_tumor_2,
                                 wt_tumor_3))

Exploratory Data Analysis

Univariate Analysis

ggplot(dev_df, aes(x=readmitted, y=age, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=time_in_hospital, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=num_lab_procedures, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=num_procedures, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=num_medications, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=number_outpatient, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=number_emergency, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=number_inpatient, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=number_diagnoses, fill=readmitted)) + 
    geom_boxplot()

ggplot(dev_df, aes(x=readmitted, y=CCI, fill=readmitted)) + 
    geom_boxplot()

From the box plots, it is evident that age and number of diagnoses are left skewed. On the other hand, number of lab procedures, number of medications, number of procedures, number of emergency/inpatient/outpatient visits, total time in hospital, and CCI are right skewed. Number of inpatient visits, number of diagnoses, and CCI appear to impact the target variable the most.

#categorical variables
cat_df <- dev_df %>% select_if(~class(.) == 'factor')

marrangeGrob(
  map(
    names(cat_df), 
    ~ ggplot(dev_df, aes_string(.x)) + 
      geom_bar() + geom_bar(aes(fill = readmitted))  +
  theme(legend.position="none")
  ),
  ncol = 4,
  nrow = 4,
  top = "Variable Distribution"
)

#table(dev_df$readmitted)
#11314/88029 #12.9%

From the bar plots it is evident that there is a class imbalance as there are many more cases of not being readmitted vs being readmitted. In addition, many of the medication variables have few unique values and a large ratio of frequency of the most common value to the frequency of the other value. We will remove variables with these characteristics since they will not be useful for prediction.

nearZeroVar(cat_df, freqCut = 95/5, uniqueCut = 10, saveMetrics = FALSE,
  names = TRUE, allowParallel = TRUE)
##  [1] "max_glu_serum"            "repaglinide"             
##  [3] "nateglinide"              "chlorpropamide"          
##  [5] "acetohexamide"            "tolbutamide"             
##  [7] "acarbose"                 "miglitol"                
##  [9] "troglitazone"             "tolazamide"              
## [11] "glyburide.metformin"      "glipizide.metformin"     
## [13] "glimepiride.pioglitazone" "metformin.rosiglitazone" 
## [15] "metformin.pioglitazone"

The above variables have near zero variance so we will remove them from the modeling data.

Bi-variate Analysis

#age and payer code
payer_names <- list(
  '0'="Missing",
  '1'="Medicare",
  '2'= "Other",
  '3'= "Self_Pay"
)

payer_labeller <- function(variable,value){
  return(payer_names[value])
}

ggplot(dev_df, aes(age, fill=readmitted)) +
  geom_bar(position = "fill") +
  facet_wrap(~payer_code, ncol=2, scales="free", labeller=payer_labeller) +
  scale_y_continuous() +
  ylab("Percent")

#race and gender
race_names <- list(
  '0'="Caucasian",
  '1'="Non-caucasian"
)

race_labeller <- function(variable,value){
  return(race_names[value])
}

ggplot(dev_df, aes(gender, fill=readmitted)) +
  geom_bar(position = "fill") +
  facet_wrap(~race, ncol=2, scales="free", labeller=race_labeller) +
  scale_y_continuous() +
  ylab("Percent")

#payer code and discharge disposition id
ggplot(dev_df, aes(discharge_disposition_id, fill=readmitted)) +
  geom_bar(position = "fill") +
  facet_wrap(~payer_code, ncol=2, scales="free", labeller=payer_labeller) +
  scale_y_continuous() +
  ylab("Percent")

#time in hospital and diabetesmed
ggplot(dev_df, aes(time_in_hospital, fill=readmitted)) +
  geom_bar(position = "fill") +
  facet_wrap(~diabetesMed, ncol=2, scales="free") +
  scale_y_continuous() +
  ylab("Percent")

#num meds and time in hospital
ggplot(dev_df, aes(x=num_medications, y=time_in_hospital, color=readmitted)) +
  geom_point()

Investigating the relationship between age and payer code, it appears that 25 year old patients using Medicare have a higher chance of being readmitted. It appears that race and gender do not have a large impact on likelihood of being readmitted. Patients using Medicare, self-pay, or have missing payer code are more likely to be readmitted if they were discharged, transferred, or still a patient. Patients using other health care insurance were more likely to be readmitted of they left the hospital against medical advice. There does not appear to be a relationship between time in hospital and whether a diabetes medication was prescribed. It is evident that there is a direct relationship between number of medications and time in hospital.

Correlation Between Predictors

Numerical

num_df <- select_if(dev_df, is.numeric)

#correlation matrix
data_cor = cor(num_df)
corrplot(data_cor, type = "lower")

Based on the correlation matrix, it is evident that there is low correlation between numerical predictors.

Categorical

Some categorical predictors such as diag_1 and medical_specialy are potentially correlated since they provide similar information. We will use Cramer’s V to test the strength of association between two nominal predictors. It ranges from 0 to 1 where 0 indicates no association between the two variables and 1 indicates a perfect association between the two variables.

cramersV(dev_df$diabetesMed, dev_df$insulin) #0.5802898
## [1] 0.5802898
cramersV(dev_df$diabetesMed, dev_df$metformin) #0.2712807
## [1] 0.2712807
#Final modeling data set
#drop diabetes medications (0 variance variables) and correlated variables
dev_df_final <- subset(dev_df, select=-c(metformin, repaglinide, nateglinide, chlorpropamide,
                                         glimepiride, acetohexamide, glipizide, glyburide,
                                         tolbutamide,  pioglitazone, rosiglitazone,
                                         acarbose, miglitol, troglitazone, tolazamide, insulin,
                                         glyburide.metformin, glipizide.metformin, glimepiride.pioglitazone,
                                         metformin.rosiglitazone, metformin.pioglitazone, rosiglitazone,
                                         medical_specialty, max_glu_serum, A1Cresult
                                         ))

Based on the results, dibetesMed and insulin have a strong association and dibetesMed and metformin have a moderate association SincediabetesMed and the other diabetic medications such as insulin probably have strong associations and to limit the number of predictors in our models we will keep diabetesMed and drop the other 22 anti-diabetic medications. In addition, since medical_specialty and diag_1/2/3 probably have an association we will drop medical_specialty since it has a greater percentage of missing values.

Models

Under-Sampling

Readmission (target variable = 1) rate is about 12% in the development data. Since our data is imbalanced and contains categorical predictors we will test both under and over random under-sampling to create a balanced data set. Let’s start with under-sampling first.

Logistic Regression

#table(dev_df_final$readmitted)

n_readmitted <- 11314
new_n_total <- n_readmitted/0.5

undersampled_df <- ovun.sample(readmitted ~ ., data=dev_df_final, method="under", N=new_n_total, seed=123)

under_df <- undersampled_df$data
#table(under_df$readmitted)

In our baseline model, we will keep all possible predictors.

set.seed(1234)
indexes_under = createDataPartition(under_df$readmitted, p = .7, list = F)
train_under = under_df[indexes_under, ]
test_under = under_df[-indexes_under, ]

# Training model
lr_under <- glm(readmitted ~ ., 
                      data = train_under, 
                      family = "binomial")
   
# Summary
summary(lr_under)
## 
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = train_under)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2957  -1.0828  -0.3372   1.1436   1.7137  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.2198946  0.1434747  -8.503  < 2e-16 ***
## race1                     -0.0398364  0.0405898  -0.981 0.326377    
## gender1                   -0.0280484  0.0333856  -0.840 0.400833    
## age                        0.0027816  0.0012233   2.274 0.022972 *  
## admission_type_id1        -0.0740245  0.0863268  -0.857 0.391173    
## admission_type_id2        -0.0866521  1.4204128  -0.061 0.951355    
## admission_type_id3        -0.0150404  0.0764383  -0.197 0.844011    
## discharge_disposition_id1 -0.2231381  0.0807307  -2.764 0.005710 ** 
## discharge_disposition_id2  0.2705246  0.0824316   3.282 0.001031 ** 
## discharge_disposition_id3  0.1316901  0.2123430   0.620 0.535142    
## admission_source_id1       0.2160858  0.1143299   1.890 0.058755 .  
## admission_source_id2       0.1675115  0.0929713   1.802 0.071584 .  
## admission_source_id3       0.1835964  0.0930896   1.972 0.048580 *  
## time_in_hospital           0.0039724  0.0066233   0.600 0.548663    
## payer_code1               -0.1348853  0.0434050  -3.108 0.001886 ** 
## payer_code2               -0.1792118  0.0469923  -3.814 0.000137 ***
## payer_code3               -0.0434589  0.0814418  -0.534 0.593605    
## num_lab_procedures         0.0009287  0.0009577   0.970 0.332225    
## num_procedures            -0.0094872  0.0111462  -0.851 0.394679    
## num_medications            0.0031718  0.0026573   1.194 0.232628    
## number_outpatient         -0.0064254  0.0139288  -0.461 0.644582    
## number_emergency           0.1036519  0.0232970   4.449 8.62e-06 ***
## number_inpatient           0.2842894  0.0139855  20.327  < 2e-16 ***
## number_diagnoses           0.0606576  0.0097549   6.218 5.03e-10 ***
## change1                    0.0182188  0.0391377   0.466 0.641571    
## diabetesMed1               0.1604345  0.0465866   3.444 0.000574 ***
## CCI                        0.0677853  0.0110828   6.116 9.58e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21959  on 15839  degrees of freedom
## Residual deviance: 20786  on 15813  degrees of freedom
## AIC: 20840
## 
## Number of Fisher Scoring iterations: 4

From the summary, race, gender, admission_type_id, time_in_hospital, num_lab_procedures, num_procedures, number_outpatient, and change are not significant to the model. We will drop these predictors in the final model.

# Training model
lr_under_2 <- glm(readmitted ~ age + discharge_disposition_id + +admission_source_id + payer_code 
                      + number_emergency + number_inpatient + 
                      number_diagnoses + diabetesMed + CCI,
                      data = train_under, 
                      family = "binomial")
   
# Summary
summary(lr_under_2)
## 
## Call:
## glm(formula = readmitted ~ age + discharge_disposition_id + +admission_source_id + 
##     payer_code + number_emergency + number_inpatient + number_diagnoses + 
##     diabetesMed + CCI, family = "binomial", data = train_under)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2755  -1.0818  -0.3346   1.1426   1.6837  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.208071   0.138684  -8.711  < 2e-16 ***
## age                        0.002738   0.001202   2.277 0.022759 *  
## discharge_disposition_id1 -0.226368   0.080062  -2.827 0.004693 ** 
## discharge_disposition_id2  0.283686   0.082078   3.456 0.000548 ***
## discharge_disposition_id3  0.130082   0.211824   0.614 0.539147    
## admission_source_id1       0.199098   0.093768   2.123 0.033728 *  
## admission_source_id2       0.165189   0.068108   2.425 0.015292 *  
## admission_source_id3       0.146587   0.071112   2.061 0.039269 *  
## payer_code1               -0.139390   0.042478  -3.281 0.001033 ** 
## payer_code2               -0.186422   0.046184  -4.036 5.43e-05 ***
## payer_code3               -0.055246   0.080112  -0.690 0.490443    
## number_emergency           0.100059   0.023120   4.328 1.51e-05 ***
## number_inpatient           0.285308   0.013879  20.556  < 2e-16 ***
## number_diagnoses           0.066087   0.009289   7.114 1.12e-12 ***
## diabetesMed1               0.183857   0.040527   4.537 5.72e-06 ***
## CCI                        0.068926   0.011061   6.231 4.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21959  on 15839  degrees of freedom
## Residual deviance: 20795  on 15824  degrees of freedom
## AIC: 20827
## 
## Number of Fisher Scoring iterations: 4
# Predict test data based on model
predict_lr_under_2 <- predict(lr_under_2, 
                       test_under, type = "response")

# Changing probabilities
predict_lr_under_2 <- ifelse(predict_lr_under_2 >0.5, 1, 0)
   
# Evaluating model accuracy
# using confusion matrix
#table(test_under$readmitted, predict_lr_under_2)
confusionMatrix(as.factor(predict_lr_under_2), test_under$readmitted, positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2188 1475
##          1 1206 1919
##                                           
##                Accuracy : 0.605           
##                  95% CI : (0.5933, 0.6167)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2101          
##                                           
##  Mcnemar's Test P-Value : 2.268e-07       
##                                           
##             Sensitivity : 0.5654          
##             Specificity : 0.6447          
##          Pos Pred Value : 0.6141          
##          Neg Pred Value : 0.5973          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2827          
##    Detection Prevalence : 0.4604          
##       Balanced Accuracy : 0.6050          
##                                           
##        'Positive' Class : 1               
## 
#p <- predict(lr_under_2, test_under, type = 'response')

# ROC-AUC Curve
ROCPred <- prediction(predict_lr_under_2, test_under$readmitted) 
ROCPer <- performance(ROCPred, measure = "tpr", 
                             x.measure = "fpr")
   
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
#auc #0.6050383

# Plotting curve
plot(ROCPer, main = "Logistic Regression ROC AUC - Under-sampling",col = 'blue', lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

Our final logistic regression model built from under-sampled data has AUC of 60.5, sensitivity of 56.5, and specificity of 64.5.

Decision Tree

accuracy_tune <- function(fit) {
    predict_over <- predict(fit, test_under, type = 'class')
    table_mat <- table(test_under$readmitted, predict_over)
    accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
    accuracy_Test
}

#control <- rpart.control(minsplit = 100,
 #   minbucket = round(100 / 3),
#    maxdepth = 30,
 #   cp = 0)

#tune_fit <- rpart(readmitted~., data = train_under, method = 'class', control = control)
#accuracy_tune(tune_fit) #0.5767531

control <- rpart.control(minsplit = 100,
    minbucket = round(100 / 4),
    maxdepth = 10,
    cp = 0.001)

tune_fit_4_over <- rpart(readmitted~., data = train_under, method = 'class', control = control)
#accuracy_tune(tune_fit_4_over) #0.604449

printcp(tune_fit_4_over)
## 
## Classification tree:
## rpart(formula = readmitted ~ ., data = train_under, method = "class", 
##     control = control)
## 
## Variables actually used in tree construction:
##  [1] admission_source_id      age                      CCI                     
##  [4] diabetesMed              discharge_disposition_id num_lab_procedures      
##  [7] num_medications          number_diagnoses         number_emergency        
## [10] number_inpatient         payer_code               time_in_hospital        
## 
## Root node error: 7920/15840 = 0.5
## 
## n= 15840 
## 
##           CP nsplit rel error  xerror      xstd
## 1  0.1904040      0   1.00000 1.01490 0.0079446
## 2  0.0267677      1   0.80960 0.80960 0.0078002
## 3  0.0042929      2   0.78283 0.78333 0.0077568
## 4  0.0037879      3   0.77854 0.78674 0.0077627
## 5  0.0022727      4   0.77475 0.78460 0.0077590
## 6  0.0018939      6   0.77020 0.78409 0.0077581
## 7  0.0017677      7   0.76831 0.78321 0.0077566
## 8  0.0014731      8   0.76654 0.78359 0.0077572
## 9  0.0013889     13   0.75821 0.78207 0.0077545
## 10 0.0012626     14   0.75682 0.78169 0.0077539
## 11 0.0011364     17   0.75227 0.78144 0.0077534
## 12 0.0010101     25   0.74306 0.77980 0.0077505
## 13 0.0010000     26   0.74205 0.77955 0.0077500
#prp(tune_fit_4_over,type=2,extra=106,nn = TRUE,branch=1,varlen=0,yesno=2)

#as.data.frame(tune_fit_4_over$variable.importance)

dt_varimport <- data.frame(imp = tune_fit_4_over$variable.importance)

dt_varimport_2 <- dt_varimport %>% 
  rownames_to_column() %>% 
  rename("variable" = rowname) %>% 
  arrange(imp) %>%
  mutate(variable = forcats::fct_inorder(variable))

ggplot(dt_varimport_2) +
  geom_col(aes(x = variable, y = imp),
            show.legend = F) +
  coord_flip() 

predicted <- predict(tune_fit_4_over, test_under, type="class")
#table(test_under$readmitted, predicted)
confusionMatrix(test_under$readmitted, predicted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1849 1545
##          1 1140 2254
##                                           
##                Accuracy : 0.6044          
##                  95% CI : (0.5927, 0.6161)
##     No Information Rate : 0.5597          
##     P-Value [Acc > NIR] : 4.595e-14       
##                                           
##                   Kappa : 0.2089          
##                                           
##  Mcnemar's Test P-Value : 6.356e-15       
##                                           
##             Sensitivity : 0.6186          
##             Specificity : 0.5933          
##          Pos Pred Value : 0.5448          
##          Neg Pred Value : 0.6641          
##              Prevalence : 0.4403          
##          Detection Rate : 0.2724          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6060          
##                                           
##        'Positive' Class : 0               
## 
pred <- prediction(predict(tune_fit_4_over, test_under, type="prob")[, 2], test_under$readmitted)

auc <- performance(pred, "auc")
auc <- auc@y.values[[1]]
#auc #0.6360124

plot(performance(pred, 'tpr', 'fpr'), col='blue', main='Decision Tree ROC AUC - Under-sampling')
abline(0, 1, lty=2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

After tuning and fitting, our final decision tree model using under-sampled data has AUC of 63.6, sensitivity of 61.9, and specificity of 59.3. The top 3 most important predictors are number_inpatient, discharge_disposition_id, and number_emergency.

Over-Sampling

n_not_readmitted <- 88029
new_n_total <- n_not_readmitted/0.5

oversampled_df <- ovun.sample(readmitted ~ ., data=dev_df_final, method="over", N=new_n_total, seed=123)

over_df <- oversampled_df$data
#table(over_df$readmitted)

Logistic Regression

Again in our baseline model we will keep all possible predictors.

indexes_over = createDataPartition(over_df$readmitted, p = .7, list = F)
train_over = over_df[indexes_over, ]
test_over = over_df[-indexes_over, ]

# Training model
lr_over <- glm(readmitted ~ ., 
                      data = train_over, 
                      family = "binomial")
   
# Summary
summary(lr_over)
## 
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = train_over)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5118  -1.0865  -0.3091   1.1483   1.6931  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.1142471  0.0517499 -21.531  < 2e-16 ***
## race1                      0.0039574  0.0145498   0.272 0.785628    
## gender1                   -0.0456644  0.0119697  -3.815 0.000136 ***
## age                        0.0021106  0.0004379   4.820 1.44e-06 ***
## admission_type_id1        -0.0151414  0.0305743  -0.495 0.620436    
## admission_type_id2         0.7677991  0.5781769   1.328 0.184189    
## admission_type_id3         0.0563767  0.0269691   2.090 0.036580 *  
## discharge_disposition_id1 -0.2332626  0.0289258  -8.064 7.37e-16 ***
## discharge_disposition_id2  0.2369942  0.0295786   8.012 1.13e-15 ***
## discharge_disposition_id3  0.1438469  0.0777047   1.851 0.064141 .  
## admission_source_id1       0.0528998  0.0407973   1.297 0.194751    
## admission_source_id2       0.1545198  0.0330516   4.675 2.94e-06 ***
## admission_source_id3       0.1551447  0.0331373   4.682 2.84e-06 ***
## time_in_hospital           0.0090983  0.0023902   3.806 0.000141 ***
## payer_code1               -0.1192046  0.0154763  -7.702 1.34e-14 ***
## payer_code2               -0.2171511  0.0168570 -12.882  < 2e-16 ***
## payer_code3               -0.0948955  0.0295048  -3.216 0.001299 ** 
## num_lab_procedures         0.0004336  0.0003444   1.259 0.208090    
## num_procedures            -0.0127553  0.0039835  -3.202 0.001365 ** 
## num_medications            0.0042848  0.0009525   4.498 6.85e-06 ***
## number_outpatient         -0.0042054  0.0047410  -0.887 0.375056    
## number_emergency           0.0894122  0.0077810  11.491  < 2e-16 ***
## number_inpatient           0.2876933  0.0049548  58.064  < 2e-16 ***
## number_diagnoses           0.0449196  0.0034864  12.884  < 2e-16 ***
## change1                   -0.0063233  0.0140351  -0.451 0.652321    
## diabetesMed1               0.1807648  0.0167038  10.822  < 2e-16 ***
## CCI                        0.0824808  0.0040210  20.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 170850  on 123241  degrees of freedom
## Residual deviance: 161676  on 123215  degrees of freedom
## AIC: 161730
## 
## Number of Fisher Scoring iterations: 4

The predictors race, num_lab_procedures, number_outpatient, and change are not significant so we will drop then in the final model.

# Training model
lr_over_2 <- glm(readmitted ~ gender + age + admission_type_id + discharge_disposition_id + 
                   admission_source_id + time_in_hospital + payer_code + num_procedures +
                   num_medications + number_emergency + number_inpatient +
                   number_diagnoses  + diabetesMed + CCI, 
                      data = train_over, 
                      family = "binomial")
   
# Summary
summary(lr_over_2)
## 
## Call:
## glm(formula = readmitted ~ gender + age + admission_type_id + 
##     discharge_disposition_id + admission_source_id + time_in_hospital + 
##     payer_code + num_procedures + num_medications + number_emergency + 
##     number_inpatient + number_diagnoses + diabetesMed + CCI, 
##     family = "binomial", data = train_over)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5062  -1.0867  -0.3112   1.1488   1.6912  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.1048737  0.0508794 -21.716  < 2e-16 ***
## gender1                   -0.0456124  0.0119437  -3.819 0.000134 ***
## age                        0.0020873  0.0004326   4.825 1.40e-06 ***
## admission_type_id1        -0.0180604  0.0302819  -0.596 0.550901    
## admission_type_id2         0.7684483  0.5781426   1.329 0.183793    
## admission_type_id3         0.0556922  0.0268855   2.071 0.038316 *  
## discharge_disposition_id1 -0.2318455  0.0288848  -8.027 1.00e-15 ***
## discharge_disposition_id2  0.2370022  0.0295508   8.020 1.06e-15 ***
## discharge_disposition_id3  0.1457847  0.0776799   1.877 0.060554 .  
## admission_source_id1       0.0620893  0.0403376   1.539 0.123746    
## admission_source_id2       0.1639098  0.0323086   5.073 3.91e-07 ***
## admission_source_id3       0.1607654  0.0328390   4.896 9.80e-07 ***
## time_in_hospital           0.0097914  0.0023361   4.191 2.77e-05 ***
## payer_code1               -0.1223920  0.0153155  -7.991 1.33e-15 ***
## payer_code2               -0.2205474  0.0166930 -13.212  < 2e-16 ***
## payer_code3               -0.0987262  0.0292680  -3.373 0.000743 ***
## num_procedures            -0.0126409  0.0039744  -3.181 0.001470 ** 
## num_medications            0.0044094  0.0009209   4.788 1.68e-06 ***
## number_emergency           0.0888016  0.0077497  11.459  < 2e-16 ***
## number_inpatient           0.2873435  0.0049384  58.186  < 2e-16 ***
## number_diagnoses           0.0448999  0.0034712  12.935  < 2e-16 ***
## diabetesMed1               0.1769868  0.0147853  11.970  < 2e-16 ***
## CCI                        0.0824999  0.0040202  20.521  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 170850  on 123241  degrees of freedom
## Residual deviance: 161679  on 123219  degrees of freedom
## AIC: 161725
## 
## Number of Fisher Scoring iterations: 4
# Predict test data based on model
predict_lr_over_2 <- predict(lr_over_2, 
                       test_over, type = "response")

# Changing probabilities
predict_lr_over_2 <- ifelse(predict_lr_over_2 >0.5, 1, 0)
   
# Evaluating model accuracy
# using confusion matrix
#table(test_over$readmitted, predict_lr_over_2)
confusionMatrix(as.factor(predict_lr_over_2), test_over$readmitted, positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17420 11753
##          1  8988 14655
##                                           
##                Accuracy : 0.6073          
##                  95% CI : (0.6031, 0.6115)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2146          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5549          
##             Specificity : 0.6596          
##          Pos Pred Value : 0.6198          
##          Neg Pred Value : 0.5971          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2775          
##    Detection Prevalence : 0.4476          
##       Balanced Accuracy : 0.6073          
##                                           
##        'Positive' Class : 1               
## 
# ROC-AUC Curve
ROCPred <- prediction(predict_lr_over_2, test_over$readmitted) 
ROCPer <- performance(ROCPred, measure = "tpr", 
                             x.measure = "fpr")
   
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
#auc #0.607297
   
# Plotting curve
plot(ROCPer, main = "Logistic Regression ROC AUC - Over-sampling",col = 'blue', lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

The final logistic regression model using over-sampling has AUC of 60.7, sensitivity 55.5, and specificity 66.0.

Decision Tree

control <- rpart.control(minsplit = 100,
    minbucket = round(100 / 3),
    maxdepth = 30,
    cp = 0)

tune_fit <- rpart(readmitted~., data = train_over, method = 'class', control = control)
#accuracy_tune(tune_fit) #0.6946081

control <- rpart.control(minsplit = 50,
    minbucket = round(100/3),
    maxdepth = 20,
    cp = 0)

tune_fit_4 <- rpart(readmitted~., data = train_over, method = 'class', control = control)
#accuracy_tune(tune_fit_4) #0.7144962

printcp(tune_fit_4)
## 
## Classification tree:
## rpart(formula = readmitted ~ ., data = train_over, method = "class", 
##     control = control)
## 
## Variables actually used in tree construction:
##  [1] admission_source_id      admission_type_id        age                     
##  [4] CCI                      change                   diabetesMed             
##  [7] discharge_disposition_id gender                   num_lab_procedures      
## [10] num_medications          num_procedures           number_diagnoses        
## [13] number_emergency         number_inpatient         number_outpatient       
## [16] payer_code               race                     time_in_hospital        
## 
## Root node error: 61621/123242 = 0.5
## 
## n= 123242 
## 
##             CP nsplit rel error  xerror      xstd
## 1   1.8648e-01      0   1.00000 1.00729 0.0028485
## 2   2.1097e-02      1   0.81352 0.81805 0.0028010
## 3   3.5296e-03      2   0.79242 0.79315 0.0027869
## 4   2.9698e-03      4   0.78537 0.78700 0.0027832
## 5   1.7851e-03      5   0.78240 0.78408 0.0027813
## 6   1.5579e-03      6   0.78061 0.78397 0.0027813
## 7   1.4443e-03     12   0.76983 0.78264 0.0027804
## 8   1.0711e-03     13   0.76839 0.77480 0.0027754
## 9   9.8992e-04     14   0.76732 0.77219 0.0027736
## 10  9.8181e-04     15   0.76633 0.77191 0.0027734
## 11  9.5747e-04     17   0.76437 0.77178 0.0027734
## 12  9.4124e-04     18   0.76341 0.77177 0.0027733
## 13  8.7632e-04     19   0.76247 0.77131 0.0027730
## 14  6.3290e-04     20   0.76159 0.76800 0.0027708
## 15  6.1667e-04     21   0.76096 0.76604 0.0027695
## 16  5.8963e-04     24   0.75904 0.76508 0.0027688
## 17  5.4635e-04     28   0.75638 0.76372 0.0027679
## 18  5.4365e-04     35   0.75073 0.76227 0.0027669
## 19  5.0308e-04     37   0.74965 0.76127 0.0027662
## 20  4.7062e-04     40   0.74814 0.75742 0.0027634
## 21  4.4898e-04     42   0.74720 0.75619 0.0027626
## 22  4.4628e-04     55   0.74118 0.75562 0.0027622
## 23  4.3816e-04     58   0.73968 0.75523 0.0027619
## 24  4.0571e-04     59   0.73924 0.75296 0.0027602
## 25  3.8505e-04     62   0.73803 0.75122 0.0027590
## 26  3.5702e-04     74   0.73327 0.74903 0.0027574
## 27  3.4891e-04     75   0.73292 0.74744 0.0027562
## 28  3.4079e-04     78   0.73183 0.74639 0.0027554
## 29  3.2456e-04     82   0.73047 0.74471 0.0027541
## 30  3.2186e-04     84   0.72982 0.74356 0.0027533
## 31  3.1916e-04     96   0.72516 0.74342 0.0027532
## 32  3.1645e-04    118   0.71735 0.74329 0.0027531
## 33  3.0834e-04    121   0.71640 0.74251 0.0027525
## 34  2.9211e-04    126   0.71485 0.74045 0.0027509
## 35  2.7588e-04    131   0.71339 0.73728 0.0027485
## 36  2.6777e-04    137   0.71174 0.73467 0.0027464
## 37  2.6506e-04    139   0.71120 0.73269 0.0027449
## 38  2.5965e-04    150   0.70750 0.73184 0.0027442
## 39  2.5154e-04    157   0.70568 0.72763 0.0027408
## 40  2.4342e-04    165   0.70328 0.72415 0.0027380
## 41  2.3801e-04    166   0.70304 0.72193 0.0027362
## 42  2.3531e-04    176   0.70036 0.72172 0.0027360
## 43  2.2720e-04    180   0.69942 0.71928 0.0027340
## 44  2.1908e-04    198   0.69483 0.71552 0.0027308
## 45  2.1421e-04    205   0.69277 0.71373 0.0027293
## 46  2.1097e-04    215   0.69037 0.71088 0.0027269
## 47  2.0556e-04    226   0.68796 0.70942 0.0027256
## 48  2.0285e-04    229   0.68735 0.70768 0.0027241
## 49  2.0169e-04    258   0.68089 0.70747 0.0027239
## 50  2.0015e-04    265   0.67948 0.70739 0.0027239
## 51  1.9474e-04    269   0.67862 0.70521 0.0027219
## 52  1.8933e-04    305   0.67076 0.70408 0.0027209
## 53  1.8662e-04    333   0.66373 0.70129 0.0027185
## 54  1.8392e-04    339   0.66262 0.69962 0.0027170
## 55  1.8257e-04    353   0.65999 0.69924 0.0027166
## 56  1.8083e-04    358   0.65896 0.69924 0.0027166
## 57  1.7851e-04    382   0.65374 0.69580 0.0027135
## 58  1.7310e-04    449   0.63767 0.69368 0.0027116
## 59  1.7040e-04    455   0.63642 0.69223 0.0027103
## 60  1.6769e-04    480   0.63199 0.69002 0.0027082
## 61  1.6553e-04    483   0.63149 0.68710 0.0027055
## 62  1.6228e-04    490   0.63027 0.68667 0.0027051
## 63  1.5687e-04    540   0.62125 0.68306 0.0027017
## 64  1.5417e-04    544   0.62062 0.68194 0.0027006
## 65  1.5255e-04    554   0.61903 0.68181 0.0027005
## 66  1.4605e-04    566   0.61703 0.67824 0.0026971
## 67  1.4403e-04    603   0.61111 0.67673 0.0026956
## 68  1.4064e-04    611   0.60995 0.67170 0.0026906
## 69  1.3956e-04    614   0.60953 0.67128 0.0026902
## 70  1.3794e-04    624   0.60749 0.67081 0.0026898
## 71  1.3524e-04    640   0.60507 0.67013 0.0026891
## 72  1.3388e-04    646   0.60426 0.66747 0.0026864
## 73  1.2983e-04    655   0.60301 0.66687 0.0026858
## 74  1.2577e-04    684   0.59923 0.66570 0.0026846
## 75  1.2442e-04    688   0.59872 0.66143 0.0026803
## 76  1.2171e-04    731   0.59210 0.66143 0.0026803
## 77  1.2009e-04    760   0.58847 0.66122 0.0026801
## 78  1.1901e-04    769   0.58723 0.65866 0.0026774
## 79  1.1360e-04    787   0.58470 0.65633 0.0026750
## 80  1.0954e-04    835   0.57870 0.65481 0.0026734
## 81  1.0819e-04    840   0.57815 0.65466 0.0026733
## 82  1.0548e-04    847   0.57725 0.65106 0.0026695
## 83  1.0278e-04    861   0.57578 0.65091 0.0026693
## 84  9.7369e-05    864   0.57547 0.64902 0.0026673
## 85  9.5747e-05    924   0.56946 0.64721 0.0026654
## 86  9.3312e-05    947   0.56672 0.64640 0.0026645
## 87  9.1960e-05    951   0.56635 0.64629 0.0026644
## 88  8.9255e-05    954   0.56607 0.64616 0.0026642
## 89  8.6551e-05    983   0.56301 0.64202 0.0026598
## 90  8.5198e-05    989   0.56249 0.64134 0.0026590
## 91  8.3459e-05   1049   0.55499 0.64111 0.0026588
## 92  8.1141e-05   1056   0.55441 0.64006 0.0026576
## 93  7.8436e-05   1105   0.55038 0.63918 0.0026566
## 94  7.7084e-05   1122   0.54886 0.63837 0.0026557
## 95  7.5732e-05   1130   0.54824 0.63811 0.0026555
## 96  7.3027e-05   1141   0.54725 0.63790 0.0026552
## 97  7.0322e-05   1185   0.54382 0.63446 0.0026514
## 98  6.8970e-05   1195   0.54311 0.63425 0.0026512
## 99  6.7231e-05   1213   0.54184 0.63415 0.0026511
## 100 6.4913e-05   1223   0.54115 0.63301 0.0026498
## 101 6.1667e-05   1292   0.53590 0.63298 0.0026497
## 102 6.0856e-05   1303   0.53498 0.63311 0.0026499
## 103 5.9504e-05   1308   0.53467 0.63311 0.0026499
## 104 5.6799e-05   1320   0.53375 0.63204 0.0026487
## 105 5.5176e-05   1332   0.53307 0.63181 0.0026484
## 106 5.4094e-05   1337   0.53279 0.63178 0.0026484
## 107 5.1930e-05   1352   0.53198 0.63047 0.0026469
## 108 4.8685e-05   1357   0.53172 0.63008 0.0026465
## 109 4.5439e-05   1446   0.52644 0.63003 0.0026464
## 110 4.4628e-05   1451   0.52622 0.62910 0.0026454
## 111 4.3275e-05   1455   0.52604 0.62917 0.0026454
## 112 4.2193e-05   1475   0.52503 0.62933 0.0026456
## 113 4.0571e-05   1490   0.52422 0.62899 0.0026452
## 114 3.9411e-05   1504   0.52365 0.62909 0.0026453
## 115 3.7866e-05   1511   0.52338 0.62912 0.0026454
## 116 3.6514e-05   1519   0.52294 0.62907 0.0026453
## 117 3.2456e-05   1523   0.52279 0.62871 0.0026449
## 118 2.7047e-05   1581   0.52091 0.62889 0.0026451
## 119 2.4342e-05   1593   0.52059 0.62912 0.0026454
## 120 2.1638e-05   1601   0.52039 0.62883 0.0026450
## 121 1.6228e-05   1616   0.52007 0.62893 0.0026452
## 122 8.1141e-06   1655   0.51943 0.62860 0.0026448
## 123 5.4094e-06   1685   0.51919 0.62881 0.0026450
## 124 4.0571e-06   1691   0.51916 0.62896 0.0026452
## 125 0.0000e+00   1695   0.51914 0.62868 0.0026449
predicted <- predict(tune_fit_4, test_over, type="class")
#table(test_over$readmitted, predicted)
confusionMatrix(test_over$readmitted, predicted)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17353  9055
##          1  6990 19418
##                                           
##                Accuracy : 0.6962          
##                  95% CI : (0.6923, 0.7001)
##     No Information Rate : 0.5391          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3924          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7129          
##             Specificity : 0.6820          
##          Pos Pred Value : 0.6571          
##          Neg Pred Value : 0.7353          
##              Prevalence : 0.4609          
##          Detection Rate : 0.3286          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6974          
##                                           
##        'Positive' Class : 0               
## 
pred <- prediction(predict(tune_fit_4, test_over, type="prob")[, 2], test_over$readmitted)

auc <- performance(pred, "auc")
auc <- auc@y.values[[1]]
#auc #0.7529766

#as.data.frame(tune_fit$variable.importance)
dt_varimport <- data.frame(imp = tune_fit_4$variable.importance)

dt_varimport_2 <- dt_varimport %>% 
  rownames_to_column() %>% 
  rename("variable" = rowname) %>% 
  arrange(imp) %>%
  mutate(variable = forcats::fct_inorder(variable))

ggplot(dt_varimport_2) +
  geom_col(aes(x = variable, y = imp),
            show.legend = F) +
  coord_flip() 

plot(performance(pred, 'tpr', 'fpr'), col='blue', main='Decision Tree ROC AUC - Over-sampling')
abline(0, 1, lty=2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

AUC for the decision tree model using over-sampled data is 75.3, sensitivity is 71.3, and specificity is 68.2. The top 3 important predictors are num_lab_procedures, number_inpatient, and number_medications.

Summary Performance

The table below summarizes the performance metrics for each model that was developed. Overall, the decision tree model using over-sampled data had the best performance across the board. It is interesting that this model out performed the others by such a large margin. In general, it appears the decision trees performed better than the logistic regression models and note than number_inpatient was the top predictor in both decision trees.

#summary table
Model <- c("Logistic Regression (Under-sampling)", "Decision Tree (Under-sampling)", "Logistic Regression (Over-sampling)", "Decision Tree (Over-sampling)")
   
auc_values <- c(60.5, 63.6, 60.7, 75.3)
   
spec_values <- c(56.5, 61.9, 55.5, 71.3)
    
sens_values <- c(64.5, 59.3, 66.0, 68.2)
    
sum_df<- data.frame(Model, auc_values, spec_values, sens_values)

# rename columnscolnames(sum_df)[which(names(sum_df) == "auc_values")] <- "AUC"
colnames(sum_df)[which(names(sum_df) == "spec_values")] <- "Specificity"
colnames(sum_df)[which(names(sum_df) == "sens_values")] <- "Sensitivity"

sum_df
##                                  Model auc_values Specificity Sensitivity
## 1 Logistic Regression (Under-sampling)       60.5        56.5        64.5
## 2       Decision Tree (Under-sampling)       63.6        61.9        59.3
## 3  Logistic Regression (Over-sampling)       60.7        55.5        66.0
## 4        Decision Tree (Over-sampling)       75.3        71.3        68.2

Conclusion

In general, the decision trees perform better than the logistic regression models. Since our models’ goal was to predict whether a patient is readmitted to the hospital or not, false negatives (incorrectly predicting a patient will not be readmitted) are more costly than false positives (incorrectly predicting a patient will be readmitted). So we will stress sensitivity as the most important performance metric because it is the ratio of true positives over actual positives. Thus, we recommend the decision tree model using over-sampled data as the best model to use based on this analysis. Future possible steps include testing each model on other data sets to see how they perform and testing other types of models.