The goal of this case analysis is to predict whether or not a patient will be readmitted to the hospital within 30 days using data from the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008). The data covers 10 years (1999-2008) of readmissions in 130 U.S. hospitals. Information regarding patient history, demographics, medications administered, and diagnoses are included in the data. Each record satisfies the following:
The main constraints of this case study are model interpretability and misclassification. Doctors need to be able to understand and explain why a model predicted that a patient will be readmitted to the hospital so they can make appropriate modifications to a patient’s course of treatment and also the reason for the changes. If a model makes an incorrect prediction, it could be financially and emotionally costly to the patient if he/she is subject to additional unnecessary time in the hospital or is discharged home but has to come home a few days later because their symptoms or pain are back. Due to these constraint we will be testing logistic regression and decision tree models.
#import data saved on github
df_raw <- read.csv(url("https://raw.githubusercontent.com/KatherineEvers/698/main/diabetic_data.csv"), header=TRUE)
#first few rows of data
head(df_raw) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "800px", height = "200px")
encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide.metformin | glipizide.metformin | glimepiride.pioglitazone | metformin.rosiglitazone | metformin.pioglitazone | change | diabetesMed | readmitted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ? | Pediatrics-Endocrinology | 41 | 0 | 1 | 0 | 0 | 0 | 250.83 | ? | ? | 1 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | NO |
149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ? | ? | 59 | 0 | 18 | 0 | 0 | 0 | 276 | 250.01 | 255 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | >30 |
64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ? | ? | 11 | 5 | 13 | 2 | 0 | 1 | 648 | 250 | V27 | 6 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | NO |
500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ? | ? | 44 | 1 | 16 | 0 | 0 | 0 | 8 | 250.43 | 403 | 7 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | NO |
16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ? | ? | 51 | 0 | 8 | 0 | 0 | 0 | 197 | 157 | 250 | 5 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | NO |
35754 | 82637451 | Caucasian | Male | [50-60) | ? | 2 | 1 | 2 | 3 | ? | ? | 31 | 6 | 16 | 0 | 0 | 0 | 414 | 411 | 250 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | Yes | >30 |
#summarize data
skim(df_raw)
Name | df_raw |
Number of rows | 101766 |
Number of columns | 50 |
_______________________ | |
Column type frequency: | |
character | 37 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
race | 0 | 1 | 1 | 15 | 0 | 6 | 0 |
gender | 0 | 1 | 4 | 15 | 0 | 3 | 0 |
age | 0 | 1 | 6 | 8 | 0 | 10 | 0 |
weight | 0 | 1 | 1 | 9 | 0 | 10 | 0 |
payer_code | 0 | 1 | 1 | 2 | 0 | 18 | 0 |
medical_specialty | 0 | 1 | 1 | 36 | 0 | 73 | 0 |
diag_1 | 0 | 1 | 1 | 6 | 0 | 717 | 0 |
diag_2 | 0 | 1 | 1 | 6 | 0 | 749 | 0 |
diag_3 | 0 | 1 | 1 | 6 | 0 | 790 | 0 |
max_glu_serum | 0 | 1 | 4 | 4 | 0 | 4 | 0 |
A1Cresult | 0 | 1 | 2 | 4 | 0 | 4 | 0 |
metformin | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
repaglinide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
nateglinide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
chlorpropamide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
glimepiride | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
acetohexamide | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
glipizide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
glyburide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
tolbutamide | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
pioglitazone | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
rosiglitazone | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
acarbose | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
miglitol | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
troglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
tolazamide | 0 | 1 | 2 | 6 | 0 | 3 | 0 |
examide | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
citoglipton | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
insulin | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
glyburide.metformin | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
glipizide.metformin | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
glimepiride.pioglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
metformin.rosiglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
metformin.pioglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
change | 0 | 1 | 2 | 2 | 0 | 2 | 0 |
diabetesMed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
readmitted | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
encounter_id | 0 | 1 | 165201645.62 | 102640295.98 | 12522 | 84961194 | 152388987 | 230270888 | 443867222 | ▆▇▅▂▂ |
patient_nbr | 0 | 1 | 54330400.69 | 38696359.35 | 135 | 23413221 | 45505143 | 87545950 | 189502619 | ▇▆▆▁▁ |
admission_type_id | 0 | 1 | 2.02 | 1.45 | 1 | 1 | 1 | 3 | 8 | ▇▂▁▁▁ |
discharge_disposition_id | 0 | 1 | 3.72 | 5.28 | 1 | 1 | 1 | 4 | 28 | ▇▁▁▁▁ |
admission_source_id | 0 | 1 | 5.75 | 4.06 | 1 | 1 | 7 | 7 | 25 | ▅▇▁▁▁ |
time_in_hospital | 0 | 1 | 4.40 | 2.99 | 1 | 2 | 4 | 6 | 14 | ▇▅▂▁▁ |
num_lab_procedures | 0 | 1 | 43.10 | 19.67 | 1 | 31 | 44 | 57 | 132 | ▃▇▅▁▁ |
num_procedures | 0 | 1 | 1.34 | 1.71 | 0 | 0 | 1 | 2 | 6 | ▇▂▁▁▁ |
num_medications | 0 | 1 | 16.02 | 8.13 | 1 | 10 | 15 | 20 | 81 | ▇▃▁▁▁ |
number_outpatient | 0 | 1 | 0.37 | 1.27 | 0 | 0 | 0 | 0 | 42 | ▇▁▁▁▁ |
number_emergency | 0 | 1 | 0.20 | 0.93 | 0 | 0 | 0 | 0 | 76 | ▇▁▁▁▁ |
number_inpatient | 0 | 1 | 0.64 | 1.26 | 0 | 0 | 0 | 1 | 21 | ▇▁▁▁▁ |
number_diagnoses | 0 | 1 | 7.42 | 1.93 | 1 | 6 | 8 | 9 | 16 | ▁▅▇▁▁ |
#describe(df)
#sum_data <- summary(df)
#pander(sum_data, split.table = 100, style = 'rmarkdown')
As seen in the summary table, the data contains 101,766 rows and 50 columns. Of the 50 columns, 13 are numeric and 37 are character (nominal) variables. The table also provides distributions of the numerical variables and the number of unique values and most common values of nominal variables. Note that examide
and citoglipton
have only 1 unique value and diag_1
, diag_2
and diag_3
have over 700 unique values.
Some variables have “?” or other labels coded as missing/not available/not measured, which is why no missing values are reported in the summary table above. After replacing these labels with NA, we get the following table of variables with missing values. Since weight
, max_glu_serum
, and A1Cresult
have over 50% missing values, they will be dropped from the modeling data. In the data transformation section later on missing values for the remaining factors with less than 50% missing values will be filled with mode or kept as a separate category.
#recode missing labels to NA
missing_df <- df_raw %>%
mutate(race = na_if(race, '?')) %>%
mutate(weight = na_if(weight, '?')) %>%
mutate(admission_type_id = na_if(admission_type_id, '5')) %>%
mutate(admission_type_id = na_if(admission_type_id, '6')) %>%
mutate(admission_type_id = na_if(admission_type_id, '8')) %>%
mutate(discharge_disposition_id = na_if(discharge_disposition_id, '18')) %>%
mutate(discharge_disposition_id = na_if(discharge_disposition_id, '25')) %>%
mutate(discharge_disposition_id = na_if(discharge_disposition_id, '26')) %>%
mutate(admission_source_id = na_if(admission_source_id, '15')) %>%
mutate(admission_source_id = na_if(admission_source_id, '17'))%>%
mutate(admission_source_id = na_if(admission_source_id, '20')) %>%
mutate(admission_source_id = na_if(admission_source_id, '21')) %>%
mutate(payer_code = na_if(payer_code, '?')) %>%
mutate(medical_specialty = na_if(medical_specialty, '?')) %>%
mutate(payer_code = na_if(payer_code, '?')) %>%
mutate(diag_1 = na_if(diag_1, '?')) %>%
mutate(diag_2 = na_if(diag_2, '?')) %>%
mutate(diag_3 = na_if(diag_3, '?')) %>%
mutate(max_glu_serum = na_if(max_glu_serum, 'None')) %>%
mutate(A1Cresult = na_if(A1Cresult, 'None'))
#sapply(missing_df, function(x) sum(is.na(x)))
missing_df %>%
summarize_all(funs(sum(is.na(.)) / length(.) * 100)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "800px", height = "100px")
encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide.metformin | glipizide.metformin | glimepiride.pioglitazone | metformin.rosiglitazone | metformin.pioglitazone | change | diabetesMed | readmitted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2.233555 | 0 | 0 | 96.85848 | 10.21559 | 4.598785 | 6.821532 | 0 | 39.55742 | 49.08221 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0206356 | 0.3517874 | 1.398306 | 0 | 94.74677 | 83.27732 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
#missing/not available/not measured
#table(df_raw['race'])["?"] #2273
#table(df_raw['weight'])["?"] #98569
#table(df_raw['admission_type_id']) #5: 4785 ,6: 5291,8: 320
#table(df_raw['discharge_disposition_id']) #18: 3691, 25:989, 26:0
#table(df_raw['admission_source_id']) #15:0, 17: 6781, 20: 161, 21:0
#table(df_raw['payer_code'])["?"] #40245
#table(df_raw['medical_specialty'])["?"] #49949
#table(df_raw['diag_1'])["?"] #21
#table(df_raw['diag_2'])["?"] #358
#table(df_raw['diag_3'])["?"] #1423
#table(df_raw['max_glu_serum'])["None"] #96420
#table(df_raw['A1Cresult'])["None"] #84748
Upon investigation we find that there are cases of multiple records for one patient number. Based on the five requirements of the patient records we will assume each record is an independent, separate occurrence and keep all records.
dup <- df_raw %>%
count(patient_nbr) %>%
arrange(desc(n))
head(dup) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "800px", height = "200px")
patient_nbr | n |
---|---|
88785891 | 40 |
43140906 | 28 |
1660293 | 23 |
23199021 | 23 |
88227540 | 23 |
23643405 | 22 |
#examine example patient number
df_raw[df_raw$patient_nbr==8693541,] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
scroll_box(width = "800px", height = "200px")
encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide.metformin | glipizide.metformin | glimepiride.pioglitazone | metformin.rosiglitazone | metformin.pioglitazone | change | diabetesMed | readmitted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19311 | 68920296 | 8693541 | Caucasian | Female | [80-90) | ? | 1 | 1 | 7 | 12 | ? | ? | 74 | 0 | 23 | 0 | 0 | 0 | 410 | 428 | 427 | 9 | None | None | No | No | No | No | No | No | No | Down | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Ch | Yes | >30 |
20345 | 71711130 | 8693541 | Caucasian | Female | [80-90) | ? | 1 | 1 | 7 | 6 | ? | ? | 60 | 0 | 13 | 0 | 0 | 1 | 410 | 428 | 427 | 8 | None | None | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | >30 |
21637 | 75256344 | 8693541 | Caucasian | Female | [80-90) | ? | 1 | 1 | 7 | 5 | ? | ? | 61 | 0 | 10 | 0 | 0 | 2 | 428 | 496 | 427 | 6 | None | None | No | No | No | No | No | No | No | Down | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Ch | Yes | >30 |
24712 | 83442996 | 8693541 | Caucasian | Female | [80-90) | ? | 1 | 3 | 7 | 13 | ? | ? | 66 | 1 | 19 | 0 | 0 | 3 | 428 | 486 | 427 | 8 | None | None | No | No | No | No | No | No | No | Down | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Ch | Yes | <30 |
25682 | 85503798 | 8693541 | Caucasian | Female | [80-90) | ? | 1 | 6 | 7 | 6 | ? | ? | 49 | 2 | 16 | 0 | 0 | 4 | 398 | 486 | 427 | 8 | None | None | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | NO |
Since we are predicting whether a patient was readmitted or not, we want to remove any patients who expired in the hospital or were discharged to hospice in order to avoid bias. These patients have a discharge_disposition_id of 11, 13, 14, 19, 20, or 21. There are 2,423 records removed with 99,343 remaining.
table(df_raw['discharge_disposition_id'])
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 60234 2128 13954 815 1184 12902 623 108 21 6 1642 3 399
## 14 15 16 17 18 19 20 22 23 24 25 27 28
## 372 63 11 14 3691 8 2 1993 412 48 989 5 139
df <- subset(df_raw, !(discharge_disposition_id %in% c('11','13','14','19','20', '21'))) #drop 2423 records, 99343 remaining
#table(df['discharge_disposition_id'])
#convert from character to factor
df$race <- as.factor(df$race)
df$gender <- as.factor(df$gender)
df$age <- as.factor(df$age)
df$weight <- as.factor(df$weight)
df$admission_type_id <- as.factor(df$admission_type_id)
df$discharge_disposition_id <- as.factor(df$discharge_disposition_id)
df$admission_source_id <- as.factor(df$admission_source_id)
df$payer_code <- as.factor(df$payer_code)
df$medical_specialty <- as.factor(df$medical_specialty)
df$diag_1 <- as.factor(df$diag_1)
df$diag_2 <- as.factor(df$diag_2)
df$diag_3 <- as.factor(df$diag_3)
df$max_glu_serum <- as.factor(df$max_glu_serum)
df$A1Cresult <- as.factor(df$A1Cresult)
df$metformin <- as.factor(df$metformin)
df$repaglinide <- as.factor(df$repaglinide)
df$nateglinide <- as.factor(df$nateglinide)
df$chlorpropamide <- as.factor(df$chlorpropamide)
df$glimepiride <- as.factor(df$glimepiride)
df$acetohexamide <- as.factor(df$acetohexamide)
df$glipizide <- as.factor(df$glipizide)
df$glyburide <- as.factor(df$glyburide)
df$tolbutamide <- as.factor(df$tolbutamide)
df$pioglitazone <- as.factor(df$pioglitazone)
df$rosiglitazone <- as.factor(df$rosiglitazone)
df$acarbose <- as.factor(df$acarbose)
df$miglitol <- as.factor(df$miglitol)
df$troglitazone <- as.factor(df$troglitazone)
df$tolazamide <- as.factor(df$tolazamide)
df$examide <- as.factor(df$examide)
df$citoglipton <- as.factor(df$citoglipton)
df$insulin <- as.factor(df$insulin)
df$glyburide.metformin <- as.factor(df$glyburide.metformin)
df$glipizide.metformin <- as.factor(df$glipizide.metformin)
df$glimepiride.pioglitazone <- as.factor(df$glimepiride.pioglitazone)
df$metformin.rosiglitazone <- as.factor(df$metformin.rosiglitazone)
df$metformin.pioglitazone <- as.factor(df$metformin.pioglitazone)
df$change <- as.factor(df$change)
df$diabetesMed <- as.factor(df$diabetesMed)
df$readmitted <- as.factor(df$readmitted)
Many of the nominal variables have multiple levels such as diag_1
and discharge_disposition_id
. If a nominal variable has too many levels or levels that rarely occur, it is less likely that these levels will impact the model. So in order to reduce the number of levels for these variables we will recode and combine some levels. We will also create the Charlson Comorbidity Index (CCI) 1992 version based on the ICD-9 (International Classification of Diseases) codes for diag_1
, diag_2
, and diag_3
#readmitted-convert to binary
df_2 <- df %>% mutate(readmitted=recode(readmitted,
'<30' = 1,
'>30' = 0,
'NO' = 0))
#table(df['readmitted'])
#count unique values for each variable
#sapply(lapply(df, unique), length)
#admission type id
df_2 <- df_2 %>% mutate(admission_type_id=recode(admission_type_id,
'2'=3,
'1'=3,
'7'=3,
'3'=1,
'4'=2,
'5'=0,
'6'=0,
'8'=0))
#discharge_disposition_id
df_2 <- df_2 %>% mutate(discharge_disposition_id=recode(discharge_disposition_id,
'1'=1,
'2'=2,
'18'=0,
'25'=0,
'26'=0,
'3'=2,
'4'=2,
'5'=2,
'6'=2,
'8'=2,
'9'=2,
'10'=2,
'15'=2,
'16'=2,
'17'=2,
'22'=2,
'23'=2,
'24'=2,
'30'=2,
'27'=2,
'28'=2,
'29'=2,
'7'=3,
'11'=4,
'12'=2,
'19'=4,
'20'=4,
'21'=4,
'13'=4,
'14'=4
))
#table(df_2['discharge_disposition_id'])
#admission source id
df_2 <- df_2 %>% mutate(admission_source_id=recode(admission_source_id,
'1' = 3,
'2'=3,
'3'=3,
'4'=1,
'5'=1,
'6'=1,
'7'=2,
'8'=3,
'9'=0,
'10'=1,
'11'=3,
'12'=3,
'13'=3,
'14'=3,
'15'=0,
'17'=0,
'18'=1,
'19'=4,
'20'=0,
'21'=0,
'22'=1,
'23'=3,
'24'=3,
'25'=1,
'26'=1
))
#table(df_2['admission_source_id'])
#fill missing with mode
#race
df_2 <- df_2 %>% mutate(race=recode(race,
'Caucasian' = 0,
'AfricanAmerican' = 1,
'Asian' = 1,
'Hispanic'=1,
'Other'=1,
'?'=0))
#gender
df_2 <- df_2 %>% mutate(gender=recode(gender,
'Unknown/Invalid' = 1,
'Male' = 0,
'Female' = 1))
#age
df_2 <- df_2 %>% mutate(age=recode(age,
'[0-10)' = 5,
'[10-20)' = 15,
'[20-30)' = 25,
'[30-40)' = 35,
'[40-50)' = 45,
'[50-60)' = 55,
'[60-70)' = 65,
'[70-80)' = 75,
'[80-90)' = 85,
'[90-100)' = 95))
#weight
#prop.table(table(df$weight))["?"] #0.97-drop
#payer_code
df_2 <- df_2 %>% mutate(payer_code=recode(payer_code,
MC=1,
MD=2,
HM=2,
UN=2,
BC=2,
SP=3,
CP=2,
SI=2,
DM=2,
CM=2,
CH=2,
PO=2,
WC=2,
OT=2,
OG=2,
MP=2,
FR=2,
'?'=0))
#table(df_2['payer_code'])
#medical_specialty
df_2 <- df_2 %>% mutate(medical_specialty=recode(medical_specialty,
'?'=0,
'AllergyandImmunology'=7,
'Anesthesiology'=7,
'Anesthesiology-Pediatric'=13,
'Cardiology'=1,
'Cardiology-Pediatric'=13,
'DCPTEAM'=7,
'Dentistry'=7,
'Dermatology'=7,
'Emergency/Trauma'=2,
'Endocrinology'=7,
'Endocrinology-Metabolism'=7,
'Family/GeneralPractice'=4,
'Gastroenterology'=5,
'Gynecology'=7,
'Hematology'=6,
'Hematology/Oncology'=6,
'Hospitalist'=7,
'InfectiousDiseases'=7,
'InternalMedicine'=4,
'Nephrology'=8,
'Neurology'=9,
'Neurophysiology'=9,
'Obsterics&Gynecology-GynecologicOnco'=7,
'Obstetrics'=7,
'ObstetricsandGynecology'=7,
'Oncology'=10,
'Ophthalmology'=7,
'Orthopedics'=11,
'Orthopedics-Reconstructive'=11,
'Osteopath'=11,
'Otolaryngology'=7,
'OutreachServices'=7,
'Pathology'=7,
'Pediatrics'=13,
'Pediatrics-AllergyandImmunology'=13,
'Pediatrics-CriticalCare'=13,
'Pediatrics-EmergencyMedicine'=13,
'Pediatrics-Endocrinology'=13,
'Pediatrics-Hematology-Oncology'=13,
'Pediatrics-InfectiousDiseases'=13,
'Pediatrics-Neurology'=13,
'Pediatrics-Pulmonology'=13,
'Perinatology'=7,
'PhysicalMedicineandRehabilitation'=7,
'PhysicianNotFound'=7,
'Podiatry'=10,
'Proctology'=8,
'Psychiatry'=7,
'Psychiatry-Addictive'=7,
'Psychiatry-Child/Adolescent'=13,
'Psychology'=7,
'Pulmonology'=12,
'Radiologist'=11,
'Radiology'=11,
'Resident'=7,
'Rheumatology'=11,
'Speech'=7,
'SportsMedicine'=11,
'Surgeon'=3,
'Surgery-Cardiovascular'=3,
'Surgery-Cardiovascular/Thoracic'=3,
'Surgery-Colon&Rectal'=3,
'Surgery-General'=3,
'Surgery-Maxillofacial'=3,
'Surgery-Neuro'=3,
'Surgery-Pediatric'=3,
'Surgery-Plastic'=3,
'Surgery-PlasticwithinHeadandNeck'=3,
'Surgery-Thoracic'=3,
'Surgery-Vascular'=3,
'SurgicalSpecialty'=3,
'Urology'=8))
#table(df_2['medical_specialty'])
#diag_1
#reassign ? to separate category
levels(df_2$diag_1)[levels(df_2$diag_1)=="?"] <- "9999"
#convert factor levels to numeric
df_2$diag_1 <- as.numeric(levels(df_2$diag_1))[df_2$diag_1]
df_2 <- transform(df_2, AMI_1= ifelse(diag_1==410, 1, 0))
df_2 <- transform(df_2, AMI_history_1= ifelse(diag_1==412, 1, 0))
df_2 <- transform(df_2, CHF_1= ifelse(diag_1==428, 1, 0))
df_2$PVD_1 <- ifelse(df_2$diag_1==411 | df_2$diag_1==443| df_2$diag_1==785, 1, 0)
df_2$CVD_1 <- ifelse( df_2$diag_1 >= 430 & df_2$diag_1 <= 438, 1, 0)
df_2$COPD_1 <- ifelse( (df_2$diag_1 >= 490 & df_2$diag_1 <= 496) | (df_2$diag_1 >= 500 & df_2$diag_1 <= 505), 1, 0)
df_2 <- transform(df_2, dementia_1= ifelse(diag_1==290, 1, 0))
df_2 <- transform(df_2, paralysis_1= ifelse(diag_1==342, 1, 0))
df_2$diabetes_1 <- ifelse(df_2$diag_1 == 250 | (df_2$diag_1 >= 250.01 & df_2$diag_1 <= 250.39) | df_2$diag_1 == 250.7, 1, 0)
df_2$diabetes_w_comp_1 <- ifelse((df_2$diag_1 >= 250.4 & df_2$diag_1 <= 250.69) | (df_2$diag_1 >= 250.8 & df_2$diag_1 == 250.99), 1, 0)
df_2$renal_1 <- ifelse((df_2$diag_1 >= 582 & df_2$diag_1 <= 583) | (df_2$diag_1 >= 585 & df_2$diag_1 <= 586) | df$diag_1==588, 1, 0)
df_2 <- transform(df_2, mild_liver_1= ifelse(diag_1==571, 1, 0))
df_2$mod_sev_liver_1 <- ifelse(df_2$diag_1 == 456 | df_2$diag_1 == 572, 1, 0)
df_2$peptic_ulcer_1 <- ifelse( df_2$diag_1 >= 531 & df_2$diag_1 == 534, 1, 0)
df_2$rheumatologic_1 <- ifelse(df_2$diag_1 == 710 | df_2$diag_1 == 714, 1, 0)
df_2$aids_1 <- ifelse(df_2$diag_1 == 42 | df_2$diag_1 == 43 | df_2$diag_1 == 44, 1, 0)
df_2$malig_1 <- ifelse((df_2$diag_1 >= 140 & df_2$diag_1 <= 172) | (df_2$diag_1 >= 174 & df_2$diag_1 <= 195) |
(df_2$diag_1 >= 200 & df_2$diag_1 <= 208), 1, 0)
df_2$tumor_1 <- ifelse((df_2$diag_1 >= 196 & df_2$diag_1 <= 199), 1, 0)
df_2 <- transform(df_2, wt_AMI_1= ifelse(AMI_1==1, 1, 0))
df_2 <- transform(df_2, wt_AMI_history_1= ifelse(AMI_history_1==1, 1, 0))
df_2 <- transform(df_2, wt_CHF_1= ifelse(CHF_1==1, 1, 0))
df_2 <- transform(df_2, wt_PVD_1= ifelse(PVD_1==1, 1, 0))
df_2 <- transform(df_2, wt_CVD_1= ifelse(PVD_1==1, 1, 0))
df_2 <- transform(df_2, wt_COPD_1= ifelse(COPD_1==1, 1, 0))
df_2 <- transform(df_2, wt_dementia_1= ifelse(dementia_1==1, 1, 0))
df_2 <- transform(df_2, wt_paralysis_1= ifelse(paralysis_1==1, 2, 0))
df_2 <- transform(df_2, wt_diabetes_1= ifelse(diabetes_1==1, 1, 0))
df_2 <- transform(df_2, wt_diabetes_w_comp_1= ifelse(diabetes_w_comp_1==1, 2, 0))
df_2 <- transform(df_2, wt_renal_1= ifelse(renal_1==1, 2, 0))
df_2 <- transform(df_2, wt_mild_liver_1= ifelse(mild_liver_1==1, 1, 0))
df_2 <- transform(df_2, wt_mod_sev_liver_1= ifelse(mod_sev_liver_1==1, 3, 0))
df_2 <- transform(df_2, wt_peptic_ulcer_1= ifelse(peptic_ulcer_1==1, 1, 0))
df_2 <- transform(df_2, wt_rheumatologic_1= ifelse(rheumatologic_1==1, 1, 0))
df_2 <- transform(df_2, wt_aids_1= ifelse(aids_1==1, 6, 0))
df_2 <- transform(df_2, wt_malig_1= ifelse(malig_1==1, 2, 0))
df_2 <- transform(df_2, wt_tumor_1= ifelse(tumor_1==1, 6, 0))
df_2$wt_AMI_1[is.na(df_2$wt_AMI_1)] <- 0
df_2$wt_AMI_history_1[is.na(df_2$wt_AMI_history_1)] <- 0
df_2$wt_CHF_1[is.na(df_2$wt_CHF_1)] <- 0
df_2$wt_PVD_1[is.na(df_2$wt_PVD_1)] <- 0
df_2$wt_CVD_1[is.na(df_2$wt_CVD_1)] <- 0
df_2$wt_COPD_1[is.na(df_2$wt_COPD_1)] <- 0
df_2$wt_dementia_1[is.na(df_2$wt_dementia_1)] <- 0
df_2$wt_paralysis_1[is.na(df_2$wt_paralysis_1)] <- 0
df_2$wt_diabetes_1[is.na(df_2$wt_diabetes_1)] <- 0
df_2$wt_diabetes_w_comp_1[is.na(df_2$wt_diabetes_w_comp_1)] <- 0
df_2$wt_renal_1[is.na(df_2$wt_renal_1)] <- 0
df_2$wt_mild_liver_1[is.na(df_2$wt_mild_liver_1)] <- 0
df_2$wt_mod_sev_liver_1[is.na(df_2$wt_mod_sev_liver_1)] <- 0
df_2$wt_peptic_ulcer_1[is.na(df_2$wt_peptic_ulcer_1)] <- 0
df_2$wt_rheumatologic_1[is.na(df_2$wt_rheumatologic_1)] <- 0
df_2$wt_aids_1[is.na(df_2$wt_aids_1)] <- 0
df_2$wt_malig_1[is.na(df_2$wt_malig_1)] <- 0
df_2$wt_tumor_1[is.na(df_2$wt_tumor_1)] <- 0
#table(df_2['diag_1'])
#table(df['diag_3'])
#diag_2
#reassign ?
levels(df_2$diag_2)[levels(df_2$diag_2)=="?"] <- "9999"
df_2$diag_2 <- as.numeric(levels(df_2$diag_2))[df_2$diag_2]
df_2 <- transform(df_2, AMI_2= ifelse(diag_2==410, 1, 0))
df_2 <- transform(df_2, AMI_history_2= ifelse(diag_2==412, 1, 0))
df_2 <- transform(df_2, CHF_2= ifelse(diag_2==428, 1, 0))
df_2$PVD_2 <- ifelse(df_2$diag_2==411 | df_2$diag_2==443| df_2$diag_2==785, 1, 0)
df_2$CVD_2 <- ifelse( df_2$diag_2 >= 430 & df_2$diag_2 <= 438, 1, 0)
df_2$COPD_2 <- ifelse( (df_2$diag_2 >= 490 & df_2$diag_2 <= 496) | (df_2$diag_2 >= 500 & df_2$diag_2 <= 505), 1, 0)
df_2 <- transform(df_2, dementia_2= ifelse(diag_2==290, 1, 0))
df_2 <- transform(df_2, paralysis_2= ifelse(diag_2==342, 1, 0))
df_2$diabetes_2 <- ifelse(df_2$diag_2 == 250 | (df_2$diag_2 >= 250.01 & df_2$diag_2 <= 250.39) | df_2$diag_2 == 250.7, 1, 0)
df_2$diabetes_w_comp_2 <- ifelse((df_2$diag_2 >= 250.4 & df_2$diag_2 <= 250.69) | (df_2$diag_2 >= 250.8 & df_2$diag_2 == 250.99), 1, 0)
df_2$renal_2 <- ifelse((df_2$diag_2 >= 582 & df_2$diag_2 <= 583) | (df_2$diag_2 >= 585 & df_2$diag_2 <= 586) | df$diag_2==588, 1, 0)
df_2 <- transform(df_2, mild_liver_2= ifelse(diag_2==571, 1, 0))
df_2$mod_sev_liver_2 <- ifelse(df_2$diag_2 == 456 | df_2$diag_2 == 572, 1, 0)
df_2$peptic_ulcer_2 <- ifelse( df_2$diag_2 >= 531 & df_2$diag_2 == 534, 1, 0)
df_2$rheumatologic_2 <- ifelse(df_2$diag_2 == 710 | df_2$diag_2 == 714, 1, 0)
df_2$aids_2 <- ifelse(df_2$diag_2 == 42 | df_2$diag_2 == 43 | df_2$diag_2 == 44, 1, 0)
df_2$malig_2 <- ifelse((df_2$diag_2 >= 140 & df_2$diag_2 <= 172) | (df_2$diag_2 >= 174 & df_2$diag_2 <= 195) |
(df_2$diag_2 >= 200 & df_2$diag_2 <= 208), 1, 0)
df_2$tumor_2 <- ifelse((df_2$diag_2 >= 196 & df_2$diag_2 <= 199), 1, 0)
df_2 <- transform(df_2, wt_AMI_2= ifelse(AMI_2==1, 1, 0))
df_2 <- transform(df_2, wt_AMI_history_2= ifelse(AMI_history_2==1, 1, 0))
df_2 <- transform(df_2, wt_CHF_2= ifelse(CHF_2==1, 1, 0))
df_2 <- transform(df_2, wt_PVD_2= ifelse(PVD_2==1, 1, 0))
df_2 <- transform(df_2, wt_CVD_2= ifelse(PVD_2==1, 1, 0))
df_2 <- transform(df_2, wt_COPD_2= ifelse(COPD_2==1, 1, 0))
df_2 <- transform(df_2, wt_dementia_2= ifelse(dementia_2==1, 1, 0))
df_2 <- transform(df_2, wt_paralysis_2= ifelse(paralysis_2==1, 2, 0))
df_2 <- transform(df_2, wt_diabetes_2= ifelse(diabetes_2==1, 1, 0))
df_2 <- transform(df_2, wt_diabetes_w_comp_2= ifelse(diabetes_w_comp_2==1, 2, 0))
df_2 <- transform(df_2, wt_renal_2= ifelse(renal_2==1, 2, 0))
df_2 <- transform(df_2, wt_mild_liver_2= ifelse(mild_liver_2==1, 1, 0))
df_2 <- transform(df_2, wt_mod_sev_liver_2= ifelse(mod_sev_liver_2==1, 3, 0))
df_2 <- transform(df_2, wt_peptic_ulcer_2= ifelse(peptic_ulcer_2==1, 1, 0))
df_2 <- transform(df_2, wt_rheumatologic_2= ifelse(rheumatologic_2==1, 1, 0))
df_2 <- transform(df_2, wt_aids_2= ifelse(aids_2==1, 6, 0))
df_2 <- transform(df_2, wt_malig_2= ifelse(malig_2==1, 2, 0))
df_2 <- transform(df_2, wt_tumor_2= ifelse(tumor_2==1, 6, 0))
df_2$wt_AMI_2[is.na(df_2$wt_AMI_2)] <- 0
df_2$wt_AMI_history_2[is.na(df_2$wt_AMI_history_2)] <- 0
df_2$wt_CHF_2[is.na(df_2$wt_CHF_2)] <- 0
df_2$wt_PVD_2[is.na(df_2$wt_PVD_2)] <- 0
df_2$wt_CVD_2[is.na(df_2$wt_CVD_2)] <- 0
df_2$wt_COPD_2[is.na(df_2$wt_COPD_2)] <- 0
df_2$wt_dementia_2[is.na(df_2$wt_dementia_2)] <- 0
df_2$wt_paralysis_2[is.na(df_2$wt_paralysis_2)] <- 0
df_2$wt_diabetes_2[is.na(df_2$wt_diabetes_2)] <- 0
df_2$wt_diabetes_w_comp_2[is.na(df_2$wt_diabetes_w_comp_2)] <- 0
df_2$wt_renal_2[is.na(df_2$wt_renal_2)] <- 0
df_2$wt_mild_liver_2[is.na(df_2$wt_mild_liver_2)] <- 0
df_2$wt_mod_sev_liver_2[is.na(df_2$wt_mod_sev_liver_2)] <- 0
df_2$wt_peptic_ulcer_2[is.na(df_2$wt_peptic_ulcer_2)] <- 0
df_2$wt_rheumatologic_2[is.na(df_2$wt_rheumatologic_2)] <- 0
df_2$wt_aids_2[is.na(df_2$wt_aids_2)] <- 0
df_2$wt_malig_2[is.na(df_2$wt_malig_2)] <- 0
df_2$wt_tumor_2[is.na(df_2$wt_tumor_2)] <- 0
#table(df_2['diag_2'])
#diag_3
#reassign ?
levels(df_2$diag_3)[levels(df_2$diag_3)=="?"] <- "9999"
#convert to numeric
df_2$diag_3 <- as.numeric(levels(df_2$diag_3))[df_2$diag_3]
df_2 <- transform(df_2, AMI_3= ifelse(diag_3==410, 1, 0))
df_2 <- transform(df_2, AMI_history_3= ifelse(diag_3==412, 1, 0))
df_2 <- transform(df_2, CHF_3= ifelse(diag_3==428, 1, 0))
df_2$PVD_3 <- ifelse(df_2$diag_3==411 | df_2$diag_3==443| df_2$diag_3==785, 1, 0)
df_2$CVD_3 <- ifelse( df_2$diag_3 >= 430 & df_2$diag_3 <= 438, 1, 0)
df_2$COPD_3 <- ifelse( (df_2$diag_3 >= 490 & df_2$diag_3 <= 496) | (df_2$diag_3 >= 500 & df_2$diag_3 <= 505), 1, 0)
df_2 <- transform(df_2, dementia_3= ifelse(diag_3==290, 1, 0))
df_2 <- transform(df_2, paralysis_3= ifelse(diag_3==342, 1, 0))
df_2$diabetes_3 <- ifelse(df_2$diag_3 == 250 | (df_2$diag_3 >= 250.01 & df_2$diag_3 <= 250.39) | df_2$diag_3 == 250.7, 1, 0)
df_2$diabetes_w_comp_3 <- ifelse((df_2$diag_3 >= 250.4 & df_2$diag_3 <= 250.69) | (df_2$diag_3 >= 250.8 & df_2$diag_3 == 250.99), 1, 0)
df_2$renal_3 <- ifelse((df_2$diag_3 >= 582 & df_2$diag_3 <= 583) | (df_2$diag_3 >= 585 & df_2$diag_3 <= 586) | df$diag_3==588, 1, 0)
df_2 <- transform(df_2, mild_liver_3= ifelse(diag_3==571, 1, 0))
df_2$mod_sev_liver_3 <- ifelse(df_2$diag_3 == 456 | df_2$diag_3 == 572, 1, 0)
df_2$peptic_ulcer_3 <- ifelse( df_2$diag_3 >= 531 & df_2$diag_3 == 534, 1, 0)
df_2$rheumatologic_3 <- ifelse(df_2$diag_3 == 710 | df_2$diag_3 == 714, 1, 0)
df_2$aids_3 <- ifelse(df_2$diag_3 == 42 | df_2$diag_3 == 43 | df_2$diag_3 == 44, 1, 0)
df_2$malig_3 <- ifelse((df_2$diag_3 >= 140 & df_2$diag_3 <= 172) | (df_2$diag_3 >= 174 & df_2$diag_3 <= 195) |
(df_2$diag_3 >= 200 & df_2$diag_3 <= 208), 1, 0)
df_2$tumor_3 <- ifelse((df_2$diag_3 >= 196 & df_2$diag_3 <= 199), 1, 0)
df_2 <- transform(df_2, wt_AMI_3= ifelse(AMI_3==1, 1, 0))
df_2 <- transform(df_2, wt_AMI_history_3= ifelse(AMI_history_3==1, 1, 0))
df_2 <- transform(df_2, wt_CHF_3= ifelse(CHF_3==1, 1, 0))
df_2 <- transform(df_2, wt_PVD_3= ifelse(PVD_3==1, 1, 0))
df_2 <- transform(df_2, wt_CVD_3= ifelse(PVD_3==1, 1, 0))
df_2 <- transform(df_2, wt_COPD_3= ifelse(COPD_3==1, 1, 0))
df_2 <- transform(df_2, wt_dementia_3= ifelse(dementia_3==1, 1, 0))
df_2 <- transform(df_2, wt_paralysis_3= ifelse(paralysis_3==1, 2, 0))
df_2 <- transform(df_2, wt_diabetes_3= ifelse(diabetes_3==1, 1, 0))
df_2 <- transform(df_2, wt_diabetes_w_comp_3= ifelse(diabetes_w_comp_3==1, 2, 0))
df_2 <- transform(df_2, wt_renal_3= ifelse(renal_3==1, 2, 0))
df_2 <- transform(df_2, wt_mild_liver_3= ifelse(mild_liver_3==1, 1, 0))
df_2 <- transform(df_2, wt_mod_sev_liver_3= ifelse(mod_sev_liver_3==1, 3, 0))
df_2 <- transform(df_2, wt_peptic_ulcer_3= ifelse(peptic_ulcer_3==1, 1, 0))
df_2 <- transform(df_2, wt_rheumatologic_3= ifelse(rheumatologic_3==1, 1, 0))
df_2 <- transform(df_2, wt_aids_3= ifelse(aids_3==1, 6, 0))
df_2 <- transform(df_2, wt_malig_3= ifelse(malig_3==1, 2, 0))
df_2 <- transform(df_2, wt_tumor_3= ifelse(tumor_3==1, 6, 0))
df_2$wt_AMI_3[is.na(df_2$wt_AMI_3)] <- 0
df_2$wt_AMI_history_3[is.na(df_2$wt_AMI_history_3)] <- 0
df_2$wt_CHF_3[is.na(df_2$wt_CHF_3)] <- 0
df_2$wt_PVD_3[is.na(df_2$wt_PVD_3)] <- 0
df_2$wt_CVD_3[is.na(df_2$wt_CVD_3)] <- 0
df_2$wt_COPD_3[is.na(df_2$wt_COPD_3)] <- 0
df_2$wt_dementia_3[is.na(df_2$wt_dementia_3)] <- 0
df_2$wt_paralysis_3[is.na(df_2$wt_paralysis_3)] <- 0
df_2$wt_diabetes_3[is.na(df_2$wt_diabetes_3)] <- 0
df_2$wt_diabetes_w_comp_3[is.na(df_2$wt_diabetes_w_comp_3)] <- 0
df_2$wt_renal_3[is.na(df_2$wt_renal_3)] <- 0
df_2$wt_mild_liver_3[is.na(df_2$wt_mild_liver_3)] <- 0
df_2$wt_mod_sev_liver_3[is.na(df_2$wt_mod_sev_liver_3)] <- 0
df_2$wt_peptic_ulcer_3[is.na(df_2$wt_peptic_ulcer_3)] <- 0
df_2$wt_rheumatologic_3[is.na(df_2$wt_rheumatologic_3)] <- 0
df_2$wt_aids_3[is.na(df_2$wt_aids_3)] <- 0
df_2$wt_malig_3[is.na(df_2$wt_malig_3)] <- 0
df_2$wt_tumor_3[is.na(df_2$wt_tumor_3)] <- 0
#Create Charlson score
df_2$CCI <- df_2$wt_AMI_1 + df_2$wt_AMI_history_1 + df_2$wt_CHF_1 + df_2$wt_PVD_1 + df_2$wt_CVD_1 + df_2$wt_COPD_1 +
df_2$wt_dementia_1 + df_2$wt_paralysis_1 + df_2$wt_diabetes_1 + df_2$wt_diabetes_w_comp_1 + df_2$wt_renal_1 +
df_2$wt_mild_liver_1 + df_2$wt_mod_sev_liver_1 + df_2$wt_peptic_ulcer_1 + df_2$wt_rheumatologic_1 + df_2$wt_aids_1 +
df_2$wt_AMI_2 + df_2$wt_AMI_history_2 + df_2$wt_CHF_2 + df_2$wt_PVD_2 + df_2$wt_CVD_2 + df_2$wt_COPD_2 +
df_2$wt_dementia_2 + df_2$wt_paralysis_2 + df_2$wt_diabetes_2 + df_2$wt_diabetes_w_comp_2 + df_2$wt_renal_2 +
df_2$wt_mild_liver_2 + df_2$wt_mod_sev_liver_2 + df_2$wt_peptic_ulcer_2 + df_2$wt_rheumatologic_2 + df_2$wt_aids_2 +
df_2$wt_AMI_3 + df_2$wt_AMI_history_3 + df_2$wt_CHF_3 + df_2$wt_PVD_3 + df_2$wt_CVD_3 + df_2$wt_COPD_3 +
df_2$wt_dementia_3 + df_2$wt_paralysis_3 + df_2$wt_diabetes_3 + df_2$wt_diabetes_w_comp_3 + df_2$wt_renal_3 +
df_2$wt_mild_liver_3 + df_2$wt_mod_sev_liver_3 + df_2$wt_peptic_ulcer_3 + df_2$wt_rheumatologic_3 + df_2$wt_aids_3 + df_2$wt_malig_1 + df_2$wt_tumor_1 + df_2$wt_malig_2 + df_2$wt_tumor_2 + df_2$wt_malig_3 + df_2$wt_tumor_3
#table(df_2['diag_3'])
#max_glu_serum
df_2 <- df_2 %>% mutate(max_glu_serum=recode(max_glu_serum,
'None'=0,
'Norm'=1,
'>200'=2,
'>300'=3))
#A1Cresult
df_2 <- df_2 %>% mutate(A1Cresult=recode(A1Cresult,
'None'=0,
'Norm'=1,
'>7'=2,
'>8'=3))
#metformin:metformin.pioglitazone
df_2 <- df_2 %>%
mutate_at(vars(metformin: metformin.pioglitazone), recode, 'No'=0,'Steady'=1,'Up'=1,'Down'=1)
#df_2 <- df_2 %>%
#mutate_at(vars(metformin: metformin.pioglitazone), recode, 'No'=0,'Steady'=1,'Up'=2,'Down'=3)
#change
df_2 <- df_2 %>% mutate(change=recode(change,
'Ch' = 1,
'No' = 0))
#diabetesMed
df_2 <- df_2 %>% mutate(diabetesMed=recode(diabetesMed,
'Yes' = 1,
'No' = 0))
#convert to factors
#************************
df_2$race <- as.factor(df_2$race)
df_2$gender <- as.factor(df_2$gender)
df_2$admission_type_id <- as.factor(df_2$admission_type_id)
df_2$discharge_disposition_id <- as.factor(df_2$discharge_disposition_id)
df_2$admission_source_id <- as.factor(df_2$admission_source_id)
df_2$payer_code <- as.factor(df_2$payer_code)
df_2$medical_specialty <- as.factor(df_2$medical_specialty)
df_2$diag_1 <- as.factor(df_2$diag_1)
df_2$diag_2 <- as.factor(df_2$diag_2)
df_2$diag_3 <- as.factor(df_2$diag_3)
df_2$max_glu_serum <- as.factor(df_2$max_glu_serum)
df_2$A1Cresult <- as.factor(df_2$A1Cresult)
df_2$metformin <- as.factor(df_2$metformin)
df_2$repaglinide <- as.factor(df_2$repaglinide)
df_2$nateglinide <- as.factor(df_2$nateglinide)
df_2$chlorpropamide <- as.factor(df_2$chlorpropamide)
df_2$glimepiride <- as.factor(df_2$glimepiride)
df_2$acetohexamide <- as.factor(df_2$acetohexamide)
df_2$glipizide <- as.factor(df_2$glipizide)
df_2$glyburide <- as.factor(df_2$glyburide)
df_2$tolbutamide <- as.factor(df_2$tolbutamide)
df_2$pioglitazone <- as.factor(df_2$pioglitazone)
df_2$rosiglitazone <- as.factor(df_2$rosiglitazone)
df_2$acarbose <- as.factor(df_2$acarbose)
df_2$miglitol <- as.factor(df_2$miglitol)
df_2$troglitazone <- as.factor(df_2$troglitazone)
df_2$tolazamide <- as.factor(df_2$tolazamide)
df_2$examide <- as.factor(df_2$examide)
df_2$citoglipton <- as.factor(df_2$citoglipton)
df_2$insulin <- as.factor(df_2$insulin)
df_2$glyburide.metformin <- as.factor(df_2$glyburide.metformin)
df_2$glipizide.metformin <- as.factor(df_2$glipizide.metformin)
df_2$glimepiride.pioglitazone <- as.factor(df_2$glimepiride.pioglitazone)
df_2$metformin.rosiglitazone <- as.factor(df_2$metformin.rosiglitazone)
df_2$metformin.pioglitazone <- as.factor(df_2$metformin.pioglitazone)
df_2$change <- as.factor(df_2$change)
df_2$diabetesMed <- as.factor(df_2$diabetesMed)
df_2$readmitted <- as.factor(df_2$readmitted)
#drop one level factors and unneeded columns
dev_df <- subset(df_2, select=-c(weight,encounter_id, patient_nbr, examide, citoglipton,
diag_1, diag_2, diag_3,
wt_AMI_1 , wt_AMI_history_1 , wt_CHF_1 , wt_PVD_1 , wt_CVD_1 , wt_COPD_1 ,
wt_dementia_1 , wt_paralysis_1 , wt_diabetes_1 , wt_diabetes_w_comp_1 , wt_renal_1 ,
wt_mild_liver_1 , wt_mod_sev_liver_1 , wt_peptic_ulcer_1 , wt_rheumatologic_1 , wt_aids_1 ,
wt_AMI_2 , wt_AMI_history_2 , wt_CHF_2 , wt_PVD_2 , wt_CVD_2 , wt_COPD_2 ,
wt_dementia_2 , wt_paralysis_2 , wt_diabetes_2 , wt_diabetes_w_comp_2 , wt_renal_2 ,
wt_mild_liver_2 , wt_mod_sev_liver_2 , wt_peptic_ulcer_2 , wt_rheumatologic_2 , wt_aids_2 ,
wt_AMI_3 , wt_AMI_history_3 , wt_CHF_3 , wt_PVD_3 , wt_CVD_3 , wt_COPD_3 ,
wt_dementia_3 , wt_paralysis_3 , wt_diabetes_3 , wt_diabetes_w_comp_3 , wt_renal_3 ,
wt_mild_liver_3 , wt_mod_sev_liver_3 , wt_peptic_ulcer_3 , wt_rheumatologic_3 , wt_aids_3,
AMI_1, AMI_history_1, CHF_1,PVD_1,CVD_1,COPD_1,dementia_1,paralysis_1,diabetes_1,
diabetes_w_comp_1,renal_1, mild_liver_1,mod_sev_liver_1,peptic_ulcer_1,rheumatologic_1,aids_1,
AMI_2, AMI_history_2, CHF_2,PVD_2,CVD_2,COPD_2,dementia_2,paralysis_2,diabetes_2,
diabetes_w_comp_2,renal_2, mild_liver_2,mod_sev_liver_2,peptic_ulcer_2,rheumatologic_2,aids_2,
AMI_3, AMI_history_3, CHF_3,PVD_3,CVD_3,COPD_3,dementia_3,paralysis_3,diabetes_3,
diabetes_w_comp_3,renal_3, mild_liver_3,mod_sev_liver_3,peptic_ulcer_3,rheumatologic_3,aids_3, malig_1, malig_2, malig_3, wt_malig_1,
wt_malig_2, wt_malig_3, tumor_1, tumor_2, tumor_3, wt_tumor_1, wt_tumor_2,
wt_tumor_3))
ggplot(dev_df, aes(x=readmitted, y=age, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=time_in_hospital, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=num_lab_procedures, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=num_procedures, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=num_medications, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=number_outpatient, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=number_emergency, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=number_inpatient, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=number_diagnoses, fill=readmitted)) +
geom_boxplot()
ggplot(dev_df, aes(x=readmitted, y=CCI, fill=readmitted)) +
geom_boxplot()
From the box plots, it is evident that age and number of diagnoses are left skewed. On the other hand, number of lab procedures, number of medications, number of procedures, number of emergency/inpatient/outpatient visits, total time in hospital, and CCI are right skewed. Number of inpatient visits, number of diagnoses, and CCI appear to impact the target variable the most.
#categorical variables
cat_df <- dev_df %>% select_if(~class(.) == 'factor')
marrangeGrob(
map(
names(cat_df),
~ ggplot(dev_df, aes_string(.x)) +
geom_bar() + geom_bar(aes(fill = readmitted)) +
theme(legend.position="none")
),
ncol = 4,
nrow = 4,
top = "Variable Distribution"
)
#table(dev_df$readmitted)
#11314/88029 #12.9%
From the bar plots it is evident that there is a class imbalance as there are many more cases of not being readmitted vs being readmitted. In addition, many of the medication variables have few unique values and a large ratio of frequency of the most common value to the frequency of the other value. We will remove variables with these characteristics since they will not be useful for prediction.
nearZeroVar(cat_df, freqCut = 95/5, uniqueCut = 10, saveMetrics = FALSE,
names = TRUE, allowParallel = TRUE)
## [1] "max_glu_serum" "repaglinide"
## [3] "nateglinide" "chlorpropamide"
## [5] "acetohexamide" "tolbutamide"
## [7] "acarbose" "miglitol"
## [9] "troglitazone" "tolazamide"
## [11] "glyburide.metformin" "glipizide.metformin"
## [13] "glimepiride.pioglitazone" "metformin.rosiglitazone"
## [15] "metformin.pioglitazone"
The above variables have near zero variance so we will remove them from the modeling data.
#age and payer code
payer_names <- list(
'0'="Missing",
'1'="Medicare",
'2'= "Other",
'3'= "Self_Pay"
)
payer_labeller <- function(variable,value){
return(payer_names[value])
}
ggplot(dev_df, aes(age, fill=readmitted)) +
geom_bar(position = "fill") +
facet_wrap(~payer_code, ncol=2, scales="free", labeller=payer_labeller) +
scale_y_continuous() +
ylab("Percent")
#race and gender
race_names <- list(
'0'="Caucasian",
'1'="Non-caucasian"
)
race_labeller <- function(variable,value){
return(race_names[value])
}
ggplot(dev_df, aes(gender, fill=readmitted)) +
geom_bar(position = "fill") +
facet_wrap(~race, ncol=2, scales="free", labeller=race_labeller) +
scale_y_continuous() +
ylab("Percent")
#payer code and discharge disposition id
ggplot(dev_df, aes(discharge_disposition_id, fill=readmitted)) +
geom_bar(position = "fill") +
facet_wrap(~payer_code, ncol=2, scales="free", labeller=payer_labeller) +
scale_y_continuous() +
ylab("Percent")
#time in hospital and diabetesmed
ggplot(dev_df, aes(time_in_hospital, fill=readmitted)) +
geom_bar(position = "fill") +
facet_wrap(~diabetesMed, ncol=2, scales="free") +
scale_y_continuous() +
ylab("Percent")
#num meds and time in hospital
ggplot(dev_df, aes(x=num_medications, y=time_in_hospital, color=readmitted)) +
geom_point()
Investigating the relationship between age and payer code, it appears that 25 year old patients using Medicare have a higher chance of being readmitted. It appears that race and gender do not have a large impact on likelihood of being readmitted. Patients using Medicare, self-pay, or have missing payer code are more likely to be readmitted if they were discharged, transferred, or still a patient. Patients using other health care insurance were more likely to be readmitted of they left the hospital against medical advice. There does not appear to be a relationship between time in hospital and whether a diabetes medication was prescribed. It is evident that there is a direct relationship between number of medications and time in hospital.
num_df <- select_if(dev_df, is.numeric)
#correlation matrix
data_cor = cor(num_df)
corrplot(data_cor, type = "lower")
Based on the correlation matrix, it is evident that there is low correlation between numerical predictors.
Some categorical predictors such as diag_1
and medical_specialy
are potentially correlated since they provide similar information. We will use Cramer’s V to test the strength of association between two nominal predictors. It ranges from 0 to 1 where 0 indicates no association between the two variables and 1 indicates a perfect association between the two variables.
cramersV(dev_df$diabetesMed, dev_df$insulin) #0.5802898
## [1] 0.5802898
cramersV(dev_df$diabetesMed, dev_df$metformin) #0.2712807
## [1] 0.2712807
#Final modeling data set
#drop diabetes medications (0 variance variables) and correlated variables
dev_df_final <- subset(dev_df, select=-c(metformin, repaglinide, nateglinide, chlorpropamide,
glimepiride, acetohexamide, glipizide, glyburide,
tolbutamide, pioglitazone, rosiglitazone,
acarbose, miglitol, troglitazone, tolazamide, insulin,
glyburide.metformin, glipizide.metformin, glimepiride.pioglitazone,
metformin.rosiglitazone, metformin.pioglitazone, rosiglitazone,
medical_specialty, max_glu_serum, A1Cresult
))
Based on the results, dibetesMed
and insulin
have a strong association and dibetesMed
and metformin
have a moderate association SincediabetesMed
and the other diabetic medications such as insulin
probably have strong associations and to limit the number of predictors in our models we will keep diabetesMed
and drop the other 22 anti-diabetic medications. In addition, since medical_specialty
and diag_1/2/3
probably have an association we will drop medical_specialty
since it has a greater percentage of missing values.
Readmission (target variable = 1) rate is about 12% in the development data. Since our data is imbalanced and contains categorical predictors we will test both under and over random under-sampling to create a balanced data set. Let’s start with under-sampling first.
#table(dev_df_final$readmitted)
n_readmitted <- 11314
new_n_total <- n_readmitted/0.5
undersampled_df <- ovun.sample(readmitted ~ ., data=dev_df_final, method="under", N=new_n_total, seed=123)
under_df <- undersampled_df$data
#table(under_df$readmitted)
In our baseline model, we will keep all possible predictors.
set.seed(1234)
indexes_under = createDataPartition(under_df$readmitted, p = .7, list = F)
train_under = under_df[indexes_under, ]
test_under = under_df[-indexes_under, ]
# Training model
lr_under <- glm(readmitted ~ .,
data = train_under,
family = "binomial")
# Summary
summary(lr_under)
##
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = train_under)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2957 -1.0828 -0.3372 1.1436 1.7137
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2198946 0.1434747 -8.503 < 2e-16 ***
## race1 -0.0398364 0.0405898 -0.981 0.326377
## gender1 -0.0280484 0.0333856 -0.840 0.400833
## age 0.0027816 0.0012233 2.274 0.022972 *
## admission_type_id1 -0.0740245 0.0863268 -0.857 0.391173
## admission_type_id2 -0.0866521 1.4204128 -0.061 0.951355
## admission_type_id3 -0.0150404 0.0764383 -0.197 0.844011
## discharge_disposition_id1 -0.2231381 0.0807307 -2.764 0.005710 **
## discharge_disposition_id2 0.2705246 0.0824316 3.282 0.001031 **
## discharge_disposition_id3 0.1316901 0.2123430 0.620 0.535142
## admission_source_id1 0.2160858 0.1143299 1.890 0.058755 .
## admission_source_id2 0.1675115 0.0929713 1.802 0.071584 .
## admission_source_id3 0.1835964 0.0930896 1.972 0.048580 *
## time_in_hospital 0.0039724 0.0066233 0.600 0.548663
## payer_code1 -0.1348853 0.0434050 -3.108 0.001886 **
## payer_code2 -0.1792118 0.0469923 -3.814 0.000137 ***
## payer_code3 -0.0434589 0.0814418 -0.534 0.593605
## num_lab_procedures 0.0009287 0.0009577 0.970 0.332225
## num_procedures -0.0094872 0.0111462 -0.851 0.394679
## num_medications 0.0031718 0.0026573 1.194 0.232628
## number_outpatient -0.0064254 0.0139288 -0.461 0.644582
## number_emergency 0.1036519 0.0232970 4.449 8.62e-06 ***
## number_inpatient 0.2842894 0.0139855 20.327 < 2e-16 ***
## number_diagnoses 0.0606576 0.0097549 6.218 5.03e-10 ***
## change1 0.0182188 0.0391377 0.466 0.641571
## diabetesMed1 0.1604345 0.0465866 3.444 0.000574 ***
## CCI 0.0677853 0.0110828 6.116 9.58e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21959 on 15839 degrees of freedom
## Residual deviance: 20786 on 15813 degrees of freedom
## AIC: 20840
##
## Number of Fisher Scoring iterations: 4
From the summary, race
, gender
, admission_type_id
, time_in_hospital
, num_lab_procedures
, num_procedures
, number_outpatient
, and change
are not significant to the model. We will drop these predictors in the final model.
# Training model
lr_under_2 <- glm(readmitted ~ age + discharge_disposition_id + +admission_source_id + payer_code
+ number_emergency + number_inpatient +
number_diagnoses + diabetesMed + CCI,
data = train_under,
family = "binomial")
# Summary
summary(lr_under_2)
##
## Call:
## glm(formula = readmitted ~ age + discharge_disposition_id + +admission_source_id +
## payer_code + number_emergency + number_inpatient + number_diagnoses +
## diabetesMed + CCI, family = "binomial", data = train_under)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2755 -1.0818 -0.3346 1.1426 1.6837
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.208071 0.138684 -8.711 < 2e-16 ***
## age 0.002738 0.001202 2.277 0.022759 *
## discharge_disposition_id1 -0.226368 0.080062 -2.827 0.004693 **
## discharge_disposition_id2 0.283686 0.082078 3.456 0.000548 ***
## discharge_disposition_id3 0.130082 0.211824 0.614 0.539147
## admission_source_id1 0.199098 0.093768 2.123 0.033728 *
## admission_source_id2 0.165189 0.068108 2.425 0.015292 *
## admission_source_id3 0.146587 0.071112 2.061 0.039269 *
## payer_code1 -0.139390 0.042478 -3.281 0.001033 **
## payer_code2 -0.186422 0.046184 -4.036 5.43e-05 ***
## payer_code3 -0.055246 0.080112 -0.690 0.490443
## number_emergency 0.100059 0.023120 4.328 1.51e-05 ***
## number_inpatient 0.285308 0.013879 20.556 < 2e-16 ***
## number_diagnoses 0.066087 0.009289 7.114 1.12e-12 ***
## diabetesMed1 0.183857 0.040527 4.537 5.72e-06 ***
## CCI 0.068926 0.011061 6.231 4.63e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21959 on 15839 degrees of freedom
## Residual deviance: 20795 on 15824 degrees of freedom
## AIC: 20827
##
## Number of Fisher Scoring iterations: 4
# Predict test data based on model
predict_lr_under_2 <- predict(lr_under_2,
test_under, type = "response")
# Changing probabilities
predict_lr_under_2 <- ifelse(predict_lr_under_2 >0.5, 1, 0)
# Evaluating model accuracy
# using confusion matrix
#table(test_under$readmitted, predict_lr_under_2)
confusionMatrix(as.factor(predict_lr_under_2), test_under$readmitted, positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2188 1475
## 1 1206 1919
##
## Accuracy : 0.605
## 95% CI : (0.5933, 0.6167)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2101
##
## Mcnemar's Test P-Value : 2.268e-07
##
## Sensitivity : 0.5654
## Specificity : 0.6447
## Pos Pred Value : 0.6141
## Neg Pred Value : 0.5973
## Prevalence : 0.5000
## Detection Rate : 0.2827
## Detection Prevalence : 0.4604
## Balanced Accuracy : 0.6050
##
## 'Positive' Class : 1
##
#p <- predict(lr_under_2, test_under, type = 'response')
# ROC-AUC Curve
ROCPred <- prediction(predict_lr_under_2, test_under$readmitted)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
#auc #0.6050383
# Plotting curve
plot(ROCPer, main = "Logistic Regression ROC AUC - Under-sampling",col = 'blue', lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
Our final logistic regression model built from under-sampled data has AUC of 60.5, sensitivity of 56.5, and specificity of 64.5.
accuracy_tune <- function(fit) {
predict_over <- predict(fit, test_under, type = 'class')
table_mat <- table(test_under$readmitted, predict_over)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
accuracy_Test
}
#control <- rpart.control(minsplit = 100,
# minbucket = round(100 / 3),
# maxdepth = 30,
# cp = 0)
#tune_fit <- rpart(readmitted~., data = train_under, method = 'class', control = control)
#accuracy_tune(tune_fit) #0.5767531
control <- rpart.control(minsplit = 100,
minbucket = round(100 / 4),
maxdepth = 10,
cp = 0.001)
tune_fit_4_over <- rpart(readmitted~., data = train_under, method = 'class', control = control)
#accuracy_tune(tune_fit_4_over) #0.604449
printcp(tune_fit_4_over)
##
## Classification tree:
## rpart(formula = readmitted ~ ., data = train_under, method = "class",
## control = control)
##
## Variables actually used in tree construction:
## [1] admission_source_id age CCI
## [4] diabetesMed discharge_disposition_id num_lab_procedures
## [7] num_medications number_diagnoses number_emergency
## [10] number_inpatient payer_code time_in_hospital
##
## Root node error: 7920/15840 = 0.5
##
## n= 15840
##
## CP nsplit rel error xerror xstd
## 1 0.1904040 0 1.00000 1.01490 0.0079446
## 2 0.0267677 1 0.80960 0.80960 0.0078002
## 3 0.0042929 2 0.78283 0.78333 0.0077568
## 4 0.0037879 3 0.77854 0.78674 0.0077627
## 5 0.0022727 4 0.77475 0.78460 0.0077590
## 6 0.0018939 6 0.77020 0.78409 0.0077581
## 7 0.0017677 7 0.76831 0.78321 0.0077566
## 8 0.0014731 8 0.76654 0.78359 0.0077572
## 9 0.0013889 13 0.75821 0.78207 0.0077545
## 10 0.0012626 14 0.75682 0.78169 0.0077539
## 11 0.0011364 17 0.75227 0.78144 0.0077534
## 12 0.0010101 25 0.74306 0.77980 0.0077505
## 13 0.0010000 26 0.74205 0.77955 0.0077500
#prp(tune_fit_4_over,type=2,extra=106,nn = TRUE,branch=1,varlen=0,yesno=2)
#as.data.frame(tune_fit_4_over$variable.importance)
dt_varimport <- data.frame(imp = tune_fit_4_over$variable.importance)
dt_varimport_2 <- dt_varimport %>%
rownames_to_column() %>%
rename("variable" = rowname) %>%
arrange(imp) %>%
mutate(variable = forcats::fct_inorder(variable))
ggplot(dt_varimport_2) +
geom_col(aes(x = variable, y = imp),
show.legend = F) +
coord_flip()
predicted <- predict(tune_fit_4_over, test_under, type="class")
#table(test_under$readmitted, predicted)
confusionMatrix(test_under$readmitted, predicted)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1849 1545
## 1 1140 2254
##
## Accuracy : 0.6044
## 95% CI : (0.5927, 0.6161)
## No Information Rate : 0.5597
## P-Value [Acc > NIR] : 4.595e-14
##
## Kappa : 0.2089
##
## Mcnemar's Test P-Value : 6.356e-15
##
## Sensitivity : 0.6186
## Specificity : 0.5933
## Pos Pred Value : 0.5448
## Neg Pred Value : 0.6641
## Prevalence : 0.4403
## Detection Rate : 0.2724
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.6060
##
## 'Positive' Class : 0
##
pred <- prediction(predict(tune_fit_4_over, test_under, type="prob")[, 2], test_under$readmitted)
auc <- performance(pred, "auc")
auc <- auc@y.values[[1]]
#auc #0.6360124
plot(performance(pred, 'tpr', 'fpr'), col='blue', main='Decision Tree ROC AUC - Under-sampling')
abline(0, 1, lty=2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
After tuning and fitting, our final decision tree model using under-sampled data has AUC of 63.6, sensitivity of 61.9, and specificity of 59.3. The top 3 most important predictors are number_inpatient
, discharge_disposition_id
, and number_emergency
.
n_not_readmitted <- 88029
new_n_total <- n_not_readmitted/0.5
oversampled_df <- ovun.sample(readmitted ~ ., data=dev_df_final, method="over", N=new_n_total, seed=123)
over_df <- oversampled_df$data
#table(over_df$readmitted)
Again in our baseline model we will keep all possible predictors.
indexes_over = createDataPartition(over_df$readmitted, p = .7, list = F)
train_over = over_df[indexes_over, ]
test_over = over_df[-indexes_over, ]
# Training model
lr_over <- glm(readmitted ~ .,
data = train_over,
family = "binomial")
# Summary
summary(lr_over)
##
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = train_over)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5118 -1.0865 -0.3091 1.1483 1.6931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1142471 0.0517499 -21.531 < 2e-16 ***
## race1 0.0039574 0.0145498 0.272 0.785628
## gender1 -0.0456644 0.0119697 -3.815 0.000136 ***
## age 0.0021106 0.0004379 4.820 1.44e-06 ***
## admission_type_id1 -0.0151414 0.0305743 -0.495 0.620436
## admission_type_id2 0.7677991 0.5781769 1.328 0.184189
## admission_type_id3 0.0563767 0.0269691 2.090 0.036580 *
## discharge_disposition_id1 -0.2332626 0.0289258 -8.064 7.37e-16 ***
## discharge_disposition_id2 0.2369942 0.0295786 8.012 1.13e-15 ***
## discharge_disposition_id3 0.1438469 0.0777047 1.851 0.064141 .
## admission_source_id1 0.0528998 0.0407973 1.297 0.194751
## admission_source_id2 0.1545198 0.0330516 4.675 2.94e-06 ***
## admission_source_id3 0.1551447 0.0331373 4.682 2.84e-06 ***
## time_in_hospital 0.0090983 0.0023902 3.806 0.000141 ***
## payer_code1 -0.1192046 0.0154763 -7.702 1.34e-14 ***
## payer_code2 -0.2171511 0.0168570 -12.882 < 2e-16 ***
## payer_code3 -0.0948955 0.0295048 -3.216 0.001299 **
## num_lab_procedures 0.0004336 0.0003444 1.259 0.208090
## num_procedures -0.0127553 0.0039835 -3.202 0.001365 **
## num_medications 0.0042848 0.0009525 4.498 6.85e-06 ***
## number_outpatient -0.0042054 0.0047410 -0.887 0.375056
## number_emergency 0.0894122 0.0077810 11.491 < 2e-16 ***
## number_inpatient 0.2876933 0.0049548 58.064 < 2e-16 ***
## number_diagnoses 0.0449196 0.0034864 12.884 < 2e-16 ***
## change1 -0.0063233 0.0140351 -0.451 0.652321
## diabetesMed1 0.1807648 0.0167038 10.822 < 2e-16 ***
## CCI 0.0824808 0.0040210 20.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 170850 on 123241 degrees of freedom
## Residual deviance: 161676 on 123215 degrees of freedom
## AIC: 161730
##
## Number of Fisher Scoring iterations: 4
The predictors race
, num_lab_procedures
, number_outpatient
, and change
are not significant so we will drop then in the final model.
# Training model
lr_over_2 <- glm(readmitted ~ gender + age + admission_type_id + discharge_disposition_id +
admission_source_id + time_in_hospital + payer_code + num_procedures +
num_medications + number_emergency + number_inpatient +
number_diagnoses + diabetesMed + CCI,
data = train_over,
family = "binomial")
# Summary
summary(lr_over_2)
##
## Call:
## glm(formula = readmitted ~ gender + age + admission_type_id +
## discharge_disposition_id + admission_source_id + time_in_hospital +
## payer_code + num_procedures + num_medications + number_emergency +
## number_inpatient + number_diagnoses + diabetesMed + CCI,
## family = "binomial", data = train_over)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5062 -1.0867 -0.3112 1.1488 1.6912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1048737 0.0508794 -21.716 < 2e-16 ***
## gender1 -0.0456124 0.0119437 -3.819 0.000134 ***
## age 0.0020873 0.0004326 4.825 1.40e-06 ***
## admission_type_id1 -0.0180604 0.0302819 -0.596 0.550901
## admission_type_id2 0.7684483 0.5781426 1.329 0.183793
## admission_type_id3 0.0556922 0.0268855 2.071 0.038316 *
## discharge_disposition_id1 -0.2318455 0.0288848 -8.027 1.00e-15 ***
## discharge_disposition_id2 0.2370022 0.0295508 8.020 1.06e-15 ***
## discharge_disposition_id3 0.1457847 0.0776799 1.877 0.060554 .
## admission_source_id1 0.0620893 0.0403376 1.539 0.123746
## admission_source_id2 0.1639098 0.0323086 5.073 3.91e-07 ***
## admission_source_id3 0.1607654 0.0328390 4.896 9.80e-07 ***
## time_in_hospital 0.0097914 0.0023361 4.191 2.77e-05 ***
## payer_code1 -0.1223920 0.0153155 -7.991 1.33e-15 ***
## payer_code2 -0.2205474 0.0166930 -13.212 < 2e-16 ***
## payer_code3 -0.0987262 0.0292680 -3.373 0.000743 ***
## num_procedures -0.0126409 0.0039744 -3.181 0.001470 **
## num_medications 0.0044094 0.0009209 4.788 1.68e-06 ***
## number_emergency 0.0888016 0.0077497 11.459 < 2e-16 ***
## number_inpatient 0.2873435 0.0049384 58.186 < 2e-16 ***
## number_diagnoses 0.0448999 0.0034712 12.935 < 2e-16 ***
## diabetesMed1 0.1769868 0.0147853 11.970 < 2e-16 ***
## CCI 0.0824999 0.0040202 20.521 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 170850 on 123241 degrees of freedom
## Residual deviance: 161679 on 123219 degrees of freedom
## AIC: 161725
##
## Number of Fisher Scoring iterations: 4
# Predict test data based on model
predict_lr_over_2 <- predict(lr_over_2,
test_over, type = "response")
# Changing probabilities
predict_lr_over_2 <- ifelse(predict_lr_over_2 >0.5, 1, 0)
# Evaluating model accuracy
# using confusion matrix
#table(test_over$readmitted, predict_lr_over_2)
confusionMatrix(as.factor(predict_lr_over_2), test_over$readmitted, positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17420 11753
## 1 8988 14655
##
## Accuracy : 0.6073
## 95% CI : (0.6031, 0.6115)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2146
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5549
## Specificity : 0.6596
## Pos Pred Value : 0.6198
## Neg Pred Value : 0.5971
## Prevalence : 0.5000
## Detection Rate : 0.2775
## Detection Prevalence : 0.4476
## Balanced Accuracy : 0.6073
##
## 'Positive' Class : 1
##
# ROC-AUC Curve
ROCPred <- prediction(predict_lr_over_2, test_over$readmitted)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
#auc #0.607297
# Plotting curve
plot(ROCPer, main = "Logistic Regression ROC AUC - Over-sampling",col = 'blue', lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
The final logistic regression model using over-sampling has AUC of 60.7, sensitivity 55.5, and specificity 66.0.
control <- rpart.control(minsplit = 100,
minbucket = round(100 / 3),
maxdepth = 30,
cp = 0)
tune_fit <- rpart(readmitted~., data = train_over, method = 'class', control = control)
#accuracy_tune(tune_fit) #0.6946081
control <- rpart.control(minsplit = 50,
minbucket = round(100/3),
maxdepth = 20,
cp = 0)
tune_fit_4 <- rpart(readmitted~., data = train_over, method = 'class', control = control)
#accuracy_tune(tune_fit_4) #0.7144962
printcp(tune_fit_4)
##
## Classification tree:
## rpart(formula = readmitted ~ ., data = train_over, method = "class",
## control = control)
##
## Variables actually used in tree construction:
## [1] admission_source_id admission_type_id age
## [4] CCI change diabetesMed
## [7] discharge_disposition_id gender num_lab_procedures
## [10] num_medications num_procedures number_diagnoses
## [13] number_emergency number_inpatient number_outpatient
## [16] payer_code race time_in_hospital
##
## Root node error: 61621/123242 = 0.5
##
## n= 123242
##
## CP nsplit rel error xerror xstd
## 1 1.8648e-01 0 1.00000 1.00729 0.0028485
## 2 2.1097e-02 1 0.81352 0.81805 0.0028010
## 3 3.5296e-03 2 0.79242 0.79315 0.0027869
## 4 2.9698e-03 4 0.78537 0.78700 0.0027832
## 5 1.7851e-03 5 0.78240 0.78408 0.0027813
## 6 1.5579e-03 6 0.78061 0.78397 0.0027813
## 7 1.4443e-03 12 0.76983 0.78264 0.0027804
## 8 1.0711e-03 13 0.76839 0.77480 0.0027754
## 9 9.8992e-04 14 0.76732 0.77219 0.0027736
## 10 9.8181e-04 15 0.76633 0.77191 0.0027734
## 11 9.5747e-04 17 0.76437 0.77178 0.0027734
## 12 9.4124e-04 18 0.76341 0.77177 0.0027733
## 13 8.7632e-04 19 0.76247 0.77131 0.0027730
## 14 6.3290e-04 20 0.76159 0.76800 0.0027708
## 15 6.1667e-04 21 0.76096 0.76604 0.0027695
## 16 5.8963e-04 24 0.75904 0.76508 0.0027688
## 17 5.4635e-04 28 0.75638 0.76372 0.0027679
## 18 5.4365e-04 35 0.75073 0.76227 0.0027669
## 19 5.0308e-04 37 0.74965 0.76127 0.0027662
## 20 4.7062e-04 40 0.74814 0.75742 0.0027634
## 21 4.4898e-04 42 0.74720 0.75619 0.0027626
## 22 4.4628e-04 55 0.74118 0.75562 0.0027622
## 23 4.3816e-04 58 0.73968 0.75523 0.0027619
## 24 4.0571e-04 59 0.73924 0.75296 0.0027602
## 25 3.8505e-04 62 0.73803 0.75122 0.0027590
## 26 3.5702e-04 74 0.73327 0.74903 0.0027574
## 27 3.4891e-04 75 0.73292 0.74744 0.0027562
## 28 3.4079e-04 78 0.73183 0.74639 0.0027554
## 29 3.2456e-04 82 0.73047 0.74471 0.0027541
## 30 3.2186e-04 84 0.72982 0.74356 0.0027533
## 31 3.1916e-04 96 0.72516 0.74342 0.0027532
## 32 3.1645e-04 118 0.71735 0.74329 0.0027531
## 33 3.0834e-04 121 0.71640 0.74251 0.0027525
## 34 2.9211e-04 126 0.71485 0.74045 0.0027509
## 35 2.7588e-04 131 0.71339 0.73728 0.0027485
## 36 2.6777e-04 137 0.71174 0.73467 0.0027464
## 37 2.6506e-04 139 0.71120 0.73269 0.0027449
## 38 2.5965e-04 150 0.70750 0.73184 0.0027442
## 39 2.5154e-04 157 0.70568 0.72763 0.0027408
## 40 2.4342e-04 165 0.70328 0.72415 0.0027380
## 41 2.3801e-04 166 0.70304 0.72193 0.0027362
## 42 2.3531e-04 176 0.70036 0.72172 0.0027360
## 43 2.2720e-04 180 0.69942 0.71928 0.0027340
## 44 2.1908e-04 198 0.69483 0.71552 0.0027308
## 45 2.1421e-04 205 0.69277 0.71373 0.0027293
## 46 2.1097e-04 215 0.69037 0.71088 0.0027269
## 47 2.0556e-04 226 0.68796 0.70942 0.0027256
## 48 2.0285e-04 229 0.68735 0.70768 0.0027241
## 49 2.0169e-04 258 0.68089 0.70747 0.0027239
## 50 2.0015e-04 265 0.67948 0.70739 0.0027239
## 51 1.9474e-04 269 0.67862 0.70521 0.0027219
## 52 1.8933e-04 305 0.67076 0.70408 0.0027209
## 53 1.8662e-04 333 0.66373 0.70129 0.0027185
## 54 1.8392e-04 339 0.66262 0.69962 0.0027170
## 55 1.8257e-04 353 0.65999 0.69924 0.0027166
## 56 1.8083e-04 358 0.65896 0.69924 0.0027166
## 57 1.7851e-04 382 0.65374 0.69580 0.0027135
## 58 1.7310e-04 449 0.63767 0.69368 0.0027116
## 59 1.7040e-04 455 0.63642 0.69223 0.0027103
## 60 1.6769e-04 480 0.63199 0.69002 0.0027082
## 61 1.6553e-04 483 0.63149 0.68710 0.0027055
## 62 1.6228e-04 490 0.63027 0.68667 0.0027051
## 63 1.5687e-04 540 0.62125 0.68306 0.0027017
## 64 1.5417e-04 544 0.62062 0.68194 0.0027006
## 65 1.5255e-04 554 0.61903 0.68181 0.0027005
## 66 1.4605e-04 566 0.61703 0.67824 0.0026971
## 67 1.4403e-04 603 0.61111 0.67673 0.0026956
## 68 1.4064e-04 611 0.60995 0.67170 0.0026906
## 69 1.3956e-04 614 0.60953 0.67128 0.0026902
## 70 1.3794e-04 624 0.60749 0.67081 0.0026898
## 71 1.3524e-04 640 0.60507 0.67013 0.0026891
## 72 1.3388e-04 646 0.60426 0.66747 0.0026864
## 73 1.2983e-04 655 0.60301 0.66687 0.0026858
## 74 1.2577e-04 684 0.59923 0.66570 0.0026846
## 75 1.2442e-04 688 0.59872 0.66143 0.0026803
## 76 1.2171e-04 731 0.59210 0.66143 0.0026803
## 77 1.2009e-04 760 0.58847 0.66122 0.0026801
## 78 1.1901e-04 769 0.58723 0.65866 0.0026774
## 79 1.1360e-04 787 0.58470 0.65633 0.0026750
## 80 1.0954e-04 835 0.57870 0.65481 0.0026734
## 81 1.0819e-04 840 0.57815 0.65466 0.0026733
## 82 1.0548e-04 847 0.57725 0.65106 0.0026695
## 83 1.0278e-04 861 0.57578 0.65091 0.0026693
## 84 9.7369e-05 864 0.57547 0.64902 0.0026673
## 85 9.5747e-05 924 0.56946 0.64721 0.0026654
## 86 9.3312e-05 947 0.56672 0.64640 0.0026645
## 87 9.1960e-05 951 0.56635 0.64629 0.0026644
## 88 8.9255e-05 954 0.56607 0.64616 0.0026642
## 89 8.6551e-05 983 0.56301 0.64202 0.0026598
## 90 8.5198e-05 989 0.56249 0.64134 0.0026590
## 91 8.3459e-05 1049 0.55499 0.64111 0.0026588
## 92 8.1141e-05 1056 0.55441 0.64006 0.0026576
## 93 7.8436e-05 1105 0.55038 0.63918 0.0026566
## 94 7.7084e-05 1122 0.54886 0.63837 0.0026557
## 95 7.5732e-05 1130 0.54824 0.63811 0.0026555
## 96 7.3027e-05 1141 0.54725 0.63790 0.0026552
## 97 7.0322e-05 1185 0.54382 0.63446 0.0026514
## 98 6.8970e-05 1195 0.54311 0.63425 0.0026512
## 99 6.7231e-05 1213 0.54184 0.63415 0.0026511
## 100 6.4913e-05 1223 0.54115 0.63301 0.0026498
## 101 6.1667e-05 1292 0.53590 0.63298 0.0026497
## 102 6.0856e-05 1303 0.53498 0.63311 0.0026499
## 103 5.9504e-05 1308 0.53467 0.63311 0.0026499
## 104 5.6799e-05 1320 0.53375 0.63204 0.0026487
## 105 5.5176e-05 1332 0.53307 0.63181 0.0026484
## 106 5.4094e-05 1337 0.53279 0.63178 0.0026484
## 107 5.1930e-05 1352 0.53198 0.63047 0.0026469
## 108 4.8685e-05 1357 0.53172 0.63008 0.0026465
## 109 4.5439e-05 1446 0.52644 0.63003 0.0026464
## 110 4.4628e-05 1451 0.52622 0.62910 0.0026454
## 111 4.3275e-05 1455 0.52604 0.62917 0.0026454
## 112 4.2193e-05 1475 0.52503 0.62933 0.0026456
## 113 4.0571e-05 1490 0.52422 0.62899 0.0026452
## 114 3.9411e-05 1504 0.52365 0.62909 0.0026453
## 115 3.7866e-05 1511 0.52338 0.62912 0.0026454
## 116 3.6514e-05 1519 0.52294 0.62907 0.0026453
## 117 3.2456e-05 1523 0.52279 0.62871 0.0026449
## 118 2.7047e-05 1581 0.52091 0.62889 0.0026451
## 119 2.4342e-05 1593 0.52059 0.62912 0.0026454
## 120 2.1638e-05 1601 0.52039 0.62883 0.0026450
## 121 1.6228e-05 1616 0.52007 0.62893 0.0026452
## 122 8.1141e-06 1655 0.51943 0.62860 0.0026448
## 123 5.4094e-06 1685 0.51919 0.62881 0.0026450
## 124 4.0571e-06 1691 0.51916 0.62896 0.0026452
## 125 0.0000e+00 1695 0.51914 0.62868 0.0026449
predicted <- predict(tune_fit_4, test_over, type="class")
#table(test_over$readmitted, predicted)
confusionMatrix(test_over$readmitted, predicted)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17353 9055
## 1 6990 19418
##
## Accuracy : 0.6962
## 95% CI : (0.6923, 0.7001)
## No Information Rate : 0.5391
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3924
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7129
## Specificity : 0.6820
## Pos Pred Value : 0.6571
## Neg Pred Value : 0.7353
## Prevalence : 0.4609
## Detection Rate : 0.3286
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.6974
##
## 'Positive' Class : 0
##
pred <- prediction(predict(tune_fit_4, test_over, type="prob")[, 2], test_over$readmitted)
auc <- performance(pred, "auc")
auc <- auc@y.values[[1]]
#auc #0.7529766
#as.data.frame(tune_fit$variable.importance)
dt_varimport <- data.frame(imp = tune_fit_4$variable.importance)
dt_varimport_2 <- dt_varimport %>%
rownames_to_column() %>%
rename("variable" = rowname) %>%
arrange(imp) %>%
mutate(variable = forcats::fct_inorder(variable))
ggplot(dt_varimport_2) +
geom_col(aes(x = variable, y = imp),
show.legend = F) +
coord_flip()
plot(performance(pred, 'tpr', 'fpr'), col='blue', main='Decision Tree ROC AUC - Over-sampling')
abline(0, 1, lty=2)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
AUC for the decision tree model using over-sampled data is 75.3, sensitivity is 71.3, and specificity is 68.2. The top 3 important predictors are num_lab_procedures
, number_inpatient
, and number_medications
.
The table below summarizes the performance metrics for each model that was developed. Overall, the decision tree model using over-sampled data had the best performance across the board. It is interesting that this model out performed the others by such a large margin. In general, it appears the decision trees performed better than the logistic regression models and note than number_inpatient
was the top predictor in both decision trees.
#summary table
Model <- c("Logistic Regression (Under-sampling)", "Decision Tree (Under-sampling)", "Logistic Regression (Over-sampling)", "Decision Tree (Over-sampling)")
auc_values <- c(60.5, 63.6, 60.7, 75.3)
spec_values <- c(56.5, 61.9, 55.5, 71.3)
sens_values <- c(64.5, 59.3, 66.0, 68.2)
sum_df<- data.frame(Model, auc_values, spec_values, sens_values)
# rename columnscolnames(sum_df)[which(names(sum_df) == "auc_values")] <- "AUC"
colnames(sum_df)[which(names(sum_df) == "spec_values")] <- "Specificity"
colnames(sum_df)[which(names(sum_df) == "sens_values")] <- "Sensitivity"
sum_df
## Model auc_values Specificity Sensitivity
## 1 Logistic Regression (Under-sampling) 60.5 56.5 64.5
## 2 Decision Tree (Under-sampling) 63.6 61.9 59.3
## 3 Logistic Regression (Over-sampling) 60.7 55.5 66.0
## 4 Decision Tree (Over-sampling) 75.3 71.3 68.2
In general, the decision trees perform better than the logistic regression models. Since our models’ goal was to predict whether a patient is readmitted to the hospital or not, false negatives (incorrectly predicting a patient will not be readmitted) are more costly than false positives (incorrectly predicting a patient will be readmitted). So we will stress sensitivity as the most important performance metric because it is the ratio of true positives over actual positives. Thus, we recommend the decision tree model using over-sampled data as the best model to use based on this analysis. Future possible steps include testing each model on other data sets to see how they perform and testing other types of models.
https://www.guru99.com/r-decision-trees.html
https://www.statology.org/classification-and-regression-trees-in-r/
https://healthcaredelivery.cancer.gov/seermedicare/considerations/comorbidity-table.html
https://healthcaredelivery.cancer.gov/seermedicare/program/charlson.comorbidity.macro.v2000.txt http://mchp-appserv.cpe.umanitoba.ca/concept/Charlson%20Comorbidities%20-%20Coding%20Algorithms%20for%20ICD-9-CM%20and%20ICD-10.pdf https://academic.oup.com/aje/article/173/6/676/182985