The dataset we used was obtained from : https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008. The dataset represents ten years (1999-2008) of clinical care at 130 US hospitals.
Our objective with this dataset is to build a model that could predict patient’s early readmission and analyze whether a certain factor affects the early readmission rate or not.
For prediction analysis, we will be using ensemble method to select the optimal model. We will be using logistic regression, logistic lasso, single tree, bagging decision tree, random forest, support vector machine, k-nearest neighbors, and neural network. Our goal is to maximize the accuracy and sensitivity, while minimizing the false negative rate.
For causal inference, we will utilize logistic regression to define how each variable affects the odds ratio of readmission, and t.test using propensity score matching method to define whether AlC blood sugar test affect the readmission rate or not.
raw_data <- read.csv('diabetic_data.csv')
head(raw_data)
## encounter_id patient_nbr race gender age weight
## 1 2278392 8222157 Caucasian Female [0-10) ?
## 2 149190 55629189 Caucasian Female [10-20) ?
## 3 64410 86047875 AfricanAmerican Female [20-30) ?
## 4 500364 82442376 Caucasian Male [30-40) ?
## 5 16680 42519267 Caucasian Male [40-50) ?
## 6 35754 82637451 Caucasian Male [50-60) ?
## admission_type_id discharge_disposition_id admission_source_id
## 1 6 25 1
## 2 1 1 7
## 3 1 1 7
## 4 1 1 7
## 5 1 1 7
## 6 2 1 2
## time_in_hospital payer_code medical_specialty num_lab_procedures
## 1 1 ? Pediatrics-Endocrinology 41
## 2 3 ? ? 59
## 3 2 ? ? 11
## 4 2 ? ? 44
## 5 1 ? ? 51
## 6 3 ? ? 31
## num_procedures num_medications number_outpatient number_emergency
## 1 0 1 0 0
## 2 0 18 0 0
## 3 5 13 2 0
## 4 1 16 0 0
## 5 0 8 0 0
## 6 6 16 0 0
## number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1 0 250.83 ? ? 1 None
## 2 0 276 250.01 255 9 None
## 3 1 648 250 V27 6 None
## 4 0 8 250.43 403 7 None
## 5 0 197 157 250 5 None
## 6 0 414 411 250 9 None
## A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1 None No No No No No
## 2 None No No No No No
## 3 None No No No No No
## 4 None No No No No No
## 5 None No No No No No
## 6 None No No No No No
## acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1 No No No No No No
## 2 No No No No No No
## 3 No Steady No No No No
## 4 No No No No No No
## 5 No Steady No No No No
## 6 No No No No No No
## acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1 No No No No No No No
## 2 No No No No No No Up
## 3 No No No No No No No
## 4 No No No No No No Up
## 5 No No No No No No Steady
## 6 No No No No No No Steady
## glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1 No No No
## 2 No No No
## 3 No No No
## 4 No No No
## 5 No No No
## 6 No No No
## metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
## 1 No No No No NO
## 2 No No Ch Yes >30
## 3 No No No Yes NO
## 4 No No Ch Yes NO
## 5 No No Ch Yes NO
## 6 No No No Yes >30
Hmisc::describe(raw_data)
## raw_data
##
## 50 Variables 101766 Observations
## --------------------------------------------------------------------------------
## encounter_id
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 101766 1 165201646 114704306 27170784 43368426
## .25 .50 .75 .90 .95
## 84961194 152388987 230270888 311346359 378962843
##
## lowest : 12522 15738 16680 28236 35754
## highest: 443847548 443847782 443854148 443857166 443867222
## --------------------------------------------------------------------------------
## patient_nbr
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 71518 1 54330401 43738203 1456972 3957116
## .25 .50 .75 .90 .95
## 23413221 45505143 87545950 103287825 111480273
##
## lowest : 135 378 729 774 927
## highest: 189351095 189365864 189445127 189481478 189502619
## --------------------------------------------------------------------------------
## race
## n missing distinct
## 101766 0 6
##
## Value ? AfricanAmerican Asian Caucasian
## Frequency 2273 19210 641 76099
## Proportion 0.022 0.189 0.006 0.748
##
## Value Hispanic Other
## Frequency 2037 1506
## Proportion 0.020 0.015
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 101766 0 3
##
## Value Female Male Unknown/Invalid
## Frequency 54708 47055 3
## Proportion 0.538 0.462 0.000
## --------------------------------------------------------------------------------
## age
## n missing distinct
## 101766 0 10
##
## Value [0-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70)
## Frequency 161 691 1657 3775 9685 17256 22483
## Proportion 0.002 0.007 0.016 0.037 0.095 0.170 0.221
##
## Value [70-80) [80-90) [90-100)
## Frequency 26068 17197 2793
## Proportion 0.256 0.169 0.027
## --------------------------------------------------------------------------------
## weight
## n missing distinct
## 101766 0 10
##
## Value ? [0-25) [100-125) [125-150) [150-175) [175-200)
## Frequency 98569 48 625 145 35 11
## Proportion 0.969 0.000 0.006 0.001 0.000 0.000
##
## Value [25-50) [50-75) [75-100) >200
## Frequency 97 897 1336 3
## Proportion 0.001 0.009 0.013 0.000
## --------------------------------------------------------------------------------
## admission_type_id
## n missing distinct Info Mean Gmd
## 101766 0 8 0.838 2.024 1.393
##
## Value 1 2 3 4 5 6 7 8
## Frequency 53990 18480 18869 10 4785 5291 21 320
## Proportion 0.531 0.182 0.185 0.000 0.047 0.052 0.000 0.003
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## discharge_disposition_id
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 26 0.788 3.716 4.264 1 1
## .25 .50 .75 .90 .95
## 1 1 4 7 18
##
## lowest : 1 2 3 4 5, highest: 23 24 25 27 28
## --------------------------------------------------------------------------------
## admission_source_id
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 17 0.795 5.754 3.905 1 1
## .25 .50 .75 .90 .95
## 1 7 7 7 17
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 29565 1104 187 3187 855 2264 57494 16 125 8 2
## Proportion 0.291 0.011 0.002 0.031 0.008 0.022 0.565 0.000 0.001 0.000 0.000
##
## Value 13 14 17 20 22 25
## Frequency 1 2 6781 161 12 2
## Proportion 0.000 0.000 0.067 0.002 0.000 0.000
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## time_in_hospital
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 14 0.983 4.396 3.198 1 1
## .25 .50 .75 .90 .95
## 2 4 6 9 11
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 14208 17224 17756 13924 9966 7539 5859 4391 3002 2342 1855
## Proportion 0.140 0.169 0.174 0.137 0.098 0.074 0.058 0.043 0.029 0.023 0.018
##
## Value 12 13 14
## Frequency 1448 1210 1042
## Proportion 0.014 0.012 0.010
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## payer_code
## n missing distinct
## 101766 0 18
##
## Value ? BC CH CM CP DM FR HM MC MD MP
## Frequency 40256 4655 146 1937 2533 549 1 6274 32439 3532 79
## Proportion 0.396 0.046 0.001 0.019 0.025 0.005 0.000 0.062 0.319 0.035 0.001
##
## Value OG OT PO SI SP UN WC
## Frequency 1033 95 592 55 5007 2448 135
## Proportion 0.010 0.001 0.006 0.001 0.049 0.024 0.001
## --------------------------------------------------------------------------------
## medical_specialty
## n missing distinct
## 101766 0 73
##
## lowest : ? AllergyandImmunology Anesthesiology Anesthesiology-Pediatric Cardiology
## highest: Surgery-PlasticwithinHeadandNeck Surgery-Thoracic Surgery-Vascular SurgicalSpecialty Urology
## --------------------------------------------------------------------------------
## num_lab_procedures
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 118 1 43.1 22.2 4 14
## .25 .50 .75 .90 .95
## 31 44 57 67 73
##
## lowest : 1 2 3 4 5, highest: 120 121 126 129 132
## --------------------------------------------------------------------------------
## num_procedures
## n missing distinct Info Mean Gmd
## 101766 0 7 0.892 1.34 1.728
##
## Value 0 1 2 3 4 5 6
## Frequency 46652 20742 12717 9443 4180 3078 4954
## Proportion 0.458 0.204 0.125 0.093 0.041 0.030 0.049
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## num_medications
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 75 0.998 16.02 8.662 6 7
## .25 .50 .75 .90 .95
## 10 15 20 26 31
##
## lowest : 1 2 3 4 5, highest: 72 74 75 79 81
## --------------------------------------------------------------------------------
## number_outpatient
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 39 0.416 0.3694 0.6655 0 0
## .25 .50 .75 .90 .95
## 0 0 0 1 2
##
## lowest : 0 1 2 3 4, highest: 37 38 39 40 42
## --------------------------------------------------------------------------------
## number_emergency
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 33 0.299 0.1978 0.3672 0 0
## .25 .50 .75 .90 .95
## 0 0 0 1 1
##
## lowest : 0 1 2 3 4, highest: 46 54 63 64 76
## --------------------------------------------------------------------------------
## number_inpatient
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 21 0.699 0.6356 0.9915 0 0
## .25 .50 .75 .90 .95
## 0 0 1 2 3
##
## lowest : 0 1 2 3 4, highest: 16 17 18 19 21
## --------------------------------------------------------------------------------
## diag_1
## n missing distinct
## 101766 0 717
##
## lowest : ? 10 11 110 112, highest: V63 V66 V67 V70 V71
## --------------------------------------------------------------------------------
## diag_2
## n missing distinct
## 101766 0 749
##
## lowest : ? 11 110 111 112, highest: V69 V70 V72 V85 V86
## --------------------------------------------------------------------------------
## diag_3
## n missing distinct
## 101766 0 790
##
## lowest : ? 11 110 111 112, highest: V66 V70 V72 V85 V86
## --------------------------------------------------------------------------------
## number_diagnoses
## n missing distinct Info Mean Gmd .05 .10
## 101766 0 16 0.88 7.423 2.022 4 5
## .25 .50 .75 .90 .95
## 6 8 9 9 9
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 219 1023 2835 5537 11393 10161 10393 10616 49474 17 11
## Proportion 0.002 0.010 0.028 0.054 0.112 0.100 0.102 0.104 0.486 0.000 0.000
##
## Value 12 13 14 15 16
## Frequency 9 16 7 10 45
## Proportion 0.000 0.000 0.000 0.000 0.000
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## max_glu_serum
## n missing distinct
## 101766 0 4
##
## Value >200 >300 None Norm
## Frequency 1485 1264 96420 2597
## Proportion 0.015 0.012 0.947 0.026
## --------------------------------------------------------------------------------
## A1Cresult
## n missing distinct
## 101766 0 4
##
## Value >7 >8 None Norm
## Frequency 3812 8216 84748 4990
## Proportion 0.037 0.081 0.833 0.049
## --------------------------------------------------------------------------------
## metformin
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 575 81778 18346 1067
## Proportion 0.006 0.804 0.180 0.010
## --------------------------------------------------------------------------------
## repaglinide
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 45 100227 1384 110
## Proportion 0.000 0.985 0.014 0.001
## --------------------------------------------------------------------------------
## nateglinide
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 11 101063 668 24
## Proportion 0.000 0.993 0.007 0.000
## --------------------------------------------------------------------------------
## chlorpropamide
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 1 101680 79 6
## Proportion 0.000 0.999 0.001 0.000
## --------------------------------------------------------------------------------
## glimepiride
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 194 96575 4670 327
## Proportion 0.002 0.949 0.046 0.003
## --------------------------------------------------------------------------------
## acetohexamide
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101765 1
## Proportion 1 0
## --------------------------------------------------------------------------------
## glipizide
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 560 89080 11356 770
## Proportion 0.006 0.875 0.112 0.008
## --------------------------------------------------------------------------------
## glyburide
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 564 91116 9274 812
## Proportion 0.006 0.895 0.091 0.008
## --------------------------------------------------------------------------------
## tolbutamide
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101743 23
## Proportion 1 0
## --------------------------------------------------------------------------------
## pioglitazone
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 118 94438 6976 234
## Proportion 0.001 0.928 0.069 0.002
## --------------------------------------------------------------------------------
## rosiglitazone
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 87 95401 6100 178
## Proportion 0.001 0.937 0.060 0.002
## --------------------------------------------------------------------------------
## acarbose
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 3 101458 295 10
## Proportion 0.000 0.997 0.003 0.000
## --------------------------------------------------------------------------------
## miglitol
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 5 101728 31 2
## Proportion 0 1 0 0
## --------------------------------------------------------------------------------
## troglitazone
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101763 3
## Proportion 1 0
## --------------------------------------------------------------------------------
## tolazamide
## n missing distinct
## 101766 0 3
##
## Value No Steady Up
## Frequency 101727 38 1
## Proportion 1 0 0
## --------------------------------------------------------------------------------
## examide
## n missing distinct value
## 101766 0 1 No
##
## Value No
## Frequency 101766
## Proportion 1
## --------------------------------------------------------------------------------
## citoglipton
## n missing distinct value
## 101766 0 1 No
##
## Value No
## Frequency 101766
## Proportion 1
## --------------------------------------------------------------------------------
## insulin
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 12218 47383 30849 11316
## Proportion 0.120 0.466 0.303 0.111
## --------------------------------------------------------------------------------
## glyburide.metformin
## n missing distinct
## 101766 0 4
##
## Value Down No Steady Up
## Frequency 6 101060 692 8
## Proportion 0.000 0.993 0.007 0.000
## --------------------------------------------------------------------------------
## glipizide.metformin
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101753 13
## Proportion 1 0
## --------------------------------------------------------------------------------
## glimepiride.pioglitazone
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101765 1
## Proportion 1 0
## --------------------------------------------------------------------------------
## metformin.rosiglitazone
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101764 2
## Proportion 1 0
## --------------------------------------------------------------------------------
## metformin.pioglitazone
## n missing distinct
## 101766 0 2
##
## Value No Steady
## Frequency 101765 1
## Proportion 1 0
## --------------------------------------------------------------------------------
## change
## n missing distinct
## 101766 0 2
##
## Value Ch No
## Frequency 47011 54755
## Proportion 0.462 0.538
## --------------------------------------------------------------------------------
## diabetesMed
## n missing distinct
## 101766 0 2
##
## Value No Yes
## Frequency 23403 78363
## Proportion 0.23 0.77
## --------------------------------------------------------------------------------
## readmitted
## n missing distinct
## 101766 0 3
##
## Value <30 >30 NO
## Frequency 11357 35545 54864
## Proportion 0.112 0.349 0.539
## --------------------------------------------------------------------------------
colSums(raw_data=='?')
## encounter_id patient_nbr race
## 0 0 2273
## gender age weight
## 0 0 98569
## admission_type_id discharge_disposition_id admission_source_id
## 0 0 0
## time_in_hospital payer_code medical_specialty
## 0 40256 49949
## num_lab_procedures num_procedures num_medications
## 0 0 0
## number_outpatient number_emergency number_inpatient
## 0 0 0
## diag_1 diag_2 diag_3
## 21 358 1423
## number_diagnoses max_glu_serum A1Cresult
## 0 0 0
## metformin repaglinide nateglinide
## 0 0 0
## chlorpropamide glimepiride acetohexamide
## 0 0 0
## glipizide glyburide tolbutamide
## 0 0 0
## pioglitazone rosiglitazone acarbose
## 0 0 0
## miglitol troglitazone tolazamide
## 0 0 0
## examide citoglipton insulin
## 0 0 0
## glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 0 0 0
## metformin.rosiglitazone metformin.pioglitazone change
## 0 0 0
## diabetesMed readmitted
## 0 0
# 0 Drop unnecessary columns
# Drop patients who died right after the diagnosis
data <- raw_data[!(raw_data$discharge_disposition_id %in% c(13,14,18,19,20,21,25,26)), ]
#get rid of duplicated values
duplicates <- duplicated(data$patient_nbr)
data <- data[!duplicates,]
# 1 Extract the relevant value
# Get rid of values that consists of single value
data <- data[,c(-1,-2,-6,-7,-8,-9,-11,-13,-26,-27,-28,-30,-33,-36,-37,-38,-39,-40,-41,-43,-44,-45,-46,-47)]
# 2 The dataset contains '?' values. Change this to 'Unknown'
for (col in names(data)) {
data[data[, col] == '?', col] <- 'Unknown'
}
# 3 Get rid of unknown gender since there are only 3 rows
data <- data[data$gender!='Unknown/Invalid',]
# 4 Change the dependent variable to bianry. Less than 30 is 1, other 0
data$readmitted <- ifelse(data$readmitted=='<30','Yes','No')
# 5 Change the age column to simpler format
data$age <- factor(data$age,
levels = c("[0-10)", "[10-20)", "[20-30)",
"[30-40)", "[40-50)", "[50-60)",
"[60-70)", "[70-80)", "[80-90)",
"[90-100)"),
labels = c("Under 10", "10s", "20s",
"30s","40s", "50s",
"60s", "70s", "80s",
"90s"))
# 6 Change the diagnosis column
categorize_values <- function(data, column_names) {
for (column_name in column_names) {
value <- ifelse(data[[column_name]] %in% c(390:459, 785), "Circulatory",
ifelse(data[[column_name]] %in% c(460:519, 786), "Respiratory",
ifelse(data[[column_name]] %in% c(520:579, 787), "Digestive",
ifelse(data[[column_name]] %in% "250", "Diabetes",
ifelse(data[[column_name]] %in% c(800:999), "Injury",
ifelse(data[[column_name]] %in% c(710:739), "Musculoskeletal",
ifelse(data[[column_name]] %in% c(580:629, 788), "Genitourinary",
ifelse(data[[column_name]] %in% c(140:239), "Neoplasms", "Other"))))))))
}
return(value)
}
data$diag_1 <-categorize_values(data, 'diag_1')
data$diag_2 <-categorize_values(data, 'diag_2')
data$diag_3 <-categorize_values(data, 'diag_3')
# 7 Calculate the table of medical_specialty
table_medical_specialty <- table(data$medical_specialty)
# Replace values with counts less than or equal to 1000 with "Other"
data$medical_specialty[!(data$medical_specialty %in% names(table_medical_specialty)[table_medical_specialty > 1000])] <- "Other"
library(dplyr)
# Change the chr to factor prior to the analysis
chr_columns <- sapply(data,is.character)
data <- data %>%
mutate_if(chr_columns,as.factor)
head(data)
## race gender age time_in_hospital medical_specialty num_procedures
## 2 Caucasian Female 10s 3 Unknown 0
## 3 AfricanAmerican Female 20s 2 Unknown 5
## 4 Caucasian Male 30s 2 Unknown 1
## 5 Caucasian Male 40s 1 Unknown 0
## 6 Caucasian Male 50s 3 Unknown 6
## 7 Caucasian Male 60s 4 Unknown 1
## num_medications number_outpatient number_emergency number_inpatient
## 2 18 0 0 0
## 3 13 2 0 1
## 4 16 0 0 0
## 5 8 0 0 0
## 6 16 0 0 0
## 7 21 0 0 0
## diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult
## 2 Other Other Other 9 None None
## 3 Other Diabetes Other 6 None None
## 4 Other Other Circulatory 7 None None
## 5 Neoplasms Neoplasms Diabetes 5 None None
## 6 Circulatory Circulatory Diabetes 9 None None
## 7 Circulatory Circulatory Other 7 None None
## metformin glimepiride glipizide glyburide pioglitazone rosiglitazone insulin
## 2 No No No No No No Up
## 3 No No Steady No No No No
## 4 No No No No No No Up
## 5 No No Steady No No No Steady
## 6 No No No No No No Steady
## 7 Steady Steady No No No No Steady
## change diabetesMed readmitted
## 2 Ch Yes No
## 3 No Yes No
## 4 Ch Yes No
## 5 Ch Yes No
## 6 No Yes No
## 7 Ch Yes No
library(ggplot2)
library(tidyverse)
library(gridExtra)
library(tidyr)
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#fffcfc"),
strip.background = element_rect(fill = "#820000", color = "#820000", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
mycolors=c("#db3434","#ccc7c7","#871618","#590050","#7f34a8")
generate_chart <- function(data, variable) {
var_sym <- ensym(variable)
p <- data %>%
group_by(!!var_sym) %>%
summarize(count = n(),
prop = round(count / nrow(data), 2),
labels = scales::percent(prop)) %>%
ggplot(aes(x = readmitted, fill = factor(!!var_sym), y = count)) +
geom_bar(width = 0.5, stat = 'identity', color = 'black',alpha=0.8) +
scale_fill_manual(values=mycolors) +
labs(fill = as_label(var_sym)) +
theme(legend.position = "right")
return(p)
}
# Generate chart using the function
generate_chart(data, readmitted)+
labs(title='Readmission (Dependent Variable)',fill='Readmission')+
geom_text(aes(label = labels), color='black',
position = position_stack(vjust=0.5),show.legend = FALSE) +
coord_flip()
make_bar_chart <- function(data, variable) {
var_sym <- ensym(variable)
p <- data %>%
group_by(!!var_sym,readmitted) %>%
summarise(count = n()) %>%
group_by(!!var_sym) %>%
mutate(total_count = sum(count)) %>%
mutate(prop = round(count / total_count, 2)) %>%
mutate(labels = scales::percent(prop)) %>%
mutate(labels2 = ifelse(row_number() %% 2 == 1, labels, NA))%>%
ggplot(aes(x = !!var_sym, fill = factor(readmitted), y = count)) +
geom_bar(width = 0.7, stat = 'identity', color = 'black',alpha=0.8) +
scale_fill_manual(values=mycolors) +
labs(fill = 'Readmission')+
geom_text(aes(label = labels2), color='black',
position = position_stack(vjust=0.5),show.legend = FALSE)+
theme(legend.position = "right")
return(p)
}
b1 <- make_bar_chart(data,gender)+labs(title = 'Readmission by Gender')+coord_flip()
b2 <- make_bar_chart(data,change)+labs(title = 'Readmission by Change in Medication')+coord_flip()
b3 <- make_bar_chart(data,diabetesMed)+labs(title = 'Readmission by Diabetic Medication Prescription')+coord_flip()
grid.arrange(b1,b2,b3,nrow=3)
b4 <- make_bar_chart(data,race)+labs(title = 'Readmission by Race')+coord_flip()
b5 <- make_bar_chart(data,age)+labs(title = 'Readmission by Age')+coord_flip()
b6 <- make_bar_chart(data,max_glu_serum)+labs(title = 'Readmission by Glucose Serum Test Result')+coord_flip()
b7 <- make_bar_chart(data,A1Cresult)+labs(title = 'Readmission by A1C Test Result')+coord_flip()
grid.arrange(b4,b5,b6,b7,ncol=1)
b8 <- make_bar_chart(data,medical_specialty)+labs(title = 'Readmission by Medical Specialty')
b9 <- make_bar_chart(data,diag_1) +labs(title = 'Readmission by Initial Diagnosis')
b10 <- make_bar_chart(data,diag_2) +labs(title = 'Readmission by Secondary Diagnosis')
b11 <- make_bar_chart(data,diag_3) +labs(title = 'Readmission by Third Diagnosis')
grid.arrange(b8,b9,b10,b11,ncol=1)
generate_histogram2 <- function(data, variable) {
var_sym <- ensym(variable)
p <- data%>%
ggplot(aes(!!var_sym,y=..count..,fill=readmitted))+
geom_histogram(color='black',alpha = 0.8, binwidth = 1) +
labs(fill = 'Readmission')+
scale_fill_manual(values=mycolors) +
theme_minimal()+
labs(color='Readmission')
return(p)
}
h0 <- generate_histogram2(data,time_in_hospital)+labs(title='Time in Hospital')
h1 <- generate_histogram2(data,num_procedures)+labs(title= 'Number of Procedures')
h2 <- generate_histogram2(data,num_medications)+labs(title='Number of Medications')
h3 <- generate_histogram2(data,number_emergency)+labs(title='Number of Emergency Visits')
h4 <- generate_histogram2(data,number_outpatient)+labs(title='Number of Outpatient Visits')
h5 <- generate_histogram2(data,number_inpatient)+labs(title='Number of Inpatient Visits')
h6 <- generate_histogram2(data,number_diagnoses)+labs(title='Number of Diagnoses')
grid.arrange(h0,h1,h2,h3,h4,h5,h6,ncol=3)
set.seed(0)
library(caret)
# splot into train test dataset
index <- createDataPartition(data$readmitted, p = 0.7, list = FALSE)
train_data<-data[index,]
test_data <- data[-index, ]
train_data$readmitted %>%table()
## .
## No Yes
## 43633 4249
test_data$readmitted %>%table()
## .
## No Yes
## 18699 1821
# undersample train dataset due to
set.seed(0)
re1 <- train_data[(train_data$readmitted=='Yes'),]
re0 <- train_data[(train_data$readmitted=='No'),]
idx0 <- sample(1:nrow(re0), sum(train_data$readmitted=='Yes'))
re0 <- re0[idx0,]
re_undersample <- rbind(re1,re0)
idx1 <- sample(1:nrow(re_undersample), nrow(re_undersample)*0.7)
u_train_data <- re_undersample[idx1, ]
u_test_data <- re_undersample[-idx1, ]
re_undersample$readmitted%>%table()
## .
## No Yes
## 4249 4249
# Generate chart using the function
t1 <- generate_chart(train_data, readmitted)+
labs(title='Original Train Dataset',fill='Readmission')+
geom_text(aes(label = labels), color='black',
position = position_stack(vjust=0.5),show.legend = FALSE)+
coord_flip()
t2 <- generate_chart(test_data, readmitted)+
labs(title='Original Test Dataset',fill='Readmission')+
geom_text(aes(label = labels), color='black',
position = position_stack(vjust=0.5),show.legend = FALSE)+
coord_flip()
t3 <- generate_chart(re_undersample, readmitted)+
labs(title='Undersampling Dataset',fill='Readmission')+
geom_text(aes(label = labels), color='black',
position = position_stack(vjust=0.5),show.legend = FALSE)+
coord_flip()
grid.arrange(t1,t2,t3,ncol=1)
simple_logit_model <- glm(data=re_undersample, readmitted~.,family='binomial')
# p is the cutoff for the classification
p <- 0.5
# predict out of sample
logit_pred <- predict(simple_logit_model, newdata = test_data, type = "response")
logit_pred <- as.factor(ifelse(logit_pred>p,'Yes','No'))
# compute the accuracy
logit_cm <- confusionMatrix(data=logit_pred,
reference=test_data$readmitted)
logit_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11494 865
## Yes 7205 956
##
## Accuracy : 0.6067
## 95% CI : (0.6, 0.6134)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0543
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6147
## Specificity : 0.5250
## Pos Pred Value : 0.9300
## Neg Pred Value : 0.1171
## Prevalence : 0.9113
## Detection Rate : 0.5601
## Detection Prevalence : 0.6023
## Balanced Accuracy : 0.5698
##
## 'Positive' Class : No
##
library(gamlr)
library(Matrix)
set.seed(0)
Xmatrix_is <- sparse.model.matrix(readmitted ~., data=naref(re_undersample))[,-1]
logit_lasso_model <- cv.gamlr(Xmatrix_is,re_undersample$readmitted,family='binomial',verb=TRUE)
## fold 1,2,3,4,5,done.
plot(logit_lasso_model)
length(coef(logit_lasso_model))
## [1] 101
sum(coef(logit_lasso_model, select="min")!=0)
## [1] 50
sum(coef(logit_lasso_model, select="1se")!=0)
## [1] 15
Xmatrix_oos <- sparse.model.matrix(readmitted ~., data=naref(test_data))[,-1]
lasso_pred <- drop(predict(logit_lasso_model, Xmatrix_oos, type="response"))
lasso_pred <- as.factor(ifelse(lasso_pred>p,'Yes','No'))
# compute the accuracy
lasso_cm <- confusionMatrix(data=lasso_pred,
reference=test_data$readmitted)
library(rpart.plot)
library(caret)
library(rattle)
set.seed(0)
tree_is <- rpart(formula = readmitted ~ .,
data = re_undersample,
method = 'class',
xval=5, #5 fold cross-validation
model = TRUE)
fancyRpartPlot(tree_is,palettes = 'Reds')
tree_pred <- predict(tree_is,test_data,type='class')
tree_cm <- confusionMatrix(data=tree_pred,
reference=test_data$readmitted)
tree_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 10882 814
## Yes 7817 1007
##
## Accuracy : 0.5794
## 95% CI : (0.5726, 0.5862)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0493
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5820
## Specificity : 0.5530
## Pos Pred Value : 0.9304
## Neg Pred Value : 0.1141
## Prevalence : 0.9113
## Detection Rate : 0.5303
## Detection Prevalence : 0.5700
## Balanced Accuracy : 0.5675
##
## 'Positive' Class : No
##
set.seed(0)
bag_is <- train(readmitted ~ .,
data = re_undersample, method = "treebag", # for bagging
tuneLength = 5, # choose up to 5 combinations of tuning parameter
metric = "ROC", # evaluate hyperparamter combinations with ROC
trControl = trainControl(
method = "cv", # k-fold cross validation
number = 5, # k=5 folds
savePredictions = "final", # save predictions for the optimal tuning parameters
classProbs = TRUE, # return class probabilities in addition to predicted values
summaryFunction = twoClassSummary # for binary response variable
)
)
bag_pred <- predict(bag_is,test_data,type='raw')
bag_cm <- confusionMatrix(data=bag_pred,reference=test_data$readmitted)
bag_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 10470 845
## Yes 8229 976
##
## Accuracy : 0.5578
## 95% CI : (0.551, 0.5646)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0339
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5599
## Specificity : 0.5360
## Pos Pred Value : 0.9253
## Neg Pred Value : 0.1060
## Prevalence : 0.9113
## Detection Rate : 0.5102
## Detection Prevalence : 0.5514
## Balanced Accuracy : 0.5479
##
## 'Positive' Class : No
##
set.seed(0)
random_is <- train(readmitted ~ .,
data = re_undersample, method = "ranger", # for bagging
tuneLength = 5, # choose up to 5 combinations of tuning parameter
metric = "ROC", # evaluate hyperparamter combinations with ROC
trControl = trainControl(
method = "cv", # k-fold cross validation
number = 5, # k=5 folds
savePredictions = "final", # save predictions for the optimal tuning parameters
classProbs = TRUE, # return class probabilities in addition to predicted values
summaryFunction = twoClassSummary # for binary response variable
)
)
ggplot(random_is)+scale_color_manual(values=c(mycolors[1],mycolors[3]))
# compute variable importance for RF
library(ranger)
library(tree)
var_imp <- ranger(readmitted ~., data=re_undersample,write.forest = TRUE,num.trees = 200, min.node.size = 5, importance ='impurity')
library(timetk)
var_imp <- ranger::importance(var_imp) %>% sort(decreasing=TRUE)
var_imp9 <- var_imp[1:9]%>%data.frame()%>%tk_tbl()
# Create the bar graph
ggplot(aes(x = reorder(index,.), y = ., fill = reorder(index,.)), data = var_imp9) +
geom_bar(stat = 'identity', color = 'black',show.legend = F) +
scale_fill_brewer(palette = 'Reds') +
coord_flip() +
labs(x = "Features", y = "Importance", title = "Random Forest Variable Importance")
random_pred <- predict(random_is,test_data,type='raw')
random_cm <- confusionMatrix(data=random_pred,reference=test_data$readmitted)
random_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 10627 764
## Yes 8072 1057
##
## Accuracy : 0.5694
## 95% CI : (0.5626, 0.5762)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0529
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5683
## Specificity : 0.5805
## Pos Pred Value : 0.9329
## Neg Pred Value : 0.1158
## Prevalence : 0.9113
## Detection Rate : 0.5179
## Detection Prevalence : 0.5551
## Balanced Accuracy : 0.5744
##
## 'Positive' Class : No
##
set.seed(0)
gdb_is <- train(readmitted ~ .,
data = re_undersample,
method = "gbm", # for gradient boosting
tuneLength = 5, # choose up to 5 combinations of tuning parameters
metric = "ROC", # evaluate hyperparamter combinations with ROC
trControl = trainControl(
method = "cv", # k-fold cross validation
number = 5, # 5 folds
savePredictions = "final", # save predictions for the optimal tuning parameter1
classProbs = TRUE, # return class probabilities in addition to predicted values
summaryFunction = twoClassSummary # for binary response variable
))
ggplot(gdb_is)+scale_color_manual(values=mycolors)
gdb_pred <- predict(gdb_is,test_data,type='raw')
gdb_cm <- confusionMatrix(data=gdb_pred,reference=test_data$readmitted)
gdb_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11647 862
## Yes 7052 959
##
## Accuracy : 0.6143
## 95% CI : (0.6076, 0.621)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.059
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6229
## Specificity : 0.5266
## Pos Pred Value : 0.9311
## Neg Pred Value : 0.1197
## Prevalence : 0.9113
## Detection Rate : 0.5676
## Detection Prevalence : 0.6096
## Balanced Accuracy : 0.5748
##
## 'Positive' Class : No
##
set.seed(0)
svm_is <- train(readmitted ~ .,
data = re_undersample,
method = "svmLinear",
trControl = trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = FALSE,
summaryFunction = multiClassSummary))
svm_pred <- predict(svm_is,test_data,type='raw')
svm_cm <- confusionMatrix(data=svm_pred,reference=test_data$readmitted)
svm_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11991 898
## Yes 6708 923
##
## Accuracy : 0.6293
## 95% CI : (0.6227, 0.636)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0607
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6413
## Specificity : 0.5069
## Pos Pred Value : 0.9303
## Neg Pred Value : 0.1210
## Prevalence : 0.9113
## Detection Rate : 0.5844
## Detection Prevalence : 0.6281
## Balanced Accuracy : 0.5741
##
## 'Positive' Class : No
##
knn_is <- train(readmitted ~ .,
data = re_undersample,
method = "kknn",
trControl = trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = FALSE,
summaryFunction = multiClassSummary))
knn_is
## k-Nearest Neighbors
##
## 8498 samples
## 25 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 6798, 6798, 6799, 6799, 6798
## Resampling results across tuning parameters:
##
## kmax Accuracy Kappa F1 Sensitivity Specificity
## 5 0.5121198 0.02423941 0.5137321 0.5154172 0.5088223
## 7 0.5180037 0.03600560 0.5199282 0.5220046 0.5140010
## 9 0.5178862 0.03577155 0.5206508 0.5236555 0.5121161
## Pos_Pred_Value Neg_Pred_Value Precision Recall Detection_Rate
## 0.5120810 0.5121613 0.5120810 0.5154172 0.2577076
## 0.5178926 0.5181185 0.5178926 0.5220046 0.2610028
## 0.5177102 0.5180686 0.5177102 0.5236555 0.2618269
## Balanced_Accuracy
## 0.5121197
## 0.5180028
## 0.5178858
##
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
## parameter 'kernel' was held constant at a value of optimal
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 7, distance = 2 and kernel
## = optimal.
knn_pred <- predict(knn_is,test_data,type='raw')
knn_cm <- confusionMatrix(data=knn_pred,reference=test_data$readmitted)
knn_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 9883 891
## Yes 8816 930
##
## Accuracy : 0.5269
## 95% CI : (0.5201, 0.5338)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0132
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.52853
## Specificity : 0.51071
## Pos Pred Value : 0.91730
## Neg Pred Value : 0.09542
## Prevalence : 0.91126
## Detection Rate : 0.48163
## Detection Prevalence : 0.52505
## Balanced Accuracy : 0.51962
##
## 'Positive' Class : No
##
nnet_is <- train(readmitted ~ .,
data = re_undersample,
method = "nnet",
trControl = trainControl(method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = FALSE,
search = "grid",
summaryFunction = multiClassSummary),
tuneLength = 5)
ggplot(nnet_is)+scale_color_manual(values=mycolors)
nnet_pred <- predict(nnet_is,test_data,type='raw')
nnet_cm <- confusionMatrix(data=nnet_pred,reference=test_data$readmitted)
nnet_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11039 829
## Yes 7660 992
##
## Accuracy : 0.5863
## 95% CI : (0.5795, 0.5931)
## No Information Rate : 0.9113
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0502
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.5904
## Specificity : 0.5448
## Pos Pred Value : 0.9301
## Neg Pred Value : 0.1147
## Prevalence : 0.9113
## Detection Rate : 0.5380
## Detection Prevalence : 0.5784
## Balanced Accuracy : 0.5676
##
## 'Positive' Class : No
##
# creating function that extracts relevant statistics from the confusion matrix
calculate_metrics <- function(confusion_matrix) {
accuracy <- confusion_matrix$overall[1]
sensitivity <- confusion_matrix$byClass[1]
specificity <- confusion_matrix$byClass[2]
f1score <- confusion_matrix$byClass[7]
fn <- confusion_matrix$table[1,2]/(confusion_matrix$table[1,2]+confusion_matrix$table[2,2])
return(c(accuracy, sensitivity,specificity,f1score,fn))
}
confusion_matrices <- list(logit_cm, lasso_cm, tree_cm, bag_cm, random_cm, gdb_cm, svm_cm, knn_cm, nnet_cm)
metrics_list <- lapply(confusion_matrices, calculate_metrics)
accuracy <- round(sapply(metrics_list, `[`, 1),4)
sensitivity <- round(sapply(metrics_list, `[`, 2),4)
specificity <- round(sapply(metrics_list, `[`, 3),4)
f1score <- round(sapply(metrics_list, `[`, 4),4)
fn <- round(sapply(metrics_list, `[`, 5),4)
models <- c('Logit','Lasso','Tree','Bag','RF','GDB','SVM','KNN','NNET')
results <- tibble(models=rep(models,5),
metrics=c(rep('accuracy',9),rep('sensitivity',9),rep('specificity',9),rep('f1sore',9),rep('fn',9)),
scores=c(accuracy,sensitivity,specificity,f1score,fn))
results$models <- reorder(results$models, results$scores)
results %>%
ggplot(aes(x=models,y=scores,fill=metrics))+
geom_bar(color='black',stat='identity',alpha=0.8,show.legend = F,width = 0.3)+
scale_fill_manual(values=mycolors)+
coord_flip()+
facet_grid(~metrics)+
labs(title='Model Summary')
results_wide <- pivot_wider(results,names_from =models, values_from=scores)
knitr::kable(results_wide)
| metrics | Logit | Lasso | Tree | Bag | RF | GDB | SVM | KNN | NNET |
|---|---|---|---|---|---|---|---|---|---|
| accuracy | 0.6067 | 0.6718 | 0.5794 | 0.5578 | 0.5694 | 0.6143 | 0.6293 | 0.5269 | 0.5863 |
| sensitivity | 0.6147 | 0.6935 | 0.5820 | 0.5599 | 0.5683 | 0.6229 | 0.6413 | 0.5285 | 0.5904 |
| specificity | 0.5250 | 0.4492 | 0.5530 | 0.5360 | 0.5805 | 0.5266 | 0.5069 | 0.5107 | 0.5448 |
| f1sore | 0.7402 | 0.7938 | 0.7160 | 0.6977 | 0.7063 | 0.7464 | 0.7592 | 0.6706 | 0.7223 |
| fn | 0.4750 | 0.5508 | 0.4470 | 0.4640 | 0.4195 | 0.4734 | 0.4931 | 0.4893 | 0.4552 |
library(magrittr)
glm(data = data, readmitted~.,family='binomial') %>% summary()
##
## Call:
## glm(formula = readmitted ~ ., family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0732 -0.4565 -0.4025 -0.3487 2.8039
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -4.366572 1.015783 -4.299
## raceAsian 0.019792 0.168082 0.118
## raceCaucasian 0.016009 0.037499 0.427
## raceHispanic -0.047402 0.103546 -0.458
## raceOther -0.198979 0.123736 -1.608
## raceUnknown -0.180487 0.095513 -1.890
## genderMale 0.014980 0.027691 0.541
## age10s 0.726906 0.620068 1.172
## age20s 1.145377 0.596801 1.919
## age30s 1.121664 0.591097 1.898
## age40s 1.166759 0.588449 1.983
## age50s 1.129370 0.587969 1.921
## age60s 1.342771 0.587814 2.284
## age70s 1.450256 0.587835 2.467
## age80s 1.467212 0.588233 2.494
## age90s 1.312095 0.593311 2.211
## time_in_hospital 0.038229 0.005214 7.332
## medical_specialtyEmergency/Trauma 0.118773 0.085986 1.381
## medical_specialtyFamily/GeneralPractice 0.301131 0.080767 3.728
## medical_specialtyInternalMedicine 0.245286 0.072063 3.404
## medical_specialtyOrthopedics-Reconstructive -0.093367 0.147585 -0.633
## medical_specialtyOther 0.196931 0.075320 2.615
## medical_specialtySurgery-General 0.192138 0.101395 1.895
## medical_specialtyUnknown 0.225335 0.066399 3.394
## num_procedures -0.035216 0.009361 -3.762
## num_medications 0.004703 0.002152 2.186
## number_outpatient -0.008844 0.012301 -0.719
## number_emergency 0.078790 0.021017 3.749
## number_inpatient 0.326592 0.015860 20.592
## diag_1Diabetes -0.494374 0.329034 -1.503
## diag_1Digestive -0.201359 0.056134 -3.587
## diag_1Genitourinary -0.120231 0.067604 -1.778
## diag_1Injury 0.059809 0.057046 1.048
## diag_1Musculoskeletal -0.038734 0.069358 -0.558
## diag_1Neoplasms -0.237184 0.082181 -2.886
## diag_1Other -0.106119 0.040472 -2.622
## diag_1Respiratory -0.339824 0.048903 -6.949
## diag_2Diabetes -0.180238 0.069236 -2.603
## diag_2Digestive -0.026896 0.076637 -0.351
## diag_2Genitourinary -0.101094 0.055825 -1.811
## diag_2Injury 0.023883 0.086702 0.275
## diag_2Musculoskeletal -0.122380 0.111998 -1.093
## diag_2Neoplasms 0.291447 0.088777 3.283
## diag_2Other 0.043165 0.036364 1.187
## diag_2Respiratory -0.134534 0.051122 -2.632
## diag_3Diabetes -0.110950 0.051673 -2.147
## diag_3Digestive 0.144616 0.073708 1.962
## diag_3Genitourinary 0.014403 0.060226 0.239
## diag_3Injury 0.031084 0.097091 0.320
## diag_3Musculoskeletal -0.044376 0.105411 -0.421
## diag_3Neoplasms 0.190152 0.100451 1.893
## diag_3Other 0.037572 0.035052 1.072
## diag_3Respiratory 0.058341 0.056099 1.040
## number_diagnoses 0.022872 0.008688 2.633
## max_glu_serum>300 -0.101754 0.167432 -0.608
## max_glu_serumNone -0.117163 0.109713 -1.068
## max_glu_serumNorm 0.017129 0.136331 0.126
## A1Cresult>8 -0.031024 0.084683 -0.366
## A1CresultNone 0.065747 0.070643 0.931
## A1CresultNorm 0.011609 0.091151 0.127
## metforminNo -0.194192 0.162510 -1.195
## metforminSteady -0.275948 0.162953 -1.693
## metforminUp -0.420402 0.210757 -1.995
## glimepirideNo -0.276060 0.259699 -1.063
## glimepirideSteady -0.373060 0.264728 -1.409
## glimepirideUp -0.213089 0.343142 -0.621
## glipizideNo -0.290284 0.163711 -1.773
## glipizideSteady -0.247316 0.164275 -1.506
## glipizideUp -0.140074 0.209667 -0.668
## glyburideNo 0.117804 0.186522 0.632
## glyburideSteady 0.131853 0.187668 0.703
## glyburideUp 0.150627 0.232081 0.649
## pioglitazoneNo -0.088423 0.364132 -0.243
## pioglitazoneSteady -0.188050 0.366770 -0.513
## pioglitazoneUp 0.040129 0.432425 0.093
## rosiglitazoneNo 0.873177 0.594422 1.469
## rosiglitazoneSteady 0.844145 0.596145 1.416
## rosiglitazoneUp 0.909834 0.662820 1.373
## insulinNo -0.141764 0.070373 -2.014
## insulinSteady -0.150790 0.054568 -2.763
## insulinUp -0.106220 0.056968 -1.865
## changeNo 0.033859 0.050045 0.677
## diabetesMedYes 0.224787 0.049889 4.506
## Pr(>|z|)
## (Intercept) 1.72e-05 ***
## raceAsian 0.906262
## raceCaucasian 0.669441
## raceHispanic 0.647106
## raceOther 0.107815
## raceUnknown 0.058803 .
## genderMale 0.588532
## age10s 0.241076
## age20s 0.054960 .
## age30s 0.057749 .
## age40s 0.047393 *
## age50s 0.054757 .
## age60s 0.022351 *
## age70s 0.013621 *
## age80s 0.012622 *
## age90s 0.027003 *
## time_in_hospital 2.26e-13 ***
## medical_specialtyEmergency/Trauma 0.167184
## medical_specialtyFamily/GeneralPractice 0.000193 ***
## medical_specialtyInternalMedicine 0.000665 ***
## medical_specialtyOrthopedics-Reconstructive 0.526976
## medical_specialtyOther 0.008934 **
## medical_specialtySurgery-General 0.058098 .
## medical_specialtyUnknown 0.000690 ***
## num_procedures 0.000169 ***
## num_medications 0.028837 *
## number_outpatient 0.472168
## number_emergency 0.000178 ***
## number_inpatient < 2e-16 ***
## diag_1Diabetes 0.132967
## diag_1Digestive 0.000334 ***
## diag_1Genitourinary 0.075330 .
## diag_1Injury 0.294442
## diag_1Musculoskeletal 0.576526
## diag_1Neoplasms 0.003900 **
## diag_1Other 0.008742 **
## diag_1Respiratory 3.68e-12 ***
## diag_2Diabetes 0.009235 **
## diag_2Digestive 0.725620
## diag_2Genitourinary 0.070154 .
## diag_2Injury 0.782959
## diag_2Musculoskeletal 0.274524
## diag_2Neoplasms 0.001027 **
## diag_2Other 0.235214
## diag_2Respiratory 0.008498 **
## diag_3Diabetes 0.031780 *
## diag_3Digestive 0.049762 *
## diag_3Genitourinary 0.810987
## diag_3Injury 0.748856
## diag_3Musculoskeletal 0.673770
## diag_3Neoplasms 0.058359 .
## diag_3Other 0.283766
## diag_3Respiratory 0.298356
## number_diagnoses 0.008475 **
## max_glu_serum>300 0.543364
## max_glu_serumNone 0.285561
## max_glu_serumNorm 0.900016
## A1Cresult>8 0.714098
## A1CresultNone 0.352012
## A1CresultNorm 0.898659
## metforminNo 0.232105
## metforminSteady 0.090376 .
## metforminUp 0.046073 *
## glimepirideNo 0.287781
## glimepirideSteady 0.158771
## glimepirideUp 0.534604
## glipizideNo 0.076204 .
## glipizideSteady 0.132195
## glipizideUp 0.504082
## glyburideNo 0.527658
## glyburideSteady 0.482314
## glyburideUp 0.516321
## pioglitazoneNo 0.808137
## pioglitazoneSteady 0.608148
## pioglitazoneUp 0.926063
## rosiglitazoneNo 0.141846
## rosiglitazoneSteady 0.156774
## rosiglitazoneUp 0.169854
## insulinNo 0.043961 *
## insulinSteady 0.005721 **
## insulinUp 0.062241 .
## changeNo 0.498680
## diabetesMedYes 6.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40988 on 68401 degrees of freedom
## Residual deviance: 39892 on 68319 degrees of freedom
## AIC: 40058
##
## Number of Fisher Scoring iterations: 6
Here, we will be using A1C result to define whether this variable had a causal effect on readmission rate. A1C or HbA1c is a test that estimates how much blood sugar has been in the bloodstream over the last 2~3 months.
Let’s explore the statistics of A1C result data.
data$readmitted1 <- as.character(data$readmitted)
data$readmitted1 <- ifelse(data$readmitted=='Yes',1,0)
ci1 <- data%>%
ggplot(aes(x=readmitted,fill=A1Cresult))+
geom_bar(position="fill",color="black",alpha=0.8,show.legend = F)+
scale_fill_manual(values=mycolors)+
coord_flip()+
ggtitle('HbA1c Test')
ci2 <- data %>%
ggplot(aes(x=A1Cresult,y=..count..,fill=A1Cresult))+
geom_bar(color='black',alpha=0.8,show.legend = F)+
scale_fill_manual(values=mycolors)+
coord_flip()+
facet_grid(readmitted~.)
ci3 <- data %>% group_by(A1Cresult)%>%
summarise(mean=round(mean(readmitted1),4))%>%
ggplot(aes(x=as.factor(A1Cresult),y=mean,fill=as.factor(A1Cresult)))+
geom_bar(stat='identity',color='black',alpha=0.8)+
scale_fill_manual(values=mycolors)+
coord_flip()+
labs(x='A1Cresult',y='Mean of Readmission',fill='A1Cresult',title='Mean of Readmission')
grid.arrange(ci1,ci2,ci3,ncol=1)
library(MatchIt)
library(tableone)
library(tidyr)
# 1 if A1C test was taken, 0 if not
data$A1Cresult_group<- ifelse(data$A1Cresult == 'None', 0, 1)
data$group<-as.logical(data$A1Cresult_group)
data$group <- ifelse(data$group==1, 'Treatment', 'Control')
y_treatment <-data$readmitted1[data$group=='Treatment']
y_control<- data$readmitted1[data$group=='Control']
ci4 <- data %>%
ggplot(aes(x=as.factor(group),y=..count..,fill=as.factor(group)))+
geom_bar(color='black',alpha=0.8,show.legend = F)+
scale_fill_manual(values=mycolors)+
coord_flip()+
facet_grid(readmitted~.)+
labs(x='Group',y='Count',fill='Group',title='Treatment Group Count')
ci5 <- data %>% group_by(group)%>%
summarise(mean=round(mean(readmitted1),4))%>%
ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
geom_bar(stat='identity',color='black',alpha=0.8,show.legend = F)+
scale_fill_manual(values=mycolors)+
coord_flip()+
labs(x='Group',y='Mean',fill='Group',title='Mean of Readmission by Group')
set.seed(0)
match.it<-matchit(A1Cresult_group~., data=data,method = 'nearest',ratio=1)
psm_matchit_data <- match.data(match.it)
y_treatment_psm_matchit <-psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==1]
y_control_psm_matchit <- psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==0]
ci6 <- psm_matchit_data %>% group_by(group)%>%
summarise(mean=round(mean(readmitted1),4))%>%
ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
geom_bar(stat='identity',color='black',alpha=0.8)+
scale_fill_manual(values=mycolors)+
coord_flip()+
labs(x='Group',y='Mean',fill='Group',title='PSM Mean of Group')
grid.arrange(ci4,ci5,ci6,ncol=1)
set.seed(0)
# t test before PSM
t.test(y_treatment,
y_control,
var.equal =F,
paired=FALSE,
conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: y_treatment and y_control
## t = -2.831, df = 18949, p-value = 0.004645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.013153009 -0.002390939
## sample estimates:
## mean of x mean of y
## 0.08238317 0.09015514
# t test after PSM
t.test(y_treatment_psm_matchit,
y_control_psm_matchit,
var.equal = F,
paired=FALSE,
conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: y_treatment_psm_matchit and y_control_psm_matchit
## t = 0.85975, df = 24899, p-value = 0.3899
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.003802218 0.009744084
## sample estimates:
## mean of x mean of y
## 0.08238317 0.07941224
The p-value tells that normal t-test showed a statistical difference between the treatment and control group while the PSM method is not valid. Considering the PSM being more reliable since it reflects similar patient information, it is hard to conclude whether the taking the A1C test helps to reduce the readmission.
Now, let’s see whether high blood sugar (more than 8%) had affected the readmission rate or not.
# 1 if A1C >8%, if not 0
data$A1Cresult_group<- ifelse(data$A1Cresult == '>8', 1, 0)
data$group<-as.logical(data$A1Cresult_group)
data$group <- ifelse(data$A1Cresult_group==1, 'Treatment', 'Control')
y_treatment <-data$readmitted1[data$group=='Treatment']
y_control<- data$readmitted1[data$group=='Control']
ci7 <- data %>%
ggplot(aes(x=as.factor(group),y=..count..,fill=as.factor(group)))+
geom_bar(color='black',alpha=0.8,show.legend = F)+
scale_fill_manual(values=mycolors)+
coord_flip()+
facet_grid(readmitted~.)+
labs(x='Group',y='Count',fill='Group',title='Treatment Group Count')
ci8 <- data %>% group_by(group)%>%
summarise(mean=round(mean(readmitted1),4))%>%
ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
geom_bar(stat='identity',color='black',alpha=0.8,show.legend = F)+
scale_fill_manual(values=mycolors)+
coord_flip()+
labs(x='Group',y='Mean',fill='Group',title='Mean of Readmission by Group')
set.seed(0)
match.it<-matchit(A1Cresult_group~., data=data,method = 'nearest',ratio=1)
psm_matchit_data <- match.data(match.it)
y_treatment_psm_matchit <-psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==1]
y_control_psm_matchit <- psm_matchit_data$readmitted1[psm_matchit_data$A1Cresult_group==0]
ci9 <- psm_matchit_data %>% group_by(group)%>%
summarise(mean=round(mean(readmitted1),4))%>%
ggplot(aes(x=as.factor(group),y=mean,fill=as.factor(group)))+
geom_bar(stat='identity',color='black',alpha=0.8)+
scale_fill_manual(values=mycolors)+
coord_flip()+
labs(x='Group',y='Mean',fill='Group',title='PSM Mean of Readmission by Group')
grid.arrange(ci7,ci8,ci9,ncol=1)
- For those who were diagnosed as higher A1C result (8%), both t-test
method showed that patients had lower readmission rate.
t.test(y_treatment,
y_control,
var.equal =F,
paired=FALSE,
conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: y_treatment and y_control
## t = -2.7441, df = 7376.1, p-value = 0.006083
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01726678 -0.00287674
## sample estimates:
## mean of x mean of y
## 0.07955489 0.08962665
t.test(y_treatment_psm_matchit,
y_control_psm_matchit,
var.equal = F,
paired=FALSE,
conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: y_treatment_psm_matchit and y_control_psm_matchit
## t = -21.527, df = 10404, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1491346 -0.1242419
## sample estimates:
## mean of x mean of y
## 0.07955489 0.21624315