For this discussion, I’m looking at a larger UCI dataset that looks at patient characteristics of a diabetic patient cohort and their rates of hospital readmission within 30 days. Including demographics, there are also insurance reimbursement codes, diagnoses, and information about the utilization of medical services. Specific to diabetes, there are columns for A1C and serum glucose, which are both diagnostic of diabetes. Lastly, there’s information about the number and type of medication therapy each patient is on. The variable I’m interested in predicting is the total number of medications each patient is on. The link to dataset is found here: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#
set.seed(1121)
dm_df <- read.csv(file = 'dataset_diabetes/diabetic_data.csv')
row.number <- sample(1:nrow(dm_df), 0.8*nrow(dm_df))
dm_train <- dm_df[row.number, ]
test <- dm_df[-row.number, ] #test portion
summary(dm_train)
## encounter_id patient_nbr race
## Min. : 12522 Min. : 135 ? : 1851
## 1st Qu.: 85058214 1st Qu.: 23420045 AfricanAmerican:15375
## Median :152530755 Median : 45539028 Asian : 521
## Mean :165370543 Mean : 54393317 Caucasian :60854
## 3rd Qu.:230700490 3rd Qu.: 87561142 Hispanic : 1599
## Max. :443867222 Max. :189502619 Other : 1212
##
## gender age weight admission_type_id
## Female :43780 [70-80):20927 ? :78812 Min. :1.000
## Male :37629 [60-70):17960 [75-100) : 1083 1st Qu.:1.000
## Unknown/Invalid: 3 [50-60):13804 [50-75) : 733 Median :1.000
## [80-90):13764 [100-125): 515 Mean :2.023
## [40-50): 7704 [125-150): 112 3rd Qu.:3.000
## [30-40): 3008 [25-50) : 77 Max. :8.000
## (Other): 4245 (Other) : 80
## discharge_disposition_id admission_source_id time_in_hospital payer_code
## Min. : 1.000 Min. : 1.000 Min. : 1.000 ? :32118
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 MC :25987
## Median : 1.000 Median : 7.000 Median : 4.000 HM : 5042
## Mean : 3.715 Mean : 5.756 Mean : 4.398 SP : 4031
## 3rd Qu.: 4.000 3rd Qu.: 7.000 3rd Qu.: 6.000 BC : 3736
## Max. :28.000 Max. :25.000 Max. :14.000 MD : 2807
## (Other): 7691
## medical_specialty num_lab_procedures num_procedures
## ? :39957 Min. : 1.00 Min. :0.000
## InternalMedicine :11649 1st Qu.: 31.00 1st Qu.:0.000
## Emergency/Trauma : 6062 Median : 44.00 Median :1.000
## Family/GeneralPractice: 5928 Mean : 43.08 Mean :1.342
## Cardiology : 4268 3rd Qu.: 57.00 3rd Qu.:2.000
## Surgery-General : 2466 Max. :132.00 Max. :6.000
## (Other) :11082
## num_medications number_outpatient number_emergency number_inpatient
## Min. : 1.00 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:10.00 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :15.00 Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean :16.03 Mean : 0.3678 Mean : 0.1974 Mean : 0.6355
## 3rd Qu.:20.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
## Max. :81.00 Max. :39.0000 Max. :76.0000 Max. :21.0000
##
## diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 428 : 5456 276 : 5388 250 : 9202 Min. : 1.000 >200: 1166
## 414 : 5288 428 : 5342 401 : 6622 1st Qu.: 6.000 >300: 984
## 786 : 3187 250 : 4829 276 : 4165 Median : 8.000 None:77171
## 410 : 2878 427 : 3989 428 : 3689 Mean : 7.424 Norm: 2091
## 486 : 2809 401 : 2953 427 : 3189 3rd Qu.: 9.000
## 427 : 2227 496 : 2655 414 : 2920 Max. :16.000
## (Other):59567 (Other):56256 (Other):51625
## A1Cresult metformin repaglinide nateglinide chlorpropamide
## >7 : 3041 Down : 465 Down : 34 Down : 11 Down : 1
## >8 : 6569 No :65448 No :80172 No :80836 No :81336
## None:67812 Steady:14634 Steady: 1119 Steady: 545 Steady: 70
## Norm: 3990 Up : 865 Up : 87 Up : 20 Up : 5
##
##
##
## glimepiride acetohexamide glipizide glyburide tolbutamide
## Down : 149 No :81411 Down : 454 Down : 451 No :81393
## No :77258 Steady: 1 No :71246 No :72960 Steady: 19
## Steady: 3736 Steady: 9109 Steady: 7369
## Up : 269 Up : 603 Up : 632
##
##
##
## pioglitazone rosiglitazone acarbose miglitol troglitazone
## Down : 87 Down : 61 Down : 2 Down : 4 No :81409
## No :75570 No :76298 No :81161 No :81381 Steady: 3
## Steady: 5556 Steady: 4904 Steady: 242 Steady: 26
## Up : 199 Up : 149 Up : 7 Up : 1
##
##
##
## tolazamide examide citoglipton insulin glyburide.metformin
## No :81388 No:81412 No:81412 Down : 9833 Down : 5
## Steady: 23 No :37843 No :80861
## Up : 1 Steady:24656 Steady: 539
## Up : 9080 Up : 7
##
##
##
## glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone
## No :81405 No :81411 No :81410
## Steady: 7 Steady: 1 Steady: 2
##
##
##
##
##
## metformin.pioglitazone change diabetesMed readmitted
## No :81411 Ch:37631 No :18715 <30: 9071
## Steady: 1 No:43781 Yes:62697 >30:28534
## NO :43807
##
##
##
##
print(colnames(dm_train))
## [1] "encounter_id" "patient_nbr"
## [3] "race" "gender"
## [5] "age" "weight"
## [7] "admission_type_id" "discharge_disposition_id"
## [9] "admission_source_id" "time_in_hospital"
## [11] "payer_code" "medical_specialty"
## [13] "num_lab_procedures" "num_procedures"
## [15] "num_medications" "number_outpatient"
## [17] "number_emergency" "number_inpatient"
## [19] "diag_1" "diag_2"
## [21] "diag_3" "number_diagnoses"
## [23] "max_glu_serum" "A1Cresult"
## [25] "metformin" "repaglinide"
## [27] "nateglinide" "chlorpropamide"
## [29] "glimepiride" "acetohexamide"
## [31] "glipizide" "glyburide"
## [33] "tolbutamide" "pioglitazone"
## [35] "rosiglitazone" "acarbose"
## [37] "miglitol" "troglitazone"
## [39] "tolazamide" "examide"
## [41] "citoglipton" "insulin"
## [43] "glyburide.metformin" "glipizide.metformin"
## [45] "glimepiride.pioglitazone" "metformin.rosiglitazone"
## [47] "metformin.pioglitazone" "change"
## [49] "diabetesMed" "readmitted"
About half of the rows describe medication use. These describe whether the patient is on the medication during the study, and whether the dose was intensified or titrated down. The study takes place between 1999 - 2008, which means there are several contemporary therapies that will not be included. Additionally, the most important diabetic therapy is treated rather unfairly. On the one indicate anything about the patient’s underlying illness. Contrast this with the ‘insulin’ category, which probably encapsulates a broad swath of available products, which are different in their use patterns. Briefly, insulin are either rapid-acting or long-acting. Rapid-acting doses are given with food or to reverse dangerously high blood glucose levels. Long-acting insulins are given less frequently and target elevations fasting blood glucose, which indicates more severe disease. Beyond this major limitation, the dataset does a good job characterizing certain important drug combinations. Finally, there is an interesting addition of rosiligtazone, which was removed from the market in 2007 after another study found a link to increased risk of heart attack.
Overall, this dataset also includes a mixture of variable types. I am most interested in the numerical variables, including the amounts of hospital admissions, diagnoses, and labs drawn. In addition to the medications treated as categorical variables, there are also some typically numerical values that are treated as ordinal - age, weight, and some important clinical outcomes like hospital readmissions and A1C. I’ll start by looking at the variable I’m interested in predicting - number of medications administered while hospitalized.
hist(dm_train$num_medications)
The variable I’m interested in predicting is skewed to the right, with the distribution centered around 15-16. This strikes me as a lot of medications even for a hospitalization, and skewness may also pose a problem for the model.
Next, I looked at medical specialty to see which admissions were best represented:
dm_train %>% group_by(medical_specialty) %>% count() %>% arrange(by = n) %>% tail(10)
## # A tibble: 10 x 2
## # Groups: medical_specialty [10]
## medical_specialty n
## <fct> <int>
## 1 Radiologist 948
## 2 Orthopedics-Reconstructive 973
## 3 Orthopedics 1143
## 4 Nephrology 1277
## 5 Surgery-General 2466
## 6 Cardiology 4268
## 7 Family/GeneralPractice 5928
## 8 Emergency/Trauma 6062
## 9 InternalMedicine 11649
## 10 ? 39957
The vast majority of subjects are missing a medical specialty associated with their admission. I suspect that this means the patient was a general admission, but I don’t think this is a usable variable because of how much missing information is provided.
Next, I looked at were the two clinical findings used to stage diabetes, A1C and serum glucose.
dm_train$A1Cresult <- dm_train$A1Cresult %>% recode_factor('>8' = '>7')
ggplot(data = dm_train, aes(x = A1Cresult, y = num_medications)) +
geom_boxplot()
I ended up keeping ‘None’ values for A1C because this represented a smaller proportion of the subjects and is only a periodically taken clinical marker - usually only taken every 3 months. However,r for max serum glucose, I found that there were many missing values, which to me was a counter intuitive finding. Not only does a general hospital population have blood glucose taken routinely, but it’s a near certainty that a diabetic subject would have at least one blood glucose taken while admitted to a hospital. For this reason, I omitted many rows of data which were missing blood glucose.
ggplot(data = dm_train, aes(x = max_glu_serum, y = num_medications)) +
geom_boxplot()
dm_train$max_glu_serum <- dm_train$max_glu_serum %>% recode_factor('>300' = '>200')
dm_train <- dm_train %>% filter(max_glu_serum != 'None') %>% drop()
ggplot(data = dm_train, aes(x = max_glu_serum, y = num_medications)) +
geom_boxplot()
I also generated a correlation matrix with all of the numerical values. Note that several numerical values’ correlations can be ignored as these are arbitrary id’s. Any correlation likely has to do with time as a confounder.
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
corrplot(cor(dm_train[, unlist(lapply(dm_train, is.numeric))])) #remove categorical variables
There appear to be some positive correlations between number of medications and variables like the time spent in hospital, number of procedures, and diagnoses. Next, I’ll look at the categorical variable age, and see if this conforms to my expectation that diabetic patients trend to be older on average.
plot(dm_train$age, las = 2)
title("Age Range of Hospitalized Diabetic Patients")
As expected, diabetic patients tend to be more elderly, with the average patient past retirement age of 65.
I also explored several categorical variables to see if there were any differences in distribution with respect to age versus number of medications. There weren’t’ any major anomalies aside from categories with too many missing values. Below is an example for patients on insulin. Ultimately, I decided to treat age as ordinal as it appeared to be a candidate for having a quadratic relationship.
ggplot(dm_df, aes(x = age, y = num_medications)) +
geom_boxplot() +
facet_wrap(~insulin)
ggplot(data = dm_train, aes(x = age, y = num_medications)) +
geom_boxplot()
dm_train$age <- factor(dm_train$age, order = TRUE)
There appears to be a parabolic relationship where medication use increases with age, then starts decreasing in the oldest subjects.
To visualize one quantitative versus quantitative relationship, I plotted the time in hospital versus the number of medications.
ggplot(dm_train, aes(x = time_in_hospital, y = num_medications, col = age, alpha= 0.9)) +
geom_point() +
geom_jitter()
There appears to be a slightly positive relationship between length of stay and medication use. There also aren’t any age clusters apparent by color.
To find a multivariate regression, I compared number of medications with all of my quantitative variables, plus the dichotomous value ‘diabetesMed’ which provides a yes/no of whether the subject is on any diabetes medications. I also explored the interaction between the number of diagnoses and whether the patient is on diabetes medications. I also recoded two medications I was particularly interested, metformin and insulin, so that they were dichotomous.
dm_train$insulin <- dm_train$insulin %>% recode_factor('Steady' = 'Yes')
dm_train$insulin <- dm_train$insulin %>% recode_factor('Up' = 'Yes')
dm_train$insulin <- dm_train$insulin %>% recode_factor('Down' = 'Yes')
dm_train$metformin <- dm_train$metformin %>% recode_factor('Steady' = 'Yes')
dm_train$metformin <- dm_train$metformin %>% recode_factor('Up' = 'Yes')
dm_train$metformin <- dm_train$metformin %>% recode_factor('Down' = 'Yes')
dm_lm <- lm(data = dm_train, formula = num_medications ~ number_diagnoses + num_procedures + time_in_hospital**2 + diabetesMed + max_glu_serum + age + gender + insulin + change + change*num_lab_procedures)
summary(dm_lm)
##
## Call:
## lm(formula = num_medications ~ number_diagnoses + num_procedures +
## time_in_hospital^2 + diabetesMed + max_glu_serum + age +
## gender + insulin + change + change * num_lab_procedures,
## data = dm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0452 -3.3385 -0.4302 3.0157 25.8739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.062372 0.532596 9.505 < 2e-16 ***
## number_diagnoses 0.915879 0.045593 20.088 < 2e-16 ***
## num_procedures 1.104903 0.072187 15.306 < 2e-16 ***
## time_in_hospital 0.908042 0.030526 29.746 < 2e-16 ***
## diabetesMedYes 1.147145 0.243797 4.705 2.62e-06 ***
## max_glu_serumNorm -0.646647 0.171210 -3.777 0.000161 ***
## age.L 0.454179 0.933337 0.487 0.626554
## age.Q -3.128829 0.911092 -3.434 0.000600 ***
## age.C -0.270253 0.791301 -0.342 0.732722
## age^4 0.333394 0.645636 0.516 0.605616
## age^5 0.189362 0.510372 0.371 0.710636
## age^6 -0.184257 0.402377 -0.458 0.647033
## age^7 -0.017672 0.311794 -0.057 0.954803
## age^8 -0.154514 0.243393 -0.635 0.525573
## genderMale -0.637136 0.160024 -3.981 6.96e-05 ***
## insulinNo -0.442093 0.224391 -1.970 0.048881 *
## changeNo -2.936694 0.327296 -8.973 < 2e-16 ***
## num_lab_procedures -0.023019 0.008567 -2.687 0.007243 **
## changeNo:num_lab_procedures 0.036404 0.009690 3.757 0.000174 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.134 on 4222 degrees of freedom
## Multiple R-squared: 0.4349, Adjusted R-squared: 0.4325
## F-statistic: 180.5 on 18 and 4222 DF, p-value: < 2.2e-16
plot(dm_lm)
hist(dm_lm$residuals)
mean(abs(dm_lm$residuals))
## [1] 3.979724
The ultimate model only explains 43% of the variation of the number of medications, which in my opinion provides too high of residuals to be clinically meaningful. However, many of the correlations I found conform to my intuition. Generally speaking, the more complicated the hospital stay, the more likely the patient will be on more medications. It’s also worth pointing out that the y-intercept for this model is about 4, I interpret as medications that patient’s are put on as part of policy, like heparin for DVT prophylaxis, or perhaps ‘as needed’ orders for pain and nausea.
The model has two quadratic features: age and time in hospital. The time in hospital value is of limited value unfortunately as the data have an artificial maximum of 14 days length of stay. The original analysis omitted hospital lengths of stay >15 days. Additionally, the age versus number of medications has several interesting interpretations. It makes sense that younger people would be on fewer medications, but it appears that the oldest patients are less likely to be on multiple medications. This could be because their doctors are concerned about too many medications or a survivor’s bias.
There were also several dichotomous variables, some of which I adapted. Patients on diabetes medications were much more likely to be on additional medications. By contrast, ‘no changes’ to medications was associated with being on fewer medications, as well as not being insulin-dependent. Men were also own fewer medications.
Finally, I incorporated one interaction into my model. I looked at the number of lab procedures versus medication changes. My rationale for this is that labs would drive the doses of medications to be changed. However, this last interaction does not have a very large impact on the prediction made by the model.
gvlma(dm_lm)
##
## Call:
## lm(formula = num_medications ~ number_diagnoses + num_procedures +
## time_in_hospital^2 + diabetesMed + max_glu_serum + age +
## gender + insulin + change + change * num_lab_procedures,
## data = dm_train)
##
## Coefficients:
## (Intercept) number_diagnoses
## 5.06237 0.91588
## num_procedures time_in_hospital
## 1.10490 0.90804
## diabetesMedYes max_glu_serumNorm
## 1.14714 -0.64665
## age.L age.Q
## 0.45418 -3.12883
## age.C age^4
## -0.27025 0.33339
## age^5 age^6
## 0.18936 -0.18426
## age^7 age^8
## -0.01767 -0.15451
## genderMale insulinNo
## -0.63714 -0.44209
## changeNo num_lab_procedures
## -2.93669 -0.02302
## changeNo:num_lab_procedures
## 0.03640
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = dm_lm)
##
## Value p-value Decision
## Global Stat 234.5632 0.0000 Assumptions NOT satisfied!
## Skewness 151.4939 0.0000 Assumptions NOT satisfied!
## Kurtosis 80.3722 0.0000 Assumptions NOT satisfied!
## Link Function 2.1621 0.1415 Assumptions acceptable.
## Heteroscedasticity 0.5349 0.4645 Assumptions acceptable.
Per output of glvma, the skewness and kurtosis assumptions are not satisfied. As I mentioned earlier, the response variable is skewed to the right. Transforming the response variable to the square root of number of medications did correct the distribution, but assumptions were still not acceptable.
test$max_glu_serum <- test$max_glu_serum %>% recode_factor('>300' = '>200') #recode test data
test <- test %>% filter(max_glu_serum != 'None') %>% drop()
test$insulin <- test$insulin %>% recode_factor('Steady' = 'Yes')
test$insulin <- test$insulin %>% recode_factor('Up' = 'Yes')
test$insulin <- test$insulin %>% recode_factor('Down' = 'Yes')
test$metformin <- test$metformin %>% recode_factor('Steady' = 'Yes')
test$metformin <- test$metformin %>% recode_factor('Up' = 'Yes')
test$metformin <- test$metformin %>% recode_factor('Down' = 'Yes')
prediction <- predict(dm_lm, test)
error <- prediction - test['num_medications']
hist(error$num_medications)
t.test(error)
##
## One Sample t-test
##
## data: error
## t = -1.592, df = 1104, p-value = 0.1117
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.54453528 0.05670154
## sample estimates:
## mean of x
## -0.2439169
Performing a prediction on my test set supports the validity of the model.
I was interested in this dataset because of the large amount of subjects and potential features. There was a good mix of numerical and categorical data, and there are many other avenues still worth exploring. Some parts I found limiting about this data set were the categorical variables like weight and age that could have easily been numerical. The final model I found could only explain less than half of the variation of number of medications, and the assumptions of the model weren’t satisfied. However, the model was able to make satisfactory predictions for many subjects, but the variance may make these findings clinically insignificant. In the future, I would like to consolidate many of the drug features into their respective therapeutic drug classes for more meaningful conclusions about drug therapy.