Assignment: Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this discussion, I’m looking at a larger UCI dataset that looks at patient characteristics of a diabetic patient cohort and their rates of hospital readmission within 30 days. Including demographics, there are also insurance reimbursement codes, diagnoses, and information about the utilization of medical services. Specific to diabetes, there are columns for A1C and serum glucose, which are both diagnostic of diabetes. Lastly, there’s information about the number and type of medication therapy each patient is on. The variable I’m interested in predicting is the total number of medications each patient is on. The link to dataset is found here: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#

set.seed(1121)


dm_df <- read.csv(file = 'dataset_diabetes/diabetic_data.csv')

row.number <- sample(1:nrow(dm_df), 0.8*nrow(dm_df))

dm_train <- dm_df[row.number, ]
test <- dm_df[-row.number, ] #test portion
summary(dm_train)
##   encounter_id        patient_nbr                     race      
##  Min.   :    12522   Min.   :      135   ?              : 1851  
##  1st Qu.: 85058214   1st Qu.: 23420045   AfricanAmerican:15375  
##  Median :152530755   Median : 45539028   Asian          :  521  
##  Mean   :165370543   Mean   : 54393317   Caucasian      :60854  
##  3rd Qu.:230700490   3rd Qu.: 87561142   Hispanic       : 1599  
##  Max.   :443867222   Max.   :189502619   Other          : 1212  
##                                                                 
##              gender           age              weight      admission_type_id
##  Female         :43780   [70-80):20927   ?        :78812   Min.   :1.000    
##  Male           :37629   [60-70):17960   [75-100) : 1083   1st Qu.:1.000    
##  Unknown/Invalid:    3   [50-60):13804   [50-75)  :  733   Median :1.000    
##                          [80-90):13764   [100-125):  515   Mean   :2.023    
##                          [40-50): 7704   [125-150):  112   3rd Qu.:3.000    
##                          [30-40): 3008   [25-50)  :   77   Max.   :8.000    
##                          (Other): 4245   (Other)  :   80                    
##  discharge_disposition_id admission_source_id time_in_hospital   payer_code   
##  Min.   : 1.000           Min.   : 1.000      Min.   : 1.000   ?      :32118  
##  1st Qu.: 1.000           1st Qu.: 1.000      1st Qu.: 2.000   MC     :25987  
##  Median : 1.000           Median : 7.000      Median : 4.000   HM     : 5042  
##  Mean   : 3.715           Mean   : 5.756      Mean   : 4.398   SP     : 4031  
##  3rd Qu.: 4.000           3rd Qu.: 7.000      3rd Qu.: 6.000   BC     : 3736  
##  Max.   :28.000           Max.   :25.000      Max.   :14.000   MD     : 2807  
##                                                                (Other): 7691  
##               medical_specialty num_lab_procedures num_procedures 
##  ?                     :39957   Min.   :  1.00     Min.   :0.000  
##  InternalMedicine      :11649   1st Qu.: 31.00     1st Qu.:0.000  
##  Emergency/Trauma      : 6062   Median : 44.00     Median :1.000  
##  Family/GeneralPractice: 5928   Mean   : 43.08     Mean   :1.342  
##  Cardiology            : 4268   3rd Qu.: 57.00     3rd Qu.:2.000  
##  Surgery-General       : 2466   Max.   :132.00     Max.   :6.000  
##  (Other)               :11082                                     
##  num_medications number_outpatient number_emergency  number_inpatient 
##  Min.   : 1.00   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.:10.00   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median :15.00   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   :16.03   Mean   : 0.3678   Mean   : 0.1974   Mean   : 0.6355  
##  3rd Qu.:20.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
##  Max.   :81.00   Max.   :39.0000   Max.   :76.0000   Max.   :21.0000  
##                                                                       
##      diag_1          diag_2          diag_3      number_diagnoses max_glu_serum
##  428    : 5456   276    : 5388   250    : 9202   Min.   : 1.000   >200: 1166   
##  414    : 5288   428    : 5342   401    : 6622   1st Qu.: 6.000   >300:  984   
##  786    : 3187   250    : 4829   276    : 4165   Median : 8.000   None:77171   
##  410    : 2878   427    : 3989   428    : 3689   Mean   : 7.424   Norm: 2091   
##  486    : 2809   401    : 2953   427    : 3189   3rd Qu.: 9.000                
##  427    : 2227   496    : 2655   414    : 2920   Max.   :16.000                
##  (Other):59567   (Other):56256   (Other):51625                                 
##  A1Cresult     metformin     repaglinide    nateglinide    chlorpropamide
##  >7  : 3041   Down  :  465   Down  :   34   Down  :   11   Down  :    1  
##  >8  : 6569   No    :65448   No    :80172   No    :80836   No    :81336  
##  None:67812   Steady:14634   Steady: 1119   Steady:  545   Steady:   70  
##  Norm: 3990   Up    :  865   Up    :   87   Up    :   20   Up    :    5  
##                                                                          
##                                                                          
##                                                                          
##  glimepiride    acetohexamide   glipizide      glyburide     tolbutamide   
##  Down  :  149   No    :81411   Down  :  454   Down  :  451   No    :81393  
##  No    :77258   Steady:    1   No    :71246   No    :72960   Steady:   19  
##  Steady: 3736                  Steady: 9109   Steady: 7369                 
##  Up    :  269                  Up    :  603   Up    :  632                 
##                                                                            
##                                                                            
##                                                                            
##  pioglitazone   rosiglitazone    acarbose       miglitol     troglitazone  
##  Down  :   87   Down  :   61   Down  :    2   Down  :    4   No    :81409  
##  No    :75570   No    :76298   No    :81161   No    :81381   Steady:    3  
##  Steady: 5556   Steady: 4904   Steady:  242   Steady:   26                 
##  Up    :  199   Up    :  149   Up    :    7   Up    :    1                 
##                                                                            
##                                                                            
##                                                                            
##   tolazamide    examide    citoglipton   insulin      glyburide.metformin
##  No    :81388   No:81412   No:81412    Down  : 9833   Down  :    5       
##  Steady:   23                          No    :37843   No    :80861       
##  Up    :    1                          Steady:24656   Steady:  539       
##                                        Up    : 9080   Up    :    7       
##                                                                          
##                                                                          
##                                                                          
##  glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone
##  No    :81405        No    :81411             No    :81410           
##  Steady:    7        Steady:    1             Steady:    2           
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  metformin.pioglitazone change     diabetesMed readmitted 
##  No    :81411           Ch:37631   No :18715   <30: 9071  
##  Steady:    1           No:43781   Yes:62697   >30:28534  
##                                                NO :43807  
##                                                           
##                                                           
##                                                           
## 
print(colnames(dm_train))
##  [1] "encounter_id"             "patient_nbr"             
##  [3] "race"                     "gender"                  
##  [5] "age"                      "weight"                  
##  [7] "admission_type_id"        "discharge_disposition_id"
##  [9] "admission_source_id"      "time_in_hospital"        
## [11] "payer_code"               "medical_specialty"       
## [13] "num_lab_procedures"       "num_procedures"          
## [15] "num_medications"          "number_outpatient"       
## [17] "number_emergency"         "number_inpatient"        
## [19] "diag_1"                   "diag_2"                  
## [21] "diag_3"                   "number_diagnoses"        
## [23] "max_glu_serum"            "A1Cresult"               
## [25] "metformin"                "repaglinide"             
## [27] "nateglinide"              "chlorpropamide"          
## [29] "glimepiride"              "acetohexamide"           
## [31] "glipizide"                "glyburide"               
## [33] "tolbutamide"              "pioglitazone"            
## [35] "rosiglitazone"            "acarbose"                
## [37] "miglitol"                 "troglitazone"            
## [39] "tolazamide"               "examide"                 
## [41] "citoglipton"              "insulin"                 
## [43] "glyburide.metformin"      "glipizide.metformin"     
## [45] "glimepiride.pioglitazone" "metformin.rosiglitazone" 
## [47] "metformin.pioglitazone"   "change"                  
## [49] "diabetesMed"              "readmitted"

About half of the rows describe medication use. These describe whether the patient is on the medication during the study, and whether the dose was intensified or titrated down. The study takes place between 1999 - 2008, which means there are several contemporary therapies that will not be included. Additionally, the most important diabetic therapy is treated rather unfairly. On the one indicate anything about the patient’s underlying illness. Contrast this with the ‘insulin’ category, which probably encapsulates a broad swath of available products, which are different in their use patterns. Briefly, insulin are either rapid-acting or long-acting. Rapid-acting doses are given with food or to reverse dangerously high blood glucose levels. Long-acting insulins are given less frequently and target elevations fasting blood glucose, which indicates more severe disease. Beyond this major limitation, the dataset does a good job characterizing certain important drug combinations. Finally, there is an interesting addition of rosiligtazone, which was removed from the market in 2007 after another study found a link to increased risk of heart attack.

Overall, this dataset also includes a mixture of variable types. I am most interested in the numerical variables, including the amounts of hospital admissions, diagnoses, and labs drawn. In addition to the medications treated as categorical variables, there are also some typically numerical values that are treated as ordinal - age, weight, and some important clinical outcomes like hospital readmissions and A1C. I’ll start by looking at the variable I’m interested in predicting - number of medications administered while hospitalized.

Exploring features

hist(dm_train$num_medications)

The variable I’m interested in predicting is skewed to the right, with the distribution centered around 15-16. This strikes me as a lot of medications even for a hospitalization, and skewness may also pose a problem for the model.

Next, I looked at medical specialty to see which admissions were best represented:

dm_train %>% group_by(medical_specialty) %>% count() %>% arrange(by = n) %>% tail(10)
## # A tibble: 10 x 2
## # Groups:   medical_specialty [10]
##    medical_specialty              n
##    <fct>                      <int>
##  1 Radiologist                  948
##  2 Orthopedics-Reconstructive   973
##  3 Orthopedics                 1143
##  4 Nephrology                  1277
##  5 Surgery-General             2466
##  6 Cardiology                  4268
##  7 Family/GeneralPractice      5928
##  8 Emergency/Trauma            6062
##  9 InternalMedicine           11649
## 10 ?                          39957

The vast majority of subjects are missing a medical specialty associated with their admission. I suspect that this means the patient was a general admission, but I don’t think this is a usable variable because of how much missing information is provided.

Next, I looked at were the two clinical findings used to stage diabetes, A1C and serum glucose.

dm_train$A1Cresult <- dm_train$A1Cresult  %>% recode_factor('>8' = '>7')
                                          
ggplot(data = dm_train, aes(x = A1Cresult, y = num_medications)) + 
  geom_boxplot()

I ended up keeping ‘None’ values for A1C because this represented a smaller proportion of the subjects and is only a periodically taken clinical marker - usually only taken every 3 months. However,r for max serum glucose, I found that there were many missing values, which to me was a counter intuitive finding. Not only does a general hospital population have blood glucose taken routinely, but it’s a near certainty that a diabetic subject would have at least one blood glucose taken while admitted to a hospital. For this reason, I omitted many rows of data which were missing blood glucose.

ggplot(data = dm_train, aes(x = max_glu_serum, y = num_medications)) + 
  geom_boxplot()

dm_train$max_glu_serum <- dm_train$max_glu_serum  %>% recode_factor('>300' = '>200')

dm_train <- dm_train %>% filter(max_glu_serum != 'None') %>% drop()

ggplot(data = dm_train, aes(x = max_glu_serum, y = num_medications)) + 
  geom_boxplot()

I also generated a correlation matrix with all of the numerical values. Note that several numerical values’ correlations can be ignored as these are arbitrary id’s. Any correlation likely has to do with time as a confounder.

library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
corrplot(cor(dm_train[, unlist(lapply(dm_train, is.numeric))])) #remove categorical variables

There appear to be some positive correlations between number of medications and variables like the time spent in hospital, number of procedures, and diagnoses. Next, I’ll look at the categorical variable age, and see if this conforms to my expectation that diabetic patients trend to be older on average.

plot(dm_train$age, las = 2)
title("Age Range of Hospitalized Diabetic Patients")

As expected, diabetic patients tend to be more elderly, with the average patient past retirement age of 65.

I also explored several categorical variables to see if there were any differences in distribution with respect to age versus number of medications. There weren’t’ any major anomalies aside from categories with too many missing values. Below is an example for patients on insulin. Ultimately, I decided to treat age as ordinal as it appeared to be a candidate for having a quadratic relationship.

ggplot(dm_df, aes(x = age, y = num_medications)) +
  geom_boxplot() +
  facet_wrap(~insulin)

ggplot(data = dm_train, aes(x = age, y = num_medications)) + 
  geom_boxplot()

dm_train$age <- factor(dm_train$age, order = TRUE)

There appears to be a parabolic relationship where medication use increases with age, then starts decreasing in the oldest subjects.

To visualize one quantitative versus quantitative relationship, I plotted the time in hospital versus the number of medications.

ggplot(dm_train, aes(x = time_in_hospital, y = num_medications, col = age, alpha= 0.9)) +
  geom_point() + 
  geom_jitter()

There appears to be a slightly positive relationship between length of stay and medication use. There also aren’t any age clusters apparent by color.

To find a multivariate regression, I compared number of medications with all of my quantitative variables, plus the dichotomous value ‘diabetesMed’ which provides a yes/no of whether the subject is on any diabetes medications. I also explored the interaction between the number of diagnoses and whether the patient is on diabetes medications. I also recoded two medications I was particularly interested, metformin and insulin, so that they were dichotomous.

dm_train$insulin <- dm_train$insulin  %>% recode_factor('Steady' = 'Yes')
dm_train$insulin <- dm_train$insulin  %>% recode_factor('Up' = 'Yes')
dm_train$insulin <- dm_train$insulin  %>% recode_factor('Down' = 'Yes')


dm_train$metformin <- dm_train$metformin  %>% recode_factor('Steady' = 'Yes')
dm_train$metformin <- dm_train$metformin  %>% recode_factor('Up' = 'Yes')
dm_train$metformin <- dm_train$metformin  %>% recode_factor('Down' = 'Yes')

Regression

dm_lm <- lm(data = dm_train, formula = num_medications ~ number_diagnoses + num_procedures + time_in_hospital**2 + diabetesMed + max_glu_serum + age + gender + insulin + change + change*num_lab_procedures)

summary(dm_lm)
## 
## Call:
## lm(formula = num_medications ~ number_diagnoses + num_procedures + 
##     time_in_hospital^2 + diabetesMed + max_glu_serum + age + 
##     gender + insulin + change + change * num_lab_procedures, 
##     data = dm_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0452  -3.3385  -0.4302   3.0157  25.8739 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  5.062372   0.532596   9.505  < 2e-16 ***
## number_diagnoses             0.915879   0.045593  20.088  < 2e-16 ***
## num_procedures               1.104903   0.072187  15.306  < 2e-16 ***
## time_in_hospital             0.908042   0.030526  29.746  < 2e-16 ***
## diabetesMedYes               1.147145   0.243797   4.705 2.62e-06 ***
## max_glu_serumNorm           -0.646647   0.171210  -3.777 0.000161 ***
## age.L                        0.454179   0.933337   0.487 0.626554    
## age.Q                       -3.128829   0.911092  -3.434 0.000600 ***
## age.C                       -0.270253   0.791301  -0.342 0.732722    
## age^4                        0.333394   0.645636   0.516 0.605616    
## age^5                        0.189362   0.510372   0.371 0.710636    
## age^6                       -0.184257   0.402377  -0.458 0.647033    
## age^7                       -0.017672   0.311794  -0.057 0.954803    
## age^8                       -0.154514   0.243393  -0.635 0.525573    
## genderMale                  -0.637136   0.160024  -3.981 6.96e-05 ***
## insulinNo                   -0.442093   0.224391  -1.970 0.048881 *  
## changeNo                    -2.936694   0.327296  -8.973  < 2e-16 ***
## num_lab_procedures          -0.023019   0.008567  -2.687 0.007243 ** 
## changeNo:num_lab_procedures  0.036404   0.009690   3.757 0.000174 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.134 on 4222 degrees of freedom
## Multiple R-squared:  0.4349, Adjusted R-squared:  0.4325 
## F-statistic: 180.5 on 18 and 4222 DF,  p-value: < 2.2e-16
plot(dm_lm)

hist(dm_lm$residuals)

mean(abs(dm_lm$residuals))
## [1] 3.979724

The ultimate model only explains 43% of the variation of the number of medications, which in my opinion provides too high of residuals to be clinically meaningful. However, many of the correlations I found conform to my intuition. Generally speaking, the more complicated the hospital stay, the more likely the patient will be on more medications. It’s also worth pointing out that the y-intercept for this model is about 4, I interpret as medications that patient’s are put on as part of policy, like heparin for DVT prophylaxis, or perhaps ‘as needed’ orders for pain and nausea.

The model has two quadratic features: age and time in hospital. The time in hospital value is of limited value unfortunately as the data have an artificial maximum of 14 days length of stay. The original analysis omitted hospital lengths of stay >15 days. Additionally, the age versus number of medications has several interesting interpretations. It makes sense that younger people would be on fewer medications, but it appears that the oldest patients are less likely to be on multiple medications. This could be because their doctors are concerned about too many medications or a survivor’s bias.

There were also several dichotomous variables, some of which I adapted. Patients on diabetes medications were much more likely to be on additional medications. By contrast, ‘no changes’ to medications was associated with being on fewer medications, as well as not being insulin-dependent. Men were also own fewer medications.

Finally, I incorporated one interaction into my model. I looked at the number of lab procedures versus medication changes. My rationale for this is that labs would drive the doses of medications to be changed. However, this last interaction does not have a very large impact on the prediction made by the model.

gvlma(dm_lm)
## 
## Call:
## lm(formula = num_medications ~ number_diagnoses + num_procedures + 
##     time_in_hospital^2 + diabetesMed + max_glu_serum + age + 
##     gender + insulin + change + change * num_lab_procedures, 
##     data = dm_train)
## 
## Coefficients:
##                 (Intercept)             number_diagnoses  
##                     5.06237                      0.91588  
##              num_procedures             time_in_hospital  
##                     1.10490                      0.90804  
##              diabetesMedYes            max_glu_serumNorm  
##                     1.14714                     -0.64665  
##                       age.L                        age.Q  
##                     0.45418                     -3.12883  
##                       age.C                        age^4  
##                    -0.27025                      0.33339  
##                       age^5                        age^6  
##                     0.18936                     -0.18426  
##                       age^7                        age^8  
##                    -0.01767                     -0.15451  
##                  genderMale                    insulinNo  
##                    -0.63714                     -0.44209  
##                    changeNo           num_lab_procedures  
##                    -2.93669                     -0.02302  
## changeNo:num_lab_procedures  
##                     0.03640  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = dm_lm) 
## 
##                       Value p-value                   Decision
## Global Stat        234.5632  0.0000 Assumptions NOT satisfied!
## Skewness           151.4939  0.0000 Assumptions NOT satisfied!
## Kurtosis            80.3722  0.0000 Assumptions NOT satisfied!
## Link Function        2.1621  0.1415    Assumptions acceptable.
## Heteroscedasticity   0.5349  0.4645    Assumptions acceptable.

Per output of glvma, the skewness and kurtosis assumptions are not satisfied. As I mentioned earlier, the response variable is skewed to the right. Transforming the response variable to the square root of number of medications did correct the distribution, but assumptions were still not acceptable.

Prediction

test$max_glu_serum <- test$max_glu_serum  %>% recode_factor('>300' = '>200') #recode test data

test <- test %>% filter(max_glu_serum != 'None') %>% drop()
test$insulin <- test$insulin  %>% recode_factor('Steady' = 'Yes')
test$insulin <- test$insulin  %>% recode_factor('Up' = 'Yes')
test$insulin <- test$insulin  %>% recode_factor('Down' = 'Yes')


test$metformin <- test$metformin  %>% recode_factor('Steady' = 'Yes')
test$metformin <- test$metformin  %>% recode_factor('Up' = 'Yes')
test$metformin <- test$metformin  %>% recode_factor('Down' = 'Yes')


prediction <- predict(dm_lm, test)

error <- prediction - test['num_medications']

hist(error$num_medications)

t.test(error)
## 
##  One Sample t-test
## 
## data:  error
## t = -1.592, df = 1104, p-value = 0.1117
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.54453528  0.05670154
## sample estimates:
##  mean of x 
## -0.2439169

Performing a prediction on my test set supports the validity of the model.

Conclusions

I was interested in this dataset because of the large amount of subjects and potential features. There was a good mix of numerical and categorical data, and there are many other avenues still worth exploring. Some parts I found limiting about this data set were the categorical variables like weight and age that could have easily been numerical. The final model I found could only explain less than half of the variation of number of medications, and the assumptions of the model weren’t satisfied. However, the model was able to make satisfactory predictions for many subjects, but the variance may make these findings clinically insignificant. In the future, I would like to consolidate many of the drug features into their respective therapeutic drug classes for more meaningful conclusions about drug therapy.