Diabetes 130-US hospitals dataset: Analysis of factors related to readmission as well as other outcomes pertaining to patients with diabetes.

1. Introduction and Methodology

The management of hyperglycemia in hospitalized patients has significant importance, as it can help control some phenomenon like morbidity and mortality related to health problem. In this dataset, we will to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals. It has over 50 features representing patient and hospital outcomes. Information was giving for encounters that satisfied the following criteria:

It is an inpatient encounter (a hospital admission)
It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
The length of stay was at least 1 day and at most 14 days
Laboratory tests were performed during the encounter
Medications were administered during the encounter

The data contains attributes such as patient number, race, gender, age, admission type, time hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medications, diabetic medications, number of outpatients, inpatient, emergency visits in the year before the hospitalization and many others.

Methodology

In this project, our goal is to pre-process, analyze, visualize, and conduct unsupervised learning on this input dataset. In details, we plan on using clustering analysis in order to group patients together based on the features. we first start with data pre-processing by exploring the data, managing missing values, taking care of near-zero variance variables and transforming variables. Following that, we start doing clustering analysis by using first, k-means algorithm, and second, hierarchical clustering.

2. Data cleaning and exploration

The diabetes dataset is available on UCI Machine learning repository website through the link: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008. At very beginning, the dataset contains 101766 observations and 49 features. In this part, our goal is to get familiar with the dataset and get it ready by doing some preliminary statistical analysis, variable transformation, deal with NA values and so on.

## 'data.frame':    101766 obs. of  49 variables:
##  $ patient_nbr             : int  8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
##  $ race                    : chr  "Caucasian" "Caucasian" "AfricanAmerican" "Caucasian" ...
##  $ gender                  : chr  "Female" "Female" "Female" "Male" ...
##  $ age                     : chr  "[0-10)" "[10-20)" "[20-30)" "[30-40)" ...
##  $ weight                  : chr  "?" "?" "?" "?" ...
##  $ admission_type_id       : int  6 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: int  25 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : int  1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : int  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : chr  "?" "?" "?" "?" ...
##  $ medical_specialty       : chr  "Pediatrics-Endocrinology" "?" "?" "?" ...
##  $ num_lab_procedures      : int  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : int  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : int  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : chr  "250.83" "276" "648" "8" ...
##  $ diag_2                  : chr  "?" "250.01" "250" "250.43" ...
##  $ diag_3                  : chr  "?" "255" "27" "403" ...
##  $ number_diagnoses        : int  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : chr  "None" "None" "None" "None" ...
##  $ A1Cresult               : chr  "None" "None" "None" "None" ...
##  $ metformin               : chr  "No" "No" "No" "No" ...
##  $ repaglinide             : chr  "No" "No" "No" "No" ...
##  $ nateglinide             : chr  "No" "No" "No" "No" ...
##  $ chlorpropamide          : chr  "No" "No" "No" "No" ...
##  $ glimepiride             : chr  "No" "No" "No" "No" ...
##  $ acetohexamide           : chr  "No" "No" "No" "No" ...
##  $ glipizide               : chr  "No" "No" "Steady" "No" ...
##  $ glyburide               : chr  "No" "No" "No" "No" ...
##  $ tolbutamide             : chr  "No" "No" "No" "No" ...
##  $ pioglitazone            : chr  "No" "No" "No" "No" ...
##  $ rosiglitazone           : chr  "No" "No" "No" "No" ...
##  $ acarbose                : chr  "No" "No" "No" "No" ...
##  $ miglitol                : chr  "No" "No" "No" "No" ...
##  $ troglitazone            : chr  "No" "No" "No" "No" ...
##  $ tolazamide              : chr  "No" "No" "No" "No" ...
##  $ examide                 : chr  "No" "No" "No" "No" ...
##  $ citoglipton             : chr  "No" "No" "No" "No" ...
##  $ insulin                 : chr  "No" "Up" "No" "Up" ...
##  $ glyburide.metformin     : chr  "No" "No" "No" "No" ...
##  $ glipizide.metformin     : chr  "No" "No" "No" "No" ...
##  $ glimepiride.pioglitazone: chr  "No" "No" "No" "No" ...
##  $ metformin.rosiglitazone : chr  "No" "No" "No" "No" ...
##  $ metformin.pioglitazone  : chr  "No" "No" "No" "No" ...
##  $ change                  : chr  "No" "Ch" "No" "Ch" ...
##  $ diabetesMed             : chr  "No" "Yes" "Yes" "Yes" ...
##  $ readmitted              : chr  "NO" ">30" "NO" "NO" ...

## [1] 101766     49

2.1. Missingness

The dataset have many variables with ‘?’ as data points. Apparently, it was an indication of a missing value which was therefore replaced with NA so we can better manage them. Upon examination, we see that the missing are only found in five of the 49 features. The table below resume the percentage of missing data for each of these 5 features. We can see that, for some variables like weight, the number of missing point is extremely high.

##              race            weight        payer_code medical_specialty 
##              2.23             96.86             39.56             49.08 
##            diag_1            diag_2            diag_3 
##              0.02              0.35              1.40

To deal with the missing values, in the following histogram that shows the distribution of missing values, we consider a threshold of 30% (by using a red line), and variables with missing values above 30% were removed from consideration at this point. Thus, variables weight, payer_code and medical_specialty was removed from the dataset. For the other variables we leave them as the are for now, as further manipulations in the dataset will take care of the missing values.

2.2. Near Zero-Variances variables

Near Zero-Variances (NZV) refer to the case when some variables take almost a unique value across the dataset. This kind of variables are not only non-informative sometimes, they can also break some data mining methods you may want to use. Therefore, it is always a good practice to address this phenomenon, particularly if you are dealing with a dataset with large number of variables. In our case, we have a few factor variables with Near zero-variances. Keeping them would later generate considerable number of dummy variables and increase the computation complexities and resource requirements. Consequently, removed all NZV variables. By default, the nearZeroVar() function in R from caret package considers a variable as near-zero variance if the percentage of unique values in the samples is less 10%. By executing this function on our dataset, we remove 18 near-zero variance variables. Below are the Near Zero-Variance variables.

##  [1] "max_glu_serum"            "repaglinide"             
##  [3] "nateglinide"              "chlorpropamide"          
##  [5] "glimepiride"              "acetohexamide"           
##  [7] "tolbutamide"              "acarbose"                
##  [9] "miglitol"                 "troglitazone"            
## [11] "tolazamide"               "examide"                 
## [13] "citoglipton"              "glyburide.metformin"     
## [15] "glipizide.metformin"      "glimepiride.pioglitazone"
## [17] "metformin.rosiglitazone"  "metformin.pioglitazone"

2.3. Multiple Encounters of a patient.

The dataset contained multiple rows with the same patient number (patient ID). It is unclear if multiple encounters between doctors and patients, i.e. visits, are independent. There is a risk that these multiple visits of a patient might be related, hence introduce bias since some encounters of a patient can be correlated. To eliminate this risk, we decide to keep one and only one encounter which had the maximum time_in_hospital, assuming time_in_hospital was characteristic for readmission and would present sufficient variance in training data. Once we have done that, we removed the patient ID variable and the encounter ID variable. Thus, the dataset is reduced to 27 variables and 71518 observations.

## [1] 30248

2.4. Variable transformation

2.4.1. Categorical variables

Some of the categorical variables contain too many categories, this is the case of the three variables: diag_1, diag_2, and diag_3, each had some 700 levels. Which would require around 900 dummy variables and the computation needs would be expensive to manage. To consolidate the levels of these three variables, I follow the rule in table 2 from the original report paper about this dataset: https://www.hindawi.com/journals/bmri/2014/781670/. The levels are then reduced to 9 categories, and at the same time the NA’s values from these three variables were taking care by assigning the ‘other’ level. Besides these variables, I also take care of other factors variables. The levels of gender variable were reduced from 3 (male, female and unknown) to 2 (male and female). We consolidate variable Age from 10 level to 4 levels, admission source variable was consolidated from 25 levels to 5 (referral, transfer, birth, unknown and other). For the race variable, we decide to remove the missing values, which bring the number of observations to 69598 patients. One additional point I want to emphasis is that, as my goal is to use clustering analysis, the algorithms typically need numerical values to compute some metrics like distances. To allow using categorical variables in these algorithms, on solution is to use one-hot encoded to transform these categorical variables into numerical values. That’s the solution I adopt in this project.

## 
##     circulatory        Diabetes       Digestive   Genitourinary          Injury 
##           21641            5710            7819            3562            4938 
## Musculoskeletal       Neoplasms           Other     Respiratory 
##            3945            2828           11357            9718

## 
##     circulatory        Diabetes       Digestive   Genitourinary          Injury 
##           23227            9325            3094            5710            2636 
## Musculoskeletal       Neoplasms           Other     Respiratory 
##            1336            1888           16879            7423

## 
##     circulatory        Diabetes       Digestive   Genitourinary          Injury 
##           22892           12239            2814            4458            2595 
## Musculoskeletal       Neoplasms           Other     Respiratory 
##            1412            1599           18426            5083

Factor variables in the dataset:

## [1] "gender"                   "age"                     
## [3] "admission_type_id"        "discharge_disposition_id"
## [5] "admission_source_id"      "diag_1_fact"             
## [7] "diag_2_fact"              "diag_3_fact"

2.4.2. Numerical variables

Concerning the numerical variables, we have 8 numerical features in the dataset without missing values but few of them have some outliers that needs to be investigated.

Outliers

Looking into the dataset by plotting the numerical variables, we find that the variables with extreme values were spread among a handful observations. These variables with outliers are number_outpatient, number_inpatient and number_emergency. However, these variables have mean value very close to zero, removing a few outliers will result in zeroing all summary statistics, which caused some computation issues in subsequent processing. Consequently, the few outliers were kept as they were.

## [1] "time_in_hospital"   "num_lab_procedures" "num_procedures"    
## [4] "num_medications"    "number_outpatient"  "number_emergency"  
## [7] "number_inpatient"   "number_diagnoses"

Correlations

One other aspect to explore for numerical variables is the correlation between them. The heatmap below shows these correlations in absolute values, as we go from dark blue to dark red, the absolute value of the correlation grows from 0 to 1. Thus, we see that in general these variables don’t have strong correlations.

Thus, after all the cleaning we present the summary of the finalize dataset is presented below.

##      race              gender           age        admission_type_id
##  Length:69598       Female:37062   [0-30) : 1758   urgent  :48388   
##  Class :character   Male  :32536   [30-60):21342   elective:13577   
##  Mode  :character                  [60-80):33199   unknown : 7633   
##                                    80+    :13299                    
##                                                                     
##                                                                     
##                                                                     
##  discharge_disposition_id admission_source_id time_in_hospital
##  d:63515                  r:59800             Min.   : 1.000  
##  o: 2975                  t: 4734             1st Qu.: 2.000  
##  h:  568                  o:   13             Median : 4.000  
##  u: 2540                  u:  193             Mean   : 4.768  
##                           b: 4858             3rd Qu.: 7.000  
##                                               Max.   :14.000  
##                                                               
##  num_lab_procedures num_procedures  num_medications number_outpatient
##  Min.   :  1.00     Min.   :0.000   Min.   : 1.0    Min.   : 0.0000  
##  1st Qu.: 32.00     1st Qu.:0.000   1st Qu.:10.0    1st Qu.: 0.0000  
##  Median : 45.00     Median :1.000   Median :15.0    Median : 0.0000  
##  Mean   : 44.11     Mean   :1.491   Mean   :16.4    Mean   : 0.3065  
##  3rd Qu.: 58.00     3rd Qu.:2.000   3rd Qu.:21.0    3rd Qu.: 0.0000  
##  Max.   :132.00     Max.   :6.000   Max.   :81.0    Max.   :40.0000  
##                                                                      
##  number_emergency  number_inpatient  number_diagnoses  A1Cresult        
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 1.000   Length:69598      
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 6.000   Class :character  
##  Median : 0.0000   Median : 0.0000   Median : 8.000   Mode  :character  
##  Mean   : 0.1254   Mean   : 0.3264   Mean   : 7.359                     
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 9.000                     
##  Max.   :63.0000   Max.   :19.0000   Max.   :16.000                     
##                                                                         
##   metformin          glipizide          glyburide         pioglitazone      
##  Length:69598       Length:69598       Length:69598       Length:69598      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  rosiglitazone        insulin             change          diabetesMed       
##  Length:69598       Length:69598       Length:69598       Length:69598      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   readmitted             diag_1_fact           diag_2_fact   
##  Length:69598       circulatory:21047   circulatory  :22619  
##  Class :character   Other      :11031   Other        :16397  
##  Mode  :character   Respiratory: 9465   Diabetes     : 9047  
##                     Digestive  : 7636   Respiratory  : 7250  
##                     Diabetes   : 5551   Genitourinary: 5590  
##                     Injury     : 4806   Digestive    : 3014  
##                     (Other)    :10062   (Other)      : 5681  
##         diag_3_fact   
##  circulatory  :22306  
##  Other        :17933  
##  Diabetes     :11868  
##  Respiratory  : 4960  
##  Genitourinary: 4335  
##  Digestive    : 2747  
##  (Other)      : 5449

3. Unsupervisede learning

3.1. PCA

We begin this part by doing PCA analysis on the numerical variables in order to transform the set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables. The idea is to reduce the number of these variables while maintaining approximately the same amount of information they bring into our case. After the computations, it turns out that the proportion of variance explained (PVE) by the first components is around 25%. And from the figure below, we see that the proportion of variance explained by the first two principal components is around 40%.

## VarianceThreshold()

One thing we can do with the principal components is to use the first two components to explore the data by taking into account the classes of the readmission variable. the figure after present the projection of the dataset onto the first two principal components. It appeared that there were three primary clusters - one especially large one on the left, and two identically shaped smaller ones on the right.

From the above graphic, the clusters are separated in the direction of the first component, which meant that by looking at this component’s loadings it would be possible to identify what factors changed as we moved in its direction. By Analyzing the loadings, they can provide insight as to which categorical features are varying in the direction of the first components.

## change_No              -0.391364
## diabetesMed_No         -0.373782
## insulin_No             -0.288792
## metformin_No           -0.207929
## glipizide_No           -0.128099
## pioglitazone_No        -0.119529
## rosiglitazone_No       -0.117951
## glyburide_No           -0.112803
## glipizide_Steady        0.112025
## rosiglitazone_Steady    0.113399
## pioglitazone_Steady     0.113618
## insulin_Steady          0.121416
## insulin_Down            0.140969
## insulin_Up              0.143116
## num_medications         0.163507
## metformin_Steady        0.189344
## diabetesMed_Yes         0.373782
## change_Ch               0.391364
## Name: PC_0, dtype: float64

We can see that the direction expresses the patient’s usage of diabetes medication- lower score suggested that a patient don’t really change medication or wasn’t using any diabetes medications; contrariwise, larger scores suggested that a patient had a change in medication, was using at least one medication, tended to use more medications, or were using insulin/metformin. I was curious to know what these represented in detail. Therefore, this lead us on using in the next part K-means clustering with k=3 to explore the clusters. We will come back with the meaning of the loading values when analyzing the cluster in deeper.

3.3. Clustering

3.2.1. K-means clustering

To begin with, we used k-means clustering to split the dataset into the clusters identified above. By running the algorithm in python, we have three nicely partitioned clusters that can now be interrogated for their meaning with respect to the original dataset.

To make sense of these clusters, we take the cluster numbers of the observations and add them in the data frame as columns so we can analyze the results of the results of the variables based on each cluster. We start with the numerical variables and compute their means within each cluster. We normalized the features to allow variables of different magnitudes to be compared in a similar way. We present our findings in the following table and figure.

##                     Cluster 1  Cluster 2  Cluster 3
## time_in_hospital     1.113976   0.926647   0.959377
## num_lab_procedures   1.049365   0.970667   0.979968
## num_procedures       1.019749   1.002686   0.977565
## num_medications      1.187464   0.852253   0.960283
## number_outpatient    1.139794   0.869580   0.990626
## number_emergency     1.289647   0.766241   0.944113
## number_inpatient     1.111326   0.847134   1.041540
## number_diagnoses     1.019887   0.990194   0.989918

From the above charts and table, we can make few observations:

• Patients in the first cluster spent between a third and half a day longer in hospital

• Patients in the first cluster had about 5% more lab procedures than those in clusters two or three, and were, on average, using between 15 and 25% more medicaments.

• Patients in the first cluster had a record of more encounters (inpatient, emergency, and outpatient).

Remember in principal component part when we are presenting the meaning of the loadings in the first principal component in term of the categorical variables. We said that lower score suggested that a patient don’t really change medication or wasn’t using any diabetes medications; contrariwise, larger scores suggested that a patient had a change in medication, was using at least one medication, tended to use more medications, or were using insulin/metformin. Thus, we can say that t patients in cluster 1 tended to use medications, patients in cluster 2 tended not to, and patients in cluster 3 tended to somewhere in between. One interesting variable in the dataset is the diagnosis type. It would then be interesting to see how the frequencies of each diagnosis type differed between the clusters. The figure below does present exactly what we need.

We than observe that:

• All clusters had circulatory diagnoses top

• Patients on fewer/no diabetes medications were readmitted much more frequently for digestive, respiratory and injury problems. Patients with many diabetes medications were admitted more frequently for respiratory and diabetes problems.

The distribution of diagnoses suggested that diabetes diseases were more prominent in patients that took more diabetes-related medications than patients that took fewer or none at all, for whom digestive problems are more frequent.

3.2.2. Hierarchical clustering

In this part, we address the dataset by using this time hierarchical clustering, which group together, like the K-means, data points with similar characteristics. We execute the code in python to compute the dendrogram, and we obtain the following figure. When constructing the dendrogram, we decide to cut the tree at distance 160, which gives 6 clusters that shows below based on the 6 different colors.

Below, we again plot the dataset based on the first two components and color the points based on which cluster they belong to. We see that, unlike the result for the K-means clustering, the results show that there is not a well separation between the 6 clusters on the first two principal components. We only see that patients in cluster 2 (the green ones) are predominately on the positive side of the first principal component. We can say that the cluster represent patients that often change medication and generally diagnosed with respiratory and diabetes problem.

## C:\Users\ldarw\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\seaborn\regression.py:573: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
##   warnings.warn(msg, UserWarning)

4. Conclusion

This project was a great learning opportunity. For the cleaning process, I had to develop a strategy to clean and explore this big dataset with that many variables using R. For the modeling part, because the methods are extremely computationally expensive, I decided to use python which make a way better use of the memory.

Using K-means clustering by choosing k =3, we have seen that data clustered into three different groups that can be well explained by the differences in the values of features. One group represents patients that take a lot of medications, these are the ones usually diagnosed with diabetes and respiratory problems. One group doesn’t change or does not take many medications, these are patients that were usually diagnosed for digestive problems. And the last group belong between the first two and have some particularity of both.

I also use hierarchical clustering by cutting the tree at level that generate 6 clusters. We see that these 6 clusters are not well separated on the first two principal components, and that we have better insights from the K-means clustering than the hierarchical clustering. One limitation of our project is that we use only the first two components as a mean to cluster and interpret the differences between the clusters. Further study can address this problem by adding the third and the fourth principal components.

Diabetes_report