Finding Patients with Diabetes

Tomás A. Maccor

29/1/2020

Exploratory data analysis & analytics on complex data:
Identifying risks to project data and to subject safety

This presentation will have 2 sections:

Past Experience

Reviewing oral corticosteroid (OCS) doses in an asthma phase III trial

B. Cooper & R. Patterson, “The Corticosteroid Graph”, J. Allergy Clin. Immunol. 1976, Vol 8, Nr.6

B. Cooper & R. Patterson, “The Corticosteroid Graph”, J. Allergy Clin. Immunol. 1976, Vol 8, Nr.6

A typical OCS Dose Graph

How was it done in real-life

A way to do this visually:

Self-developed Scenario

The question that we have asked ourselves:

The dataset contains healthy and diabetic female subjects (source: https://data.world/data-society/pima-indians-diabetes-database)

Cases: 786

Variables: 10

The 2 “label” variables (that indicate whether each subject has diabetes or not) were removed from the dataset.

First quick look at the dataset structure:

##  Pregnancies      Glucose        BloodPressure   SkinThickness  
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:768       FALSE:768       FALSE:768       FALSE:768      
##   Insulin           BMI             Age         
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:768       FALSE:768       FALSE:768

We then perform a first visual review of the dataset, and a descriptive statistics review:

First insight:

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI             Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :81.00

AGGREGATING/CLUSTERING THE PATIENTS

Since all variables are numerical, we will try to separate the healthy from the diabetic patients using numeric cluster analysis.

The k-means algorithm seems the best choice in this case –given that there are only 2 groups expected (healthy vs. diabetic). Hierarchical clustering is better for a bigger n of clusters

Here, k-means results in a clearer visualization

After reviewing the ranges of the dataset variables, scaling of the dataset was performed –this allows obtaining a clustering that is more accurate.

Three (3) k-mean clustering runs are generated for the dataset and plotted below:

Reviewing soundness of clustering obtained

The k-means algorithm separates patients with lower Glucose & BMI, from those that have higher values of both variables

Similarly with Blood Pressure & Skin Thickness:

Worth further questioning & exploring

Further steps: ?

Reviewing safety signals