Tomás A. Maccor
29/1/2020
This presentation will have 2 sections:
B. Cooper & R. Patterson, “The Corticosteroid Graph”, J. Allergy Clin. Immunol. 1976, Vol 8, Nr.6
A typical OCS Dose Graph
The question that we have asked ourselves:
The dataset contains healthy and diabetic female subjects (source: https://data.world/data-society/pima-indians-diabetes-database)
Cases: 786
Variables: 10
The 2 “label” variables (that indicate whether each subject has diabetes or not) were removed from the dataset.## Pregnancies Glucose BloodPressure SkinThickness
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:768 FALSE:768 FALSE:768 FALSE:768
## Insulin BMI Age
## Mode :logical Mode :logical Mode :logical
## FALSE:768 FALSE:768 FALSE:768
We then perform a first visual review of the dataset, and a descriptive statistics review:
First insight:
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI Age
## Min. : 0.0 Min. : 0.00 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :81.00
BMI & Blood Pressure have several incorrect records/datapoints –their values cannot be “0”
Data imputation is not recommended in this case (not realistic/convenient)
Since all variables are numerical, we will try to separate the healthy from the diabetic patients using numeric cluster analysis.
The k-means algorithm seems the best choice in this case –given that there are only 2 groups expected (healthy vs. diabetic). Hierarchical clustering is better for a bigger n of clusters
Here, k-means results in a clearer visualization
After reviewing the ranges of the dataset variables, scaling of the dataset was performed –this allows obtaining a clustering that is more accurate.Three (3) k-mean clustering runs are generated for the dataset and plotted below:
The k-means algorithm separates patients with lower Glucose & BMI, from those that have higher values of both variables
Similarly with Blood Pressure & Skin Thickness:
Worth further questioning & exploring
Further steps: ?