Data Analysis has been performed in order to get the insights and KPI for data and to evaluate the peformance of various machine learning algorithms.
# Name columns
colnames(diabetes)
## [1] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI" "DPF" "Age"
## [9] "Outcome"
# Print description
summary(diabetes[,-9])
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DPF Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
| Outcome | Status | Indicator | Count | Average Age |
|---|---|---|---|---|
| Healthy | Absence | 0 | 500 | 31 |
| Diabetic | Presence | 1 | 268 | 37 |
## 'data.frame': 705 obs. of 6 variables:
## $ Outcome : int 1 0 1 0 0 1 0 1 0 1 ...
## $ Glucose : int 148 85 183 89 116 78 115 197 110 168 ...
## $ BMI : num 33.6 26.6 23.3 28.1 25.6 31 35.3 30.5 37.6 38 ...
## $ Age : int 50 31 32 21 30 26 29 53 30 34 ...
## $ Pregnancies: int 6 1 8 1 5 3 10 2 4 10 ...
## $ DPF : num 0.627 0.351 0.672 0.167 0.201 0.248 0.134 0.158 0.191 0.537 ...
| Characteristic | Mean_H | Max_H | Mean_D | Max_D |
|---|---|---|---|---|
| Outcome | 0.000000 | 0.000 | 1.000000 | 1.00 |
| Glucose | 109.980000 | 197.000 | 141.257463 | 199.00 |
| BMI | 30.304200 | 57.300 | 35.142537 | 67.10 |
| Age | 31.190000 | 81.000 | 37.067164 | 70.00 |
| Pregnancies | 3.298000 | 13.000 | 4.865672 | 17.00 |
| DPF | 0.429734 | 2.329 | 0.550500 | 2.42 |
| Accuracy | Precision | Recall | F1 Score | |
|---|---|---|---|---|
| KNN | 0.7045455 | 0.7642276 | 0.8034188 | 0.7833333 |
| SVM | 0.7840909 | 0.7883212 | 0.9230769 | 0.8503937 |
| Decision Tree | 0.7500000 | 0.8173913 | 0.8034188 | 0.8103448 |
| Logistic Regression | 0.7727273 | 0.7938931 | 0.8888889 | 0.8387097 |
| LDA | 0.7613636 | 0.7906977 | 0.8717949 | 0.8292683 |
| Naive Bayes | 0.7840909 | 0.8062016 | 0.8888889 | 0.8455285 |
| Random Forest | 0.7500000 | 0.8067227 | 0.8205128 | 0.8135593 |
Here are some of the plots useful in understanding the structure of the dataset and evaluate the perfomance of various ML techinques applied.
Since this is the case of medical condition, we will focus on accuracy of the model. We could have gone with the ROC curve but that could have made things more complicated. Therefore from the above plot for model performance (and the last table) it’s clear that Support Vector Machines (SVM) work as a best classification model in our case with model accuracy as 78.41%.