Objective

Exploratory Data Analysis

Data Analysis has been performed in order to get the insights and KPI for data and to evaluate the peformance of various machine learning algorithms.

# Name columns
colnames(diabetes)
## [1] "Pregnancies"   "Glucose"       "BloodPressure" "SkinThickness"
## [5] "Insulin"       "BMI"           "DPF"           "Age"          
## [9] "Outcome"
# Print description
summary(diabetes[,-9])
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI             DPF              Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00
Counts of Diabetes Outcome
Outcome Status Indicator Count Average Age
Healthy Absence 0 500 31
Diabetic Presence 1 268 37
## 'data.frame':    705 obs. of  6 variables:
##  $ Outcome    : int  1 0 1 0 0 1 0 1 0 1 ...
##  $ Glucose    : int  148 85 183 89 116 78 115 197 110 168 ...
##  $ BMI        : num  33.6 26.6 23.3 28.1 25.6 31 35.3 30.5 37.6 38 ...
##  $ Age        : int  50 31 32 21 30 26 29 53 30 34 ...
##  $ Pregnancies: int  6 1 8 1 5 3 10 2 4 10 ...
##  $ DPF        : num  0.627 0.351 0.672 0.167 0.201 0.248 0.134 0.158 0.191 0.537 ...

Comparison of Health Characteristics Between Healthy (H) and Diabetic (D)
Characteristic Mean_H Max_H Mean_D Max_D
Outcome 0.000000 0.000 1.000000 1.00
Glucose 109.980000 197.000 141.257463 199.00
BMI 30.304200 57.300 35.142537 67.10
Age 31.190000 81.000 37.067164 70.00
Pregnancies 3.298000 13.000 4.865672 17.00
DPF 0.429734 2.329 0.550500 2.42
Performance Metrics for Various Models
Accuracy Precision Recall F1 Score
KNN 0.7045455 0.7642276 0.8034188 0.7833333
SVM 0.7840909 0.7883212 0.9230769 0.8503937
Decision Tree 0.7500000 0.8173913 0.8034188 0.8103448
Logistic Regression 0.7727273 0.7938931 0.8888889 0.8387097
LDA 0.7613636 0.7906977 0.8717949 0.8292683
Naive Bayes 0.7840909 0.8062016 0.8888889 0.8455285
Random Forest 0.7500000 0.8067227 0.8205128 0.8135593

Plots

Here are some of the plots useful in understanding the structure of the dataset and evaluate the perfomance of various ML techinques applied.

Conclusion

Since this is the case of medical condition, we will focus on accuracy of the model. We could have gone with the ROC curve but that could have made things more complicated. Therefore from the above plot for model performance (and the last table) it’s clear that Support Vector Machines (SVM) work as a best classification model in our case with model accuracy as 78.41%.