Classification Techniques for Diabetes

Objective

The objective of the dataset is to predict whether a patient is diabetic or not.
The dataset consist of several predictors and a target variables (Outcome).

Exploratory Data Analysis

Data Analysis has been performed in order to get the insights and KPI for data and to evaluate the peformance of various machine learning algorithms.

# Name columns
colnames(diabetes)

## [1] "Pregnancies"   "Glucose"       "BloodPressure" "SkinThickness"
## [5] "Insulin"       "BMI"           "DPF"           "Age"          
## [9] "Outcome"

# Print description
summary(diabetes[,-9])

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI             DPF              Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00

Counts of Diabetes Outcome
Outcome	Status	Indicator	Count	Average Age
Healthy	Absence	0	500	31
Diabetic	Presence	1	268	37

## 'data.frame':    705 obs. of  6 variables:
##  $ Outcome    : int  1 0 1 0 0 1 0 1 0 1 ...
##  $ Glucose    : int  148 85 183 89 116 78 115 197 110 168 ...
##  $ BMI        : num  33.6 26.6 23.3 28.1 25.6 31 35.3 30.5 37.6 38 ...
##  $ Age        : int  50 31 32 21 30 26 29 53 30 34 ...
##  $ Pregnancies: int  6 1 8 1 5 3 10 2 4 10 ...
##  $ DPF        : num  0.627 0.351 0.672 0.167 0.201 0.248 0.134 0.158 0.191 0.537 ...

Comparison of Health Characteristics Between Healthy (H) and Diabetic (D)
Characteristic	Mean_H	Max_H	Mean_D	Max_D
Outcome	0.000000	0.000	1.000000	1.00
Glucose	109.980000	197.000	141.257463	199.00
BMI	30.304200	57.300	35.142537	67.10
Age	31.190000	81.000	37.067164	70.00
Pregnancies	3.298000	13.000	4.865672	17.00
DPF	0.429734	2.329	0.550500	2.42

Performance Metrics for Various Models
	Accuracy	Precision	Recall	F1 Score
KNN	0.7045455	0.7642276	0.8034188	0.7833333
SVM	0.7840909	0.7883212	0.9230769	0.8503937
Decision Tree	0.7500000	0.8173913	0.8034188	0.8103448
Logistic Regression	0.7727273	0.7938931	0.8888889	0.8387097
LDA	0.7613636	0.7906977	0.8717949	0.8292683
Naive Bayes	0.7840909	0.8062016	0.8888889	0.8455285
Random Forest	0.7500000	0.8067227	0.8205128	0.8135593

Plots

Here are some of the plots useful in understanding the structure of the dataset and evaluate the perfomance of various ML techinques applied.

Conclusion

Since this is the case of medical condition, we will focus on accuracy of the model. We could have gone with the ROC curve but that could have made things more complicated. Therefore from the above plot for model performance (and the last table) it’s clear that Support Vector Machines (SVM) work as a best classification model in our case with model accuracy as 78.41%.

Classification Techniques for Diabetes

Afzal

2024-04-12

Objective

Exploratory Data Analysis

Plots

Conclusion