Diabetes Data Modelling

Objective

My goal for this project is to solve a business problem (Diabetes Detection) using classification in data mining. Accurately predicting the target class (Outcome, yes & no) for each case in the data set is the ultimate goal of this study.

Moving forward, I would like to briefly introduce myself. I am a Data Analyst with experiences spanning academia and industry. Extracting value from data & making complex concepts broadly accessible. I love turning data into gold. Enabling data-driven decision making, I want to make an impact by adding value to client Business. Currently, I am Student pursuing MS Business Analytics, at CSU, East Bay.

Introduction

The goal for this project is to predict (classify) Diabetes in individual women based on several medical condition and clinical tests. Diabetes is a disease in which the body’s ability to produce or respond to the hormone insulin is impaired, resulting in abnormal metabolism of carbohydrates and elevated levels of glucose in the blood and urine. Glucose comes from the foods you eat. Insulin is a hormone that helps the glucose get into your cells to give them energy. Over time, having too much glucose in your blood can cause serious problems. It can damage your eyes, kidneys, and nerves. Diabetes can also cause heart disease, stroke and even the need to remove a limb. Pregnant women can also get diabetes, called gestational diabetes. Models used in this data mining project would help in predicting (classifying) Diabetes in individual women. This modelling will help medical industry to diagnose diabetes with less efforts and more reliability.

Motivation of Prediction Modelling

According to the U.S. Centers for Disease Control and Prevention, about 29.1 million Americans-nearly a tenth of the U.S. population-have diabetes and over 200 million people worldwide. Some of the facts are mentioned below

Full data to Stratified training data ratio.

Full data to Stratified training data ratio.

Full data to Stratified training data ratio.



Full data to Stratified training data ratio.


Medical practitioners generate data with a wealth of hidden information present, and it’s not properly being used effectively for predictions. For beneficial purpose, unused data is stored into datasets for modeling using different data mining techniques. Above thrilling facts and available data mining techniques motivates me to develop a model to classify Diabetes patient and get cure in time.

Related Work

To begin with advanced data mining, I referenced several related works. Link to guidance references are mentioned here

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3262726/
http://www.sciencedirect.com/science/article/pii/S2001037016300733
https://www.kaggle.com/uciml/pima-indians-diabetes-database

Problem Definition

In my project, I have used Diabetes dataset from Kaggle. I ran several data cleaning techniques on the dataset, which shows that dataset is already cleaned from the source itself. Data have 8 defining attributes and 1 outcome (as diabetes YES or NO). I have divided 75% as training dataset and 25% data as testing dataset using stratified sampling. I have also ensured class ration remain intact during stratified sampling. Using several algorithms (c5.0, SVM, Neural Network, Ensambling, voting & stacking), I found algorithm which work best with the available dataset. Here, I have used 10 Fold cross validation techniques for all the algorithms.

X1 X2 value
no Training 0.6510417
yes Training 0.3489583
no Orig 0.6510417
yes Orig 0.3489583


Full data to Stratified training data ratio.
Fig : Showing same YES/NO ratio after stratified sampling

Dataset Information

I have taken this data from Kaggle and originally it is taken from archive.ics.uci.edu.

Attribute / Variables Description Data type
Pregnancies Number of times pregnant Integer
Glucose Plasma glucose concentration a 2 hours in an oral glucose tolerance test Integer
BloodPressure Diastolic blood pressure (mm Hg) Integer
SkinThickness Triceps skin fold thickness (mm) Integer
Insulin 2-Hour serum insulin (mu U/ml) Integer
BMI Body mass index (weight in kg/(height in m)^2) Number
DiabetesPedigreeFunction Diabetes pedigree function Number
Age Age (years) Integer
Outcome Class variable (0 or 1) Factor, nominal

https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data


Methods (Algorithms)


Results

We have ran different algorithms with varied parameter(PCA) configurations and found that C5.0 with PCA gives us highest Accuracy and Sensitivity(Considering False Negative), so C 5.0 would be best among all the tested algorithms. Although SVM and RF( from ensamble model) have very close Accuracy but Sensitivity matters a lot for Diabetes study. We want to minimize false negative as much as possible, so we consider that C5.0 with PCA for given dataset.


Algorithm Accuracy Sensitivity(Recall) Specificity Precision
Decision Tree C 5.0 0.7291667 0.5820896 0.8080000 0.6190476
Decision Tree C 5.0 with PCA 0.7604166667 0.5820896 0.8560000 0.6842105
Naïve Bayes 0.71875000 0.5223881 0.8240000 0.6140351
Neural Network 0.72395833 0.5820896 0.8000000 0.6093750
SVM(SMO) 0.755208333 0.507462 0.98774215 0.65538462
RF 0.72395833 0.5223881 0.8320000 0.6250000

Full data to Stratified training data ratio.
Decision Tree C 5.0 with PCA


Full data to Stratified training data ratio.



Full data to Stratified training data ratio.

Prediction Vs Actual

(Showing True Positive, True Negative, False Positive & False Negative)

Prediction / Actual no yes
no 112 31
yes 13 36
Prediction / Actual no yes
no 110 32
yes 15 35
Prediction / Actual no yes
no 108 25
yes 17 42
Prediction / Actual no yes
no 108 48
yes 17 19
Prediction / Actual no yes
no 109 28
yes 16 39
Note: Prediction values are mentioned row-wise and Actual values are mentioned columnwise.

Analysis and Recommendation

Comparison With published existing research on Kaggle

Project Published on Rpub

http://rpubs.com/Devansh/Diabetes

References