Pima Indian Diabetes Dataset

The data set PimaIndiansDiabetes2 contains a corrected version of the original data set. While the UCI repository index claims that there are no missing values, closer inspection of the data shows several physical impossibilities, e.g., blood pressure or body mass index of 0. In PimaIndiansDiabetes2, all zero values of glucose, pressure, triceps, insulin and mass have been set to NA, see also Wahba et al (1995) and Ripley (1996). ## Source Original owners: National Institute of Diabetes and Digestive and Kidney Diseases Donor of database: Vincent Sigillito () These data have been taken from the UCI Repository Of Machine Learning Databases at

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by Friedrich Leisch.

Many algorithm methods for predicting diabetes

Due to the nature of data quality and structure, no single technique gives highest accuracy or accuracy for all diseases, whereas one classifier provides or shows better performance in a given dataset, another method or approach outdoes the others for other diseases. This new study or the proposed “method” will focus on a regular common classifier for diabetes disease (DD) classification and prediction, with a little fine tuning, to improve accuracy rate.

Previous works _ random samples only

  1. Diagnosis of Diabetes Mellitus using K Nearest Neighbor Algorithm. 2014
    Krati Saxena, Dr. Zubair Khan, Shefali Singh
    Department of Computer Science Engineering, Invertis University, Bareilly, India
    ** Method= knn. Accuracy rate = 70%
    http://www.ijcstjournal.org/volume-2/issue-4/IJCST-V2I4P6.pdf

  2. A Smart Clinical Decision Support System to Predict diabetes Disease Using Classification Techniques. 2018 K. Lakshmi,D.Iyajaz Ahmed, G. Siva Kumar
    Department of Computer Science and Engineering, G. Pullaiah College of Engineering and Technology, Kurnool, India
    ** Method= knn . Accuracy rate = 95%
    https://www.academia.edu/37070070/A_Smart_Clinical_Decision_Support_System_to_Predict_diabetes_Disease_Using_Classification_Techniques

  3. A comparative analysis of data mining techniques for prediction of postprandial blood glucose.2018
    Huan-Cheng Chang, Pin-Hsiang Chang, Sung-Chin Tseng, Chi-Chang Chang, Yen-Chiao Lu
    Dept. of Healthcare Management, Yuanpei University of Medical Technology, Hsinchu, Taiwan
    School of Medical Informatics, Chung-Shan Medical University/Hospital, Taichung, Taiwan
    ** Method= RF. Accuracy rate = 82.68%
    ** Method= C4.5 Accuracy rate = 76.56%
    https://www.econstor.eu/bitstream/10419/176839/1/full-09.pdf

*** Overall:

95% is among the highest accuracy rate for predicting diabetes disease, as published in an accreditted professional publication.

** Goal of this work project is to improve prediction outcome

Preliminary calculations appear to show that my proposed approach significantly outperforms current approaches.


Dataset : Pima Indians Diabetes Survey

Checking for NA values in observations

## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 NA 70 96 ...
##  $ triceps : num  35 29 NA 23 35 NA 32 NA 45 NA ...
##  $ insulin : num  NA NA NA 94 168 NA 88 NA 543 NA ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA 33.6    0.627  50      pos
## 2        1      85       66      29      NA 26.6    0.351  31      neg
## 3        8     183       64      NA      NA 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74      NA      NA 25.6    0.201  30      neg
## Number of missing value: 652

##    feature num_missing pct_missing
## 1 pregnant           0 0.000000000
## 2  glucose           5 0.006510417
## 3 pressure          35 0.045572917
## 4  triceps         227 0.295572917
## 5  insulin         374 0.486979167
## 6     mass          11 0.014322917
## 7 pedigree           0 0.000000000
## 8      age           0 0.000000000
## 9 diabetes           0 0.000000000

Removing NA values.

  • Insulin is such an important variable concerning Diabetes as an independent predictor.

##    feature num_missing pct_missing
## 1 pregnant           0           0
## 2  glucose           0           0
## 3 pressure           0           0
## 4  triceps           0           0
## 5  insulin           0           0
## 6     mass           0           0
## 7 pedigree           0           0
## 8      age           0           0
## 9 diabetes           0           0

Subsetting data into testing and training

##     pregnant         glucose         pressure        triceps     
##  Min.   : 0.000   Min.   : 71.0   Min.   :24.00   Min.   :10.00  
##  1st Qu.: 1.000   1st Qu.: 95.0   1st Qu.:60.00   1st Qu.:22.50  
##  Median : 2.000   Median :112.0   Median :70.00   Median :30.00  
##  Mean   : 3.215   Mean   :120.4   Mean   :68.76   Mean   :30.52  
##  3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.:77.00   3rd Qu.:37.50  
##  Max.   :12.000   Max.   :198.0   Max.   :94.00   Max.   :63.00  
##     insulin           mass          pedigree           age        diabetes
##  Min.   : 14.0   Min.   :19.60   Min.   :0.1070   Min.   :21.00   neg:57  
##  1st Qu.: 66.0   1st Qu.:28.55   1st Qu.:0.3105   1st Qu.:23.00   pos:22  
##  Median :116.0   Median :33.60   Median :0.4970   Median :26.00           
##  Mean   :131.7   Mean   :33.07   Mean   :0.5437   Mean   :30.33           
##  3rd Qu.:168.5   3rd Qu.:36.55   3rd Qu.:0.7020   3rd Qu.:36.00           
##  Max.   :402.0   Max.   :59.40   Max.   :2.4200   Max.   :61.00
##     pregnant         glucose         pressure         triceps    
##  Min.   : 0.000   Min.   : 56.0   Min.   : 30.00   Min.   : 7.0  
##  1st Qu.: 1.000   1st Qu.:100.0   1st Qu.: 62.00   1st Qu.:20.0  
##  Median : 2.000   Median :120.0   Median : 70.00   Median :29.0  
##  Mean   : 3.323   Mean   :123.2   Mean   : 71.14   Mean   :28.8  
##  3rd Qu.: 5.000   3rd Qu.:144.0   3rd Qu.: 80.00   3rd Qu.:37.0  
##  Max.   :17.000   Max.   :197.0   Max.   :110.00   Max.   :60.0  
##     insulin           mass          pedigree           age     diabetes 
##  Min.   : 15.0   Min.   :18.20   Min.   :0.0850   Min.   :21   neg:205  
##  1st Qu.: 82.0   1st Qu.:28.30   1st Qu.:0.2640   1st Qu.:23   pos:108  
##  Median :126.0   Median :33.10   Median :0.4390   Median :27            
##  Mean   :162.2   Mean   :33.09   Mean   :0.5178   Mean   :31            
##  3rd Qu.:192.0   3rd Qu.:37.10   3rd Qu.:0.6660   3rd Qu.:36            
##  Max.   :846.0   Max.   :67.10   Max.   :2.3290   Max.   :81

Data Exploratory Analysis _ Variable Discriptive analysis

## Warning: Removed 1 rows containing non-finite values (stat_count).
## Warning: Removed 1 rows containing missing values (geom_bar).

Classification Method

Precision, Recall and Specificity
In addition to the raw classification accuracy, there are many other metrics that are widely used to examine the performance of a classification model, including:

Precision, which is the proportion of true positives among all the individuals that have been predicted to be diabetes-positive by the model. This represents the accuracy of a predicted positive outcome.
Precision = TruePositives/(TruePositives + FalsePositives).

Sensitivity (or Recall), which is the True Positive Rate (TPR) or the proportion of identified positives among the diabetes-positive population (class = 1).
Sensitivity = TruePositives/(TruePositives + FalseNegatives).

Specificity, which measures the True Negative Rate (TNR), that is the proportion of identified negatives among the diabetes-negative population (class = 0).
Specificity = TrueNegatives/(TrueNegatives + FalseNegatives).

False Positive Rate (FPR), which represents the proportion of identified positives among the healthy individuals (i.e. diabetes-negative). This can be seen as a false alarm.
The FPR can be also calculated as 1-specificity. When positives are rare, the FPR can be high, leading to the situation where a predicted positive is most likely a negative.

Sensitivy and Specificity are commonly used to measure the performance of a predictive model.

Confusion Matrix:

True positives : these are cases in which we predicted the individuals would be diabetes-positive and they were.
True negatives : We predicted diabetes-negative, and the individuals were diabetes-negative.
False positives : We predicted diabetes-positive, but the individuals didn’t actually have diabetes.
False negatives : We predicted diabetes-negative, but they did have diabetes.


1. Support Vector Machine

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  52   6
##        pos   5  16
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.7645, 0.9284)
##     No Information Rate : 0.7215          
##     P-Value [Acc > NIR] : 0.002649        
##                                           
##                   Kappa : 0.6486          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.9123          
##             Specificity : 0.7273          
##          Pos Pred Value : 0.8966          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.7215          
##          Detection Rate : 0.6582          
##    Detection Prevalence : 0.7342          
##       Balanced Accuracy : 0.8198          
##                                           
##        'Positive' Class : neg             
## 
## [1] " Fitting svm model with 4 top predictor variables :  glucose, pressure, insulin,  mass"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  53   8
##        pos   4  14
##                                          
##                Accuracy : 0.8481         
##                  95% CI : (0.7497, 0.919)
##     No Information Rate : 0.7215         
##     P-Value [Acc > NIR] : 0.006194       
##                                          
##                   Kappa : 0.5997         
##                                          
##  Mcnemar's Test P-Value : 0.386476       
##                                          
##             Sensitivity : 0.9298         
##             Specificity : 0.6364         
##          Pos Pred Value : 0.8689         
##          Neg Pred Value : 0.7778         
##              Prevalence : 0.7215         
##          Detection Rate : 0.6709         
##    Detection Prevalence : 0.7722         
##       Balanced Accuracy : 0.7831         
##                                          
##        'Positive' Class : neg            
## 

Method = Support Vector Machine

Acuracy rate = 86%


2. RWeka package with J48 method

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  50   4
##        pos   7  18
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.7645, 0.9284)
##     No Information Rate : 0.7215          
##     P-Value [Acc > NIR] : 0.002649        
##                                           
##                   Kappa : 0.6674          
##                                           
##  Mcnemar's Test P-Value : 0.546494        
##                                           
##             Sensitivity : 0.8772          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.9259          
##          Neg Pred Value : 0.7200          
##              Prevalence : 0.7215          
##          Detection Rate : 0.6329          
##    Detection Prevalence : 0.6835          
##       Balanced Accuracy : 0.8477          
##                                           
##        'Positive' Class : neg             
## 

3. Random Forest

## [1] "** Random Forest "
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  50   5
##        pos   7  17
##                                          
##                Accuracy : 0.8481         
##                  95% CI : (0.7497, 0.919)
##     No Information Rate : 0.7215         
##     P-Value [Acc > NIR] : 0.006194       
##                                          
##                   Kappa : 0.6323         
##                                          
##  Mcnemar's Test P-Value : 0.772830       
##                                          
##             Sensitivity : 0.8772         
##             Specificity : 0.7727         
##          Pos Pred Value : 0.9091         
##          Neg Pred Value : 0.7083         
##              Prevalence : 0.7215         
##          Detection Rate : 0.6329         
##    Detection Prevalence : 0.6962         
##       Balanced Accuracy : 0.8250         
##                                          
##        'Positive' Class : neg            
## 

4. Random Forest with fine tuning

## [1] "** Random Forest with fine tuning"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  57   2
##        pos   0  20
##                                           
##                Accuracy : 0.9747          
##                  95% CI : (0.9115, 0.9969)
##     No Information Rate : 0.7215          
##     P-Value [Acc > NIR] : 3.106e-09       
##                                           
##                   Kappa : 0.9352          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.9661          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7215          
##          Detection Rate : 0.7215          
##    Detection Prevalence : 0.7468          
##       Balanced Accuracy : 0.9545          
##                                           
##        'Positive' Class : neg             
## 

Findings:

Method= Random Forest. Accuracy Rate= 84.8%

Method= Support Vector Machine. Accuracy Rate = 86%

Method= J48 _ RWeka package. Accuracy Rate=86%

Method= Support Vector Machine with top 4 variables. Accuracy Rate=87%


Method= Random Forest with node-size=5. Accuracy Rate=97%


Method Accuracy Rate
Random Forest 84.8 %
Support Vector Machine 86 %
J48_RWeka 86 %
Tuned Random Forest 97 %

Conclusion:

Preliminary results seem to indicate that my proposed method of “Tuned Random Forest” significantly outperforms current approaches.

Joe Long, data analyst San Diego, Ca October 2019