The data set PimaIndiansDiabetes2 contains a corrected version of the original data set. While the UCI repository index claims that there are no missing values, closer inspection of the data shows several physical impossibilities, e.g., blood pressure or body mass index of 0. In PimaIndiansDiabetes2, all zero values of glucose, pressure, triceps, insulin and mass have been set to NA, see also Wahba et al (1995) and Ripley (1996). ## Source Original owners: National Institute of Diabetes and Digestive and Kidney Diseases Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu) These data have been taken from the UCI Repository Of Machine Learning Databases at
ftp://ftp.ics.uci.edu/pub/machine-learning-databases
http://www.ics.uci.edu/~mlearn/MLRepository.html
and were converted to R format by Friedrich Leisch.
Due to the nature of data quality and structure, no single technique gives highest accuracy or accuracy for all diseases, whereas one classifier provides or shows better performance in a given dataset, another method or approach outdoes the others for other diseases. This new study or the proposed “method” will focus on a regular common classifier for diabetes disease (DD) classification and prediction, with a little fine tuning, to improve accuracy rate.
Diagnosis of Diabetes Mellitus using K Nearest Neighbor Algorithm. 2014
Krati Saxena, Dr. Zubair Khan, Shefali Singh
Department of Computer Science Engineering, Invertis University, Bareilly, India
** Method= knn. Accuracy rate = 70%
http://www.ijcstjournal.org/volume-2/issue-4/IJCST-V2I4P6.pdf
A Smart Clinical Decision Support System to Predict diabetes Disease Using Classification Techniques. 2018 K. Lakshmi,D.Iyajaz Ahmed, G. Siva Kumar
Department of Computer Science and Engineering, G. Pullaiah College of Engineering and Technology, Kurnool, India
** Method= knn . Accuracy rate = 95%
https://www.academia.edu/37070070/A_Smart_Clinical_Decision_Support_System_to_Predict_diabetes_Disease_Using_Classification_Techniques
A comparative analysis of data mining techniques for prediction of postprandial blood glucose.2018
Huan-Cheng Chang, Pin-Hsiang Chang, Sung-Chin Tseng, Chi-Chang Chang, Yen-Chiao Lu
Dept. of Healthcare Management, Yuanpei University of Medical Technology, Hsinchu, Taiwan
School of Medical Informatics, Chung-Shan Medical University/Hospital, Taichung, Taiwan
** Method= RF. Accuracy rate = 82.68%
** Method= C4.5 Accuracy rate = 76.56%
https://www.econstor.eu/bitstream/10419/176839/1/full-09.pdf
*** Overall:
95% is among the highest accuracy rate for predicting diabetes disease, as published in an accreditted professional publication.
Preliminary calculations appear to show that my proposed approach significantly outperforms current approaches.
Checking for NA values in observations
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 NA 70 96 ...
## $ triceps : num 35 29 NA 23 35 NA 32 NA 45 NA ...
## $ insulin : num NA NA NA 94 168 NA 88 NA 543 NA ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 NA NA 25.6 0.201 30 neg
## Number of missing value: 652
## feature num_missing pct_missing
## 1 pregnant 0 0.000000000
## 2 glucose 5 0.006510417
## 3 pressure 35 0.045572917
## 4 triceps 227 0.295572917
## 5 insulin 374 0.486979167
## 6 mass 11 0.014322917
## 7 pedigree 0 0.000000000
## 8 age 0 0.000000000
## 9 diabetes 0 0.000000000
## feature num_missing pct_missing
## 1 pregnant 0 0
## 2 glucose 0 0
## 3 pressure 0 0
## 4 triceps 0 0
## 5 insulin 0 0
## 6 mass 0 0
## 7 pedigree 0 0
## 8 age 0 0
## 9 diabetes 0 0
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 71.0 Min. :24.00 Min. :10.00
## 1st Qu.: 1.000 1st Qu.: 95.0 1st Qu.:60.00 1st Qu.:22.50
## Median : 2.000 Median :112.0 Median :70.00 Median :30.00
## Mean : 3.215 Mean :120.4 Mean :68.76 Mean :30.52
## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.:77.00 3rd Qu.:37.50
## Max. :12.000 Max. :198.0 Max. :94.00 Max. :63.00
## insulin mass pedigree age diabetes
## Min. : 14.0 Min. :19.60 Min. :0.1070 Min. :21.00 neg:57
## 1st Qu.: 66.0 1st Qu.:28.55 1st Qu.:0.3105 1st Qu.:23.00 pos:22
## Median :116.0 Median :33.60 Median :0.4970 Median :26.00
## Mean :131.7 Mean :33.07 Mean :0.5437 Mean :30.33
## 3rd Qu.:168.5 3rd Qu.:36.55 3rd Qu.:0.7020 3rd Qu.:36.00
## Max. :402.0 Max. :59.40 Max. :2.4200 Max. :61.00
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 56.0 Min. : 30.00 Min. : 7.0
## 1st Qu.: 1.000 1st Qu.:100.0 1st Qu.: 62.00 1st Qu.:20.0
## Median : 2.000 Median :120.0 Median : 70.00 Median :29.0
## Mean : 3.323 Mean :123.2 Mean : 71.14 Mean :28.8
## 3rd Qu.: 5.000 3rd Qu.:144.0 3rd Qu.: 80.00 3rd Qu.:37.0
## Max. :17.000 Max. :197.0 Max. :110.00 Max. :60.0
## insulin mass pedigree age diabetes
## Min. : 15.0 Min. :18.20 Min. :0.0850 Min. :21 neg:205
## 1st Qu.: 82.0 1st Qu.:28.30 1st Qu.:0.2640 1st Qu.:23 pos:108
## Median :126.0 Median :33.10 Median :0.4390 Median :27
## Mean :162.2 Mean :33.09 Mean :0.5178 Mean :31
## 3rd Qu.:192.0 3rd Qu.:37.10 3rd Qu.:0.6660 3rd Qu.:36
## Max. :846.0 Max. :67.10 Max. :2.3290 Max. :81
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Warning: Removed 1 rows containing missing values (geom_bar).
Precision, Recall and Specificity
In addition to the raw classification accuracy, there are many other metrics that are widely used to examine the performance of a classification model, including:
Precision, which is the proportion of true positives among all the individuals that have been predicted to be diabetes-positive by the model. This represents the accuracy of a predicted positive outcome.
Precision = TruePositives/(TruePositives + FalsePositives).
Sensitivity (or Recall), which is the True Positive Rate (TPR) or the proportion of identified positives among the diabetes-positive population (class = 1).
Sensitivity = TruePositives/(TruePositives + FalseNegatives).
Specificity, which measures the True Negative Rate (TNR), that is the proportion of identified negatives among the diabetes-negative population (class = 0).
Specificity = TrueNegatives/(TrueNegatives + FalseNegatives).
False Positive Rate (FPR), which represents the proportion of identified positives among the healthy individuals (i.e. diabetes-negative). This can be seen as a false alarm.
The FPR can be also calculated as 1-specificity. When positives are rare, the FPR can be high, leading to the situation where a predicted positive is most likely a negative.
Sensitivy and Specificity are commonly used to measure the performance of a predictive model.
True positives : these are cases in which we predicted the individuals would be diabetes-positive and they were.
True negatives : We predicted diabetes-negative, and the individuals were diabetes-negative.
False positives : We predicted diabetes-positive, but the individuals didn’t actually have diabetes.
False negatives : We predicted diabetes-negative, but they did have diabetes.
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 52 6
## pos 5 16
##
## Accuracy : 0.8608
## 95% CI : (0.7645, 0.9284)
## No Information Rate : 0.7215
## P-Value [Acc > NIR] : 0.002649
##
## Kappa : 0.6486
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.9123
## Specificity : 0.7273
## Pos Pred Value : 0.8966
## Neg Pred Value : 0.7619
## Prevalence : 0.7215
## Detection Rate : 0.6582
## Detection Prevalence : 0.7342
## Balanced Accuracy : 0.8198
##
## 'Positive' Class : neg
##
## [1] " Fitting svm model with 4 top predictor variables : glucose, pressure, insulin, mass"
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 53 8
## pos 4 14
##
## Accuracy : 0.8481
## 95% CI : (0.7497, 0.919)
## No Information Rate : 0.7215
## P-Value [Acc > NIR] : 0.006194
##
## Kappa : 0.5997
##
## Mcnemar's Test P-Value : 0.386476
##
## Sensitivity : 0.9298
## Specificity : 0.6364
## Pos Pred Value : 0.8689
## Neg Pred Value : 0.7778
## Prevalence : 0.7215
## Detection Rate : 0.6709
## Detection Prevalence : 0.7722
## Balanced Accuracy : 0.7831
##
## 'Positive' Class : neg
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 50 4
## pos 7 18
##
## Accuracy : 0.8608
## 95% CI : (0.7645, 0.9284)
## No Information Rate : 0.7215
## P-Value [Acc > NIR] : 0.002649
##
## Kappa : 0.6674
##
## Mcnemar's Test P-Value : 0.546494
##
## Sensitivity : 0.8772
## Specificity : 0.8182
## Pos Pred Value : 0.9259
## Neg Pred Value : 0.7200
## Prevalence : 0.7215
## Detection Rate : 0.6329
## Detection Prevalence : 0.6835
## Balanced Accuracy : 0.8477
##
## 'Positive' Class : neg
##
## [1] "** Random Forest "
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 50 5
## pos 7 17
##
## Accuracy : 0.8481
## 95% CI : (0.7497, 0.919)
## No Information Rate : 0.7215
## P-Value [Acc > NIR] : 0.006194
##
## Kappa : 0.6323
##
## Mcnemar's Test P-Value : 0.772830
##
## Sensitivity : 0.8772
## Specificity : 0.7727
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.7083
## Prevalence : 0.7215
## Detection Rate : 0.6329
## Detection Prevalence : 0.6962
## Balanced Accuracy : 0.8250
##
## 'Positive' Class : neg
##
## [1] "** Random Forest with fine tuning"
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 57 2
## pos 0 20
##
## Accuracy : 0.9747
## 95% CI : (0.9115, 0.9969)
## No Information Rate : 0.7215
## P-Value [Acc > NIR] : 3.106e-09
##
## Kappa : 0.9352
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9091
## Pos Pred Value : 0.9661
## Neg Pred Value : 1.0000
## Prevalence : 0.7215
## Detection Rate : 0.7215
## Detection Prevalence : 0.7468
## Balanced Accuracy : 0.9545
##
## 'Positive' Class : neg
##
Method | Accuracy Rate |
---|---|
Random Forest | 84.8 % |
Support Vector Machine | 86 % |
J48_RWeka | 86 % |
Tuned Random Forest | 97 % |
Preliminary results seem to indicate that my proposed method of “Tuned Random Forest” significantly outperforms current approaches.
Joe Long, data analyst San Diego, Ca October 2019