Diabetes is a chronic condition where the body is unable to properly regulate blood sugar (glucose) levels. This happens either because the body does not produce enough insulin or cannot effectively use the insulin it produces. Common symptoms are: Frequent urination,excessive thirst, unexplained weight loss, fatigue (feeling tired often), blurred vision, slow healing of wounds.
The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information. This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans. Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes.
Gender – the person’s sex (male or female).
Age – how old the person is.
Hypertension – whether the person has high blood pressure.
Heart_disease – whether the person has any heart-related condition.
Smoking_history / smoking_clean – the person’s smoking habit (smoker, ex-smoker, non-smoker, or no information).
BMI – a measure of body weight relative to height; used to check if someone is underweight, normal, or overweight.
HbA1c_level – average blood sugar level over the past few months.
Blood_glucose_level – the person’s current blood sugar level.
Diabetes – shows whether the person has diabetes or not.
To investigate the factors associated with diabetes and develop an effective model for predicting diabetes risk.
Which factors are significantly associated with diabetes in the dataset?
Which machine learning model provides the most accurate prediction of diabetes?
What are the key drivers of diabetes based on the best-performing model?
To analyze the relationship between clinical and lifestyle variables and diabetes using statistical methods.
To build and compare different machine learning models for diabetes prediction.
To evaluate model performance and identify the most important features influencing diabetes prediction.
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
##
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'xgboost' was built under R version 4.5.3
## Warning: package 'lightgbm' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: package 'smotefamily' was built under R version 4.5.3
## Warning: package 'ROSE' was built under R version 4.5.3
## Loaded ROSE 0.0-4
## [1] "gender" "age" "hypertension"
## [4] "heart_disease" "smoking_history" "bmi"
## [7] "HbA1c_level" "blood_glucose_level" "diabetes"
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## 1 Female 80 0 1 never 25.19 6.6
## 2 Female 54 0 0 No Info 27.32 6.6
## 3 Male 28 0 0 never 27.32 5.7
## 4 Female 36 0 0 current 23.45 5.0
## 5 Male 76 1 1 current 20.14 4.8
## 6 Female 20 0 0 never 27.32 6.6
## blood_glucose_level diabetes
## 1 140 0
## 2 80 0
## 3 158 0
## 4 155 0
## 5 155 0
## 6 85 0
## gender age hypertension heart_disease
## Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
## Mode :character Median :43.00 Median :0.00000 Median :0.00000
## Mean :41.89 Mean :0.07485 Mean :0.03942
## 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :80.00 Max. :1.00000 Max. :1.00000
## smoking_history bmi HbA1c_level blood_glucose_level
## Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0
## Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0
## Mode :character Median :27.32 Median :5.800 Median :140.0
## Mean :27.32 Mean :5.528 Mean :138.1
## 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0
## Max. :95.69 Max. :9.000 Max. :300.0
## diabetes
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.085
## 3rd Qu.:0.000
## Max. :1.000
## [1] 100000 9
## gender age hypertension heart_disease
## 0 0 0 0
## smoking_history bmi HbA1c_level blood_glucose_level
## 0 0 0 0
## diabetes
## 0
##
## current ever former never No Info not current
## 9286 4004 9352 35095 35816 6447
##
## Female Male Other
## 58552 41430 18
Re_grouping smoking history column for variables clarity
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## 1 Female 80 0 1 never 25.19 6.6
## 2 Female 54 0 0 No Info 27.32 6.6
## 3 Male 28 0 0 never 27.32 5.7
## 4 Female 36 0 0 current 23.45 5.0
## 5 Male 76 1 1 current 20.14 4.8
## 6 Female 20 0 0 never 27.32 6.6
## blood_glucose_level diabetes smoking_clean
## 1 140 0 Non_Smoker
## 2 80 0 No Info
## 3 158 0 Non_Smoker
## 4 155 0 Smoker
## 5 155 0 Smoker
## 6 85 0 Non_Smoker
##
## Ex_Smoker No Info Non_Smoker Smoker
## 19803 35816 35095 9286
Distribution of Age
## [1] 80.00 54.00 28.00 36.00 76.00 20.00 44.00 79.00 42.00 32.00 53.00 78.00
## [13] 67.00 15.00 37.00 40.00 5.00 69.00 72.00 4.00 30.00 45.00 43.00 50.00
## [25] 41.00 26.00 34.00 73.00 77.00 66.00 29.00 60.00 38.00 3.00 57.00 74.00
## [37] 19.00 46.00 21.00 59.00 27.00 13.00 56.00 2.00 7.00 11.00 6.00 55.00
## [49] 9.00 62.00 47.00 12.00 68.00 75.00 22.00 58.00 18.00 24.00 17.00 25.00
## [61] 0.08 33.00 16.00 61.00 31.00 8.00 49.00 39.00 65.00 14.00 70.00 0.56
## [73] 48.00 51.00 71.00 0.88 64.00 63.00 52.00 0.16 10.00 35.00 23.00 0.64
## [85] 1.16 1.64 0.72 1.88 1.32 0.80 1.24 1.00 1.80 0.48 1.56 1.08
## [97] 0.24 1.40 0.40 0.32 1.72 1.48
Which factors are related to diabetes?
Changing some variable to factor
Count of smoking history
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
##
## Ex_Smoker No Info Non_Smoker Smoker
## Female 10925 19700 22869 5058
## Male 8869 16110 12223 4228
## Other 9 6 3 0
Plot
##
## Welch Two Sample t-test
##
## data: HbA1c_level by diabetes
## t = -127.01, df = 9828.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -1.561932 -1.514453
## sample estimates:
## mean in group 0 mean in group 1
## 5.396761 6.934953
This shows that people with diabetes have much higher HbA1c, HbA1c is strongly related to diabetes mellitus,which means people with diabetes have higher long-term blood sugar levels.
##
## Welch Two Sample t-test
##
## data: blood_glucose_level by diabetes
## t = -94.795, df = 9045.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -62.50864 -59.97583
## sample estimates:
## mean in group 0 mean in group 1
## 132.8525 194.0947
People with diabetes have much higher blood sugar levels Patients with diabetes have significantly higher blood glucose levels compared to non-diabetic patients.
##
## Welch Two Sample t-test
##
## data: bmi by diabetes
## t = -60.265, df = 9654.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -5.267143 -4.935294
## sample estimates:
## mean in group 0 mean in group 1
## 26.88716 31.98838
People with higher body weight are more likely to have diabetes.
##
## Welch Two Sample t-test
##
## data: age by diabetes
## t = -119.59, df = 12560, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -21.17285 -20.48995
## sample estimates:
## mean in group 0 mean in group 1
## 40.11519 60.94659
People with diabetes are much older on average Older people are much more likely to have diabetes.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(diab$hypertension, diab$diabetes)
## X-squared = 3910.7, df = 1, p-value < 2.2e-16
People with high blood pressure are more likely to have diabetes.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(diab$heart_disease, diab$diabetes)
## X-squared = 2945.8, df = 1, p-value < 2.2e-16
People with heart problems are more likely to also have diabetes.
## Df Sum Sq Mean Sq F value Pr(>F)
## smoking_clean 3 301 100.38 87.79 <2e-16 ***
## Residuals 99996 114332 1.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This saying that long-term blood sugar varies across smoking groups,smoking status affects HbA1c levels
##
## Pearson's Chi-squared test
##
## data: table(diab$diabetes, diab$smoking_clean)
## X-squared = 1732.7, df = 3, p-value < 2.2e-16
Smoking habits are linked to whether someone has diabetes.
People with diabetes tend to have higher blood sugar, they are older, often have higher body weight, and are more likely to have conditions like high blood pressure or heart disease.
Converting my target to factor
##
## 0 1
## 91500 8500
##
## 0 1
## 0.915 0.085
##
## 0 1
## 35335 34665
##
## 0 1
## 0.5047857 0.4952143
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24234 279
## 1 3216 2271
##
## Accuracy : 0.8835
## 95% CI : (0.8798, 0.8871)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.508
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8828
## Specificity : 0.8906
## Pos Pred Value : 0.9886
## Neg Pred Value : 0.4139
## Prevalence : 0.9150
## Detection Rate : 0.8078
## Detection Prevalence : 0.8171
## Balanced Accuracy : 0.8867
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9646
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 25398 324
## 1 2052 2226
##
## Accuracy : 0.9208
## 95% CI : (0.9177, 0.9238)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : 0.0001434
##
## Kappa : 0.6105
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9252
## Specificity : 0.8729
## Pos Pred Value : 0.9874
## Neg Pred Value : 0.5203
## Prevalence : 0.9150
## Detection Rate : 0.8466
## Detection Prevalence : 0.8574
## Balanced Accuracy : 0.8991
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9707
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27345 785
## 1 105 1765
##
## Accuracy : 0.9703
## 95% CI : (0.9684, 0.9722)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.783
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9962
## Specificity : 0.6922
## Pos Pred Value : 0.9721
## Neg Pred Value : 0.9439
## Prevalence : 0.9150
## Detection Rate : 0.9115
## Detection Prevalence : 0.9377
## Balanced Accuracy : 0.8442
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9767
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27398 808
## 1 52 1742
##
## Accuracy : 0.9713
## 95% CI : (0.9694, 0.9732)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7871
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9981
## Specificity : 0.6831
## Pos Pred Value : 0.9714
## Neg Pred Value : 0.9710
## Prevalence : 0.9150
## Detection Rate : 0.9133
## Detection Prevalence : 0.9402
## Balanced Accuracy : 0.8406
##
## 'Positive' Class : 0
##
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.978
## Model AUC
## 1 Logistic 0.9646062
## 2 Random Forest 0.9706717
## 3 XGBoost 0.9766643
## 4 LightGBM 0.9780440
The Goal: Our goal was to build an AI that identifies patients at risk of diabetes. In medicine, missing a sick person (a False Negative) is much worse than a false alarm (a False Positive).
The Strategy: “We tested different ‘sensitivity settings’ (thresholds). While the computer’s default setting is 0.5, we found that lowering the threshold to 0.3 significantly improved our results.”
##
## Threshold: 0.3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27085 681
## 1 365 1869
##
## Accuracy : 0.9651
## 95% CI : (0.963, 0.9672)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7625
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.73294
## Specificity : 0.98670
## Pos Pred Value : 0.83662
## Neg Pred Value : 0.97547
## Prevalence : 0.08500
## Detection Rate : 0.06230
## Detection Prevalence : 0.07447
## Balanced Accuracy : 0.85982
##
## 'Positive' Class : 1
##
##
## Threshold: 0.35
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27216 731
## 1 234 1819
##
## Accuracy : 0.9678
## 95% CI : (0.9658, 0.9698)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7732
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.71333
## Specificity : 0.99148
## Pos Pred Value : 0.88602
## Neg Pred Value : 0.97384
## Prevalence : 0.08500
## Detection Rate : 0.06063
## Detection Prevalence : 0.06843
## Balanced Accuracy : 0.85240
##
## 'Positive' Class : 1
##
##
## Threshold: 0.4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27299 757
## 1 151 1793
##
## Accuracy : 0.9697
## 95% CI : (0.9677, 0.9716)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7819
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.70314
## Specificity : 0.99450
## Pos Pred Value : 0.92233
## Neg Pred Value : 0.97302
## Prevalence : 0.08500
## Detection Rate : 0.05977
## Detection Prevalence : 0.06480
## Balanced Accuracy : 0.84882
##
## 'Positive' Class : 1
##
##
## Threshold: 0.45
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27353 788
## 1 97 1762
##
## Accuracy : 0.9705
## 95% CI : (0.9685, 0.9724)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7838
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.69098
## Specificity : 0.99647
## Pos Pred Value : 0.94782
## Neg Pred Value : 0.97200
## Prevalence : 0.08500
## Detection Rate : 0.05873
## Detection Prevalence : 0.06197
## Balanced Accuracy : 0.84372
##
## 'Positive' Class : 1
##
##
## Threshold: 0.5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27398 808
## 1 52 1742
##
## Accuracy : 0.9713
## 95% CI : (0.9694, 0.9732)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7871
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.68314
## Specificity : 0.99811
## Pos Pred Value : 0.97101
## Neg Pred Value : 0.97135
## Prevalence : 0.08500
## Detection Rate : 0.05807
## Detection Prevalence : 0.05980
## Balanced Accuracy : 0.84062
##
## 'Positive' Class : 1
##
##
## Threshold: 0.55
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27419 824
## 1 31 1726
##
## Accuracy : 0.9715
## 95% CI : (0.9696, 0.9734)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7867
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.67686
## Specificity : 0.99887
## Pos Pred Value : 0.98236
## Neg Pred Value : 0.97082
## Prevalence : 0.08500
## Detection Rate : 0.05753
## Detection Prevalence : 0.05857
## Balanced Accuracy : 0.83787
##
## 'Positive' Class : 1
##
##
## Threshold: 0.6
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27430 838
## 1 20 1712
##
## Accuracy : 0.9714
## 95% CI : (0.9695, 0.9733)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7848
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.67137
## Specificity : 0.99927
## Pos Pred Value : 0.98845
## Neg Pred Value : 0.97036
## Prevalence : 0.08500
## Detection Rate : 0.05707
## Detection Prevalence : 0.05773
## Balanced Accuracy : 0.83532
##
## 'Positive' Class : 1
##
##
## Threshold: 0.65
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27440 844
## 1 10 1706
##
## Accuracy : 0.9715
## 95% CI : (0.9696, 0.9734)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7851
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.66902
## Specificity : 0.99964
## Pos Pred Value : 0.99417
## Neg Pred Value : 0.97016
## Prevalence : 0.08500
## Detection Rate : 0.05687
## Detection Prevalence : 0.05720
## Balanced Accuracy : 0.83433
##
## 'Positive' Class : 1
##
##
## Threshold: 0.7
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27446 848
## 1 4 1702
##
## Accuracy : 0.9716
## 95% CI : (0.9697, 0.9735)
## No Information Rate : 0.915
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7852
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.66745
## Specificity : 0.99985
## Pos Pred Value : 0.99766
## Neg Pred Value : 0.97003
## Prevalence : 0.08500
## Detection Rate : 0.05673
## Detection Prevalence : 0.05687
## Balanced Accuracy : 0.83365
##
## 'Positive' Class : 1
##
Threshold Tuning is the process of finding the “Sweet Spot” on the probability scale that minimizes the specific real-world risks of your project.
In a medical context, we usually lower the threshold because missing a diagnosis is worse than a false alarm.