Diabeties Predction

Introduction

Diabetes is a chronic condition where the body is unable to properly regulate blood sugar (glucose) levels. This happens either because the body does not produce enough insulin or cannot effectively use the insulin it produces. Common symptoms are: Frequent urination,excessive thirst, unexplained weight loss, fatigue (feeling tired often), blurred vision, slow healing of wounds.

About the Data set

The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information. This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans. Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes.

Explanation of variables name

Gender – the person’s sex (male or female).

Age – how old the person is.

Hypertension – whether the person has high blood pressure.

Heart_disease – whether the person has any heart-related condition.

Smoking_history / smoking_clean – the person’s smoking habit (smoker, ex-smoker, non-smoker, or no information).

BMI – a measure of body weight relative to height; used to check if someone is underweight, normal, or overweight.

HbA1c_level – average blood sugar level over the past few months.

Blood_glucose_level – the person’s current blood sugar level.

Diabetes – shows whether the person has diabetes or not.

AIM

To investigate the factors associated with diabetes and develop an effective model for predicting diabetes risk.

Research Question

  1. Which factors are significantly associated with diabetes in the dataset?

  2. Which machine learning model provides the most accurate prediction of diabetes?

  3. What are the key drivers of diabetes based on the best-performing model?

Research Objectives

  1. To analyze the relationship between clinical and lifestyle variables and diabetes using statistical methods.

  2. To build and compare different machine learning models for diabetes prediction.

  3. To evaluate model performance and identify the most important features influencing diabetes prediction.

Loading Necessary Values

## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'stringr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'plotly' was built under R version 4.5.2
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Warning: package 'xgboost' was built under R version 4.5.3
## Warning: package 'lightgbm' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Warning: package 'smotefamily' was built under R version 4.5.3
## Warning: package 'ROSE' was built under R version 4.5.3
## Loaded ROSE 0.0-4

Importing dataset

## [1] "gender"              "age"                 "hypertension"       
## [4] "heart_disease"       "smoking_history"     "bmi"                
## [7] "HbA1c_level"         "blood_glucose_level" "diabetes"
##   gender age hypertension heart_disease smoking_history   bmi HbA1c_level
## 1 Female  80            0             1           never 25.19         6.6
## 2 Female  54            0             0         No Info 27.32         6.6
## 3   Male  28            0             0           never 27.32         5.7
## 4 Female  36            0             0         current 23.45         5.0
## 5   Male  76            1             1         current 20.14         4.8
## 6 Female  20            0             0           never 27.32         6.6
##   blood_glucose_level diabetes
## 1                 140        0
## 2                  80        0
## 3                 158        0
## 4                 155        0
## 5                 155        0
## 6                  85        0
##     gender               age         hypertension     heart_disease    
##  Length:100000      Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
##  Class :character   1st Qu.:24.00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Mode  :character   Median :43.00   Median :0.00000   Median :0.00000  
##                     Mean   :41.89   Mean   :0.07485   Mean   :0.03942  
##                     3rd Qu.:60.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##                     Max.   :80.00   Max.   :1.00000   Max.   :1.00000  
##  smoking_history         bmi         HbA1c_level    blood_glucose_level
##  Length:100000      Min.   :10.01   Min.   :3.500   Min.   : 80.0      
##  Class :character   1st Qu.:23.63   1st Qu.:4.800   1st Qu.:100.0      
##  Mode  :character   Median :27.32   Median :5.800   Median :140.0      
##                     Mean   :27.32   Mean   :5.528   Mean   :138.1      
##                     3rd Qu.:29.58   3rd Qu.:6.200   3rd Qu.:159.0      
##                     Max.   :95.69   Max.   :9.000   Max.   :300.0      
##     diabetes    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.085  
##  3rd Qu.:0.000  
##  Max.   :1.000
## [1] 100000      9

Checked for missing values

##              gender                 age        hypertension       heart_disease 
##                   0                   0                   0                   0 
##     smoking_history                 bmi         HbA1c_level blood_glucose_level 
##                   0                   0                   0                   0 
##            diabetes 
##                   0

Checked of the distribution of the variables

## 
##     current        ever      former       never     No Info not current 
##        9286        4004        9352       35095       35816        6447
## 
## Female   Male  Other 
##  58552  41430     18

Re_grouping smoking history column for variables clarity

##   gender age hypertension heart_disease smoking_history   bmi HbA1c_level
## 1 Female  80            0             1           never 25.19         6.6
## 2 Female  54            0             0         No Info 27.32         6.6
## 3   Male  28            0             0           never 27.32         5.7
## 4 Female  36            0             0         current 23.45         5.0
## 5   Male  76            1             1         current 20.14         4.8
## 6 Female  20            0             0           never 27.32         6.6
##   blood_glucose_level diabetes smoking_clean
## 1                 140        0    Non_Smoker
## 2                  80        0       No Info
## 3                 158        0    Non_Smoker
## 4                 155        0        Smoker
## 5                 155        0        Smoker
## 6                  85        0    Non_Smoker
## 
##  Ex_Smoker    No Info Non_Smoker     Smoker 
##      19803      35816      35095       9286

Distribution of Age

##   [1] 80.00 54.00 28.00 36.00 76.00 20.00 44.00 79.00 42.00 32.00 53.00 78.00
##  [13] 67.00 15.00 37.00 40.00  5.00 69.00 72.00  4.00 30.00 45.00 43.00 50.00
##  [25] 41.00 26.00 34.00 73.00 77.00 66.00 29.00 60.00 38.00  3.00 57.00 74.00
##  [37] 19.00 46.00 21.00 59.00 27.00 13.00 56.00  2.00  7.00 11.00  6.00 55.00
##  [49]  9.00 62.00 47.00 12.00 68.00 75.00 22.00 58.00 18.00 24.00 17.00 25.00
##  [61]  0.08 33.00 16.00 61.00 31.00  8.00 49.00 39.00 65.00 14.00 70.00  0.56
##  [73] 48.00 51.00 71.00  0.88 64.00 63.00 52.00  0.16 10.00 35.00 23.00  0.64
##  [85]  1.16  1.64  0.72  1.88  1.32  0.80  1.24  1.00  1.80  0.48  1.56  1.08
##  [97]  0.24  1.40  0.40  0.32  1.72  1.48

Exploratory and Statistical Analysis

Which factors are related to diabetes?

Changing some variable to factor

Count of smoking history

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Gender vs Smoking

##         
##          Ex_Smoker No Info Non_Smoker Smoker
##   Female     10925   19700      22869   5058
##   Male        8869   16110      12223   4228
##   Other          9       6          3      0

Plot

Hypothesis Testing

## 
##  Welch Two Sample t-test
## 
## data:  HbA1c_level by diabetes
## t = -127.01, df = 9828.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.561932 -1.514453
## sample estimates:
## mean in group 0 mean in group 1 
##        5.396761        6.934953

This shows that people with diabetes have much higher HbA1c, HbA1c is strongly related to diabetes mellitus,which means people with diabetes have higher long-term blood sugar levels.

## 
##  Welch Two Sample t-test
## 
## data:  blood_glucose_level by diabetes
## t = -94.795, df = 9045.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -62.50864 -59.97583
## sample estimates:
## mean in group 0 mean in group 1 
##        132.8525        194.0947

People with diabetes have much higher blood sugar levels Patients with diabetes have significantly higher blood glucose levels compared to non-diabetic patients.

## 
##  Welch Two Sample t-test
## 
## data:  bmi by diabetes
## t = -60.265, df = 9654.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5.267143 -4.935294
## sample estimates:
## mean in group 0 mean in group 1 
##        26.88716        31.98838

People with higher body weight are more likely to have diabetes.

## 
##  Welch Two Sample t-test
## 
## data:  age by diabetes
## t = -119.59, df = 12560, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -21.17285 -20.48995
## sample estimates:
## mean in group 0 mean in group 1 
##        40.11519        60.94659

People with diabetes are much older on average Older people are much more likely to have diabetes.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(diab$hypertension, diab$diabetes)
## X-squared = 3910.7, df = 1, p-value < 2.2e-16

People with high blood pressure are more likely to have diabetes.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(diab$heart_disease, diab$diabetes)
## X-squared = 2945.8, df = 1, p-value < 2.2e-16

People with heart problems are more likely to also have diabetes.

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## smoking_clean     3    301  100.38   87.79 <2e-16 ***
## Residuals     99996 114332    1.14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This saying that long-term blood sugar varies across smoking groups,smoking status affects HbA1c levels

## 
##  Pearson's Chi-squared test
## 
## data:  table(diab$diabetes, diab$smoking_clean)
## X-squared = 1732.7, df = 3, p-value < 2.2e-16

Smoking habits are linked to whether someone has diabetes.

People with diabetes tend to have higher blood sugar, they are older, often have higher body weight, and are more likely to have conditions like high blood pressure or heart disease.

Machine Learning

Converting my target to factor

## 
##     0     1 
## 91500  8500
## 
##     0     1 
## 0.915 0.085

Target + Feature Engineering FIRST (BEFORE dummy Vars)

REMOVE TARGET + SPLIT DATA

ONE-HOT ENCODING ( Correct way )

Handling class imbalance

## 
##     0     1 
## 35335 34665
## 
##         0         1 
## 0.5047857 0.4952143

Logistic Regression

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 24234   279
##          1  3216  2271
##                                           
##                Accuracy : 0.8835          
##                  95% CI : (0.8798, 0.8871)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.508           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8828          
##             Specificity : 0.8906          
##          Pos Pred Value : 0.9886          
##          Neg Pred Value : 0.4139          
##              Prevalence : 0.9150          
##          Detection Rate : 0.8078          
##    Detection Prevalence : 0.8171          
##       Balanced Accuracy : 0.8867          
##                                           
##        'Positive' Class : 0               
## 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9646

Random forest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25398   324
##          1  2052  2226
##                                           
##                Accuracy : 0.9208          
##                  95% CI : (0.9177, 0.9238)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : 0.0001434       
##                                           
##                   Kappa : 0.6105          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9252          
##             Specificity : 0.8729          
##          Pos Pred Value : 0.9874          
##          Neg Pred Value : 0.5203          
##              Prevalence : 0.9150          
##          Detection Rate : 0.8466          
##    Detection Prevalence : 0.8574          
##       Balanced Accuracy : 0.8991          
##                                           
##        'Positive' Class : 0               
## 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9707

XGBOOST

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27345   785
##          1   105  1765
##                                           
##                Accuracy : 0.9703          
##                  95% CI : (0.9684, 0.9722)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.783           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9962          
##             Specificity : 0.6922          
##          Pos Pred Value : 0.9721          
##          Neg Pred Value : 0.9439          
##              Prevalence : 0.9150          
##          Detection Rate : 0.9115          
##    Detection Prevalence : 0.9377          
##       Balanced Accuracy : 0.8442          
##                                           
##        'Positive' Class : 0               
## 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.9767

LightGBM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27398   808
##          1    52  1742
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9694, 0.9732)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7871          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9981          
##             Specificity : 0.6831          
##          Pos Pred Value : 0.9714          
##          Neg Pred Value : 0.9710          
##              Prevalence : 0.9150          
##          Detection Rate : 0.9133          
##    Detection Prevalence : 0.9402          
##       Balanced Accuracy : 0.8406          
##                                           
##        'Positive' Class : 0               
## 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.978

MODEL COMPARISON

##           Model       AUC
## 1      Logistic 0.9646062
## 2 Random Forest 0.9706717
## 3       XGBoost 0.9766643
## 4      LightGBM 0.9780440

The plot

The Goal: Our goal was to build an AI that identifies patients at risk of diabetes. In medicine, missing a sick person (a False Negative) is much worse than a false alarm (a False Positive).

The Strategy: “We tested different ‘sensitivity settings’ (thresholds). While the computer’s default setting is 0.5, we found that lowering the threshold to 0.3 significantly improved our results.”

## 
## Threshold: 0.3 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27085   681
##          1   365  1869
##                                          
##                Accuracy : 0.9651         
##                  95% CI : (0.963, 0.9672)
##     No Information Rate : 0.915          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7625         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.73294        
##             Specificity : 0.98670        
##          Pos Pred Value : 0.83662        
##          Neg Pred Value : 0.97547        
##              Prevalence : 0.08500        
##          Detection Rate : 0.06230        
##    Detection Prevalence : 0.07447        
##       Balanced Accuracy : 0.85982        
##                                          
##        'Positive' Class : 1              
##                                          
## 
## Threshold: 0.35 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27216   731
##          1   234  1819
##                                           
##                Accuracy : 0.9678          
##                  95% CI : (0.9658, 0.9698)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7732          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.71333         
##             Specificity : 0.99148         
##          Pos Pred Value : 0.88602         
##          Neg Pred Value : 0.97384         
##              Prevalence : 0.08500         
##          Detection Rate : 0.06063         
##    Detection Prevalence : 0.06843         
##       Balanced Accuracy : 0.85240         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.4 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27299   757
##          1   151  1793
##                                           
##                Accuracy : 0.9697          
##                  95% CI : (0.9677, 0.9716)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7819          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.70314         
##             Specificity : 0.99450         
##          Pos Pred Value : 0.92233         
##          Neg Pred Value : 0.97302         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05977         
##    Detection Prevalence : 0.06480         
##       Balanced Accuracy : 0.84882         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.45 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27353   788
##          1    97  1762
##                                           
##                Accuracy : 0.9705          
##                  95% CI : (0.9685, 0.9724)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7838          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.69098         
##             Specificity : 0.99647         
##          Pos Pred Value : 0.94782         
##          Neg Pred Value : 0.97200         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05873         
##    Detection Prevalence : 0.06197         
##       Balanced Accuracy : 0.84372         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.5 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27398   808
##          1    52  1742
##                                           
##                Accuracy : 0.9713          
##                  95% CI : (0.9694, 0.9732)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7871          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.68314         
##             Specificity : 0.99811         
##          Pos Pred Value : 0.97101         
##          Neg Pred Value : 0.97135         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05807         
##    Detection Prevalence : 0.05980         
##       Balanced Accuracy : 0.84062         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.55 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27419   824
##          1    31  1726
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9696, 0.9734)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7867          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.67686         
##             Specificity : 0.99887         
##          Pos Pred Value : 0.98236         
##          Neg Pred Value : 0.97082         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05753         
##    Detection Prevalence : 0.05857         
##       Balanced Accuracy : 0.83787         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.6 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27430   838
##          1    20  1712
##                                           
##                Accuracy : 0.9714          
##                  95% CI : (0.9695, 0.9733)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7848          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.67137         
##             Specificity : 0.99927         
##          Pos Pred Value : 0.98845         
##          Neg Pred Value : 0.97036         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05707         
##    Detection Prevalence : 0.05773         
##       Balanced Accuracy : 0.83532         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.65 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27440   844
##          1    10  1706
##                                           
##                Accuracy : 0.9715          
##                  95% CI : (0.9696, 0.9734)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7851          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.66902         
##             Specificity : 0.99964         
##          Pos Pred Value : 0.99417         
##          Neg Pred Value : 0.97016         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05687         
##    Detection Prevalence : 0.05720         
##       Balanced Accuracy : 0.83433         
##                                           
##        'Positive' Class : 1               
##                                           
## 
## Threshold: 0.7 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27446   848
##          1     4  1702
##                                           
##                Accuracy : 0.9716          
##                  95% CI : (0.9697, 0.9735)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7852          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.66745         
##             Specificity : 0.99985         
##          Pos Pred Value : 0.99766         
##          Neg Pred Value : 0.97003         
##              Prevalence : 0.08500         
##          Detection Rate : 0.05673         
##    Detection Prevalence : 0.05687         
##       Balanced Accuracy : 0.83365         
##                                           
##        'Positive' Class : 1               
## 

Threshold Tuning is the process of finding the “Sweet Spot” on the probability scale that minimizes the specific real-world risks of your project.

In a medical context, we usually lower the threshold because missing a diagnosis is worse than a false alarm.

Feature Importance in LightGBM

Feature Importance in Random Forest