02 Regression comparison v3

Comparing traditional and machine learning models for Early Childhood Caries risk factors

Author

Sergio Uribe

CREATED

July 25, 2025

UPDATED

August 29, 2025

Data preparation

[1] "/home/sergiouribe/Insync/sergio.uribe@gmail.com/Google Drive/Research Drive/2025_EPIDENTLATVIA_01_caries_risk"

Data check

Step 1: Importing and validating ECC risk factors dataset...
=== DATA QUALITY ASSESSMENT REPORT ===

Sample Characteristics:
Total observations: 237 
Total variables: 15 

Outcome Distribution:

No_Caries    Caries 
      109       128 
Caries prevalence: 54 %

Missing Data Summary:
# A tibble: 3 × 3
  variable                 n_miss pct_miss
  <chr>                     <int>    <num>
1 breastfeeding_duration       42    17.7 
2 parental_brushing_factor     26    11.0 
3 mouth_breathing_factor       13     5.49

Age Distribution:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   3.000   3.000   3.443   4.000   5.000 

Data for analysis

Reorder the levels, check the baseline

Final sample size: 209 
ECC prevalence: 48.8 %

Traditional Statistical Modeling

Standard Logistic Regression

Table 1. Traditional Logistic Regression: Risk Factors for Early Childhood Caries
Characteristic OR 95% CI p-value
age 0.67 0.49, 0.89 0.007
gender


    Boy
    Girl 0.72 0.33, 1.57 0.4
parental_brushing


    Every night
    Occasionally 1.14 0.47, 2.69 0.8
    Seldom or never 5.13 1.73, 16.4 0.004
plaque


    No
    Yes 9.14 4.08, 21.9 <0.001
toothbrushing_frequency


    Twice daily and more
    In evenings 3.49 1.47, 8.57 0.005
    Less than daily or never 3.91 1.25, 13.4 0.023
toothpaste


    Fluoride
    Fluoride Free 0.04 0.00, 0.30 0.007
    Low Fluoride 0.96 0.42, 2.11 >0.9
sugar_liquid


    No
    Yes 0.52 0.21, 1.20 0.13
sugar_solid


    No
    Yes 0.75 0.27, 2.09 0.6
bottle_feeding


    No
    Yes 0.79 0.32, 1.91 0.6
breastfeeding_months 1.06 1.01, 1.12 0.023
mouth_breathing


    No
    Yes 0.89 0.32, 2.45 0.8
No. Obs. 209

Log-likelihood -90.5

AIC 211

BIC 261

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Spline Regression Model

Characteristic OR 95% CI p-value
splines::ns(age, df = 3)


    splines::ns(age, df = 3)1 1.73 0.18, 18.4 0.6
    splines::ns(age, df = 3)2 0.00 0.00, 0.01 <0.001
    splines::ns(age, df = 3)3 0.00 0.00, 0.24 0.027
splines::ns(breastfeeding_months, df = 3)


    splines::ns(breastfeeding_months, df = 3)1 14.1 1.52, 154 0.023
    splines::ns(breastfeeding_months, df = 3)2 1.14 0.06, 24.2 >0.9
    splines::ns(breastfeeding_months, df = 3)3 5.59 0.74, 53.7 0.11
gender


    Boy
    Girl 0.80 0.35, 1.81 0.6
parental_brushing


    Every night
    Occasionally 1.08 0.42, 2.70 0.9
    Seldom or never 4.20 1.34, 14.2 0.016
plaque


    No
    Yes 10.7 4.47, 28.0 <0.001
toothbrushing_frequency


    Twice daily and more
    In evenings 3.18 1.26, 8.23 0.015
    Less than daily or never 3.27 0.96, 11.9 0.063
toothpaste


    Fluoride
    Fluoride Free 0.03 0.00, 0.25 0.005
    Low Fluoride 0.74 0.30, 1.78 0.5
sugar_liquid


    No
    Yes 0.45 0.17, 1.10 0.086
sugar_solid


    No
    Yes 0.80 0.27, 2.36 0.7
bottle_feeding


    No
    Yes 1.08 0.39, 3.09 0.9
mouth_breathing


    No
    Yes 1.15 0.38, 3.44 0.8
No. Obs. 209

Log-likelihood -83.2

AIC 204

BIC 268

Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Machine Learning Pipeline Setup

Machine Learning Models

Problematic columns

Random forest

XGBoost

Elastinc model

K-nearest neighbors

Model comparison

Visualization

Evaluate metrics

Table: Model Performance Metrics on Test Set
model accuracy kap mn_log_loss roc_auc
KNN 0.811 0.621 1.350 0.141
Elastic Net 0.830 0.660 1.217 0.111
XGBoost 0.774 0.548 1.258 0.094
Random Forest 0.849 0.698 0.901 0.080
Logistic Regression 0.811 0.623 1.907 0.080
Spline Model 0.868 0.736 2.105 0.070

ROC Curve

[1] ".pred_No"  ".pred_Yes"

Clinical utility

FINAL Clinical utility

In the population studied, Traditional and Spline regression models provided the highest clinical utility in most threshold ranges (particularly between 20% and 60% risk). Machine learning models like Random Forest and XGBoost were comparable but not consistently superior. Decision curve analysis confirms that predictive models can guide ECC interventions better than blanket treatment strategies.

FINAL COMPARATIVE ANALYSIS: LOGISTIC REGRESSION vs MACHINE LEARNING MODELS

Fixing the error

Table: Performance Metrics - Machine Learning vs Logistic Regression
Performance comparison: All models vs Logistic Regression (reference)
Model AUC AUC Diff Accuracy Acc Diff
Random Forest 0.080 0.000 0.849 +0.038
XGBoost 0.094 +0.014 0.774 -0.038
Elastic Net 0.111 +0.031 0.830 +0.019
KNN 0.141 +0.061 0.811 0.000
Logistic Regression 0.080 0.000 0.811 0.000
Spline Model 0.070 -0.010 0.868 +0.057
Investigating the low Random Forest AUC:
[1] "All metrics for Random Forest:"
# A tibble: 4 × 4
  .metric     .estimator .estimate model        
  <chr>       <chr>          <dbl> <chr>        
1 accuracy    binary        0.849  Random Forest
2 kap         binary        0.698  Random Forest
3 mn_log_loss binary        0.901  Random Forest
4 roc_auc     binary        0.0798 Random Forest

Checking if there's an issue with factor levels in predictions:
[1] "Random Forest prediction column names:"
[1] ".pred_No"  ".pred_Yes"
[1] "Sample predictions:"
# A tibble: 6 × 2
  .pred_No .pred_Yes
     <dbl>     <dbl>
1    0.283     0.717
2    0.413     0.587
3    0.577     0.423
4    0.539     0.461
5    0.445     0.555
6    0.419     0.581
[1] "Test data outcome levels:"
[1] "No"  "Yes"
[1] "Test data outcome distribution:"

 No Yes 
 27  26 
Flipping predictions for Random Forest (AUC was 0.080)
Flipping predictions for XGBoost (AUC was 0.094)
Flipping predictions for Elastic Net (AUC was 0.111)
Flipping predictions for KNN (AUC was 0.141)
Flipping predictions for Logistic Regression (AUC was 0.080)
Flipping predictions for Spline Model (AUC was 0.070)
[1] "Corrected metrics:"
# A tibble: 6 × 4
  .metric .estimator .estimate model              
  <chr>   <chr>          <dbl> <chr>              
1 roc_auc binary         0.930 Spline Model       
2 roc_auc binary         0.920 Random Forest      
3 roc_auc binary         0.920 Logistic Regression
4 roc_auc binary         0.906 XGBoost            
5 roc_auc binary         0.889 Elastic Net        
6 roc_auc binary         0.859 KNN                

Table: Performance Metrics - Machine Learning vs Logistic Regression (CORRECTED)
Performance comparison: All models vs Logistic Regression (reference)
Model AUC AUC Diff Accuracy Acc Diff
Spline Model 0.930 +0.010 0.132 -0.057
Random Forest 0.920 0.000 0.151 -0.038
Logistic Regression 0.920 0.000 0.189 0.000
XGBoost 0.906 -0.014 0.226 +0.038
Elastic Net 0.889 -0.031 0.170 -0.019
KNN 0.859 -0.061 0.189 0.000

Fixing inverted predictions for Random Forest (AUC was 0.080)
Fixing inverted predictions for XGBoost (AUC was 0.094)
Fixing inverted predictions for Elastic Net (AUC was 0.111)
Fixing inverted predictions for KNN (AUC was 0.141)
Fixing inverted predictions for Logistic Regression (AUC was 0.080)
Fixing inverted predictions for Spline Model (AUC was 0.070)

1. COMPREHENSIVE PERFORMANCE COMPARISON TABLE

Table: Performance Metrics - Machine Learning vs Logistic Regression
Performance comparison: All models vs Logistic Regression (reference)
Model AUC AUC Diff Accuracy Acc Diff Sensitivity Sens Diff Specificity Spec Diff
Random Forest 0.920 0.000 0.151 -0.038 0.148 -0.037 0.154 -0.038
XGBoost 0.906 -0.014 0.226 +0.038 0.259 +0.074 0.192 0.000
Elastic Net 0.889 -0.031 0.170 -0.019 0.148 -0.037 0.192 0.000
KNN 0.859 -0.061 0.189 0.000 0.111 -0.074 0.269 +0.077
Logistic Regression 0.920 0.000 0.189 0.000 0.185 0.000 0.192 0.000
Spline Model 0.930 +0.010 0.132 -0.057 0.111 -0.074 0.154 -0.038

2. VISUALIZATION: PERFORMANCE COMPARISON

3. MULTI-METRIC PERFORMANCE HEATMAP

4. CLINICAL UTILITY COMPARISON VISUALIZATION

rf AUC is 0.920 - no flipping needed
xgb AUC is 0.906 - no flipping needed
enet AUC is 0.889 - no flipping needed
knn AUC is 0.859 - no flipping needed
Creating DCA objects...
DCA objects created successfully.
Creating DCA objects...
DCA objects created successfully.

============

Checking dca_data structure:
tibble [53 × 7] (S3: tbl_df/tbl/data.frame)
 $ ecc_num    : num [1:53] 1 1 1 1 1 1 1 1 1 1 ...
 $ traditional: num [1:53] 0.955 0.883 0.354 0.471 0.822 ...
 $ spline     : num [1:53] 0.966 0.877 0.294 0.321 0.845 ...
 $ rf         : num [1:53] 0.717 0.587 0.423 0.461 0.555 ...
 $ xgb        : num [1:53] 0.888 0.749 0.303 0.382 0.626 ...
 $ enet       : num [1:53] 0.821 0.778 0.319 0.312 0.691 ...
 $ knn        : num [1:53] 0.838 0.781 0.143 0.543 0.531 ...
# A tibble: 6 × 7
  ecc_num traditional spline    rf   xgb  enet   knn
    <dbl>       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1       1       0.955  0.966 0.717 0.888 0.821 0.838
2       1       0.883  0.877 0.587 0.749 0.778 0.781
3       1       0.354  0.294 0.423 0.303 0.319 0.143
4       1       0.471  0.321 0.461 0.382 0.312 0.543
5       1       0.822  0.845 0.555 0.626 0.691 0.531
6       1       0.895  0.864 0.581 0.656 0.761 0.670

Checking for data issues:
Rows: 53 
Columns: 7 
Missing values per column:
    ecc_num traditional      spline          rf         xgb        enet 
          0           0           0           0           0           0 
        knn 
          0 

Outcome variable (ecc_num) distribution:

 0  1 
27 26 

Prediction ranges:
traditional: 0.007 to 0.983
spline: 0.004 to 0.996
rf: 0.298 to 0.717
xgb: 0.113 to 0.888
enet: 0.098 to 0.868
knn: 0.016 to 0.838
Testing simple DCA creation:
Test data created with 53 rows
Test DCA successful!
Net benefits length: 0 
First few net benefits: 
Test DCA failed - cannot proceed with DCA analysis
This suggests a fundamental issue with the data or rmda package installation
Skipping DCA analysis due to package limitations with this dataset
Proceeding with performance comparisons only

=== CLINICAL UTILITY ANALYSIS ===
Decision curve analysis could not be completed due to technical limitations.
This is common with smaller datasets or the rmda package configuration.
Clinical utility should be assessed through:
- AUC performance (already completed above)
- Sensitivity/specificity at different thresholds
- Clinical context and implementation feasibility

=== ALTERNATIVE CLINICAL UTILITY: THRESHOLD ANALYSIS ===
Performance at Different Risk Thresholds
Model Threshold Sensitivity Specificity
Logistic Regression 0.3 0.962 0.741
Logistic Regression 0.4 0.923 0.815
Logistic Regression 0.5 0.808 0.815
Logistic Regression 0.6 0.769 0.889
Logistic Regression 0.7 0.692 0.926
Spline Model 0.3 0.269 0.037
Spline Model 0.4 0.192 0.111
Spline Model 0.5 0.154 0.111
Spline Model 0.6 0.154 0.185
Spline Model 0.7 0.115 0.296
Random Forest 0.3 1.000 0.037
Random Forest 0.4 0.962 0.481
Random Forest 0.5 0.846 0.852
Random Forest 0.6 0.423 1.000
Random Forest 0.7 0.038 1.000
XGBoost 0.3 0.962 0.630
XGBoost 0.4 0.885 0.667
XGBoost 0.5 0.808 0.741
XGBoost 0.6 0.692 0.889
XGBoost 0.7 0.500 1.000
Elastic Net 0.3 0.962 0.519
Elastic Net 0.4 0.846 0.704
Elastic Net 0.5 0.808 0.852
Elastic Net 0.6 0.731 0.889
Elastic Net 0.7 0.538 0.926
K-Nearest Neighbors 0.3 0.808 0.704
K-Nearest Neighbors 0.4 0.769 0.852
K-Nearest Neighbors 0.5 0.731 0.889
K-Nearest Neighbors 0.6 0.538 0.963
K-Nearest Neighbors 0.7 0.346 1.000


Performance at 50% Risk Threshold (Youden Index):
1. Random Forest: Sens=0.846, Spec=0.852, Youden=0.698
2. Elastic Net: Sens=0.808, Spec=0.852, Youden=0.660
3. Logistic Regression: Sens=0.808, Spec=0.815, Youden=0.623
4. K-Nearest Neighbors: Sens=0.731, Spec=0.889, Youden=0.620
5. XGBoost: Sens=0.808, Spec=0.741, Youden=0.548
6. Spline Model: Sens=0.154, Spec=0.111, Youden=-0.735

Optimal Risk Thresholds (Maximum Youden Index):
1. Logistic Regression: Threshold=0.4, Sens=0.923, Spec=0.815, Youden=0.738
2. Random Forest: Threshold=0.5, Sens=0.846, Spec=0.852, Youden=0.698
3. Elastic Net: Threshold=0.5, Sens=0.808, Spec=0.852, Youden=0.660
4. K-Nearest Neighbors: Threshold=0.4, Sens=0.769, Spec=0.852, Youden=0.621
5. XGBoost: Threshold=0.3, Sens=0.962, Spec=0.630, Youden=0.591
6. Spline Model: Threshold=0.7, Sens=0.115, Spec=0.296, Youden=-0.588

=== CONCLUSIONS ===
Based on this analysis:
• Machine learning models show similar AUC performance to logistic regression
• Spline regression performs best overall (AUC = 0.930)
• Traditional regression methods remain competitive and interpretable
• No single ML approach consistently outperforms traditional methods
• Clinical implementation should prioritize interpretability and simplicity
• Threshold selection impacts clinical utility more than model choice
• Results support continued use of traditional regression for ECC risk prediction

=== FINAL SUMMARY: MACHINE LEARNING vs LOGISTIC REGRESSION ===

AUC Performance Rankings (vs Logistic Regression):
1. Spline Model: AUC = 0.930 (Δ = +0.010) - Equivalent
2. Random Forest: AUC = 0.920 (Δ = +0.000) - Equivalent
3. XGBoost: AUC = 0.906 (Δ = -0.014) - Inferior
4. Elastic Net: AUC = 0.889 (Δ = -0.031) - Inferior
5. KNN: AUC = 0.859 (Δ = -0.061) - Inferior

=== CONCLUSION ===
Based on this comparison, machine learning models show:
- Similar discriminative performance (AUC) to logistic regression
- No consistent clinical utility advantage across risk thresholds
- Traditional regression methods remain stable for ECC risk prediction