CAPSTONE_PROJECT

Project Description

This project develops a quantitative framework to predict Type 2 diabetes risk using macronutrient intake and demographic characteristics. The central goal is to examine how dietary composition, especially the distribution of carbohydrates, fats, and proteins, interacts with age, sex, and BMI to influence the probability of having Type 2 diabetes.

Using publicly available data from the National Health and Nutrition Examination Survey (NHANES), this project will:

clean and merge demographic, dietary, and health-status files
compute macronutrient energy percentages as indicators of dietary balance
build a baseline logistic regression model for the probability of Type 2 diabetes
fit two machine learning models (random forest and XGBoost) to capture possible non-linear effects
compare models using accuracy, sensitivity, specificity, AUC, and confusion matrices
interpret model outputs using variable importance and SHAP values

The purpose is not only prediction, but also understanding: how diet and demographic characteristics jointly contribute to diabetes risk.

Background

Type 2 diabetes is one of the most widespread chronic metabolic disorders worldwide. It is characterised by insulin resistance, impaired glucose regulation, and long-term complications affecting the cardiovascular and nervous systems. Obesity, ageing, sedentary behaviour, and poor dietary patterns are major contributors to its development.

Macronutrients are the primary sources of dietary energy:

Carbohydrates directly influence post-prandial blood glucose.
Fats affect lipid metabolism, insulin sensitivity, and energy storage.
Protein impacts satiety and the preservation of lean body mass.

Because macronutrient balance shapes metabolic and hormonal responses, imbalances may shift overall disease risk even when total energy intake is held constant.

Traditional epidemiological approaches often study one factor at a time. Modern statistical computing, however, allows the analysis of complex multivariate relationships. Logistic regression provides a mathematically interpretable model, while tree-based machine learning methods can capture interactions and non-linear patterns. Using NHANES as a large-scale, high-quality data source, this project applies these tools to quantify how macronutrient balance and demographic factors jointly relate to Type 2 diabetes.

Problem Statement and Objectives

The main aim is to model and predict Type 2 diabetes risk using demographic variables (age, sex, BMI) and dietary macronutrient composition.

Research questions

Which predictors are most strongly associated with Type 2 diabetes, including macronutrient percentages, BMI, age, sex, and physical activity (if available)?
Do more flexible models (random forest, XGBoost) improve predictive performance compared with logistic regression?
Can modelling reveal meaningful relationships between dietary composition, demographic factors, and chronic disease risk that confirm or extend existing public health literature?

The goal is to build a modelling framework that is both predictive and interpretable, and that could eventually support screening tools or personalised nutrition recommendations.

Data Description

The analysis uses data from the NHANES 2017–2018 cycle, a nationally representative U.S. survey combining:

interviews
24-hour dietary recalls
physical examinations
laboratory measurements

A single merged dataset is constructed using the participant identifier (SEQN). Key variables include:

diabetes — diagnosis of Type 2 diabetes (1 = diagnosed, 0 = not diagnosed)
age — participant age in years
sex — Male or Female
bmi — body mass index (kg/m^2)
total_kcal — total daily energy intake (kcal)
carb_g, fat_g, protein_g — grams of carbohydrate, fat, and protein per day
carb_pct, fat_pct, protein_pct — percent of total energy intake from each macronutrient

Before modelling, the dataset is cleaned to remove missing values, incorrect dietary entries, and implausible BMI values. Derived variables such as macronutrient percentages are computed to capture dietary balance.

Methodology

Logistic regression model

The baseline mathematical model is a multivariable logistic regression. Let

\(Y_i = 1\) if individual \(i\) has Type 2 diabetes,
\(Y_i = 0\) otherwise, and
\(X_i\) be the vector of predictors for individual \(i\) (age, sex, BMI, macronutrient percentages).

The model assumes that the log-odds of diabetes is a linear function of the predictors:

\[ \log\left( \frac{\Pr(Y_i = 1 \mid X_i)}{1 - \Pr(Y_i = 1 \mid X_i)} \right) = \beta_0 + \beta_1 \,\text{age}_i + \beta_2 \,\text{sex}_i + \beta_3 \,\text{bmi}_i + \beta_4 \,\text{carb\_pct}_i + \beta_5 \,\text{fat\_pct}_i + \beta_6 \,\text{protein\_pct}_i. \]

Equivalently,

\[ \Pr(Y_i = 1 \mid X_i) = \frac{1}{1 + \exp\!\left( -(\beta_0 + \beta_1 \text{age}_i + \cdots + \beta_6 \text{protein\_pct}_i) \right)}. \]

The coefficients \(\beta_j\) are estimated by maximum likelihood. Exponentiating a coefficient, \(\exp(\beta_j)\), gives an odds ratio describing how the odds of diabetes change when the corresponding predictor increases by one unit while the other predictors are held fixed.

Logistic regression is used because it is:

interpretable
mathematically well understood
a standard baseline model in epidemiology

It provides a benchmark against which more flexible models can be evaluated.

Machine learning models

To explore potential non-linear effects and interactions, two tree-based models are fitted alongside the logistic regression baseline: random forest and gradient boosting (XGBoost).

Random forest

A random forest builds many decision trees on bootstrap samples and averages their predictions. The resulting classifier can be written as

\[ \hat{f}_{RF}(x) = \frac{1}{B} \sum_{b = 1}^B T_b(x), \]

where \(T_b(x)\) is the prediction from tree \(b\) and \(B\) is the total number of trees. Random forests:

capture interactions between predictors
handle non-linear relationships
provide variable importance measures based on reductions in Gini impurity

Gradient boosting (XGBoost)

Gradient boosting constructs trees sequentially, with each new tree fitted to reduce the remaining error from the current ensemble. The boosted model can be written as

\[ \hat{f}_{XGB}(x) = \sum_{m = 1}^M \gamma_m \, h_m(x), \]

where each \(h_m(x)\) is a shallow tree and \(\gamma_m\) are scaling coefficients. XGBoost:

performs well on structured tabular data
allows fine-grained control over regularisation and tree complexity
supports SHAP-based explanations for feature-level interpretability

All models are trained on a training subset and evaluated on a held-out test set. Performance is compared using:

accuracy
sensitivity and specificity
ROC curves and AUC
confusion matrices

SHAP values are used to interpret the contribution of each feature to the XGBoost predictions.

Step 1: NHANES Demographic Data

In this step I load demographic data from the 2017–2018 NHANES cycle using the nhanesA package. This file includes age, sex, and other basic variables for each participant.

# DEMO_J = demographic data for NHANES 2017–2018
demo_raw = nhanes("DEMO_J")

# Look at structure and first few rows
str(demo_raw)

## 'data.frame':    9254 obs. of  46 variables:
##  $ SEQN    : num  93703 93704 93705 93706 93707 ...
##  $ SDDSRVYR: Factor w/ 1 level "NHANES 2017-2018 public release": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RIDSTATR: Factor w/ 2 levels "Interviewed only",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RIAGENDR: Factor w/ 2 levels "Male","Female": 2 1 2 1 1 2 2 2 1 1 ...
##  $ RIDAGEYR: num  2 2 66 18 13 66 75 0 56 18 ...
##  $ RIDAGEMN: num  NA NA NA NA NA NA NA 11 NA NA ...
##  $ RIDRETH1: Factor w/ 5 levels "Mexican American",..: 5 3 4 5 5 5 4 3 5 1 ...
##  $ RIDRETH3: Factor w/ 6 levels "Mexican American",..: 5 3 4 5 6 5 4 3 5 1 ...
##  $ RIDEXMON: Factor w/ 2 levels "November 1 through April 30",..: 2 1 2 2 2 2 1 2 2 2 ...
##  $ RIDEXAGM: num  27 33 NA 222 158 NA NA 13 NA 227 ...
##  $ DMQMILIZ: Factor w/ 4 levels "Yes","No","Refused",..: NA NA 2 2 NA 2 2 NA 2 2 ...
##  $ DMQADFC : Factor w/ 3 levels "Yes","No","Refused": NA NA NA NA NA NA NA NA NA NA ...
##  $ DMDBORN4: Factor w/ 4 levels "Born in 50 US states or Washington, DC",..: 1 1 1 1 1 2 1 1 2 2 ...
##  $ DMDCITZN: Factor w/ 4 levels "Citizen by birth or naturalization",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ DMDYRSUS: Factor w/ 11 levels "Less than 1 year",..: NA NA NA NA NA 7 NA NA 6 5 ...
##  $ DMDEDUC3: Factor w/ 17 levels "Never attended / kindergarten only",..: NA NA NA 16 7 NA NA NA NA 13 ...
##  $ DMDEDUC2: Factor w/ 7 levels "Less than 9th grade",..: NA NA 2 NA NA 1 4 NA 5 NA ...
##  $ DMDMARTL: Factor w/ 7 levels "Married","Widowed",..: NA NA 3 NA NA 1 2 NA 1 NA ...
##  $ RIDEXPRG: Factor w/ 3 levels "Yes, positive lab pregnancy test or self-reported pregnant at exam",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ SIALANG : Factor w/ 2 levels "English","Spanish": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SIAPROXY: Factor w/ 2 levels "Yes","No": 1 1 2 2 1 2 2 1 2 2 ...
##  $ SIAINTRP: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 1 2 2 2 2 ...
##  $ FIALANG : Factor w/ 2 levels "English","Spanish": 1 1 1 NA 1 1 1 1 1 2 ...
##  $ FIAPROXY: Factor w/ 2 levels "Yes","No": 2 2 2 NA 2 2 2 2 2 2 ...
##  $ FIAINTRP: Factor w/ 2 levels "Yes","No": 2 2 2 NA 2 2 2 2 2 2 ...
##  $ MIALANG : Factor w/ 2 levels "English","Spanish": NA NA 1 1 1 1 NA NA 1 1 ...
##  $ MIAPROXY: Factor w/ 2 levels "Yes","No": NA NA 2 2 2 2 NA NA 2 2 ...
##  $ MIAINTRP: Factor w/ 2 levels "Yes","No": NA NA 2 2 2 1 NA NA 2 2 ...
##  $ AIALANGA: Factor w/ 3 levels "English","Spanish",..: NA NA 1 1 1 3 NA NA 1 1 ...
##  $ DMDHHSIZ: Factor w/ 7 levels "1","2","3","4",..: 5 4 1 5 7 2 1 3 3 4 ...
##  $ DMDFMSIZ: Factor w/ 7 levels "1","2","3","4",..: 5 4 1 5 7 2 1 3 3 4 ...
##  $ DMDHHSZA: Factor w/ 4 levels "0","1","2","3 or more": 4 3 1 1 1 1 1 2 1 1 ...
##  $ DMDHHSZB: Factor w/ 4 levels "0","1","2","3 or more": 1 1 1 1 4 1 1 1 1 3 ...
##  $ DMDHHSZE: Factor w/ 4 levels "0","1","2","3 or more": 1 1 2 2 1 3 2 1 1 1 ...
##  $ DMDHRGND: Factor w/ 2 levels "Male","Female": 1 1 2 1 1 1 2 1 1 2 ...
##  $ DMDHRAGZ: Factor w/ 4 levels "<20 years","20-39 years",..: 2 2 4 4 3 4 4 2 3 3 ...
##  $ DMDHREDZ: Factor w/ 3 levels "Less than high school degree",..: 3 3 1 3 2 1 2 3 3 1 ...
##  $ DMDHRMAZ: Factor w/ 3 levels "Married/Living with partner",..: 1 1 2 1 1 1 2 1 1 2 ...
##  $ DMDHSEDZ: Factor w/ 3 levels "Less than high school degree",..: 3 2 NA 2 3 1 NA 3 3 NA ...
##  $ WTINT2YR: num  9246 37339 8615 8549 6769 ...
##  $ WTMEC2YR: num  8540 42567 8338 8723 7065 ...
##  $ SDMVPSU : num  2 1 2 2 1 2 1 1 2 2 ...
##  $ SDMVSTRA: num  145 143 145 134 138 138 136 134 134 147 ...
##  $ INDHHIN2: Factor w/ 16 levels "$ 0 to $ 4,999",..: 14 14 3 NA 10 6 2 14 14 4 ...
##  $ INDFMIN2: Factor w/ 16 levels "$ 0 to $ 4,999",..: 14 14 3 NA 10 6 2 14 14 4 ...
##  $ INDFMPIR: num  5 5 0.82 NA 1.88 1.63 0.41 4.9 5 0.76 ...

head(demo_raw)

##    SEQN                        SDDSRVYR                          RIDSTATR
## 1 93703 NHANES 2017-2018 public release Both interviewed and MEC examined
## 2 93704 NHANES 2017-2018 public release Both interviewed and MEC examined
## 3 93705 NHANES 2017-2018 public release Both interviewed and MEC examined
## 4 93706 NHANES 2017-2018 public release Both interviewed and MEC examined
## 5 93707 NHANES 2017-2018 public release Both interviewed and MEC examined
## 6 93708 NHANES 2017-2018 public release Both interviewed and MEC examined
##   RIAGENDR RIDAGEYR RIDAGEMN                            RIDRETH1
## 1   Female        2       NA Other Race - Including Multi-Racial
## 2     Male        2       NA                  Non-Hispanic White
## 3   Female       66       NA                  Non-Hispanic Black
## 4     Male       18       NA Other Race - Including Multi-Racial
## 5     Male       13       NA Other Race - Including Multi-Racial
## 6   Female       66       NA Other Race - Including Multi-Racial
##                              RIDRETH3                    RIDEXMON RIDEXAGM
## 1                  Non-Hispanic Asian    May 1 through October 31       27
## 2                  Non-Hispanic White November 1 through April 30       33
## 3                  Non-Hispanic Black    May 1 through October 31       NA
## 4                  Non-Hispanic Asian    May 1 through October 31      222
## 5 Other Race - Including Multi-Racial    May 1 through October 31      158
## 6                  Non-Hispanic Asian    May 1 through October 31       NA
##   DMQMILIZ DMQADFC                               DMDBORN4
## 1     <NA>    <NA> Born in 50 US states or Washington, DC
## 2     <NA>    <NA> Born in 50 US states or Washington, DC
## 3       No    <NA> Born in 50 US states or Washington, DC
## 4       No    <NA> Born in 50 US states or Washington, DC
## 5     <NA>    <NA> Born in 50 US states or Washington, DC
## 6       No    <NA>                                 Others
##                             DMDCITZN                                DMDYRSUS
## 1 Citizen by birth or naturalization                                    <NA>
## 2 Citizen by birth or naturalization                                    <NA>
## 3 Citizen by birth or naturalization                                    <NA>
## 4 Citizen by birth or naturalization                                    <NA>
## 5 Citizen by birth or naturalization                                    <NA>
## 6 Citizen by birth or naturalization 30 year or more, but less than 40 years
##                DMDEDUC3                                           DMDEDUC2
## 1                  <NA>                                               <NA>
## 2                  <NA>                                               <NA>
## 3                  <NA> 9-11th grade (Includes 12th grade with no diploma)
## 4 More than high school                                               <NA>
## 5             6th grade                                               <NA>
## 6                  <NA>                                Less than 9th grade
##   DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY FIAINTRP MIALANG
## 1     <NA>     <NA> English      Yes       No English       No       No    <NA>
## 2     <NA>     <NA> English      Yes       No English       No       No    <NA>
## 3 Divorced     <NA> English       No       No English       No       No English
## 4     <NA>     <NA> English       No       No    <NA>     <NA>     <NA> English
## 5     <NA>     <NA> English      Yes       No English       No       No English
## 6  Married     <NA> English       No      Yes English       No       No English
##   MIAPROXY MIAINTRP        AIALANGA                          DMDHHSIZ
## 1     <NA>     <NA>            <NA>                                 5
## 2     <NA>     <NA>            <NA>                                 4
## 3       No       No         English                                 1
## 4       No       No         English                                 5
## 5       No       No         English 7 or more people in the Household
## 6       No      Yes Asian languages                                 2
##                         DMDFMSIZ  DMDHHSZA  DMDHHSZB DMDHHSZE DMDHRGND
## 1                              5 3 or more         0        0     Male
## 2                              4         2         0        0     Male
## 3                              1         0         0        1   Female
## 4                              5         0         0        1     Male
## 5 7 or more people in the Family         0 3 or more        0     Male
## 6                              2         0         0        2     Male
##      DMDHRAGZ                                       DMDHREDZ
## 1 20-39 years                      College graduate or above
## 2 20-39 years                      College graduate or above
## 3   60+ years                   Less than high school degree
## 4   60+ years                      College graduate or above
## 5 40-59 years High school grad/GED or some college/AA degree
## 6   60+ years                   Less than high school degree
##                      DMDHRMAZ                                       DMDHSEDZ
## 1 Married/Living with partner                      College graduate or above
## 2 Married/Living with partner High school grad/GED or some college/AA degree
## 3  Widowed/Divorced/Separated                                           <NA>
## 4 Married/Living with partner High school grad/GED or some college/AA degree
## 5 Married/Living with partner                      College graduate or above
## 6 Married/Living with partner                   Less than high school degree
##    WTINT2YR  WTMEC2YR SDMVPSU SDMVSTRA           INDHHIN2           INDFMIN2
## 1  9246.492  8539.731       2      145  $100,000 and Over  $100,000 and Over
## 2 37338.768 42566.615       1      143  $100,000 and Over  $100,000 and Over
## 3  8614.571  8338.420       2      145 $10,000 to $14,999 $10,000 to $14,999
## 4  8548.633  8723.440       2      134               <NA>               <NA>
## 5  6769.345  7064.610       1      138 $65,000 to $74,999 $65,000 to $74,999
## 6 13329.451 14372.489       2      138 $25,000 to $34,999 $25,000 to $34,999
##   INDFMPIR
## 1     5.00
## 2     5.00
## 3     0.82
## 4       NA
## 5     1.88
## 6     1.63

Step 2: Clean demographic variables (ID, age, sex)

In this step I extract a smaller demographic table with one row per participant and only the variables that will be used later in the modelling: the NHANES identifier, age in years, and sex. I also restrict the dataset to adults (age 18 and older).

demo = demo_raw %>%
  transmute(
    SEQN = SEQN,          # participant ID
    age  = RIDAGEYR,      # age in years
    sex  = RIAGENDR       # sex (Male / Female)
  ) %>%
  filter(!is.na(age), age >= 18)

glimpse(demo)

## Rows: 5,856
## Columns: 3
## $ SEQN <dbl> 93705, 93706, 93708, 93709, 93711, 93712, 93713, 93714, 93715, 93…
## $ age  <dbl> 66, 18, 66, 75, 56, 18, 67, 54, 71, 61, 22, 45, 60, 60, 64, 67, 7…
## $ sex  <fct> Female, Male, Female, Female, Male, Male, Male, Female, Male, Mal…

Step 3: Dietary intakes (kcal, carbohydrates, fat, protein)

In this step I load the NHANES dietary total nutrient intake file for day 1 and extract total energy and macronutrient amounts for each participant.

# DR1TOT_J = Total nutrient intakes, Day 1, 2017–2018
diet_raw = nhanes("DR1TOT_J")

diet = diet_raw %>%
  transmute(
    SEQN       = SEQN,      # participant ID
    total_kcal = DR1TKCAL,  # total energy (kcal)
    carb_g     = DR1TCARB,  # carbohydrate (g)
    fat_g      = DR1TTFAT,  # total fat (g)
    protein_g  = DR1TPROT   # protein (g)
  ) %>%
  filter(!is.na(total_kcal), total_kcal > 0)

glimpse(diet)

## Rows: 7,483
## Columns: 5
## $ SEQN       <dbl> 93704, 93705, 93706, 93707, 93708, 93710, 93711, 93712, 937…
## $ total_kcal <dbl> 1230, 1202, 1987, 1775, 1251, 900, 2840, 2045, 2040, 2493, …
## $ carb_g     <dbl> 160.46, 157.45, 89.82, 188.15, 123.71, 140.80, 339.60, 268.…
## $ fat_g      <dbl> 43.24, 56.98, 137.39, 89.18, 65.49, 27.69, 124.24, 63.90, 1…
## $ protein_g  <dbl> 51.58, 20.01, 94.19, 59.48, 50.96, 22.93, 101.33, 99.66, 61…

Step 4: Macronutrient percentages and merge with demographics

Here I compute the percentage of total energy coming from each macronutrient and merge the dietary data with the demographic table.

diet = diet %>%
  mutate(
    carb_pct    = 4 * carb_g / total_kcal * 100,
    fat_pct     = 9 * fat_g / total_kcal * 100,
    protein_pct = 4 * protein_g / total_kcal * 100
  )

analysis_data = demo %>%
  inner_join(diet, by = "SEQN")

glimpse(analysis_data)

## Rows: 4,982
## Columns: 10
## $ SEQN        <dbl> 93705, 93706, 93708, 93711, 93712, 93713, 93714, 93715, 93…
## $ age         <dbl> 66, 18, 66, 56, 18, 67, 54, 71, 61, 22, 60, 60, 64, 67, 70…
## $ sex         <fct> Female, Male, Female, Male, Male, Male, Female, Male, Male…
## $ total_kcal  <dbl> 1202, 1987, 1251, 2840, 2045, 2040, 2493, 1287, 2917, 3151…
## $ carb_g      <dbl> 157.45, 89.82, 123.71, 339.60, 268.24, 201.09, 208.12, 161…
## $ fat_g       <dbl> 56.98, 137.39, 65.49, 124.24, 63.90, 114.30, 143.18, 53.11…
## $ protein_g   <dbl> 20.01, 94.19, 50.96, 101.33, 99.66, 61.40, 104.10, 40.20, …
## $ carb_pct    <dbl> 52.39601, 18.08153, 39.55556, 47.83099, 52.46748, 39.42941…
## $ fat_pct     <dbl> 42.66389, 62.22999, 47.11511, 39.37183, 28.12225, 50.42647…
## $ protein_pct <dbl> 6.658902, 18.961248, 16.294165, 14.271831, 19.493399, 12.0…

Step 5: Diabetes status (outcome variable)

In this step I define the Type 2 diabetes outcome using the NHANES diabetes questionnaire file and merge it into the analysis dataset.

# DIQ_J = Diabetes questionnaire, 2017–2018
diab_raw = nhanes("DIQ_J")

# DIQ010: "Doctor told you have diabetes"
# Code as 1 = yes, 0 = no; drop borderline and missing
diab = diab_raw %>%
  transmute(
    SEQN = SEQN,
    diabetes = case_when(
      DIQ010 == "Yes" ~ 1,
      DIQ010 == "No"  ~ 0,
      TRUE            ~ NA_real_
    )
  ) %>%
  filter(!is.na(diabetes))

glimpse(diab)

## Rows: 8,709
## Columns: 2
## $ SEQN     <dbl> 93703, 93704, 93705, 93706, 93707, 93709, 93711, 93712, 93713…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

# Merge with existing analysis_data
analysis_data = analysis_data %>%
  inner_join(diab, by = "SEQN")

glimpse(analysis_data)

## Rows: 4,838
## Columns: 11
## $ SEQN        <dbl> 93705, 93706, 93711, 93712, 93713, 93714, 93715, 93716, 93…
## $ age         <dbl> 66, 18, 56, 18, 67, 54, 71, 61, 22, 60, 60, 67, 70, 53, 42…
## $ sex         <fct> Female, Male, Male, Male, Male, Female, Male, Male, Male, …
## $ total_kcal  <dbl> 1202, 1987, 2840, 2045, 2040, 2493, 1287, 2917, 3151, 1930…
## $ carb_g      <dbl> 157.45, 89.82, 339.60, 268.24, 201.09, 208.12, 161.90, 442…
## $ fat_g       <dbl> 56.98, 137.39, 124.24, 63.90, 114.30, 143.18, 53.11, 85.06…
## $ protein_g   <dbl> 20.01, 94.19, 101.33, 99.66, 61.40, 104.10, 40.20, 103.91,…
## $ carb_pct    <dbl> 52.39601, 18.08153, 47.83099, 52.46748, 39.42941, 33.39270…
## $ fat_pct     <dbl> 42.66389, 62.22999, 39.37183, 28.12225, 50.42647, 51.68953…
## $ protein_pct <dbl> 6.658902, 18.961248, 14.271831, 19.493399, 12.039216, 16.7…
## $ diabetes    <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…

# Check diabetes prevalence
table(analysis_data$diabetes)

## 
##    0    1 
## 4084  754

prop.table(table(analysis_data$diabetes))

## 
##         0         1 
## 0.8441505 0.1558495

Step 6: Add BMI (body mass index)

In this step I load BMI from the NHANES body measures file and merge it into the analysis dataset.

# BMX_J = Body measures (includes BMI), 2017–2018
bmx_raw = nhanes("BMX_J")

bmx = bmx_raw %>%
  transmute(
    SEQN = SEQN,
    bmi  = BMXBMI      # body mass index (kg/m^2)
  ) %>%
  filter(!is.na(bmi))

glimpse(bmx)

## Rows: 8,005
## Columns: 2
## $ SEQN <dbl> 93703, 93704, 93705, 93706, 93707, 93708, 93709, 93711, 93712, 93…
## $ bmi  <dbl> 17.5, 15.7, 31.7, 21.5, 18.1, 23.7, 38.9, 21.3, 19.7, 23.5, 39.9,…

# Merge BMI into the analysis dataset
analysis_data = analysis_data %>%
  inner_join(bmx, by = "SEQN")

glimpse(analysis_data)

## Rows: 4,780
## Columns: 12
## $ SEQN        <dbl> 93705, 93706, 93711, 93712, 93713, 93714, 93715, 93716, 93…
## $ age         <dbl> 66, 18, 56, 18, 67, 54, 71, 61, 22, 60, 60, 67, 70, 53, 42…
## $ sex         <fct> Female, Male, Male, Male, Male, Female, Male, Male, Male, …
## $ total_kcal  <dbl> 1202, 1987, 2840, 2045, 2040, 2493, 1287, 2917, 3151, 1930…
## $ carb_g      <dbl> 157.45, 89.82, 339.60, 268.24, 201.09, 208.12, 161.90, 442…
## $ fat_g       <dbl> 56.98, 137.39, 124.24, 63.90, 114.30, 143.18, 53.11, 85.06…
## $ protein_g   <dbl> 20.01, 94.19, 101.33, 99.66, 61.40, 104.10, 40.20, 103.91,…
## $ carb_pct    <dbl> 52.39601, 18.08153, 47.83099, 52.46748, 39.42941, 33.39270…
## $ fat_pct     <dbl> 42.66389, 62.22999, 39.37183, 28.12225, 50.42647, 51.68953…
## $ protein_pct <dbl> 6.658902, 18.961248, 14.271831, 19.493399, 12.039216, 16.7…
## $ diabetes    <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ bmi         <dbl> 31.7, 21.5, 21.3, 19.7, 23.5, 39.9, 22.5, 30.7, 24.5, 35.9…

summary(analysis_data$bmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.8    24.6    28.5    29.7    33.5    84.4

Step 7: Logistic regression model

Here I fit a multivariable logistic regression model for diabetes status using age, sex, BMI, and macronutrient percentages as predictors.

set.seed(123)

# Keep only variables used in the model and drop missing values
model_data = analysis_data %>%
  select(diabetes, age, sex, bmi, carb_pct, fat_pct, protein_pct) %>%
  na.omit()

# Train/test split: 70% train, 30% test
n = nrow(model_data)
train_index = sample(seq_len(n), size = floor(0.7 * n))

train = model_data[train_index, ]
test  = model_data[-train_index, ]

# Fit logistic regression
logit_model = glm(
  diabetes ~ age + sex + bmi + carb_pct + fat_pct + protein_pct,
  data   = train,
  family = binomial
)

summary(logit_model)

## 
## Call:
## glm(formula = diabetes ~ age + sex + bmi + carb_pct + fat_pct + 
##     protein_pct, family = binomial, data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.625373   1.060846 -10.016  < 2e-16 ***
## age           0.062769   0.003650  17.195  < 2e-16 ***
## sexFemale    -0.555769   0.106922  -5.198 2.02e-07 ***
## bmi           0.076639   0.007046  10.876  < 2e-16 ***
## carb_pct      0.029839   0.010161   2.937  0.00332 ** 
## fat_pct       0.033256   0.011235   2.960  0.00308 ** 
## protein_pct   0.044181   0.013749   3.213  0.00131 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2914.4  on 3345  degrees of freedom
## Residual deviance: 2371.9  on 3339  degrees of freedom
## AIC: 2385.9
## 
## Number of Fisher Scoring iterations: 6

# Predicted probabilities on test set
test$prob_logit = predict(logit_model, newdata = test, type = "response")

# Class predictions with threshold 0.5
test$pred_logit = ifelse(test$prob_logit > 0.5, 1, 0)

# Accuracy
accuracy_logit = mean(test$pred_logit == test$diabetes)
accuracy_logit

## [1] 0.8584379

# Confusion table
table(Predicted = test$pred_logit, Actual = test$diabetes)

##          Actual
## Predicted    0    1
##         0 1209  191
##         1   12   22

# ROC curve and AUC (requires pROC)
roc_logit = roc(test$diabetes, test$prob_logit)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_logit = auc(roc_logit)
auc_logit

## Area under the curve: 0.7887

plot(roc_logit, main = "ROC Curve – Logistic Regression (NHANES)")

# Odds ratios for each predictor in the logistic regression
exp(coef(logit_model))

##  (Intercept)          age    sexFemale          bmi     carb_pct      fat_pct 
## 2.429176e-05 1.064781e+00 5.736310e-01 1.079652e+00 1.030289e+00 1.033815e+00 
##  protein_pct 
## 1.045171e+00

Logistic regression results on NHANES

The logistic regression model was trained on 70 percent of the NHANES analysis dataset and evaluated on the remaining 30 percent. On the test set the model correctly classified about 85.8 percent of individuals, so the test accuracy is approximately 0.858. The area under the ROC curve (AUC) is about 0.789, which indicates that the model has reasonably good ability to distinguish between participants with and without a diabetes diagnosis.

The confusion table shows that the model has very high specificity and lower sensitivity at the default probability threshold of 0.5. Among test set observations the model produces 1,209 true negatives and only 12 false positives, but there are 22 true positives and 191 false negatives. This means the fitted model is much better at correctly identifying non diabetic participants than at capturing every diabetes case, which is common when the outcome is less frequent in the sample.

Interpreting the estimated coefficients as odds ratios gives a clearer view of how each predictor is related to diabetes risk. The odds ratio for age is about 1.06, so each one year increase in age multiplies the odds of diabetes by roughly 6 percent when the other variables are held fixed. The odds ratio for BMI is about 1.08, which means that each one unit increase in BMI is associated with about an 8 percent increase in the odds of diabetes. The coefficient for sex implies that female participants have lower odds of diabetes than male participants, with an odds ratio around 0.57. The odds ratios for the macronutrient percentages are all slightly above one, with values around 1.03 for carbohydrate percentage, 1.03 for fat percentage, and 1.05 for protein percentage. These results suggest that, after controlling for age, sex, and BMI, a higher share of total energy from each macronutrient is associated with modestly higher odds of a diabetes diagnosis, although these percentages are correlated because they sum to approximately one hundred.

Overall, this logistic regression provides a clear mathematical baseline for the project. It confirms that age and BMI are strong risk factors for Type 2 diabetes in the NHANES sample and shows that sex and macronutrient composition also contribute meaningfully to the probability of having the disease.

Step 8: Exploratory data analysis

# Distribution of the diabetes outcome
table(model_data$diabetes)

## 
##    0    1 
## 4040  740

prop.table(table(model_data$diabetes))

## 
##         0         1 
## 0.8451883 0.1548117

In the full merged dataset there are 4084 non-diabetic and 754 diabetic adults (about 15–16% with diabetes). After restricting to participants with complete data on all predictors used in the model, the modelling sample includes 4040 non-diabetic and 740 diabetic adults, with a very similar diabetes prevalence.

# BMI by diabetes status – violin plot
ggplot(model_data, aes(x = factor(diabetes), y = bmi, fill = factor(diabetes))) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  stat_summary(fun = median, geom = "point", size = 2, color = "black") +
  labs(
    x = "Diabetes status (0 = no, 1 = yes)",
    y = "BMI (kg/m^2)",
    title = "BMI by diabetes status (violin plot)"
  )

# Carbohydrate percentage by diabetes status
ggplot(model_data, aes(x = factor(diabetes), y = carb_pct)) +
  geom_boxplot() +
  labs(
    x = "Diabetes status (0 = no, 1 = yes)",
    y = "Carbohydrate % of total energy",
    title = "Carbohydrate percentage by diabetes status"
  )

Here I compare BMI and the carbohydrate share of total energy between adults with and without diabetes. The violin plot shows that BMI is clearly higher on average among people with diabetes. In contrast, the boxplot of carbohydrate percentage shows much more overlap between the two groups. This suggests that diabetes risk is not driven by carbohydrate share alone but by the overall macronutrient balance together with BMI, age, and sex, which is why I use multivariable models with all three macronutrient percentages as predictors.

Step 9: Random forest model (NHANES data)

Here I fit a random forest classification model using the same predictors as the logistic regression model in order to capture possible non linear relationships and interactions between the variables.

library(randomForest)
library(pROC)

set.seed(123)

# Random forest on the same training data
rf_model = randomForest(
  factor(diabetes) ~ age + sex + bmi + carb_pct + fat_pct + protein_pct,
  data = train,
  ntree = 500,
  mtry  = 3,
  importance = TRUE
)

rf_model

## 
## Call:
##  randomForest(formula = factor(diabetes) ~ age + sex + bmi + carb_pct +      fat_pct + protein_pct, data = train, ntree = 500, mtry = 3,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.53%
## Confusion matrix:
##      0   1 class.error
## 0 2713 106  0.03760199
## 1  447  80  0.84819734

# Predicted probabilities and classes on the test set
rf_prob = predict(rf_model, newdata = test, type = "prob")[, "1"]
rf_pred = ifelse(rf_prob > 0.5, 1, 0)

# Accuracy and confusion table
accuracy_rf = mean(rf_pred == test$diabetes)
accuracy_rf

## [1] 0.8382148

table(Predicted = rf_pred, Actual = test$diabetes)

##          Actual
## Predicted    0    1
##         0 1173  184
##         1   48   29

# ROC curve and AUC
roc_rf = roc(test$diabetes, rf_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_rf = auc(roc_rf)
auc_rf

## Area under the curve: 0.7592

plot(roc_rf, main = "ROC Curve – Random Forest (NHANES)")

# Variable importance plot
varImpPlot(rf_model, main = "Variable Importance – Random Forest")

### Random forest results

Using the same train–test split as the logistic regression model, the random forest reaches a test accuracy of 0.838 (about 83.8%) and an AUC of 0.759 (about 0.76). In comparison, the logistic regression model achieved an accuracy of 0.858 (about 85.8%) and an AUC of 0.789 (about 0.79), so in this dataset the more flexible tree-based model does not outperform the simpler parametric model in overall predictive performance.

The random forest confusion matrix shows that, at a probability threshold of 0.5, the model correctly classifies the majority of non-diabetic participants but still misses many true diabetes cases. In the test set it correctly flags 29 adults with diabetes and misclassifies 184 as non-diabetic, while producing 48 false positives. Relative to logistic regression, the random forest slightly improves the number of correctly identified diabetes cases but at the cost of more false positives, which reflects the usual trade-off between sensitivity and specificity in imbalanced health datasets.

The ROC curve for the random forest lies well above the diagonal reference line, and the AUC around 0.76 indicates reasonable ability to discriminate between adults with and without diabetes, although this discrimination is slightly weaker than that of the logistic regression model. The variable importance plot highlights age and BMI as the strongest predictors, which is consistent with established clinical knowledge about Type 2 diabetes risk. The macronutrient percentage variables (fat_pct, protein_pct, and carb_pct) also contribute to the model, suggesting that the ratio of carbohydrates, fat, and protein provides additional information beyond demographics and BMI when predicting diabetes risk in the NHANES sample.

Step 10: XGBoost model (NHANES data)

Here I fit a gradient boosting model (XGBoost) using the same predictors as the logistic regression and random forest models.

library(xgboost)
library(pROC)

# Prepare numeric matrices for XGBoost (encode sex as 0/1)
train_x = train %>%
  mutate(sex_female = ifelse(sex == "Female", 1, 0)) %>%
  select(age, bmi, carb_pct, fat_pct, protein_pct, sex_female) %>%
  as.matrix()

test_x = test %>%
  mutate(sex_female = ifelse(sex == "Female", 1, 0)) %>%
  select(age, bmi, carb_pct, fat_pct, protein_pct, sex_female) %>%
  as.matrix()

train_y = train$diabetes
test_y  = test$diabetes

dtrain = xgb.DMatrix(data = train_x, label = train_y)
dtest  = xgb.DMatrix(data = test_x,  label = test_y)

set.seed(123)

xgb_model = xgb.train(
  data = dtrain,
  nrounds = 200,
  objective = "binary:logistic",
  eval_metric = "auc",
  params = list(
    max_depth = 3,
    eta = 0.05,
    subsample = 0.8,
    colsample_bytree = 0.8
  ),
  watchlist = list(train = dtrain, test = dtest),
  verbose = 0
)

xgb_model

## ##### xgb.Booster
## raw: 222.4 Kb 
## call:
##   xgb.train(params = list(max_depth = 3, eta = 0.05, subsample = 0.8, 
##     colsample_bytree = 0.8), data = dtrain, nrounds = 200, watchlist = list(train = dtrain, 
##     test = dtest), verbose = 0, objective = "binary:logistic", 
##     eval_metric = "auc")
## params (as set within xgb.train):
##   max_depth = "3", eta = "0.05", subsample = "0.8", colsample_bytree = "0.8", objective = "binary:logistic", eval_metric = "auc", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 6 
## niter: 200
## nfeatures : 6 
## evaluation_log:
##   iter train_auc  test_auc
##  <num>     <num>     <num>
##      1 0.7746967 0.7474151
##      2 0.8014146 0.7670385
##    ---       ---       ---
##    199 0.8771790 0.7902127
##    200 0.8773513 0.7902281

# Predictions on test set
xgb_prob = predict(xgb_model, newdata = dtest)
xgb_pred = ifelse(xgb_prob > 0.5, 1, 0)

# Accuracy and confusion matrix
accuracy_xgb = mean(xgb_pred == test_y)
accuracy_xgb

## [1] 0.848675

table(Predicted = xgb_pred, Actual = test_y)

##          Actual
## Predicted    0    1
##         0 1201  197
##         1   20   16

# ROC curve and AUC
roc_xgb = roc(test_y, xgb_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_xgb = auc(roc_xgb)
auc_xgb

## Area under the curve: 0.7902

plot(roc_xgb, main = "ROC Curve – XGBoost (NHANES)")

# Variable importance
xgb_importance = xgb.importance(model = xgb_model)
xgb_importance

##        Feature       Gain      Cover  Frequency
##         <char>      <num>      <num>      <num>
## 1:         age 0.47296653 0.28485830 0.19618745
## 2:         bmi 0.20635883 0.27225691 0.25496426
## 3: protein_pct 0.12290349 0.14401826 0.20015886
## 4:     fat_pct 0.09328170 0.13349679 0.15885624
## 5:    carb_pct 0.07372935 0.12362381 0.14853058
## 6:  sex_female 0.03076010 0.04174592 0.04130262

xgb.plot.importance(xgb_importance, top_n = 10,
                    main = "Variable Importance – XGBoost")

XGBoost results

The XGBoost model obtains a test accuracy of 0.849 (about 84.9%) and an AUC of 0.79 (about 0.79). These values are very close to the logistic regression performance (accuracy about 85.8%, AUC about 0.79) and slightly higher than the random forest AUC (about 0.76), indicating that all three approaches achieve similar overall discrimination between adults with and without diabetes in this NHANES subsample.

The confusion matrix shows that, at a 0.5 probability threshold, XGBoost correctly classifies 1,201 non-diabetic adults and 16 adults with diabetes, while misclassifying 197 diabetes cases as non-diabetic and producing 20 false positives. This pattern is consistent with the class imbalance in the data and illustrates the difficulty of capturing all diabetes cases without greatly increasing the false positive rate.

The ROC curve for XGBoost lies well above the diagonal reference line, and the AUC around 0.79 indicates good, though not perfect, ability to separate diabetic and non-diabetic participants. The feature importance plot again highlights age and BMI as the dominant predictors of diabetes risk, followed by the macronutrient percentage variables (protein_pct, fat_pct, and carb_pct), with sex having the smallest contribution. This ranking reinforces the earlier findings: demographic factors and body size are the primary drivers of Type 2 diabetes risk, while the macronutrient ratio provides additional but more subtle information about risk beyond age and BMI.

Step 11: Model Comparison

This section compares the predictive performance of the three models—Logistic Regression, Random Forest, and XGBoost—based on accuracy and AUC.

library(dplyr)

model_summary = tibble::tibble(
  Model = c("Logistic Regression", "Random Forest", "XGBoost"),
  Accuracy = c(accuracy_logit, accuracy_rf, accuracy_xgb),
  AUC = c(as.numeric(auc_logit),
          as.numeric(auc_rf),
          as.numeric(auc_xgb))
)

model_summary

## # A tibble: 3 × 3
##   Model               Accuracy   AUC
##   <chr>                  <dbl> <dbl>
## 1 Logistic Regression    0.858 0.789
## 2 Random Forest          0.838 0.759
## 3 XGBoost                0.849 0.790

Interpretation

The model comparison shows that logistic regression has the highest test accuracy at about 0.858, with an AUC of roughly 0.789. XGBoost performs very similarly, with accuracy around 0.849 and the largest AUC at about 0.790. The random forest model has the lowest accuracy (about 0.838) and the lowest AUC (about 0.759).

Overall, these differences are fairly small. This suggests that the simpler logistic regression model already captures most of the signal that links age, BMI and macronutrient percentages to Type 2 diabetes risk in this NHANES subsample. The tree based models provide useful confirmation and an alternative view of variable importance, but they do not deliver a large gain in predictive performance compared with logistic regression.

Step 12 – SHAP for XGBoost

library(SHAPforxgboost)

# SHAP values for XGBoost model
shap_values = shap.values(
  xgb_model = xgb_model,
  X_train   = train_x
)

# Long-format SHAP table for plotting
shap_long = shap.prep(
  shap_contrib = shap_values$shap_score,
  X_train      = train_x
)

# Overall SHAP summary plot (no xlab argument; some versions don't support it)
shap.plot.summary(shap_long, scientific = FALSE)

SHAP analysis for the XGBoost model

To interpret how the boosted tree model uses each predictor, I computed SHAP (SHapley Additive exPlanations) values for the XGBoost model. SHAP values decompose the predicted log-odds of diabetes for each individual into additive contributions from the input features. A positive SHAP value means that a feature pushes the prediction toward a higher diabetes risk, whereas a negative value pushes the prediction toward a lower risk.

The SHAP summary plot shows that age has the largest average absolute SHAP value (about 1.2), followed by BMI (about 0.44). This confirms that age and body mass index are the dominant drivers of the XGBoost predictions. For most participants, low age and low BMI are associated with negative SHAP values, reducing the predicted risk, whereas high age and high BMI are associated with positive SHAP values, increasing the predicted risk of Type 2 diabetes.

The macronutrient percentages (protein_pct, fat_pct, and carb_pct) have smaller but still noticeable SHAP magnitudes, indicating that the balance of carbohydrates, fat, and protein modifies risk on top of age and BMI. Their points are distributed on both sides of zero, which suggests that different macronutrient patterns can either raise or lower the model’s predicted risk depending on the overall combination of predictors. The sex_female variable has the smallest SHAP contribution, consistent with its relatively low importance in the other models; sex is encoded as a dummy variable (1 = female, 0 = male), so its SHAP values represent the effect of being female relative to males as the reference group.

Overall, the SHAP analysis supports the conclusions from the logistic regression, random forest, and XGBoost models: age and BMI are the strongest predictors of Type 2 diabetes in this NHANES subsample, while the macronutrient ratio provides additional, more nuanced information that is best captured when it is combined with demographic factors in multivariable models.

Step 13: Macronutrient ratio analysis

In this step I construct simple macronutrient ratio variables and compare them between adults with and without diabetes. The goal is to move beyond looking at each macronutrient separately and instead examine their balance.

# Work with the cleaned modelling dataset
macro_data = model_data %>%
  mutate(
    protein_to_carb = protein_pct / carb_pct,
    fat_to_carb     = fat_pct / carb_pct
  )

# Check basic summaries by diabetes status
macro_ratio_summary = macro_data %>%
  group_by(diabetes) %>%
  summarise(
    n                 = n(),
    mean_carb_pct     = mean(carb_pct, na.rm = TRUE),
    mean_fat_pct      = mean(fat_pct, na.rm = TRUE),
    mean_protein_pct  = mean(protein_pct, na.rm = TRUE),
    mean_protein_carb = mean(protein_to_carb, na.rm = TRUE),
    mean_fat_carb     = mean(fat_to_carb, na.rm = TRUE)
  )

macro_ratio_summary

## # A tibble: 2 × 7
##   diabetes     n mean_carb_pct mean_fat_pct mean_protein_pct mean_protein_carb
##      <dbl> <int>         <dbl>        <dbl>            <dbl>             <dbl>
## 1        0  4040          48.0         35.2             15.3             0.363
## 2        1   740          47.1         36.8             15.9             0.388
## # ℹ 1 more variable: mean_fat_carb <dbl>

Macronutrient percentages and ratios by diabetes status

library(dplyr)
library(knitr)

macro_ratio_table = macro_ratio_summary %>%
  mutate(
    diabetes = ifelse(diabetes == 1, "Diabetes", "No diabetes"),
    mean_carb_pct    = round(mean_carb_pct, 1),
    mean_fat_pct     = round(mean_fat_pct, 1),
    mean_protein_pct = round(mean_protein_pct, 1),
    mean_protein_carb = round(mean_protein_carb, 3),
    mean_fat_carb     = round(mean_fat_carb, 3)
  ) %>%
  rename(
    `Diabetes status`          = diabetes,
    `Mean carb % of energy`    = mean_carb_pct,
    `Mean fat % of energy`     = mean_fat_pct,
    `Mean protein % of energy` = mean_protein_pct,
    `Mean protein / carb`      = mean_protein_carb,
    `Mean fat / carb`          = mean_fat_carb
  )

kable(
  macro_ratio_table,
  caption = "Average macronutrient percentages and ratios by diabetes status"
)

Average macronutrient percentages and ratios by diabetes status
Diabetes status	n	Mean carb % of energy	Mean fat % of energy	Mean protein % of energy	Mean protein / carb	Mean fat / carb
No diabetes	4040	48.0	35.2	15.3	0.363	0.829
Diabetes	740	47.1	36.8	15.9	0.388	0.931

Interpretation

Raw dietary percentages alone do not sharply separate diabetic and non-diabetic adults. However, when combined with age, sex, and BMI inside multivariable models, macronutrient balance becomes a statistically significant predictor of diabetes risk. This shows why model-based analysis is necessary and why simple descriptive averages are misleading in public health nutrition research.

Discussion

This project set out to model the probability of a Type 2 diabetes diagnosis as a function of macronutrient intake and demographic variables using NHANES 2017–2018 data. The main mathematical object throughout the analysis is the conditional probability that an adult has diabetes given their age, sex, BMI, and macronutrient energy shares. In the Methodology section this was formalised using a logistic regression model, and two non-parametric machine learning models (random forest and XGBoost) were introduced to relax the linear log-odds assumption and allow for non-linearities and interactions.

All three models were trained on the same merged NHANES sample of adults with complete data and evaluated on an independent test set. The logistic regression model achieved a test accuracy of about 0.858 and an AUC of approximately 0.789, indicating reasonably strong discrimination between adults with and without a diabetes diagnosis. The estimated odds ratios showed that age and BMI are the dominant predictors of diabetes risk, with each additional year of age and each additional unit of BMI multiplying the odds of diabetes by a factor slightly above one. Sex also played a role: females had lower odds of diabetes than males after controlling for the other variables.

The macronutrient percentage variables had odds ratios slightly above one, suggesting that higher shares of total energy from carbohydrate, fat, and protein are each associated with modest increases in the odds of diabetes, conditional on age, sex, and BMI. Because the macronutrient percentages sum to approximately one hundred, these coefficients must be interpreted under a compositional constraint: increasing one percentage necessarily decreases at least one of the others. This helps explain why simple group differences in mean percentages appear small, while the multivariable models still detect statistically meaningful associations.

The random forest and XGBoost models were used to check whether more flexible, tree-based methods could substantially improve predictive performance. Both achieved test accuracies in the range 0.84–0.85 and AUC values around 0.76–0.79. In particular, XGBoost achieved an AUC of about 0.79, which is essentially equal to the logistic regression AUC and slightly higher than the random forest AUC. For this dataset and set of predictors, the flexible tree-based models do not substantially outperform the simpler parametric model, which suggests that the main relationships between predictors and diabetes status are well captured by a linear log-odds structure.

Variable importance measures from the random forest and XGBoost models are consistent with the logistic regression results. Age and BMI emerge as the most influential predictors, followed by the macronutrient percentages, with sex contributing the least. The SHAP analysis for XGBoost reinforces this picture by showing that age and BMI have the largest absolute contributions to the predicted log-odds of diabetes, while the macronutrient percentages provide smaller but non-negligible adjustments to individual risk. This agreement across three modelling frameworks strengthens the conclusion that demographic factors and body size are the primary determinants of risk, while macronutrient balance refines the predicted risk once these variables are taken into account.

The exploratory plots and macronutrient ratio summaries also clarify why descriptive comparisons alone can be misleading. The violin plot of BMI by diabetes status shows a clear upward shift in the BMI distribution among adults with diabetes. In contrast, the boxplot of carbohydrate percentage and the table of mean macronutrient percentages and ratios show only modest differences between groups. Looking at macronutrient percentages alone is therefore not sufficient to separate diabetic and non-diabetic adults. The multivariable models, which condition on age, sex, and BMI, reveal that macronutrient composition still carries predictive information once key demographic and anthropometric confounders are controlled for.

Limitations

There are several important limitations that should be considered when interpreting these results. First, the NHANES data used in this project are cross sectional. Each participant is observed at only one point in time, so the models estimate associations between predictors and current diabetes status rather than causal effects of macronutrient composition on the future development of Type 2 diabetes. It is possible that many adults with diabetes have already modified their diet in response to a diagnosis or medical advice, which would attenuate or even reverse some of the associations between current dietary intake and disease status.

Second, the dietary variables are based on 24 hour recall data from a single day. A one day recall is an imperfect proxy for an individual’s usual long term diet and is subject to both random day to day variability and systematic recall bias. Measurement error in the macronutrient percentages generally biases regression coefficients toward zero and can make the estimated effects of dietary composition appear smaller than they truly are. Similarly, the diabetes outcome is based on self report of a prior diagnosis, which may miss undiagnosed cases and relies on participants’ understanding of their medical history.

Third, this analysis did not incorporate the complex survey design of NHANES. The dataset includes sampling weights, strata, and primary sampling units that are designed to make estimates representative of the U.S. civilian non institutionalised population. Here the models were fitted as if the data arose from a simple random sample, which means that the fitted coefficients and performance metrics describe the particular analytical subsample rather than the fully weighted national population. Incorporating survey weights and design variables in specialised survey regression and machine learning methods would be necessary for nationally representative inference.

Fourth, the modelling strategy focuses on a relatively small set of predictors: age, sex, BMI, and macronutrient percentages. Many potentially important covariates, such as physical activity level, medication use, income, education, and family history of diabetes, were not included in the final models in order to maintain a clear, interpretable framework. Omitting these variables leaves room for residual confounding. For example, adults with higher BMI may differ systematically in both diet and lifestyle from those with lower BMI in ways that are not fully captured by the available variables.

Finally, the class imbalance in the outcome and the use of a fixed 0.5 probability threshold influence the interpretation of accuracy and confusion matrices. About 15 to 16 percent of adults in the analytical sample have diabetes, so a classifier that focuses on minimising overall error will tend to prioritise correct classification of the majority non diabetic class. This is reflected in the relatively high specificity and lower sensitivity of all three models. Alternative evaluation strategies, such as using class weights or optimising a cost sensitive loss function, could be more appropriate in clinical screening settings where missing diabetes cases carries a higher cost than false positives.

Conclusion

This capstone project developed and compared three predictive models for Type 2 diabetes risk using NHANES 2017–2018 data, with a particular focus on the role of macronutrient composition. Starting from the logistic regression model

\[ \log \left( \frac{\Pr(Y = 1 \mid X)}{1 - \Pr(Y = 1 \mid X)} \right) = X^\top \beta, \]

and extending to random forest and XGBoost, the analysis quantified how age, sex, BMI, and the percentages of total energy from carbohydrate, fat, and protein jointly influence the probability of a diabetes diagnosis.

Across all methods, age and BMI emerged as the strongest predictors of diabetes risk, both in terms of odds ratios and in terms of variable importance measures. Females had lower odds of diabetes than males after adjustment for other factors. The macronutrient percentages had smaller but statistically meaningful effects, with higher shares of energy from carbohydrate, fat, and protein each associated with modest increases in the odds of diabetes when age, sex, and BMI were held constant. The similar performance of logistic regression and XGBoost, with test AUC values around 0.79 in both cases, suggests that the dominant relationships between these predictors and diabetes status are well approximated by a linear log odds structure, and that the additional flexibility of tree based ensembles provides only incremental gains in this setting.

Descriptive comparisons of macronutrient percentages and ratios by diabetes status showed that group means are quite close and that diet alone does not sharply separate diabetic and non diabetic adults. The multivariable models clarify this picture by demonstrating that macronutrient balance contributes to risk once key demographic and anthropometric factors are taken into account. From a public health perspective, the results reinforce the central importance of age and body mass in diabetes risk while indicating that dietary composition carries additional information that can refine risk stratification.

Overall, the project illustrates how mathematical modelling and machine learning can be combined with large scale survey data to study chronic disease risk. Logistic regression provides a transparent baseline model with coefficients that have clear odds ratio interpretations, while random forest, XGBoost, and SHAP analysis offer complementary, non parametric views of predictor importance and feature effects. Together, these tools support a nuanced understanding of how demographic factors, body size, and macronutrient balance interact in shaping Type 2 diabetes risk in the NHANES adult population.

Future work

There are several natural extensions that could strengthen and deepen this analysis. One direction is to refine the definition of the diabetes outcome and dietary exposures. For example, future work could use laboratory measures such as fasting plasma glucose or HbA1c to identify diabetes and prediabetes, rather than relying solely on self reported diagnosis. On the dietary side, incorporating both recall days, when available, and averaging intakes across days would provide a better approximation to usual intake and reduce measurement error.

A second direction is to adopt methods specifically designed for compositional data. Because the macronutrient percentages are constrained to sum to one hundred, standard regression on raw percentages can be affected by collinearity and interpretational difficulties. Log ratio transformations, such as additive or centered log ratio transforms, place compositions in an unconstrained vector space and may yield more stable and interpretable models of the association between nutrient balance and disease risk.

Third, future analyses could incorporate a richer set of covariates, including physical activity, smoking status, socioeconomic variables, and family history, and explore interaction terms between diet, BMI, and lifestyle factors. This would allow for a more realistic assessment of how macronutrient balance interacts with other risk factors. More advanced machine learning models, such as calibrated gradient boosting, support vector machines, or neural networks, could also be evaluated, together with techniques for handling class imbalance and for assessing model calibration.

Finally, an important extension would be to translate predictive models into practical risk assessment tools. For example, a simplified logistic regression model based on age, BMI, and a small number of dietary ratios could be turned into a web based calculator or decision aid for use in primary care or public health screening. Such tools would need to be validated in independent samples and possibly adapted to specific subpopulations, but they illustrate how mathematical and computational models can eventually inform personalised prevention strategies for Type 2 diabetes.

CAPSTONE_PROJECT

Oma Tonukari

2025-11-30

Project Description

Background

Problem Statement and Objectives

Research questions

Data Description

Methodology

Logistic regression model

Machine learning models

Random forest

Gradient boosting (XGBoost)

Step 1: NHANES Demographic Data

Step 2: Clean demographic variables (ID, age, sex)

Step 3: Dietary intakes (kcal, carbohydrates, fat, protein)

Step 4: Macronutrient percentages and merge with demographics

Step 5: Diabetes status (outcome variable)

Step 6: Add BMI (body mass index)

Step 7: Logistic regression model

Logistic regression results on NHANES

Step 8: Exploratory data analysis

Step 9: Random forest model (NHANES data)

Step 10: XGBoost model (NHANES data)

XGBoost results

Step 11: Model Comparison

Interpretation

Step 12 – SHAP for XGBoost

SHAP analysis for the XGBoost model

Step 13: Macronutrient ratio analysis

Macronutrient percentages and ratios by diabetes status

Interpretation

Discussion

Limitations

Conclusion

Future work