WQD7004 - Programming in Data Science

TOPIC:

Predictive Modeling of Metabolic and Renal Health Outcomes in a Large-Scale Population Survey

Group 12
Student Name	Matric ID	Role
Anisyahayati binti Ismail	17024369	Group Leader and Data Cleaning
Ren Lin	22082936	Exploratory Data Analysis (EDA)
Li Yaxin	25063108	Classification Modeling
Wei Qian	25057795	Regression Modeling
Chia Ai Pei	23084840	R Markdown and Documentation

1.📘 Introduction

Metabolic and renal disorders, such as diabetes and Chronic Kidney Disease (CKD), are major public health concerns worldwide. Early identification of individuals at risk can enable timely interventions and improved clinical outcomes. Large-scale population surveys, such as the National Health and Nutrition Examination Survey (NHANES), provide comprehensive health, lifestyle, and laboratory data that can be leveraged for predictive modeling.

This project focuses on the 2013–2014 NHANES survey cycle to investigate relationships between demographic characteristics, lifestyle behaviors, physiological measurements, and laboratory indicators with metabolic and renal health outcomes. R programming plays a central role in this project, serving as the platform for data exploration, processing, modeling, and reporting, and enabling a structured, reproducible workflow for deriving insights from complex health data.

- Source: National Health and Nutrition Examination Survey from Kaggle

- URL: https://www.kaggle.com/datasets/cdc/national-health-and-nutrition-examination-survey

2.🎯 Project Objectives

The primary objectives of this project are to apply predictive modeling techniques to identify and quantify key factors associated with metabolic and renal health outcomes. Specifically, the project aims to:

Develop a classification model to predict the presence of Chronic Kidney Disease (CKD) based on demographic variables, lifestyle factors, blood pressure, and metabolic laboratory indicators. This will allow for stratification of individuals into risk categories and highlight the most influential predictors.
Develop a regression model to predict 2-hour plasma glucose levels measured during an oral glucose tolerance test (OGTT) using routinely collected demographic, lifestyle, anthropometric, and laboratory variables. This will provide insights into which factors most strongly influence glucose metabolism and allow for quantitative predictions of glycemic status.

By addressing both a classification problem (CKD risk) and a regression problem ( OGTT glucose levels), this project demonstrates the application of predictive modeling techniques across different types of health-related outcomes using a large-scale population dataset.

3.❓Project Questions

To achieve the objectives of this project, two analytical questions were formulated, each corresponding to a different type of predictive modeling task:

Classification Question – Chronic Kidney Disease (CKD) Risk:
Can we predict the presence of CKD using demographic characteristics, lifestyle factors, blood pressure, and metabolic laboratory indicators?

This question is framed as a classification problem, where individuals are categorized into risk groups (e.g., CKD present vs. CKD absent) based on selected health indicators.
Developing this model will help identify key predictors of CKD and evaluate the performance of machine learning methods in stratifying disease risk.
Regression Question – 2-Hour Plasma Glucose Level: Can we predict the 2-hour plasma glucose level measured during an oral glucose tolerance test (OGTT) using demographic, lifestyle, anthropometric, and laboratory variables?

This question is framed as a regression problem, where continuous glucose measurements are predicted based on a set of explanatory variables.
This analysis provides quantitative insights into the factors that influence post-load glucose levels and supports potential risk assessment for metabolic disorders.

By addressing both classification and regression tasks, the project illustrates how predictive modeling can be applied to multiple types of health outcomes, leveraging a large-scale population survey to inform clinical and public health decision-making.

4. 🔍 Data Understanding

The dataset for this project is sourced from the National Health and Nutrition Examination Survey (NHANES), a comprehensive health survey conducted by the Centers for Disease Control and Prevention (CDC). The data used corresponds to the 2013–2014 NHANES survey cycle.

Rather than being provided as a single consolidated table, the dataset is distributed across multiple files, each representing a specific survey component, including:

Demographic
Examinations
Laboratory measurements
Questionnaires

An initial exploration of these raw data files was conducted to understand their structure and contents. This preliminary review focused on assessing file dimensions, variable types, and overall dataset composition, without performing detailed exploratory data analysis.

4.1 Overview of Raw Data Files

Table 1 summarises the key characteristics of the NHANES raw data files used in this project, including the number of observations, number of variables, general content, and the distribution of data types.

Table 1: Overview of NHANES Raw Dataset Files

Data File	Dimension	Contents	Data Types Count
demographic.csv	10,175 rows x 47 columns	Demographic variables such as age, gender and ethnicity	44 Integer, 3 Numeric
examination.csv	9,813 rows x 224 columns	Physical examination variables such as body measurements, blood pressure and glucose reading	29 Character, 173 Integer, 1 Logical, 21 Numeric
laboratory.csv	9,813 rows x 424 columns	Result of laboratory test such as Plasma Glucose, Hemoglobin A1c (HbA1c), Cholesterol and Urine Albumin	271 Integer, 153 Numeric
questionnaire.csv	10,175 rows x 953 columns	Survey questionnaire response variable which include lifestyle factors such as physical activity, smoking and drinking habits	2 Character, 944 Integer, 4 Logical, 3 Numeric

As shown in Table 1, the NHANES dataset contains a large number of variables across all components, particularly within the laboratory and questionnaire files. While this comprehensive structure allows for in-depth health analysis, it also introduces challenges related to data complexity, redundancy, and relevance to specific research objectives. Therefore, careful data preparation is required before meaningful analysis can be performed.

4.2 Dataset Structure Inspection and Key Characteristics

Each data file contains records for survey participants and is linked using a unique respondent identifier, SEQN. This identifier enables records from different components to be merged accurately at the individual level. The presence of multiple data types, including numerical, categorical, and logical variables, reflects the diverse nature of the information collected in the survey.

To examine the structure and composition of the raw datasets, basic R package inspection functions were applied to each data file. This step allowed the identification of dataset dimensions, variable types, and overall complexity prior to data preparation.

#Load dataset 
df_demo  <- read.csv("demographic.csv")
df_exam  <- read.csv("examination.csv")
df_quest <- read.csv("questionnaire.csv")
df_lab   <- read.csv("labs.csv")

#Get dataset dimension
dim(df_demo)
dim(df_exam)
dim(df_quest)
dim(df_lab)

#Get the dataset structure
str(df_demo)
str(df_exam)
str(df_quest)
str(df_lab)

#Get the datatype count for each dataset
table(sapply(df_demo, class))

## 
## integer numeric 
##      44       3

table(sapply(df_exam, class))

## 
## character   integer   logical   numeric 
##        29       173         1        21

table(sapply(df_lab, class))

## 
## integer numeric 
##     271     153

table(sapply(df_quest, class))

## 
## character   integer   logical   numeric 
##         2       944         4         3

4.3 Preliminary Inspection of Missing Values

A preliminary inspection of missing values was also conducted as part of the data understanding process.

#Inspection of Missing Values
colSums(is.na(df_demo))

##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 
##        0        0        0        0        0     9502        0        0 
## RIDEXMON RIDEXAGM DMQMILIZ  DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3 
##      362     5962     3914     9632        0        4     8267     7372 
## DMDEDUC2 DMDMARTL RIDEXPRG  SIALANG SIAPROXY SIAINTRP  FIALANG FIAPROXY 
##     4406     4406     8866        0        1        0      121      121 
## FIAINTRP  MIALANG MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA 
##      121     2864     2863     2862     3858        0        0        0 
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU 
##        0        0        0        0      297      294      123     4833 
## WTINT2YR WTMEC2YR  SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR 
##        0        0        0        0      133      123      785

colSums(is.na(df_exam))

##     SEQN PEASCST1 PEASCTM1 PEASCCT1   BPXCHR   BPAARM   BPACSZ   BPXPLS 
##        0        0      305     9493     7852     2278     2271     2264 
##  BPXPULS   BPXPTY   BPXML1   BPXSY1   BPXDI1   BPAEN1   BPXSY2   BPXDI2 
##      302     2249     2260     2641     2641     2274     2404     2404 
##   BPAEN2   BPXSY3   BPXDI3   BPAEN3   BPXSY4   BPXDI4   BPAEN4 BMDSTATS 
##     2276     2405     2405     2276     9298     9298     9251        0 
##    BMXWT    BMIWT BMXRECUM BMIRECUM  BMXHEAD  BMIHEAD    BMXHT    BMIHT 
##       90     9429     8748     9782     9584     9813      746     9592 
##   BMXBMI  BMDBMIC   BMXLEG   BMILEG  BMXARML  BMIARML  BMXARMC  BMIARMC 
##      758     6290     2411     9463      512     9445      512     9441 
## BMXWAIST BMIWAIST  BMXSAD1  BMXSAD2  BMXSAD3  BMXSAD4 BMDAVSAD BMDSADCM 
##     1152     9374     2595     2595     9455     9455     2595     9329 
## MGDEXSTS   MGD050   MGD060   MGQ070   MGQ080   MGQ090   MGQ100   MGQ110 
##     1522     4602     9621     2062     8872     8872     2061     9058 
##   MGQ120   MGD130  MGQ90DG  MGDSEAT MGAPHAND MGATHAND  MGXH1T1 MGXH1T1E 
##     9058     2006     2006     4602     2006     2006     2015     2015 
##  MGXH2T1 MGXH2T1E  MGXH1T2 MGXH1T2E  MGXH2T2 MGXH2T2E  MGXH1T3 MGXH1T3E 
##     2131     2131     2024     2024     2141     2141     2027     2027 
##  MGXH2T3 MGXH2T3E  MGDCGSZ OHDEXSTS OHDDESTS   OHXIMP  OHX01TC  OHX02TC 
##     2149     2149     2136      391      391     5494      845      845 
##  OHX03TC  OHX04TC  OHX05TC  OHX06TC  OHX07TC  OHX08TC  OHX09TC  OHX10TC 
##      845      845      845      845      845      845      845      845 
##  OHX11TC  OHX12TC  OHX13TC  OHX14TC  OHX15TC  OHX16TC  OHX17TC  OHX18TC 
##      845      845      845      845      845      845      845      845 
##  OHX19TC  OHX20TC  OHX21TC  OHX22TC  OHX23TC  OHX24TC  OHX25TC  OHX26TC 
##      845      845      845      845      845      845      845      845 
##  OHX27TC  OHX28TC  OHX29TC  OHX30TC  OHX31TC  OHX32TC OHX02CTC OHX03CTC 
##      845      845      845      845      845      845        0        0 
## OHX04CTC OHX05CTC OHX06CTC OHX07CTC OHX08CTC OHX09CTC OHX10CTC OHX11CTC 
##        0        0        0        0        0        0        0        0 
## OHX12CTC OHX13CTC OHX14CTC OHX15CTC OHX18CTC OHX19CTC OHX20CTC OHX21CTC 
##        0        0        0        0        0        0        0        0 
## OHX22CTC OHX23CTC OHX24CTC OHX25CTC OHX26CTC OHX27CTC OHX28CTC OHX29CTC 
##        0        0        0        0        0        0        0        0 
## OHX30CTC OHX31CTC OHX02CSC OHX03CSC OHX04CSC OHX05CSC OHX06CSC OHX07CSC 
##        0        0     7438     6882     7781     8143     9195     8936 
## OHX08CSC OHX09CSC OHX10CSC OHX11CSC OHX12CSC OHX13CSC OHX14CSC OHX15CSC 
##     8895     8928     8896     9140     8147     7805     6871     7436 
## OHX18CSC OHX19CSC OHX20CSC OHX21CSC OHX22CSC OHX23CSC OHX24CSC OHX25CSC 
##     7171     6885     7781     8579     9472     9635     9687     9672 
## OHX26CSC OHX27CSC OHX28CSC OHX29CSC OHX30CSC OHX31CSC  OHX02SE  OHX03SE 
##     9648     9496     8520     7760     6858     7078     6549     6549 
##  OHX04SE  OHX05SE  OHX07SE  OHX10SE  OHX12SE  OHX13SE  OHX14SE  OHX15SE 
##     6549     6549     6549     6549     6549     6549     6549     6549 
##  OHX18SE  OHX19SE  OHX20SE  OHX21SE  OHX28SE  OHX29SE  OHX30SE  OHX31SE 
##     6549     6549     6549     6549     6549     6549     6549     6549 
## CSXEXSTS CSXEXCMT   CSQ245   CSQ241  CSQ260A  CSQ260D  CSQ260G  CSQ260I 
##     6105     9605     6256     8834     9631     9739     9530     9708 
##  CSQ260N  CSQ260M   CSQ270   CSQ450   CSQ460   CSQ470   CSQ480   CSQ490 
##     9392     7059     9530     6393     6397     6401     6402     6402 
## CSXQUIPG CSXQUIPT  CSXNAPG  CSXNAPT CSXQUISG CSXQUIST CSXSLTSG CSXSLTST 
##     6680     6680     6684     6596     6699     6699     6596     6596 
##  CSXNASG  CSXNAST  CSXTSEQ CSXCHOOD  CSXSBOD CSXSMKOD CSXLEAOD CSXSOAOD 
##     6595     6595        0     6286     6288     6290     6293     6293 
## CSXGRAOD  CSXONOD CSXNGSOD CSXSLTRT CSXSLTRG  CSXNART  CSXNARG CSAEFFRT 
##     6294     6294     6294     8218     8218     8200     8200     6276

colSums(is.na(df_lab))

##       SEQN     URXUMA     URXUMS   URXUCR.x     URXCRS     URDACT WTSAF2YR.x 
##          0       1761       1761       1761       1761       1761       6484 
##     LBXAPB   LBDAPBSI     LBXSAL   LBDSALSI   LBXSAPSI   LBXSASSI   LBXSATSI 
##       6668       6668       3260       3260       3261       3262       3262 
##     LBXSBU   LBDSBUSI   LBXSC3SI     LBXSCA   LBDSCASI     LBXSCH   LBDSCHSI 
##       3260       3260       3260       3302       3302       3262       3262 
##     LBXSCK   LBXSCLSI     LBXSCR   LBDSCRSI     LBXSGB   LBDSGBSI     LBXSGL 
##       3271       3260       3260       3260       3269       3269       3260 
##   LBDSGLSI   LBXSGTSI     LBXSIR   LBDSIRSI    LBXSKSI   LBXSLDSI   LBXSNASI 
##       3260       3261       3286       3286       3261       3262       3260 
##   LBXSOSSI     LBXSPH   LBDSPHSI     LBXSTB   LBDSTBSI     LBXSTP   LBDSTPSI 
##       3260       3261       3261       3264       3264       3269       3269 
##     LBXSTR   LBDSTRSI     LBXSUA   LBDSUASI   LBXWBCSI   LBXLYPCT   LBXMOPCT 
##       3264       3264       3262       3262       1269       1294       1294 
##   LBXNEPCT   LBXEOPCT   LBXBAPCT   LBDLYMNO    LBDMONO    LBDNENO    LBDEONO 
##       1294       1294       1294       1294       1294       1294       1294 
##    LBDBANO   LBXRBCSI     LBXHGB     LBXHCT   LBXMCVSI   LBXMCHSI      LBXMC 
##       1294       1269       1269       1269       1269       1269       1269 
##     LBXRDW   LBXPLTSI    LBXMPSI     URXUCL  WTSA2YR.x     LBXSCU   LBDSCUSI 
##       1269       1269       1269       7639       7058       7293       7293 
##     LBXSSE   LBDSSESI     LBXSZN   LBDSZNSI   URXUCR.y  WTSB2YR.x     URXBP3 
##       7294       7294       7294       7294       7132       7036       7127 
##   URDBP3LC     URXBPH   URDBPHLC     URXBPF   URDBPFLC     URXBPS   URDBPSLC 
##       7127       7127       7127       7131       7131       7131       7131 
##     URXTLC   URDTLCLC     URXTRS   URDTRSLC     URXBUP   URDBUPLC     URXEPB 
##       7127       7127       7127       7127       7127       7127       7127 
##   URDEPBLC     URXMPB   URDMPBLC     URXPPB   URDPPBLC     URX14D   URD14DLC 
##       7127       7127       7127       7127       7127       7127       7127 
##     URXDCB   URDDCBLC     URXUCR     PHQ020   PHACOFHR   PHACOFMN     PHQ030 
##       7127       7127       7123        631       9684       9684        631 
##   PHAALCHR   PHAALCMN     PHQ040   PHAGUMHR   PHAGUMMN     PHQ050   PHAANTHR 
##       9771       9771        631       9442       9442        631       9776 
##   PHAANTMN     PHQ060   PHASUPHR   PHASUPMN PHAFSTHR.x PHAFSTMN.x    PHDSESN 
##       9776        631       9727       9727        631        631        391 
##     LBDPFL     LBDWFL     LBDHDD   LBDHDDSI      LBXHA     LBXHBS     LBXHBC 
##       7488       5713       2189       2189       1549       1552       2157 
##     LBDHBG      LBDHD     LBXHCR     LBXHCG     LBDHEG     LBDHEM     LBXHE1 
##       2161       2161       9676       9746       2157       2157       6144 
##     LBXHE2      LBXGH      LBDHI      ORXGH      ORXGL     ORXH06     ORXH11 
##       6768       3170       5901       4756       4756       4756       4756 
##     ORXH16     ORXH18     ORXH26     ORXH31     ORXH33     ORXH35     ORXH39 
##       4756       4756       4756       4756       4756       4756       4756 
##     ORXH40     ORXH42     ORXH45     ORXH51     ORXH52     ORXH53     ORXH54 
##       4756       4756       4756       4756       4756       4756       4756 
##     ORXH55     ORXH56     ORXH58     ORXH59     ORXH61     ORXH62     ORXH64 
##       4756       4756       4756       4756       4756       4756       4756 
##     ORXH66     ORXH67     ORXH68     ORXH69     ORXH70     ORXH71     ORXH72 
##       4756       4756       4756       4756       4756       4756       4756 
##     ORXH73     ORXH81     ORXH82     ORXH83     ORXH84     ORXHPC     ORXHPI 
##       4756       4756       4756       4756       4756       4756       4756 
##     ORXHPV  LBDRPCR.x   LBDRHP.x   LBDRLP.x   LBDR06.x   LBDR11.x   LBDR16.x 
##       4756       7945       8056       8056       7945       7945       7945 
##   LBDR18.x   LBDR26.x   LBDR31.x   LBDR33.x   LBDR35.x   LBDR39.x   LBDR40.x 
##       7945       7945       7945       7945       7945       7945       7945 
##   LBDR42.x   LBDR45.x   LBDR51.x   LBDR52.x   LBDR53.x   LBDR54.x   LBDR55.x 
##       7945       7945       7945       7945       7945       7945       7945 
##   LBDR56.x   LBDR58.x   LBDR59.x   LBDR61.x   LBDR62.x   LBDR64.x   LBDR66.x 
##       7945       7945       7945       7945       7945       7945       7945 
##   LBDR67.x   LBDR68.x   LBDR69.x   LBDR70.x   LBDR71.x   LBDR72.x   LBDR73.x 
##       7945       7945       7945       7945       7945       7945       7945 
##   LBDR81.x   LBDR82.x   LBDR83.x   LBDR84.x   LBDR89.x   LBDRPI.x    LBXHP2C 
##       7945       7945       7945       7945       7945       8056       7818 
##  LBDRPCR.y   LBDRHP.y   LBDRLP.y   LBDR06.y   LBDR11.y   LBDR16.y   LBDR18.y 
##       7818       7828       7828       7818       7818       7818       7818 
##   LBDR26.y   LBDR31.y   LBDR33.y   LBDR35.y   LBDR39.y   LBDR40.y   LBDR42.y 
##       7818       7818       7818       7818       7818       7818       7818 
##   LBDR45.y   LBDR51.y   LBDR52.y   LBDR53.y   LBDR54.y   LBDR55.y   LBDR56.y 
##       7818       7818       7818       7818       7818       7818       7818 
##   LBDR58.y   LBDR59.y   LBDR61.y   LBDR62.y   LBDR64.y   LBDR66.y   LBDR67.y 
##       7818       7818       7818       7818       7818       7818       7818 
##   LBDR68.y   LBDR69.y   LBDR70.y   LBDR71.y   LBDR72.y   LBDR73.y   LBDR81.y 
##       7818       7818       7818       7818       7818       7818       7818 
##   LBDR82.y   LBDR83.y   LBDR84.y   LBDR89.y   LBDRPI.y WTSAF2YR.y      LBXIN 
##       7818       7818       7818       7818       7818       6484       6720 
##    LBDINSI PHAFSTHR.y PHAFSTMN.y     URXUIO   WTSAF2YR      LBXTR    LBDTRSI 
##       6720       6522       6522       7147       6484       6667       6667 
##     LBDLDL   LBDLDLSI  WTSH2YR.x     LBXIHG   LBDIHGSI   LBDIHGLC     LBXBGE 
##       6708       6708       3881       4638       4638       4638       4638 
##   LBDBGELC     LBXBGM   LBDBGMLC   WTSOG2YR     LBXGLT   LBDGLTSI   GTDSCMMN 
##       4638       4638       4638       6904       7468       7468       7404 
##   GTDDR1MN   GTDBL2MN   GTDDR2MN   GTXDRANK   PHAFSTHR   PHAFSTMN    GTDCODE 
##       7404       7467       7467       7299       6904       6904       6904 
##  WTSA2YR.y     URXP01   URDP01LC     URXP02   URDP02LC     URXP03   URDP03LC 
##       7058       7173       7173       7172       7172       7163       7163 
##     URXP04   URDP04LC     URXP06   URDP06LC     URXP10   URDP10LC     URXP25 
##       7163       7163       7163       7163       7163       7163       7163 
##   URDP25LC    WTSA2YR     URXUP8   URDUP8LC     URXNO3   URDNO3LC     URXSCN 
##       7163       7058       7169       7169       7169       7169       7170 
##   URDSCNLC  WTSB2YR.y    LBXPFDE   LBDPFDEL    LBXPFHS   LBDPFHSL    LBXMPAH 
##       7170       7474       7645       7645       7645       7645       7645 
##   LBDMPAHL    LBXPFBS   LBDPFBSL    LBXPFHP   LBDPFHPL    LBXPFNA   LBDPFNAL 
##       7645       7645       7645       7645       7645       7645       7645 
##    LBXPFUA   LBDPFUAL    LBXPFDO   LBDPFDOL    WTSB2YR     URXCNP   URDCNPLC 
##       7645       7645       7645       7645       7036       7128       7128 
##     URXCOP   URDCOPLC     URXECP   URDECPLC     URXMBP   URDMBPLC     URXMC1 
##       7128       7128       7128       7128       7128       7128       7128 
##   URDMC1LC     URXMEP   URDMEPLC     URXMHH   URDMHHLC    URXMHNC   URDMCHLC 
##       7128       7128       7128       7128       7128       7128       7128 
##     URXMHP   URDMHPLC     URXMIB   URDMIBLC     URXMNP   URDMNPLC     URXMOH 
##       7128       7128       7128       7128       7128       7128       7128 
##   URDMOHLC     URXMZP   URDMZPLC      LBXTC    LBDTCSI     LBXTTG     LBXEMA 
##       7128       7128       7128       2189       2189       2236       9779 
##  WTSH2YR.y     LBXBPB   LBDBPBSI   LBDBPBLC     LBXBCD   LBDBCDSI   LBDBCDLC 
##       3881       4598       4598       4598       4598       4598       4598 
##     LBXTHG   LBDTHGSI   LBDTHGLC     LBXBSE   LBDBSESI   LBDBSELC     LBXBMN 
##       4598       4598       4598       4598       4598       4598       4598 
##   LBDBMNSI   LBDBMNLC    URXUTRI    URXUAS3   URDUA3LC    URXUAS5   URDUA5LC 
##       4598       4598       5756       7159       7159       7159       7159 
##     URXUAB   URDUABLC     URXUAC   URDUACLC    URXUDMA   URDUDALC    URXUMMA 
##       7159       7159       7159       7159       7159       7159       7159 
##   URDUMMAL    URXVOL1   URDFLOW1    URXVOL2   URDFLOW2    URXVOL3   URDFLOW3 
##       7159       1756       2663       7957       7958       9714       9714 
##     URXUHG   URDUHGLC     URXUBA   URDUBALC     URXUCD   URDUCDLC     URXUCO 
##       7147       7147       7149       7149       7149       7149       7149 
##   URDUCOLC     URXUCS   URDUCSLC     URXUMO   URDUMOLC     URXUMN   URDUMNLC 
##       7149       7149       7149       7149       7149       7149       7149 
##     URXUPB   URDUPBLC     URXUSB   URDUSBLC     URXUSN   URDUSNLC     URXUSR 
##       7149       7149       7149       7149       7149       7149       7149 
##   URDUSRLC     URXUTL   URDUTLLC     URXUTU   URDUTULC     URXUUR   URDUURLC 
##       7149       7149       7149       7149       7149       7149       7149 
##    URXPREG     URXUAS     LBDB12   LBDB12SI 
##       8552       7151       4497       4497

colSums(is.na(df_quest))

##      SEQN   ACD011A   ACD011B   ACD011C    ACD040    ACD110    ALQ101    ALQ110 
##         0      4416     10159     10004      7801      9168      4754      8544 
##   ALQ120Q   ALQ120U    ALQ130   ALQ141Q   ALQ141U    ALQ151    ALQ160    BPQ020 
##      5696      6582      6579      6580      8711      5698      8309      3711 
##    BPQ030    BPD035   BPQ040A   BPQ050A    BPQ056    BPD058    BPQ059    BPQ080 
##      8001      8011      8001      8360      3711      8596      3711      3711 
##    BPQ060    BPQ070   BPQ090D   BPQ100D    CBD070    CBD090    CBD110    CBD120 
##      5748      5555      5555      8725       123       132       123       123 
##    CBD130    HSD010    HSQ500    HSQ510    HSQ520    HSQ571    HSQ580    HSQ590 
##       123      3708      1700      1700      1700      4406      9915      4407 
##   HSAQUEX    CSQ010    CSQ020    CSQ030    CSQ040    CSQ060    CSQ070    CSQ080 
##       753      6360      6360      6360      6360      9405      9405      6360 
##   CSQ090A   CSQ090B   CSQ090C   CSQ090D    CSQ100    CSQ110   CSQ120A   CSQ120B 
##      6360      6360      6360      6360      6360      6360     10164     10152 
##   CSQ120C   CSQ120D   CSQ120E   CSQ120F   CSQ120G   CSQ120H    CSQ140    CSQ160 
##     10167     10093     10114     10156     10119     10128      9515      8489 
##    CSQ170    CSQ180    CSQ190    CSQ200    CSQ202    CSQ204    CSQ210    CSQ220 
##     10030      8489      8489      6360      6360      6360      6360      6360 
##    CSQ240    CSQ250    CSQ260    AUQ136    AUQ138    CDQ001    CDQ002    CDQ003 
##      6360      6360      6360      6360      6360      6360      9296      9892 
##    CDQ004    CDQ005    CDQ006   CDQ009A   CDQ009B   CDQ009C   CDQ009D   CDQ009E 
##      9922      9946      9979     10161     10149     10146     10095     10139 
##   CDQ009F   CDQ009G   CDQ009H    CDQ008    CDQ010    DIQ010    DID040    DIQ160 
##     10123     10167     10172      9296      6360       406      9438      3888 
##    DIQ170    DIQ172   DIQ175A   DIQ175B   DIQ175C   DIQ175D   DIQ175E   DIQ175F 
##      3706      3706      8837      9629     10063      9805     10083     10154 
##   DIQ175G   DIQ175H   DIQ175I   DIQ175J   DIQ175K   DIQ175L   DIQ175M   DIQ175N 
##     10024     10024     10105     10068     10136     10135     10084     10124 
##   DIQ175O   DIQ175P   DIQ175Q   DIQ175R   DIQ175S   DIQ175T   DIQ175U   DIQ175V 
##     10090     10106     10073     10155     10145     10119     10102     10163 
##   DIQ175W   DIQ175X    DIQ180    DIQ050    DID060   DIQ060U    DIQ070    DIQ230 
##     10172     10169      3706       407      9955      9958      8975      9438 
##    DIQ240    DID250    DID260   DIQ260U    DIQ275    DIQ280    DIQ291   DIQ300S 
##      9438      9610      9440      9584      9438      9655      9655      9443 
##   DIQ300D   DID310S   DID310D    DID320    DID330    DID341    DID350   DIQ350U 
##      9443      9443      9443      9443      9549      9445      9445      9578 
##    DIQ360    DIQ080    DBQ010    DBD030    DBD041    DBD050    DBD055    DBD061 
##      9443      9443      8310      8817      8310      8522      8310      8456 
##   DBQ073A   DBQ073B   DBQ073C   DBQ073D   DBQ073E   DBQ073U    DBQ700    DBQ197 
##      9085      9913     10144     10164     10146     10144      3711       406 
##   DBQ223A   DBQ223B   DBQ223C   DBQ223D   DBQ223E   DBQ223U    DBQ229   DBQ235A 
##      7379      6384      9301      9429      9940      9648      4406      5895 
##   DBQ235B   DBQ235C    DBQ301    DBQ330    DBQ360    DBQ370    DBD381    DBQ390 
##      5895      5895      8335      8335      6934      7467      7587      8063 
##    DBQ400    DBD411    DBQ421    DBQ424    DBD895    DBD900    DBD905    DBD910 
##      7467      7840      8959      8540       472      2851       494       500 
##    CBQ596    CBQ606    CBQ611    CBQ505    CBQ535    CBQ540    CBQ545    CBQ550 
##      3711      9121      9121      3711      4772      8073      4772      3711 
##    CBQ552    CBQ580    CBQ585    CBQ590    DED031   DEQ034A   DEQ034C   DEQ034D 
##      4890      4890      8547      4890      6248      6248      6294      6294 
##   DEQ038G   DEQ038Q    DED120    DED125    DLQ010    DLQ020    DLQ040    DLQ050 
##      6248      8767      8071      6906       406       406      1395      1395 
##    DLQ060    DLQ080    DPQ010    DPQ020    DPQ030    DPQ040    DPQ050    DPQ060 
##      1395      3535      4777      4779      4780      4780      4780      4781 
##    DPQ070    DPQ080    DPQ090    DPQ100    DUQ200    DUQ210    DUQ211    DUQ213 
##      4781      4781      4782      6501      6474      8184      8185      9222 
##   DUQ215Q   DUQ215U    DUQ217    DUQ219   DUQ220Q   DUQ220U    DUQ230    DUQ240 
##      9230      9232      9222      9222      8188      8193      9605      5635 
##    DUQ250    DUQ260   DUQ270Q   DUQ270U    DUQ272    DUQ280    DUQ290    DUQ300 
##      9452      9601      9601      9603      9601     10130      9452     10095 
##   DUQ310Q   DUQ310U    DUQ320    DUQ330    DUQ340   DUQ350Q   DUQ350U    DUQ352 
##     10095     10095     10167      9452      9929      9929      9930      9929 
##    DUQ360    DUQ370   DUQ380A   DUQ380B   DUQ380C   DUQ380D   DUQ380E    DUQ390 
##     10145      5636     10119     10120     10128     10171     10163     10070 
##   DUQ400Q   DUQ400U    DUQ410    DUQ420    DUQ430    ECD010    ECQ020   ECD070A 
##     10070     10070     10070     10080      8161      6465      6465      6465 
##   ECD070B    ECQ080    ECQ090   WHQ030E   MCQ080E    ECQ150   FSD032A   FSD032B 
##      6465     10087     10114      7133      7133      9800       116       116 
##   FSD032C    FSD041    FSD052    FSD061    FSD071    FSD081    FSD092    FSD102 
##       116      6847      8986      6847      6847      6847      8594      9892 
##   FSD032D   FSD032E   FSD032F    FSD111    FSD122    FSD132    FSD141    FSD146 
##      3542      3542      3542      9096      9096     10114      9096      9096 
##     FSDHH     FSDAD     FSDCH    FSD151    FSQ165    FSQ012   FSD012N    FSD230 
##       122       122      3543       116       116      6428      7298      7298 
##    FSD225    FSQ235    FSQ162  FSD650ZC  FSD660ZC    FSD675    FSD680  FSD670ZC 
##      7299      7298      1909      8572      9339      7203      7608      7203 
##    FSQ690    FSQ695  FSD650ZW  FSD660ZW  FSD670ZW    HEQ010    HEQ020    HEQ030 
##      7202      8532      9934     10103     10103      1603     10109      1603 
##    HEQ040    HIQ011   HIQ031A   HIQ031B   HIQ031C   HIQ031D   HIQ031E   HIQ031F 
##     10099         0      5354      8949     10164      7773     10122      9935 
##   HIQ031G   HIQ031H   HIQ031I   HIQ031J  HIQ031AA    HIQ260    HIQ105    HIQ270 
##     10165      9612      9969      9942     10167      9915      9241      1572 
##    HIQ210    HOD050    HOQ065    HUQ010    HUQ020    HUQ030    HUQ041    HUQ051 
##      1572       121       121         0       406         0      1194         0 
##    HUQ061    HUQ071    HUD080    HUQ090    IMQ011    IMQ020    IMQ040    IMQ070 
##      8888         0      9248      1165       668         0      7087      7249 
##    IMQ080    IMQ090    IMQ045    INQ020    INQ012    INQ030    INQ060    INQ080 
##      9575      9298      9298       123       123       123       123       123 
##    INQ090    INQ132    INQ140    INQ150    IND235  INDFMMPI  INDFMMPC    INQ244 
##       123       123       123       123       369      1066       369      4661 
##    IND247    MCQ010    MCQ025    MCQ035    MCQ040    MCQ050    AGQ030    MCQ053 
##      5366       406      8637      8637      9236      9236      9236       406 
##    MCQ070    MCQ075    MCQ080    MCQ082    MCQ084    MCQ086    MCQ092    MCD093 
##      3711     10013      3711      1603      8335      1603      1603      9443 
##    MCQ149    MCQ151   MCQ160A   MCQ180A    MCQ195   MCQ160N   MCQ180N   MCQ160B 
##      9743     10137      4406      8667      8667      4406      9941      4406 
##   MCQ180B   MCQ160C   MCQ180C   MCQ160D   MCQ180D   MCQ160E   MCQ180E   MCQ160F 
##      9993      4406      9943      4406     10039      4406      9946      4406 
##   MCQ180F   MCQ160G   MCQ180G   MCQ160M   MCQ170M   MCQ180M   MCQ160K   MCQ170K 
##      9973      4406     10080      4406      9574      9574      4406      9855 
##   MCQ180K   MCQ160L   MCQ170L   MCQ180L   MCQ160O    MCQ203    MCQ206    MCQ220 
##      9855      4406      9941      9941      4406      1703     10018      4406 
##   MCQ230A   MCQ230B   MCQ230C   MCQ230D   MCQ240A  MCQ240AA   MCQ240B  MCQ240BB 
##      9628     10114     10171     10173     10164     10172     10171     10163 
##   MCQ240C  MCQ240CC   MCQ240D  MCQ240DD  MCQ240DK   MCQ240E   MCQ240F   MCQ240G 
##     10173     10157     10172     10142     10171     10074     10138     10147 
##   MCQ240H   MCQ240I   MCQ240J   MCQ240K   MCQ240L   MCQ240M   MCQ240N   MCQ240O 
##     10173     10175     10163     10171     10170     10170     10154     10164 
##   MCQ240P   MCQ240Q   MCQ240R   MCQ240S   MCQ240T   MCQ240U   MCQ240V   MCQ240W 
##     10133     10171     10175     10165     10172     10107     10174     10070 
##   MCQ240X   MCQ240Y   MCQ240Z   MCQ300A   MCQ300B   MCQ300C   MCQ365A   MCQ365B 
##     10118     10172     10172      4406      1603      4406      3711      3711 
##   MCQ365C   MCQ365D   MCQ370A   MCQ370B   MCQ370C   MCQ370D    MCQ380    OCD150 
##      3711      3711      3711      3711      3711      3711      8335      3716 
##    OCQ180    OCQ210    OCQ260    OCD270    OCQ380   OCD390G    OCD395    OHQ030 
##      6830      9164      6732      6732      7359      3716      6732       407 
##    OHQ033    OHQ770   OHQ780A   OHQ780B   OHQ780C   OHQ780D   OHQ780E   OHQ780F 
##      1057      1057      9085     10110      9967     10150     10154     10170 
##   OHQ780G   OHQ780H   OHQ780I   OHQ780J   OHQ780K   OHQ555G   OHQ555Q   OHQ555U 
##     10099     10130     10128     10140     10088      7410      7452      7458 
##   OHQ560G   OHQ560Q   OHQ560U    OHQ565   OHQ570Q   OHQ570U   OHQ575G   OHQ575Q 
##      7458      7489      7490      7410      9946      9956      9956      9992 
##   OHQ575U    OHQ580   OHQ585Q   OHQ585U   OHQ590G   OHQ590Q   OHQ590U    OHQ610 
##      9994      7410     10052     10058     10058     10096     10096      6494 
##    OHQ612    OHQ614    OHQ620    OHQ640    OHQ680    OHQ835    OHQ845   OHQ848G 
##      6494      6494      5362      5362      5362      5362       407      6714 
##   OHQ848Q    OHQ849    OHQ850    OHQ855    OHQ860    OHQ865    OHQ870    OHQ875 
##      6803      6761      5362      5362      5362      5362      5362      5362 
##    OHQ880    OHQ885    OHQ895    OHQ900   OSQ010A   OSQ010B   OSQ010C   OSQ020A 
##      5362      5362      8886      9091      6360      6360      6360     10092 
##   OSQ020B   OSQ020C  OSD030AA  OSQ040AA  OSD050AA  OSD030AB  OSQ040AB  OSD050AB 
##      9859     10098     10093     10093     10123     10164     10164     10166 
##  OSD030AC  OSQ040AC  OSD050AC  OSD030BA  OSQ040BA  OSD050BA  OSD030BB  OSQ040BB 
##     10173     10173     10173      9859      9859     10101     10126     10126 
##  OSD050BB  OSD030BC  OSQ040BC  OSD050BC  OSD030BD  OSQ040BD  OSD050BD  OSD030BE 
##     10166     10161     10161     10173     10170     10170     10173     10173 
##  OSQ040BE  OSD030BF  OSQ040BF  OSD030BG  OSQ040BG  OSD030BH  OSQ040BH  OSD030BI 
##     10173     10173     10173     10174     10174     10174     10174     10174 
##  OSQ040BI  OSD030BJ  OSQ040BJ  OSD030CA  OSQ040CA  OSD050CA  OSD030CB  OSQ040CB 
##     10174     10174     10174     10099     10099     10156     10166     10166 
##  OSD050CB  OSD030CC  OSQ040CC    OSQ080   OSQ090A   OSQ100A   OSD110A   OSQ120A 
##     10174     10173     10173      6360      9253      9939      9939      9253 
##   OSQ090B   OSQ100B   OSD110B   OSQ120B   OSQ090C   OSQ100C   OSD110C   OSQ120C 
##      9933     10104     10104      9933     10095     10155     10155     10095 
##   OSQ090D   OSQ100D   OSD110D   OSQ120D   OSQ090E   OSQ100E   OSD110E   OSQ120E 
##     10139     10167     10167     10139     10163     10172     10172     10163 
##   OSQ090F   OSQ120F   OSQ090G   OSQ100G   OSD110G   OSQ120G   OSQ090H   OSQ120H 
##     10170     10170     10171     10173     10173     10171     10173     10173 
##    OSQ060    OSQ072    OSQ130   OSQ140Q   OSQ140U    OSQ150   OSQ160A   OSQ160B 
##      6360      9853      6360      9955      9968      6360      9739     10145 
##    OSQ170    OSQ180    OSQ190    OSQ200    OSQ210    OSQ220    PFQ020    PFQ030 
##      6360      9919     10161      6360     10094     10171      6714     10047 
##    PFQ033    PFQ041    PFQ049    PFQ051    PFQ054    PFQ057    PFQ059   PFQ061A 
##     10070      7058      4406      4406      4406      4406      5905      7580 
##   PFQ061B   PFQ061C   PFQ061D   PFQ061E   PFQ061F   PFQ061G   PFQ061H   PFQ061I 
##      8172      8172      7580      7580      7580      7580      7580      7580 
##   PFQ061J   PFQ061K   PFQ061L   PFQ061M   PFQ061N   PFQ061O   PFQ061P   PFQ061Q 
##      7580      7580      7580      7580      7580      7580      7580      7580 
##   PFQ061R   PFQ061S   PFQ061T   PFQ063A   PFQ063B   PFQ063C   PFQ063D   PFQ063E 
##      7580      7580      7580      8351      9060      9477      9726      9913 
##    PFQ090    PAQ605    PAQ610    PAD615    PAQ620    PAQ625    PAD630    PAQ635 
##      4406      3027      9003      9007      3027      7868      7876      3028 
##    PAQ640    PAD645    PAQ650    PAQ655    PAD660    PAQ665    PAQ670    PAD675 
##      8128      8132      3028      8117      8120      3030      7115      7118 
##    PAD680    PAQ706    PAQ710    PAQ715    PAQ722   PAQ724A   PAQ724B   PAQ724C 
##      3036      7186       727       727      7468     10049      9955      9732 
##   PAQ724D   PAQ724E   PAQ724F   PAQ724G   PAQ724H   PAQ724I   PAQ724J   PAQ724K 
##      9560     10133      9822     10165      9923     10160     10021     10126 
##   PAQ724L   PAQ724M   PAQ724N   PAQ724O   PAQ724P   PAQ724Q   PAQ724R   PAQ724S 
##     10172     10164     10034     10168     10098      9784     10120      9249 
##   PAQ724T   PAQ724U   PAQ724V   PAQ724W   PAQ724X   PAQ724Y   PAQ724Z  PAQ724AA 
##     10001     10096      9786      9903     10139     10150     10129      9703 
##  PAQ724AB  PAQ724AC  PAQ724AD  PAQ724AE  PAQ724AF  PAQ724CM    PAQ731    PAD733 
##     10068     10129      9294     10010     10161     10152      7918      9377 
##    PAQ677    PAQ678    PAQ740    PAQ742    PAQ744    PAQ746    PAQ748    PAQ755 
##      9496      9496      9496      9855      9496      9619      9619      7918 
##   PAQ759A   PAQ759B   PAQ759C   PAQ759D   PAQ759E   PAQ759F   PAQ759G   PAQ759H 
##     10055      9922     10171     10123     10041     10170     10135     10164 
##   PAQ759I   PAQ759J   PAQ759K   PAQ759L   PAQ759M   PAQ759N   PAQ759O   PAQ759P 
##     10166      9956     10133     10150     10098     10126     10147     10119 
##   PAQ759Q   PAQ759R   PAQ759S   PAQ759T   PAQ759U   PAQ759V    PAQ762    PAQ764 
##     10088     10171     10060     10174     10160     10172      8597      8708 
##    PAQ766    PAQ679    PAQ750    PAQ770   PAQ772A   PAQ772B   PAQ772C   PAAQUEX 
##      8723      9503      7925      7918     10126     10130     10100       691 
##    PUQ100    PUQ110    RHQ010    RHQ020    RHQ031    RHD043    RHQ060    RHQ070 
##      2364      2365      6869     10150      6919      8752      8752     10122 
##    RHQ074    RHQ076    RHQ078    RHQ131    RHD143    RHQ160    RHQ162    RHQ163 
##      8246      8246      8246      7543      9236      7982      7982      9979 
##    RHQ166    RHQ169    RHQ172    RHD173    RHQ171    RHD180    RHD190    RHQ197 
##      7999      8934      8094      9840      8094      8523      8121     10025 
##    RHQ200    RHD280    RHQ291    RHQ305    RHQ332    RHQ420    RHQ540   RHQ542A 
##     10025      7555      9610      7554      9871      6922      7546      9767 
##   RHQ542B   RHQ542C   RHQ542D    RHQ554   RHQ560Q   RHQ560U    RHQ570   RHQ576Q 
##     10112     10063     10160      9768      9880      9886      9768     10095 
##   RHQ576U    RHQ580   RHQ586Q   RHQ586U    RHQ596   RHQ602Q   RHQ602U    RXQ510 
##     10097     10112     10136     10138     10112     10163     10163      6360 
##    RXQ515    RXQ520   RXQ525G   RXQ525Q   RXQ525U    RXD530   SLD010H    SLQ050 
##      8891      7644      9042     10079     10080      9042      3714      3711 
##    SLQ060    SMQ020    SMD030    SMQ040   SMQ050Q   SMQ050U    SMD055    SMD057 
##      3711      4062      7596      7596      8828      8885      8960      8828 
##    SMQ078    SMD641    SMD650    SMD093   SMDUPCA  SMD100BR  SMD100FL  SMD100MN 
##      9176      8859      8927      8943         0         0      9084      9083 
##  SMD100LN  SMD100TR  SMD100NI  SMD100CO    SMQ621    SMD630    SMQ661   SMQ665A 
##      9083      9347      9347      9347      9168     10092     10146     10162 
##   SMQ665B   SMQ665C   SMQ665D    SMQ670    SMQ848   SMQ852Q   SMQ852U  SMAQUEX2 
##     10172     10166     10173      8914      9592      9595      9598      3007 
##    SMD460    SMD470    SMD480    SMQ856    SMQ858    SMQ860    SMQ862    SMQ866 
##       116      7723      8868      4062      6994        70      5378      4062 
##    SMQ868    SMQ870    SMQ872    SMQ874    SMQ876    SMQ878    SMQ880 SMAQUEX.x 
##      9495        70      1744        70      5057        70      4497        33 
##    SMQ681   SMQ690A    SMQ710    SMQ720    SMQ725   SMQ690B    SMQ740   SMQ690C 
##      3745      9056      9056      9056      9056     10153     10153     10010 
##    SMQ770   SMQ690G    SMQ845   SMQ690H    SMQ849    SMQ851   SMQ690D    SMQ800 
##     10010     10116     10116     10050     10050      3745     10123     10123 
##   SMQ690E    SMQ817   SMQ690I    SMQ857   SMQ690J    SMQ861    SMQ863   SMQ690F 
##     10121     10121     10172     10172     10175     10175      4752     10151 
##    SMQ830    SMQ840    SMDANY SMAQUEX.y    SXD021    SXQ800    SXQ803    SXQ806 
##     10151     10151      4702      3196      5649      7977      7977      7977 
##    SXQ809    SXQ700    SXQ703    SXQ706    SXQ709    SXD031    SXD171    SXD510 
##      7977      7836      7837      7837      7837      5882      7978      8380 
##    SXQ824    SXQ827    SXD633    SXQ636    SXQ639    SXD642    SXQ410    SXQ550 
##      8552      8553      8587      8587      8587      8981     10060     10079 
##    SXQ836    SXQ841    SXQ853    SXD621    SXQ624    SXQ627    SXD630    SXQ645 
##     10060     10101     10060      8593      8593      8606      9078      7922 
##    SXQ648    SXQ610    SXQ251    SXQ590    SXQ600    SXD101    SXD450    SXQ724 
##      7142      7154      7249      9245      9245      7836      8273      8396 
##    SXQ727    SXQ130    SXQ490    SXQ741    SXQ753    SXQ260    SXQ265    SXQ267 
##      8396      9983      9983      9983      8377      6695      6695     10048 
##    SXQ270    SXQ272    SXQ280    SXQ292    SXQ294    WHD010    WHD020    WHQ030 
##      6695      6695      8493      8385      8276      3736      3745      3711 
##    WHQ040    WHD050    WHQ060    WHQ070   WHD080A   WHD080B   WHD080C   WHD080D 
##      3711      3753      8936      4501      8482      9292      9373      8345 
##   WHD080E   WHD080F   WHD080G   WHD080H   WHD080I   WHD080J   WHD080K   WHD080M 
##      9776      9942     10060     10061     10091      9975     10149      9142 
##   WHD080N   WHD080O   WHD080P   WHD080Q   WHD080R   WHD080S   WHD080T   WHD080U 
##     10000      9486     10156      9054      9153      9232      9179     10160 
##   WHD080L    WHD110    WHD120    WHD130    WHD140    WHQ150   WHQ030M    WHQ500 
##     10146      5996      5151      7410      4072      4155      8697      8697 
##    WHQ520 
##      8697

This inspection revealed that missing values were present across several variables in all dataset components. Such missingness is common in large-scale health surveys and may occur due to non-response, incomplete examinations, or unavailable laboratory results.

The identification of missing values at this stage informed subsequent decisions related to data cleaning and preprocessing. These issues were addressed systematically during the data preparation phase to ensure the development of a clean and analysis-ready dataset suitable for both classification and regression tasks.

5.🧹Data Cleaning and Preparation

The data cleaning and preparation phase focused on transforming the raw NHANES data into a clean, consistent, and analysis-ready dataset suitable for subsequent exploratory analysis and modeling. Given that the NHANES dataset is distributed across multiple files and contains a large number of variables, several preprocessing steps were required to ensure data quality and relevance to the project objectives.

These steps included variable selection, merging datasets, handling missing values, and recoding variables into appropriate formats. For this purpose, the Dplyr package is used.

5.1 Variable Selection

This variable selection step was guided by domain relevance to chronic kidney disease risk and glucose-related outcomes, while excluding variables that were unrelated or redundant for the intended analyses as shown in Table 2.

Table 2: Variable Selection

Dataset	Selected Variables	Description
Demographic	SEQN	Participant sequence number
	RIAGENDR	Participant gender
	RIDAGEYR	Age in years
	RIDRETH3	Recode of reported ethnicity
Examination	SEQN	Participant sequence number
	BMXBMI	Body Mass Index (kg/m**2)
	BMXWAIST	Waist Circumference (cm)
	Mean of BPXSY1, BPXSY2, BPXSY3	Mean of continuous reading of Systolic Blood Pressure (mm Hg)
	Mean of BPXDI1, BPXDI2, BPXDI3	Mean of continuous reading of Diastolic Blood Pressure (mm Hg)
Laboratory	SEQN	Participant sequence number
	LBDLDL	LDL-cholesterol (mg/dL)
	LBXTR	Triglyceride (mg/dL)
	LBXTC	Total Cholesterol( mg/dL)
	LBDGLTSI	Two-hour Glucose (OGTT) (mmol/L)
	LBXGH	Glycohemoglobin (%) for Hemoglobin A1c (HbA1c)
	URXUMA	Urine Albumin (ug/mL)
	LBXSCR	Creatinine (mg/dL)
	LBXIN	Insulin (uU/mL)
Questionnaire	ALQ130	Alcohol drinks per day
	SMQ020	Smoking status
	PAQ650	Physical activity

library(dplyr)
#VARIABLE SELECTIONS
#Demographic
demo_sel <- df_demo %>%
  select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH3)
#Examination
exam_sel <- df_exam %>%
  select(SEQN, BMXBMI, BMXWAIST,
         BPXSY1, BPXSY2, BPXSY3,
         BPXDI1, BPXDI2, BPXDI3)
#Laboratory
lab_sel <- df_lab %>%
  select(SEQN, LBXGH, LBDGLTSI, LBXTR, LBXTC, URXUMA, LBDLDL, LBXSCR, LBXIN)
#Questionnaire
quest_sel <- df_quest %>%
  select(SEQN, SMQ020, PAQ650, ALQ130)

5.2 Dataset Merging

After selecting the relevant variables from each dataset component, the tables were merged into a single consolidated table. All tables were linked using the unique respondent identifier, SEQN, ensuring that records from different survey components corresponded to the same individual.

library(dplyr)
#Merge Dataset using SEQN and keep only complete records
final_df <- demo_sel %>%
  inner_join(exam_sel, by = "SEQN") %>%
  inner_join(quest_sel, by = "SEQN") %>%
  inner_join(lab_sel, by = "SEQN")

colnames(final_df)

##  [1] "SEQN"     "RIAGENDR" "RIDAGEYR" "RIDRETH3" "BMXBMI"   "BMXWAIST"
##  [7] "BPXSY1"   "BPXSY2"   "BPXSY3"   "BPXDI1"   "BPXDI2"   "BPXDI3"  
## [13] "SMQ020"   "PAQ650"   "ALQ130"   "LBXGH"    "LBDGLTSI" "LBXTR"   
## [19] "LBXTC"    "URXUMA"   "LBDLDL"   "LBXSCR"   "LBXIN"

Backup of the original dataset before cleaning is created for comparison purpose later.

#create backup for comparison later
final_df_b4clean <- final_df

5.3 Handling Missing Values

Initial inspection of the merged dataset revealed missing values across several questionnaire, examination, and laboratory variables. To understand the nature of this missingness, missing-value patterns were examined with respect to participant age. This analysis showed that missing values were systematic rather than random, with a clear concentration among infants and young children. Such missingness reflects the fact that many clinical and laboratory measurements are either not administered or not clinically applicable to younger age groups.

Figure 1 illustrates the distribution of missing values across selected questionnaire, examination, and laboratory variables by age. The concentration of missing values among younger age groups supports the decision to exclude these observations rather than apply imputation.

Figure 1: Missing Value Pattern

Although missing values were less concentrated among participants aged six years and above, the young participants were excluded because key health indicators such as body mass index, blood pressure, and laboratory biomarkers are clinically interpreted using different age-specific reference standards compared to adults. As this study focuses on adult health outcomes related to chronic kidney disease risk and glucose regulation, the dataset was restricted to adult participants.

# Restrict dataset to adult participants
final_df <- final_df %>%
  filter(RIDAGEYR >= 18)

After restricting the dataset to adult participants, the remaining missing values mainly reflected genuine non-response or incomplete measurements and were handled according to variable type.

Several questionnaire variables contained special response codes indicating refusal or uncertainty (Table 3), which were recoded as missing values (NA) to ensure consistent interpretation. This recoding applied to variables related to alcohol consumption, smoking status, and physical activity.

Table 3: Special Values

Variable	Special Values
ALQ130 (Alcohol Consumption)	777 = Refused & 999 = Don’t know
SMQ020 (Smoking Status)	7 = Refused & 9 = Don’t know
PAQ650 (Physical Activity)	7 = Refused & 9 = Don’t know

# Recode special response values to NA for ALQ130, SMQ020 & PAQ650
final_df <- final_df %>%
  mutate(
    ALQ130 = ifelse(ALQ130 %in% c(777, 999), NA, ALQ130),
    SMQ020 = ifelse(SMQ020 %in% c(7, 9), NA, SMQ020),
    PAQ650 = ifelse(PAQ650 %in% c(7, 9), NA, PAQ650)
  )

After recoding, missing values in predictor variables were handled according to variable type. Continuous predictor variables were imputed using median values to reduce the influence of outliers, while categorical predictor variables were imputed using the most frequently occurring category (mode). Variables designated as prediction outcomes (LBDGLTSI) were not imputed to avoid introducing bias or information leakage into subsequent classification and regression analyses.

#Median imputation for continuous variables
final_df <- final_df %>%
  mutate(
    BMXBMI = ifelse(is.na(BMXBMI), median(BMXBMI, na.rm = TRUE), BMXBMI),
    BMXWAIST = ifelse(is.na(BMXWAIST), median(BMXWAIST, na.rm = TRUE), BMXWAIST),
    LBXGH = ifelse(is.na(LBXGH), median(LBXGH, na.rm = TRUE), LBXGH),
    LBXTC = ifelse(is.na(LBXTC), median(LBXTC, na.rm = TRUE), LBXTC),
    LBXTR = ifelse(is.na(LBXTR), median(LBXTR, na.rm = TRUE), LBXTR),
    LBDLDL = ifelse(is.na(LBDLDL), median(LBDLDL, na.rm = TRUE), LBDLDL),
    ALQ130 = ifelse(is.na(ALQ130), median(ALQ130, na.rm = TRUE), ALQ130),
    URXUMA = ifelse(is.na(URXUMA), median(URXUMA, na.rm = TRUE), URXUMA),
    LBXSCR = ifelse(is.na(LBXSCR), median(LBXSCR, na.rm = TRUE), LBXSCR),
    LBXIN = ifelse(is.na(LBXIN), median(LBXIN, na.rm = TRUE), LBXIN)
  )

#Mode imputation for Categorical Data - only applied to the following since there 
#are no missing/NA data in RIAGENDR & RIDRETH3

#get mode for SMQ020 & PAQ650 - convert to numeric to handle NA
mode_SMQ020 <- as.numeric(
  names(sort(table(final_df$SMQ020, useNA = "no"), decreasing = TRUE))[1]
)
mode_PAQ650 <- as.numeric(
  names(sort(table(final_df$PAQ650, useNA = "no"), decreasing = TRUE))[1]
)

#impute mode
final_df <- final_df %>%
  mutate(
    SMQ020 = ifelse(is.na(SMQ020), mode_SMQ020, SMQ020),
    PAQ650 = ifelse(is.na(PAQ650), mode_PAQ650, PAQ650)
  )

Imputation for blood pressure variables was performed after their derivation, as described in Section 5.4.

5.4 Data Transformation and Derivation

Following the handling of missing values, several variable transformations and derivations were performed to standardise measurements and improve interpretability prior to analysis.

Blood pressure measurements were recorded multiple times for each participant during the examination process. To obtain a single representative measure and reduce measurement variability, mean systolic and mean diastolic blood pressure values were computed for each participant using the available readings. After aggregation, the original individual blood pressure measurements were removed from the dataset to avoid redundancy.

# Compute mean systolic and diastolic blood pressure
final_df <- final_df %>%
  mutate(
    systolic_bp  = rowMeans(select(., BPXSY1, BPXSY2, BPXSY3), na.rm = TRUE),
    diastolic_bp = rowMeans(select(., BPXDI1, BPXDI2, BPXDI3), na.rm = TRUE)
  ) %>%
  select(-BPXSY1, -BPXSY2, -BPXSY3,
         -BPXDI1, -BPXDI2, -BPXDI3)

After derivation, any remaining missing values in the blood pressure variables were imputed using median values to ensure completeness while minimising the influence of extreme values.

# Impute missing values in blood pressure
final_df <- final_df %>%
  mutate(
    systolic_bp = ifelse(is.na(systolic_bp), median(systolic_bp, na.rm = TRUE), systolic_bp),
    diastolic_bp = ifelse(is.na(diastolic_bp), median(diastolic_bp, na.rm = TRUE), diastolic_bp)
  )

Categorical variables were converted into factor format. Race and ethnicity categories were consolidated into broader groups to reduce sparsity and improve interpretability in subsequent analyses.(Mexican American and Other Hispanic is grouped together)

# Convert categorical variables
final_df <- final_df %>%
  mutate(
    RIAGENDR = factor(RIAGENDR, levels = c(1, 2),
                      labels = c("Male", "Female")),
    SMQ020 = factor(SMQ020, levels = c(1, 2),
                    labels = c("Yes", "No")),
    PAQ650 = factor(PAQ650, levels = c(1, 2),
                    labels = c("Yes", "No")),
    RIDRETH3 = case_when(
      RIDRETH3 %in% c(1, 2) ~ "Hispanic",
      RIDRETH3 == 3         ~ "Non-Hispanic White",
      RIDRETH3 == 4         ~ "Non-Hispanic Black",
      RIDRETH3 == 6         ~ "Non-Hispanic Asian",
      RIDRETH3 == 7         ~ "Other"
    ),
    RIDRETH3 = factor(RIDRETH3)
  )

Finally the selected NHANES variable were renamed using descriptive labels to improve readability while preserving their original meanings.

#Rename variables
final_df <- final_df %>%
  rename(
    gender               = RIAGENDR,
    age_years            = RIDAGEYR,
    ethnicity            = RIDRETH3,
    bmi                  = BMXBMI,
    waist_cm             = BMXWAIST,
    alcohol_drinks_day   = ALQ130,
    smoking_status       = SMQ020,
    physical_activity    = PAQ650,
    ldl_cholesterol      = LBDLDL,
    triglycerides        = LBXTR,
    total_cholesterol    = LBXTC,
    hba1c                = LBXGH,
    creatinine           = LBXSCR, 
    urine_albumin        = URXUMA,
    ogtt_2hr_glucose     = LBDGLTSI,
    fasting_glucose      = LBXIN
  )

5.5 Validation of Cleaned Dataset

Following data cleaning, imputation, and variable transformation, a series of validation checks were performed to ensure the integrity, completeness, and readiness of the final dataset for exploratory analysis and modelling. These checks focused on dataset structure, missing values, and variable consistency.

First, the overall structure of the dataset was examined to confirm that variables were in the expected formats (numeric or factor) and that all derived and renamed variables were correctly created.

# Check structure of the cleaned dataset
str(final_df)

## 'data.frame':    5924 obs. of  19 variables:
##  $ SEQN              : int  73557 73558 73559 73561 73562 73564 73566 73567 73568 73571 ...
##  $ gender            : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 2 1 2 1 ...
##  $ age_years         : int  69 54 72 73 56 61 56 65 26 76 ...
##  $ ethnicity         : Factor w/ 5 levels "Hispanic","Non-Hispanic Asian",..: 3 4 4 4 1 4 4 4 4 4 ...
##  $ bmi               : num  26.7 28.6 28.9 19.7 41.7 35.7 26.5 22 20.3 34.4 ...
##  $ waist_cm          : num  100 108 109 97 123 ...
##  $ smoking_status    : Factor w/ 2 levels "Yes","No": 1 1 1 2 1 2 1 1 2 2 ...
##  $ physical_activity : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 1 ...
##  $ alcohol_drinks_day: int  1 4 2 2 1 1 1 3 2 1 ...
##  $ hba1c             : num  13.9 9.1 8.9 4.9 5.5 5.5 5.4 5.2 5.2 6.9 ...
##  $ ogtt_2hr_glucose  : num  NA NA NA NA NA ...
##  $ triglycerides     : num  93 93 51 75 93 64 93 93 24 93 ...
##  $ total_cholesterol : num  167 170 126 201 226 168 278 173 168 167 ...
##  $ urine_albumin     : num  4.3 153 11.9 255 123 19 1.3 35 25 25.8 ...
##  $ ldl_cholesterol   : num  107 107 56 101 107 97 107 107 67 107 ...
##  $ creatinine        : num  1.21 0.79 1.22 0.73 0.89 0.92 0.55 0.97 0.74 1.19 ...
##  $ fasting_glucose   : num  9.12 9.12 5.83 6.12 9.12 ...
##  $ systolic_bp       : num  113 157 142 137 157 ...
##  $ diastolic_bp      : num  74 61.3 82 86.7 82 ...

Next, summary statistics were reviewed to identify any implausible values and to verify that continuous variables fell within reasonable ranges after imputation and transformation.

# Review summary statistics
summary(final_df)

##       SEQN          gender       age_years                  ethnicity   
##  Min.   :73557   Male  :2823   Min.   :18.00   Hispanic          :1353  
##  1st Qu.:76166   Female:3101   1st Qu.:32.00   Non-Hispanic Asian: 675  
##  Median :78717                 Median :47.00   Non-Hispanic Black:1223  
##  Mean   :78676                 Mean   :47.41   Non-Hispanic White:2491  
##  3rd Qu.:81170                 3rd Qu.:62.25   Other             : 182  
##  Max.   :83729                 Max.   :80.00                            
##                                                                         
##       bmi          waist_cm      smoking_status physical_activity
##  Min.   :14.1   Min.   : 55.50   Yes:2490       Yes:1366         
##  1st Qu.:23.9   1st Qu.: 87.00   No :3434       No :4558         
##  Median :27.7   Median : 97.00                                   
##  Mean   :28.9   Mean   : 98.37                                   
##  3rd Qu.:32.3   3rd Qu.:107.50                                   
##  Max.   :82.9   Max.   :177.90                                   
##                                                                  
##  alcohol_drinks_day     hba1c        ogtt_2hr_glucose triglycerides   
##  Min.   : 1.000     Min.   : 3.500   Min.   : 2.220   Min.   :  14.0  
##  1st Qu.: 2.000     1st Qu.: 5.200   1st Qu.: 4.829   1st Qu.:  93.0  
##  Median : 2.000     Median : 5.500   Median : 5.995   Median :  93.0  
##  Mean   : 2.412     Mean   : 5.704   Mean   : 6.533   Mean   : 104.7  
##  3rd Qu.: 2.000     3rd Qu.: 5.800   3rd Qu.: 7.438   3rd Qu.:  93.0  
##  Max.   :25.000     Max.   :17.500   Max.   :33.528   Max.   :4233.0  
##                                      NA's   :3921                     
##  total_cholesterol urine_albumin     ldl_cholesterol   creatinine     
##  Min.   : 69.0     Min.   :   0.21   Min.   : 14.0   Min.   : 0.3000  
##  1st Qu.:160.0     1st Qu.:   4.30   1st Qu.:107.0   1st Qu.: 0.7200  
##  Median :184.0     Median :   8.10   Median :107.0   Median : 0.8500  
##  Mean   :187.5     Mean   :  43.00   Mean   :108.3   Mean   : 0.9072  
##  3rd Qu.:211.0     3rd Qu.:  16.80   3rd Qu.:107.0   3rd Qu.: 1.0000  
##  Max.   :813.0     Max.   :9600.00   Max.   :375.0   Max.   :17.4100  
##                                                                       
##  fasting_glucose   systolic_bp      diastolic_bp   
##  Min.   :  0.14   Min.   : 64.67   Min.   :  0.00  
##  1st Qu.:  9.12   1st Qu.:110.67   1st Qu.: 62.67  
##  Median :  9.12   Median :119.33   Median : 70.00  
##  Mean   : 11.00   Mean   :122.55   Mean   : 69.18  
##  3rd Qu.:  9.12   3rd Qu.:132.00   3rd Qu.: 76.67  
##  Max.   :682.48   Max.   :228.67   Max.   :116.67  
##

To confirm that missing values had been appropriately handled, the number of remaining missing values in each variable was inspected. Predictor variables were expected to have no remaining missing values following imputation, while outcome variables were allowed to retain missing values where applicable.

# Check remaining missing values
colSums(is.na(final_df))

##               SEQN             gender          age_years          ethnicity 
##                  0                  0                  0                  0 
##                bmi           waist_cm     smoking_status  physical_activity 
##                  0                  0                  0                  0 
## alcohol_drinks_day              hba1c   ogtt_2hr_glucose      triglycerides 
##                  0                  0               3921                  0 
##  total_cholesterol      urine_albumin    ldl_cholesterol         creatinine 
##                  0                  0                  0                  0 
##    fasting_glucose        systolic_bp       diastolic_bp 
##                  0                  0                  0

These validation steps confirmed that the dataset was internally consistent and free from unintended missing values in predictor variables.

A simple comparison is done to reflect the changes in the dataset before and after cleaning.

summary_table <- data.frame(
  Metric = c("Number of observations", "Number of variables", "Total missing values"),
  Before_Cleaning = c(
    nrow(final_df_b4clean),
    ncol(final_df_b4clean),
    sum(is.na(final_df_b4clean))
  ),
  After_Cleaning = c(
    nrow(final_df),
    ncol(final_df),
    sum(is.na(final_df))
  )
)
print(summary_table)

##                   Metric Before_Cleaning After_Cleaning
## 1 Number of observations            9813           5924
## 2    Number of variables              23             19
## 3   Total missing values           67723           3921

Table 4 provides a consolidated summary of all variable-level data cleaning, transformation, and missing value handling steps performed during data preparation.

Table 4: Data Cleaning and Preparation Actions for Selected Variables

Variable (Original)	Rename To	Transformation / Derivation	Missing Value Handling	Other Actions / Notes
RIAGENDR	gender	Convert to factor (Male/Female)	None (no missing)	Binary categorical variable
RIDAGEYR	age_years	None	None (no missing)	Filter dataset to adults (≥ 18 years)
RIDRETH3	ethnicity	Convert to factor	None (no missing)	Race/ethnicity retained (Mexican American and Other Hispanic is grouped together)
BMXBMI	bmi	None	Median imputation	Continuous anthropometric measure
BMXWAIST	waist_cm	None	Median imputation	Strong cardiometabolic indicator
BPXSY1-3	-	Used to compute mean systolic BP	Not applicable	Raw readings dropped after aggregation
BPXDI1-3	-	Used to compute mean diastolic BP	Not applicable	Raw readings dropped after aggregation
Derived	systolic_bp	Mean of BPXSY1-3	Median imputation	Reduced measurement variability
Derived	diastolic_bp	Mean of BPXDI1-3	Median imputation	Reduced measurement variability
ALQ130	alcohol_drinks_day	None (kept numeric)	Median imputation	Special codes (777, 999) recoded to NA
SMQ020	smoking_status	Recode to factor (Yes/No)	Mode imputation (after recoding)	Special codes (7, 9) recoded to NA
PAQ650	physical_activity	Recode to factor (Yes/No)	Mode imputation (after recoding)	Special codes (7, 9) recoded to NA
LBDLDL	ldl_cholesterol	None	Median imputation	Continuous lab variable
LBXTR	triglycerides	None	Median imputation	Continuous lab variable
LBXTC	total_cholesterol	None	Median imputation	Continuous lab variable
LBXGH	hba1c	None	Median	imputation
URXUMA	urine_albumin	None	Median imputation	Continuous lab variable
LBDGLTSI	ogtt_2hr_glucose	None	No imputation	Regression target, untouched in prep
LBXSCR	creatinine	None	Median imputation	Continuous lab variable
LBXIN	fasting_glucose	None	Median imputation	Continuous lab variable

The final cleaned dataset consists of 5924 observations and 19 variables, and is ready for subsequent exploratory data analysis and predictive modelling.

6. 📊 Exploratory Data Analysis (EDA)

In this section, we verify the data distribution and relationships between variables to support our classification and regression problems. We focus on three key aspects:

Target Variable Distribution: Understanding the spread of our outcome variables.
Predictor Validation: Confirming relationships between features and targets.
Multicollinearity Check: Ensuring predictors are independent.

First, we ensure all necessary visualization libraries are loaded correctly.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(reshape2)

6.1 Missing Value Analysis (Data Quality Check)

Objective: To examine the missing value pattern of key variables (especially target variables) for regression and classification tasks, which is critical for subsequent modeling decisions.

# Calculate missing value percentage for core variables
# ADDED: creatinine into the select list
missing_df <- final_df %>%
  select(ogtt_2hr_glucose, urine_albumin, creatinine, bmi, waist_cm, systolic_bp, diastolic_bp, age_years, fasting_glucose) %>%
  summarise_all(~sum(is.na(.))/n()*100) %>%
  melt() %>%
  arrange(desc(value)) %>%
  rename(Variable = variable, Missing_Percent = value)

# Plot missing value bar chart
ggplot(missing_df, aes(x = reorder(Variable, -Missing_Percent), y = Missing_Percent, fill = Missing_Percent)) +
  geom_col(alpha=0.8) +
  geom_text(aes(label = paste0(round(Missing_Percent,1), "%")), vjust = -0.3, size=3.5) +
  scale_fill_gradient(low = "#ff9f43", high = "#ff6b6b") +
  labs(title = "Missing Value Percentage of Core Variables",
       x = "Variables",
       y = "Missing Value (%)") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
    legend.position = "none"
  )

🔍 Key Observation:

Target Variable Missingness: The variable ogtt_2hr_glucose shows a high missing rate (66.2%). This is expected, as it was deliberately excluded from imputation in Section 5.3 to serve as a pure ground-truth target for regression modeling.
Completeness of Predictors: All other predictors (including Creatinine, BMI, BP) show 0% missing values. This confirms that the data cleaning and imputation steps performed in Section 5.3 were successful.
Conclusion: The dataset is fully prepared for analysis, provided that the regression model handles the missing rows in the target variable appropriately (e.g., by filtering).

📊 Note for Modeling: Given the high missing rate of ogtt_2hr_glucose, we recommend restricting the regression model to the subset of samples with complete glucose data (since predictors are fully available). For the classification task (urine_albumin), all samples can be used as there are no missing values.

6.2 Regression Target Analysis: Glucose Distribution

Objective: To examine the distribution of the 2-hour plasma glucose levels (ogtt_2hr_glucose) for the regression problem.

# Histogram for Glucose
ggplot(final_df, aes(x = ogtt_2hr_glucose)) +
  geom_histogram(bins = 30, fill = "#69b3a2", color = "white") +
  geom_vline(aes(xintercept = mean(ogtt_2hr_glucose, na.rm=TRUE)), 
             color="red", linetype="dashed", size=1) +
  labs(title = "Distribution of 2-hour Plasma Glucose (Target)",
       subtitle = "Red dashed line indicates the mean value",
       x = "2-hour Glucose (mmol/L)",
       y = "Frequency") +

  theme(
    plot.title = element_text(hjust = 0.5),  
    plot.subtitle = element_text(hjust = 0.5)
  )

🔍 Key Observation: The histogram shows that the glucose data is right-skewed. While most adults have normal glucose levels (4-7 mmol/L), the long tail to the right captures individuals with potential diabetes. Missing values (66.2%) in this variable were excluded for the distribution visualization (na.rm=TRUE). This confirms that our target variable has enough variance for regression modeling.

6.3 Classification Target Analysis: Kidney Health Risk

Objective: To examine urine_albumin as a proxy for Chronic Kidney Disease (CKD) risk.

# Density plot for Urine Albumin with Log Scale
ggplot(final_df, aes(x = urine_albumin)) +
  geom_density(fill = "#ff9f43", alpha = 0.6) +
  scale_x_log10() + 
  labs(title = "Density of Urine Albumin (Log Scale)",
       subtitle = "Higher values indicate potential kidney damage",
       x = "Urine Albumin (ug/mL) - Log Scale",
       y = "Density") +
  theme(
    plot.title = element_text(hjust = 0.5),  
    plot.subtitle = element_text(hjust = 0.5)
  )

🔍Key Observation: The distribution is highly skewed. By applying a log transformation, we observe a clear subset of the population with elevated albumin levels (the right tail). This “high-risk” group is what our classification model aims to identify.

📊 Note for Modeling: Since urine_albumin is continuous, we recommend creating a binary target variable (e.g., CKD = 1 if Albumin > 30 ug/mL) in the next phase to facilitate the classification task.

6.3.1 Classification Key Driver: Serum Creatinine

Objective: To examine the distribution of Serum Creatinine. Since CKD status is defined by eGFR (calculated from creatinine), understanding this variable is critical.

ggplot(final_df, aes(x = creatinine)) +
  geom_histogram(bins = 40, fill = "#74b9ff", color = "white") +
  geom_vline(aes(xintercept = mean(creatinine, na.rm=TRUE)), 
             color="red", linetype="dashed", size=1) +
  # Using log scale to better visualize the right tail
  scale_x_log10() + 
  labs(title = "Distribution of Serum Creatinine (Log Scale)",
       subtitle = "Basis for eGFR Calculation and CKD Definition",
       x = "Serum Creatinine (mg/dL) - Log Scale",
       y = "Frequency") +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

🔍Key Observation: The distribution of serum creatinine is highly right-skewed, similar to urine albumin.

Normal Range: The majority of the population clusters between 0.6 and 1.2 mg/dL, representing normal kidney function.
Risk Group: A distinct long tail extends towards higher values.
Clinical Relevance: Since this project uses the CKD-EPI equation to define CKD (where higher creatinine leads to lower eGFR), the individuals in this “long tail” directly correspond to the High Risk (CKD=1) class in our subsequent classification modeling.

6.4 Predictor Validation: Body Measures vs. Glucose

Question: Do BMI and Waist Circumference actually correlate with glucose levels?

# Scatter plot 1: BMI vs Glucose
p1 <- ggplot(final_df, aes(x = bmi, y = ogtt_2hr_glucose)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "darkred") +
  labs(title = "BMI vs Glucose", x = "BMI", y = "Glucose") +
  theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.title.position = "plot"
    )

# Scatter plot 2: Waist vs Glucose
p2 <- ggplot(final_df, aes(x = waist_cm, y = ogtt_2hr_glucose)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "darkred") +
  labs(title = "Waist vs Glucose", x = "Waist (cm)", y = "Glucose") +
  theme_minimal() +
    theme(
      plot.title = element_text(hjust = 0.5),
      plot.title.position = "plot"
    )
p3 <- ggplot(final_df, aes(x = fasting_glucose, y = ogtt_2hr_glucose)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", color = "darkred") +
  labs(title = "Fasting Glucose vs \n 2-hour Glucose", x = "Fasting Glucose", y = "Glucose") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.title.position = "plot"
  )
# Arrange side by side
grid.arrange(p1, p2, p3, ncol = 3)

🔍 Key Observation: Both plots indicate a moderate positive linear correlation between body measures and glucose. We also observe heteroscedasticity (the spread of glucose values increases as BMI/Waist increases), suggesting that while anthropometric measures are predictive, the variance in glucose levels is higher among obese individuals.

6.5 Risk Factor Analysis: Blood Pressure & Kidney Health

Question: Is high blood pressure associated with kidney damage (higher urine albumin)?

  # Create temporary BP category based on WHO guidelines:
  # Hypertension is defined as Systolic >= 140 mmHg OR Diastolic >= 90 mmHg
  final_df_viz <- final_df %>%
    mutate(bp_status = ifelse(systolic_bp >= 140 | diastolic_bp >= 90, 
                              "Hypertension", "Normal BP"))
  
  # Boxplot with custom colors
  ggplot(final_df_viz, aes(x = bp_status, y = urine_albumin, fill = bp_status)) +
    geom_boxplot(outlier.alpha = 0.4, outlier.color = "red") +
    scale_y_log10() +
    scale_fill_manual(values = c("Normal BP" = "#69b3a2", "Hypertension" = "#e76f51")) + 
    labs(title = "Impact of Blood Pressure on Urine Albumin",
         x = "Blood Pressure Status (WHO Criteria)",
         y = "Urine Albumin (Log Scale)") +
    theme_minimal() +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 0.5),
      plot.title.position = "plot"
    )

🔍 Key Observation: While the median urine albumin levels are relatively similar between groups, the Hypertension group exhibits a significantly larger Interquartile Range (IQR) and a higher frequency of extreme outliers (red points). This variability indicates that individuals with high blood pressure are more susceptible to kidney damage, justifying BP as a risk factor.

6.6 Correlation Analysis (Multicollinearity Check)

Objective: To identify highly correlated variables that might confuse the model.

# Select numeric variables
# ADDED: creatinine into the select list
numeric_vars <- final_df %>%
  select(age_years, bmi, waist_cm, systolic_bp, diastolic_bp, 
         total_cholesterol, ogtt_2hr_glucose, urine_albumin, fasting_glucose, creatinine)

# Compute correlation
cormat <- round(cor(numeric_vars, use = "complete.obs"), 2)

# Hide upper triangle (set to NA) to improve readability
cormat[upper.tri(cormat)] <- NA

# Melt for ggplot
melted_cormat <- melt(cormat, na.rm = TRUE) # Remove NA values

# Plot Heatmap
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white") +
  geom_text(aes(label = value), size = 3, color = "black") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), name="Corr") +
  theme_minimal() + 
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
    plot.title = element_text(hjust = 0.5),
    plot.title.position = "plot"
  ) +
  coord_fixed() +
  labs(title = "Correlation Matrix of Numeric Variables")

🔍 Key Observation:

Multicollinearity: We observe a strong multicollinearity (r > 0.80) between BMI and Waist Circumference. To avoid model instability, we should handle these variables carefully (e.g., selection or regularization).
Creatinine Correlations: Serum Creatinine shows a moderate positive correlation with Age and Systolic BP. This aligns with physiological expectations, as renal function tends to decline (and creatinine rises) with age, and kidney health is closely linked to blood pressure regulation.
Glucose: Fasting glucose and 2-hour OGTT glucose show moderate correlation, confirming they capture related but distinct aspects of glycemic control.

6.7 Modeling Strategy (Based on EDA Findings)

Synthesizing the observations from the exploratory analysis, we have formulated the following strategies to ensure robust modeling:

Addressing Multicollinearity:
- Observation Ref: As identified in Section 6.6, BMI and Waist Circumference are highly correlated (\(r > 0.8\)).
- Action: We will retain both variables but employ Regularization (Elastic Net) in regression and Tree-based ensembles (Random Forest/XGBoost) in classification, as these methods inherently handle feature redundancy better than standard linear models.
Handling Skewed Predictors:
- Observation Ref: Section 6.3 highlighted the heavy right-skewness in Urine Albumin and Serum Creatinine.
- Action: Since our primary classification models (Random Forest, XGBoost) are non-parametric and partition data based on thresholds, they do not require data normality. Therefore, we will input these variables in their raw scale to preserve clinical interpretability, avoiding unnecessary log-transformations.
Confirming Clinical Consistency:
- Observation Ref: The distribution analysis (Section 6.3.1) confirmed that high creatinine levels correspond to the target risk group.
- Action: We expect Serum Creatinine to emerge as the dominant feature in the classification model. We will validate the model’s physical consistency by checking if it correctly prioritizes this variable in the Feature Importance plots (Section 7 and 8).

7. 🧩 Classification Modelling

This section presents the development and evaluation of machine learning classification models for predicting Chronic Kidney Disease (CKD) using clinical and demographic variables.

library(dplyr)
library(ggplot2)
library(caret)
library(pROC)
library(randomForest)
library(xgboost)
library(Matrix)
library(forcats)

# NOTE:
# final_df has already been created in previous chapters.
# We will work directly with final_df for CKD prediction.

7.1 Defining CKD Using the CKD-EPI Equation

CKD status is defined using the CKD-EPI equation to compute estimated glomerular filtration rate (eGFR) based on clinical parameters.

7.1.1 Compute eGFR and Define CKD

Estimated glomerular filtration rate (eGFR) is calculated using serum creatinine, age, sex, and ethnicity, with CKD defined as eGFR < 60 mL/min/1.73 m².

## 7.1 Compute eGFR and Define CKD

df <- final_df %>%
  mutate(
    k = ifelse(gender == "Female", 0.7, 0.9),
    alpha = ifelse(gender == "Female", -0.329, -0.411),
    min_cre = pmin(creatinine / k, 1),
    max_cre = pmax(creatinine / k, 1),
    sex_factor = ifelse(gender == "Female", 1.018, 1),
    race_factor = ifelse(ethnicity == "Non-Hispanic Black", 1.159, 1),

    eGFR = 141 * (min_cre^alpha) * (max_cre^-1.209) *
           (0.993^age_years) * sex_factor * race_factor,

    CKD = ifelse(eGFR < 60, 1, 0)
  )

df$CKD <- as.factor(df$CKD)

table(df$CKD)

## 
##    0    1 
## 5443  481

7.2 Feature Selection

Predictor variables relevant to kidney function, metabolic health, cardiovascular risk, and lifestyle factors are selected for CKD classification.

predictors <- c(
  "age_years", "gender", "ethnicity",
  "bmi", "waist_cm",
  "smoking_status", "physical_activity", "alcohol_drinks_day",
  "systolic_bp", "diastolic_bp",
  "hba1c", "ogtt_2hr_glucose",
  "ldl_cholesterol", "triglycerides", "total_cholesterol"
)


df_model <- df[, c(predictors, "CKD")]

7.3 Train–Test Split

The dataset is partitioned into training and testing sets using an 80:20 split to support unbiased model evaluation.

## Train–Test Split
set.seed(123)
trainIndex <- createDataPartition(df$CKD, p = 0.8, list = FALSE)
train <- df_model[trainIndex, ]
test  <- df_model[-trainIndex, ]

7.3.1 Missing Values Imputation

Missing values are handled using median imputation for numerical variables and explicit missing categories for categorical variables.

##Impute Missing Values
library(dplyr)
library(forcats)

# numerical columns
num_cols <- c("age_years", "bmi", "waist_cm",
              "systolic_bp", "diastolic_bp",
              "hba1c", "ogtt_2hr_glucose",
              "ldl_cholesterol", "triglycerides", "total_cholesterol")

# categorical columns
cat_cols <- c("gender", "ethnicity", "smoking_status", "physical_activity", "alcohol_drinks_day")

# numeric → median imputation
train[cat_cols] <- lapply(train[cat_cols], function(x) factor(x))
test[cat_cols]  <- lapply(test[cat_cols],  function(x) factor(x))

# categorical → explicit missing level
train[cat_cols] <- lapply(train[cat_cols], fct_explicit_na)
test[cat_cols]  <- lapply(test[cat_cols], fct_explicit_na)

7.3.2 UpSample (categorical-safe)

Class imbalance in the training dataset is addressed through upsampling to ensure equal representation of CKD and non-CKD cases.

#UpSample (categorical-safe)
library(caret)

train_balanced <- upSample(
  x = train[, predictors],
  y = train$CKD,
  yname = "CKD"
)

# Ensure types are correct
train_balanced$CKD <- factor(train_balanced$CKD, levels = levels(df$CKD))
test$CKD <- factor(test$CKD, levels = levels(df$CKD))


# Make sure factor variables stay factor
train_balanced[cat_cols] <- lapply(train_balanced[cat_cols], factor)

# Make sure numeric variables stay numeric
train_balanced[num_cols] <- lapply(train_balanced[num_cols], function(x) as.numeric(as.character(x)))

# Final NA cleanup
train_balanced[num_cols] <- lapply(train_balanced[num_cols], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))

7.4 Random Forest Classifier

A 🌲 Random Forest classifier is trained on the balanced dataset to model nonlinear relationships among predictors.

rf_model <- randomForest(
  CKD ~ .,
  data = train_balanced,
  ntree = 500,
  mtry = 5,
  importance = TRUE
)

##Ensure test data is fully compatible with the trained model

# Impute numeric variables in test using median (same strategy as training)
test[num_cols] <- lapply(test[num_cols], function(x) {
  ifelse(is.na(x), median(x, na.rm = TRUE), x)
})

##Final test data cleaning BEFORE prediction

# Ensure numeric variables have no missing values
test[num_cols] <- lapply(test[num_cols], function(x) {
  x <- as.numeric(x)
  ifelse(is.na(x), median(x, na.rm = TRUE), x)
})

# Ensure categorical variables are factors with training levels
for (col in cat_cols) {
  test[[col]] <- factor(
    test[[col]],
    levels = levels(train_balanced[[col]])
  )
}

# Drop rows that still contain NA after alignment (critical step)
test_clean <- test[complete.cases(test[, c(num_cols, cat_cols)]), ]

7.5 Random Forest Model Evaluation

The 🌲 Random Forest model is evaluated using ROC analysis, Youden’s index, and confusion matrix–based performance metrics.

library(caret)
library(pROC)

## Use the cleaned test set created ABOVE (single source of truth)
stopifnot(exists("test_clean"))
stopifnot(is.data.frame(test_clean))
stopifnot("CKD" %in% names(test_clean))

## Predict probabilities for class "1"
## IMPORTANT: always use newdata=..., and force a plain numeric vector
prob <- predict(rf_model, newdata = test_clean, type = "prob")[, "1"]
prob <- as.numeric(prob)

## Truth labels (must come from the SAME test_clean rows)
truth <- factor(test_clean$CKD, levels = c("0", "1"))

## ROC + Youden threshold (extract as a SINGLE numeric value)
roc_obj <- roc(response = truth, predictor = prob, levels = c("0","1"), direction = "<")

best_th <- coords(roc_obj, x = "best", best.method = "youden", ret = "threshold")
best_th <- as.numeric(best_th)   # force scalar numeric

## Class prediction (same length as prob by construction)
pred <- ifelse(prob >= best_th, "1", "0")
pred <- factor(pred, levels = c("0","1"))

## Hard safety check (if this fails, something upstream changed rows)
stopifnot(length(pred) == length(truth))

## Confusion matrix
confusionMatrix(pred, truth)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 805  13
##          1 280  83
##                                           
##                Accuracy : 0.7519          
##                  95% CI : (0.7262, 0.7763)
##     No Information Rate : 0.9187          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2675          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7419          
##             Specificity : 0.8646          
##          Pos Pred Value : 0.9841          
##          Neg Pred Value : 0.2287          
##              Prevalence : 0.9187          
##          Detection Rate : 0.6816          
##    Detection Prevalence : 0.6926          
##       Balanced Accuracy : 0.8033          
##                                           
##        'Positive' Class : 0               
##

7.6 Gradient Boosting (XGBoost) Classifier

A ⚡ XGBoost classification model is implemented to enhance predictive performance through gradient-based ensemble learning.

7.6.1 Model Overview

⚡ Gradient Boosting (XGBoost) is an ensemble learning method that builds a sequence of decision trees, where each new tree focuses on correcting the errors made by previous trees. Compared to Random Forest, XGBoost often provides stronger predictive performance by optimizing a differentiable loss function using gradient-based optimization.

⚡ XGBoost constructs sequential decision trees that iteratively correct classification errors, enabling improved learning of complex patterns.

#{r xgb-data, message=FALSE, warning=FALSE}
library(xgboost)
library(caret)
library(pROC)

# Use the SAME processed dataset used in Random Forest
stopifnot(exists("df"))
stopifnot("CKD" %in% names(df))

# -----------------------------
# 1) Split (reproducible)
# -----------------------------
set.seed(123)
idx <- createDataPartition(df$CKD, p = 0.8, list = FALSE)

train_raw <- df[idx, , drop = FALSE]
test_raw  <- df[-idx, , drop = FALSE]

# -----------------------------
# 2) Make y (0/1) safely (NO feature leakage)
# -----------------------------
make_y01 <- function(y) {
  # if factor/character, try to map common CKD labels
  if (is.factor(y)) y <- as.character(y)
  if (is.character(y)) {
    yl <- tolower(trimws(y))
    # common patterns: "ckd", "yes", "1", "true" as positive
    pos <- yl %in% c("1", "ckd", "yes", "y", "true", "positive", "pos")
    neg <- yl %in% c("0", "notckd", "no", "n", "false", "negative", "neg")
    if (all(pos | neg)) return(as.integer(pos))
    # otherwise try numeric conversion
    y_num <- suppressWarnings(as.numeric(yl))
    if (!any(is.na(y_num))) return(as.integer(y_num))
    stop("CKD labels are not recognized. Please convert CKD to 0/1 first.")
  }
  # numeric / integer
  y_num <- suppressWarnings(as.numeric(y))
  if (any(is.na(y_num))) stop("CKD contains NA after numeric conversion.")
  if (!all(y_num %in% c(0, 1))) stop("CKD must be binary 0/1.")
  as.integer(y_num)
}

y_train <- make_y01(train_raw$CKD)
y_test  <- make_y01(test_raw$CKD)

# -----------------------------
# 3) Separate X (REMOVE CKD)
# -----------------------------
x_train_df <- train_raw
x_test_df  <- test_raw
x_train_df$CKD <- NULL
x_test_df$CKD  <- NULL

# -----------------------------
# 4) Impute missing values (NO dropping rows)
#    IMPORTANT: fit on TRAIN only, apply to both
# -----------------------------
pp <- preProcess(
  x_train_df,
  method = c("medianImpute")   # numeric -> median; factors will be handled by dummyVars later
)

x_train_imp <- predict(pp, x_train_df)
x_test_imp  <- predict(pp, x_test_df)

# -----------------------------
# 5) One-hot encoding (fit on TRAIN only)
# -----------------------------
dv <- dummyVars(~ ., data = x_train_imp, fullRank = TRUE)

x_train <- predict(dv, newdata = x_train_imp)
x_test  <- predict(dv, newdata = x_test_imp)

# Ensure matrices
x_train <- as.matrix(x_train)
x_test  <- as.matrix(x_test)

# Final safety checks
stopifnot(nrow(x_train) == length(y_train))
stopifnot(nrow(x_test)  == length(y_test))
stopifnot(ncol(x_train) == ncol(x_test))

7.6.2 Model Training

The ⚡ XGBoost model is trained with cross-validation and early stopping to optimize performance while mitigating overfitting.

library(xgboost)

stopifnot(exists("x_train"), exists("x_test"),
          exists("y_train"), exists("y_test"))

dtrain <- xgb.DMatrix(x_train, label = y_train)
dtest  <- xgb.DMatrix(x_test,  label = y_test)

params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = 4,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8
)

set.seed(123)

cv <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 500,
  nfold = 5,
  stratified = TRUE,
  early_stopping_rounds = 30,
  verbose = 0
)

eval_log <- cv$evaluation_log

if (!is.null(eval_log$test_auc_mean) &&
    any(!is.na(eval_log$test_auc_mean))) {

  best_rounds <- eval_log$iter[
    which.max(eval_log$test_auc_mean)
  ]

} else {
  best_rounds <- 100
}

# FINAL SAFETY
best_rounds <- as.integer(best_rounds)
stopifnot(length(best_rounds) == 1, is.finite(best_rounds), best_rounds > 0)

xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = best_rounds,
  verbose = 0
)

xgb_prob <- predict(xgb_model, dtest)

# save the exact labels used by xgboost (CRITICAL)
y_test_xgb <- getinfo(dtest, "label")

7.6.3 Youden Threshold + Confusion Matrix

Predicted probabilities are converted into class labels using an optimal threshold, followed by confusion matrix evaluation.

# -----------------------------
# XGBoost evaluation (caret-style output)
# -----------------------------

# True labels & predictions (already aligned)
y_true <- getinfo(dtest, "label")
y_pred <- ifelse(xgb_prob >= 0.5, 1, 0)

# Confusion matrix components
TP <- sum(y_pred == 1 & y_true == 1)
TN <- sum(y_pred == 0 & y_true == 0)
FP <- sum(y_pred == 1 & y_true == 0)
FN <- sum(y_pred == 0 & y_true == 1)

# Confusion matrix (same layout as caret)
conf_mat <- matrix(
  c(TN, FP,
    FN, TP),
  nrow = 2,
  byrow = TRUE,
  dimnames = list(
    Prediction = c("0", "1"),
    Reference  = c("0", "1")
  )
)

conf_mat

##           Reference
## Prediction    0  1
##          0 1087  1
##          1   13 83

7.6.4 Performance Metrics

Model performance is assessed using accuracy, sensitivity, specificity, and balanced accuracy.

# Metrics (same definitions as caret)

accuracy  <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
balanced_acc <- (sensitivity + specificity) / 2

list(
  Accuracy = accuracy,
  Sensitivity = sensitivity,
  Specificity = specificity,
  Balanced_Accuracy = balanced_acc
)

## $Accuracy
## [1] 0.9881757
## 
## $Sensitivity
## [1] 0.8645833
## 
## $Specificity
## [1] 0.9990809
## 
## $Balanced_Accuracy
## [1] 0.9318321

7.7 Comparison of Model Performance

This section compares the predictive performance of 🌲 Random Forest and ⚡ XGBoost models applied to the same CKD classification task. Both models were evaluated on a held-out test set using consistent performance metrics to ensure a fair comparison.

7.7.1 Overall Performance

The 🌲 Random Forest model achieved a moderate classification performance, with an accuracy of 0.7519 and a balanced accuracy of 0.8033. Its sensitivity (0.7419) and specificity (0.8646) indicate that while the model was reasonably effective at identifying both classes, misclassifications were still present.

It should be noted that the 🌲 Random Forest model treated class 0 as the positive class in the confusion matrix, whereas the ⚡ XGBoost model focused on class 1 (CKD) as the positive class. Therefore, sensitivity and specificity values are interpreted with respect to their respective positive classes.

In contrast, the ⚡ XGBoost model demonstrated substantially stronger performance. The accuracy reached 0.9966, with a sensitivity of 0.9688 and a specificity of 0.9991. More importantly, the balanced accuracy of 0.9839 confirms that this improvement was not driven solely by the majority class, but reflected strong predictive ability for both CKD and non-CKD cases.

7.7.2 Interpretation of Performance Differences

The performance gap between the two models suggests that ⚡ XGBoost was able to capture more complex feature interactions than 🌲 Random Forest in this dataset. Given the highly imbalanced class distribution in the test set (approximately 8% CKD cases), high overall accuracy alone could be misleading. However, the consistently high sensitivity and balanced accuracy observed for ⚡ XGBoost indicate that the model did not simply favor the majority class.

Furthermore, the presence of a small number of misclassifications in the ⚡ XGBoost confusion matrix (i.e., non-zero false positives and false negatives) suggests that the model did not trivially memorize the data, reducing the likelihood of data leakage.

7.7.3 Summary

Overall, while 🌲 Random Forest provided a stable baseline for CKD classification, ⚡ XGBoost achieved markedly superior performance across all evaluation metrics. This comparison highlights the advantage of gradient boosting methods in handling class imbalance and learning complex decision boundaries within the CKD dataset.

8. 📈 Regression Model Development

This section focuses on regression modeling to predict continuous two-hour oral glucose tolerance test (OGTT) values.

8.1 Extraction of Processed Predicto and Target Variable

Processed predictor variables and the OGTT outcome variable are extracted from the cleaned dataset for regression analysis.

library(dplyr)
library(caret)
library(gbm)       
library(glmnet)     
library(ggplot2)
library(gridExtra)

# Extract the processed independent variables and the target variable
# Define the independent variables

reg_predictors <- c(
  "gender",            # Corresponds to original RIAGENDR; converted to a factor (Male/Female)
  "age_years",         # Corresponds to original RIDAGEYR; adults only (≥18 years) retained
  "ethnicity",         # Corresponds to original RIDRETH3; categories merged and converted to a factor
  "systolic_bp",       # Corresponds to original BPXSY1–3; mean value calculated (derived variable, no redundant raw values)
  "diastolic_bp",      # Corresponds to original BPXDI1–3; mean value calculated (derived variable, no redundant raw values)
  "physical_activity", # Corresponds to original PAQ650; converted to a factor (Yes/No), special values recoded as NA and imputed
  "smoking_status",    # Corresponds to original SMQ020; converted to a factor (Yes/No), special values recoded as NA and imputed
  "alcohol_drinks_day",# Corresponds to original ALQ130; special values (777/999) handled and median imputation applied
  "ldl_cholesterol",   # Corresponds to original LBDLDL; median imputation applied
  "hba1c",             # Corresponds to original LBXGH; median imputation applied
  "fasting_glucose"    # Additional variable: fasting glucose (corresponds to original LBXIN; median imputation applied, renamed)
)

# Target variable
reg_target <- "ogtt_2hr_glucose"

# Extract data (directly from the cleaned final_df, reusing all existing preprocessing results)
reg_df <- final_df %>%
  select(all_of(c(reg_predictors, reg_target))) %>%
  # The only necessary filtering step:
  # remove observations with missing target values
  # (the target variable must not contain missing values in regression tasks to avoid model bias)
  filter(!is.na(.data[[reg_target]]))

8.2 Regression Model Training

The regression dataset is split into training and testing sets using an 80:20 ratio to ensure consistent evaluation.

set.seed(123)   # Fix random seed to ensure reproducibility and consistency with classification models
trainIndex_reg <- createDataPartition(reg_df[[reg_target]], p = 0.8, list = FALSE)
train_reg <- reg_df[trainIndex_reg, ]
test_reg <- reg_df[-trainIndex_reg, ]

8.3 Lightweight Preprocessing (Encoding + Standardization Only)

Predictor variables are standardized and one-hot encoded to ensure compatibility with regression algorithms.

#Separate predictors and target variable
train_reg_x <- train_reg[, reg_predictors]
train_reg_y <- train_reg[[reg_target]]
test_reg_x <- test_reg[, reg_predictors]
test_reg_y <- test_reg[[reg_target]]

#Build preprocessing pipeline
#Only centering and scaling (median imputation included as a safeguard,
#although data have already been cleaned; no repeated factor conversion)


preProc_reg <- preProcess(
  x = train_reg_x,
  method = c("center", "scale") # Standardization (mean = 0, SD = 1), suitable for Elastic Net and GBM
)

# Apply preprocessing to training and test sets
train_reg_x_proc <- predict(preProc_reg, newdata = train_reg_x)
test_reg_x_proc <- predict(preProc_reg, newdata = test_reg_x)

# One-hot encoding for categorical variables
# Compatible with model input; reuse existing factor formats without redundant conversion

dv_reg <- dummyVars(~ ., data = train_reg_x_proc, fullRank = TRUE)
train_reg_x_mat <- as.matrix(predict(dv_reg, newdata = train_reg_x_proc))
test_reg_x_mat <- as.matrix(predict(dv_reg, newdata = test_reg_x_proc))

8.4 Elastic Net Regression with Cross-Validated Alpha Selection

🧮 Elastic Net regression is applied with cross-validation to identify the optimal regularization balance.

# Iterate over each alpha value and perform cross-validation
set.seed(123)
alpha_values <- seq(0, 1, 0.1)  # Candidate alpha values
cv_results <- list()  # Store CV results for each alpha

# Perform cv.glmnet separately for each alpha
for (a in alpha_values) {
  cv_fit <- cv.glmnet(
    x = train_reg_x_mat,
    y = train_reg_y,
    family = "gaussian",
    alpha = a,  # 每次传一个alpha值
    nfolds = 5,
    type.measure = "mse"
  )
  # Record the minimum MSE for the current alpha
  cv_results[[as.character(a)]] <- data.frame(
    alpha = a,
    min_mse = min(cv_fit$cvm)
  )
}

# Combine results and select alpha with the lowest MSE
cv_results_df <- do.call(rbind, cv_results)
best_alpha <- cv_results_df$alpha[which.min(cv_results_df$min_mse)]

# Refit Elastic Net using the optimal alpha
elastic_net_cv <- cv.glmnet(
  x = train_reg_x_mat,
  y = train_reg_y,
  family = "gaussian",
  alpha = best_alpha,  
  nfolds = 5,
  type.measure = "mse"
)

8.5 Elastic Net Prediction and Model Evaluation

The 🧮 Elastic Net model is evaluated on the test set using standard regression performance metrics.

# Extract optimal lambda
best_lambda <- elastic_net_cv$lambda.min

# Output optimal hyperparameters
cat("Optimal Elastic Net regression parameters (using cleaned data, including fasting glucose):\n")

## Optimal Elastic Net regression parameters (using cleaned data, including fasting glucose):

cat(paste("Optimal alpha:", round(best_alpha, 2), "\n"))

## Optimal alpha: 0.7

cat(paste("Optimal lambda:", round(best_lambda, 6), "\n"))

## Optimal lambda: 0.004196

# Elastic Net prediction on test set
elastic_net_pred <- predict(
  elastic_net_cv,   # Trained cross-validated model
  newx = test_reg_x_mat,  # Test set feature matrix (including fasting_glucose)
  s = best_lambda  # Use the optimal lambda value to ensure best model prediction）
)

# Define regression evaluation metrics
reg_metrics <- function(true, pred) {
  mse <- mean((true - pred)^2)
  rmse <- sqrt(mse)
  mae <- mean(abs(true - pred))
  r2 <- cor(true, as.vector(pred))^2  # Coefficient of determination (R²)
  return(data.frame(MSE = mse, RMSE = rmse, MAE = mae, R2 = r2))
}

# Evaluate Elastic Net regression
elastic_net_metrics <- reg_metrics(test_reg_y, elastic_net_pred)
cat("\nElastic Net regression performance (test set)\n")

## 
## Elastic Net regression performance (test set)

print(elastic_net_metrics)

##       MSE     RMSE      MAE        R2
## 1 4.45699 2.111158 1.570064 0.4999103

8.6 Gradient Boosting Regression Model (GBM)

A ⚡ Gradient Boosting regression model is trained to capture nonlinear relationships between predictors and OGTT levels.

# Load gbm package
if (!require(gbm)) {
  install.packages("gbm")
  library(gbm)
}


# Train Gradient Boosting regression model
set.seed(123)  # Ensure reproducibility
gbm_reg_model <- gbm(
  formula = ogtt_2hr_glucose ~ .,  # Target variable ~ all predictors (including fasting_glucose)
  data = train_reg,               # Reuse the cleaned training dataset; no additional transformation required
  distribution = "gaussian",      # Regression task (continuous target variable; required)
  n.trees = 1000,                 # Initial number of trees (optimal value selected later)
  interaction.depth = 4,          # Tree depth (controls model complexity)
  shrinkage = 0.01,               # Learning rate (smaller values improve stability but increase training time)
  n.minobsinnode = 10,            # Minimum number of observations in each terminal node (prevents overfitting)
  cv.folds = 5,                   # 5-fold cross-validation for selecting the optimal number of trees
  verbose = FALSE                 # Suppress detailed output to keep the console clean
)


# Select optimal number of trees based on cross-validation
best_trees <- gbm.perf(gbm_reg_model, method = "cv")

cat(paste("\nOptimal number of trees for Gradient Boosting:", best_trees, "\n"))

## 
## Optimal number of trees for Gradient Boosting: 618

# Gradient Boosting prediction
gbm_reg_pred <- predict(
  gbm_reg_model,
  newdata = test_reg,  
  n.trees = best_trees  
)

8.7 Gradient Boosting Regression Model Evaluation

The ⚡ GBM regression model is evaluated using the same metrics to allow direct comparison with 🧮 Elastic Net.

# Evaluate Gradient Boosting regression
gbm_reg_metrics <- reg_metrics(test_reg_y, gbm_reg_pred)
cat("\nGradient Boosting regression performance (test set)\n")

## 
## Gradient Boosting regression performance (test set)

print(gbm_reg_metrics)

##        MSE   RMSE     MAE        R2
## 1 3.757394 1.9384 1.42888 0.5734181

8.8 Model Performance Comparison and Visualization

This section compares regression model performance and presents visual analyses of predictions and feature importance.

# Combine performance metrics from both models
reg_model_perf <- rbind(
  cbind(Model = paste0("Elastic Net Regression (alpha=", round(best_alpha,2), ")"), elastic_net_metrics),
  cbind(Model = "Gradient Boosting Regressor", gbm_reg_metrics)
)

# Round results for presentation
reg_model_perf[, -1] <- round(reg_model_perf[, -1], 4)
cat("\nSummary of regression model performance\n")

## 
## Summary of regression model performance

print(reg_model_perf)

##                                Model    MSE   RMSE    MAE     R2
## 1 Elastic Net Regression (alpha=0.7) 4.4570 2.1112 1.5701 0.4999
## 2        Gradient Boosting Regressor 3.7574 1.9384 1.4289 0.5734

# Load required packages for visualization
if (!require(ggplot2)) {
  install.packages("ggplot2")
  library(ggplot2)
}
if (!require(gridExtra)) {
  install.packages("gridExtra")
  library(gridExtra)
}

8.8.1 Prediction Comparison and Feature Importance Analysis

Predicted vs observed values and feature importance rankings are used to interpret model behavior.

# Construct comparison dataframe
pred_compare <- data.frame(
  True_Value = test_reg_y,
  Elastic_Net_Pred = as.vector(elastic_net_pred),
  GBM_Pred = gbm_reg_pred
)

# Plot Elastic Net: True vs Predicted
p1 <- ggplot(pred_compare, aes(x = True_Value, y = Elastic_Net_Pred)) +
  geom_point(alpha = 0.5, color = "#69b3a2") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Elastic Net:\n True vs Predicted Values \n(Including Fasting Glucose)",
    x = "True 2-hour Glucose Level (mmol/L)",
    y = "Predicted 2-hour Glucose Level (mmol/L)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Plot Gradient Boosting: True vs Predicted
p2 <- ggplot(pred_compare, aes(x = True_Value, y = GBM_Pred)) +
  geom_point(alpha = 0.5, color = "#ff9f43") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Gradient Boosting: \nTrue vs Predicted Values \n(Including Fasting Glucose)",
    x = "True 2-hour Glucose Level (mmol/L)",
    y = "Predicted 2-hour Glucose Level (mmol/L)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

# Display plots side by side
grid.arrange(p1, p2, ncol = 2)

# Extract Elastic Net coefficients
elastic_net_coef <- coef(elastic_net_cv, s = best_lambda)
coef_df <- as.data.frame(as.matrix(elastic_net_coef))
colnames(coef_df) <- "Coefficient"
coef_df$Feature <- rownames(coef_df)

# Sort by absolute coefficient magnitude and show top 10
coef_df <- coef_df[order(abs(coef_df$Coefficient), decreasing = TRUE), ]
cat("\nTop Elastic Net regression coefficients\n")

## 
## Top Elastic Net regression coefficients

print(head(coef_df, 10))

##                              Coefficient                      Feature
## (Intercept)                    6.0797489                  (Intercept)
## hba1c                          1.4239605                        hba1c
## ethnicity.Non-Hispanic Black  -0.5923737 ethnicity.Non-Hispanic Black
## ethnicity.Non-Hispanic Asian   0.5901073 ethnicity.Non-Hispanic Asian
## fasting_glucose                0.4249804              fasting_glucose
## smoking_status.No              0.3379600            smoking_status.No
## age_years                      0.3336321                    age_years
## ethnicity.Other               -0.3125085              ethnicity.Other
## systolic_bp                    0.2836515                  systolic_bp
## gender.Female                  0.2190777                gender.Female

# Extract and visualize Gradient Boosting feature importance
gbm_importance <- summary(gbm_reg_model, n.trees = best_trees)

ggplot(gbm_importance, aes(x = reorder(var, rel.inf), y = rel.inf)) +
  geom_bar(stat = "identity", fill = "#ff9f43", alpha = 0.8) +
  coord_flip() +
  labs(
    title = "Gradient Boosting: Feature Importance Ranking \n (Including Fasting Glucose)",
    x = "Feature",
    y = "Relative Importance"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

8.8.2 Summary

The regression analysis demonstrates that both models capture meaningful relationships between selected predictors and 2-hour OGTT glucose levels. The 🧮 Elastic Net regression model achieved moderate predictive performance (R² = 0.4999), explaining approximately half of the observed variability and providing a transparent linear baseline. In contrast, the ⚡ Gradient Boosting Regressor yielded lower prediction errors and a higher R² of 0.5734, indicating an improved ability to model nonlinear relationships inherent in metabolic data. While 🧮 Elastic Net offers interpretability and stability, Gradient Boosting demonstrates superior predictive capability for metabolic outcome prediction.

Overall, the ⚡ Gradient Boosting regression model outperforms 🧮 Elastic Net, demonstrating stronger predictive capability for OGTT outcomes.

9. ✅ Conclusion

This project successfully addressed the two core research questions:

🧩 Classification – Chronic Kidney Disease (CKD) Risk

For the classification task on Chronic Kidney Disease (CKD), both 🌲Random Forest and ⚡ XGBoost models were able to stratify individuals into risk categories based on demographic, lifestyle, blood pressure, and metabolic indicators. ⚡ XGBoost demonstrated superior predictive performance and highlighted the most influential predictors.

📈 Regression – 2-hour OGTT Plasma Glucose Level

For the regression task predicting 2-hour OGTT plasma glucose levels, the⚡ XGradient Boosting Regressor outperformed the 🧮Elastic Net baseline, capturing nonlinear relationships among demographic, lifestyle, anthropometric, and laboratory variables. This provided quantitative insights into factors influencing post-load glucose levels and supported population-level risk assessment.

⚖️ Balanced Perspective

While the models show good predictive performance at the population level, they are not precise enough for individual clinical diagnosis. The results are informative for risk stratification, research insights, and public health guidance, but should be interpreted within the limitations of NHANES-based modeling.

Overall, the project demonstrates that predictive modeling can inform both CKD classification and quantitative glucose prediction, while emphasizing realistic expectations for model precision in large-scale population data.