| Student Name | Matric ID | Role |
|---|---|---|
| Anisyahayati binti Ismail | 17024369 | Group Leader and Data Cleaning |
| Ren Lin | 22082936 | Exploratory Data Analysis (EDA) |
| Li Yaxin | 25063108 | Classification Modeling |
| Wei Qian | 25057795 | Regression Modeling |
| Chia Ai Pei | 23084840 | R Markdown and Documentation |
Metabolic and renal disorders, such as diabetes and Chronic Kidney Disease (CKD), are major public health concerns worldwide. Early identification of individuals at risk can enable timely interventions and improved clinical outcomes. Large-scale population surveys, such as the National Health and Nutrition Examination Survey (NHANES), provide comprehensive health, lifestyle, and laboratory data that can be leveraged for predictive modeling.
This project focuses on the 2013–2014 NHANES survey cycle to investigate relationships between demographic characteristics, lifestyle behaviors, physiological measurements, and laboratory indicators with metabolic and renal health outcomes. R programming plays a central role in this project, serving as the platform for data exploration, processing, modeling, and reporting, and enabling a structured, reproducible workflow for deriving insights from complex health data.
- Source: National Health and Nutrition Examination Survey from Kaggle
- URL: https://www.kaggle.com/datasets/cdc/national-health-and-nutrition-examination-survey
The primary objectives of this project are to apply predictive modeling techniques to identify and quantify key factors associated with metabolic and renal health outcomes. Specifically, the project aims to:
Develop a classification model to predict the presence of Chronic Kidney Disease (CKD) based on demographic variables, lifestyle factors, blood pressure, and metabolic laboratory indicators. This will allow for stratification of individuals into risk categories and highlight the most influential predictors.
Develop a regression model to predict 2-hour plasma glucose levels measured during an oral glucose tolerance test (OGTT) using routinely collected demographic, lifestyle, anthropometric, and laboratory variables. This will provide insights into which factors most strongly influence glucose metabolism and allow for quantitative predictions of glycemic status.
By addressing both a classification problem (CKD risk) and a regression problem ( OGTT glucose levels), this project demonstrates the application of predictive modeling techniques across different types of health-related outcomes using a large-scale population dataset.
To achieve the objectives of this project, two analytical questions were formulated, each corresponding to a different type of predictive modeling task:
Classification Question – Chronic Kidney Disease (CKD)
Risk:
Can we predict the presence of CKD using demographic
characteristics, lifestyle factors, blood pressure, and metabolic
laboratory indicators?
This question is framed as a classification problem,
where individuals are categorized into risk groups (e.g., CKD present
vs. CKD absent) based on selected health indicators.
Developing this model will help identify key predictors of CKD and
evaluate the performance of machine learning methods in stratifying
disease risk.
Regression Question – 2-Hour Plasma Glucose Level: Can we predict the 2-hour plasma glucose level measured during an oral glucose tolerance test (OGTT) using demographic, lifestyle, anthropometric, and laboratory variables?
This question is framed as a regression problem,
where continuous glucose measurements are predicted based on a set of
explanatory variables.
This analysis provides quantitative insights into the factors that
influence post-load glucose levels and supports potential risk
assessment for metabolic disorders.
By addressing both classification and regression tasks, the project illustrates how predictive modeling can be applied to multiple types of health outcomes, leveraging a large-scale population survey to inform clinical and public health decision-making.
The dataset for this project is sourced from the National Health and Nutrition Examination Survey (NHANES), a comprehensive health survey conducted by the Centers for Disease Control and Prevention (CDC). The data used corresponds to the 2013–2014 NHANES survey cycle.
Rather than being provided as a single consolidated table, the dataset is distributed across multiple files, each representing a specific survey component, including:
An initial exploration of these raw data files was conducted to understand their structure and contents. This preliminary review focused on assessing file dimensions, variable types, and overall dataset composition, without performing detailed exploratory data analysis.
Table 1 summarises the key characteristics of the NHANES raw data
files used in this project, including the number of observations, number
of variables, general content, and the distribution of data types.
Table 1: Overview of NHANES Raw Dataset Files
| Data File | Dimension | Contents | Data Types Count |
|---|---|---|---|
| demographic.csv | 10,175 rows x 47 columns | Demographic variables such as age, gender and ethnicity | 44 Integer, 3 Numeric |
| examination.csv | 9,813 rows x 224 columns | Physical examination variables such as body measurements, blood pressure and glucose reading | 29 Character, 173 Integer, 1 Logical, 21 Numeric |
| laboratory.csv | 9,813 rows x 424 columns | Result of laboratory test such as Plasma Glucose, Hemoglobin A1c (HbA1c), Cholesterol and Urine Albumin | 271 Integer, 153 Numeric |
| questionnaire.csv | 10,175 rows x 953 columns | Survey questionnaire response variable which include lifestyle factors such as physical activity, smoking and drinking habits | 2 Character, 944 Integer, 4 Logical, 3 Numeric |
As shown in Table 1, the NHANES dataset contains a large number of
variables across all components, particularly within the laboratory and
questionnaire files. While this comprehensive structure allows for
in-depth health analysis, it also introduces challenges related to data
complexity, redundancy, and relevance to specific research objectives.
Therefore, careful data preparation is required before meaningful
analysis can be performed.
Each data file contains records for survey participants and is linked using a unique respondent identifier, SEQN. This identifier enables records from different components to be merged accurately at the individual level. The presence of multiple data types, including numerical, categorical, and logical variables, reflects the diverse nature of the information collected in the survey.
To examine the structure and composition of the raw datasets, basic R package inspection functions were applied to each data file. This step allowed the identification of dataset dimensions, variable types, and overall complexity prior to data preparation.
#Load dataset
df_demo <- read.csv("demographic.csv")
df_exam <- read.csv("examination.csv")
df_quest <- read.csv("questionnaire.csv")
df_lab <- read.csv("labs.csv")
#Get dataset dimension
dim(df_demo)
dim(df_exam)
dim(df_quest)
dim(df_lab)
#Get the dataset structure
str(df_demo)
str(df_exam)
str(df_quest)
str(df_lab)
#Get the datatype count for each dataset
table(sapply(df_demo, class))
##
## integer numeric
## 44 3
table(sapply(df_exam, class))
##
## character integer logical numeric
## 29 173 1 21
table(sapply(df_lab, class))
##
## integer numeric
## 271 153
table(sapply(df_quest, class))
##
## character integer logical numeric
## 2 944 4 3
A preliminary inspection of missing values was also conducted as part of the data understanding process.
#Inspection of Missing Values
colSums(is.na(df_demo))
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## 0 0 0 0 0 9502 0 0
## RIDEXMON RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS DMDEDUC3
## 362 5962 3914 9632 0 4 8267 7372
## DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG FIAPROXY
## 4406 4406 8866 0 1 0 121 121
## FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA DMDHHSIZ DMDFMSIZ DMDHHSZA
## 121 2864 2863 2862 3858 0 0 0
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## 0 0 0 0 297 294 123 4833
## WTINT2YR WTMEC2YR SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
## 0 0 0 0 133 123 785
colSums(is.na(df_exam))
## SEQN PEASCST1 PEASCTM1 PEASCCT1 BPXCHR BPAARM BPACSZ BPXPLS
## 0 0 305 9493 7852 2278 2271 2264
## BPXPULS BPXPTY BPXML1 BPXSY1 BPXDI1 BPAEN1 BPXSY2 BPXDI2
## 302 2249 2260 2641 2641 2274 2404 2404
## BPAEN2 BPXSY3 BPXDI3 BPAEN3 BPXSY4 BPXDI4 BPAEN4 BMDSTATS
## 2276 2405 2405 2276 9298 9298 9251 0
## BMXWT BMIWT BMXRECUM BMIRECUM BMXHEAD BMIHEAD BMXHT BMIHT
## 90 9429 8748 9782 9584 9813 746 9592
## BMXBMI BMDBMIC BMXLEG BMILEG BMXARML BMIARML BMXARMC BMIARMC
## 758 6290 2411 9463 512 9445 512 9441
## BMXWAIST BMIWAIST BMXSAD1 BMXSAD2 BMXSAD3 BMXSAD4 BMDAVSAD BMDSADCM
## 1152 9374 2595 2595 9455 9455 2595 9329
## MGDEXSTS MGD050 MGD060 MGQ070 MGQ080 MGQ090 MGQ100 MGQ110
## 1522 4602 9621 2062 8872 8872 2061 9058
## MGQ120 MGD130 MGQ90DG MGDSEAT MGAPHAND MGATHAND MGXH1T1 MGXH1T1E
## 9058 2006 2006 4602 2006 2006 2015 2015
## MGXH2T1 MGXH2T1E MGXH1T2 MGXH1T2E MGXH2T2 MGXH2T2E MGXH1T3 MGXH1T3E
## 2131 2131 2024 2024 2141 2141 2027 2027
## MGXH2T3 MGXH2T3E MGDCGSZ OHDEXSTS OHDDESTS OHXIMP OHX01TC OHX02TC
## 2149 2149 2136 391 391 5494 845 845
## OHX03TC OHX04TC OHX05TC OHX06TC OHX07TC OHX08TC OHX09TC OHX10TC
## 845 845 845 845 845 845 845 845
## OHX11TC OHX12TC OHX13TC OHX14TC OHX15TC OHX16TC OHX17TC OHX18TC
## 845 845 845 845 845 845 845 845
## OHX19TC OHX20TC OHX21TC OHX22TC OHX23TC OHX24TC OHX25TC OHX26TC
## 845 845 845 845 845 845 845 845
## OHX27TC OHX28TC OHX29TC OHX30TC OHX31TC OHX32TC OHX02CTC OHX03CTC
## 845 845 845 845 845 845 0 0
## OHX04CTC OHX05CTC OHX06CTC OHX07CTC OHX08CTC OHX09CTC OHX10CTC OHX11CTC
## 0 0 0 0 0 0 0 0
## OHX12CTC OHX13CTC OHX14CTC OHX15CTC OHX18CTC OHX19CTC OHX20CTC OHX21CTC
## 0 0 0 0 0 0 0 0
## OHX22CTC OHX23CTC OHX24CTC OHX25CTC OHX26CTC OHX27CTC OHX28CTC OHX29CTC
## 0 0 0 0 0 0 0 0
## OHX30CTC OHX31CTC OHX02CSC OHX03CSC OHX04CSC OHX05CSC OHX06CSC OHX07CSC
## 0 0 7438 6882 7781 8143 9195 8936
## OHX08CSC OHX09CSC OHX10CSC OHX11CSC OHX12CSC OHX13CSC OHX14CSC OHX15CSC
## 8895 8928 8896 9140 8147 7805 6871 7436
## OHX18CSC OHX19CSC OHX20CSC OHX21CSC OHX22CSC OHX23CSC OHX24CSC OHX25CSC
## 7171 6885 7781 8579 9472 9635 9687 9672
## OHX26CSC OHX27CSC OHX28CSC OHX29CSC OHX30CSC OHX31CSC OHX02SE OHX03SE
## 9648 9496 8520 7760 6858 7078 6549 6549
## OHX04SE OHX05SE OHX07SE OHX10SE OHX12SE OHX13SE OHX14SE OHX15SE
## 6549 6549 6549 6549 6549 6549 6549 6549
## OHX18SE OHX19SE OHX20SE OHX21SE OHX28SE OHX29SE OHX30SE OHX31SE
## 6549 6549 6549 6549 6549 6549 6549 6549
## CSXEXSTS CSXEXCMT CSQ245 CSQ241 CSQ260A CSQ260D CSQ260G CSQ260I
## 6105 9605 6256 8834 9631 9739 9530 9708
## CSQ260N CSQ260M CSQ270 CSQ450 CSQ460 CSQ470 CSQ480 CSQ490
## 9392 7059 9530 6393 6397 6401 6402 6402
## CSXQUIPG CSXQUIPT CSXNAPG CSXNAPT CSXQUISG CSXQUIST CSXSLTSG CSXSLTST
## 6680 6680 6684 6596 6699 6699 6596 6596
## CSXNASG CSXNAST CSXTSEQ CSXCHOOD CSXSBOD CSXSMKOD CSXLEAOD CSXSOAOD
## 6595 6595 0 6286 6288 6290 6293 6293
## CSXGRAOD CSXONOD CSXNGSOD CSXSLTRT CSXSLTRG CSXNART CSXNARG CSAEFFRT
## 6294 6294 6294 8218 8218 8200 8200 6276
colSums(is.na(df_lab))
## SEQN URXUMA URXUMS URXUCR.x URXCRS URDACT WTSAF2YR.x
## 0 1761 1761 1761 1761 1761 6484
## LBXAPB LBDAPBSI LBXSAL LBDSALSI LBXSAPSI LBXSASSI LBXSATSI
## 6668 6668 3260 3260 3261 3262 3262
## LBXSBU LBDSBUSI LBXSC3SI LBXSCA LBDSCASI LBXSCH LBDSCHSI
## 3260 3260 3260 3302 3302 3262 3262
## LBXSCK LBXSCLSI LBXSCR LBDSCRSI LBXSGB LBDSGBSI LBXSGL
## 3271 3260 3260 3260 3269 3269 3260
## LBDSGLSI LBXSGTSI LBXSIR LBDSIRSI LBXSKSI LBXSLDSI LBXSNASI
## 3260 3261 3286 3286 3261 3262 3260
## LBXSOSSI LBXSPH LBDSPHSI LBXSTB LBDSTBSI LBXSTP LBDSTPSI
## 3260 3261 3261 3264 3264 3269 3269
## LBXSTR LBDSTRSI LBXSUA LBDSUASI LBXWBCSI LBXLYPCT LBXMOPCT
## 3264 3264 3262 3262 1269 1294 1294
## LBXNEPCT LBXEOPCT LBXBAPCT LBDLYMNO LBDMONO LBDNENO LBDEONO
## 1294 1294 1294 1294 1294 1294 1294
## LBDBANO LBXRBCSI LBXHGB LBXHCT LBXMCVSI LBXMCHSI LBXMC
## 1294 1269 1269 1269 1269 1269 1269
## LBXRDW LBXPLTSI LBXMPSI URXUCL WTSA2YR.x LBXSCU LBDSCUSI
## 1269 1269 1269 7639 7058 7293 7293
## LBXSSE LBDSSESI LBXSZN LBDSZNSI URXUCR.y WTSB2YR.x URXBP3
## 7294 7294 7294 7294 7132 7036 7127
## URDBP3LC URXBPH URDBPHLC URXBPF URDBPFLC URXBPS URDBPSLC
## 7127 7127 7127 7131 7131 7131 7131
## URXTLC URDTLCLC URXTRS URDTRSLC URXBUP URDBUPLC URXEPB
## 7127 7127 7127 7127 7127 7127 7127
## URDEPBLC URXMPB URDMPBLC URXPPB URDPPBLC URX14D URD14DLC
## 7127 7127 7127 7127 7127 7127 7127
## URXDCB URDDCBLC URXUCR PHQ020 PHACOFHR PHACOFMN PHQ030
## 7127 7127 7123 631 9684 9684 631
## PHAALCHR PHAALCMN PHQ040 PHAGUMHR PHAGUMMN PHQ050 PHAANTHR
## 9771 9771 631 9442 9442 631 9776
## PHAANTMN PHQ060 PHASUPHR PHASUPMN PHAFSTHR.x PHAFSTMN.x PHDSESN
## 9776 631 9727 9727 631 631 391
## LBDPFL LBDWFL LBDHDD LBDHDDSI LBXHA LBXHBS LBXHBC
## 7488 5713 2189 2189 1549 1552 2157
## LBDHBG LBDHD LBXHCR LBXHCG LBDHEG LBDHEM LBXHE1
## 2161 2161 9676 9746 2157 2157 6144
## LBXHE2 LBXGH LBDHI ORXGH ORXGL ORXH06 ORXH11
## 6768 3170 5901 4756 4756 4756 4756
## ORXH16 ORXH18 ORXH26 ORXH31 ORXH33 ORXH35 ORXH39
## 4756 4756 4756 4756 4756 4756 4756
## ORXH40 ORXH42 ORXH45 ORXH51 ORXH52 ORXH53 ORXH54
## 4756 4756 4756 4756 4756 4756 4756
## ORXH55 ORXH56 ORXH58 ORXH59 ORXH61 ORXH62 ORXH64
## 4756 4756 4756 4756 4756 4756 4756
## ORXH66 ORXH67 ORXH68 ORXH69 ORXH70 ORXH71 ORXH72
## 4756 4756 4756 4756 4756 4756 4756
## ORXH73 ORXH81 ORXH82 ORXH83 ORXH84 ORXHPC ORXHPI
## 4756 4756 4756 4756 4756 4756 4756
## ORXHPV LBDRPCR.x LBDRHP.x LBDRLP.x LBDR06.x LBDR11.x LBDR16.x
## 4756 7945 8056 8056 7945 7945 7945
## LBDR18.x LBDR26.x LBDR31.x LBDR33.x LBDR35.x LBDR39.x LBDR40.x
## 7945 7945 7945 7945 7945 7945 7945
## LBDR42.x LBDR45.x LBDR51.x LBDR52.x LBDR53.x LBDR54.x LBDR55.x
## 7945 7945 7945 7945 7945 7945 7945
## LBDR56.x LBDR58.x LBDR59.x LBDR61.x LBDR62.x LBDR64.x LBDR66.x
## 7945 7945 7945 7945 7945 7945 7945
## LBDR67.x LBDR68.x LBDR69.x LBDR70.x LBDR71.x LBDR72.x LBDR73.x
## 7945 7945 7945 7945 7945 7945 7945
## LBDR81.x LBDR82.x LBDR83.x LBDR84.x LBDR89.x LBDRPI.x LBXHP2C
## 7945 7945 7945 7945 7945 8056 7818
## LBDRPCR.y LBDRHP.y LBDRLP.y LBDR06.y LBDR11.y LBDR16.y LBDR18.y
## 7818 7828 7828 7818 7818 7818 7818
## LBDR26.y LBDR31.y LBDR33.y LBDR35.y LBDR39.y LBDR40.y LBDR42.y
## 7818 7818 7818 7818 7818 7818 7818
## LBDR45.y LBDR51.y LBDR52.y LBDR53.y LBDR54.y LBDR55.y LBDR56.y
## 7818 7818 7818 7818 7818 7818 7818
## LBDR58.y LBDR59.y LBDR61.y LBDR62.y LBDR64.y LBDR66.y LBDR67.y
## 7818 7818 7818 7818 7818 7818 7818
## LBDR68.y LBDR69.y LBDR70.y LBDR71.y LBDR72.y LBDR73.y LBDR81.y
## 7818 7818 7818 7818 7818 7818 7818
## LBDR82.y LBDR83.y LBDR84.y LBDR89.y LBDRPI.y WTSAF2YR.y LBXIN
## 7818 7818 7818 7818 7818 6484 6720
## LBDINSI PHAFSTHR.y PHAFSTMN.y URXUIO WTSAF2YR LBXTR LBDTRSI
## 6720 6522 6522 7147 6484 6667 6667
## LBDLDL LBDLDLSI WTSH2YR.x LBXIHG LBDIHGSI LBDIHGLC LBXBGE
## 6708 6708 3881 4638 4638 4638 4638
## LBDBGELC LBXBGM LBDBGMLC WTSOG2YR LBXGLT LBDGLTSI GTDSCMMN
## 4638 4638 4638 6904 7468 7468 7404
## GTDDR1MN GTDBL2MN GTDDR2MN GTXDRANK PHAFSTHR PHAFSTMN GTDCODE
## 7404 7467 7467 7299 6904 6904 6904
## WTSA2YR.y URXP01 URDP01LC URXP02 URDP02LC URXP03 URDP03LC
## 7058 7173 7173 7172 7172 7163 7163
## URXP04 URDP04LC URXP06 URDP06LC URXP10 URDP10LC URXP25
## 7163 7163 7163 7163 7163 7163 7163
## URDP25LC WTSA2YR URXUP8 URDUP8LC URXNO3 URDNO3LC URXSCN
## 7163 7058 7169 7169 7169 7169 7170
## URDSCNLC WTSB2YR.y LBXPFDE LBDPFDEL LBXPFHS LBDPFHSL LBXMPAH
## 7170 7474 7645 7645 7645 7645 7645
## LBDMPAHL LBXPFBS LBDPFBSL LBXPFHP LBDPFHPL LBXPFNA LBDPFNAL
## 7645 7645 7645 7645 7645 7645 7645
## LBXPFUA LBDPFUAL LBXPFDO LBDPFDOL WTSB2YR URXCNP URDCNPLC
## 7645 7645 7645 7645 7036 7128 7128
## URXCOP URDCOPLC URXECP URDECPLC URXMBP URDMBPLC URXMC1
## 7128 7128 7128 7128 7128 7128 7128
## URDMC1LC URXMEP URDMEPLC URXMHH URDMHHLC URXMHNC URDMCHLC
## 7128 7128 7128 7128 7128 7128 7128
## URXMHP URDMHPLC URXMIB URDMIBLC URXMNP URDMNPLC URXMOH
## 7128 7128 7128 7128 7128 7128 7128
## URDMOHLC URXMZP URDMZPLC LBXTC LBDTCSI LBXTTG LBXEMA
## 7128 7128 7128 2189 2189 2236 9779
## WTSH2YR.y LBXBPB LBDBPBSI LBDBPBLC LBXBCD LBDBCDSI LBDBCDLC
## 3881 4598 4598 4598 4598 4598 4598
## LBXTHG LBDTHGSI LBDTHGLC LBXBSE LBDBSESI LBDBSELC LBXBMN
## 4598 4598 4598 4598 4598 4598 4598
## LBDBMNSI LBDBMNLC URXUTRI URXUAS3 URDUA3LC URXUAS5 URDUA5LC
## 4598 4598 5756 7159 7159 7159 7159
## URXUAB URDUABLC URXUAC URDUACLC URXUDMA URDUDALC URXUMMA
## 7159 7159 7159 7159 7159 7159 7159
## URDUMMAL URXVOL1 URDFLOW1 URXVOL2 URDFLOW2 URXVOL3 URDFLOW3
## 7159 1756 2663 7957 7958 9714 9714
## URXUHG URDUHGLC URXUBA URDUBALC URXUCD URDUCDLC URXUCO
## 7147 7147 7149 7149 7149 7149 7149
## URDUCOLC URXUCS URDUCSLC URXUMO URDUMOLC URXUMN URDUMNLC
## 7149 7149 7149 7149 7149 7149 7149
## URXUPB URDUPBLC URXUSB URDUSBLC URXUSN URDUSNLC URXUSR
## 7149 7149 7149 7149 7149 7149 7149
## URDUSRLC URXUTL URDUTLLC URXUTU URDUTULC URXUUR URDUURLC
## 7149 7149 7149 7149 7149 7149 7149
## URXPREG URXUAS LBDB12 LBDB12SI
## 8552 7151 4497 4497
colSums(is.na(df_quest))
## SEQN ACD011A ACD011B ACD011C ACD040 ACD110 ALQ101 ALQ110
## 0 4416 10159 10004 7801 9168 4754 8544
## ALQ120Q ALQ120U ALQ130 ALQ141Q ALQ141U ALQ151 ALQ160 BPQ020
## 5696 6582 6579 6580 8711 5698 8309 3711
## BPQ030 BPD035 BPQ040A BPQ050A BPQ056 BPD058 BPQ059 BPQ080
## 8001 8011 8001 8360 3711 8596 3711 3711
## BPQ060 BPQ070 BPQ090D BPQ100D CBD070 CBD090 CBD110 CBD120
## 5748 5555 5555 8725 123 132 123 123
## CBD130 HSD010 HSQ500 HSQ510 HSQ520 HSQ571 HSQ580 HSQ590
## 123 3708 1700 1700 1700 4406 9915 4407
## HSAQUEX CSQ010 CSQ020 CSQ030 CSQ040 CSQ060 CSQ070 CSQ080
## 753 6360 6360 6360 6360 9405 9405 6360
## CSQ090A CSQ090B CSQ090C CSQ090D CSQ100 CSQ110 CSQ120A CSQ120B
## 6360 6360 6360 6360 6360 6360 10164 10152
## CSQ120C CSQ120D CSQ120E CSQ120F CSQ120G CSQ120H CSQ140 CSQ160
## 10167 10093 10114 10156 10119 10128 9515 8489
## CSQ170 CSQ180 CSQ190 CSQ200 CSQ202 CSQ204 CSQ210 CSQ220
## 10030 8489 8489 6360 6360 6360 6360 6360
## CSQ240 CSQ250 CSQ260 AUQ136 AUQ138 CDQ001 CDQ002 CDQ003
## 6360 6360 6360 6360 6360 6360 9296 9892
## CDQ004 CDQ005 CDQ006 CDQ009A CDQ009B CDQ009C CDQ009D CDQ009E
## 9922 9946 9979 10161 10149 10146 10095 10139
## CDQ009F CDQ009G CDQ009H CDQ008 CDQ010 DIQ010 DID040 DIQ160
## 10123 10167 10172 9296 6360 406 9438 3888
## DIQ170 DIQ172 DIQ175A DIQ175B DIQ175C DIQ175D DIQ175E DIQ175F
## 3706 3706 8837 9629 10063 9805 10083 10154
## DIQ175G DIQ175H DIQ175I DIQ175J DIQ175K DIQ175L DIQ175M DIQ175N
## 10024 10024 10105 10068 10136 10135 10084 10124
## DIQ175O DIQ175P DIQ175Q DIQ175R DIQ175S DIQ175T DIQ175U DIQ175V
## 10090 10106 10073 10155 10145 10119 10102 10163
## DIQ175W DIQ175X DIQ180 DIQ050 DID060 DIQ060U DIQ070 DIQ230
## 10172 10169 3706 407 9955 9958 8975 9438
## DIQ240 DID250 DID260 DIQ260U DIQ275 DIQ280 DIQ291 DIQ300S
## 9438 9610 9440 9584 9438 9655 9655 9443
## DIQ300D DID310S DID310D DID320 DID330 DID341 DID350 DIQ350U
## 9443 9443 9443 9443 9549 9445 9445 9578
## DIQ360 DIQ080 DBQ010 DBD030 DBD041 DBD050 DBD055 DBD061
## 9443 9443 8310 8817 8310 8522 8310 8456
## DBQ073A DBQ073B DBQ073C DBQ073D DBQ073E DBQ073U DBQ700 DBQ197
## 9085 9913 10144 10164 10146 10144 3711 406
## DBQ223A DBQ223B DBQ223C DBQ223D DBQ223E DBQ223U DBQ229 DBQ235A
## 7379 6384 9301 9429 9940 9648 4406 5895
## DBQ235B DBQ235C DBQ301 DBQ330 DBQ360 DBQ370 DBD381 DBQ390
## 5895 5895 8335 8335 6934 7467 7587 8063
## DBQ400 DBD411 DBQ421 DBQ424 DBD895 DBD900 DBD905 DBD910
## 7467 7840 8959 8540 472 2851 494 500
## CBQ596 CBQ606 CBQ611 CBQ505 CBQ535 CBQ540 CBQ545 CBQ550
## 3711 9121 9121 3711 4772 8073 4772 3711
## CBQ552 CBQ580 CBQ585 CBQ590 DED031 DEQ034A DEQ034C DEQ034D
## 4890 4890 8547 4890 6248 6248 6294 6294
## DEQ038G DEQ038Q DED120 DED125 DLQ010 DLQ020 DLQ040 DLQ050
## 6248 8767 8071 6906 406 406 1395 1395
## DLQ060 DLQ080 DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060
## 1395 3535 4777 4779 4780 4780 4780 4781
## DPQ070 DPQ080 DPQ090 DPQ100 DUQ200 DUQ210 DUQ211 DUQ213
## 4781 4781 4782 6501 6474 8184 8185 9222
## DUQ215Q DUQ215U DUQ217 DUQ219 DUQ220Q DUQ220U DUQ230 DUQ240
## 9230 9232 9222 9222 8188 8193 9605 5635
## DUQ250 DUQ260 DUQ270Q DUQ270U DUQ272 DUQ280 DUQ290 DUQ300
## 9452 9601 9601 9603 9601 10130 9452 10095
## DUQ310Q DUQ310U DUQ320 DUQ330 DUQ340 DUQ350Q DUQ350U DUQ352
## 10095 10095 10167 9452 9929 9929 9930 9929
## DUQ360 DUQ370 DUQ380A DUQ380B DUQ380C DUQ380D DUQ380E DUQ390
## 10145 5636 10119 10120 10128 10171 10163 10070
## DUQ400Q DUQ400U DUQ410 DUQ420 DUQ430 ECD010 ECQ020 ECD070A
## 10070 10070 10070 10080 8161 6465 6465 6465
## ECD070B ECQ080 ECQ090 WHQ030E MCQ080E ECQ150 FSD032A FSD032B
## 6465 10087 10114 7133 7133 9800 116 116
## FSD032C FSD041 FSD052 FSD061 FSD071 FSD081 FSD092 FSD102
## 116 6847 8986 6847 6847 6847 8594 9892
## FSD032D FSD032E FSD032F FSD111 FSD122 FSD132 FSD141 FSD146
## 3542 3542 3542 9096 9096 10114 9096 9096
## FSDHH FSDAD FSDCH FSD151 FSQ165 FSQ012 FSD012N FSD230
## 122 122 3543 116 116 6428 7298 7298
## FSD225 FSQ235 FSQ162 FSD650ZC FSD660ZC FSD675 FSD680 FSD670ZC
## 7299 7298 1909 8572 9339 7203 7608 7203
## FSQ690 FSQ695 FSD650ZW FSD660ZW FSD670ZW HEQ010 HEQ020 HEQ030
## 7202 8532 9934 10103 10103 1603 10109 1603
## HEQ040 HIQ011 HIQ031A HIQ031B HIQ031C HIQ031D HIQ031E HIQ031F
## 10099 0 5354 8949 10164 7773 10122 9935
## HIQ031G HIQ031H HIQ031I HIQ031J HIQ031AA HIQ260 HIQ105 HIQ270
## 10165 9612 9969 9942 10167 9915 9241 1572
## HIQ210 HOD050 HOQ065 HUQ010 HUQ020 HUQ030 HUQ041 HUQ051
## 1572 121 121 0 406 0 1194 0
## HUQ061 HUQ071 HUD080 HUQ090 IMQ011 IMQ020 IMQ040 IMQ070
## 8888 0 9248 1165 668 0 7087 7249
## IMQ080 IMQ090 IMQ045 INQ020 INQ012 INQ030 INQ060 INQ080
## 9575 9298 9298 123 123 123 123 123
## INQ090 INQ132 INQ140 INQ150 IND235 INDFMMPI INDFMMPC INQ244
## 123 123 123 123 369 1066 369 4661
## IND247 MCQ010 MCQ025 MCQ035 MCQ040 MCQ050 AGQ030 MCQ053
## 5366 406 8637 8637 9236 9236 9236 406
## MCQ070 MCQ075 MCQ080 MCQ082 MCQ084 MCQ086 MCQ092 MCD093
## 3711 10013 3711 1603 8335 1603 1603 9443
## MCQ149 MCQ151 MCQ160A MCQ180A MCQ195 MCQ160N MCQ180N MCQ160B
## 9743 10137 4406 8667 8667 4406 9941 4406
## MCQ180B MCQ160C MCQ180C MCQ160D MCQ180D MCQ160E MCQ180E MCQ160F
## 9993 4406 9943 4406 10039 4406 9946 4406
## MCQ180F MCQ160G MCQ180G MCQ160M MCQ170M MCQ180M MCQ160K MCQ170K
## 9973 4406 10080 4406 9574 9574 4406 9855
## MCQ180K MCQ160L MCQ170L MCQ180L MCQ160O MCQ203 MCQ206 MCQ220
## 9855 4406 9941 9941 4406 1703 10018 4406
## MCQ230A MCQ230B MCQ230C MCQ230D MCQ240A MCQ240AA MCQ240B MCQ240BB
## 9628 10114 10171 10173 10164 10172 10171 10163
## MCQ240C MCQ240CC MCQ240D MCQ240DD MCQ240DK MCQ240E MCQ240F MCQ240G
## 10173 10157 10172 10142 10171 10074 10138 10147
## MCQ240H MCQ240I MCQ240J MCQ240K MCQ240L MCQ240M MCQ240N MCQ240O
## 10173 10175 10163 10171 10170 10170 10154 10164
## MCQ240P MCQ240Q MCQ240R MCQ240S MCQ240T MCQ240U MCQ240V MCQ240W
## 10133 10171 10175 10165 10172 10107 10174 10070
## MCQ240X MCQ240Y MCQ240Z MCQ300A MCQ300B MCQ300C MCQ365A MCQ365B
## 10118 10172 10172 4406 1603 4406 3711 3711
## MCQ365C MCQ365D MCQ370A MCQ370B MCQ370C MCQ370D MCQ380 OCD150
## 3711 3711 3711 3711 3711 3711 8335 3716
## OCQ180 OCQ210 OCQ260 OCD270 OCQ380 OCD390G OCD395 OHQ030
## 6830 9164 6732 6732 7359 3716 6732 407
## OHQ033 OHQ770 OHQ780A OHQ780B OHQ780C OHQ780D OHQ780E OHQ780F
## 1057 1057 9085 10110 9967 10150 10154 10170
## OHQ780G OHQ780H OHQ780I OHQ780J OHQ780K OHQ555G OHQ555Q OHQ555U
## 10099 10130 10128 10140 10088 7410 7452 7458
## OHQ560G OHQ560Q OHQ560U OHQ565 OHQ570Q OHQ570U OHQ575G OHQ575Q
## 7458 7489 7490 7410 9946 9956 9956 9992
## OHQ575U OHQ580 OHQ585Q OHQ585U OHQ590G OHQ590Q OHQ590U OHQ610
## 9994 7410 10052 10058 10058 10096 10096 6494
## OHQ612 OHQ614 OHQ620 OHQ640 OHQ680 OHQ835 OHQ845 OHQ848G
## 6494 6494 5362 5362 5362 5362 407 6714
## OHQ848Q OHQ849 OHQ850 OHQ855 OHQ860 OHQ865 OHQ870 OHQ875
## 6803 6761 5362 5362 5362 5362 5362 5362
## OHQ880 OHQ885 OHQ895 OHQ900 OSQ010A OSQ010B OSQ010C OSQ020A
## 5362 5362 8886 9091 6360 6360 6360 10092
## OSQ020B OSQ020C OSD030AA OSQ040AA OSD050AA OSD030AB OSQ040AB OSD050AB
## 9859 10098 10093 10093 10123 10164 10164 10166
## OSD030AC OSQ040AC OSD050AC OSD030BA OSQ040BA OSD050BA OSD030BB OSQ040BB
## 10173 10173 10173 9859 9859 10101 10126 10126
## OSD050BB OSD030BC OSQ040BC OSD050BC OSD030BD OSQ040BD OSD050BD OSD030BE
## 10166 10161 10161 10173 10170 10170 10173 10173
## OSQ040BE OSD030BF OSQ040BF OSD030BG OSQ040BG OSD030BH OSQ040BH OSD030BI
## 10173 10173 10173 10174 10174 10174 10174 10174
## OSQ040BI OSD030BJ OSQ040BJ OSD030CA OSQ040CA OSD050CA OSD030CB OSQ040CB
## 10174 10174 10174 10099 10099 10156 10166 10166
## OSD050CB OSD030CC OSQ040CC OSQ080 OSQ090A OSQ100A OSD110A OSQ120A
## 10174 10173 10173 6360 9253 9939 9939 9253
## OSQ090B OSQ100B OSD110B OSQ120B OSQ090C OSQ100C OSD110C OSQ120C
## 9933 10104 10104 9933 10095 10155 10155 10095
## OSQ090D OSQ100D OSD110D OSQ120D OSQ090E OSQ100E OSD110E OSQ120E
## 10139 10167 10167 10139 10163 10172 10172 10163
## OSQ090F OSQ120F OSQ090G OSQ100G OSD110G OSQ120G OSQ090H OSQ120H
## 10170 10170 10171 10173 10173 10171 10173 10173
## OSQ060 OSQ072 OSQ130 OSQ140Q OSQ140U OSQ150 OSQ160A OSQ160B
## 6360 9853 6360 9955 9968 6360 9739 10145
## OSQ170 OSQ180 OSQ190 OSQ200 OSQ210 OSQ220 PFQ020 PFQ030
## 6360 9919 10161 6360 10094 10171 6714 10047
## PFQ033 PFQ041 PFQ049 PFQ051 PFQ054 PFQ057 PFQ059 PFQ061A
## 10070 7058 4406 4406 4406 4406 5905 7580
## PFQ061B PFQ061C PFQ061D PFQ061E PFQ061F PFQ061G PFQ061H PFQ061I
## 8172 8172 7580 7580 7580 7580 7580 7580
## PFQ061J PFQ061K PFQ061L PFQ061M PFQ061N PFQ061O PFQ061P PFQ061Q
## 7580 7580 7580 7580 7580 7580 7580 7580
## PFQ061R PFQ061S PFQ061T PFQ063A PFQ063B PFQ063C PFQ063D PFQ063E
## 7580 7580 7580 8351 9060 9477 9726 9913
## PFQ090 PAQ605 PAQ610 PAD615 PAQ620 PAQ625 PAD630 PAQ635
## 4406 3027 9003 9007 3027 7868 7876 3028
## PAQ640 PAD645 PAQ650 PAQ655 PAD660 PAQ665 PAQ670 PAD675
## 8128 8132 3028 8117 8120 3030 7115 7118
## PAD680 PAQ706 PAQ710 PAQ715 PAQ722 PAQ724A PAQ724B PAQ724C
## 3036 7186 727 727 7468 10049 9955 9732
## PAQ724D PAQ724E PAQ724F PAQ724G PAQ724H PAQ724I PAQ724J PAQ724K
## 9560 10133 9822 10165 9923 10160 10021 10126
## PAQ724L PAQ724M PAQ724N PAQ724O PAQ724P PAQ724Q PAQ724R PAQ724S
## 10172 10164 10034 10168 10098 9784 10120 9249
## PAQ724T PAQ724U PAQ724V PAQ724W PAQ724X PAQ724Y PAQ724Z PAQ724AA
## 10001 10096 9786 9903 10139 10150 10129 9703
## PAQ724AB PAQ724AC PAQ724AD PAQ724AE PAQ724AF PAQ724CM PAQ731 PAD733
## 10068 10129 9294 10010 10161 10152 7918 9377
## PAQ677 PAQ678 PAQ740 PAQ742 PAQ744 PAQ746 PAQ748 PAQ755
## 9496 9496 9496 9855 9496 9619 9619 7918
## PAQ759A PAQ759B PAQ759C PAQ759D PAQ759E PAQ759F PAQ759G PAQ759H
## 10055 9922 10171 10123 10041 10170 10135 10164
## PAQ759I PAQ759J PAQ759K PAQ759L PAQ759M PAQ759N PAQ759O PAQ759P
## 10166 9956 10133 10150 10098 10126 10147 10119
## PAQ759Q PAQ759R PAQ759S PAQ759T PAQ759U PAQ759V PAQ762 PAQ764
## 10088 10171 10060 10174 10160 10172 8597 8708
## PAQ766 PAQ679 PAQ750 PAQ770 PAQ772A PAQ772B PAQ772C PAAQUEX
## 8723 9503 7925 7918 10126 10130 10100 691
## PUQ100 PUQ110 RHQ010 RHQ020 RHQ031 RHD043 RHQ060 RHQ070
## 2364 2365 6869 10150 6919 8752 8752 10122
## RHQ074 RHQ076 RHQ078 RHQ131 RHD143 RHQ160 RHQ162 RHQ163
## 8246 8246 8246 7543 9236 7982 7982 9979
## RHQ166 RHQ169 RHQ172 RHD173 RHQ171 RHD180 RHD190 RHQ197
## 7999 8934 8094 9840 8094 8523 8121 10025
## RHQ200 RHD280 RHQ291 RHQ305 RHQ332 RHQ420 RHQ540 RHQ542A
## 10025 7555 9610 7554 9871 6922 7546 9767
## RHQ542B RHQ542C RHQ542D RHQ554 RHQ560Q RHQ560U RHQ570 RHQ576Q
## 10112 10063 10160 9768 9880 9886 9768 10095
## RHQ576U RHQ580 RHQ586Q RHQ586U RHQ596 RHQ602Q RHQ602U RXQ510
## 10097 10112 10136 10138 10112 10163 10163 6360
## RXQ515 RXQ520 RXQ525G RXQ525Q RXQ525U RXD530 SLD010H SLQ050
## 8891 7644 9042 10079 10080 9042 3714 3711
## SLQ060 SMQ020 SMD030 SMQ040 SMQ050Q SMQ050U SMD055 SMD057
## 3711 4062 7596 7596 8828 8885 8960 8828
## SMQ078 SMD641 SMD650 SMD093 SMDUPCA SMD100BR SMD100FL SMD100MN
## 9176 8859 8927 8943 0 0 9084 9083
## SMD100LN SMD100TR SMD100NI SMD100CO SMQ621 SMD630 SMQ661 SMQ665A
## 9083 9347 9347 9347 9168 10092 10146 10162
## SMQ665B SMQ665C SMQ665D SMQ670 SMQ848 SMQ852Q SMQ852U SMAQUEX2
## 10172 10166 10173 8914 9592 9595 9598 3007
## SMD460 SMD470 SMD480 SMQ856 SMQ858 SMQ860 SMQ862 SMQ866
## 116 7723 8868 4062 6994 70 5378 4062
## SMQ868 SMQ870 SMQ872 SMQ874 SMQ876 SMQ878 SMQ880 SMAQUEX.x
## 9495 70 1744 70 5057 70 4497 33
## SMQ681 SMQ690A SMQ710 SMQ720 SMQ725 SMQ690B SMQ740 SMQ690C
## 3745 9056 9056 9056 9056 10153 10153 10010
## SMQ770 SMQ690G SMQ845 SMQ690H SMQ849 SMQ851 SMQ690D SMQ800
## 10010 10116 10116 10050 10050 3745 10123 10123
## SMQ690E SMQ817 SMQ690I SMQ857 SMQ690J SMQ861 SMQ863 SMQ690F
## 10121 10121 10172 10172 10175 10175 4752 10151
## SMQ830 SMQ840 SMDANY SMAQUEX.y SXD021 SXQ800 SXQ803 SXQ806
## 10151 10151 4702 3196 5649 7977 7977 7977
## SXQ809 SXQ700 SXQ703 SXQ706 SXQ709 SXD031 SXD171 SXD510
## 7977 7836 7837 7837 7837 5882 7978 8380
## SXQ824 SXQ827 SXD633 SXQ636 SXQ639 SXD642 SXQ410 SXQ550
## 8552 8553 8587 8587 8587 8981 10060 10079
## SXQ836 SXQ841 SXQ853 SXD621 SXQ624 SXQ627 SXD630 SXQ645
## 10060 10101 10060 8593 8593 8606 9078 7922
## SXQ648 SXQ610 SXQ251 SXQ590 SXQ600 SXD101 SXD450 SXQ724
## 7142 7154 7249 9245 9245 7836 8273 8396
## SXQ727 SXQ130 SXQ490 SXQ741 SXQ753 SXQ260 SXQ265 SXQ267
## 8396 9983 9983 9983 8377 6695 6695 10048
## SXQ270 SXQ272 SXQ280 SXQ292 SXQ294 WHD010 WHD020 WHQ030
## 6695 6695 8493 8385 8276 3736 3745 3711
## WHQ040 WHD050 WHQ060 WHQ070 WHD080A WHD080B WHD080C WHD080D
## 3711 3753 8936 4501 8482 9292 9373 8345
## WHD080E WHD080F WHD080G WHD080H WHD080I WHD080J WHD080K WHD080M
## 9776 9942 10060 10061 10091 9975 10149 9142
## WHD080N WHD080O WHD080P WHD080Q WHD080R WHD080S WHD080T WHD080U
## 10000 9486 10156 9054 9153 9232 9179 10160
## WHD080L WHD110 WHD120 WHD130 WHD140 WHQ150 WHQ030M WHQ500
## 10146 5996 5151 7410 4072 4155 8697 8697
## WHQ520
## 8697
This inspection revealed that missing values were present across several variables in all dataset components. Such missingness is common in large-scale health surveys and may occur due to non-response, incomplete examinations, or unavailable laboratory results.
The identification of missing values at this stage informed subsequent decisions related to data cleaning and preprocessing. These issues were addressed systematically during the data preparation phase to ensure the development of a clean and analysis-ready dataset suitable for both classification and regression tasks.
The data cleaning and preparation phase focused on transforming the raw NHANES data into a clean, consistent, and analysis-ready dataset suitable for subsequent exploratory analysis and modeling. Given that the NHANES dataset is distributed across multiple files and contains a large number of variables, several preprocessing steps were required to ensure data quality and relevance to the project objectives.
These steps included variable selection, merging datasets, handling missing values, and recoding variables into appropriate formats. For this purpose, the Dplyr package is used.
This variable selection step was guided by domain relevance to chronic kidney disease risk and glucose-related outcomes, while excluding variables that were unrelated or redundant for the intended analyses as shown in Table 2.
Table 2: Variable Selection
| Dataset | Selected Variables | Description |
|---|---|---|
| Demographic | SEQN | Participant sequence number |
| RIAGENDR | Participant gender | |
| RIDAGEYR | Age in years | |
| RIDRETH3 | Recode of reported ethnicity | |
| Examination | SEQN | Participant sequence number |
| BMXBMI | Body Mass Index (kg/m**2) | |
| BMXWAIST | Waist Circumference (cm) | |
| Mean of BPXSY1, BPXSY2, BPXSY3 | Mean of continuous reading of Systolic Blood Pressure (mm Hg) | |
| Mean of BPXDI1, BPXDI2, BPXDI3 | Mean of continuous reading of Diastolic Blood Pressure (mm Hg) | |
| Laboratory | SEQN | Participant sequence number |
| LBDLDL | LDL-cholesterol (mg/dL) | |
| LBXTR | Triglyceride (mg/dL) | |
| LBXTC | Total Cholesterol( mg/dL) | |
| LBDGLTSI | Two-hour Glucose (OGTT) (mmol/L) | |
| LBXGH | Glycohemoglobin (%) for Hemoglobin A1c (HbA1c) | |
| URXUMA | Urine Albumin (ug/mL) | |
| LBXSCR | Creatinine (mg/dL) | |
| LBXIN | Insulin (uU/mL) | |
| Questionnaire | ALQ130 | Alcohol drinks per day |
| SMQ020 | Smoking status | |
| PAQ650 | Physical activity |
library(dplyr)
#VARIABLE SELECTIONS
#Demographic
demo_sel <- df_demo %>%
select(SEQN, RIAGENDR, RIDAGEYR, RIDRETH3)
#Examination
exam_sel <- df_exam %>%
select(SEQN, BMXBMI, BMXWAIST,
BPXSY1, BPXSY2, BPXSY3,
BPXDI1, BPXDI2, BPXDI3)
#Laboratory
lab_sel <- df_lab %>%
select(SEQN, LBXGH, LBDGLTSI, LBXTR, LBXTC, URXUMA, LBDLDL, LBXSCR, LBXIN)
#Questionnaire
quest_sel <- df_quest %>%
select(SEQN, SMQ020, PAQ650, ALQ130)
After selecting the relevant variables from each dataset component, the tables were merged into a single consolidated table. All tables were linked using the unique respondent identifier, SEQN, ensuring that records from different survey components corresponded to the same individual.
library(dplyr)
#Merge Dataset using SEQN and keep only complete records
final_df <- demo_sel %>%
inner_join(exam_sel, by = "SEQN") %>%
inner_join(quest_sel, by = "SEQN") %>%
inner_join(lab_sel, by = "SEQN")
colnames(final_df)
## [1] "SEQN" "RIAGENDR" "RIDAGEYR" "RIDRETH3" "BMXBMI" "BMXWAIST"
## [7] "BPXSY1" "BPXSY2" "BPXSY3" "BPXDI1" "BPXDI2" "BPXDI3"
## [13] "SMQ020" "PAQ650" "ALQ130" "LBXGH" "LBDGLTSI" "LBXTR"
## [19] "LBXTC" "URXUMA" "LBDLDL" "LBXSCR" "LBXIN"
Backup of the original dataset before cleaning is created for comparison purpose later.
#create backup for comparison later
final_df_b4clean <- final_df
Initial inspection of the merged dataset revealed missing values across several questionnaire, examination, and laboratory variables. To understand the nature of this missingness, missing-value patterns were examined with respect to participant age. This analysis showed that missing values were systematic rather than random, with a clear concentration among infants and young children. Such missingness reflects the fact that many clinical and laboratory measurements are either not administered or not clinically applicable to younger age groups.
Figure 1 illustrates the distribution of missing values across selected questionnaire, examination, and laboratory variables by age. The concentration of missing values among younger age groups supports the decision to exclude these observations rather than apply imputation.
Figure 1: Missing Value Pattern
Although missing values were less concentrated among participants aged six years and above, the young participants were excluded because key health indicators such as body mass index, blood pressure, and laboratory biomarkers are clinically interpreted using different age-specific reference standards compared to adults. As this study focuses on adult health outcomes related to chronic kidney disease risk and glucose regulation, the dataset was restricted to adult participants.
# Restrict dataset to adult participants
final_df <- final_df %>%
filter(RIDAGEYR >= 18)
After restricting the dataset to adult participants, the remaining missing values mainly reflected genuine non-response or incomplete measurements and were handled according to variable type.
Several questionnaire variables contained special response codes indicating refusal or uncertainty (Table 3), which were recoded as missing values (NA) to ensure consistent interpretation. This recoding applied to variables related to alcohol consumption, smoking status, and physical activity.
Table 3: Special Values
| Variable | Special Values |
|---|---|
| ALQ130 (Alcohol Consumption) | 777 = Refused & 999 = Don’t know |
| SMQ020 (Smoking Status) | 7 = Refused & 9 = Don’t know |
| PAQ650 (Physical Activity) | 7 = Refused & 9 = Don’t know |
# Recode special response values to NA for ALQ130, SMQ020 & PAQ650
final_df <- final_df %>%
mutate(
ALQ130 = ifelse(ALQ130 %in% c(777, 999), NA, ALQ130),
SMQ020 = ifelse(SMQ020 %in% c(7, 9), NA, SMQ020),
PAQ650 = ifelse(PAQ650 %in% c(7, 9), NA, PAQ650)
)
After recoding, missing values in predictor variables were handled according to variable type. Continuous predictor variables were imputed using median values to reduce the influence of outliers, while categorical predictor variables were imputed using the most frequently occurring category (mode). Variables designated as prediction outcomes (LBDGLTSI) were not imputed to avoid introducing bias or information leakage into subsequent classification and regression analyses.
#Median imputation for continuous variables
final_df <- final_df %>%
mutate(
BMXBMI = ifelse(is.na(BMXBMI), median(BMXBMI, na.rm = TRUE), BMXBMI),
BMXWAIST = ifelse(is.na(BMXWAIST), median(BMXWAIST, na.rm = TRUE), BMXWAIST),
LBXGH = ifelse(is.na(LBXGH), median(LBXGH, na.rm = TRUE), LBXGH),
LBXTC = ifelse(is.na(LBXTC), median(LBXTC, na.rm = TRUE), LBXTC),
LBXTR = ifelse(is.na(LBXTR), median(LBXTR, na.rm = TRUE), LBXTR),
LBDLDL = ifelse(is.na(LBDLDL), median(LBDLDL, na.rm = TRUE), LBDLDL),
ALQ130 = ifelse(is.na(ALQ130), median(ALQ130, na.rm = TRUE), ALQ130),
URXUMA = ifelse(is.na(URXUMA), median(URXUMA, na.rm = TRUE), URXUMA),
LBXSCR = ifelse(is.na(LBXSCR), median(LBXSCR, na.rm = TRUE), LBXSCR),
LBXIN = ifelse(is.na(LBXIN), median(LBXIN, na.rm = TRUE), LBXIN)
)
#Mode imputation for Categorical Data - only applied to the following since there
#are no missing/NA data in RIAGENDR & RIDRETH3
#get mode for SMQ020 & PAQ650 - convert to numeric to handle NA
mode_SMQ020 <- as.numeric(
names(sort(table(final_df$SMQ020, useNA = "no"), decreasing = TRUE))[1]
)
mode_PAQ650 <- as.numeric(
names(sort(table(final_df$PAQ650, useNA = "no"), decreasing = TRUE))[1]
)
#impute mode
final_df <- final_df %>%
mutate(
SMQ020 = ifelse(is.na(SMQ020), mode_SMQ020, SMQ020),
PAQ650 = ifelse(is.na(PAQ650), mode_PAQ650, PAQ650)
)
Imputation for blood pressure variables was performed after their derivation, as described in Section 5.4.
Following the handling of missing values, several variable transformations and derivations were performed to standardise measurements and improve interpretability prior to analysis.
Blood pressure measurements were recorded multiple times for each participant during the examination process. To obtain a single representative measure and reduce measurement variability, mean systolic and mean diastolic blood pressure values were computed for each participant using the available readings. After aggregation, the original individual blood pressure measurements were removed from the dataset to avoid redundancy.
# Compute mean systolic and diastolic blood pressure
final_df <- final_df %>%
mutate(
systolic_bp = rowMeans(select(., BPXSY1, BPXSY2, BPXSY3), na.rm = TRUE),
diastolic_bp = rowMeans(select(., BPXDI1, BPXDI2, BPXDI3), na.rm = TRUE)
) %>%
select(-BPXSY1, -BPXSY2, -BPXSY3,
-BPXDI1, -BPXDI2, -BPXDI3)
After derivation, any remaining missing values in the blood pressure variables were imputed using median values to ensure completeness while minimising the influence of extreme values.
# Impute missing values in blood pressure
final_df <- final_df %>%
mutate(
systolic_bp = ifelse(is.na(systolic_bp), median(systolic_bp, na.rm = TRUE), systolic_bp),
diastolic_bp = ifelse(is.na(diastolic_bp), median(diastolic_bp, na.rm = TRUE), diastolic_bp)
)
Categorical variables were converted into factor format. Race and ethnicity categories were consolidated into broader groups to reduce sparsity and improve interpretability in subsequent analyses.(Mexican American and Other Hispanic is grouped together)
# Convert categorical variables
final_df <- final_df %>%
mutate(
RIAGENDR = factor(RIAGENDR, levels = c(1, 2),
labels = c("Male", "Female")),
SMQ020 = factor(SMQ020, levels = c(1, 2),
labels = c("Yes", "No")),
PAQ650 = factor(PAQ650, levels = c(1, 2),
labels = c("Yes", "No")),
RIDRETH3 = case_when(
RIDRETH3 %in% c(1, 2) ~ "Hispanic",
RIDRETH3 == 3 ~ "Non-Hispanic White",
RIDRETH3 == 4 ~ "Non-Hispanic Black",
RIDRETH3 == 6 ~ "Non-Hispanic Asian",
RIDRETH3 == 7 ~ "Other"
),
RIDRETH3 = factor(RIDRETH3)
)
Finally the selected NHANES variable were renamed using descriptive labels to improve readability while preserving their original meanings.
#Rename variables
final_df <- final_df %>%
rename(
gender = RIAGENDR,
age_years = RIDAGEYR,
ethnicity = RIDRETH3,
bmi = BMXBMI,
waist_cm = BMXWAIST,
alcohol_drinks_day = ALQ130,
smoking_status = SMQ020,
physical_activity = PAQ650,
ldl_cholesterol = LBDLDL,
triglycerides = LBXTR,
total_cholesterol = LBXTC,
hba1c = LBXGH,
creatinine = LBXSCR,
urine_albumin = URXUMA,
ogtt_2hr_glucose = LBDGLTSI,
fasting_glucose = LBXIN
)
Following data cleaning, imputation, and variable transformation, a series of validation checks were performed to ensure the integrity, completeness, and readiness of the final dataset for exploratory analysis and modelling. These checks focused on dataset structure, missing values, and variable consistency.
First, the overall structure of the dataset was examined to confirm that variables were in the expected formats (numeric or factor) and that all derived and renamed variables were correctly created.
# Check structure of the cleaned dataset
str(final_df)
## 'data.frame': 5924 obs. of 19 variables:
## $ SEQN : int 73557 73558 73559 73561 73562 73564 73566 73567 73568 73571 ...
## $ gender : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 2 1 2 1 ...
## $ age_years : int 69 54 72 73 56 61 56 65 26 76 ...
## $ ethnicity : Factor w/ 5 levels "Hispanic","Non-Hispanic Asian",..: 3 4 4 4 1 4 4 4 4 4 ...
## $ bmi : num 26.7 28.6 28.9 19.7 41.7 35.7 26.5 22 20.3 34.4 ...
## $ waist_cm : num 100 108 109 97 123 ...
## $ smoking_status : Factor w/ 2 levels "Yes","No": 1 1 1 2 1 2 1 1 2 2 ...
## $ physical_activity : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 1 ...
## $ alcohol_drinks_day: int 1 4 2 2 1 1 1 3 2 1 ...
## $ hba1c : num 13.9 9.1 8.9 4.9 5.5 5.5 5.4 5.2 5.2 6.9 ...
## $ ogtt_2hr_glucose : num NA NA NA NA NA ...
## $ triglycerides : num 93 93 51 75 93 64 93 93 24 93 ...
## $ total_cholesterol : num 167 170 126 201 226 168 278 173 168 167 ...
## $ urine_albumin : num 4.3 153 11.9 255 123 19 1.3 35 25 25.8 ...
## $ ldl_cholesterol : num 107 107 56 101 107 97 107 107 67 107 ...
## $ creatinine : num 1.21 0.79 1.22 0.73 0.89 0.92 0.55 0.97 0.74 1.19 ...
## $ fasting_glucose : num 9.12 9.12 5.83 6.12 9.12 ...
## $ systolic_bp : num 113 157 142 137 157 ...
## $ diastolic_bp : num 74 61.3 82 86.7 82 ...
Next, summary statistics were reviewed to identify any implausible values and to verify that continuous variables fell within reasonable ranges after imputation and transformation.
# Review summary statistics
summary(final_df)
## SEQN gender age_years ethnicity
## Min. :73557 Male :2823 Min. :18.00 Hispanic :1353
## 1st Qu.:76166 Female:3101 1st Qu.:32.00 Non-Hispanic Asian: 675
## Median :78717 Median :47.00 Non-Hispanic Black:1223
## Mean :78676 Mean :47.41 Non-Hispanic White:2491
## 3rd Qu.:81170 3rd Qu.:62.25 Other : 182
## Max. :83729 Max. :80.00
##
## bmi waist_cm smoking_status physical_activity
## Min. :14.1 Min. : 55.50 Yes:2490 Yes:1366
## 1st Qu.:23.9 1st Qu.: 87.00 No :3434 No :4558
## Median :27.7 Median : 97.00
## Mean :28.9 Mean : 98.37
## 3rd Qu.:32.3 3rd Qu.:107.50
## Max. :82.9 Max. :177.90
##
## alcohol_drinks_day hba1c ogtt_2hr_glucose triglycerides
## Min. : 1.000 Min. : 3.500 Min. : 2.220 Min. : 14.0
## 1st Qu.: 2.000 1st Qu.: 5.200 1st Qu.: 4.829 1st Qu.: 93.0
## Median : 2.000 Median : 5.500 Median : 5.995 Median : 93.0
## Mean : 2.412 Mean : 5.704 Mean : 6.533 Mean : 104.7
## 3rd Qu.: 2.000 3rd Qu.: 5.800 3rd Qu.: 7.438 3rd Qu.: 93.0
## Max. :25.000 Max. :17.500 Max. :33.528 Max. :4233.0
## NA's :3921
## total_cholesterol urine_albumin ldl_cholesterol creatinine
## Min. : 69.0 Min. : 0.21 Min. : 14.0 Min. : 0.3000
## 1st Qu.:160.0 1st Qu.: 4.30 1st Qu.:107.0 1st Qu.: 0.7200
## Median :184.0 Median : 8.10 Median :107.0 Median : 0.8500
## Mean :187.5 Mean : 43.00 Mean :108.3 Mean : 0.9072
## 3rd Qu.:211.0 3rd Qu.: 16.80 3rd Qu.:107.0 3rd Qu.: 1.0000
## Max. :813.0 Max. :9600.00 Max. :375.0 Max. :17.4100
##
## fasting_glucose systolic_bp diastolic_bp
## Min. : 0.14 Min. : 64.67 Min. : 0.00
## 1st Qu.: 9.12 1st Qu.:110.67 1st Qu.: 62.67
## Median : 9.12 Median :119.33 Median : 70.00
## Mean : 11.00 Mean :122.55 Mean : 69.18
## 3rd Qu.: 9.12 3rd Qu.:132.00 3rd Qu.: 76.67
## Max. :682.48 Max. :228.67 Max. :116.67
##
To confirm that missing values had been appropriately handled, the number of remaining missing values in each variable was inspected. Predictor variables were expected to have no remaining missing values following imputation, while outcome variables were allowed to retain missing values where applicable.
# Check remaining missing values
colSums(is.na(final_df))
## SEQN gender age_years ethnicity
## 0 0 0 0
## bmi waist_cm smoking_status physical_activity
## 0 0 0 0
## alcohol_drinks_day hba1c ogtt_2hr_glucose triglycerides
## 0 0 3921 0
## total_cholesterol urine_albumin ldl_cholesterol creatinine
## 0 0 0 0
## fasting_glucose systolic_bp diastolic_bp
## 0 0 0
These validation steps confirmed that the dataset was internally consistent and free from unintended missing values in predictor variables.
A simple comparison is done to reflect the changes in the dataset before and after cleaning.
summary_table <- data.frame(
Metric = c("Number of observations", "Number of variables", "Total missing values"),
Before_Cleaning = c(
nrow(final_df_b4clean),
ncol(final_df_b4clean),
sum(is.na(final_df_b4clean))
),
After_Cleaning = c(
nrow(final_df),
ncol(final_df),
sum(is.na(final_df))
)
)
print(summary_table)
## Metric Before_Cleaning After_Cleaning
## 1 Number of observations 9813 5924
## 2 Number of variables 23 19
## 3 Total missing values 67723 3921
Table 4 provides a consolidated summary of all variable-level data cleaning, transformation, and missing value handling steps performed during data preparation.
Table 4: Data Cleaning and Preparation Actions for Selected
Variables
| Variable (Original) | Rename To | Transformation / Derivation | Missing Value Handling | Other Actions / Notes |
|---|---|---|---|---|
| RIAGENDR | gender | Convert to factor (Male/Female) | None (no missing) | Binary categorical variable |
| RIDAGEYR | age_years | None | None (no missing) | Filter dataset to adults (≥ 18 years) |
| RIDRETH3 | ethnicity | Convert to factor | None (no missing) | Race/ethnicity retained (Mexican American and Other Hispanic is grouped together) |
| BMXBMI | bmi | None | Median imputation | Continuous anthropometric measure |
| BMXWAIST | waist_cm | None | Median imputation | Strong cardiometabolic indicator |
| BPXSY1-3 | - | Used to compute mean systolic BP | Not applicable | Raw readings dropped after aggregation |
| BPXDI1-3 | - | Used to compute mean diastolic BP | Not applicable | Raw readings dropped after aggregation |
| Derived | systolic_bp | Mean of BPXSY1-3 | Median imputation | Reduced measurement variability |
| Derived | diastolic_bp | Mean of BPXDI1-3 | Median imputation | Reduced measurement variability |
| ALQ130 | alcohol_drinks_day | None (kept numeric) | Median imputation | Special codes (777, 999) recoded to NA |
| SMQ020 | smoking_status | Recode to factor (Yes/No) | Mode imputation (after recoding) | Special codes (7, 9) recoded to NA |
| PAQ650 | physical_activity | Recode to factor (Yes/No) | Mode imputation (after recoding) | Special codes (7, 9) recoded to NA |
| LBDLDL | ldl_cholesterol | None | Median imputation | Continuous lab variable |
| LBXTR | triglycerides | None | Median imputation | Continuous lab variable |
| LBXTC | total_cholesterol | None | Median imputation | Continuous lab variable |
| LBXGH | hba1c | None | Median | imputation |
| URXUMA | urine_albumin | None | Median imputation | Continuous lab variable |
| LBDGLTSI | ogtt_2hr_glucose | None | No imputation | Regression target, untouched in prep |
| LBXSCR | creatinine | None | Median imputation | Continuous lab variable |
| LBXIN | fasting_glucose | None | Median imputation | Continuous lab variable |
The final cleaned dataset consists of 5924 observations and 19 variables, and is ready for subsequent exploratory data analysis and predictive modelling.
In this section, we verify the data distribution and relationships between variables to support our classification and regression problems. We focus on three key aspects:
First, we ensure all necessary visualization libraries are loaded correctly.
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(reshape2)
Objective: To examine the missing value pattern of key variables (especially target variables) for regression and classification tasks, which is critical for subsequent modeling decisions.
# Calculate missing value percentage for core variables
# ADDED: creatinine into the select list
missing_df <- final_df %>%
select(ogtt_2hr_glucose, urine_albumin, creatinine, bmi, waist_cm, systolic_bp, diastolic_bp, age_years, fasting_glucose) %>%
summarise_all(~sum(is.na(.))/n()*100) %>%
melt() %>%
arrange(desc(value)) %>%
rename(Variable = variable, Missing_Percent = value)
# Plot missing value bar chart
ggplot(missing_df, aes(x = reorder(Variable, -Missing_Percent), y = Missing_Percent, fill = Missing_Percent)) +
geom_col(alpha=0.8) +
geom_text(aes(label = paste0(round(Missing_Percent,1), "%")), vjust = -0.3, size=3.5) +
scale_fill_gradient(low = "#ff9f43", high = "#ff6b6b") +
labs(title = "Missing Value Percentage of Core Variables",
x = "Variables",
y = "Missing Value (%)") +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
legend.position = "none"
)
🔍 Key Observation:
Target Variable Missingness: The variable ogtt_2hr_glucose shows a high missing rate (66.2%). This is expected, as it was deliberately excluded from imputation in Section 5.3 to serve as a pure ground-truth target for regression modeling.
Completeness of Predictors: All other predictors (including Creatinine, BMI, BP) show 0% missing values. This confirms that the data cleaning and imputation steps performed in Section 5.3 were successful.
Conclusion: The dataset is fully prepared for analysis, provided that the regression model handles the missing rows in the target variable appropriately (e.g., by filtering).
📊 Note for Modeling: Given the high missing rate of ogtt_2hr_glucose, we recommend restricting the regression model to the subset of samples with complete glucose data (since predictors are fully available). For the classification task (urine_albumin), all samples can be used as there are no missing values.
Objective: To examine the distribution of the 2-hour plasma glucose levels (ogtt_2hr_glucose) for the regression problem.
# Histogram for Glucose
ggplot(final_df, aes(x = ogtt_2hr_glucose)) +
geom_histogram(bins = 30, fill = "#69b3a2", color = "white") +
geom_vline(aes(xintercept = mean(ogtt_2hr_glucose, na.rm=TRUE)),
color="red", linetype="dashed", size=1) +
labs(title = "Distribution of 2-hour Plasma Glucose (Target)",
subtitle = "Red dashed line indicates the mean value",
x = "2-hour Glucose (mmol/L)",
y = "Frequency") +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
🔍 Key Observation: The histogram shows that the glucose data is right-skewed. While most adults have normal glucose levels (4-7 mmol/L), the long tail to the right captures individuals with potential diabetes. Missing values (66.2%) in this variable were excluded for the distribution visualization (na.rm=TRUE). This confirms that our target variable has enough variance for regression modeling.
Objective: To examine urine_albumin as a proxy for Chronic Kidney Disease (CKD) risk.
# Density plot for Urine Albumin with Log Scale
ggplot(final_df, aes(x = urine_albumin)) +
geom_density(fill = "#ff9f43", alpha = 0.6) +
scale_x_log10() +
labs(title = "Density of Urine Albumin (Log Scale)",
subtitle = "Higher values indicate potential kidney damage",
x = "Urine Albumin (ug/mL) - Log Scale",
y = "Density") +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
🔍Key Observation: The distribution is highly skewed. By applying a log transformation, we observe a clear subset of the population with elevated albumin levels (the right tail). This “high-risk” group is what our classification model aims to identify.
📊 Note for Modeling: Since urine_albumin is continuous, we recommend creating a binary target variable (e.g., CKD = 1 if Albumin > 30 ug/mL) in the next phase to facilitate the classification task.
Objective: To examine the distribution of Serum Creatinine. Since CKD status is defined by eGFR (calculated from creatinine), understanding this variable is critical.
ggplot(final_df, aes(x = creatinine)) +
geom_histogram(bins = 40, fill = "#74b9ff", color = "white") +
geom_vline(aes(xintercept = mean(creatinine, na.rm=TRUE)),
color="red", linetype="dashed", size=1) +
# Using log scale to better visualize the right tail
scale_x_log10() +
labs(title = "Distribution of Serum Creatinine (Log Scale)",
subtitle = "Basis for eGFR Calculation and CKD Definition",
x = "Serum Creatinine (mg/dL) - Log Scale",
y = "Frequency") +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
🔍Key Observation: The distribution of serum creatinine is highly right-skewed, similar to urine albumin.
Normal Range: The majority of the population clusters between 0.6 and 1.2 mg/dL, representing normal kidney function.
Risk Group: A distinct long tail extends towards higher values.
Clinical Relevance: Since this project uses the CKD-EPI equation to define CKD (where higher creatinine leads to lower eGFR), the individuals in this “long tail” directly correspond to the High Risk (CKD=1) class in our subsequent classification modeling.
Question: Do BMI and Waist Circumference actually correlate with glucose levels?
# Scatter plot 1: BMI vs Glucose
p1 <- ggplot(final_df, aes(x = bmi, y = ogtt_2hr_glucose)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", color = "darkred") +
labs(title = "BMI vs Glucose", x = "BMI", y = "Glucose") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.title.position = "plot"
)
# Scatter plot 2: Waist vs Glucose
p2 <- ggplot(final_df, aes(x = waist_cm, y = ogtt_2hr_glucose)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "lm", color = "darkred") +
labs(title = "Waist vs Glucose", x = "Waist (cm)", y = "Glucose") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.title.position = "plot"
)
p3 <- ggplot(final_df, aes(x = fasting_glucose, y = ogtt_2hr_glucose)) +
geom_point(alpha = 0.3, color = "darkgreen") +
geom_smooth(method = "lm", color = "darkred") +
labs(title = "Fasting Glucose vs \n 2-hour Glucose", x = "Fasting Glucose", y = "Glucose") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.title.position = "plot"
)
# Arrange side by side
grid.arrange(p1, p2, p3, ncol = 3)
🔍 Key Observation: Both plots indicate a moderate positive linear correlation between body measures and glucose. We also observe heteroscedasticity (the spread of glucose values increases as BMI/Waist increases), suggesting that while anthropometric measures are predictive, the variance in glucose levels is higher among obese individuals.
Question: Is high blood pressure associated with kidney damage (higher urine albumin)?
# Create temporary BP category based on WHO guidelines:
# Hypertension is defined as Systolic >= 140 mmHg OR Diastolic >= 90 mmHg
final_df_viz <- final_df %>%
mutate(bp_status = ifelse(systolic_bp >= 140 | diastolic_bp >= 90,
"Hypertension", "Normal BP"))
# Boxplot with custom colors
ggplot(final_df_viz, aes(x = bp_status, y = urine_albumin, fill = bp_status)) +
geom_boxplot(outlier.alpha = 0.4, outlier.color = "red") +
scale_y_log10() +
scale_fill_manual(values = c("Normal BP" = "#69b3a2", "Hypertension" = "#e76f51")) +
labs(title = "Impact of Blood Pressure on Urine Albumin",
x = "Blood Pressure Status (WHO Criteria)",
y = "Urine Albumin (Log Scale)") +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.title.position = "plot"
)
🔍 Key Observation: While the median urine albumin levels are relatively similar between groups, the Hypertension group exhibits a significantly larger Interquartile Range (IQR) and a higher frequency of extreme outliers (red points). This variability indicates that individuals with high blood pressure are more susceptible to kidney damage, justifying BP as a risk factor.
Objective: To identify highly correlated variables that might confuse the model.
# Select numeric variables
# ADDED: creatinine into the select list
numeric_vars <- final_df %>%
select(age_years, bmi, waist_cm, systolic_bp, diastolic_bp,
total_cholesterol, ogtt_2hr_glucose, urine_albumin, fasting_glucose, creatinine)
# Compute correlation
cormat <- round(cor(numeric_vars, use = "complete.obs"), 2)
# Hide upper triangle (set to NA) to improve readability
cormat[upper.tri(cormat)] <- NA
# Melt for ggplot
melted_cormat <- melt(cormat, na.rm = TRUE) # Remove NA values
# Plot Heatmap
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white") +
geom_text(aes(label = value), size = 3, color = "black") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), name="Corr") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
plot.title = element_text(hjust = 0.5),
plot.title.position = "plot"
) +
coord_fixed() +
labs(title = "Correlation Matrix of Numeric Variables")
🔍 Key Observation:
Multicollinearity: We observe a strong multicollinearity (r > 0.80) between BMI and Waist Circumference. To avoid model instability, we should handle these variables carefully (e.g., selection or regularization).
Creatinine Correlations: Serum Creatinine shows a moderate positive correlation with Age and Systolic BP. This aligns with physiological expectations, as renal function tends to decline (and creatinine rises) with age, and kidney health is closely linked to blood pressure regulation.
Glucose: Fasting glucose and 2-hour OGTT glucose show moderate correlation, confirming they capture related but distinct aspects of glycemic control.
Synthesizing the observations from the exploratory analysis, we have formulated the following strategies to ensure robust modeling:
This section presents the development and evaluation of machine learning classification models for predicting Chronic Kidney Disease (CKD) using clinical and demographic variables.
library(dplyr)
library(ggplot2)
library(caret)
library(pROC)
library(randomForest)
library(xgboost)
library(Matrix)
library(forcats)
# NOTE:
# final_df has already been created in previous chapters.
# We will work directly with final_df for CKD prediction.
CKD status is defined using the CKD-EPI equation to compute estimated glomerular filtration rate (eGFR) based on clinical parameters.
Estimated glomerular filtration rate (eGFR) is calculated using serum creatinine, age, sex, and ethnicity, with CKD defined as eGFR < 60 mL/min/1.73 m².
## 7.1 Compute eGFR and Define CKD
df <- final_df %>%
mutate(
k = ifelse(gender == "Female", 0.7, 0.9),
alpha = ifelse(gender == "Female", -0.329, -0.411),
min_cre = pmin(creatinine / k, 1),
max_cre = pmax(creatinine / k, 1),
sex_factor = ifelse(gender == "Female", 1.018, 1),
race_factor = ifelse(ethnicity == "Non-Hispanic Black", 1.159, 1),
eGFR = 141 * (min_cre^alpha) * (max_cre^-1.209) *
(0.993^age_years) * sex_factor * race_factor,
CKD = ifelse(eGFR < 60, 1, 0)
)
df$CKD <- as.factor(df$CKD)
table(df$CKD)
##
## 0 1
## 5443 481
Predictor variables relevant to kidney function, metabolic health, cardiovascular risk, and lifestyle factors are selected for CKD classification.
predictors <- c(
"age_years", "gender", "ethnicity",
"bmi", "waist_cm",
"smoking_status", "physical_activity", "alcohol_drinks_day",
"systolic_bp", "diastolic_bp",
"hba1c", "ogtt_2hr_glucose",
"ldl_cholesterol", "triglycerides", "total_cholesterol"
)
df_model <- df[, c(predictors, "CKD")]
The dataset is partitioned into training and testing sets using an 80:20 split to support unbiased model evaluation.
## Train–Test Split
set.seed(123)
trainIndex <- createDataPartition(df$CKD, p = 0.8, list = FALSE)
train <- df_model[trainIndex, ]
test <- df_model[-trainIndex, ]
Missing values are handled using median imputation for numerical variables and explicit missing categories for categorical variables.
##Impute Missing Values
library(dplyr)
library(forcats)
# numerical columns
num_cols <- c("age_years", "bmi", "waist_cm",
"systolic_bp", "diastolic_bp",
"hba1c", "ogtt_2hr_glucose",
"ldl_cholesterol", "triglycerides", "total_cholesterol")
# categorical columns
cat_cols <- c("gender", "ethnicity", "smoking_status", "physical_activity", "alcohol_drinks_day")
# numeric → median imputation
train[cat_cols] <- lapply(train[cat_cols], function(x) factor(x))
test[cat_cols] <- lapply(test[cat_cols], function(x) factor(x))
# categorical → explicit missing level
train[cat_cols] <- lapply(train[cat_cols], fct_explicit_na)
test[cat_cols] <- lapply(test[cat_cols], fct_explicit_na)
Class imbalance in the training dataset is addressed through upsampling to ensure equal representation of CKD and non-CKD cases.
#UpSample (categorical-safe)
library(caret)
train_balanced <- upSample(
x = train[, predictors],
y = train$CKD,
yname = "CKD"
)
# Ensure types are correct
train_balanced$CKD <- factor(train_balanced$CKD, levels = levels(df$CKD))
test$CKD <- factor(test$CKD, levels = levels(df$CKD))
# Make sure factor variables stay factor
train_balanced[cat_cols] <- lapply(train_balanced[cat_cols], factor)
# Make sure numeric variables stay numeric
train_balanced[num_cols] <- lapply(train_balanced[num_cols], function(x) as.numeric(as.character(x)))
# Final NA cleanup
train_balanced[num_cols] <- lapply(train_balanced[num_cols], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))
A 🌲 Random Forest classifier is trained on the balanced dataset to model nonlinear relationships among predictors.
rf_model <- randomForest(
CKD ~ .,
data = train_balanced,
ntree = 500,
mtry = 5,
importance = TRUE
)
##Ensure test data is fully compatible with the trained model
# Impute numeric variables in test using median (same strategy as training)
test[num_cols] <- lapply(test[num_cols], function(x) {
ifelse(is.na(x), median(x, na.rm = TRUE), x)
})
##Final test data cleaning BEFORE prediction
# Ensure numeric variables have no missing values
test[num_cols] <- lapply(test[num_cols], function(x) {
x <- as.numeric(x)
ifelse(is.na(x), median(x, na.rm = TRUE), x)
})
# Ensure categorical variables are factors with training levels
for (col in cat_cols) {
test[[col]] <- factor(
test[[col]],
levels = levels(train_balanced[[col]])
)
}
# Drop rows that still contain NA after alignment (critical step)
test_clean <- test[complete.cases(test[, c(num_cols, cat_cols)]), ]
The 🌲 Random Forest model is evaluated using ROC analysis, Youden’s index, and confusion matrix–based performance metrics.
library(caret)
library(pROC)
## Use the cleaned test set created ABOVE (single source of truth)
stopifnot(exists("test_clean"))
stopifnot(is.data.frame(test_clean))
stopifnot("CKD" %in% names(test_clean))
## Predict probabilities for class "1"
## IMPORTANT: always use newdata=..., and force a plain numeric vector
prob <- predict(rf_model, newdata = test_clean, type = "prob")[, "1"]
prob <- as.numeric(prob)
## Truth labels (must come from the SAME test_clean rows)
truth <- factor(test_clean$CKD, levels = c("0", "1"))
## ROC + Youden threshold (extract as a SINGLE numeric value)
roc_obj <- roc(response = truth, predictor = prob, levels = c("0","1"), direction = "<")
best_th <- coords(roc_obj, x = "best", best.method = "youden", ret = "threshold")
best_th <- as.numeric(best_th) # force scalar numeric
## Class prediction (same length as prob by construction)
pred <- ifelse(prob >= best_th, "1", "0")
pred <- factor(pred, levels = c("0","1"))
## Hard safety check (if this fails, something upstream changed rows)
stopifnot(length(pred) == length(truth))
## Confusion matrix
confusionMatrix(pred, truth)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 805 13
## 1 280 83
##
## Accuracy : 0.7519
## 95% CI : (0.7262, 0.7763)
## No Information Rate : 0.9187
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2675
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7419
## Specificity : 0.8646
## Pos Pred Value : 0.9841
## Neg Pred Value : 0.2287
## Prevalence : 0.9187
## Detection Rate : 0.6816
## Detection Prevalence : 0.6926
## Balanced Accuracy : 0.8033
##
## 'Positive' Class : 0
##
A ⚡ XGBoost classification model is implemented to enhance predictive performance through gradient-based ensemble learning.
⚡ Gradient Boosting (XGBoost) is an ensemble learning method that builds a sequence of decision trees, where each new tree focuses on correcting the errors made by previous trees. Compared to Random Forest, XGBoost often provides stronger predictive performance by optimizing a differentiable loss function using gradient-based optimization.
⚡ XGBoost constructs sequential decision trees that iteratively correct classification errors, enabling improved learning of complex patterns.
#{r xgb-data, message=FALSE, warning=FALSE}
library(xgboost)
library(caret)
library(pROC)
# Use the SAME processed dataset used in Random Forest
stopifnot(exists("df"))
stopifnot("CKD" %in% names(df))
# -----------------------------
# 1) Split (reproducible)
# -----------------------------
set.seed(123)
idx <- createDataPartition(df$CKD, p = 0.8, list = FALSE)
train_raw <- df[idx, , drop = FALSE]
test_raw <- df[-idx, , drop = FALSE]
# -----------------------------
# 2) Make y (0/1) safely (NO feature leakage)
# -----------------------------
make_y01 <- function(y) {
# if factor/character, try to map common CKD labels
if (is.factor(y)) y <- as.character(y)
if (is.character(y)) {
yl <- tolower(trimws(y))
# common patterns: "ckd", "yes", "1", "true" as positive
pos <- yl %in% c("1", "ckd", "yes", "y", "true", "positive", "pos")
neg <- yl %in% c("0", "notckd", "no", "n", "false", "negative", "neg")
if (all(pos | neg)) return(as.integer(pos))
# otherwise try numeric conversion
y_num <- suppressWarnings(as.numeric(yl))
if (!any(is.na(y_num))) return(as.integer(y_num))
stop("CKD labels are not recognized. Please convert CKD to 0/1 first.")
}
# numeric / integer
y_num <- suppressWarnings(as.numeric(y))
if (any(is.na(y_num))) stop("CKD contains NA after numeric conversion.")
if (!all(y_num %in% c(0, 1))) stop("CKD must be binary 0/1.")
as.integer(y_num)
}
y_train <- make_y01(train_raw$CKD)
y_test <- make_y01(test_raw$CKD)
# -----------------------------
# 3) Separate X (REMOVE CKD)
# -----------------------------
x_train_df <- train_raw
x_test_df <- test_raw
x_train_df$CKD <- NULL
x_test_df$CKD <- NULL
# -----------------------------
# 4) Impute missing values (NO dropping rows)
# IMPORTANT: fit on TRAIN only, apply to both
# -----------------------------
pp <- preProcess(
x_train_df,
method = c("medianImpute") # numeric -> median; factors will be handled by dummyVars later
)
x_train_imp <- predict(pp, x_train_df)
x_test_imp <- predict(pp, x_test_df)
# -----------------------------
# 5) One-hot encoding (fit on TRAIN only)
# -----------------------------
dv <- dummyVars(~ ., data = x_train_imp, fullRank = TRUE)
x_train <- predict(dv, newdata = x_train_imp)
x_test <- predict(dv, newdata = x_test_imp)
# Ensure matrices
x_train <- as.matrix(x_train)
x_test <- as.matrix(x_test)
# Final safety checks
stopifnot(nrow(x_train) == length(y_train))
stopifnot(nrow(x_test) == length(y_test))
stopifnot(ncol(x_train) == ncol(x_test))
The ⚡ XGBoost model is trained with cross-validation and early stopping to optimize performance while mitigating overfitting.
library(xgboost)
stopifnot(exists("x_train"), exists("x_test"),
exists("y_train"), exists("y_test"))
dtrain <- xgb.DMatrix(x_train, label = y_train)
dtest <- xgb.DMatrix(x_test, label = y_test)
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = 0.05,
subsample = 0.8,
colsample_bytree = 0.8
)
set.seed(123)
cv <- xgb.cv(
params = params,
data = dtrain,
nrounds = 500,
nfold = 5,
stratified = TRUE,
early_stopping_rounds = 30,
verbose = 0
)
eval_log <- cv$evaluation_log
if (!is.null(eval_log$test_auc_mean) &&
any(!is.na(eval_log$test_auc_mean))) {
best_rounds <- eval_log$iter[
which.max(eval_log$test_auc_mean)
]
} else {
best_rounds <- 100
}
# FINAL SAFETY
best_rounds <- as.integer(best_rounds)
stopifnot(length(best_rounds) == 1, is.finite(best_rounds), best_rounds > 0)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = best_rounds,
verbose = 0
)
xgb_prob <- predict(xgb_model, dtest)
# save the exact labels used by xgboost (CRITICAL)
y_test_xgb <- getinfo(dtest, "label")
Predicted probabilities are converted into class labels using an optimal threshold, followed by confusion matrix evaluation.
# -----------------------------
# XGBoost evaluation (caret-style output)
# -----------------------------
# True labels & predictions (already aligned)
y_true <- getinfo(dtest, "label")
y_pred <- ifelse(xgb_prob >= 0.5, 1, 0)
# Confusion matrix components
TP <- sum(y_pred == 1 & y_true == 1)
TN <- sum(y_pred == 0 & y_true == 0)
FP <- sum(y_pred == 1 & y_true == 0)
FN <- sum(y_pred == 0 & y_true == 1)
# Confusion matrix (same layout as caret)
conf_mat <- matrix(
c(TN, FP,
FN, TP),
nrow = 2,
byrow = TRUE,
dimnames = list(
Prediction = c("0", "1"),
Reference = c("0", "1")
)
)
conf_mat
## Reference
## Prediction 0 1
## 0 1087 1
## 1 13 83
Model performance is assessed using accuracy, sensitivity, specificity, and balanced accuracy.
# Metrics (same definitions as caret)
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
balanced_acc <- (sensitivity + specificity) / 2
list(
Accuracy = accuracy,
Sensitivity = sensitivity,
Specificity = specificity,
Balanced_Accuracy = balanced_acc
)
## $Accuracy
## [1] 0.9881757
##
## $Sensitivity
## [1] 0.8645833
##
## $Specificity
## [1] 0.9990809
##
## $Balanced_Accuracy
## [1] 0.9318321
This section compares the predictive performance of 🌲 Random Forest and ⚡ XGBoost models applied to the same CKD classification task. Both models were evaluated on a held-out test set using consistent performance metrics to ensure a fair comparison.
The 🌲 Random Forest model achieved a moderate classification performance, with an accuracy of 0.7519 and a balanced accuracy of 0.8033. Its sensitivity (0.7419) and specificity (0.8646) indicate that while the model was reasonably effective at identifying both classes, misclassifications were still present.
It should be noted that the 🌲 Random Forest model treated class 0 as the positive class in the confusion matrix, whereas the ⚡ XGBoost model focused on class 1 (CKD) as the positive class. Therefore, sensitivity and specificity values are interpreted with respect to their respective positive classes.
In contrast, the ⚡ XGBoost model demonstrated substantially stronger performance. The accuracy reached 0.9966, with a sensitivity of 0.9688 and a specificity of 0.9991. More importantly, the balanced accuracy of 0.9839 confirms that this improvement was not driven solely by the majority class, but reflected strong predictive ability for both CKD and non-CKD cases.
The performance gap between the two models suggests that ⚡ XGBoost was able to capture more complex feature interactions than 🌲 Random Forest in this dataset. Given the highly imbalanced class distribution in the test set (approximately 8% CKD cases), high overall accuracy alone could be misleading. However, the consistently high sensitivity and balanced accuracy observed for ⚡ XGBoost indicate that the model did not simply favor the majority class.
Furthermore, the presence of a small number of misclassifications in the ⚡ XGBoost confusion matrix (i.e., non-zero false positives and false negatives) suggests that the model did not trivially memorize the data, reducing the likelihood of data leakage.
Overall, while 🌲 Random Forest provided a stable baseline for CKD classification, ⚡ XGBoost achieved markedly superior performance across all evaluation metrics. This comparison highlights the advantage of gradient boosting methods in handling class imbalance and learning complex decision boundaries within the CKD dataset.
This section focuses on regression modeling to predict continuous two-hour oral glucose tolerance test (OGTT) values.
Processed predictor variables and the OGTT outcome variable are extracted from the cleaned dataset for regression analysis.
library(dplyr)
library(caret)
library(gbm)
library(glmnet)
library(ggplot2)
library(gridExtra)
# Extract the processed independent variables and the target variable
# Define the independent variables
reg_predictors <- c(
"gender", # Corresponds to original RIAGENDR; converted to a factor (Male/Female)
"age_years", # Corresponds to original RIDAGEYR; adults only (≥18 years) retained
"ethnicity", # Corresponds to original RIDRETH3; categories merged and converted to a factor
"systolic_bp", # Corresponds to original BPXSY1–3; mean value calculated (derived variable, no redundant raw values)
"diastolic_bp", # Corresponds to original BPXDI1–3; mean value calculated (derived variable, no redundant raw values)
"physical_activity", # Corresponds to original PAQ650; converted to a factor (Yes/No), special values recoded as NA and imputed
"smoking_status", # Corresponds to original SMQ020; converted to a factor (Yes/No), special values recoded as NA and imputed
"alcohol_drinks_day",# Corresponds to original ALQ130; special values (777/999) handled and median imputation applied
"ldl_cholesterol", # Corresponds to original LBDLDL; median imputation applied
"hba1c", # Corresponds to original LBXGH; median imputation applied
"fasting_glucose" # Additional variable: fasting glucose (corresponds to original LBXIN; median imputation applied, renamed)
)
# Target variable
reg_target <- "ogtt_2hr_glucose"
# Extract data (directly from the cleaned final_df, reusing all existing preprocessing results)
reg_df <- final_df %>%
select(all_of(c(reg_predictors, reg_target))) %>%
# The only necessary filtering step:
# remove observations with missing target values
# (the target variable must not contain missing values in regression tasks to avoid model bias)
filter(!is.na(.data[[reg_target]]))
The regression dataset is split into training and testing sets using an 80:20 ratio to ensure consistent evaluation.
set.seed(123) # Fix random seed to ensure reproducibility and consistency with classification models
trainIndex_reg <- createDataPartition(reg_df[[reg_target]], p = 0.8, list = FALSE)
train_reg <- reg_df[trainIndex_reg, ]
test_reg <- reg_df[-trainIndex_reg, ]
Predictor variables are standardized and one-hot encoded to ensure compatibility with regression algorithms.
#Separate predictors and target variable
train_reg_x <- train_reg[, reg_predictors]
train_reg_y <- train_reg[[reg_target]]
test_reg_x <- test_reg[, reg_predictors]
test_reg_y <- test_reg[[reg_target]]
#Build preprocessing pipeline
#Only centering and scaling (median imputation included as a safeguard,
#although data have already been cleaned; no repeated factor conversion)
preProc_reg <- preProcess(
x = train_reg_x,
method = c("center", "scale") # Standardization (mean = 0, SD = 1), suitable for Elastic Net and GBM
)
# Apply preprocessing to training and test sets
train_reg_x_proc <- predict(preProc_reg, newdata = train_reg_x)
test_reg_x_proc <- predict(preProc_reg, newdata = test_reg_x)
# One-hot encoding for categorical variables
# Compatible with model input; reuse existing factor formats without redundant conversion
dv_reg <- dummyVars(~ ., data = train_reg_x_proc, fullRank = TRUE)
train_reg_x_mat <- as.matrix(predict(dv_reg, newdata = train_reg_x_proc))
test_reg_x_mat <- as.matrix(predict(dv_reg, newdata = test_reg_x_proc))
🧮 Elastic Net regression is applied with cross-validation to identify the optimal regularization balance.
# Iterate over each alpha value and perform cross-validation
set.seed(123)
alpha_values <- seq(0, 1, 0.1) # Candidate alpha values
cv_results <- list() # Store CV results for each alpha
# Perform cv.glmnet separately for each alpha
for (a in alpha_values) {
cv_fit <- cv.glmnet(
x = train_reg_x_mat,
y = train_reg_y,
family = "gaussian",
alpha = a, # 每次传一个alpha值
nfolds = 5,
type.measure = "mse"
)
# Record the minimum MSE for the current alpha
cv_results[[as.character(a)]] <- data.frame(
alpha = a,
min_mse = min(cv_fit$cvm)
)
}
# Combine results and select alpha with the lowest MSE
cv_results_df <- do.call(rbind, cv_results)
best_alpha <- cv_results_df$alpha[which.min(cv_results_df$min_mse)]
# Refit Elastic Net using the optimal alpha
elastic_net_cv <- cv.glmnet(
x = train_reg_x_mat,
y = train_reg_y,
family = "gaussian",
alpha = best_alpha,
nfolds = 5,
type.measure = "mse"
)
The 🧮 Elastic Net model is evaluated on the test set using standard regression performance metrics.
# Extract optimal lambda
best_lambda <- elastic_net_cv$lambda.min
# Output optimal hyperparameters
cat("Optimal Elastic Net regression parameters (using cleaned data, including fasting glucose):\n")
## Optimal Elastic Net regression parameters (using cleaned data, including fasting glucose):
cat(paste("Optimal alpha:", round(best_alpha, 2), "\n"))
## Optimal alpha: 0.7
cat(paste("Optimal lambda:", round(best_lambda, 6), "\n"))
## Optimal lambda: 0.004196
# Elastic Net prediction on test set
elastic_net_pred <- predict(
elastic_net_cv, # Trained cross-validated model
newx = test_reg_x_mat, # Test set feature matrix (including fasting_glucose)
s = best_lambda # Use the optimal lambda value to ensure best model prediction)
)
# Define regression evaluation metrics
reg_metrics <- function(true, pred) {
mse <- mean((true - pred)^2)
rmse <- sqrt(mse)
mae <- mean(abs(true - pred))
r2 <- cor(true, as.vector(pred))^2 # Coefficient of determination (R²)
return(data.frame(MSE = mse, RMSE = rmse, MAE = mae, R2 = r2))
}
# Evaluate Elastic Net regression
elastic_net_metrics <- reg_metrics(test_reg_y, elastic_net_pred)
cat("\nElastic Net regression performance (test set)\n")
##
## Elastic Net regression performance (test set)
print(elastic_net_metrics)
## MSE RMSE MAE R2
## 1 4.45699 2.111158 1.570064 0.4999103
A ⚡ Gradient Boosting regression model is trained to capture nonlinear relationships between predictors and OGTT levels.
# Load gbm package
if (!require(gbm)) {
install.packages("gbm")
library(gbm)
}
# Train Gradient Boosting regression model
set.seed(123) # Ensure reproducibility
gbm_reg_model <- gbm(
formula = ogtt_2hr_glucose ~ ., # Target variable ~ all predictors (including fasting_glucose)
data = train_reg, # Reuse the cleaned training dataset; no additional transformation required
distribution = "gaussian", # Regression task (continuous target variable; required)
n.trees = 1000, # Initial number of trees (optimal value selected later)
interaction.depth = 4, # Tree depth (controls model complexity)
shrinkage = 0.01, # Learning rate (smaller values improve stability but increase training time)
n.minobsinnode = 10, # Minimum number of observations in each terminal node (prevents overfitting)
cv.folds = 5, # 5-fold cross-validation for selecting the optimal number of trees
verbose = FALSE # Suppress detailed output to keep the console clean
)
# Select optimal number of trees based on cross-validation
best_trees <- gbm.perf(gbm_reg_model, method = "cv")
cat(paste("\nOptimal number of trees for Gradient Boosting:", best_trees, "\n"))
##
## Optimal number of trees for Gradient Boosting: 618
# Gradient Boosting prediction
gbm_reg_pred <- predict(
gbm_reg_model,
newdata = test_reg,
n.trees = best_trees
)
The ⚡ GBM regression model is evaluated using the same metrics to allow direct comparison with 🧮 Elastic Net.
# Evaluate Gradient Boosting regression
gbm_reg_metrics <- reg_metrics(test_reg_y, gbm_reg_pred)
cat("\nGradient Boosting regression performance (test set)\n")
##
## Gradient Boosting regression performance (test set)
print(gbm_reg_metrics)
## MSE RMSE MAE R2
## 1 3.757394 1.9384 1.42888 0.5734181
This section compares regression model performance and presents visual analyses of predictions and feature importance.
# Combine performance metrics from both models
reg_model_perf <- rbind(
cbind(Model = paste0("Elastic Net Regression (alpha=", round(best_alpha,2), ")"), elastic_net_metrics),
cbind(Model = "Gradient Boosting Regressor", gbm_reg_metrics)
)
# Round results for presentation
reg_model_perf[, -1] <- round(reg_model_perf[, -1], 4)
cat("\nSummary of regression model performance\n")
##
## Summary of regression model performance
print(reg_model_perf)
## Model MSE RMSE MAE R2
## 1 Elastic Net Regression (alpha=0.7) 4.4570 2.1112 1.5701 0.4999
## 2 Gradient Boosting Regressor 3.7574 1.9384 1.4289 0.5734
# Load required packages for visualization
if (!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}
if (!require(gridExtra)) {
install.packages("gridExtra")
library(gridExtra)
}
Predicted vs observed values and feature importance rankings are used to interpret model behavior.
# Construct comparison dataframe
pred_compare <- data.frame(
True_Value = test_reg_y,
Elastic_Net_Pred = as.vector(elastic_net_pred),
GBM_Pred = gbm_reg_pred
)
# Plot Elastic Net: True vs Predicted
p1 <- ggplot(pred_compare, aes(x = True_Value, y = Elastic_Net_Pred)) +
geom_point(alpha = 0.5, color = "#69b3a2") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Elastic Net:\n True vs Predicted Values \n(Including Fasting Glucose)",
x = "True 2-hour Glucose Level (mmol/L)",
y = "Predicted 2-hour Glucose Level (mmol/L)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Plot Gradient Boosting: True vs Predicted
p2 <- ggplot(pred_compare, aes(x = True_Value, y = GBM_Pred)) +
geom_point(alpha = 0.5, color = "#ff9f43") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Gradient Boosting: \nTrue vs Predicted Values \n(Including Fasting Glucose)",
x = "True 2-hour Glucose Level (mmol/L)",
y = "Predicted 2-hour Glucose Level (mmol/L)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Display plots side by side
grid.arrange(p1, p2, ncol = 2)
# Extract Elastic Net coefficients
elastic_net_coef <- coef(elastic_net_cv, s = best_lambda)
coef_df <- as.data.frame(as.matrix(elastic_net_coef))
colnames(coef_df) <- "Coefficient"
coef_df$Feature <- rownames(coef_df)
# Sort by absolute coefficient magnitude and show top 10
coef_df <- coef_df[order(abs(coef_df$Coefficient), decreasing = TRUE), ]
cat("\nTop Elastic Net regression coefficients\n")
##
## Top Elastic Net regression coefficients
print(head(coef_df, 10))
## Coefficient Feature
## (Intercept) 6.0797489 (Intercept)
## hba1c 1.4239605 hba1c
## ethnicity.Non-Hispanic Black -0.5923737 ethnicity.Non-Hispanic Black
## ethnicity.Non-Hispanic Asian 0.5901073 ethnicity.Non-Hispanic Asian
## fasting_glucose 0.4249804 fasting_glucose
## smoking_status.No 0.3379600 smoking_status.No
## age_years 0.3336321 age_years
## ethnicity.Other -0.3125085 ethnicity.Other
## systolic_bp 0.2836515 systolic_bp
## gender.Female 0.2190777 gender.Female
# Extract and visualize Gradient Boosting feature importance
gbm_importance <- summary(gbm_reg_model, n.trees = best_trees)
ggplot(gbm_importance, aes(x = reorder(var, rel.inf), y = rel.inf)) +
geom_bar(stat = "identity", fill = "#ff9f43", alpha = 0.8) +
coord_flip() +
labs(
title = "Gradient Boosting: Feature Importance Ranking \n (Including Fasting Glucose)",
x = "Feature",
y = "Relative Importance"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
The regression analysis demonstrates that both models capture meaningful relationships between selected predictors and 2-hour OGTT glucose levels. The 🧮 Elastic Net regression model achieved moderate predictive performance (R² = 0.4999), explaining approximately half of the observed variability and providing a transparent linear baseline. In contrast, the ⚡ Gradient Boosting Regressor yielded lower prediction errors and a higher R² of 0.5734, indicating an improved ability to model nonlinear relationships inherent in metabolic data. While 🧮 Elastic Net offers interpretability and stability, Gradient Boosting demonstrates superior predictive capability for metabolic outcome prediction.
Overall, the ⚡ Gradient Boosting regression model outperforms 🧮 Elastic Net, demonstrating stronger predictive capability for OGTT outcomes.
This project successfully addressed the two core research questions:
For the classification task on Chronic Kidney Disease (CKD), both 🌲Random Forest and ⚡ XGBoost models were able to stratify individuals into risk categories based on demographic, lifestyle, blood pressure, and metabolic indicators. ⚡ XGBoost demonstrated superior predictive performance and highlighted the most influential predictors.
For the regression task predicting 2-hour OGTT plasma glucose levels, the⚡ XGradient Boosting Regressor outperformed the 🧮Elastic Net baseline, capturing nonlinear relationships among demographic, lifestyle, anthropometric, and laboratory variables. This provided quantitative insights into factors influencing post-load glucose levels and supported population-level risk assessment.
While the models show good predictive performance at the population level, they are not precise enough for individual clinical diagnosis. The results are informative for risk stratification, research insights, and public health guidance, but should be interpreted within the limitations of NHANES-based modeling.
Overall, the project demonstrates that predictive modeling can inform both CKD classification and quantitative glucose prediction, while emphasizing realistic expectations for model precision in large-scale population data.