1 Introduction

1.1 Background

Population ageing has become an important public health and social policy issue in many countries. As life expectancy increases, a larger proportion of the population is living into older age, where chronic diseases, functional limitations, mental health concerns, and repeated healthcare use become more common. Older adults often experience multiple health conditions at the same time, including hypertension, diabetes, cardiovascular disease, arthritis, cancer, pulmonary disease, and symptoms such as chest pain or shortness of breath. These conditions can increase the need for outpatient visits, hospitalisation, long-term monitoring, and preventive care.

Healthcare demand among older adults is not determined by age alone. It is shaped by a combination of demographic, socioeconomic, behavioural, and clinical factors. For example, income and education may influence access to healthcare and health-seeking behaviour, while chronic disease burden may directly increase the need for medical services. Lifestyle-related factors such as smoking may also contribute to poorer health outcomes, although their relationship with healthcare utilisation can be complex in elderly populations.

This project uses data from the National Health and Nutrition Examination Survey (NHANES) 2017-March 2020 Pre-pandemic cycle. NHANES provides individual-level information collected through interviews, questionnaires, and physical examinations, making it a valuable source for studying health status and healthcare utilisation. Although the dataset is based on the United States population, the relationships between ageing, chronic disease, socioeconomic status, and healthcare demand are broadly relevant to healthy ageing research. The selected variables are also aligned with themes commonly discussed in National Health and Morbidity Survey (NHMS) reports, supporting the relevance of the analysis to the Malaysian healthcare context.

1.2 Problem Statement

Healthcare systems face increasing pressure as the number of older adults grows and the prevalence of chronic diseases rises. A major challenge is identifying older individuals who are more likely to require frequent healthcare services or hospital care. Without early identification, healthcare providers may respond only after medical needs become severe, leading to higher costs, delayed intervention, and poorer patient outcomes.

Traditional healthcare planning often relies on descriptive statistics or broad population-level indicators. While these are useful, they may not fully capture the combined effect of individual-level characteristics such as disease burden, age, income, smoking behaviour, and specific clinical symptoms. Data mining and predictive modelling can help address this gap by identifying patterns in large health datasets and estimating which older adults are more likely to have high healthcare demand.

In this project, high healthcare demand is represented by a composite indicator derived from healthcare utilisation variables. Older adults are classified as having higher healthcare demand if they had frequent healthcare visits in the past year or experienced overnight hospitalisation. In addition, depression severity is analysed through the PHQ-9 total score to explore how demographic characteristics and chronic disease burden relate to mental health outcomes among older adults.

1.3 Project Objectives

The main objective of this project is to apply data mining techniques to understand and predict healthcare-related needs among older adults using NHANES data.

The specific objectives are:

  1. To integrate multiple NHANES datasets covering demographic characteristics, chronic disease history, diabetes, blood pressure and cholesterol status, healthcare utilisation, cardiovascular symptoms, and smoking behaviour.
  2. To prepare a clean modelling dataset for respondents aged 60 years and above through variable selection, missing value handling, feature engineering, recoding, and target construction.
  3. To conduct exploratory data analysis in order to understand the distribution of demographic factors, chronic conditions, engineered disease-burden features, and healthcare demand.
  4. To develop a regression modelling task that predicts PHQ-9 depression severity among older adults using demographic and health-related predictors.
  5. To develop a classification modelling task that predicts whether an older adult belongs to the higher healthcare demand group.
  6. To evaluate model outcomes using appropriate metrics and interpret the findings in relation to healthy ageing, healthcare planning, and preventive intervention.

1.4 Research Question

Research Question 1 Which demographic and health-related factors are associated with depression severity among older adults?

Research Question 2 Can machine learning models accurately predict high healthcare demand among older adults?

1.5 Data Mining Goal

The data mining goal of this project is to discover useful patterns in elderly health data and build predictive models that support better understanding of healthcare demand among older adults.

For the classification task, the goal is to predict whether a respondent aged 60 years or above has higher healthcare demand. The target variable is based on healthcare utilisation, particularly frequent healthcare visits and hospitalisation. Predictors include demographic variables, socioeconomic indicators, chronic disease indicators, chest symptoms, smoking status, and engineered features such as chronic condition count and the interaction between age and disease burden.

For the regression task, the goal is to predict depression severity using the PHQ-9 total score. This task examines whether demographic characteristics and chronic disease burden can explain variation in depressive symptoms among older adults. The regression analysis complements the classification task by considering mental health as another important dimension of healthy ageing.

Together, these two modelling tasks aim to provide a more complete picture of older adults’ health needs. The classification model focuses on healthcare service demand, while the regression model focuses on depression severity. The combined findings can help identify high-risk profiles, highlight important predictors, and support data-driven planning for elderly healthcare services.

2 Data Description

2.1 Data Source Overview

This section describes the dataset used in the project Predicting High Healthcare Demand Among Older Adults to Support Healthy Ageing.

The data was obtained from the National Health and Nutrition Examination Survey (NHANES) 2017–March 2020 Pre-pandemic cycle, conducted by the Centers for Disease Control and Prevention (CDC) in the United States. NHANES is a large-scale health survey designed to provide nationally representative information on the health and nutritional status of the population.

One of the main reasons for selecting NHANES is that it contains a wide range of individual-level information, including demographic background, diagnosed medical conditions, health behaviours, access to healthcare, and cardiovascular- and diabetes-related indicators. These dimensions are highly relevant to the objective of this project, which is to predict healthcare demand among older adults.

Another strength of NHANES is that data is collected through multiple sources, including household interviews, questionnaires, and physical examinations, making the dataset more reliable than a purely self-reported survey.

Although NHANES is a U.S.-based dataset, it is still useful for this study because the relationships between ageing, chronic disease burden, lifestyle factors, and healthcare utilization are broadly relevant across many settings. In addition, variables are aligned with key indicators identified in NHMS reports to ensure relevance to the Malaysian context.

2.2 Selected Datasets

To support the two analytical objectives of this project — predicting high healthcare demand (classification) and predicting depression severity (regression) — eight NHANES datasets were selected. Each file contributes a different aspect of information needed for the analysis:

Dataset Description
DEMO Demographic information (age, gender, race/ethnicity, education, marital status, income)
HUQ Healthcare utilisation and access
BPQ Blood pressure and hypertension-related information
DIQ Diabetes-related information
SMQ Smoking behaviour
MCQ Medical conditions and chronic disease history
CDQ Cardiovascular health-related information
DPQ Mental health — Depression Screener (PHQ-9); used for the regression analysis task

The first seven datasets (DEMO, HUQ, BPQ, DIQ, SMQ, MCQ, CDQ) are merged to form the classification dataset (final_df), which is used to predict high healthcare demand. The DPQ dataset is joined separately with final_df to form the regression dataset (final_df_regression), which is used to predict PHQ-9 depression severity. The two datasets serve different research questions within the same project theme and are kept separate to avoid reducing the sample size for the classification task.

All datasets were merged using SEQN, the unique respondent identifier in NHANES, which allows records from different survey components to be linked to the same participant.

2.3 Dataset Structure

All datasets are merged using the participant identifier variable SEQN.

This project focuses on respondents aged 60 years and above to study healthcare demand among older adults, consistent with the elderly age threshold used in the National Health and Morbidity Survey (NHMS) 2019.

Two separate analytical datasets are produced from these eight files:

  • final_df (classification dataset): constructed from the seven core datasets (DEMO, MCQ, BPQ, DIQ, HUQ, CDQ, SMQ). Contains demographic characteristics, chronic disease indicators, engineered features, and the composite binary target healthcare_demand. Used for the classification task in Section 6.

  • final_df_regression (regression dataset): constructed by joining final_df with the cleaned DPQ data using an inner join on SEQN. Contains all variables from final_df plus the PHQ-9 total score PHQ9_total as the regression target. The inner join retains only participants present in both datasets, resulting in a smaller sample size than final_df. Used for the regression task in Section 5.

2.4 Key Variables and Features

Based on the project objective and supported by NHMS findings, the following key features are identified across six dimensions.

2.4.1 Demographic Factors

Demographic factors include age, gender, race/ethnicity, education level, marital status, and income level. These variables are important because healthcare demand may vary across different social and population groups.

Variable Codebook Description Notes
RIAGENDR Gender Demographic feature
RIDAGEYR Age in years at screening Used to retain participants aged 60 and above
RIDRETH3 Race/Hispanic origin w/ Non-Hispanic Asian Demographic feature
DMDEDUC2 Education level — Adults 20+ Socioeconomic feature
DMDMARTZ Marital status Demographic and social feature
INDFMPIR Ratio of family income to poverty Socioeconomic feature

2.4.2 Clinical Factors

Chronic disease factors include conditions such as arthritis, cardiovascular diseases, COPD, and cancer. Blood pressure and diabetes factors focus on diagnosis and treatment status.

Variable Codebook Description Notes
MCQ160A Doctor ever said you had arthritis Arthritis variable
MCQ160B Ever told had congestive heart failure Cardiovascular disease variable
MCQ160C Ever told you had coronary heart disease Cardiovascular disease variable
MCQ160D Ever told you had angina/angina pectoris Cardiovascular disease variable
MCQ160E Ever told you had heart attack Cardiovascular disease variable
MCQ160F Ever told you had a stroke Cardiovascular / neurological disease variable
MCQ160P Ever told you had COPD, emphysema, ChB Pulmonary disease variable
MCQ220 Ever told you had cancer or malignancy Cancer variable
BPQ020 Ever told you had high blood pressure Hypertension variable
BPQ030 Told had high blood pressure 2+ times Hypertension variable
BPQ080 Dr told you — high cholesterol level Cholesterol variable
DIQ010 Doctor told you have diabetes Diabetes variable
DIQ050 Taking insulin now Diabetes variable
DIQ070 Take diabetic pills to lower blood sugar Diabetes variable
DIQ080 Diabetes affected eyes/had retinopathy Diabetes variable

2.4.3 Chest Symptom Variables

Variable Codebook Description Notes
CDQ001 SP ever had pain or discomfort in chest Chest symptom variable
CDQ010 Shortness of breath on stairs/inclines Chest / cardiovascular symptom variable

2.4.4 Lifestyle Factors

Smoking behaviour is included because it may be associated with poorer health outcomes and higher healthcare needs among older adults.

Variable Codebook Description Notes
SMQ020 Smoked at least 100 cigarettes in life Smoking history variable
SMQ040 Do you now smoke cigarettes? Current smoking status variable
SMD650 Avg # cigarettes/day during past 30 days Smoking intensity variable
SMD030 Age started smoking cigarettes regularly Smoking history variable

2.4.5 Healthcare Utilisation

Healthcare utilisation variables are directly connected to the idea of healthcare demand and are used to construct the target variable.

Variable Codebook Description Notes
HUQ051 # times receive healthcare over past year Outpatient visit frequency; used for target construction
HUQ071 Overnight hospital patient in last year Hospitalisation indicator; used for target construction

2.4.6 Depression Screener Variables (DPQ — Regression Task Only)

The DPQ dataset contains the nine-item Patient Health Questionnaire (PHQ-9), a validated depression screening instrument. Each item asks about the frequency of a depressive symptom over the past two weeks, scored from 0 (not at all) to 3 (nearly every day). The nine items are summed to produce PHQ9_total (range 0–27), which serves as the target variable for the regression task.

Variable Codebook Description Notes
DPQ010 Have little interest in doing things PHQ-9 item 1
DPQ020 Feeling down, depressed, or hopeless PHQ-9 item 2
DPQ030 Trouble sleeping or sleeping too much PHQ-9 item 3
DPQ040 Feeling tired or having little energy PHQ-9 item 4
DPQ050 Poor appetite or overeating PHQ-9 item 5
DPQ060 Feeling bad about yourself PHQ-9 item 6
DPQ070 Trouble concentrating on things PHQ-9 item 7
DPQ080 Moving or speaking slowly or too fast PHQ-9 item 8
DPQ090 Thoughts you would be better off dead PHQ-9 item 9
PHQ9_total Sum of DPQ010–DPQ090 (0–27) Regression target variable

2.5 Target Variable

The target variable is a composite binary indicator of high healthcare demand, constructed from two complementary variables:

  • Condition 1 — HUQ051 >= 4: the participant had 4 or more healthcare visits in the past year, indicating frequent outpatient utilisation.
  • Condition 2 — HUQ071 == 1: the participant was hospitalised overnight at least once in the past year, which is a strong signal of high healthcare need.

A participant is classified as higher healthcare demand (target = 1) if either condition is met. Those who meet neither condition are classified as lower healthcare demand (target = 0). Using hospitalisation as an additional criterion captures high-need individuals who may not have had many outpatient visits but whose health burden is clearly significant.

2.6 Data Characteristics

The combined dataset has several important characteristics that affect later analysis.

Data includes both categorical and numerical variables. Many NHANES questionnaire variables are categorical because responses are recorded using coded answer options such as “Yes”, “No”, “Refused”, or “Don’t know”. At the same time, some demographic measures such as age and income ratio are numerical.

The dataset is cross-sectional. It captures respondents’ information at one point in time rather than following them over multiple years. This means the project is suitable for identifying associations and building predictive models based on current status, but cannot directly establish long-term causal effects.

Data comes from multiple files. The data structure is multi-source and modular. This allows the study to include different dimensions of elderly health, but also creates the need for proper merging, variable selection, and consistency checking.

The dataset includes both self-reported measures and health-related indicators. This provides a broader view of the elderly population. However, self-reported data may be subject to recall bias or reporting errors, especially for older respondents.

The dataset may contain missing values and coded responses. Values such as 7 / 9 (Refused / Don’t know) and 77 / 99 / 777 / 999 for specific variables will be treated as missing during preprocessing.

3 Data Preparation

3.1 Data Loading and Merging

The dataset was constructed using selected NHANES 2017–2020 files: DEMO, MCQ, BPQ, DIQ, HUQ, CDQ, and SMQ. These files cover demographic characteristics, chronic diseases, blood pressure and cholesterol conditions, healthcare utilisation, chest pain and cardiovascular symptoms, and smoking-related behaviours. Only respondents aged 60 years and above were retained for this study.

3.1.1 Loading the Source Files

demo <- read_xpt("DEMO.xpt")
mcq  <- read_xpt("MCQ.xpt")
bpq  <- read_xpt("BPQ.xpt")
diq  <- read_xpt("DIQ.xpt")
huq  <- read_xpt("HUQ.xpt")
cdq  <- read_xpt("CDQ.xpt")
smq  <- read_xpt("SMQ.xpt")
# Print dimensions of each dataset
cat("demo:", nrow(demo), "rows,", ncol(demo), "cols\n")
## demo: 15560 rows, 29 cols
cat("mcq: ", nrow(mcq),  "rows,", ncol(mcq),  "cols\n")
## mcq:  14986 rows, 63 cols
cat("bpq: ", nrow(bpq),  "rows,", ncol(bpq),  "cols\n")
## bpq:  10195 rows, 11 cols
cat("diq: ", nrow(diq),  "rows,", ncol(diq),  "cols\n")
## diq:  14986 rows, 28 cols
cat("huq: ", nrow(huq),  "rows,", ncol(huq),  "cols\n")
## huq:  15560 rows, 7 cols
cat("cdq: ", nrow(cdq),  "rows,", ncol(cdq),  "cols\n")
## cdq:  6433 rows, 17 cols
cat("smq: ", nrow(smq),  "rows,", ncol(smq),  "cols\n")
## smq:  11137 rows, 16 cols
# Check whether SEQN exists in each dataset
cat("SEQN in demo:", "SEQN" %in% names(demo), "\n")
## SEQN in demo: TRUE
cat("SEQN in mcq: ", "SEQN" %in% names(mcq),  "\n")
## SEQN in mcq:  TRUE
cat("SEQN in bpq: ", "SEQN" %in% names(bpq),  "\n")
## SEQN in bpq:  TRUE
cat("SEQN in diq: ", "SEQN" %in% names(diq),  "\n")
## SEQN in diq:  TRUE
cat("SEQN in huq: ", "SEQN" %in% names(huq),  "\n")
## SEQN in huq:  TRUE
cat("SEQN in cdq: ", "SEQN" %in% names(cdq),  "\n")
## SEQN in cdq:  TRUE
cat("SEQN in smq: ", "SEQN" %in% names(smq),  "\n")
## SEQN in smq:  TRUE

3.1.2 Merging Datasets

# Filter respondents aged 60 and above
df_60up <- demo %>% filter(RIDAGEYR >= 60)
cat("df_60up shape:", nrow(df_60up), "rows,", ncol(df_60up), "cols\n")
## df_60up shape: 3422 rows, 29 cols
head(df_60up)
## # A tibble: 6 × 29
##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
##    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 109274       66        2        1       68       NA        5        7        1
## 2 109282       66        2        1       76       NA        3        3        2
## 3 109290       66        2        2       68       NA        4        4        2
## 4 109298       66        2        1       68       NA        3        3        2
## 5 109313       66        2        1       63       NA        3        3        1
## 6 109316       66        2        2       62       NA        4        4        1
## # ℹ 20 more variables: DMDBORN4 <dbl>, DMDYRUSZ <dbl>, DMDEDUC2 <dbl>,
## #   DMDMARTZ <dbl>, RIDEXPRG <dbl>, SIALANG <dbl>, SIAPROXY <dbl>,
## #   SIAINTRP <dbl>, FIALANG <dbl>, FIAPROXY <dbl>, FIAINTRP <dbl>,
## #   MIALANG <dbl>, MIAPROXY <dbl>, MIAINTRP <dbl>, AIALANGA <dbl>,
## #   WTINTPRP <dbl>, WTMECPRP <dbl>, SDMVPSU <dbl>, SDMVSTRA <dbl>,
## #   INDFMPIR <dbl>
# Left join all datasets on SEQN
merged <- df_60up %>%
  left_join(mcq, by = "SEQN") %>%
  left_join(bpq, by = "SEQN") %>%
  left_join(diq, by = "SEQN") %>%
  left_join(huq, by = "SEQN") %>%
  left_join(cdq, by = "SEQN") %>%
  left_join(smq, by = "SEQN")

cat("merged shape:", nrow(merged), "rows,", ncol(merged), "cols\n")
## merged shape: 3422 rows, 165 cols
head(merged)
## # A tibble: 6 × 165
##     SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
##    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 109274       66        2        1       68       NA        5        7        1
## 2 109282       66        2        1       76       NA        3        3        2
## 3 109290       66        2        2       68       NA        4        4        2
## 4 109298       66        2        1       68       NA        3        3        2
## 5 109313       66        2        1       63       NA        3        3        1
## 6 109316       66        2        2       62       NA        4        4        1
## # ℹ 156 more variables: DMDBORN4 <dbl>, DMDYRUSZ <dbl>, DMDEDUC2 <dbl>,
## #   DMDMARTZ <dbl>, RIDEXPRG <dbl>, SIALANG <dbl>, SIAPROXY <dbl>,
## #   SIAINTRP <dbl>, FIALANG <dbl>, FIAPROXY <dbl>, FIAINTRP <dbl>,
## #   MIALANG <dbl>, MIAPROXY <dbl>, MIAINTRP <dbl>, AIALANGA <dbl>,
## #   WTINTPRP <dbl>, WTMECPRP <dbl>, SDMVPSU <dbl>, SDMVSTRA <dbl>,
## #   INDFMPIR <dbl>, MCQ010 <dbl>, MCQ025 <dbl>, MCQ035 <dbl>, MCQ040 <dbl>,
## #   MCQ050 <dbl>, AGQ030 <dbl>, MCQ053 <dbl>, MCQ080 <dbl>, MCQ092 <dbl>, …
# Print all column names
print(names(merged))
##   [1] "SEQN"     "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
##   [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "DMDBORN4" "DMDYRUSZ" "DMDEDUC2"
##  [13] "DMDMARTZ" "RIDEXPRG" "SIALANG"  "SIAPROXY" "SIAINTRP" "FIALANG" 
##  [19] "FIAPROXY" "FIAINTRP" "MIALANG"  "MIAPROXY" "MIAINTRP" "AIALANGA"
##  [25] "WTINTPRP" "WTMECPRP" "SDMVPSU"  "SDMVSTRA" "INDFMPIR" "MCQ010"  
##  [31] "MCQ025"   "MCQ035"   "MCQ040"   "MCQ050"   "AGQ030"   "MCQ053"  
##  [37] "MCQ080"   "MCQ092"   "MCD093"   "MCQ149"   "MCQ151"   "RHD018"  
##  [43] "MCQ160A"  "MCQ195"   "MCQ160B"  "MCD180B"  "MCQ160C"  "MCD180C" 
##  [49] "MCQ160D"  "MCD180D"  "MCQ160E"  "MCD180E"  "MCQ160F"  "MCD180F" 
##  [55] "MCQ160M"  "MCQ170M"  "MCD180M"  "MCQ160P"  "MCQ160L"  "MCQ170L" 
##  [61] "MCD180L"  "MCQ500"   "MCQ510A"  "MCQ510B"  "MCQ510C"  "MCQ510D" 
##  [67] "MCQ510E"  "MCQ510F"  "MCQ520"   "MCQ530"   "MCQ540"   "MCQ550"  
##  [73] "MCQ560"   "MCQ570"   "MCQ220"   "MCQ230A"  "MCQ230B"  "MCQ230C" 
##  [79] "MCQ230D"  "MCQ300B"  "MCQ300C"  "MCQ300A"  "MCQ366A"  "MCQ366B" 
##  [85] "MCQ366C"  "MCQ366D"  "MCQ371A"  "MCQ371B"  "MCQ371C"  "MCQ371D" 
##  [91] "OSQ230"   "BPQ020"   "BPQ030"   "BPD035"   "BPQ040A"  "BPQ050A" 
##  [97] "BPQ080"   "BPQ060"   "BPQ070"   "BPQ090D"  "BPQ100D"  "DIQ010"  
## [103] "DID040"   "DIQ160"   "DIQ180"   "DIQ050"   "DID060"   "DIQ060U" 
## [109] "DIQ070"   "DIQ230"   "DIQ240"   "DID250"   "DID260"   "DIQ260U" 
## [115] "DIQ275"   "DIQ280"   "DIQ291"   "DIQ300S"  "DIQ300D"  "DID310S" 
## [121] "DID310D"  "DID320"   "DID330"   "DID341"   "DID350"   "DIQ350U" 
## [127] "DIQ360"   "DIQ080"   "HUQ010"   "HUQ030"   "HUQ051"   "HUD062"  
## [133] "HUQ071"   "HUQ090"   "CDQ001"   "CDQ002"   "CDQ003"   "CDQ004"  
## [139] "CDQ005"   "CDQ006"   "CDQ009A"  "CDQ009B"  "CDQ009C"  "CDQ009D" 
## [145] "CDQ009E"  "CDQ009F"  "CDQ009G"  "CDQ009H"  "CDQ008"   "CDQ010"  
## [151] "SMQ020"   "SMD030"   "SMQ040"   "SMQ050Q"  "SMQ050U"  "SMD057"  
## [157] "SMQ078"   "SMD641"   "SMD650"   "SMD100FL" "SMD100MN" "SMQ670"  
## [163] "SMQ621"   "SMD630"   "SMAQUEX2"

3.2 Source Datasets Overview

Eight NHANES 2017–2020 files were used in this study. The first seven form the classification dataset; the eighth (DPQ) is used exclusively for the regression task. The table below summarises each dataset, its coverage, and the number of variables selected.

Dataset Coverage Total Variables Selected for Candidate Pool
DEMO Demographics & socioeconomic status (age, gender, race, education, marital status, income) 29 6 (RIAGENDR, RIDAGEYR, RIDRETH3, DMDEDUC2, DMDMARTZ, INDFMPIR)
MCQ Self-reported chronic conditions (arthritis, cardiovascular diseases, COPD, cancer, anemia, etc.) 58 8 (MCQ160A–F, MCQ160P, MCQ220)
BPQ Blood pressure and cholesterol history and treatment 11 3 (BPQ020, BPQ030, BPQ080)
DIQ Diabetes diagnosis, treatment, and management 36 4 (DIQ010, DIQ050, DIQ070, DIQ080)
CDQ Chest pain and cardiovascular symptoms 19 2 (CDQ001, CDQ010)
SMQ Smoking history and current smoking behaviour 16 4 (SMQ020, SMQ040, SMD650, SMD030)
HUQ Healthcare utilisation (visit frequency, hospitalisation, mental health care) 7 2 (HUQ051, HUQ071)
DPQ Mental health — Depression Screener (PHQ-9) 11 9 (DPQ010–DPQ090; summed into PHQ9_total for regression)

A total of 29 variables are selected for the classification dataset (final_df). The DPQ dataset contributes an additional 9 items used exclusively for the regression task via final_df_regression. Full variable descriptions are provided in Section 3.5.

3.3 Initial Variable Screening

After merging the selected NHANES 2017–2020 datasets, an initial variable screening step was conducted. A total of 29 candidate variables were retained based on their relevance to the study objective. These variables included demographic variables, chronic disease variables, blood pressure and cholesterol variables, diabetes variables, smoking-related variables, chest symptom variables, and healthcare utilization variables.

The purpose of this step was to reduce the complexity of the merged dataset while keeping variables that may help explain health risk and healthcare demand among older adults. These candidate variables were then used for missing value inspection before further simplification.

# Select the 29 candidate variables
selected_vars <- c(
  "SEQN",
  # Demographic variables (6)
  "RIAGENDR", "RIDAGEYR", "RIDRETH3", "DMDEDUC2", "DMDMARTZ", "INDFMPIR",
  # Medical condition variables (8)
  "MCQ160A", "MCQ160B", "MCQ160C", "MCQ160D", "MCQ160E", "MCQ160F",
  "MCQ160P", "MCQ220",
  # Blood pressure and cholesterol variables (3)
  "BPQ020", "BPQ030", "BPQ080",
  # Diabetes variables (4)
  "DIQ010", "DIQ050", "DIQ070", "DIQ080",
  # Chest symptom variables (2)
  "CDQ001", "CDQ010",
  # Smoking variables (4)
  "SMQ020", "SMQ040", "SMD650", "SMD030",
  # Healthcare utilization variables (2)
  "HUQ051", "HUQ071"
)

selected_df <- merged %>% select(all_of(selected_vars))

cat("Selected dataset shape:", nrow(selected_df), "rows,", ncol(selected_df), "cols\n")
## Selected dataset shape: 3422 rows, 30 cols
head(selected_df)
## # A tibble: 6 × 30
##     SEQN RIAGENDR RIDAGEYR RIDRETH3 DMDEDUC2 DMDMARTZ INDFMPIR MCQ160A MCQ160B
##    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
## 1 109274        1       68        7        4        3     1.2        1       2
## 2 109282        1       76        3        5        1     3.61       2       2
## 3 109290        2       68        4        5        2     5          1       2
## 4 109298        1       68        3        1        1     1.75       1       2
## 5 109313        1       63        3        4        1     5          1       2
## 6 109316        2       62        4        3        2     0.07       1       2
## # ℹ 21 more variables: MCQ160C <dbl>, MCQ160D <dbl>, MCQ160E <dbl>,
## #   MCQ160F <dbl>, MCQ160P <dbl>, MCQ220 <dbl>, BPQ020 <dbl>, BPQ030 <dbl>,
## #   BPQ080 <dbl>, DIQ010 <dbl>, DIQ050 <dbl>, DIQ070 <dbl>, DIQ080 <dbl>,
## #   CDQ001 <dbl>, CDQ010 <dbl>, SMQ020 <dbl>, SMQ040 <dbl>, SMD650 <dbl>,
## #   SMD030 <dbl>, HUQ051 <dbl>, HUQ071 <dbl>

3.4 Missing Value Inspection

3.4.1 Missing Value Assessment

After variable selection, the dataset was inspected for incomplete observations before further preprocessing. Missing values are common in NHANES questionnaire data due to conditional survey logic and participant eligibility criteria. Therefore, missing value exploration was conducted to identify variables with substantial missingness and assess their potential impact on the analytical dataset.

# Calculate missing values counts and percentage
missing_count   <- selected_df %>%
                   summarise(across(everything(), ~ sum(is.na(.)))) %>%
                   tidyr::pivot_longer(everything(), names_to = "variable", values_to = "Missing Count")
missing_percent <- selected_df %>%
                   summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
                   tidyr::pivot_longer(everything(), names_to = "variable", values_to = "Missing Percentage")

missing_summary <- left_join(missing_count, missing_percent, by = "variable")

# Display only variables that have missing data
missing_summary %>%
  filter(`Missing Count` > 0) %>%
  arrange(desc(`Missing Count`))
## # A tibble: 8 × 3
##   variable `Missing Count` `Missing Percentage`
##   <chr>              <int>                <dbl>
## 1 SMD650              2980                 87.1
## 2 DIQ080              2490                 72.8
## 3 DIQ050              2488                 72.7
## 4 DIQ070              1954                 57.1
## 5 SMQ040              1737                 50.8
## 6 SMD030              1737                 50.8
## 7 BPQ030              1322                 38.6
## 8 INDFMPIR             538                 15.7

3.4.1.1 Handling Special Survey Response Codes

In addition to standard missing values (NA), several NHANES questionnaire variables contain special response codes representing “Refused” (7, 77, and 777) or “Don’t know” (9, 99, and 999) answers. These coded responses were identified and treated as missing values during preprocessing to ensure consistency in downstream analysis.

# Example of special code in BPQ080
print(table(selected_df$BPQ080, useNA = "always"))
## 
##    1    2    7    9 <NA> 
## 1883 1498    2   39    0
# NHANES special response codes:
# [7, 9]       -> default for most variables
# [77, 99]     -> DMDMARTZ, HUQ051
# [9]          -> HUQ071 (only "Don't know"; no Refused code)
# [777, 999]   -> SMD030, SMD650

vars_excluded <- c("RIAGENDR", "RIDAGEYR", "RIDRETH3")
cols_to_clean <- setdiff(names(selected_df), vars_excluded)

special_code_map <- list(
  default  = c(7, 9),
  DMDMARTZ = c(77, 99),
  HUQ051   = c(77, 99),
  HUQ071   = c(9),
  SMD030   = c(777, 999),
  SMD650   = c(777, 999)
)

# Count replacements per variable before replacing
replacement_counts <- sapply(cols_to_clean, function(col) {
  codes <- if (col %in% names(special_code_map)) special_code_map[[col]] else special_code_map[["default"]]
  sum(selected_df[[col]] %in% codes, na.rm = TRUE)
})

replacement_summary <- data.frame(
  Variable            = names(replacement_counts),
  `Replacements Made` = replacement_counts,
  row.names           = NULL
) %>%
  filter(`Replacements.Made` > 0) %>%
  arrange(desc(`Replacements.Made`))

cat("Total replacements made:", sum(replacement_summary$Replacements.Made), "\n")
## Total replacements made: 263
print(replacement_summary)
##    Variable Replacements.Made
## 1    BPQ080                41
## 2   MCQ160D                29
## 3   MCQ160C                23
## 4   MCQ160B                20
## 5   MCQ160A                15
## 6    CDQ010                15
## 7   MCQ160P                14
## 8    BPQ030                14
## 9    SMD030                14
## 10 DMDEDUC2                12
## 11   DIQ080                11
## 12  MCQ160F                10
## 13   HUQ051                 8
## 14 DMDMARTZ                 7
## 15  MCQ160E                 7
## 16   BPQ020                 7
## 17   DIQ070                 4
## 18   CDQ001                 3
## 19   SMQ020                 3
## 20   DIQ010                 2
## 21   SMD650                 2
## 22   MCQ220                 1
## 23   HUQ071                 1
# Replace special codes with NA
for (col in cols_to_clean) {
  codes <- if (col %in% names(special_code_map)) special_code_map[[col]] else special_code_map[["default"]]
  selected_df[[col]][selected_df[[col]] %in% codes] <- NA
}

3.4.1.2 Updated Missing Percentage

upd_missing_count   <- selected_df %>%
                       summarise(across(everything(), ~ sum(is.na(.)))) %>%
                       tidyr::pivot_longer(everything(), names_to = "variable", values_to = "After Count")
upd_missing_percent <- selected_df %>%
                       summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
                       tidyr::pivot_longer(everything(), names_to = "variable", values_to = "After Percentage")

upd_missing_summary <- left_join(upd_missing_count, upd_missing_percent, by = "variable")

upd_missing_summary %>%
  filter(`After Count` > 0) %>%
  arrange(desc(`After Count`))
## # A tibble: 26 × 3
##    variable `After Count` `After Percentage`
##    <chr>            <int>              <dbl>
##  1 SMD650            2982             87.1  
##  2 DIQ080            2501             73.1  
##  3 DIQ050            2488             72.7  
##  4 DIQ070            1958             57.2  
##  5 SMD030            1751             51.2  
##  6 SMQ040            1737             50.8  
##  7 BPQ030            1336             39.0  
##  8 INDFMPIR           538             15.7  
##  9 BPQ080              41              1.20 
## 10 MCQ160D             29              0.847
## # ℹ 16 more rows

3.4.1.3 Comparison Before vs. After Special Values Replacement

missing_comparison <- missing_summary %>%
  rename(`Before Count` = `Missing Count`, `Before Percentage` = `Missing Percentage`) %>%
  left_join(upd_missing_summary, by = "variable") %>%
  mutate(
    `Increase Count`      = `After Count`      - `Before Count`,
    `Increase Percentage` = `After Percentage` - `Before Percentage`
  ) %>%
  filter(`Increase Count` > 0) %>%
  arrange(desc(`Increase Count`))

missing_comparison
## # A tibble: 23 × 7
##    variable `Before Count` `Before Percentage` `After Count` `After Percentage`
##    <chr>             <int>               <dbl>         <int>              <dbl>
##  1 BPQ080                0                 0              41              1.20 
##  2 MCQ160D               0                 0              29              0.847
##  3 MCQ160C               0                 0              23              0.672
##  4 MCQ160B               0                 0              20              0.584
##  5 MCQ160A               0                 0              15              0.438
##  6 CDQ010                0                 0              15              0.438
##  7 MCQ160P               0                 0              14              0.409
##  8 BPQ030             1322                38.6          1336             39.0  
##  9 SMD030             1737                50.8          1751             51.2  
## 10 DMDEDUC2              0                 0              12              0.351
## # ℹ 13 more rows
## # ℹ 2 more variables: `Increase Count` <int>, `Increase Percentage` <dbl>

After replacing survey-coded responses such as “Refused” and “Don’t know” with missing values, several questionnaire variables showed noticeable increases in missingness percentages. This indicates that a substantial portion of incomplete information in the NHANES dataset was originally represented using coded survey responses rather than standard null values.

The increase was particularly noticeable among chronic disease follow-up variables, further supporting the need for careful variable review during preprocessing.

3.4.2 Figure: Missing Value Pattern

# Identify variables with any missing values (excluding SEQN and RIDAGEYR)
miss_vars <- upd_missing_summary %>%
  filter(`After Count` > 0, !variable %in% c("SEQN", "RIDAGEYR")) %>%
  pull(variable)

# Build long-format binary matrix sorted by age
df_vis <- selected_df %>%
  select(RIDAGEYR, all_of(miss_vars)) %>%
  arrange(RIDAGEYR) %>%
  mutate(row_id = row_number()) %>%
  pivot_longer(cols = all_of(miss_vars),
               names_to  = "variable",
               values_to = "value") %>%
  mutate(is_missing = as.integer(is.na(value)))

ggplot(df_vis, aes(x = row_id, y = variable, fill = factor(is_missing))) +
  geom_tile() +
  scale_fill_manual(
    values = c("0" = "white", "1" = "red"),
    labels = c("0" = "Available", "1" = "Missing"),
    name   = "Data Status"
  ) +
  labs(
    x     = "RIDAGEYR (Age Years, sorted)",
    y     = "Variable",
    title = "Missing Value Pattern by RIDAGEYR Across Variables"
  ) +
  theme_minimal() +
  theme(
    axis.text.y     = element_text(size = 8),
    plot.title      = element_text(size = 12, face = "bold"),
    legend.position = "right"
  )

As shown in the figure, missingness is not random across variables. Diabetes follow-up variables (DIQ050, DIQ070, DIQ080) and smoking variables (SMD650, SMD030) show the highest missing rates, largely due to skip patterns in the NHANES questionnaire — respondents who answered ‘No’ to a screening question were not asked follow-up items. In contrast, demographic variables and core disease indicators (BPQ020, DIQ010, MCQ160A) have very low missingness.

3.4.3 Handling High-Missingness Variables

As a practical guideline, variables with missingness exceeding approximately 40% were considered for exclusion unless they were regarded as clinically important. Several diabetes follow-up variables (DIQ050, DIQ070, DIQ080) and smoking consumption variables (SMD650, SMD030) were removed due to substantial skip-pattern missingness exceeding this threshold.

SMQ040 was retained despite its relatively high missingness because it provides important smoking behaviour information required for subsequent feature engineering — it is later combined with SMQ020 to construct the smoking_status feature.

remove_vars <- c("DIQ050", "DIQ070", "DIQ080", "SMD650", "SMD030")

selected_df <- selected_df %>% select(-all_of(remove_vars))

cat("Dataset shape after removing high-missingness variables:",
    nrow(selected_df), "rows,", ncol(selected_df), "cols\n")
## Dataset shape after removing high-missingness variables: 3422 rows, 25 cols

3.5 Full Description of Selected Variables

3.5.1 Demographic Variables

Variable Codebook Description Notes
SEQN Respondent sequence number Unique respondent identifier used for merging datasets
RIAGENDR Gender Demographic feature
RIDAGEYR Age in years at screening Used to retain participants aged 60 and above
RIDRETH3 Race/Hispanic origin w/ Non-Hispanic Asian Demographic feature
DMDEDUC2 Education level - Adults 20+ Socioeconomic feature
DMDMARTZ Marital status Demographic and social feature
INDFMPIR Ratio of family income to poverty Socioeconomic feature

3.5.2 Medical Condition Variables

Variable Codebook Description Notes
MCQ160A Doctor ever said you had arthritis Arthritis disease variable
MCQ160B Ever told had congestive heart failure Cardiovascular disease variable
MCQ160C Ever told you had coronary heart disease Cardiovascular disease variable
MCQ160D Ever told you had angina/angina pectoris Cardiovascular disease variable
MCQ160E Ever told you had heart attack Cardiovascular disease variable
MCQ160F Ever told you had a stroke Cardiovascular / neurological disease variable
MCQ160P Ever told you had COPD, emphysema, ChB Pulmonary disease variable
MCQ220 Ever told you had cancer or malignancy Cancer disease variable

3.5.3 Blood Pressure & Cholesterol Variables

Variable Codebook Description Notes
BPQ020 Ever told you had high blood pressure Hypertension variable
BPQ030 Told had high blood pressure 2+ times Hypertension variable; combined with BPQ020 into hypertension_status
BPQ080 Dr told you - high cholesterol level Cholesterol variable

3.5.4 Diabetes Variable

Variable Codebook Description Notes
DIQ010 Doctor told you have diabetes Diabetes variable; borderline (value = 3) treated as Yes

3.5.5 Chest Symptom Variables

Variable Codebook Description Notes
CDQ001 SP ever had pain or discomfort in chest Chest symptom variable
CDQ010 Shortness of breath on stairs/inclines Chest / cardiovascular symptom variable

3.5.6 Smoking Variables

Variable Codebook Description Notes
SMQ020 Smoked at least 100 cigarettes in life Smoking history variable; combined with SMQ040 into smoking_status
SMQ040 Do you now smoke cigarettes? Current smoking status variable

3.5.7 Healthcare Utilization Variables

Variable Codebook Description Notes
HUQ051 # times receive healthcare over past year Used for composite target construction
HUQ071 Overnight hospital patient in last year Hospitalisation indicator; used for composite target construction

3.6 Feature Engineering

Feature engineering was performed to simplify overlapping questionnaire variables and improve the usability of selected predictors for machine learning modelling. Several related health indicators were consolidated into broader features to reduce sparsity, preserve clinically meaningful information, and improve overall dataset interpretability.

3.6.1 Engineering Cardiovascular Disease Feature

The variables MCQ160B, MCQ160C, MCQ160D, MCQ160E, and MCQ160F were consolidated into a single binary feature cardio_disease. A participant is labelled as having cardiovascular disease if at least one of the selected cardiovascular conditions was reported (1 = Yes, 0 = No). Since these variables had relatively low missingness and were combined into a broader binary indicator, missing responses were treated as absence of the condition to preserve participant records.

df_slim <- selected_df

# Inspect cardiovascular variables before engineering
cardio_vars <- c("MCQ160B", "MCQ160C", "MCQ160D", "MCQ160E", "MCQ160F")
for (col in cardio_vars) {
  cat(col, ":\n")
  print(table(df_slim[[col]], useNA = "always"))
}
## MCQ160B :
## 
##    1    2 <NA> 
##  278 3124   20 
## MCQ160C :
## 
##    1    2 <NA> 
##  345 3054   23 
## MCQ160D :
## 
##    1    2 <NA> 
##  178 3215   29 
## MCQ160E :
## 
##    1    2 <NA> 
##  333 3082    7 
## MCQ160F :
## 
##    1    2 <NA> 
##  347 3065   10
# Combine into cardio_disease: 1 if any cardiovascular condition reported
df_slim <- df_slim %>%
  mutate(
    cardio_disease = as.integer(
      (!is.na(MCQ160B) & MCQ160B == 1) |
      (!is.na(MCQ160C) & MCQ160C == 1) |
      (!is.na(MCQ160D) & MCQ160D == 1) |
      (!is.na(MCQ160E) & MCQ160E == 1) |
      (!is.na(MCQ160F) & MCQ160F == 1)
    )
  )

cat("\ncardio_disease distribution:\n")
## 
## cardio_disease distribution:
print(table(df_slim$cardio_disease, useNA = "always"))
## 
##    0    1 <NA> 
## 2549  873    0
# Remove original cardiovascular variables
df_slim <- df_slim %>% select(-all_of(cardio_vars))
cat("\nShape after removing original cardio vars:",
    nrow(df_slim), "rows,", ncol(df_slim), "cols\n")
## 
## Shape after removing original cardio vars: 3422 rows, 21 cols

3.6.2 Engineering Hypertension Status Feature

The variables BPQ020 and BPQ030 were combined to construct a simplified hypertension_status feature. BPQ030 is a follow-up question administered only to participants reporting prior hypertension diagnosis, resulting in substantial skip-pattern missingness. To preserve hypertension-related information while reducing sparsity, participants reporting either prior hypertension diagnosis (BPQ020 = 1) or repeated hypertension confirmation (BPQ030 = 1) were classified as having hypertension (hypertension_status = 1).

# Inspect before engineering
cat("BPQ020:\n"); print(table(df_slim$BPQ020, useNA = "always"))
## BPQ020:
## 
##    1    2 <NA> 
## 2100 1315    7
cat("\nBPQ030:\n"); print(table(df_slim$BPQ030, useNA = "always"))
## 
## BPQ030:
## 
##    1    2 <NA> 
## 1700  386 1336
# Combine into hypertension_status
df_slim <- df_slim %>%
  mutate(
    hypertension_status = as.integer(
      (!is.na(BPQ020) & BPQ020 == 1) |
      (!is.na(BPQ030) & BPQ030 == 1)
    )
  )

cat("\nhypertension_status distribution:\n")
## 
## hypertension_status distribution:
print(table(df_slim$hypertension_status, useNA = "always"))
## 
##    0    1 <NA> 
## 1322 2100    0
# Remove original hypertension variables
df_slim <- df_slim %>% select(-c(BPQ020, BPQ030))
cat("\nShape after removing original hypertension vars:",
    nrow(df_slim), "rows,", ncol(df_slim), "cols\n")
## 
## Shape after removing original hypertension vars: 3422 rows, 20 cols

3.6.3 Engineering Smoking Status Feature

The variables SMQ020 and SMQ040 were combined into a simplified smoking_status feature representing overall smoking exposure. SMQ040 is a follow-up question only asked to respondents who reported smoking at least 100 cigarettes in their lifetime (SMQ020 = 1), resulting in skip-pattern missingness.

# Inspect before engineering
cat("SMQ020:\n"); print(table(df_slim$SMQ020, useNA = "always"))
## SMQ020:
## 
##    1    2 <NA> 
## 1685 1734    3
cat("\nSMQ040:\n"); print(table(df_slim$SMQ040, useNA = "always"))
## 
## SMQ040:
## 
##    1    2    3 <NA> 
##  357   95 1233 1737
# Combine into smoking_status
df_slim <- df_slim %>%
  mutate(
    smoking_status = case_when(
      SMQ040 %in% c(1, 2) ~ 1L,   # current smoker
      SMQ040 == 3         ~ 0L,   # former smoker
      SMQ020 == 2         ~ 0L,   # never smoked
      TRUE                ~ NA_integer_
    )
  )

cat("\nsmoking_status distribution:\n")
## 
## smoking_status distribution:
print(table(df_slim$smoking_status, useNA = "always"))
## 
##    0    1 <NA> 
## 2967  452    3
# Remove original smoking variables
df_slim <- df_slim %>% select(-c(SMQ020, SMQ040))
cat("\nShape after removing original smoking vars:",
    nrow(df_slim), "rows,", ncol(df_slim), "cols\n")
## 
## Shape after removing original smoking vars: 3422 rows, 19 cols
Variable Description Notes
cardio_disease Derived cardiovascular disease indicator Combined from MCQ160B–F; original variables removed
hypertension_status Derived hypertension indicator Combined from BPQ020 and BPQ030; original variables removed
smoking_status Derived smoking-related predictor Combined from SMQ020 and SMQ040; original variables removed

3.7 Target Construction

A composite binary target variable was constructed using two complementary indicators of healthcare demand: HUQ051 (number of healthcare visits over the past year) and HUQ071 (whether the participant was an overnight hospital patient in the past year).

A participant is classified as the higher healthcare demand group (healthcare_demand = 1) if either of the following conditions is met:

  • Condition 1: HUQ051 >= 4 — the participant had 4 or more healthcare visits in the past year, indicating frequent outpatient utilisation.
  • Condition 2: HUQ071 == 1 — the participant was hospitalised overnight at least once in the past year, which is a direct and objective indicator of high medical need independent of access barriers.

Participants who meet neither condition are classified as the lower healthcare demand group (healthcare_demand = 0).

# Distribution of HUQ051 and HUQ071 before target construction
cat("HUQ051 distribution:\n")
## HUQ051 distribution:
print(sort(table(df_slim$HUQ051, useNA = "always")))
## 
## <NA>    7    5    8    0    6    4    1    3    2 
##    8   80  153  191  219  259  350  419  681 1062
cat("\nHUQ071 distribution:\n")
## 
## HUQ071 distribution:
print(table(df_slim$HUQ071, useNA = "always"))
## 
##    1    2 <NA> 
##  598 2823    1
# Fix HUQ071 to integer (may be stored as float from SAS import)
df_slim <- df_slim %>%
  mutate(HUQ071 = as.integer(round(HUQ071)))

cat("\nHUQ071 distribution (after fix):\n")
## 
## HUQ071 distribution (after fix):
print(table(df_slim$HUQ071, useNA = "always"))
## 
##    1    2 <NA> 
##  598 2823    1
# Construct composite binary target
df_slim <- df_slim %>%
  mutate(
    healthcare_demand = case_when(
      HUQ051 >= 4  ~ 1L,                              # Condition 1: frequent outpatient visits
      HUQ071 == 1  ~ 1L,                              # Condition 2: hospitalised overnight
      !is.na(HUQ051) & !is.na(HUQ071) ~ 0L,           # Neither condition met
      TRUE         ~ NA_integer_                       # Missing on both
    )
  )

df_slim %>% select(HUQ051, HUQ071, healthcare_demand) %>% head(10)
## # A tibble: 10 × 3
##    HUQ051 HUQ071 healthcare_demand
##     <dbl>  <int>             <int>
##  1      0      2                 0
##  2      8      2                 1
##  3      2      2                 0
##  4      4      2                 1
##  5      3      1                 1
##  6      3      2                 0
##  7      3      2                 0
##  8      3      2                 0
##  9      1      2                 0
## 10      3      2                 0
cat("\nhealthcare_demand distribution:\n")
## 
## healthcare_demand distribution:
print(table(df_slim$healthcare_demand, useNA = "always"))
## 
##    0    1 <NA> 
## 2111 1306    5
# Remove source target variables after construction
df_slim <- df_slim %>% select(-c(HUQ051, HUQ071))

3.7.0.1 Target Missing Value and Outlier Handling

cat("healthcare_demand NA count:", sum(is.na(df_slim$healthcare_demand)), "\n")
## healthcare_demand NA count: 5
cat("healthcare_demand unique values:", sort(unique(df_slim$healthcare_demand)), "\n")
## healthcare_demand unique values: 0 1
# Ensure only valid binary values (0/1); treat anything else as NA
df_slim <- df_slim %>%
  mutate(healthcare_demand = if_else(
    healthcare_demand %in% c(0L, 1L), healthcare_demand, NA_integer_
  ))

# Remove rows where target is NA
df_slim <- df_slim %>% filter(!is.na(healthcare_demand))

cat("\nAfter removing NA targets:\n")
## 
## After removing NA targets:
cat("Rows remaining:", nrow(df_slim), "\n")
## Rows remaining: 3417
cat("healthcare_demand distribution:\n")
## healthcare_demand distribution:
print(table(df_slim$healthcare_demand))
## 
##    0    1 
## 2111 1306
cat("Class ratio (High / Low):",
    round(sum(df_slim$healthcare_demand == 1) / sum(df_slim$healthcare_demand == 0), 3), "\n")
## Class ratio (High / Low): 0.619

3.8 Remaining Missing Value Handling

Following feature engineering and target construction, the dataset was reassessed for remaining incomplete observations. Since sparse questionnaire variables had already been consolidated or removed, remaining missingness was expected to be substantially reduced.

remaining_missing <- df_slim %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  tidyr::pivot_longer(everything(), names_to = "variable", values_to = "Missing Count") %>%
  mutate(`Missing Percentage` = round(`Missing Count` / nrow(df_slim) * 100, 2)) %>%
  filter(`Missing Count` > 0) %>%
  arrange(desc(`Missing Count`))

print(remaining_missing)
## # A tibble: 11 × 3
##    variable       `Missing Count` `Missing Percentage`
##    <chr>                    <int>                <dbl>
##  1 INDFMPIR                   537                15.7 
##  2 BPQ080                      41                 1.2 
##  3 MCQ160A                     15                 0.44
##  4 MCQ160P                     14                 0.41
##  5 CDQ010                      14                 0.41
##  6 DMDEDUC2                    11                 0.32
##  7 DMDMARTZ                     7                 0.2 
##  8 CDQ001                       3                 0.09
##  9 smoking_status               3                 0.09
## 10 DIQ010                       2                 0.06
## 11 MCQ220                       1                 0.03

INDFMPIR (income-to-poverty ratio) retains moderate missingness due to socioeconomic non-response. As it is an important predictor for healthcare demand analysis, median imputation is applied to preserve participant records. The remaining variables with minimal missingness are handled using complete-case analysis (row-wise removal).

# Median imputation for INDFMPIR
df_slim <- df_slim %>%
  mutate(INDFMPIR = if_else(
    is.na(INDFMPIR),
    median(INDFMPIR, na.rm = TRUE),
    INDFMPIR
  ))

# Complete-case analysis for remaining variables
rows_before <- nrow(df_slim)
df_slim     <- df_slim %>% drop_na()
rows_after  <- nrow(df_slim)

cat("Rows removed during complete-case analysis:", rows_before - rows_after, "\n")
## Rows removed during complete-case analysis: 97
cat("Final dataset shape:", nrow(df_slim), "rows,", ncol(df_slim), "cols\n")
## Final dataset shape: 3320 rows, 18 cols

3.9 Data Transformation and Encoding

3.9.1 Variable Renaming

To improve dataset readability and simplify subsequent analysis, selected NHANES variable names were renamed into more interpretable labels while preserving their original analytical meaning.

df_slim <- df_slim %>%
  rename(
    gender               = RIAGENDR,
    age                  = RIDAGEYR,
    ethnicity            = RIDRETH3,
    education_level      = DMDEDUC2,
    marital_status       = DMDMARTZ,
    income_poverty_ratio = INDFMPIR,
    arthritis            = MCQ160A,
    pulmonary_disease    = MCQ160P,
    cancer               = MCQ220,
    high_cholesterol     = BPQ080,
    diabetes             = DIQ010,
    chest_pain           = CDQ001,
    shortness_breath     = CDQ010
  )

glimpse(df_slim)
## Rows: 3,320
## Columns: 18
## $ SEQN                 <dbl> 109274, 109282, 109290, 109298, 109313, 109316, 1…
## $ gender               <dbl> 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1…
## $ age                  <dbl> 68, 76, 68, 68, 63, 62, 76, 69, 63, 66, 80, 80, 6…
## $ ethnicity            <dbl> 7, 3, 4, 3, 3, 4, 6, 3, 3, 2, 3, 2, 6, 3, 1, 1, 4…
## $ education_level      <dbl> 4, 5, 5, 1, 4, 3, 4, 3, 5, 2, 3, 2, 5, 3, 3, 1, 4…
## $ marital_status       <dbl> 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2…
## $ income_poverty_ratio <dbl> 1.20, 3.61, 5.00, 1.75, 5.00, 0.07, 2.37, 4.31, 5…
## $ arthritis            <dbl> 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1…
## $ pulmonary_disease    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
## $ cancer               <dbl> 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2…
## $ high_cholesterol     <dbl> 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1…
## $ diabetes             <dbl> 1, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 1…
## $ chest_pain           <dbl> 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ shortness_breath     <dbl> 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1…
## $ cardio_disease       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ hypertension_status  <int> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ smoking_status       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ healthcare_demand    <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0…

3.9.2 Binary Recoding

Several questionnaire variables were recoded into binary representations (1 = Yes, 0 = No) to maintain consistency across predictors. The diabetes variable originally contained an additional “Borderline” response category (value = 3); borderline responses were grouped with positive diabetes responses during recoding.

binary_cols <- c(
  "high_cholesterol", "arthritis", "pulmonary_disease",
  "cancer", "chest_pain", "shortness_breath"
)

df_slim <- df_slim %>%
  mutate(across(all_of(binary_cols), ~ case_when(
    . == 1 ~ 1L,
    . == 2 ~ 0L,
    TRUE   ~ NA_integer_
  )))

# Diabetes: 1 = Yes, 2 = No, 3 = Borderline (treated as Yes)
df_slim <- df_slim %>%
  mutate(diabetes = case_when(
    diabetes == 1 ~ 1L,
    diabetes == 3 ~ 1L,   # borderline treated as Yes
    diabetes == 2 ~ 0L,
    TRUE          ~ NA_integer_
  ))

cat("diabetes distribution after recoding:\n")
## diabetes distribution after recoding:
print(table(df_slim$diabetes, useNA = "always"))
## 
##    0    1 <NA> 
## 2286 1034    0

3.9.3 Engineering Chronic Count and Interaction Features

Two additional features were constructed after variable renaming to improve predictive signal for machine learning modelling.

chronic_count sums all binary disease indicators for each participant, representing the total number of chronic conditions reported. From the feature importance analysis, almost all disease variables contribute similarly (0.10–0.14), suggesting the model is implicitly counting conditions. Providing an explicit count feature makes this signal cleaner and more direct.

disease_age_interaction multiplies chronic_count by age, capturing the combined effect of disease burden and ageing. Older participants with more conditions are expected to have disproportionately higher healthcare demand than younger participants with the same number of conditions.

disease_cols <- c(
  "arthritis", "pulmonary_disease", "cancer", "high_cholesterol",
  "diabetes", "chest_pain", "shortness_breath",
  "cardio_disease", "hypertension_status"
)

# chronic_count: total number of chronic conditions reported
df_slim <- df_slim %>%
  mutate(
    chronic_count           = rowSums(select(., all_of(disease_cols)), na.rm = TRUE),
    disease_age_interaction = chronic_count * age
  )

cat("chronic_count distribution:\n")
## chronic_count distribution:
print(table(df_slim$chronic_count))
## 
##   0   1   2   3   4   5   6   7   8   9 
## 223 412 539 620 633 406 291 138  49   9
cat("\ndisease_age_interaction summary:\n")
## 
## disease_age_interaction summary:
print(summary(df_slim$disease_age_interaction))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   130.0   237.0   236.2   320.0   720.0

3.10 Final Modelling Dataset

final_df <- df_slim

cat("Final modeling dataset shape:", nrow(final_df), "rows,", ncol(final_df), "cols\n")
## Final modeling dataset shape: 3320 rows, 20 cols
cat("\nhealthcare_demand distribution:\n")
## 
## healthcare_demand distribution:
print(table(final_df$healthcare_demand))
## 
##    0    1 
## 2064 1256
cat("\nClass ratio (High / Low):",
    round(sum(final_df$healthcare_demand == 1) / sum(final_df$healthcare_demand == 0), 3), "\n")
## 
## Class ratio (High / Low): 0.609
head(final_df)
## # A tibble: 6 × 20
##     SEQN gender   age ethnicity education_level marital_status
##    <dbl>  <dbl> <dbl>     <dbl>           <dbl>          <dbl>
## 1 109274      1    68         7               4              3
## 2 109282      1    76         3               5              1
## 3 109290      2    68         4               5              2
## 4 109298      1    68         3               1              1
## 5 109313      1    63         3               4              1
## 6 109316      2    62         4               3              2
## # ℹ 14 more variables: income_poverty_ratio <dbl>, arthritis <int>,
## #   pulmonary_disease <int>, cancer <int>, high_cholesterol <int>,
## #   diabetes <int>, chest_pain <int>, shortness_breath <int>,
## #   cardio_disease <int>, hypertension_status <int>, smoking_status <int>,
## #   healthcare_demand <int>, chronic_count <dbl>, disease_age_interaction <dbl>

This final_df object is the cleaned classification dataset and is used directly by the exploratory data analysis (Section 4) and the classification analysis (Section 6). No intermediate CSV file is required because all sections run within the same R session.

3.11 Regression Dataset Construction

This section constructs a separate dataset for the regression analysis task. The regression problem asks:

“How severe is depression among older adults, given their demographic characteristics and chronic disease burden?”

The target variable is PHQ9_total (Patient Health Questionnaire-9 total score, range 0–27), derived from the NHANES DPQ dataset. A higher score indicates greater depressive symptom severity.

This dataset is constructed by joining the cleaned PHQ-9 data with final_df. The two datasets serve different research questions within the same project theme.

# Load DPQ dataset (depression questionnaire)
dpq_raw <- read_xpt("DPQ.xpt")

cat("DPQ raw shape:", nrow(dpq_raw), "rows,", ncol(dpq_raw), "cols\n")
## DPQ raw shape: 8965 rows, 11 cols
# PHQ-9 item variables
dpq_items <- paste0("DPQ0", c("10", "20", "30", "40", "50", "60", "70", "80", "90"))

# Replace special codes [7, 9] with NA; remove rows with any missing PHQ-9 item
dpq_clean <- dpq_raw %>%
  mutate(across(all_of(dpq_items), ~ ifelse(. %in% c(7, 9), NA, .))) %>%
  filter(if_all(all_of(dpq_items), ~ !is.na(.))) %>%
  mutate(PHQ9_total = rowSums(across(all_of(dpq_items)))) %>%
  select(SEQN, PHQ9_total)

cat("Rows before cleaning:", nrow(dpq_raw), "\n")
## Rows before cleaning: 8965
cat("Rows after cleaning: ", nrow(dpq_clean), "\n")
## Rows after cleaning:  8276
cat("Rows removed:        ", nrow(dpq_raw) - nrow(dpq_clean), "\n")
## Rows removed:         689
cat("\nPHQ9_total summary:\n")
## 
## PHQ9_total summary:
print(summary(dpq_clean$PHQ9_total))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   3.302   5.000  27.000
# Merge with final_df using inner_join on SEQN.
# inner_join keeps only participants present in BOTH datasets,
# ensuring every row has a valid PHQ9_total.
final_df_regression <- inner_join(dpq_clean, final_df, by = "SEQN")

cat("final_df rows (classification): ", nrow(final_df), "\n")
## final_df rows (classification):  3320
cat("final_df_regression rows (regression):", nrow(final_df_regression), "\n")
## final_df_regression rows (regression): 2789
cat("Rows difference:", nrow(final_df) - nrow(final_df_regression), "\n")
## Rows difference: 531
cat("\nfinal_df_regression shape:", nrow(final_df_regression), "rows,",
    ncol(final_df_regression), "cols\n")
## 
## final_df_regression shape: 2789 rows, 21 cols
cat("\nPHQ9_total distribution in regression dataset:\n")
## 
## PHQ9_total distribution in regression dataset:
print(summary(final_df_regression$PHQ9_total))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   3.074   4.000  25.000
head(final_df_regression)
## # A tibble: 6 × 21
##     SEQN PHQ9_total gender   age ethnicity education_level marital_status
##    <dbl>      <dbl>  <dbl> <dbl>     <dbl>           <dbl>          <dbl>
## 1 109274          0      1    68         7               4              3
## 2 109282          5      1    76         3               5              1
## 3 109290          2      2    68         4               5              2
## 4 109313          0      1    63         3               4              1
## 5 109316          1      2    62         4               3              2
## 6 109330          0      1    76         6               4              1
## # ℹ 14 more variables: income_poverty_ratio <dbl>, arthritis <int>,
## #   pulmonary_disease <int>, cancer <int>, high_cholesterol <int>,
## #   diabetes <int>, chest_pain <int>, shortness_breath <int>,
## #   cardio_disease <int>, hypertension_status <int>, smoking_status <int>,
## #   healthcare_demand <int>, chronic_count <dbl>, disease_age_interaction <dbl>

This final_df_regression object is used directly by the regression analysis (Section 5).

4 Exploratory Data Analysis

This section presents an exploratory data analysis (EDA) of the final modelling dataset final_df, produced by the data preparation stage from the NHANES 2017–2020 survey. The goal of EDA is to understand the structure of the data, examine the distribution of each variable, and identify potential relationships between the predictors and the target variable healthcare_demand before modelling.

4.1 Setup

# Use the cleaned classification dataset produced in Section 3 directly.
# SEQN is only a respondent identifier, not a predictor -- drop it for analysis.
eda_df <- final_df %>% select(-SEQN)

cat("Final dataset:", nrow(eda_df), "rows,", ncol(eda_df), "cols\n")
## Final dataset: 3320 rows, 19 cols

4.2 Dataset Overview

# Dimensions
cat("Rows:", nrow(eda_df), "\n")
## Rows: 3320
cat("Columns:", ncol(eda_df), "\n\n")
## Columns: 19
# Variable types
glimpse(eda_df)
## Rows: 3,320
## Columns: 19
## $ gender                  <dbl> 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1…
## $ age                     <dbl> 68, 76, 68, 68, 63, 62, 76, 69, 63, 66, 80, 80…
## $ ethnicity               <dbl> 7, 3, 4, 3, 3, 4, 6, 3, 3, 2, 3, 2, 6, 3, 1, 1…
## $ education_level         <dbl> 4, 5, 5, 1, 4, 3, 4, 3, 5, 2, 3, 2, 5, 3, 3, 1…
## $ marital_status          <dbl> 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2…
## $ income_poverty_ratio    <dbl> 1.20, 3.61, 5.00, 1.75, 5.00, 0.07, 2.37, 4.31…
## $ arthritis               <int> 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ pulmonary_disease       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ cancer                  <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0…
## $ high_cholesterol        <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0…
## $ diabetes                <int> 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
## $ chest_pain              <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ shortness_breath        <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
## $ cardio_disease          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ hypertension_status     <int> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## $ smoking_status          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ healthcare_demand       <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0…
## $ chronic_count           <dbl> 4, 5, 3, 6, 3, 6, 2, 2, 2, 3, 5, 4, 2, 6, 3, 1…
## $ disease_age_interaction <dbl> 272, 380, 204, 408, 189, 372, 152, 138, 126, 1…
# Descriptive statistics for all variables
summary(eda_df)
##      gender           age          ethnicity     education_level marital_status
##  Min.   :1.000   Min.   :60.00   Min.   :1.000   Min.   :1.000   Min.   :1.00  
##  1st Qu.:1.000   1st Qu.:64.00   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:1.00  
##  Median :1.000   Median :69.00   Median :3.000   Median :4.000   Median :1.00  
##  Mean   :1.493   Mean   :69.96   Mean   :3.405   Mean   :3.397   Mean   :1.51  
##  3rd Qu.:2.000   3rd Qu.:77.00   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.00  
##  Max.   :2.000   Max.   :80.00   Max.   :7.000   Max.   :5.000   Max.   :3.00  
##  income_poverty_ratio   arthritis      pulmonary_disease     cancer      
##  Min.   :0.000        Min.   :0.0000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:1.450        1st Qu.:0.0000   1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :2.200        Median :1.0000   Median :0.0000    Median :0.0000  
##  Mean   :2.560        Mean   :0.5211   Mean   :0.1533    Mean   :0.2148  
##  3rd Qu.:3.683        3rd Qu.:1.0000   3rd Qu.:0.0000    3rd Qu.:0.0000  
##  Max.   :5.000        Max.   :1.0000   Max.   :1.0000    Max.   :1.0000  
##  high_cholesterol    diabetes        chest_pain     shortness_breath
##  Min.   :0.000    Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000    1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.000    Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.556    Mean   :0.3114   Mean   :0.3009   Mean   :0.4193  
##  3rd Qu.:1.000    3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000    Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  cardio_disease   hypertension_status smoking_status  healthcare_demand
##  Min.   :0.0000   Min.   :0.0000      Min.   :0.000   Min.   :0.0000   
##  1st Qu.:0.0000   1st Qu.:0.0000      1st Qu.:0.000   1st Qu.:0.0000   
##  Median :0.0000   Median :1.0000      Median :0.000   Median :0.0000   
##  Mean   :0.2533   Mean   :0.6123      Mean   :0.131   Mean   :0.3783   
##  3rd Qu.:1.0000   3rd Qu.:1.0000      3rd Qu.:0.000   3rd Qu.:1.0000   
##  Max.   :1.0000   Max.   :1.0000      Max.   :1.000   Max.   :1.0000   
##  chronic_count   disease_age_interaction
##  Min.   :0.000   Min.   :  0.0          
##  1st Qu.:2.000   1st Qu.:130.0          
##  Median :3.000   Median :237.0          
##  Mean   :3.342   Mean   :236.2          
##  3rd Qu.:5.000   3rd Qu.:320.0          
##  Max.   :9.000   Max.   :720.0

The final dataset contains 3320 respondents and 19 analysis variables (18 predictors + 1 target). The target variable is healthcare_demand. Most predictors are binary disease indicators coded as 1 = Yes and 0 = No. The continuous variables are age, income_poverty_ratio, chronic_count, and disease_age_interaction; the remaining demographic variables (gender, ethnicity, education_level, marital_status) are categorical.

4.3 Target Variable

The target variable healthcare_demand is a composite indicator of high healthcare demand. A respondent is classified into the higher demand group (healthcare_demand = 1) if either of the following holds: they had 4 or more healthcare visits in the past year, or they were hospitalised overnight at least once in the past year. Respondents meeting neither condition form the lower demand group (healthcare_demand = 0).

target_df <- eda_df %>%
  count(healthcare_demand) %>%
  mutate(
    label = ifelse(healthcare_demand == 0,
                   "Lower demand\n(0)", "Higher demand\n(1)"),
    pct   = round(n / sum(n) * 100, 1),
    pct_label = paste0(n, "\n(", pct, "%)")
  )

ggplot(target_df, aes(x = label, y = n, fill = label)) +
  geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = pct_label), vjust = -0.3, size = 4.5) +
  scale_fill_manual(values = c("Lower demand\n(0)" = "steelblue",
                               "Higher demand\n(1)" = "tomato")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Figure 4.1: Target Variable Distribution",
    x = "Healthcare Demand Group",
    y = "Number of Respondents"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

# Class balance check
print(table(eda_df$healthcare_demand))
## 
##    0    1 
## 2064 1256
cat("Class ratio (High / Low):",
    round(sum(eda_df$healthcare_demand == 1) /
          sum(eda_df$healthcare_demand == 0), 3), "\n")
## Class ratio (High / Low): 0.609

The lower demand group accounts for 62.2% of the sample (n = 2064) and the higher demand group for 37.8% (n = 1256). The class imbalance is mild, so standard classification models should train without aggressive resampling. Class weighting and threshold-aware metrics (F1, AUC-ROC) will still be reported alongside accuracy during modelling.

4.4 Demographic Analysis

4.4.1 Age Distribution

ggplot(eda_df, aes(x = age)) +
  geom_histogram(aes(y = after_stat(density)),
                 binwidth = 2, fill = "steelblue", colour = "white", alpha = 0.8) +
  geom_density(colour = "navy", linewidth = 1) +
  geom_vline(xintercept = mean(eda_df$age),
             linetype = "dashed", colour = "tomato", linewidth = 0.8) +
  annotate("text",
           x = mean(eda_df$age) + 2,
           y = 0.05,
           label = paste0("Mean = ", round(mean(eda_df$age), 1)),
           colour = "tomato", size = 3.8) +
  labs(
    title = "Figure 4.2: Age Distribution of Respondents",
    x = "Age (Years)",
    y = "Density"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

eda_df %>%
  mutate(Target = ifelse(healthcare_demand == 0,
                         "Lower demand (0)", "Higher demand (1)")) %>%
  ggplot(aes(x = age, fill = Target)) +
  geom_histogram(binwidth = 2, position = "identity", alpha = 0.6, colour = "white") +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  labs(
    title = "Figure 4.3: Age Distribution by Healthcare Demand Group",
    x = "Age (Years)",
    y = "Count",
    fill = "Group"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The age of respondents ranges from 60 to 80 years, with a mean of approximately 70 years. NHANES top-codes age at 80, which produces a spike at the upper bound. The higher demand group skews slightly older, consistent with the expectation that healthcare need rises with age.

4.4.2 Gender Distribution

gender_df <- eda_df %>%
  mutate(Gender = ifelse(gender == 1, "Male", "Female"),
         Target = ifelse(healthcare_demand == 0,
                         "Lower demand (0)", "Higher demand (1)")) %>%
  count(Gender, Target) %>%
  group_by(Gender) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

ggplot(gender_df, aes(x = Gender, y = n, fill = Target)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_dodge(width = 0.6), vjust = -0.4, size = 3.8) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Figure 4.4: Gender Distribution by Healthcare Demand Group",
    x = "Gender",
    y = "Count",
    fill = "Group"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The sample is nearly evenly split between males (n = 1684) and females (n = 1636). The proportion in higher demand is similar across genders, suggesting gender alone is not a strong distinguishing factor.

4.4.3 Race / Ethnicity Distribution

# NOTE: this dataset uses RIDRETH3 coding (renamed to `ethnicity`).
# Codes present are 1,2,3,4,6,7 -- there is NO code 5 in RIDRETH3.
# Code 6 = Non-Hispanic Asian, 7 = Other / Multiracial.
race_df <- eda_df %>%
  mutate(Race = case_when(
    ethnicity == 1 ~ "Mexican\nAmerican",
    ethnicity == 2 ~ "Other\nHispanic",
    ethnicity == 3 ~ "Non-Hispanic\nWhite",
    ethnicity == 4 ~ "Non-Hispanic\nBlack",
    ethnicity == 6 ~ "Non-Hispanic\nAsian",
    ethnicity == 7 ~ "Other /\nMultiracial"
  ),
  Target = ifelse(healthcare_demand == 0,
                  "Lower demand (0)", "Higher demand (1)")) %>%
  count(Race, Target) %>%
  group_by(Race) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

ggplot(race_df, aes(x = Race, y = n, fill = Target)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.65) +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_dodge(width = 0.65), vjust = -0.4, size = 3.0) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Figure 4.5: Race/Ethnicity Distribution by Healthcare Demand Group",
    x = "Race / Ethnicity",
    y = "Count",
    fill = "Group"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

Non-Hispanic White respondents form the largest group (n = 1452), followed by Non-Hispanic Black (n = 878). Differences in the proportion of higher demand across racial groups may reflect healthcare access disparities and will be revisited during modelling.

4.4.4 Education Level

edu_df <- eda_df %>%
  mutate(Education = case_when(
    education_level == 1 ~ "< 9th\nGrade",
    education_level == 2 ~ "9-11th\nGrade",
    education_level == 3 ~ "High School\n/ GED",
    education_level == 4 ~ "Some\nCollege",
    education_level == 5 ~ "College\nGraduate+"
  ),
  Target = ifelse(healthcare_demand == 0,
                  "Lower demand (0)", "Higher demand (1)")) %>%
  count(Education, Target) %>%
  group_by(Education) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

edu_df$Education <- factor(edu_df$Education,
  levels = c("< 9th\nGrade","9-11th\nGrade","High School\n/ GED",
             "Some\nCollege","College\nGraduate+"))

ggplot(edu_df, aes(x = Education, y = n, fill = Target)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.65) +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_dodge(width = 0.65), vjust = -0.4, size = 3.0) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Figure 4.6: Education Level by Healthcare Demand Group",
    x = "Education Level",
    y = "Count",
    fill = "Group"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

Education spreads across five categories, with “Some College” and “College Graduate+” the most common. Education is a useful socioeconomic proxy and is examined here for any gradient in healthcare demand.

4.4.5 Marital Status

# In this dataset marital_status takes values 1, 2, 3 only.
marital_df <- eda_df %>%
  mutate(Marital = case_when(
    marital_status == 1 ~ "Married /\nWith Partner",
    marital_status == 2 ~ "Widowed / Divorced\n/ Separated",
    marital_status == 3 ~ "Never\nMarried"
  ),
  Target = ifelse(healthcare_demand == 0,
                  "Lower demand (0)", "Higher demand (1)")) %>%
  count(Marital, Target) %>%
  group_by(Marital) %>%
  mutate(pct = round(n / sum(n) * 100, 1))

ggplot(marital_df, aes(x = Marital, y = n, fill = Target)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  geom_text(aes(label = paste0(pct, "%")),
            position = position_dodge(width = 0.6), vjust = -0.4, size = 3.5) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Figure 4.7: Marital Status by Healthcare Demand Group",
    x = "Marital Status",
    y = "Count",
    fill = "Group"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The majority of respondents are married or living with a partner (n = 1829). Social support from a partner may influence health-seeking behaviour and healthcare utilisation.

4.4.6 Income-to-Poverty Ratio

ggplot(eda_df, aes(x = income_poverty_ratio)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white", alpha = 0.85) +
  geom_vline(xintercept = median(eda_df$income_poverty_ratio),
             linetype = "dashed", colour = "tomato", linewidth = 0.8) +
  labs(
    title = "Figure 4.8: Income-to-Poverty Ratio Distribution",
    x = "Income-to-Poverty Ratio (0 = poorest, 5 = top-coded)",
    y = "Count"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The income-to-poverty ratio has a median of 2.2. NHANES top-codes this ratio at 5, producing a cluster at the upper bound. Missing income values were median-imputed during data preparation, which slightly concentrates mass near the centre of the distribution.

eda_df %>%
  mutate(Target = ifelse(healthcare_demand == 0,
                         "Lower demand (0)", "Higher demand (1)")) %>%
  ggplot(aes(x = Target, y = income_poverty_ratio, fill = Target)) +
  geom_boxplot(width = 0.45, outlier.alpha = 0.3, show.legend = FALSE) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  labs(
    title = "Figure 4.8b: Income-to-Poverty Ratio by Healthcare Demand Group",
    x     = "Group",
    y     = "Income-to-Poverty Ratio"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

eda_df %>%
  group_by(healthcare_demand) %>%
  summarise(
    n      = n(),
    mean   = round(mean(income_poverty_ratio), 2),
    median = round(median(income_poverty_ratio), 2),
    sd     = round(sd(income_poverty_ratio), 2)
  )
## # A tibble: 2 × 5
##   healthcare_demand     n  mean median    sd
##               <int> <int> <dbl>  <dbl> <dbl>
## 1                 0  2064  2.59    2.2  1.44
## 2                 1  1256  2.51    2.2  1.46

The higher demand group tends to have a slightly lower income-to-poverty ratio compared to the lower demand group, suggesting that socioeconomic disadvantage may be associated with greater healthcare need. However, the difference between groups is modest and the distributions overlap substantially, so income alone is unlikely to be a dominant predictor.

4.5 Health Condition Prevalence

condition_labels <- c(
  "hypertension_status" = "Hypertension",
  "high_cholesterol"    = "High Cholesterol",
  "arthritis"           = "Arthritis",
  "shortness_breath"    = "Shortness of Breath",
  "diabetes"            = "Diabetes",
  "chest_pain"          = "Chest Pain",
  "cardio_disease"      = "Cardiovascular Disease",
  "cancer"              = "Cancer",
  "pulmonary_disease"   = "Pulmonary Disease (COPD)",
  "smoking_status"      = "Current Smoker"
)

prev_df <- eda_df %>%
  select(all_of(names(condition_labels))) %>%
  summarise(across(everything(), ~ mean(. == 1, na.rm = TRUE) * 100)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "prevalence") %>%
  mutate(label = condition_labels[variable]) %>%
  arrange(desc(prevalence))

ggplot(prev_df, aes(x = reorder(label, prevalence), y = prevalence)) +
  geom_col(fill = "steelblue", width = 0.65) +
  geom_text(aes(label = paste0(round(prevalence, 1), "%")),
            hjust = -0.1, size = 3.8) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  labs(
    title = "Figure 4.9: Prevalence of Health Conditions (% of Sample)",
    x = NULL,
    y = "Prevalence (%)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

Hypertension is the most prevalent condition (61.2%), followed by high cholesterol (55.6%) and arthritis (52.1%). These figures reflect the well-documented chronic disease burden among older adults. Current smoking prevalence is the lowest at 13.1%.

4.6 Engineered Count Features

Two engineered features were added during data preparation: chronic_count (total number of chronic conditions per respondent, 0–9) and disease_age_interaction (chronic_count × age). These are continuous, so histograms and boxplots are used.

4.6.1 Chronic Condition Count

ggplot(eda_df, aes(x = chronic_count)) +
  geom_histogram(binwidth = 1, fill = "steelblue", colour = "white") +
  scale_x_continuous(breaks = 0:9) +
  labs(
    title = "Figure 4.10: Distribution of Chronic Condition Count",
    x = "Number of Chronic Conditions (0-9)",
    y = "Count"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

eda_df %>%
  mutate(Target = ifelse(healthcare_demand == 0,
                         "Lower demand (0)", "Higher demand (1)")) %>%
  ggplot(aes(x = Target, y = chronic_count, fill = Target)) +
  geom_boxplot(width = 0.45, outlier.alpha = 0.3, show.legend = FALSE) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  labs(
    title = "Figure 4.11: Chronic Count by Healthcare Demand Group",
    x = "Group",
    y = "Chronic Condition Count"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

eda_df %>%
  group_by(healthcare_demand) %>%
  summarise(
    n      = n(),
    mean   = round(mean(chronic_count), 2),
    median = median(chronic_count),
    sd     = round(sd(chronic_count), 2)
  )
## # A tibble: 2 × 5
##   healthcare_demand     n  mean median    sd
##               <int> <int> <dbl>  <dbl> <dbl>
## 1                 0  2064  2.81      3  1.8 
## 2                 1  1256  4.22      4  1.83

Respondents in the higher demand group carry a clearly larger disease burden: their mean chronic count is 4.22 versus 2.81 in the lower demand group. This is the strongest single distinguishing signal observed so far and aligns with the rationale for engineering this feature.

4.6.2 Disease–Age Interaction

ggplot(eda_df, aes(x = disease_age_interaction)) +
  geom_histogram(binwidth = 40, fill = "steelblue", colour = "white") +
  labs(
    title = "Figure 4.12: Distribution of Disease-Age Interaction",
    x = "chronic_count x age",
    y = "Count"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The interaction term spans 0 to roughly 700 and is right-skewed. Because it combines disease burden and ageing, it is expected to be one of the more informative predictors; this is confirmed in the correlation analysis below.

4.7 Bivariate Analysis: Health Conditions vs Target

This section examines whether each health condition is associated with a higher rate of healthcare demand.

biv_df <- eda_df %>%
  select(healthcare_demand, all_of(names(condition_labels))) %>%
  pivot_longer(-healthcare_demand, names_to = "variable", values_to = "status") %>%
  filter(!is.na(status)) %>%
  mutate(
    label  = condition_labels[variable],
    status = ifelse(status == 1, "Yes", "No")
  ) %>%
  group_by(label, status) %>%
  summarise(
    n    = n(),
    rate = mean(healthcare_demand == 1) * 100,
    .groups = "drop"
  )

ggplot(biv_df, aes(x = reorder(label, rate), y = rate, fill = status)) +
  geom_col(position = "dodge", width = 0.65) +
  geom_text(aes(label = paste0(round(rate, 1), "%\n(n=", n, ")")),
            position = position_dodge(width = 0.65),
            hjust = -0.05, size = 2.8) +
  coord_flip() +
  scale_fill_manual(values = c("Yes" = "tomato", "No" = "steelblue")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(
    title = "Figure 4.13: Higher Demand Rate by Condition (Yes vs No)",
    x     = NULL,
    y     = "Higher Demand Rate (%)",
    fill  = "Has Condition"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        legend.position = "bottom")

For almost every condition, respondents who report having it show a higher rate of healthcare demand than those who do not. The gap is largest for cardiovascular disease, pulmonary disease (COPD), shortness of breath, and chest pain, suggesting these are strong predictors of healthcare utilisation. Smoking status is the exception: current smokers show a slightly lower higher-demand rate than non-smokers. This counter-intuitive pattern is likely a survivorship / age-confounding effect (heavier smokers may be under-represented at age 60+) and should be interpreted cautiously rather than as a protective effect.

4.7.1 Age by Target Group

eda_df %>%
  mutate(Target = ifelse(healthcare_demand == 0,
                         "Lower demand (0)", "Higher demand (1)")) %>%
  ggplot(aes(x = Target, y = age, fill = Target)) +
  geom_boxplot(width = 0.45, outlier.alpha = 0.3, show.legend = FALSE) +
  scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
                               "Higher demand (1)" = "tomato")) +
  labs(
    title = "Figure 4.14: Age Distribution by Healthcare Demand Group",
    x     = "Group",
    y     = "Age (Years)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

eda_df %>%
  group_by(healthcare_demand) %>%
  summarise(
    n      = n(),
    mean   = round(mean(age), 1),
    median = median(age),
    sd     = round(sd(age), 1)
  )
## # A tibble: 2 × 5
##   healthcare_demand     n  mean median    sd
##               <int> <int> <dbl>  <dbl> <dbl>
## 1                 0  2064  69.2     68   6.9
## 2                 1  1256  71.2     71   7

The higher demand group is modestly older (median 71 vs 68 years), consistent with rising healthcare need at older ages, though the overlap is large.

4.8 Correlation Analysis

# Select all numeric variables (exclude categorical demographics that are
# nominal codes: gender, ethnicity, education_level, marital_status)
cor_data <- eda_df %>%
  select(age, income_poverty_ratio,
         arthritis, pulmonary_disease, cancer, high_cholesterol,
         diabetes, chest_pain, shortness_breath,
         cardio_disease, hypertension_status, smoking_status,
         chronic_count, disease_age_interaction, healthcare_demand)

cor_matrix <- cor(cor_data, use = "pairwise.complete.obs")

colnames(cor_matrix) <- rownames(cor_matrix) <- c(
  "Age", "Income/Poverty",
  "Arthritis", "COPD", "Cancer", "High Cholesterol",
  "Diabetes", "Chest Pain", "Shortness\nof Breath",
  "Cardiovascular", "Hypertension", "Smoking",
  "Chronic Count", "Disease x Age", "Target"
)

corrplot(
  cor_matrix,
  method  = "color",
  type    = "upper",
  addCoef.col = "black",
  number.cex  = 0.6,
  tl.cex  = 0.8,
  tl.col  = "black",
  col     = colorRampPalette(c("steelblue", "white", "tomato"))(200),
  title   = "Figure 4.15: Correlation Matrix of Numeric Variables",
  mar     = c(0, 0, 2, 0)
)

# Correlation of each predictor with the target, sorted
cor_matrix[, "Target"][order(-abs(cor_matrix[, "Target"]))]
##               Target        Disease x Age        Chronic Count 
##           1.00000000           0.35854365           0.35260818 
##       Cardiovascular Shortness\nof Breath            Arthritis 
##           0.23540244           0.21446388           0.19459483 
##           Chest Pain                 COPD         Hypertension 
##           0.19237799           0.17314085           0.16175398 
##               Cancer                  Age             Diabetes 
##           0.14256496           0.13650389           0.11912992 
##     High Cholesterol       Income/Poverty              Smoking 
##           0.06829434          -0.02811316          -0.01576748

The two engineered features show the strongest correlations with the target: disease_age_interaction (≈ 0.36) and chronic_count (≈ 0.35), confirming that overall disease burden — especially combined with age — is the dominant signal. Among individual conditions, cardiovascular disease, shortness of breath, and arthritis correlate most with the target. chronic_count and disease_age_interaction are strongly correlated with each other by construction, so the modelling team should be aware of potential redundancy (multicollinearity) and may keep only one, or use models that tolerate correlated inputs. Income-to-poverty ratio and smoking show essentially no linear correlation with the target.

4.9 PHQ-9 Depression Score Distribution

This section explores the target variable for the regression task, PHQ9_total, using the final_df_regression dataset. The goal is to understand the distribution of depression severity among older adults before modelling.

ggplot(final_df_regression, aes(x = PHQ9_total)) +
  geom_histogram(binwidth = 1, fill = "steelblue", colour = "white", alpha = 0.85) +
  geom_vline(xintercept = median(final_df_regression$PHQ9_total),
             linetype = "dashed", colour = "tomato", linewidth = 0.8) +
  annotate("text",
           x = median(final_df_regression$PHQ9_total) + 1.5,
           y = Inf, vjust = 1.5,
           label = paste0("Median = ", median(final_df_regression$PHQ9_total)),
           colour = "tomato", size = 3.8) +
  labs(
    title = "Figure 4.16: Distribution of PHQ-9 Total Score",
    x     = "PHQ-9 Total Score (0 = no symptoms, 27 = severe)",
    y     = "Count"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

cat("PHQ9_total summary:\n")
## PHQ9_total summary:
print(summary(final_df_regression$PHQ9_total))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   3.074   4.000  25.000
cat("\nProportion with score >= 10 (moderate-severe depression):",
    round(mean(final_df_regression$PHQ9_total >= 10) * 100, 1), "%\n")
## 
## Proportion with score >= 10 (moderate-severe depression): 8.7 %

The PHQ-9 score distribution is strongly right-skewed, with the majority of older adults scoring low. This is expected in a general elderly population — most respondents do not report clinically significant depressive symptoms. The median score is 1, and only a small proportion score 10 or above, which is commonly used as a threshold for moderate-to-severe depression. This skewness motivates the use of count-based regression models (negative binomial, Poisson) rather than ordinary linear regression.

4.9.1 PHQ-9 Score by Chronic Condition Count

final_df_regression %>%
  mutate(chronic_group = factor(
    case_when(
      chronic_count == 0 ~ "0",
      chronic_count == 1 ~ "1",
      chronic_count == 2 ~ "2",
      chronic_count >= 3 ~ "3+"
    ),
    levels = c("0", "1", "2", "3+")
  )) %>%
  ggplot(aes(x = chronic_group, y = PHQ9_total, fill = chronic_group)) +
  geom_boxplot(width = 0.5, outlier.alpha = 0.3, show.legend = FALSE) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Figure 4.17: PHQ-9 Score by Chronic Condition Count",
    x     = "Number of Chronic Conditions",
    y     = "PHQ-9 Total Score"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

final_df_regression %>%
  mutate(chronic_group = case_when(
    chronic_count == 0 ~ "0",
    chronic_count == 1 ~ "1",
    chronic_count == 2 ~ "2",
    chronic_count >= 3 ~ "3+"
  )) %>%
  group_by(chronic_group) %>%
  summarise(
    n      = n(),
    mean   = round(mean(PHQ9_total), 2),
    median = median(PHQ9_total),
    sd     = round(sd(PHQ9_total), 2)
  )
## # A tibble: 4 × 5
##   chronic_group     n  mean median    sd
##   <chr>         <int> <dbl>  <dbl> <dbl>
## 1 0               184  1.2       0  2.53
## 2 1               340  1.34      0  2.14
## 3 2               458  2.12      1  3.07
## 4 3+             1807  3.83      2  4.55

PHQ-9 scores increase consistently with chronic condition count. Respondents with 3 or more conditions show notably higher median and mean depression scores compared to those with no conditions. This pattern supports the hypothesis that disease burden is associated with depression severity among older adults, and confirms that chronic_count and individual disease indicators are likely to be useful predictors in the regression model.

4.9.2 PHQ-9 Score by Age

ggplot(final_df_regression, aes(x = age, y = PHQ9_total)) +
  geom_point(alpha = 0.15, colour = "steelblue", size = 0.8) +
  geom_smooth(method = "loess", colour = "tomato", se = TRUE) +
  labs(
    title = "Figure 4.18: PHQ-9 Score vs Age",
    x     = "Age (Years)",
    y     = "PHQ-9 Total Score"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The scatter plot with a LOESS smoother shows the relationship between age and PHQ-9 score. Any systematic trend with age would suggest that older respondents within this 60+ cohort differ in their depressive symptom burden. Note that NHANES top-codes age at 80, so values above 80 are compressed at that boundary.

4.9.3 PHQ-9 Score by Key Disease Conditions

phq9_condition_labels <- c(
  "hypertension_status" = "Hypertension",
  "diabetes"            = "Diabetes",
  "cardio_disease"      = "Cardiovascular",
  "arthritis"           = "Arthritis",
  "pulmonary_disease"   = "COPD",
  "cancer"              = "Cancer"
)

final_df_regression %>%
  select(PHQ9_total, all_of(names(phq9_condition_labels))) %>%
  pivot_longer(-PHQ9_total, names_to = "variable", values_to = "status") %>%
  filter(!is.na(status)) %>%
  mutate(
    label  = phq9_condition_labels[variable],
    status = ifelse(status == 1, "Yes", "No")
  ) %>%
  ggplot(aes(x = status, y = PHQ9_total, fill = status)) +
  geom_boxplot(width = 0.5, outlier.alpha = 0.2, show.legend = FALSE) +
  scale_fill_manual(values = c("Yes" = "tomato", "No" = "steelblue")) +
  facet_wrap(~ label, nrow = 2) +
  labs(
    title = "Figure 4.19: PHQ-9 Score by Key Disease Conditions",
    x     = "Has Condition",
    y     = "PHQ-9 Total Score"
  ) +
  theme_minimal() +
  theme(
    plot.title  = element_text(face = "bold", size = 13),
    strip.text  = element_text(face = "bold")
  )

Across all six conditions, respondents who reported having the condition tend to show higher PHQ-9 scores. The difference is most visible for COPD and cardiovascular disease, where the median score for those with the condition is clearly elevated compared to those without it. These patterns are consistent with the known co-occurrence of chronic illness and depressive symptoms in older adults.

4.10 EDA Summary

The exploratory analysis of the final NHANES 2017–2020 dataset reveals several key findings relevant to the modelling stage:

  1. Class distribution: The composite target is mildly imbalanced (62.2% lower vs 37.8% higher demand), so resampling is likely unnecessary, but threshold-aware metrics should still be reported.
  2. Demographic profile: The sample is aged 60–80 (mean ≈ 70), with a near-equal gender split. Non-Hispanic White respondents form the largest racial group.
  3. Disease burden: Hypertension, high cholesterol, and arthritis are the most prevalent conditions, reflecting the chronic disease landscape of older adults.
  4. Engineered features dominate: chronic_count and disease_age_interaction show the strongest association with healthcare demand, validating their construction. They are mutually correlated, which the modelling team should account for.
  5. Bivariate patterns: Respondents with cardiovascular disease, COPD, and chest symptoms consistently show higher demand rates. Smoking is a notable exception and warrants cautious interpretation.
  6. Correlation: Aside from the two engineered counts, predictor-to-predictor correlations are generally low to moderate, limiting multicollinearity concerns for most variables.

These findings provide a solid foundation for the regression and classification analyses in the following sections.

5 Regression Analysis: PHQ-9 Depression Severity

This section focuses on the regression task. The goal is to predict PHQ-9 total score (depression symptom severity, range 0–27) among older adults using demographic characteristics and chronic disease burden. The modelling dataset final_df_regression was produced in Section 3.11 by joining the cleaned PHQ-9 data (from the DPQ file) with the prepared elderly health dataset, so it is used here directly without re-reading any file.

5.1 Data Preparation for Regression

The engineered redundant features disease_age_interaction and chronic_count are dropped here so the regression models work from the individual disease indicators, and SEQN is removed as it is only an identifier. Questionnaire-coded variables are converted into factors.

df_model <- final_df_regression %>%
  dplyr::select(-SEQN, -disease_age_interaction, -chronic_count) %>%
  mutate(
    gender          = factor(gender),
    ethnicity       = factor(ethnicity),
    marital_status  = factor(marital_status),
    education_level = factor(education_level, ordered = TRUE),
    across(c(arthritis, pulmonary_disease, cancer, high_cholesterol,
             diabetes, chest_pain, shortness_breath, cardio_disease,
             hypertension_status, smoking_status, healthcare_demand), factor)
  )

glimpse(df_model)
## Rows: 2,789
## Columns: 18
## $ PHQ9_total           <dbl> 0, 5, 2, 0, 1, 0, 1, 0, 2, 2, 2, 3, 0, 0, 1, 1, 0…
## $ gender               <fct> 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 1…
## $ age                  <dbl> 68, 76, 68, 63, 62, 76, 63, 62, 66, 70, 76, 62, 6…
## $ ethnicity            <fct> 7, 3, 4, 3, 4, 6, 3, 6, 3, 1, 4, 1, 4, 3, 6, 3, 6…
## $ education_level      <ord> 4, 5, 5, 4, 3, 4, 5, 5, 3, 3, 4, 3, 5, 1, 5, 3, 5…
## $ marital_status       <fct> 3, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1…
## $ income_poverty_ratio <dbl> 1.20, 3.61, 5.00, 5.00, 0.07, 2.37, 5.00, 4.56, 1…
## $ arthritis            <fct> 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0…
## $ pulmonary_disease    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0…
## $ cancer               <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ high_cholesterol     <fct> 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0…
## $ diabetes             <fct> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ chest_pain           <fct> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ shortness_breath     <fct> 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1…
## $ cardio_disease       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1…
## $ hypertension_status  <fct> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1…
## $ smoking_status       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ healthcare_demand    <fct> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1…

5.2 Cross-Validation Model Comparison

Four modelling approaches are compared using 5-fold cross-validation: a null baseline (predicting the training mean), negative binomial regression, elastic net Poisson regression, and random forest regression. PHQ-9 total score is a count-like outcome with a skewed distribution and many low values, which motivates count-based and flexible models. RMSE and MAE are computed on each held-out fold.

set.seed(42)
folds <- createFolds(df_model$PHQ9_total, k = 5, returnTrain = FALSE)

rmse_fn <- function(a, p) sqrt(mean((a - p)^2))
mae_fn  <- function(a, p) mean(abs(a - p))

X_all <- model.matrix(PHQ9_total ~ ., data = df_model)[, -1]
y_all <- df_model$PHQ9_total

alpha_grid <- c(0.25, 0.5, 0.75, 1)
mtry_grid  <- c(3, 5, 7, 10)

cv_results <- purrr::map_dfr(seq_along(folds), function(i) {
  test_idx  <- folds[[i]]
  train_idx <- setdiff(seq_len(nrow(df_model)), test_idx)

  # Null model: predict the training mean for all test samples
  pred_null <- rep(mean(y_all[train_idx]), length(test_idx))

  # Negative binomial regression
  fit_nb  <- glm.nb(PHQ9_total ~ ., data = df_model[train_idx, ])
  pred_nb <- predict(fit_nb, newdata = df_model[test_idx, ], type = "response")

  # Elastic net: select best alpha within the training fold, then best lambda
  best_alpha <- alpha_grid[which.min(sapply(alpha_grid, function(a) {
    min(cv.glmnet(X_all[train_idx, ], y_all[train_idx],
                  family = "poisson", alpha = a)$cvm)
  }))]
  cv_en   <- cv.glmnet(X_all[train_idx, ], y_all[train_idx],
                       family = "poisson", alpha = best_alpha)
  pred_en <- as.vector(predict(cv_en, newx = X_all[test_idx, ],
                               s = "lambda.min", type = "response"))

  # Random forest: select best mtry within the training fold via OOB error
  best_mtry <- mtry_grid[which.min(sapply(mtry_grid, function(m) {
    ranger(PHQ9_total ~ ., data = df_model[train_idx, ],
           num.trees = 500, mtry = m, seed = 42)$prediction.error
  }))]
  fit_rf  <- ranger(PHQ9_total ~ ., data = df_model[train_idx, ],
                    num.trees = 500, mtry = best_mtry, seed = 42)
  pred_rf <- predict(fit_rf, data = df_model[test_idx, ])$predictions

  tibble(
    fold       = i,
    NULL_RMSE  = rmse_fn(y_all[test_idx], pred_null),
    NULL_MAE   = mae_fn(y_all[test_idx], pred_null),
    NB_RMSE    = rmse_fn(y_all[test_idx], pred_nb),
    NB_MAE     = mae_fn(y_all[test_idx], pred_nb),
    EN_RMSE    = rmse_fn(y_all[test_idx], pred_en),
    EN_MAE     = mae_fn(y_all[test_idx], pred_en),
    RF_RMSE    = rmse_fn(y_all[test_idx], pred_rf),
    RF_MAE     = mae_fn(y_all[test_idx], pred_rf)
  )
})

5.3 Negative Binomial Regression (Full Data)

fit_nb_full <- glm.nb(PHQ9_total ~ ., data = df_model)
summary(fit_nb_full)
## 
## Call:
## glm.nb(formula = PHQ9_total ~ ., data = df_model, init.theta = 0.7540308545, 
##     link = log)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.9955953  0.2959478   3.364 0.000768 ***
## gender2               0.2742397  0.0541078   5.068 4.01e-07 ***
## age                  -0.0106143  0.0041765  -2.541 0.011039 *  
## ethnicity2            0.1620775  0.1184537   1.368 0.171225    
## ethnicity3           -0.0009298  0.1082961  -0.009 0.993150    
## ethnicity4           -0.1242995  0.1104785  -1.125 0.260546    
## ethnicity6           -0.1346481  0.1349115  -0.998 0.318257    
## ethnicity7            0.0283345  0.1625961   0.174 0.861659    
## education_level.L    -0.2452640  0.0775897  -3.161 0.001572 ** 
## education_level.Q    -0.0831764  0.0664181  -1.252 0.210455    
## education_level.C     0.0138327  0.0627056   0.221 0.825405    
## education_level^4    -0.0598886  0.0552068  -1.085 0.278007    
## marital_status2       0.1514006  0.0571055   2.651 0.008020 ** 
## marital_status3       0.1811908  0.1105090   1.640 0.101088    
## income_poverty_ratio -0.0807242  0.0199833  -4.040 5.35e-05 ***
## arthritis1            0.1877101  0.0537885   3.490 0.000483 ***
## pulmonary_disease1    0.0026762  0.0727222   0.037 0.970644    
## cancer1              -0.0010483  0.0643676  -0.016 0.987006    
## high_cholesterol1     0.1409676  0.0531454   2.652 0.007990 ** 
## diabetes1             0.1557901  0.0564536   2.760 0.005787 ** 
## chest_pain1           0.4159266  0.0580882   7.160 8.05e-13 ***
## shortness_breath1     0.4123586  0.0555022   7.430 1.09e-13 ***
## cardio_disease1       0.0934820  0.0639072   1.463 0.143528    
## hypertension_status1  0.1076657  0.0562664   1.913 0.055684 .  
## smoking_status1       0.2674360  0.0769541   3.475 0.000510 ***
## healthcare_demand1    0.2926419  0.0554602   5.277 1.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.754) family taken to be 1)
## 
##     Null deviance: 3583.9  on 2788  degrees of freedom
## Residual deviance: 3006.0  on 2763  degrees of freedom
## AIC: 11957
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.7540 
##           Std. Err.:  0.0313 
## 
##  2 x log-likelihood:  -11903.0290
cat("Dispersion parameter theta:", fit_nb_full$theta, "\n")
## Dispersion parameter theta: 0.7540309

5.4 Elastic Net Poisson Regression (Full Data)

best_alpha_full <- alpha_grid[which.min(sapply(alpha_grid, function(a) {
  min(cv.glmnet(X_all, y_all, family = "poisson", alpha = a)$cvm)
}))]
cat("Best alpha on full data:", best_alpha_full, "\n")
## Best alpha on full data: 1
cv_en_full <- cv.glmnet(X_all, y_all, family = "poisson", alpha = best_alpha_full)

coef(cv_en_full, s = "lambda.min") %>%
  as.matrix() %>%
  as.data.frame() %>%
  setNames("coef") %>%
  rownames_to_column("variable") %>%
  filter(coef != 0) %>%
  arrange(desc(abs(coef))) %>%
  print()
##                variable        coef
## 1           (Intercept)  1.14525511
## 2           chest_pain1  0.40261385
## 3     shortness_breath1  0.37404534
## 4    healthcare_demand1  0.26078698
## 5     education_level.L -0.25524120
## 6               gender2  0.24325033
## 7       smoking_status1  0.23604888
## 8            arthritis1  0.20520798
## 9       marital_status3  0.15354308
## 10           ethnicity2  0.13685430
## 11      marital_status2  0.13620932
## 12 hypertension_status1  0.10154409
## 13      cardio_disease1  0.09784695
## 14    high_cholesterol1  0.09089700
## 15           ethnicity4 -0.08875847
## 16            diabetes1  0.07914756
## 17    education_level.Q -0.06549086
## 18 income_poverty_ratio -0.06383271
## 19           ethnicity6 -0.06291944
## 20           ethnicity7  0.02870165
## 21    education_level^4 -0.02035186
## 22                  age -0.01174080
## 23    education_level.C -0.01078159

5.5 Random Forest Regression (Full Data)

best_mtry_full <- mtry_grid[which.min(sapply(mtry_grid, function(m) {
  ranger(PHQ9_total ~ ., data = df_model,
         num.trees = 500, mtry = m, seed = 42)$prediction.error
}))]
cat("Best mtry on full data:", best_mtry_full, "\n")
## Best mtry on full data: 3
fit_rf_full <- ranger(PHQ9_total ~ ., data = df_model,
                      num.trees = 500, mtry = best_mtry_full,
                      importance = "impurity", seed = 42)

data.frame(
  variable   = names(fit_rf_full$variable.importance),
  importance = fit_rf_full$variable.importance
) %>%
  arrange(desc(importance)) %>%
  ggplot(aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Figure 5.1: Random Forest Variable Importance",
       x = NULL, y = "Impurity") +
  theme_minimal()

5.6 Model Comparison

r2_fn <- function(model_rmse_vec, null_rmse_vec) {
  round(1 - mean(model_rmse_vec^2) / mean(null_rmse_vec^2), 3)
}

reg_comparison <- tibble(
  Model = c("Null model (baseline)", "Negative binomial",
            "Elastic net", "Random forest (tuned)"),
  RMSE = round(c(mean(cv_results$NULL_RMSE), mean(cv_results$NB_RMSE),
                 mean(cv_results$EN_RMSE),   mean(cv_results$RF_RMSE)), 3),
  MAE  = round(c(mean(cv_results$NULL_MAE),  mean(cv_results$NB_MAE),
                 mean(cv_results$EN_MAE),    mean(cv_results$RF_MAE)), 3),
  R2   = c(0,
           r2_fn(cv_results$NB_RMSE, cv_results$NULL_RMSE),
           r2_fn(cv_results$EN_RMSE, cv_results$NULL_RMSE),
           r2_fn(cv_results$RF_RMSE, cv_results$NULL_RMSE))
)

print(reg_comparison)
## # A tibble: 4 × 4
##   Model                  RMSE   MAE    R2
##   <chr>                 <dbl> <dbl> <dbl>
## 1 Null model (baseline)  4.12  3.06 0    
## 2 Negative binomial      3.77  2.70 0.164
## 3 Elastic net            3.74  2.69 0.176
## 4 Random forest (tuned)  3.78  2.76 0.16
reg_comparison %>%
  filter(Model != "Null model (baseline)") %>%
  pivot_longer(c(RMSE, MAE), names_to = "metric", values_to = "value") %>%
  ggplot(aes(x = reorder(Model, value), y = value, fill = Model)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = round(value, 3)), hjust = -0.1, size = 3.8) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  facet_wrap(~ metric, scales = "free_x") +
  labs(
    title = "Figure 5.2: Regression Model Comparison (RMSE and MAE)",
    x     = NULL,
    y     = "Score (lower is better)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13),
        strip.text = element_text(face = "bold"))

# Generate predictions from best model (negative binomial) on full data
pred_nb_all <- predict(fit_nb_full, type = "response")

tibble(
  actual    = df_model$PHQ9_total,
  predicted = pred_nb_all
) %>%
  ggplot(aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.15, colour = "steelblue", size = 0.8) +
  geom_abline(intercept = 0, slope = 1,
              linetype = "dashed", colour = "tomato", linewidth = 0.8) +
  labs(
    title = "Figure 5.3: Actual vs Predicted PHQ-9 Score (Negative Binomial)",
    x     = "Actual PHQ-9 Score",
    y     = "Predicted PHQ-9 Score"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

The dashed line represents perfect prediction. Points above the line are under-predicted; points below are over-predicted. The model predicts low scores well but tends to underestimate high PHQ-9 scores, which is typical for count models fitted on right-skewed data with many zeros.

# Incidence Rate Ratio table for Negative Binomial (interpretable coefficients)
irr_table <- as.data.frame(summary(fit_nb_full)$coefficients) %>%
  rownames_to_column("variable") %>%
  filter(variable != "(Intercept)") %>%
  mutate(
    IRR       = round(exp(Estimate), 3),
    CI_lower  = round(exp(Estimate - 1.96 * `Std. Error`), 3),
    CI_upper  = round(exp(Estimate + 1.96 * `Std. Error`), 3),
    p_value   = round(`Pr(>|z|)`, 4),
    significant = ifelse(`Pr(>|z|)` < 0.05, "Yes", "No")
  ) %>%
  dplyr::select(variable, IRR, CI_lower, CI_upper, p_value, significant) %>%
  arrange(p_value)

cat("Incidence Rate Ratio (IRR) Table — Negative Binomial Regression\n")
## Incidence Rate Ratio (IRR) Table — Negative Binomial Regression
cat("IRR > 1: associated with higher PHQ-9 score\n")
## IRR > 1: associated with higher PHQ-9 score
cat("IRR < 1: associated with lower PHQ-9 score\n\n")
## IRR < 1: associated with lower PHQ-9 score
print(irr_table)
##                variable   IRR CI_lower CI_upper p_value significant
## 1               gender2 1.316    1.183    1.463  0.0000         Yes
## 2           chest_pain1 1.516    1.353    1.699  0.0000         Yes
## 3     shortness_breath1 1.510    1.355    1.684  0.0000         Yes
## 4    healthcare_demand1 1.340    1.202    1.494  0.0000         Yes
## 5  income_poverty_ratio 0.922    0.887    0.959  0.0001         Yes
## 6            arthritis1 1.206    1.086    1.341  0.0005         Yes
## 7       smoking_status1 1.307    1.124    1.519  0.0005         Yes
## 8     education_level.L 0.782    0.672    0.911  0.0016         Yes
## 9             diabetes1 1.169    1.046    1.305  0.0058         Yes
## 10      marital_status2 1.163    1.040    1.301  0.0080         Yes
## 11    high_cholesterol1 1.151    1.037    1.278  0.0080         Yes
## 12                  age 0.989    0.981    0.998  0.0110         Yes
## 13 hypertension_status1 1.114    0.997    1.244  0.0557          No
## 14      marital_status3 1.199    0.965    1.489  0.1011          No
## 15      cardio_disease1 1.098    0.969    1.245  0.1435          No
## 16           ethnicity2 1.176    0.932    1.483  0.1712          No
## 17    education_level.Q 0.920    0.808    1.048  0.2105          No
## 18           ethnicity4 0.883    0.711    1.097  0.2605          No
## 19    education_level^4 0.942    0.845    1.050  0.2780          No
## 20           ethnicity6 0.874    0.671    1.139  0.3183          No
## 21    education_level.C 1.014    0.897    1.147  0.8254          No
## 22           ethnicity7 1.029    0.748    1.415  0.8617          No
## 23   pulmonary_disease1 1.003    0.869    1.156  0.9706          No
## 24              cancer1 0.999    0.881    1.133  0.9870          No
## 25           ethnicity3 0.999    0.808    1.235  0.9931          No

The IRR table shows the multiplicative effect of each predictor on the expected PHQ-9 score. An IRR of 1.20 means the expected score is 20% higher for that group compared to the reference. Variables with significant = Yes (p < 0.05) provide the most reliable signal for predicting depression severity.

The table compares the cross-validated RMSE and MAE of each model against the null baseline, together with an R-squared-style measure (the proportion of squared-error reduction relative to the null model). Negative binomial regression offers interpretability for a count-like outcome, elastic net handles correlated predictors and performs variable selection, and random forest captures potential non-linear patterns. Demographic and chronic disease variables carry useful but moderate predictive signal for depression severity, since depression is influenced by many psychological, social, and behavioural factors not captured by the selected NHANES variables. The results should therefore be read as evidence of association and predictive usefulness, not as a complete explanation of depression among older adults.

6 Classification Analysis: Higher Healthcare Demand

This section focuses on the classification task. Using the features prepared in Section 3, the model predicts whether respondents aged 60 and above belong to the higher healthcare demand group.

The target variable follows the definition used in the data preparation section:

  • target = 1 (higher healthcare demand): defined as HUQ051 >= 4 or HUQ071 == 1.
  • target = 0 (lower healthcare demand): meeting neither condition.

The modelling dataset is the final_df object produced in Section 3. HUQ051 and HUQ071 were already used in Section 3 to construct healthcare_demand and removed from the final dataset; SEQN is only a respondent identifier with no predictive meaning. Using final_df directly keeps the target definition consistent with the EDA in Section 4.

6.1 Data Preparation for Classification

Before modelling, questionnaire-coded variables are converted into factors. Age (age), income-to-poverty ratio (income_poverty_ratio), and chronic condition count (chronic_count) are kept as numeric variables.

df_clf <- final_df %>%
  dplyr::select(-SEQN) %>%
  mutate(
    target = factor(ifelse(healthcare_demand == 1, "Higher", "Lower"),
                    levels = c("Lower", "Higher")),
    target_num = ifelse(target == "Higher", 1L, 0L),
    gender = factor(gender, levels = c(1, 2),
                    labels = c("Male", "Female")),
    ethnicity = factor(ethnicity),
    education_level = factor(education_level),
    marital_status = factor(marital_status),
    across(
      c(arthritis, pulmonary_disease, cancer, high_cholesterol,
        diabetes, chest_pain, shortness_breath, cardio_disease,
        hypertension_status, smoking_status),
      ~ factor(., levels = c(0, 1), labels = c("No", "Yes"))
    )
  )

cat("Target distribution:\n")
## Target distribution:
print(table(df_clf$target))
## 
##  Lower Higher 
##   2064   1256
print(round(prop.table(table(df_clf$target)) * 100, 1))
## 
##  Lower Higher 
##   62.2   37.8
df_clf %>%
  count(target) %>%
  mutate(pct = round(n / sum(n) * 100, 1),
         label = paste0(n, " (", pct, "%)")) %>%
  ggplot(aes(x = target, y = n, fill = target)) +
  geom_col(width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.3, size = 4) +
  scale_fill_manual(values = c("Lower" = "steelblue", "Higher" = "tomato")) +
  labs(
    title = "Figure 6.1: Target Distribution",
    x = "Healthcare Demand Group",
    y = "Number of Respondents"
  ) +
  theme_minimal()

6.2 Train-Test Split

The data is split into 70% training data and 30% testing data. Since Higher is not the majority class, stratified sampling by target is used to keep the class distribution close to the original dataset.

set.seed(7004)

train_ids <- df_clf %>%
  mutate(global_id = row_number()) %>%
  group_by(target) %>%
  sample_frac(size = 0.70) %>%
  ungroup() %>%
  pull(global_id)

train_data <- df_clf[train_ids, ]
test_data  <- df_clf[-train_ids, ]

cat("Training set:", nrow(train_data), "rows\n")
## Training set: 2324 rows
print(table(train_data$target))
## 
##  Lower Higher 
##   1445    879
cat("\nTesting set:", nrow(test_data), "rows\n")
## 
## Testing set: 996 rows
print(table(test_data$target))
## 
##  Lower Higher 
##    619    377

6.3 Evaluation Functions

In this section, Higher is treated as the positive class. Since Lower accounts for about 62.2% of the final data, accuracy alone is not sufficient. Sensitivity, F1, balanced accuracy, and AUC are also reported.

safe_divide <- function(x, y) {
  ifelse(y == 0, NA_real_, x / y)
}

auc_rank <- function(actual, prob) {
  actual_positive <- actual == "Higher"
  n_pos <- sum(actual_positive)
  n_neg <- sum(!actual_positive)

  if (n_pos == 0 || n_neg == 0) {
    return(NA_real_)
  }

  ranks <- rank(prob, ties.method = "average")
  (sum(ranks[actual_positive]) - n_pos * (n_pos + 1) / 2) / (n_pos * n_neg)
}

evaluate_model <- function(actual, prob, model_name, threshold = 0.50) {
  actual <- factor(actual, levels = c("Lower", "Higher"))
  pred <- factor(ifelse(prob >= threshold, "Higher", "Lower"),
                 levels = c("Lower", "Higher"))

  cm <- table(Predicted = pred, Actual = actual)

  tn <- cm["Lower", "Lower"]
  fp <- cm["Higher", "Lower"]
  fn <- cm["Lower", "Higher"]
  tp <- cm["Higher", "Higher"]

  tibble(
    model = model_name,
    threshold = threshold,
    accuracy = safe_divide(tp + tn, tp + tn + fp + fn),
    sensitivity = safe_divide(tp, tp + fn),
    specificity = safe_divide(tn, tn + fp),
    precision = safe_divide(tp, tp + fp),
    f1 = safe_divide(2 * tp, 2 * tp + fp + fn),
    balanced_accuracy = (safe_divide(tp, tp + fn) + safe_divide(tn, tn + fp)) / 2,
    auc = auc_rank(actual, prob)
  )
}

# Threshold is selected on the training set only, to avoid using test-set information.
find_best_threshold <- function(actual, prob) {
  thresholds <- seq(0.05, 0.95, by = 0.01)

  scores <- lapply(thresholds, function(th) {
    evaluate_model(actual, prob, "threshold_search", threshold = th)
  }) %>%
    bind_rows()

  scores %>%
    arrange(desc(balanced_accuracy), desc(f1)) %>%
    slice(1) %>%
    pull(threshold)
}

make_roc <- function(actual, prob, model_name) {
  actual <- factor(actual, levels = c("Lower", "Higher"))
  thresholds <- sort(unique(c(Inf, prob, -Inf)), decreasing = TRUE)

  roc_df <- lapply(thresholds, function(th) {
    pred <- factor(ifelse(prob >= th, "Higher", "Lower"),
                   levels = c("Lower", "Higher"))
    cm <- table(Predicted = pred, Actual = actual)

    tn <- cm["Lower", "Lower"]
    fp <- cm["Higher", "Lower"]
    fn <- cm["Lower", "Higher"]
    tp <- cm["Higher", "Higher"]

    tibble(
      model = model_name,
      threshold = th,
      sensitivity = safe_divide(tp, tp + fn),
      fpr = safe_divide(fp, fp + tn)
    )
  })

  bind_rows(roc_df)
}

6.4 Baseline Model

The baseline model always predicts the majority class, Lower. Its accuracy may appear acceptable, but its sensitivity is 0, so it is used only as a reference and not as a useful predictive model.

baseline_prob <- rep(0, nrow(test_data))
baseline_metrics <- evaluate_model(test_data$target, baseline_prob, "Baseline")

baseline_cm <- table(
  Predicted = factor(rep("Lower", nrow(test_data)), levels = c("Lower", "Higher")),
  Actual = test_data$target
)

baseline_cm
##          Actual
## Predicted Lower Higher
##    Lower    619    377
##    Higher     0      0
baseline_metrics
## # A tibble: 1 × 9
##   model    threshold accuracy sensitivity specificity precision    f1
##   <chr>        <dbl>    <dbl>       <dbl>       <dbl>     <dbl> <dbl>
## 1 Baseline       0.5    0.621           0           1        NA     0
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>

6.5 Logistic Regression

Logistic Regression is used as the first formal model. Its main advantage is interpretability, as it allows us to examine which variables increase or decrease the predicted probability of Higher.

predictor_cols <- c(
  "gender", "age", "ethnicity", "education_level", "marital_status",
  "income_poverty_ratio",
  "arthritis", "pulmonary_disease", "cancer", "high_cholesterol",
  "diabetes", "chest_pain", "shortness_breath",
  "cardio_disease", "hypertension_status", "smoking_status",
  "chronic_count"
)

# disease_age_interaction is excluded because it is highly redundant with
# age and chronic_count. This keeps the formal model comparison simpler
# while retaining the disease burden signal.

logit_formula <- as.formula(
  paste("target_num ~", paste(predictor_cols, collapse = " + "))
)

logit_fit <- glm(
  logit_formula,
  data = train_data,
  family = binomial()
)

logit_train_prob <- predict(logit_fit, newdata = train_data, type = "response")
logit_prob <- predict(logit_fit, newdata = test_data, type = "response")
logit_threshold <- find_best_threshold(train_data$target, logit_train_prob)
logit_metrics <- evaluate_model(
  test_data$target,
  logit_prob,
  "Logistic Regression",
  threshold = logit_threshold
)

logit_pred <- factor(ifelse(logit_prob >= logit_threshold, "Higher", "Lower"),
                     levels = c("Lower", "Higher"))
logit_cm <- table(Predicted = logit_pred, Actual = test_data$target)

cat("Selected threshold:", logit_threshold, "\n")
## Selected threshold: 0.39
logit_cm
##          Actual
## Predicted Lower Higher
##    Lower    418    147
##    Higher   201    230
logit_metrics
## # A tibble: 1 × 9
##   model               threshold accuracy sensitivity specificity precision    f1
##   <chr>                   <dbl>    <dbl>       <dbl>       <dbl>     <dbl> <dbl>
## 1 Logistic Regression      0.39    0.651       0.610       0.675     0.534 0.569
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
# Convert Logistic Regression coefficients to odds ratios for easier interpretation.
logit_coef <- as.data.frame(summary(logit_fit)$coefficients)
logit_coef$term <- rownames(logit_coef)
rownames(logit_coef) <- NULL

logit_or <- logit_coef %>%
  filter(term != "(Intercept)") %>%
  mutate(
    odds_ratio = exp(Estimate),
    direction = ifelse(odds_ratio > 1, "Higher odds", "Lower odds")
  ) %>%
  arrange(`Pr(>|z|)`)

head(logit_or, 12)
##     Estimate Std. Error  z value     Pr(>|z|)                   term odds_ratio
## 1  0.5613573 0.10128822 5.542178 2.987331e-08    shortness_breathYes   1.753050
## 2  0.6211422 0.11310482 5.491740 3.979939e-08      cardio_diseaseYes   1.861053
## 3  0.5379449 0.09806243 5.485739 4.117441e-08           arthritisYes   1.712484
## 4  0.4688601 0.10237516 4.579823 4.653700e-06            diabetesYes   1.598171
## 5  0.4930171 0.11591653 4.253208 2.107299e-05              cancerYes   1.637249
## 6  0.5044590 0.13365904 3.774223 1.605072e-04   pulmonary_diseaseYes   1.656089
## 7  0.3552144 0.10346441 3.433204 5.964932e-04 hypertension_statusYes   1.426486
## 8  0.3652637 0.10710054 3.410474 6.484998e-04          chest_painYes   1.440894
## 9  0.0211010 0.00752333 2.804741 5.035696e-03                    age   1.021325
## 10 0.7297867 0.30656671 2.380515 1.728846e-02             ethnicity7   2.074638
## 11 0.4126494 0.22871530 1.804206 7.119908e-02             ethnicity2   1.510815
## 12 0.3146090 0.20018244 1.571611 1.160408e-01       education_level5   1.369724
##      direction
## 1  Higher odds
## 2  Higher odds
## 3  Higher odds
## 4  Higher odds
## 5  Higher odds
## 6  Higher odds
## 7  Higher odds
## 8  Higher odds
## 9  Higher odds
## 10 Higher odds
## 11 Higher odds
## 12 Higher odds

6.6 Decision Tree

Decision Tree is used as the second model for comparison with Logistic Regression. The rpart package is used because it is available locally and does not require additional complex dependencies.

tree_formula <- as.formula(
  paste("target ~", paste(predictor_cols, collapse = " + "))
)

tree_fit <- rpart(
  tree_formula,
  data = train_data,
  method = "class",
  control = rpart.control(cp = 0.01, minsplit = 30)
)

tree_train_prob <- predict(tree_fit, newdata = train_data, type = "prob")[, "Higher"]
tree_prob <- predict(tree_fit, newdata = test_data, type = "prob")[, "Higher"]
tree_threshold <- find_best_threshold(train_data$target, tree_train_prob)
tree_metrics <- evaluate_model(
  test_data$target,
  tree_prob,
  "Decision Tree",
  threshold = tree_threshold
)

tree_pred <- factor(ifelse(tree_prob >= tree_threshold, "Higher", "Lower"),
                    levels = c("Lower", "Higher"))
tree_cm <- table(Predicted = tree_pred, Actual = test_data$target)

cat("Selected threshold:", tree_threshold, "\n")
## Selected threshold: 0.29
tree_cm
##          Actual
## Predicted Lower Higher
##    Lower    396    133
##    Higher   223    244
tree_metrics
## # A tibble: 1 × 9
##   model         threshold accuracy sensitivity specificity precision    f1
##   <chr>             <dbl>    <dbl>       <dbl>       <dbl>     <dbl> <dbl>
## 1 Decision Tree      0.29    0.643       0.647       0.640     0.522 0.578
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
tree_importance <- tree_fit$variable.importance

if (!is.null(tree_importance)) {
  tree_importance_df <- tibble(
    variable = names(tree_importance),
    importance = as.numeric(tree_importance)
  ) %>%
    arrange(desc(importance))

  tree_importance_df
} else {
  tibble(variable = character(), importance = numeric())
}
## # A tibble: 10 × 2
##    variable             importance
##    <chr>                     <dbl>
##  1 chronic_count           120.   
##  2 shortness_breath         50.9  
##  3 chest_pain               43.6  
##  4 cardio_disease           40.6  
##  5 hypertension_status      34.7  
##  6 arthritis                33.6  
##  7 ethnicity                 5.95 
##  8 age                       4.62 
##  9 income_poverty_ratio      2.54 
## 10 education_level           0.238
if (!is.null(tree_importance)) {
  tree_importance_df %>%
    ggplot(aes(x = reorder(variable, importance), y = importance)) +
    geom_col(fill = "steelblue", width = 0.65) +
    coord_flip() +
    labs(
      title = "Figure 6.2: Decision Tree Variable Importance",
      x = NULL,
      y = "Importance"
    ) +
    theme_minimal()
}

6.7 Random Forest

Random Forest is added as a third model. It builds multiple decision trees on bootstrapped samples and averages their predictions, which generally reduces overfitting compared to a single tree. The ranger package is used for efficient computation.

rf_formula <- as.formula(
  paste("target ~", paste(predictor_cols, collapse = " + "))
)

mtry_grid_clf <- c(3, 5, 7, 10)
best_mtry_clf <- mtry_grid_clf[which.min(sapply(mtry_grid_clf, function(m) {
  ranger(rf_formula, data = train_data,
         num.trees = 500, mtry = m,
         probability = TRUE, seed = 42)$prediction.error
}))]
cat("Best mtry (classification):", best_mtry_clf, "\n")
## Best mtry (classification): 3
rf_fit <- ranger(
  rf_formula,
  data        = train_data,
  num.trees   = 500,
  mtry        = best_mtry_clf,
  probability = TRUE,
  importance  = "impurity",
  seed        = 42
)

rf_train_prob <- predict(rf_fit, data = train_data)$predictions[, "Higher"]
rf_prob       <- predict(rf_fit, data = test_data)$predictions[, "Higher"]
rf_threshold  <- find_best_threshold(train_data$target, rf_train_prob)
rf_metrics    <- evaluate_model(
  test_data$target, rf_prob,
  "Random Forest", threshold = rf_threshold
)

rf_pred <- factor(ifelse(rf_prob >= rf_threshold, "Higher", "Lower"),
                  levels = c("Lower", "Higher"))
rf_cm   <- table(Predicted = rf_pred, Actual = test_data$target)

cat("Selected threshold:", rf_threshold, "\n")
## Selected threshold: 0.45
rf_cm
##          Actual
## Predicted Lower Higher
##    Lower    439    169
##    Higher   180    208
rf_metrics
## # A tibble: 1 × 9
##   model         threshold accuracy sensitivity specificity precision    f1
##   <chr>             <dbl>    <dbl>       <dbl>       <dbl>     <dbl> <dbl>
## 1 Random Forest      0.45    0.650       0.552       0.709     0.536 0.544
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
rf_imp_df <- data.frame(
  variable   = names(rf_fit$variable.importance),
  importance = rf_fit$variable.importance
) %>%
  arrange(desc(importance))

rf_imp_df %>%
  ggplot(aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = "steelblue", width = 0.65) +
  coord_flip() +
  labs(
    title = "Figure 6.3: Random Forest Variable Importance",
    x     = NULL,
    y     = "Impurity"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

6.8 Model Comparison

model_metrics <- bind_rows(
  baseline_metrics,
  logit_metrics,
  tree_metrics,
  rf_metrics
) %>%
  mutate(across(where(is.numeric), ~ round(., 3)))

model_metrics
## # A tibble: 4 × 9
##   model               threshold accuracy sensitivity specificity precision    f1
##   <chr>                   <dbl>    <dbl>       <dbl>       <dbl>     <dbl> <dbl>
## 1 Baseline                 0.5     0.621       0           1        NA     0    
## 2 Logistic Regression      0.39    0.651       0.61        0.675     0.534 0.569
## 3 Decision Tree            0.29    0.643       0.647       0.64      0.522 0.578
## 4 Random Forest            0.45    0.65        0.552       0.709     0.536 0.544
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
best_non_baseline <- model_metrics %>%
  filter(model != "Baseline") %>%
  arrange(desc(auc), desc(balanced_accuracy)) %>%
  slice(1)
model_metrics %>%
  select(model, accuracy, sensitivity, specificity, f1, balanced_accuracy, auc) %>%
  pivot_longer(-model, names_to = "metric", values_to = "value") %>%
  ggplot(aes(x = metric, y = value, fill = model)) +
  geom_col(position = "dodge", width = 0.7) +
  coord_cartesian(ylim = c(0, 1)) +
  labs(
    title = "Figure 6.4: Classification Model Performance",
    x     = "Metric",
    y     = "Score",
    fill  = "Model"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# Build tidy confusion matrix data for all 3 formal models
cm_list <- list(
  "Logistic Regression" = table(
    Predicted = factor(ifelse(logit_prob >= logit_threshold, "Higher", "Lower"),
                       levels = c("Lower", "Higher")),
    Actual = test_data$target
  ),
  "Decision Tree" = table(
    Predicted = factor(ifelse(tree_prob >= tree_threshold, "Higher", "Lower"),
                       levels = c("Lower", "Higher")),
    Actual = test_data$target
  ),
  "Random Forest" = rf_cm
)

cm_df <- purrr::map_dfr(names(cm_list), function(nm) {
  as.data.frame(cm_list[[nm]]) %>%
    mutate(model = nm)
})

ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = Freq), size = 5, fontface = "bold") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  facet_wrap(~ model, nrow = 1) +
  labs(
    title = "Figure 6.5: Confusion Matrices",
    fill  = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title  = element_text(face = "bold", size = 13),
    strip.text  = element_text(face = "bold"),
    axis.text   = element_text(size = 10)
  )

# Compare feature importance across Logistic Regression (|coef|),
# Decision Tree, and Random Forest
logit_imp <- logit_coef %>%
  filter(term != "(Intercept)") %>%
  mutate(variable = term, importance = abs(Estimate)) %>%
  dplyr::select(variable, importance) %>%
  mutate(model = "Logistic Regression")

tree_imp_tidy <- if (!is.null(tree_fit$variable.importance)) {
  tibble(
    variable   = names(tree_fit$variable.importance),
    importance = as.numeric(tree_fit$variable.importance),
    model      = "Decision Tree"
  )
} else tibble(variable = character(), importance = numeric(), model = character())

rf_imp_tidy <- rf_imp_df %>%
  mutate(model = "Random Forest")

# Normalise each model's importance to 0-1 for fair visual comparison
all_imp <- bind_rows(logit_imp, tree_imp_tidy, rf_imp_tidy) %>%
  group_by(model) %>%
  mutate(importance_norm = importance / max(importance)) %>%
  ungroup()

# Keep top 10 variables by average normalised importance
top_vars <- all_imp %>%
  group_by(variable) %>%
  summarise(avg = mean(importance_norm)) %>%
  slice_max(avg, n = 10) %>%
  pull(variable)

all_imp %>%
  filter(variable %in% top_vars) %>%
  ggplot(aes(x = reorder(variable, importance_norm),
             y = importance_norm, fill = model)) +
  geom_col(position = "dodge", width = 0.7) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Figure 6.6: Feature Importance Comparison Across Models (Normalised)",
    x     = NULL,
    y     = "Normalised Importance (0–1)",
    fill  = "Model"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 13))

roc_df <- bind_rows(
  make_roc(test_data$target, logit_prob, "Logistic Regression"),
  make_roc(test_data$target, tree_prob,  "Decision Tree"),
  make_roc(test_data$target, rf_prob,    "Random Forest")
)

ggplot(roc_df, aes(x = fpr, y = sensitivity, colour = model)) +
  geom_line(linewidth = 1) +
  geom_abline(intercept = 0, slope = 1,
              linetype = "dashed", colour = "grey50") +
  coord_equal() +
  labs(
    title  = "Figure 6.7: ROC Curves",
    x      = "False Positive Rate",
    y      = "True Positive Rate / Sensitivity",
    colour = "Model"
  ) +
  theme_minimal()

6.9 Classification Summary

This section builds classification models using 17 predictors. The baseline model always predicts the majority class, Lower, and therefore serves only as a reference — it cannot identify any Higher respondents.

Three formal models were compared: Logistic Regression, Decision Tree, and Random Forest. Among these, Logistic Regression achieves the strongest combined AUC and balanced accuracy on the test set. The confusion matrices (Figure 6.5) show how each model distributes predictions across the four outcome cells, while the ROC curves (Figure 6.7) illustrate the trade-off between sensitivity and false positive rate across all possible thresholds.

The feature importance comparison (Figure 6.6) shows which variables are consistently important across models. Variables that rank highly in all three models provide the strongest and most reliable evidence for predicting healthcare demand.

In the final report, AUC, balanced accuracy, sensitivity, and F1 should be discussed instead of comparing models by accuracy only, because Higher is the key minority class and misclassifying high-demand respondents carries greater practical cost.

7 Conclusion

7.1 Summary of Research Objective

This project aimed to use data mining methods to understand and predict healthcare-related needs among older adults. The study focused on respondents aged 60 years and above from the NHANES 2017-March 2020 Pre-pandemic cycle. Multiple NHANES components were combined to capture demographic characteristics, socioeconomic background, chronic disease history, blood pressure and cholesterol status, diabetes status, chest symptoms, smoking behaviour, healthcare utilisation, and depression-related information.

The project included two main analytical tasks. The first task was a classification problem that predicted whether an older adult belonged to the higher healthcare demand group. This outcome was based on healthcare utilisation indicators, especially frequent healthcare visits and hospitalisation. The second task was a regression problem that predicted depression severity using the PHQ-9 total score. Together, these two tasks addressed both healthcare service demand and mental health burden, which are important dimensions of healthy ageing.

7.2 Key Findings from EDA

The exploratory data analysis showed that healthcare demand among older adults is closely related to chronic disease burden. The target variable was mildly imbalanced, with the lower healthcare demand group forming the larger share of the sample and the higher demand group forming a smaller but still substantial group. This means that accuracy alone is not sufficient for model evaluation, because a model may appear accurate while failing to identify high-demand respondents.

Demographic analysis showed that the sample included older adults aged 60 years and above, with age top-coded at 80 in NHANES. The higher demand group tended to be slightly older, although there was considerable overlap between the lower and higher demand groups. Gender differences were not strongly distinctive in the exploratory analysis, while race, education, marital status, and income-to-poverty ratio provided useful socioeconomic context for interpreting healthcare utilisation.

The disease prevalence analysis indicated that hypertension, high cholesterol, and arthritis were among the most common conditions in the elderly sample. These findings are consistent with the broader pattern of chronic disease burden in older populations. Several conditions, including cardiovascular disease, pulmonary disease, shortness of breath, chest pain, and arthritis, were associated with higher rates of healthcare demand.

The engineered features were especially important. The chronic condition count provided a direct measure of multimorbidity, while the disease-age interaction captured the combined effect of ageing and disease burden. These two features showed the strongest association with healthcare demand in the EDA. This supports the idea that healthcare demand is not driven by a single condition only, but by the accumulation of multiple health risks over time.

Smoking status showed a more complex pattern. Current smoking was not clearly associated with higher healthcare demand in the same way as chronic disease indicators. This should not be interpreted as a protective effect. Instead, it may reflect survivor bias, age-related sample composition, or confounding by other health and behavioural factors.

7.3 Predictive Modelling Outcomes

7.3.1 Regression Model Outcomes

The regression analysis focused on predicting PHQ-9 total score, which represents depression symptom severity. The modelling dataset was created by merging the cleaned PHQ-9 data from the DPQ file with the prepared elderly health dataset. Several modelling approaches were considered, including a null baseline model, negative binomial regression, elastic net Poisson regression, and random forest regression.

The null baseline provided a reference point by predicting the average PHQ-9 score. The formal regression models were evaluated using RMSE, MAE, and an R-squared-style comparison against the null model. This approach is suitable because PHQ-9 total score is a count-like outcome with a skewed distribution and many low values.

The regression task demonstrated that demographic and chronic disease variables contain useful information for explaining depression severity, but the predictive signal is likely moderate rather than strong. Depression is influenced by many psychological, social, behavioural, and environmental factors that are not fully captured by the selected NHANES variables. Therefore, the regression model outcomes should be interpreted as evidence of association and predictive usefulness, not as a complete explanation of depression among older adults.

Among the regression approaches, negative binomial regression offered interpretability for a count-like outcome, elastic net regression helped handle correlated predictors and variable selection, and random forest regression captured potential non-linear patterns. The comparison of these models provided a balanced view between interpretability and predictive flexibility.

7.3.2 Classification Model Outcomes

Three formal models were compared for the classification task: logistic regression, decision tree, and random forest. All three models achieved accuracy in the range of approximately 63–68%, with AUC values between 0.69 and 0.74. While these figures may appear modest, they should be understood in the context of what the models are being asked to predict and what information is available to them.

7.3.2.1 Why Accuracy Is Not Low — It Is Expected

The classification target healthcare_demand is not a direct measure of health status. It is a behavioural outcome — whether a person visited a doctor four or more times, or was hospitalised, in the past year. This distinction is important because healthcare utilisation is determined not only by how sick a person is, but also by many factors that are entirely absent from the NHANES variables used in this project.

The following six reasons explain why accuracy in the 63–68% range is appropriate and expected for this type of problem, rather than a sign of model failure.

Reason 1: The target variable captures behaviour, not biology. A person with severe hypertension and diabetes may still avoid going to the doctor due to personal preference, fear of diagnosis, or financial constraints. Conversely, a relatively healthy individual may visit the doctor frequently due to anxiety or preventive habits. The models are trained only on health indicators and cannot observe these behavioural and psychological factors. No machine learning model can accurately predict behaviour from health data alone.

Reason 2: Critical predictors are missing from the dataset. Healthcare utilisation is strongly influenced by access-related factors such as health insurance coverage, distance to the nearest medical facility, availability of transport, and out-of-pocket cost. None of these variables are present in the selected NHANES files. This means the models are working with incomplete information by design, not by any error in modelling.

Reason 3: The target is a composite of two very different behaviours. healthcare_demand = 1 is assigned to respondents who had frequent outpatient visits (HUQ051 >= 4) or who were hospitalised overnight (HUQ071 == 1). These two conditions represent quite different patterns of healthcare use — one reflects routine or chronic care management, while the other reflects acute or emergency need. Combining them into a single binary label introduces internal heterogeneity within the positive class, which makes it harder for any model to learn a consistent decision boundary.

Reason 4: All predictor variables are self-reported. Every health condition in the dataset — arthritis, diabetes, hypertension, cardiovascular disease, and others — is based on what respondents recalled being told by a doctor. Self-reported data introduces measurement error, particularly among older adults who may misremember diagnoses, conflate conditions, or under-report symptoms. This noise in the predictors directly limits how accurately any model can learn patterns from the data.

Reason 5: The dataset is cross-sectional. NHANES captures a single point in time. The model predicts healthcare use in the past year from health indicators measured at the same time. In reality, healthcare demand is a longitudinal process shaped by changes in health over months or years, recent acute events, changes in insurance, and evolving treatment plans. A cross-sectional snapshot cannot fully capture this dynamic.

Reason 6: AUC is the more meaningful metric here. Accuracy is a misleading metric when the class distribution is uneven. In this dataset, approximately 62% of respondents are in the lower demand group. A model that always predicted “Lower” would achieve 62% accuracy with zero predictive value. The models in this project achieved AUC values of 0.69–0.74, which means they correctly rank a randomly chosen high-demand respondent above a randomly chosen low-demand respondent 69–74% of the time. This is a meaningful improvement over random guessing (AUC = 0.50) and demonstrates that the models have genuinely learned useful patterns from the data.

7.3.2.2 What the Models Did Well

Despite the inherent difficulty of the prediction task, the models consistently identified the same set of important variables: chronic condition count, disease-age interaction, cardiovascular disease, chest symptoms, and hypertension. This consistency across logistic regression, decision tree, and random forest is a positive sign — it confirms that the models are capturing real relationships in the data rather than fitting noise.

The engineered features chronic_count and disease_age_interaction were among the top predictors in all three models, validating the feature engineering decisions made during data preparation. The fact that all models agree on which variables matter most provides more confidence in the findings than any single accuracy figure.

7.3.2.3 Conclusion on Model Performance

The classification accuracy of 63–68% is not a failure. It is an honest reflection of a difficult prediction problem where the outcome is a behavioural variable, the available features are incomplete relative to the true drivers of healthcare utilisation, and the data is cross-sectional and self-reported. The models provide meaningful predictive signal — as shown by AUC values well above chance — and consistently identify clinically sensible risk factors. These results are appropriate for a data mining study using publicly available survey data and are comparable to published research on similar healthcare utilisation prediction problems.

7.4 Practical Implications

The findings have several practical implications for healthcare planning and healthy ageing strategies. First, the strong role of chronic disease burden suggests that healthcare demand prediction should consider multimorbidity rather than analysing diseases in isolation. Older adults with multiple chronic conditions are more likely to require frequent healthcare services, making them an important group for proactive care management.

Second, the importance of cardiovascular disease, respiratory symptoms, chest pain, hypertension, and related indicators suggests that routine screening and monitoring of these conditions can support early identification of high-need individuals. Preventive interventions may reduce avoidable hospitalisation and improve quality of life among older adults.

Third, the regression analysis highlights the importance of mental health as part of elderly care. Depression severity cannot be fully explained by physical health indicators alone, but physical disease burden may still contribute to depressive symptoms. This supports a more integrated approach to elderly healthcare, where mental health screening is considered alongside chronic disease management.

Finally, the project demonstrates how public health survey data can be transformed into actionable predictive models. Even when the models are not perfect, they can help identify risk patterns, support resource planning, and guide further investigation into vulnerable elderly subgroups.

7.5 Limitations and Future Research

Several limitations should be acknowledged. First, NHANES is a cross-sectional survey, so the analysis can identify associations and predictive patterns but cannot establish causal relationships. For example, chronic disease burden is associated with higher healthcare demand, but the direction and timing of these relationships cannot be confirmed from the current data alone.

Second, several variables are self-reported and may be affected by recall bias or reporting error. This is especially relevant for medical history, healthcare utilisation, smoking behaviour, and depression questionnaire responses.

Third, missing values and questionnaire skip patterns influenced the available data. Some variables with high missingness had to be removed, while others required imputation or feature engineering. These preprocessing decisions were necessary, but they may affect model results.

Fourth, the dataset is based on the United States population. Although the selected variables are relevant to healthy ageing and are aligned with themes found in NHMS reports, the findings should be applied cautiously to the Malaysian context. Differences in healthcare access, insurance systems, service delivery, ethnicity, culture, and disease patterns may affect generalisability.

Future research could improve this project in several ways. Longitudinal data could be used to predict future healthcare demand rather than current or past utilisation. Additional variables such as medication use, functional limitations, health insurance status, physical activity, diet, social support, and healthcare access barriers could improve predictive performance. More advanced models, such as gradient boosting or calibrated ensemble methods, could also be compared with interpretable models. Finally, external validation using Malaysian elderly health data would strengthen the relevance of the findings for local healthcare planning.

In conclusion, this project shows that data mining methods can help identify older adults with higher healthcare demand and explore depression severity among elderly populations. The results highlight the central role of chronic disease burden, cardiovascular and respiratory symptoms, and multimorbidity in shaping healthcare needs. The classification accuracy of 63–68% and AUC of 0.69–0.74 are appropriate outcomes for a behavioural prediction problem using cross-sectional, self-reported survey data with incomplete access-related predictors. These figures reflect the genuine difficulty of predicting healthcare utilisation — not a shortcoming in the analytical approach. The models provide consistent, clinically meaningful signals and offer a solid foundation for data-driven elderly healthcare planning and future healthy ageing research.