Population ageing has become an important public health and social policy issue in many countries. As life expectancy increases, a larger proportion of the population is living into older age, where chronic diseases, functional limitations, mental health concerns, and repeated healthcare use become more common. Older adults often experience multiple health conditions at the same time, including hypertension, diabetes, cardiovascular disease, arthritis, cancer, pulmonary disease, and symptoms such as chest pain or shortness of breath. These conditions can increase the need for outpatient visits, hospitalisation, long-term monitoring, and preventive care.
Healthcare demand among older adults is not determined by age alone. It is shaped by a combination of demographic, socioeconomic, behavioural, and clinical factors. For example, income and education may influence access to healthcare and health-seeking behaviour, while chronic disease burden may directly increase the need for medical services. Lifestyle-related factors such as smoking may also contribute to poorer health outcomes, although their relationship with healthcare utilisation can be complex in elderly populations.
This project uses data from the National Health and Nutrition Examination Survey (NHANES) 2017-March 2020 Pre-pandemic cycle. NHANES provides individual-level information collected through interviews, questionnaires, and physical examinations, making it a valuable source for studying health status and healthcare utilisation. Although the dataset is based on the United States population, the relationships between ageing, chronic disease, socioeconomic status, and healthcare demand are broadly relevant to healthy ageing research. The selected variables are also aligned with themes commonly discussed in National Health and Morbidity Survey (NHMS) reports, supporting the relevance of the analysis to the Malaysian healthcare context.
Healthcare systems face increasing pressure as the number of older adults grows and the prevalence of chronic diseases rises. A major challenge is identifying older individuals who are more likely to require frequent healthcare services or hospital care. Without early identification, healthcare providers may respond only after medical needs become severe, leading to higher costs, delayed intervention, and poorer patient outcomes.
Traditional healthcare planning often relies on descriptive statistics or broad population-level indicators. While these are useful, they may not fully capture the combined effect of individual-level characteristics such as disease burden, age, income, smoking behaviour, and specific clinical symptoms. Data mining and predictive modelling can help address this gap by identifying patterns in large health datasets and estimating which older adults are more likely to have high healthcare demand.
In this project, high healthcare demand is represented by a composite indicator derived from healthcare utilisation variables. Older adults are classified as having higher healthcare demand if they had frequent healthcare visits in the past year or experienced overnight hospitalisation. In addition, depression severity is analysed through the PHQ-9 total score to explore how demographic characteristics and chronic disease burden relate to mental health outcomes among older adults.
The main objective of this project is to apply data mining techniques to understand and predict healthcare-related needs among older adults using NHANES data.
The specific objectives are:
Research Question 1 Which demographic and health-related factors are associated with depression severity among older adults?
Research Question 2 Can machine learning models accurately predict high healthcare demand among older adults?
The data mining goal of this project is to discover useful patterns in elderly health data and build predictive models that support better understanding of healthcare demand among older adults.
For the classification task, the goal is to predict whether a respondent aged 60 years or above has higher healthcare demand. The target variable is based on healthcare utilisation, particularly frequent healthcare visits and hospitalisation. Predictors include demographic variables, socioeconomic indicators, chronic disease indicators, chest symptoms, smoking status, and engineered features such as chronic condition count and the interaction between age and disease burden.
For the regression task, the goal is to predict depression severity using the PHQ-9 total score. This task examines whether demographic characteristics and chronic disease burden can explain variation in depressive symptoms among older adults. The regression analysis complements the classification task by considering mental health as another important dimension of healthy ageing.
Together, these two modelling tasks aim to provide a more complete picture of older adults’ health needs. The classification model focuses on healthcare service demand, while the regression model focuses on depression severity. The combined findings can help identify high-risk profiles, highlight important predictors, and support data-driven planning for elderly healthcare services.
This section describes the dataset used in the project Predicting High Healthcare Demand Among Older Adults to Support Healthy Ageing.
The data was obtained from the National Health and Nutrition Examination Survey (NHANES) 2017–March 2020 Pre-pandemic cycle, conducted by the Centers for Disease Control and Prevention (CDC) in the United States. NHANES is a large-scale health survey designed to provide nationally representative information on the health and nutritional status of the population.
One of the main reasons for selecting NHANES is that it contains a wide range of individual-level information, including demographic background, diagnosed medical conditions, health behaviours, access to healthcare, and cardiovascular- and diabetes-related indicators. These dimensions are highly relevant to the objective of this project, which is to predict healthcare demand among older adults.
Another strength of NHANES is that data is collected through multiple sources, including household interviews, questionnaires, and physical examinations, making the dataset more reliable than a purely self-reported survey.
Although NHANES is a U.S.-based dataset, it is still useful for this study because the relationships between ageing, chronic disease burden, lifestyle factors, and healthcare utilization are broadly relevant across many settings. In addition, variables are aligned with key indicators identified in NHMS reports to ensure relevance to the Malaysian context.
To support the two analytical objectives of this project — predicting high healthcare demand (classification) and predicting depression severity (regression) — eight NHANES datasets were selected. Each file contributes a different aspect of information needed for the analysis:
| Dataset | Description |
|---|---|
| DEMO | Demographic information (age, gender, race/ethnicity, education, marital status, income) |
| HUQ | Healthcare utilisation and access |
| BPQ | Blood pressure and hypertension-related information |
| DIQ | Diabetes-related information |
| SMQ | Smoking behaviour |
| MCQ | Medical conditions and chronic disease history |
| CDQ | Cardiovascular health-related information |
| DPQ | Mental health — Depression Screener (PHQ-9); used for the regression analysis task |
The first seven datasets (DEMO, HUQ, BPQ, DIQ, SMQ, MCQ, CDQ) are
merged to form the classification dataset
(final_df), which is used to predict high healthcare
demand. The DPQ dataset is joined separately with final_df
to form the regression dataset
(final_df_regression), which is used to predict PHQ-9
depression severity. The two datasets serve different research questions
within the same project theme and are kept separate to avoid reducing
the sample size for the classification task.
All datasets were merged using SEQN, the unique respondent identifier in NHANES, which allows records from different survey components to be linked to the same participant.
All datasets are merged using the participant identifier variable
SEQN.
This project focuses on respondents aged 60 years and above to study healthcare demand among older adults, consistent with the elderly age threshold used in the National Health and Morbidity Survey (NHMS) 2019.
Two separate analytical datasets are produced from these eight files:
final_df (classification dataset):
constructed from the seven core datasets (DEMO, MCQ, BPQ, DIQ, HUQ, CDQ,
SMQ). Contains demographic characteristics, chronic disease indicators,
engineered features, and the composite binary target
healthcare_demand. Used for the classification task in
Section 6.
final_df_regression (regression
dataset): constructed by joining final_df with the
cleaned DPQ data using an inner join on SEQN. Contains all variables
from final_df plus the PHQ-9 total score
PHQ9_total as the regression target. The inner join retains
only participants present in both datasets, resulting in a smaller
sample size than final_df. Used for the regression task in
Section 5.
Based on the project objective and supported by NHMS findings, the following key features are identified across six dimensions.
Demographic factors include age, gender, race/ethnicity, education level, marital status, and income level. These variables are important because healthcare demand may vary across different social and population groups.
| Variable | Codebook Description | Notes |
|---|---|---|
| RIAGENDR | Gender | Demographic feature |
| RIDAGEYR | Age in years at screening | Used to retain participants aged 60 and above |
| RIDRETH3 | Race/Hispanic origin w/ Non-Hispanic Asian | Demographic feature |
| DMDEDUC2 | Education level — Adults 20+ | Socioeconomic feature |
| DMDMARTZ | Marital status | Demographic and social feature |
| INDFMPIR | Ratio of family income to poverty | Socioeconomic feature |
Chronic disease factors include conditions such as arthritis, cardiovascular diseases, COPD, and cancer. Blood pressure and diabetes factors focus on diagnosis and treatment status.
| Variable | Codebook Description | Notes |
|---|---|---|
| MCQ160A | Doctor ever said you had arthritis | Arthritis variable |
| MCQ160B | Ever told had congestive heart failure | Cardiovascular disease variable |
| MCQ160C | Ever told you had coronary heart disease | Cardiovascular disease variable |
| MCQ160D | Ever told you had angina/angina pectoris | Cardiovascular disease variable |
| MCQ160E | Ever told you had heart attack | Cardiovascular disease variable |
| MCQ160F | Ever told you had a stroke | Cardiovascular / neurological disease variable |
| MCQ160P | Ever told you had COPD, emphysema, ChB | Pulmonary disease variable |
| MCQ220 | Ever told you had cancer or malignancy | Cancer variable |
| BPQ020 | Ever told you had high blood pressure | Hypertension variable |
| BPQ030 | Told had high blood pressure 2+ times | Hypertension variable |
| BPQ080 | Dr told you — high cholesterol level | Cholesterol variable |
| DIQ010 | Doctor told you have diabetes | Diabetes variable |
| DIQ050 | Taking insulin now | Diabetes variable |
| DIQ070 | Take diabetic pills to lower blood sugar | Diabetes variable |
| DIQ080 | Diabetes affected eyes/had retinopathy | Diabetes variable |
| Variable | Codebook Description | Notes |
|---|---|---|
| CDQ001 | SP ever had pain or discomfort in chest | Chest symptom variable |
| CDQ010 | Shortness of breath on stairs/inclines | Chest / cardiovascular symptom variable |
Smoking behaviour is included because it may be associated with poorer health outcomes and higher healthcare needs among older adults.
| Variable | Codebook Description | Notes |
|---|---|---|
| SMQ020 | Smoked at least 100 cigarettes in life | Smoking history variable |
| SMQ040 | Do you now smoke cigarettes? | Current smoking status variable |
| SMD650 | Avg # cigarettes/day during past 30 days | Smoking intensity variable |
| SMD030 | Age started smoking cigarettes regularly | Smoking history variable |
Healthcare utilisation variables are directly connected to the idea of healthcare demand and are used to construct the target variable.
| Variable | Codebook Description | Notes |
|---|---|---|
| HUQ051 | # times receive healthcare over past year | Outpatient visit frequency; used for target construction |
| HUQ071 | Overnight hospital patient in last year | Hospitalisation indicator; used for target construction |
The DPQ dataset contains the nine-item Patient Health Questionnaire
(PHQ-9), a validated depression screening instrument. Each item asks
about the frequency of a depressive symptom over the past two weeks,
scored from 0 (not at all) to 3 (nearly every day). The nine items are
summed to produce PHQ9_total (range 0–27), which serves as
the target variable for the regression task.
| Variable | Codebook Description | Notes |
|---|---|---|
| DPQ010 | Have little interest in doing things | PHQ-9 item 1 |
| DPQ020 | Feeling down, depressed, or hopeless | PHQ-9 item 2 |
| DPQ030 | Trouble sleeping or sleeping too much | PHQ-9 item 3 |
| DPQ040 | Feeling tired or having little energy | PHQ-9 item 4 |
| DPQ050 | Poor appetite or overeating | PHQ-9 item 5 |
| DPQ060 | Feeling bad about yourself | PHQ-9 item 6 |
| DPQ070 | Trouble concentrating on things | PHQ-9 item 7 |
| DPQ080 | Moving or speaking slowly or too fast | PHQ-9 item 8 |
| DPQ090 | Thoughts you would be better off dead | PHQ-9 item 9 |
| PHQ9_total | Sum of DPQ010–DPQ090 (0–27) | Regression target variable |
The target variable is a composite binary indicator of high healthcare demand, constructed from two complementary variables:
HUQ051 >= 4: the
participant had 4 or more healthcare visits in the past year, indicating
frequent outpatient utilisation.HUQ071 == 1: the
participant was hospitalised overnight at least once in the past year,
which is a strong signal of high healthcare need.A participant is classified as higher healthcare demand
(target = 1) if either condition is met. Those who
meet neither condition are classified as lower healthcare demand
(target = 0). Using hospitalisation as an
additional criterion captures high-need individuals who may not have had
many outpatient visits but whose health burden is clearly
significant.
The combined dataset has several important characteristics that affect later analysis.
Data includes both categorical and numerical variables. Many NHANES questionnaire variables are categorical because responses are recorded using coded answer options such as “Yes”, “No”, “Refused”, or “Don’t know”. At the same time, some demographic measures such as age and income ratio are numerical.
The dataset is cross-sectional. It captures respondents’ information at one point in time rather than following them over multiple years. This means the project is suitable for identifying associations and building predictive models based on current status, but cannot directly establish long-term causal effects.
Data comes from multiple files. The data structure is multi-source and modular. This allows the study to include different dimensions of elderly health, but also creates the need for proper merging, variable selection, and consistency checking.
The dataset includes both self-reported measures and health-related indicators. This provides a broader view of the elderly population. However, self-reported data may be subject to recall bias or reporting errors, especially for older respondents.
The dataset may contain missing values and coded
responses. Values such as 7 / 9
(Refused / Don’t know) and 77 / 99 /
777 / 999 for specific variables will be
treated as missing during preprocessing.
The dataset was constructed using selected NHANES 2017–2020 files:
DEMO, MCQ, BPQ, DIQ,
HUQ, CDQ, and SMQ. These files
cover demographic characteristics, chronic diseases, blood pressure and
cholesterol conditions, healthcare utilisation, chest pain and
cardiovascular symptoms, and smoking-related behaviours. Only
respondents aged 60 years and above were retained for this study.
demo <- read_xpt("DEMO.xpt")
mcq <- read_xpt("MCQ.xpt")
bpq <- read_xpt("BPQ.xpt")
diq <- read_xpt("DIQ.xpt")
huq <- read_xpt("HUQ.xpt")
cdq <- read_xpt("CDQ.xpt")
smq <- read_xpt("SMQ.xpt")## demo: 15560 rows, 29 cols
## mcq: 14986 rows, 63 cols
## bpq: 10195 rows, 11 cols
## diq: 14986 rows, 28 cols
## huq: 15560 rows, 7 cols
## cdq: 6433 rows, 17 cols
## smq: 11137 rows, 16 cols
## SEQN in demo: TRUE
## SEQN in mcq: TRUE
## SEQN in bpq: TRUE
## SEQN in diq: TRUE
## SEQN in huq: TRUE
## SEQN in cdq: TRUE
## SEQN in smq: TRUE
# Filter respondents aged 60 and above
df_60up <- demo %>% filter(RIDAGEYR >= 60)
cat("df_60up shape:", nrow(df_60up), "rows,", ncol(df_60up), "cols\n")## df_60up shape: 3422 rows, 29 cols
## # A tibble: 6 × 29
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 109274 66 2 1 68 NA 5 7 1
## 2 109282 66 2 1 76 NA 3 3 2
## 3 109290 66 2 2 68 NA 4 4 2
## 4 109298 66 2 1 68 NA 3 3 2
## 5 109313 66 2 1 63 NA 3 3 1
## 6 109316 66 2 2 62 NA 4 4 1
## # ℹ 20 more variables: DMDBORN4 <dbl>, DMDYRUSZ <dbl>, DMDEDUC2 <dbl>,
## # DMDMARTZ <dbl>, RIDEXPRG <dbl>, SIALANG <dbl>, SIAPROXY <dbl>,
## # SIAINTRP <dbl>, FIALANG <dbl>, FIAPROXY <dbl>, FIAINTRP <dbl>,
## # MIALANG <dbl>, MIAPROXY <dbl>, MIAINTRP <dbl>, AIALANGA <dbl>,
## # WTINTPRP <dbl>, WTMECPRP <dbl>, SDMVPSU <dbl>, SDMVSTRA <dbl>,
## # INDFMPIR <dbl>
# Left join all datasets on SEQN
merged <- df_60up %>%
left_join(mcq, by = "SEQN") %>%
left_join(bpq, by = "SEQN") %>%
left_join(diq, by = "SEQN") %>%
left_join(huq, by = "SEQN") %>%
left_join(cdq, by = "SEQN") %>%
left_join(smq, by = "SEQN")
cat("merged shape:", nrow(merged), "rows,", ncol(merged), "cols\n")## merged shape: 3422 rows, 165 cols
## # A tibble: 6 × 165
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3 RIDEXMON
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 109274 66 2 1 68 NA 5 7 1
## 2 109282 66 2 1 76 NA 3 3 2
## 3 109290 66 2 2 68 NA 4 4 2
## 4 109298 66 2 1 68 NA 3 3 2
## 5 109313 66 2 1 63 NA 3 3 1
## 6 109316 66 2 2 62 NA 4 4 1
## # ℹ 156 more variables: DMDBORN4 <dbl>, DMDYRUSZ <dbl>, DMDEDUC2 <dbl>,
## # DMDMARTZ <dbl>, RIDEXPRG <dbl>, SIALANG <dbl>, SIAPROXY <dbl>,
## # SIAINTRP <dbl>, FIALANG <dbl>, FIAPROXY <dbl>, FIAINTRP <dbl>,
## # MIALANG <dbl>, MIAPROXY <dbl>, MIAINTRP <dbl>, AIALANGA <dbl>,
## # WTINTPRP <dbl>, WTMECPRP <dbl>, SDMVPSU <dbl>, SDMVSTRA <dbl>,
## # INDFMPIR <dbl>, MCQ010 <dbl>, MCQ025 <dbl>, MCQ035 <dbl>, MCQ040 <dbl>,
## # MCQ050 <dbl>, AGQ030 <dbl>, MCQ053 <dbl>, MCQ080 <dbl>, MCQ092 <dbl>, …
## [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
## [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "DMDBORN4" "DMDYRUSZ" "DMDEDUC2"
## [13] "DMDMARTZ" "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG"
## [19] "FIAPROXY" "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA"
## [25] "WTINTPRP" "WTMECPRP" "SDMVPSU" "SDMVSTRA" "INDFMPIR" "MCQ010"
## [31] "MCQ025" "MCQ035" "MCQ040" "MCQ050" "AGQ030" "MCQ053"
## [37] "MCQ080" "MCQ092" "MCD093" "MCQ149" "MCQ151" "RHD018"
## [43] "MCQ160A" "MCQ195" "MCQ160B" "MCD180B" "MCQ160C" "MCD180C"
## [49] "MCQ160D" "MCD180D" "MCQ160E" "MCD180E" "MCQ160F" "MCD180F"
## [55] "MCQ160M" "MCQ170M" "MCD180M" "MCQ160P" "MCQ160L" "MCQ170L"
## [61] "MCD180L" "MCQ500" "MCQ510A" "MCQ510B" "MCQ510C" "MCQ510D"
## [67] "MCQ510E" "MCQ510F" "MCQ520" "MCQ530" "MCQ540" "MCQ550"
## [73] "MCQ560" "MCQ570" "MCQ220" "MCQ230A" "MCQ230B" "MCQ230C"
## [79] "MCQ230D" "MCQ300B" "MCQ300C" "MCQ300A" "MCQ366A" "MCQ366B"
## [85] "MCQ366C" "MCQ366D" "MCQ371A" "MCQ371B" "MCQ371C" "MCQ371D"
## [91] "OSQ230" "BPQ020" "BPQ030" "BPD035" "BPQ040A" "BPQ050A"
## [97] "BPQ080" "BPQ060" "BPQ070" "BPQ090D" "BPQ100D" "DIQ010"
## [103] "DID040" "DIQ160" "DIQ180" "DIQ050" "DID060" "DIQ060U"
## [109] "DIQ070" "DIQ230" "DIQ240" "DID250" "DID260" "DIQ260U"
## [115] "DIQ275" "DIQ280" "DIQ291" "DIQ300S" "DIQ300D" "DID310S"
## [121] "DID310D" "DID320" "DID330" "DID341" "DID350" "DIQ350U"
## [127] "DIQ360" "DIQ080" "HUQ010" "HUQ030" "HUQ051" "HUD062"
## [133] "HUQ071" "HUQ090" "CDQ001" "CDQ002" "CDQ003" "CDQ004"
## [139] "CDQ005" "CDQ006" "CDQ009A" "CDQ009B" "CDQ009C" "CDQ009D"
## [145] "CDQ009E" "CDQ009F" "CDQ009G" "CDQ009H" "CDQ008" "CDQ010"
## [151] "SMQ020" "SMD030" "SMQ040" "SMQ050Q" "SMQ050U" "SMD057"
## [157] "SMQ078" "SMD641" "SMD650" "SMD100FL" "SMD100MN" "SMQ670"
## [163] "SMQ621" "SMD630" "SMAQUEX2"
Eight NHANES 2017–2020 files were used in this study. The first seven form the classification dataset; the eighth (DPQ) is used exclusively for the regression task. The table below summarises each dataset, its coverage, and the number of variables selected.
| Dataset | Coverage | Total Variables | Selected for Candidate Pool |
|---|---|---|---|
| DEMO | Demographics & socioeconomic status (age, gender, race, education, marital status, income) | 29 | 6 (RIAGENDR, RIDAGEYR, RIDRETH3, DMDEDUC2, DMDMARTZ, INDFMPIR) |
| MCQ | Self-reported chronic conditions (arthritis, cardiovascular diseases, COPD, cancer, anemia, etc.) | 58 | 8 (MCQ160A–F, MCQ160P, MCQ220) |
| BPQ | Blood pressure and cholesterol history and treatment | 11 | 3 (BPQ020, BPQ030, BPQ080) |
| DIQ | Diabetes diagnosis, treatment, and management | 36 | 4 (DIQ010, DIQ050, DIQ070, DIQ080) |
| CDQ | Chest pain and cardiovascular symptoms | 19 | 2 (CDQ001, CDQ010) |
| SMQ | Smoking history and current smoking behaviour | 16 | 4 (SMQ020, SMQ040, SMD650, SMD030) |
| HUQ | Healthcare utilisation (visit frequency, hospitalisation, mental health care) | 7 | 2 (HUQ051, HUQ071) |
| DPQ | Mental health — Depression Screener (PHQ-9) | 11 | 9 (DPQ010–DPQ090; summed into PHQ9_total for regression) |
A total of 29 variables are selected for the
classification dataset (final_df). The DPQ dataset
contributes an additional 9 items used exclusively for
the regression task via final_df_regression. Full variable
descriptions are provided in Section 3.5.
After merging the selected NHANES 2017–2020 datasets, an initial variable screening step was conducted. A total of 29 candidate variables were retained based on their relevance to the study objective. These variables included demographic variables, chronic disease variables, blood pressure and cholesterol variables, diabetes variables, smoking-related variables, chest symptom variables, and healthcare utilization variables.
The purpose of this step was to reduce the complexity of the merged dataset while keeping variables that may help explain health risk and healthcare demand among older adults. These candidate variables were then used for missing value inspection before further simplification.
# Select the 29 candidate variables
selected_vars <- c(
"SEQN",
# Demographic variables (6)
"RIAGENDR", "RIDAGEYR", "RIDRETH3", "DMDEDUC2", "DMDMARTZ", "INDFMPIR",
# Medical condition variables (8)
"MCQ160A", "MCQ160B", "MCQ160C", "MCQ160D", "MCQ160E", "MCQ160F",
"MCQ160P", "MCQ220",
# Blood pressure and cholesterol variables (3)
"BPQ020", "BPQ030", "BPQ080",
# Diabetes variables (4)
"DIQ010", "DIQ050", "DIQ070", "DIQ080",
# Chest symptom variables (2)
"CDQ001", "CDQ010",
# Smoking variables (4)
"SMQ020", "SMQ040", "SMD650", "SMD030",
# Healthcare utilization variables (2)
"HUQ051", "HUQ071"
)
selected_df <- merged %>% select(all_of(selected_vars))
cat("Selected dataset shape:", nrow(selected_df), "rows,", ncol(selected_df), "cols\n")## Selected dataset shape: 3422 rows, 30 cols
## # A tibble: 6 × 30
## SEQN RIAGENDR RIDAGEYR RIDRETH3 DMDEDUC2 DMDMARTZ INDFMPIR MCQ160A MCQ160B
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 109274 1 68 7 4 3 1.2 1 2
## 2 109282 1 76 3 5 1 3.61 2 2
## 3 109290 2 68 4 5 2 5 1 2
## 4 109298 1 68 3 1 1 1.75 1 2
## 5 109313 1 63 3 4 1 5 1 2
## 6 109316 2 62 4 3 2 0.07 1 2
## # ℹ 21 more variables: MCQ160C <dbl>, MCQ160D <dbl>, MCQ160E <dbl>,
## # MCQ160F <dbl>, MCQ160P <dbl>, MCQ220 <dbl>, BPQ020 <dbl>, BPQ030 <dbl>,
## # BPQ080 <dbl>, DIQ010 <dbl>, DIQ050 <dbl>, DIQ070 <dbl>, DIQ080 <dbl>,
## # CDQ001 <dbl>, CDQ010 <dbl>, SMQ020 <dbl>, SMQ040 <dbl>, SMD650 <dbl>,
## # SMD030 <dbl>, HUQ051 <dbl>, HUQ071 <dbl>
After variable selection, the dataset was inspected for incomplete observations before further preprocessing. Missing values are common in NHANES questionnaire data due to conditional survey logic and participant eligibility criteria. Therefore, missing value exploration was conducted to identify variables with substantial missingness and assess their potential impact on the analytical dataset.
# Calculate missing values counts and percentage
missing_count <- selected_df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
tidyr::pivot_longer(everything(), names_to = "variable", values_to = "Missing Count")
missing_percent <- selected_df %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
tidyr::pivot_longer(everything(), names_to = "variable", values_to = "Missing Percentage")
missing_summary <- left_join(missing_count, missing_percent, by = "variable")
# Display only variables that have missing data
missing_summary %>%
filter(`Missing Count` > 0) %>%
arrange(desc(`Missing Count`))## # A tibble: 8 × 3
## variable `Missing Count` `Missing Percentage`
## <chr> <int> <dbl>
## 1 SMD650 2980 87.1
## 2 DIQ080 2490 72.8
## 3 DIQ050 2488 72.7
## 4 DIQ070 1954 57.1
## 5 SMQ040 1737 50.8
## 6 SMD030 1737 50.8
## 7 BPQ030 1322 38.6
## 8 INDFMPIR 538 15.7
In addition to standard missing values (NA), several
NHANES questionnaire variables contain special response codes
representing “Refused” (7, 77, and 777) or “Don’t know” (9, 99, and 999)
answers. These coded responses were identified and treated as missing
values during preprocessing to ensure consistency in downstream
analysis.
##
## 1 2 7 9 <NA>
## 1883 1498 2 39 0
# NHANES special response codes:
# [7, 9] -> default for most variables
# [77, 99] -> DMDMARTZ, HUQ051
# [9] -> HUQ071 (only "Don't know"; no Refused code)
# [777, 999] -> SMD030, SMD650
vars_excluded <- c("RIAGENDR", "RIDAGEYR", "RIDRETH3")
cols_to_clean <- setdiff(names(selected_df), vars_excluded)
special_code_map <- list(
default = c(7, 9),
DMDMARTZ = c(77, 99),
HUQ051 = c(77, 99),
HUQ071 = c(9),
SMD030 = c(777, 999),
SMD650 = c(777, 999)
)
# Count replacements per variable before replacing
replacement_counts <- sapply(cols_to_clean, function(col) {
codes <- if (col %in% names(special_code_map)) special_code_map[[col]] else special_code_map[["default"]]
sum(selected_df[[col]] %in% codes, na.rm = TRUE)
})
replacement_summary <- data.frame(
Variable = names(replacement_counts),
`Replacements Made` = replacement_counts,
row.names = NULL
) %>%
filter(`Replacements.Made` > 0) %>%
arrange(desc(`Replacements.Made`))
cat("Total replacements made:", sum(replacement_summary$Replacements.Made), "\n")## Total replacements made: 263
## Variable Replacements.Made
## 1 BPQ080 41
## 2 MCQ160D 29
## 3 MCQ160C 23
## 4 MCQ160B 20
## 5 MCQ160A 15
## 6 CDQ010 15
## 7 MCQ160P 14
## 8 BPQ030 14
## 9 SMD030 14
## 10 DMDEDUC2 12
## 11 DIQ080 11
## 12 MCQ160F 10
## 13 HUQ051 8
## 14 DMDMARTZ 7
## 15 MCQ160E 7
## 16 BPQ020 7
## 17 DIQ070 4
## 18 CDQ001 3
## 19 SMQ020 3
## 20 DIQ010 2
## 21 SMD650 2
## 22 MCQ220 1
## 23 HUQ071 1
upd_missing_count <- selected_df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
tidyr::pivot_longer(everything(), names_to = "variable", values_to = "After Count")
upd_missing_percent <- selected_df %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
tidyr::pivot_longer(everything(), names_to = "variable", values_to = "After Percentage")
upd_missing_summary <- left_join(upd_missing_count, upd_missing_percent, by = "variable")
upd_missing_summary %>%
filter(`After Count` > 0) %>%
arrange(desc(`After Count`))## # A tibble: 26 × 3
## variable `After Count` `After Percentage`
## <chr> <int> <dbl>
## 1 SMD650 2982 87.1
## 2 DIQ080 2501 73.1
## 3 DIQ050 2488 72.7
## 4 DIQ070 1958 57.2
## 5 SMD030 1751 51.2
## 6 SMQ040 1737 50.8
## 7 BPQ030 1336 39.0
## 8 INDFMPIR 538 15.7
## 9 BPQ080 41 1.20
## 10 MCQ160D 29 0.847
## # ℹ 16 more rows
missing_comparison <- missing_summary %>%
rename(`Before Count` = `Missing Count`, `Before Percentage` = `Missing Percentage`) %>%
left_join(upd_missing_summary, by = "variable") %>%
mutate(
`Increase Count` = `After Count` - `Before Count`,
`Increase Percentage` = `After Percentage` - `Before Percentage`
) %>%
filter(`Increase Count` > 0) %>%
arrange(desc(`Increase Count`))
missing_comparison## # A tibble: 23 × 7
## variable `Before Count` `Before Percentage` `After Count` `After Percentage`
## <chr> <int> <dbl> <int> <dbl>
## 1 BPQ080 0 0 41 1.20
## 2 MCQ160D 0 0 29 0.847
## 3 MCQ160C 0 0 23 0.672
## 4 MCQ160B 0 0 20 0.584
## 5 MCQ160A 0 0 15 0.438
## 6 CDQ010 0 0 15 0.438
## 7 MCQ160P 0 0 14 0.409
## 8 BPQ030 1322 38.6 1336 39.0
## 9 SMD030 1737 50.8 1751 51.2
## 10 DMDEDUC2 0 0 12 0.351
## # ℹ 13 more rows
## # ℹ 2 more variables: `Increase Count` <int>, `Increase Percentage` <dbl>
After replacing survey-coded responses such as “Refused” and “Don’t know” with missing values, several questionnaire variables showed noticeable increases in missingness percentages. This indicates that a substantial portion of incomplete information in the NHANES dataset was originally represented using coded survey responses rather than standard null values.
The increase was particularly noticeable among chronic disease follow-up variables, further supporting the need for careful variable review during preprocessing.
# Identify variables with any missing values (excluding SEQN and RIDAGEYR)
miss_vars <- upd_missing_summary %>%
filter(`After Count` > 0, !variable %in% c("SEQN", "RIDAGEYR")) %>%
pull(variable)
# Build long-format binary matrix sorted by age
df_vis <- selected_df %>%
select(RIDAGEYR, all_of(miss_vars)) %>%
arrange(RIDAGEYR) %>%
mutate(row_id = row_number()) %>%
pivot_longer(cols = all_of(miss_vars),
names_to = "variable",
values_to = "value") %>%
mutate(is_missing = as.integer(is.na(value)))
ggplot(df_vis, aes(x = row_id, y = variable, fill = factor(is_missing))) +
geom_tile() +
scale_fill_manual(
values = c("0" = "white", "1" = "red"),
labels = c("0" = "Available", "1" = "Missing"),
name = "Data Status"
) +
labs(
x = "RIDAGEYR (Age Years, sorted)",
y = "Variable",
title = "Missing Value Pattern by RIDAGEYR Across Variables"
) +
theme_minimal() +
theme(
axis.text.y = element_text(size = 8),
plot.title = element_text(size = 12, face = "bold"),
legend.position = "right"
)As shown in the figure, missingness is not random across variables.
Diabetes follow-up variables (DIQ050, DIQ070,
DIQ080) and smoking variables (SMD650,
SMD030) show the highest missing rates, largely due to skip
patterns in the NHANES questionnaire — respondents who answered ‘No’ to
a screening question were not asked follow-up items. In contrast,
demographic variables and core disease indicators (BPQ020,
DIQ010, MCQ160A) have very low
missingness.
As a practical guideline, variables with missingness exceeding
approximately 40% were considered for exclusion unless they were
regarded as clinically important. Several diabetes follow-up variables
(DIQ050, DIQ070, DIQ080) and
smoking consumption variables (SMD650, SMD030)
were removed due to substantial skip-pattern missingness exceeding this
threshold.
SMQ040 was retained despite its relatively high
missingness because it provides important smoking behaviour information
required for subsequent feature engineering — it is later combined with
SMQ020 to construct the smoking_status
feature.
remove_vars <- c("DIQ050", "DIQ070", "DIQ080", "SMD650", "SMD030")
selected_df <- selected_df %>% select(-all_of(remove_vars))
cat("Dataset shape after removing high-missingness variables:",
nrow(selected_df), "rows,", ncol(selected_df), "cols\n")## Dataset shape after removing high-missingness variables: 3422 rows, 25 cols
| Variable | Codebook Description | Notes |
|---|---|---|
| SEQN | Respondent sequence number | Unique respondent identifier used for merging datasets |
| RIAGENDR | Gender | Demographic feature |
| RIDAGEYR | Age in years at screening | Used to retain participants aged 60 and above |
| RIDRETH3 | Race/Hispanic origin w/ Non-Hispanic Asian | Demographic feature |
| DMDEDUC2 | Education level - Adults 20+ | Socioeconomic feature |
| DMDMARTZ | Marital status | Demographic and social feature |
| INDFMPIR | Ratio of family income to poverty | Socioeconomic feature |
| Variable | Codebook Description | Notes |
|---|---|---|
| MCQ160A | Doctor ever said you had arthritis | Arthritis disease variable |
| MCQ160B | Ever told had congestive heart failure | Cardiovascular disease variable |
| MCQ160C | Ever told you had coronary heart disease | Cardiovascular disease variable |
| MCQ160D | Ever told you had angina/angina pectoris | Cardiovascular disease variable |
| MCQ160E | Ever told you had heart attack | Cardiovascular disease variable |
| MCQ160F | Ever told you had a stroke | Cardiovascular / neurological disease variable |
| MCQ160P | Ever told you had COPD, emphysema, ChB | Pulmonary disease variable |
| MCQ220 | Ever told you had cancer or malignancy | Cancer disease variable |
| Variable | Codebook Description | Notes |
|---|---|---|
| BPQ020 | Ever told you had high blood pressure | Hypertension variable |
| BPQ030 | Told had high blood pressure 2+ times | Hypertension variable; combined with BPQ020 into hypertension_status |
| BPQ080 | Dr told you - high cholesterol level | Cholesterol variable |
| Variable | Codebook Description | Notes |
|---|---|---|
| DIQ010 | Doctor told you have diabetes | Diabetes variable; borderline (value = 3) treated as Yes |
| Variable | Codebook Description | Notes |
|---|---|---|
| CDQ001 | SP ever had pain or discomfort in chest | Chest symptom variable |
| CDQ010 | Shortness of breath on stairs/inclines | Chest / cardiovascular symptom variable |
| Variable | Codebook Description | Notes |
|---|---|---|
| SMQ020 | Smoked at least 100 cigarettes in life | Smoking history variable; combined with SMQ040 into smoking_status |
| SMQ040 | Do you now smoke cigarettes? | Current smoking status variable |
| Variable | Codebook Description | Notes |
|---|---|---|
| HUQ051 | # times receive healthcare over past year | Used for composite target construction |
| HUQ071 | Overnight hospital patient in last year | Hospitalisation indicator; used for composite target construction |
Feature engineering was performed to simplify overlapping questionnaire variables and improve the usability of selected predictors for machine learning modelling. Several related health indicators were consolidated into broader features to reduce sparsity, preserve clinically meaningful information, and improve overall dataset interpretability.
The variables MCQ160B, MCQ160C,
MCQ160D, MCQ160E, and MCQ160F
were consolidated into a single binary feature
cardio_disease. A participant is labelled as having
cardiovascular disease if at least one of the selected cardiovascular
conditions was reported (1 = Yes, 0 = No). Since these variables had
relatively low missingness and were combined into a broader binary
indicator, missing responses were treated as absence of the condition to
preserve participant records.
df_slim <- selected_df
# Inspect cardiovascular variables before engineering
cardio_vars <- c("MCQ160B", "MCQ160C", "MCQ160D", "MCQ160E", "MCQ160F")
for (col in cardio_vars) {
cat(col, ":\n")
print(table(df_slim[[col]], useNA = "always"))
}## MCQ160B :
##
## 1 2 <NA>
## 278 3124 20
## MCQ160C :
##
## 1 2 <NA>
## 345 3054 23
## MCQ160D :
##
## 1 2 <NA>
## 178 3215 29
## MCQ160E :
##
## 1 2 <NA>
## 333 3082 7
## MCQ160F :
##
## 1 2 <NA>
## 347 3065 10
# Combine into cardio_disease: 1 if any cardiovascular condition reported
df_slim <- df_slim %>%
mutate(
cardio_disease = as.integer(
(!is.na(MCQ160B) & MCQ160B == 1) |
(!is.na(MCQ160C) & MCQ160C == 1) |
(!is.na(MCQ160D) & MCQ160D == 1) |
(!is.na(MCQ160E) & MCQ160E == 1) |
(!is.na(MCQ160F) & MCQ160F == 1)
)
)
cat("\ncardio_disease distribution:\n")##
## cardio_disease distribution:
##
## 0 1 <NA>
## 2549 873 0
# Remove original cardiovascular variables
df_slim <- df_slim %>% select(-all_of(cardio_vars))
cat("\nShape after removing original cardio vars:",
nrow(df_slim), "rows,", ncol(df_slim), "cols\n")##
## Shape after removing original cardio vars: 3422 rows, 21 cols
The variables BPQ020 and BPQ030 were
combined to construct a simplified hypertension_status
feature. BPQ030 is a follow-up question administered only
to participants reporting prior hypertension diagnosis, resulting in
substantial skip-pattern missingness. To preserve hypertension-related
information while reducing sparsity, participants reporting either prior
hypertension diagnosis (BPQ020 = 1) or repeated
hypertension confirmation (BPQ030 = 1) were classified as
having hypertension (hypertension_status = 1).
## BPQ020:
##
## 1 2 <NA>
## 2100 1315 7
##
## BPQ030:
##
## 1 2 <NA>
## 1700 386 1336
# Combine into hypertension_status
df_slim <- df_slim %>%
mutate(
hypertension_status = as.integer(
(!is.na(BPQ020) & BPQ020 == 1) |
(!is.na(BPQ030) & BPQ030 == 1)
)
)
cat("\nhypertension_status distribution:\n")##
## hypertension_status distribution:
##
## 0 1 <NA>
## 1322 2100 0
# Remove original hypertension variables
df_slim <- df_slim %>% select(-c(BPQ020, BPQ030))
cat("\nShape after removing original hypertension vars:",
nrow(df_slim), "rows,", ncol(df_slim), "cols\n")##
## Shape after removing original hypertension vars: 3422 rows, 20 cols
The variables SMQ020 and SMQ040 were
combined into a simplified smoking_status feature
representing overall smoking exposure. SMQ040 is a
follow-up question only asked to respondents who reported smoking at
least 100 cigarettes in their lifetime (SMQ020 = 1),
resulting in skip-pattern missingness.
## SMQ020:
##
## 1 2 <NA>
## 1685 1734 3
##
## SMQ040:
##
## 1 2 3 <NA>
## 357 95 1233 1737
# Combine into smoking_status
df_slim <- df_slim %>%
mutate(
smoking_status = case_when(
SMQ040 %in% c(1, 2) ~ 1L, # current smoker
SMQ040 == 3 ~ 0L, # former smoker
SMQ020 == 2 ~ 0L, # never smoked
TRUE ~ NA_integer_
)
)
cat("\nsmoking_status distribution:\n")##
## smoking_status distribution:
##
## 0 1 <NA>
## 2967 452 3
# Remove original smoking variables
df_slim <- df_slim %>% select(-c(SMQ020, SMQ040))
cat("\nShape after removing original smoking vars:",
nrow(df_slim), "rows,", ncol(df_slim), "cols\n")##
## Shape after removing original smoking vars: 3422 rows, 19 cols
| Variable | Description | Notes |
|---|---|---|
| cardio_disease | Derived cardiovascular disease indicator | Combined from MCQ160B–F; original variables removed |
| hypertension_status | Derived hypertension indicator | Combined from BPQ020 and BPQ030; original variables removed |
| smoking_status | Derived smoking-related predictor | Combined from SMQ020 and SMQ040; original variables removed |
A composite binary target variable was constructed using two
complementary indicators of healthcare demand: HUQ051
(number of healthcare visits over the past year) and HUQ071
(whether the participant was an overnight hospital patient in the past
year).
A participant is classified as the higher healthcare demand
group (healthcare_demand = 1) if either of the
following conditions is met:
HUQ051 >= 4 — the
participant had 4 or more healthcare visits in the past year, indicating
frequent outpatient utilisation.HUQ071 == 1 — the
participant was hospitalised overnight at least once in the past year,
which is a direct and objective indicator of high medical need
independent of access barriers.Participants who meet neither condition are classified as the
lower healthcare demand group
(healthcare_demand = 0).
## HUQ051 distribution:
##
## <NA> 7 5 8 0 6 4 1 3 2
## 8 80 153 191 219 259 350 419 681 1062
##
## HUQ071 distribution:
##
## 1 2 <NA>
## 598 2823 1
# Fix HUQ071 to integer (may be stored as float from SAS import)
df_slim <- df_slim %>%
mutate(HUQ071 = as.integer(round(HUQ071)))
cat("\nHUQ071 distribution (after fix):\n")##
## HUQ071 distribution (after fix):
##
## 1 2 <NA>
## 598 2823 1
# Construct composite binary target
df_slim <- df_slim %>%
mutate(
healthcare_demand = case_when(
HUQ051 >= 4 ~ 1L, # Condition 1: frequent outpatient visits
HUQ071 == 1 ~ 1L, # Condition 2: hospitalised overnight
!is.na(HUQ051) & !is.na(HUQ071) ~ 0L, # Neither condition met
TRUE ~ NA_integer_ # Missing on both
)
)
df_slim %>% select(HUQ051, HUQ071, healthcare_demand) %>% head(10)## # A tibble: 10 × 3
## HUQ051 HUQ071 healthcare_demand
## <dbl> <int> <int>
## 1 0 2 0
## 2 8 2 1
## 3 2 2 0
## 4 4 2 1
## 5 3 1 1
## 6 3 2 0
## 7 3 2 0
## 8 3 2 0
## 9 1 2 0
## 10 3 2 0
##
## healthcare_demand distribution:
##
## 0 1 <NA>
## 2111 1306 5
# Remove source target variables after construction
df_slim <- df_slim %>% select(-c(HUQ051, HUQ071))## healthcare_demand NA count: 5
## healthcare_demand unique values: 0 1
# Ensure only valid binary values (0/1); treat anything else as NA
df_slim <- df_slim %>%
mutate(healthcare_demand = if_else(
healthcare_demand %in% c(0L, 1L), healthcare_demand, NA_integer_
))
# Remove rows where target is NA
df_slim <- df_slim %>% filter(!is.na(healthcare_demand))
cat("\nAfter removing NA targets:\n")##
## After removing NA targets:
## Rows remaining: 3417
## healthcare_demand distribution:
##
## 0 1
## 2111 1306
cat("Class ratio (High / Low):",
round(sum(df_slim$healthcare_demand == 1) / sum(df_slim$healthcare_demand == 0), 3), "\n")## Class ratio (High / Low): 0.619
Following feature engineering and target construction, the dataset was reassessed for remaining incomplete observations. Since sparse questionnaire variables had already been consolidated or removed, remaining missingness was expected to be substantially reduced.
remaining_missing <- df_slim %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
tidyr::pivot_longer(everything(), names_to = "variable", values_to = "Missing Count") %>%
mutate(`Missing Percentage` = round(`Missing Count` / nrow(df_slim) * 100, 2)) %>%
filter(`Missing Count` > 0) %>%
arrange(desc(`Missing Count`))
print(remaining_missing)## # A tibble: 11 × 3
## variable `Missing Count` `Missing Percentage`
## <chr> <int> <dbl>
## 1 INDFMPIR 537 15.7
## 2 BPQ080 41 1.2
## 3 MCQ160A 15 0.44
## 4 MCQ160P 14 0.41
## 5 CDQ010 14 0.41
## 6 DMDEDUC2 11 0.32
## 7 DMDMARTZ 7 0.2
## 8 CDQ001 3 0.09
## 9 smoking_status 3 0.09
## 10 DIQ010 2 0.06
## 11 MCQ220 1 0.03
INDFMPIR (income-to-poverty ratio) retains moderate
missingness due to socioeconomic non-response. As it is an important
predictor for healthcare demand analysis, median
imputation is applied to preserve participant records. The
remaining variables with minimal missingness are handled using
complete-case analysis (row-wise removal).
# Median imputation for INDFMPIR
df_slim <- df_slim %>%
mutate(INDFMPIR = if_else(
is.na(INDFMPIR),
median(INDFMPIR, na.rm = TRUE),
INDFMPIR
))
# Complete-case analysis for remaining variables
rows_before <- nrow(df_slim)
df_slim <- df_slim %>% drop_na()
rows_after <- nrow(df_slim)
cat("Rows removed during complete-case analysis:", rows_before - rows_after, "\n")## Rows removed during complete-case analysis: 97
## Final dataset shape: 3320 rows, 18 cols
To improve dataset readability and simplify subsequent analysis, selected NHANES variable names were renamed into more interpretable labels while preserving their original analytical meaning.
df_slim <- df_slim %>%
rename(
gender = RIAGENDR,
age = RIDAGEYR,
ethnicity = RIDRETH3,
education_level = DMDEDUC2,
marital_status = DMDMARTZ,
income_poverty_ratio = INDFMPIR,
arthritis = MCQ160A,
pulmonary_disease = MCQ160P,
cancer = MCQ220,
high_cholesterol = BPQ080,
diabetes = DIQ010,
chest_pain = CDQ001,
shortness_breath = CDQ010
)
glimpse(df_slim)## Rows: 3,320
## Columns: 18
## $ SEQN <dbl> 109274, 109282, 109290, 109298, 109313, 109316, 1…
## $ gender <dbl> 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1…
## $ age <dbl> 68, 76, 68, 68, 63, 62, 76, 69, 63, 66, 80, 80, 6…
## $ ethnicity <dbl> 7, 3, 4, 3, 3, 4, 6, 3, 3, 2, 3, 2, 6, 3, 1, 1, 4…
## $ education_level <dbl> 4, 5, 5, 1, 4, 3, 4, 3, 5, 2, 3, 2, 5, 3, 3, 1, 4…
## $ marital_status <dbl> 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2…
## $ income_poverty_ratio <dbl> 1.20, 3.61, 5.00, 1.75, 5.00, 0.07, 2.37, 4.31, 5…
## $ arthritis <dbl> 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1…
## $ pulmonary_disease <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2…
## $ cancer <dbl> 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2…
## $ high_cholesterol <dbl> 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1…
## $ diabetes <dbl> 1, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 1…
## $ chest_pain <dbl> 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ shortness_breath <dbl> 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1…
## $ cardio_disease <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ hypertension_status <int> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ smoking_status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ healthcare_demand <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0…
Several questionnaire variables were recoded into binary
representations (1 = Yes, 0 = No) to maintain consistency across
predictors. The diabetes variable originally contained an
additional “Borderline” response category (value = 3); borderline
responses were grouped with positive diabetes responses during
recoding.
binary_cols <- c(
"high_cholesterol", "arthritis", "pulmonary_disease",
"cancer", "chest_pain", "shortness_breath"
)
df_slim <- df_slim %>%
mutate(across(all_of(binary_cols), ~ case_when(
. == 1 ~ 1L,
. == 2 ~ 0L,
TRUE ~ NA_integer_
)))
# Diabetes: 1 = Yes, 2 = No, 3 = Borderline (treated as Yes)
df_slim <- df_slim %>%
mutate(diabetes = case_when(
diabetes == 1 ~ 1L,
diabetes == 3 ~ 1L, # borderline treated as Yes
diabetes == 2 ~ 0L,
TRUE ~ NA_integer_
))
cat("diabetes distribution after recoding:\n")## diabetes distribution after recoding:
##
## 0 1 <NA>
## 2286 1034 0
Two additional features were constructed after variable renaming to improve predictive signal for machine learning modelling.
chronic_count sums all binary disease
indicators for each participant, representing the total number of
chronic conditions reported. From the feature importance analysis,
almost all disease variables contribute similarly (0.10–0.14),
suggesting the model is implicitly counting conditions. Providing an
explicit count feature makes this signal cleaner and more direct.
disease_age_interaction multiplies
chronic_count by age, capturing the combined
effect of disease burden and ageing. Older participants with more
conditions are expected to have disproportionately higher healthcare
demand than younger participants with the same number of conditions.
disease_cols <- c(
"arthritis", "pulmonary_disease", "cancer", "high_cholesterol",
"diabetes", "chest_pain", "shortness_breath",
"cardio_disease", "hypertension_status"
)
# chronic_count: total number of chronic conditions reported
df_slim <- df_slim %>%
mutate(
chronic_count = rowSums(select(., all_of(disease_cols)), na.rm = TRUE),
disease_age_interaction = chronic_count * age
)
cat("chronic_count distribution:\n")## chronic_count distribution:
##
## 0 1 2 3 4 5 6 7 8 9
## 223 412 539 620 633 406 291 138 49 9
##
## disease_age_interaction summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 130.0 237.0 236.2 320.0 720.0
final_df <- df_slim
cat("Final modeling dataset shape:", nrow(final_df), "rows,", ncol(final_df), "cols\n")## Final modeling dataset shape: 3320 rows, 20 cols
##
## healthcare_demand distribution:
##
## 0 1
## 2064 1256
cat("\nClass ratio (High / Low):",
round(sum(final_df$healthcare_demand == 1) / sum(final_df$healthcare_demand == 0), 3), "\n")##
## Class ratio (High / Low): 0.609
## # A tibble: 6 × 20
## SEQN gender age ethnicity education_level marital_status
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 109274 1 68 7 4 3
## 2 109282 1 76 3 5 1
## 3 109290 2 68 4 5 2
## 4 109298 1 68 3 1 1
## 5 109313 1 63 3 4 1
## 6 109316 2 62 4 3 2
## # ℹ 14 more variables: income_poverty_ratio <dbl>, arthritis <int>,
## # pulmonary_disease <int>, cancer <int>, high_cholesterol <int>,
## # diabetes <int>, chest_pain <int>, shortness_breath <int>,
## # cardio_disease <int>, hypertension_status <int>, smoking_status <int>,
## # healthcare_demand <int>, chronic_count <dbl>, disease_age_interaction <dbl>
This final_df object is the cleaned classification
dataset and is used directly by the exploratory data analysis (Section
4) and the classification analysis (Section 6). No intermediate CSV file
is required because all sections run within the same R session.
This section constructs a separate dataset for the regression analysis task. The regression problem asks:
“How severe is depression among older adults, given their demographic characteristics and chronic disease burden?”
The target variable is PHQ9_total
(Patient Health Questionnaire-9 total score, range 0–27), derived from
the NHANES DPQ dataset. A higher score indicates greater depressive
symptom severity.
This dataset is constructed by joining the cleaned PHQ-9 data with
final_df. The two datasets serve different research
questions within the same project theme.
# Load DPQ dataset (depression questionnaire)
dpq_raw <- read_xpt("DPQ.xpt")
cat("DPQ raw shape:", nrow(dpq_raw), "rows,", ncol(dpq_raw), "cols\n")## DPQ raw shape: 8965 rows, 11 cols
# PHQ-9 item variables
dpq_items <- paste0("DPQ0", c("10", "20", "30", "40", "50", "60", "70", "80", "90"))
# Replace special codes [7, 9] with NA; remove rows with any missing PHQ-9 item
dpq_clean <- dpq_raw %>%
mutate(across(all_of(dpq_items), ~ ifelse(. %in% c(7, 9), NA, .))) %>%
filter(if_all(all_of(dpq_items), ~ !is.na(.))) %>%
mutate(PHQ9_total = rowSums(across(all_of(dpq_items)))) %>%
select(SEQN, PHQ9_total)
cat("Rows before cleaning:", nrow(dpq_raw), "\n")## Rows before cleaning: 8965
## Rows after cleaning: 8276
## Rows removed: 689
##
## PHQ9_total summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 3.302 5.000 27.000
# Merge with final_df using inner_join on SEQN.
# inner_join keeps only participants present in BOTH datasets,
# ensuring every row has a valid PHQ9_total.
final_df_regression <- inner_join(dpq_clean, final_df, by = "SEQN")
cat("final_df rows (classification): ", nrow(final_df), "\n")## final_df rows (classification): 3320
## final_df_regression rows (regression): 2789
## Rows difference: 531
cat("\nfinal_df_regression shape:", nrow(final_df_regression), "rows,",
ncol(final_df_regression), "cols\n")##
## final_df_regression shape: 2789 rows, 21 cols
##
## PHQ9_total distribution in regression dataset:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 3.074 4.000 25.000
## # A tibble: 6 × 21
## SEQN PHQ9_total gender age ethnicity education_level marital_status
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 109274 0 1 68 7 4 3
## 2 109282 5 1 76 3 5 1
## 3 109290 2 2 68 4 5 2
## 4 109313 0 1 63 3 4 1
## 5 109316 1 2 62 4 3 2
## 6 109330 0 1 76 6 4 1
## # ℹ 14 more variables: income_poverty_ratio <dbl>, arthritis <int>,
## # pulmonary_disease <int>, cancer <int>, high_cholesterol <int>,
## # diabetes <int>, chest_pain <int>, shortness_breath <int>,
## # cardio_disease <int>, hypertension_status <int>, smoking_status <int>,
## # healthcare_demand <int>, chronic_count <dbl>, disease_age_interaction <dbl>
This final_df_regression object is used directly by the
regression analysis (Section 5).
This section presents an exploratory data analysis (EDA) of the final
modelling dataset final_df, produced by the data
preparation stage from the NHANES 2017–2020 survey. The goal of EDA is
to understand the structure of the data, examine the distribution of
each variable, and identify potential relationships between the
predictors and the target variable healthcare_demand before
modelling.
# Use the cleaned classification dataset produced in Section 3 directly.
# SEQN is only a respondent identifier, not a predictor -- drop it for analysis.
eda_df <- final_df %>% select(-SEQN)
cat("Final dataset:", nrow(eda_df), "rows,", ncol(eda_df), "cols\n")## Final dataset: 3320 rows, 19 cols
## Rows: 3320
## Columns: 19
## Rows: 3,320
## Columns: 19
## $ gender <dbl> 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1…
## $ age <dbl> 68, 76, 68, 68, 63, 62, 76, 69, 63, 66, 80, 80…
## $ ethnicity <dbl> 7, 3, 4, 3, 3, 4, 6, 3, 3, 2, 3, 2, 6, 3, 1, 1…
## $ education_level <dbl> 4, 5, 5, 1, 4, 3, 4, 3, 5, 2, 3, 2, 5, 3, 3, 1…
## $ marital_status <dbl> 3, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2…
## $ income_poverty_ratio <dbl> 1.20, 3.61, 5.00, 1.75, 5.00, 0.07, 2.37, 4.31…
## $ arthritis <int> 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0…
## $ pulmonary_disease <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ cancer <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0…
## $ high_cholesterol <int> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0…
## $ diabetes <int> 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0…
## $ chest_pain <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ shortness_breath <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
## $ cardio_disease <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ hypertension_status <int> 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## $ smoking_status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ healthcare_demand <int> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0…
## $ chronic_count <dbl> 4, 5, 3, 6, 3, 6, 2, 2, 2, 3, 5, 4, 2, 6, 3, 1…
## $ disease_age_interaction <dbl> 272, 380, 204, 408, 189, 372, 152, 138, 126, 1…
## gender age ethnicity education_level marital_status
## Min. :1.000 Min. :60.00 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:1.000 1st Qu.:64.00 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.00
## Median :1.000 Median :69.00 Median :3.000 Median :4.000 Median :1.00
## Mean :1.493 Mean :69.96 Mean :3.405 Mean :3.397 Mean :1.51
## 3rd Qu.:2.000 3rd Qu.:77.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.00
## Max. :2.000 Max. :80.00 Max. :7.000 Max. :5.000 Max. :3.00
## income_poverty_ratio arthritis pulmonary_disease cancer
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.450 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.200 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :2.560 Mean :0.5211 Mean :0.1533 Mean :0.2148
## 3rd Qu.:3.683 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :5.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## high_cholesterol diabetes chest_pain shortness_breath
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.556 Mean :0.3114 Mean :0.3009 Mean :0.4193
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## cardio_disease hypertension_status smoking_status healthcare_demand
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.000 Median :0.0000
## Mean :0.2533 Mean :0.6123 Mean :0.131 Mean :0.3783
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.0000
## chronic_count disease_age_interaction
## Min. :0.000 Min. : 0.0
## 1st Qu.:2.000 1st Qu.:130.0
## Median :3.000 Median :237.0
## Mean :3.342 Mean :236.2
## 3rd Qu.:5.000 3rd Qu.:320.0
## Max. :9.000 Max. :720.0
The final dataset contains 3320 respondents and
19 analysis variables (18 predictors + 1 target). The
target variable is healthcare_demand. Most predictors are
binary disease indicators coded as 1 = Yes and
0 = No. The continuous variables are age,
income_poverty_ratio, chronic_count, and
disease_age_interaction; the remaining demographic
variables (gender, ethnicity,
education_level, marital_status) are
categorical.
The target variable healthcare_demand is a
composite indicator of high healthcare demand. A
respondent is classified into the higher demand group
(healthcare_demand = 1) if either of the following
holds: they had 4 or more healthcare visits in the past year,
or they were hospitalised overnight at least once in
the past year. Respondents meeting neither condition form the
lower demand group
(healthcare_demand = 0).
target_df <- eda_df %>%
count(healthcare_demand) %>%
mutate(
label = ifelse(healthcare_demand == 0,
"Lower demand\n(0)", "Higher demand\n(1)"),
pct = round(n / sum(n) * 100, 1),
pct_label = paste0(n, "\n(", pct, "%)")
)
ggplot(target_df, aes(x = label, y = n, fill = label)) +
geom_bar(stat = "identity", width = 0.5, show.legend = FALSE) +
geom_text(aes(label = pct_label), vjust = -0.3, size = 4.5) +
scale_fill_manual(values = c("Lower demand\n(0)" = "steelblue",
"Higher demand\n(1)" = "tomato")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Figure 4.1: Target Variable Distribution",
x = "Healthcare Demand Group",
y = "Number of Respondents"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))##
## 0 1
## 2064 1256
cat("Class ratio (High / Low):",
round(sum(eda_df$healthcare_demand == 1) /
sum(eda_df$healthcare_demand == 0), 3), "\n")## Class ratio (High / Low): 0.609
The lower demand group accounts for 62.2% of the sample (n = 2064) and the higher demand group for 37.8% (n = 1256). The class imbalance is mild, so standard classification models should train without aggressive resampling. Class weighting and threshold-aware metrics (F1, AUC-ROC) will still be reported alongside accuracy during modelling.
ggplot(eda_df, aes(x = age)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 2, fill = "steelblue", colour = "white", alpha = 0.8) +
geom_density(colour = "navy", linewidth = 1) +
geom_vline(xintercept = mean(eda_df$age),
linetype = "dashed", colour = "tomato", linewidth = 0.8) +
annotate("text",
x = mean(eda_df$age) + 2,
y = 0.05,
label = paste0("Mean = ", round(mean(eda_df$age), 1)),
colour = "tomato", size = 3.8) +
labs(
title = "Figure 4.2: Age Distribution of Respondents",
x = "Age (Years)",
y = "Density"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))eda_df %>%
mutate(Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
ggplot(aes(x = age, fill = Target)) +
geom_histogram(binwidth = 2, position = "identity", alpha = 0.6, colour = "white") +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
labs(
title = "Figure 4.3: Age Distribution by Healthcare Demand Group",
x = "Age (Years)",
y = "Count",
fill = "Group"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The age of respondents ranges from 60 to 80 years, with a mean of approximately 70 years. NHANES top-codes age at 80, which produces a spike at the upper bound. The higher demand group skews slightly older, consistent with the expectation that healthcare need rises with age.
gender_df <- eda_df %>%
mutate(Gender = ifelse(gender == 1, "Male", "Female"),
Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
count(Gender, Target) %>%
group_by(Gender) %>%
mutate(pct = round(n / sum(n) * 100, 1))
ggplot(gender_df, aes(x = Gender, y = n, fill = Target)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
geom_text(aes(label = paste0(pct, "%")),
position = position_dodge(width = 0.6), vjust = -0.4, size = 3.8) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Figure 4.4: Gender Distribution by Healthcare Demand Group",
x = "Gender",
y = "Count",
fill = "Group"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The sample is nearly evenly split between males (n = 1684) and females (n = 1636). The proportion in higher demand is similar across genders, suggesting gender alone is not a strong distinguishing factor.
# NOTE: this dataset uses RIDRETH3 coding (renamed to `ethnicity`).
# Codes present are 1,2,3,4,6,7 -- there is NO code 5 in RIDRETH3.
# Code 6 = Non-Hispanic Asian, 7 = Other / Multiracial.
race_df <- eda_df %>%
mutate(Race = case_when(
ethnicity == 1 ~ "Mexican\nAmerican",
ethnicity == 2 ~ "Other\nHispanic",
ethnicity == 3 ~ "Non-Hispanic\nWhite",
ethnicity == 4 ~ "Non-Hispanic\nBlack",
ethnicity == 6 ~ "Non-Hispanic\nAsian",
ethnicity == 7 ~ "Other /\nMultiracial"
),
Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
count(Race, Target) %>%
group_by(Race) %>%
mutate(pct = round(n / sum(n) * 100, 1))
ggplot(race_df, aes(x = Race, y = n, fill = Target)) +
geom_bar(stat = "identity", position = "dodge", width = 0.65) +
geom_text(aes(label = paste0(pct, "%")),
position = position_dodge(width = 0.65), vjust = -0.4, size = 3.0) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Figure 4.5: Race/Ethnicity Distribution by Healthcare Demand Group",
x = "Race / Ethnicity",
y = "Count",
fill = "Group"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))Non-Hispanic White respondents form the largest group (n = 1452), followed by Non-Hispanic Black (n = 878). Differences in the proportion of higher demand across racial groups may reflect healthcare access disparities and will be revisited during modelling.
edu_df <- eda_df %>%
mutate(Education = case_when(
education_level == 1 ~ "< 9th\nGrade",
education_level == 2 ~ "9-11th\nGrade",
education_level == 3 ~ "High School\n/ GED",
education_level == 4 ~ "Some\nCollege",
education_level == 5 ~ "College\nGraduate+"
),
Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
count(Education, Target) %>%
group_by(Education) %>%
mutate(pct = round(n / sum(n) * 100, 1))
edu_df$Education <- factor(edu_df$Education,
levels = c("< 9th\nGrade","9-11th\nGrade","High School\n/ GED",
"Some\nCollege","College\nGraduate+"))
ggplot(edu_df, aes(x = Education, y = n, fill = Target)) +
geom_bar(stat = "identity", position = "dodge", width = 0.65) +
geom_text(aes(label = paste0(pct, "%")),
position = position_dodge(width = 0.65), vjust = -0.4, size = 3.0) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Figure 4.6: Education Level by Healthcare Demand Group",
x = "Education Level",
y = "Count",
fill = "Group"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))Education spreads across five categories, with “Some College” and “College Graduate+” the most common. Education is a useful socioeconomic proxy and is examined here for any gradient in healthcare demand.
# In this dataset marital_status takes values 1, 2, 3 only.
marital_df <- eda_df %>%
mutate(Marital = case_when(
marital_status == 1 ~ "Married /\nWith Partner",
marital_status == 2 ~ "Widowed / Divorced\n/ Separated",
marital_status == 3 ~ "Never\nMarried"
),
Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
count(Marital, Target) %>%
group_by(Marital) %>%
mutate(pct = round(n / sum(n) * 100, 1))
ggplot(marital_df, aes(x = Marital, y = n, fill = Target)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
geom_text(aes(label = paste0(pct, "%")),
position = position_dodge(width = 0.6), vjust = -0.4, size = 3.5) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Figure 4.7: Marital Status by Healthcare Demand Group",
x = "Marital Status",
y = "Count",
fill = "Group"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The majority of respondents are married or living with a partner (n = 1829). Social support from a partner may influence health-seeking behaviour and healthcare utilisation.
ggplot(eda_df, aes(x = income_poverty_ratio)) +
geom_histogram(binwidth = 0.5, fill = "steelblue", colour = "white", alpha = 0.85) +
geom_vline(xintercept = median(eda_df$income_poverty_ratio),
linetype = "dashed", colour = "tomato", linewidth = 0.8) +
labs(
title = "Figure 4.8: Income-to-Poverty Ratio Distribution",
x = "Income-to-Poverty Ratio (0 = poorest, 5 = top-coded)",
y = "Count"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The income-to-poverty ratio has a median of 2.2. NHANES top-codes this ratio at 5, producing a cluster at the upper bound. Missing income values were median-imputed during data preparation, which slightly concentrates mass near the centre of the distribution.
eda_df %>%
mutate(Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
ggplot(aes(x = Target, y = income_poverty_ratio, fill = Target)) +
geom_boxplot(width = 0.45, outlier.alpha = 0.3, show.legend = FALSE) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
labs(
title = "Figure 4.8b: Income-to-Poverty Ratio by Healthcare Demand Group",
x = "Group",
y = "Income-to-Poverty Ratio"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))eda_df %>%
group_by(healthcare_demand) %>%
summarise(
n = n(),
mean = round(mean(income_poverty_ratio), 2),
median = round(median(income_poverty_ratio), 2),
sd = round(sd(income_poverty_ratio), 2)
)## # A tibble: 2 × 5
## healthcare_demand n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 0 2064 2.59 2.2 1.44
## 2 1 1256 2.51 2.2 1.46
The higher demand group tends to have a slightly lower income-to-poverty ratio compared to the lower demand group, suggesting that socioeconomic disadvantage may be associated with greater healthcare need. However, the difference between groups is modest and the distributions overlap substantially, so income alone is unlikely to be a dominant predictor.
condition_labels <- c(
"hypertension_status" = "Hypertension",
"high_cholesterol" = "High Cholesterol",
"arthritis" = "Arthritis",
"shortness_breath" = "Shortness of Breath",
"diabetes" = "Diabetes",
"chest_pain" = "Chest Pain",
"cardio_disease" = "Cardiovascular Disease",
"cancer" = "Cancer",
"pulmonary_disease" = "Pulmonary Disease (COPD)",
"smoking_status" = "Current Smoker"
)
prev_df <- eda_df %>%
select(all_of(names(condition_labels))) %>%
summarise(across(everything(), ~ mean(. == 1, na.rm = TRUE) * 100)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "prevalence") %>%
mutate(label = condition_labels[variable]) %>%
arrange(desc(prevalence))
ggplot(prev_df, aes(x = reorder(label, prevalence), y = prevalence)) +
geom_col(fill = "steelblue", width = 0.65) +
geom_text(aes(label = paste0(round(prevalence, 1), "%")),
hjust = -0.1, size = 3.8) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
labs(
title = "Figure 4.9: Prevalence of Health Conditions (% of Sample)",
x = NULL,
y = "Prevalence (%)"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))Hypertension is the most prevalent condition (61.2%), followed by high cholesterol (55.6%) and arthritis (52.1%). These figures reflect the well-documented chronic disease burden among older adults. Current smoking prevalence is the lowest at 13.1%.
Two engineered features were added during data preparation:
chronic_count (total number of chronic conditions per
respondent, 0–9) and disease_age_interaction
(chronic_count × age). These are continuous,
so histograms and boxplots are used.
ggplot(eda_df, aes(x = chronic_count)) +
geom_histogram(binwidth = 1, fill = "steelblue", colour = "white") +
scale_x_continuous(breaks = 0:9) +
labs(
title = "Figure 4.10: Distribution of Chronic Condition Count",
x = "Number of Chronic Conditions (0-9)",
y = "Count"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))eda_df %>%
mutate(Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
ggplot(aes(x = Target, y = chronic_count, fill = Target)) +
geom_boxplot(width = 0.45, outlier.alpha = 0.3, show.legend = FALSE) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
labs(
title = "Figure 4.11: Chronic Count by Healthcare Demand Group",
x = "Group",
y = "Chronic Condition Count"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))eda_df %>%
group_by(healthcare_demand) %>%
summarise(
n = n(),
mean = round(mean(chronic_count), 2),
median = median(chronic_count),
sd = round(sd(chronic_count), 2)
)## # A tibble: 2 × 5
## healthcare_demand n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 0 2064 2.81 3 1.8
## 2 1 1256 4.22 4 1.83
Respondents in the higher demand group carry a clearly larger disease burden: their mean chronic count is 4.22 versus 2.81 in the lower demand group. This is the strongest single distinguishing signal observed so far and aligns with the rationale for engineering this feature.
ggplot(eda_df, aes(x = disease_age_interaction)) +
geom_histogram(binwidth = 40, fill = "steelblue", colour = "white") +
labs(
title = "Figure 4.12: Distribution of Disease-Age Interaction",
x = "chronic_count x age",
y = "Count"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The interaction term spans 0 to roughly 700 and is right-skewed. Because it combines disease burden and ageing, it is expected to be one of the more informative predictors; this is confirmed in the correlation analysis below.
This section examines whether each health condition is associated with a higher rate of healthcare demand.
biv_df <- eda_df %>%
select(healthcare_demand, all_of(names(condition_labels))) %>%
pivot_longer(-healthcare_demand, names_to = "variable", values_to = "status") %>%
filter(!is.na(status)) %>%
mutate(
label = condition_labels[variable],
status = ifelse(status == 1, "Yes", "No")
) %>%
group_by(label, status) %>%
summarise(
n = n(),
rate = mean(healthcare_demand == 1) * 100,
.groups = "drop"
)
ggplot(biv_df, aes(x = reorder(label, rate), y = rate, fill = status)) +
geom_col(position = "dodge", width = 0.65) +
geom_text(aes(label = paste0(round(rate, 1), "%\n(n=", n, ")")),
position = position_dodge(width = 0.65),
hjust = -0.05, size = 2.8) +
coord_flip() +
scale_fill_manual(values = c("Yes" = "tomato", "No" = "steelblue")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
labs(
title = "Figure 4.13: Higher Demand Rate by Condition (Yes vs No)",
x = NULL,
y = "Higher Demand Rate (%)",
fill = "Has Condition"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
legend.position = "bottom")For almost every condition, respondents who report having it show a higher rate of healthcare demand than those who do not. The gap is largest for cardiovascular disease, pulmonary disease (COPD), shortness of breath, and chest pain, suggesting these are strong predictors of healthcare utilisation. Smoking status is the exception: current smokers show a slightly lower higher-demand rate than non-smokers. This counter-intuitive pattern is likely a survivorship / age-confounding effect (heavier smokers may be under-represented at age 60+) and should be interpreted cautiously rather than as a protective effect.
eda_df %>%
mutate(Target = ifelse(healthcare_demand == 0,
"Lower demand (0)", "Higher demand (1)")) %>%
ggplot(aes(x = Target, y = age, fill = Target)) +
geom_boxplot(width = 0.45, outlier.alpha = 0.3, show.legend = FALSE) +
scale_fill_manual(values = c("Lower demand (0)" = "steelblue",
"Higher demand (1)" = "tomato")) +
labs(
title = "Figure 4.14: Age Distribution by Healthcare Demand Group",
x = "Group",
y = "Age (Years)"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))eda_df %>%
group_by(healthcare_demand) %>%
summarise(
n = n(),
mean = round(mean(age), 1),
median = median(age),
sd = round(sd(age), 1)
)## # A tibble: 2 × 5
## healthcare_demand n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 0 2064 69.2 68 6.9
## 2 1 1256 71.2 71 7
The higher demand group is modestly older (median 71 vs 68 years), consistent with rising healthcare need at older ages, though the overlap is large.
# Select all numeric variables (exclude categorical demographics that are
# nominal codes: gender, ethnicity, education_level, marital_status)
cor_data <- eda_df %>%
select(age, income_poverty_ratio,
arthritis, pulmonary_disease, cancer, high_cholesterol,
diabetes, chest_pain, shortness_breath,
cardio_disease, hypertension_status, smoking_status,
chronic_count, disease_age_interaction, healthcare_demand)
cor_matrix <- cor(cor_data, use = "pairwise.complete.obs")
colnames(cor_matrix) <- rownames(cor_matrix) <- c(
"Age", "Income/Poverty",
"Arthritis", "COPD", "Cancer", "High Cholesterol",
"Diabetes", "Chest Pain", "Shortness\nof Breath",
"Cardiovascular", "Hypertension", "Smoking",
"Chronic Count", "Disease x Age", "Target"
)
corrplot(
cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 0.6,
tl.cex = 0.8,
tl.col = "black",
col = colorRampPalette(c("steelblue", "white", "tomato"))(200),
title = "Figure 4.15: Correlation Matrix of Numeric Variables",
mar = c(0, 0, 2, 0)
)# Correlation of each predictor with the target, sorted
cor_matrix[, "Target"][order(-abs(cor_matrix[, "Target"]))]## Target Disease x Age Chronic Count
## 1.00000000 0.35854365 0.35260818
## Cardiovascular Shortness\nof Breath Arthritis
## 0.23540244 0.21446388 0.19459483
## Chest Pain COPD Hypertension
## 0.19237799 0.17314085 0.16175398
## Cancer Age Diabetes
## 0.14256496 0.13650389 0.11912992
## High Cholesterol Income/Poverty Smoking
## 0.06829434 -0.02811316 -0.01576748
The two engineered features show the strongest correlations with the
target: disease_age_interaction (≈ 0.36) and
chronic_count (≈ 0.35), confirming that overall disease
burden — especially combined with age — is the dominant signal. Among
individual conditions, cardiovascular disease, shortness of breath, and
arthritis correlate most with the target. chronic_count and
disease_age_interaction are strongly correlated with
each other by construction, so the modelling team should be aware
of potential redundancy (multicollinearity) and may keep only one, or
use models that tolerate correlated inputs. Income-to-poverty ratio and
smoking show essentially no linear correlation with the target.
This section explores the target variable for the regression task,
PHQ9_total, using the final_df_regression
dataset. The goal is to understand the distribution of depression
severity among older adults before modelling.
ggplot(final_df_regression, aes(x = PHQ9_total)) +
geom_histogram(binwidth = 1, fill = "steelblue", colour = "white", alpha = 0.85) +
geom_vline(xintercept = median(final_df_regression$PHQ9_total),
linetype = "dashed", colour = "tomato", linewidth = 0.8) +
annotate("text",
x = median(final_df_regression$PHQ9_total) + 1.5,
y = Inf, vjust = 1.5,
label = paste0("Median = ", median(final_df_regression$PHQ9_total)),
colour = "tomato", size = 3.8) +
labs(
title = "Figure 4.16: Distribution of PHQ-9 Total Score",
x = "PHQ-9 Total Score (0 = no symptoms, 27 = severe)",
y = "Count"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))## PHQ9_total summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 3.074 4.000 25.000
cat("\nProportion with score >= 10 (moderate-severe depression):",
round(mean(final_df_regression$PHQ9_total >= 10) * 100, 1), "%\n")##
## Proportion with score >= 10 (moderate-severe depression): 8.7 %
The PHQ-9 score distribution is strongly right-skewed, with the majority of older adults scoring low. This is expected in a general elderly population — most respondents do not report clinically significant depressive symptoms. The median score is 1, and only a small proportion score 10 or above, which is commonly used as a threshold for moderate-to-severe depression. This skewness motivates the use of count-based regression models (negative binomial, Poisson) rather than ordinary linear regression.
final_df_regression %>%
mutate(chronic_group = factor(
case_when(
chronic_count == 0 ~ "0",
chronic_count == 1 ~ "1",
chronic_count == 2 ~ "2",
chronic_count >= 3 ~ "3+"
),
levels = c("0", "1", "2", "3+")
)) %>%
ggplot(aes(x = chronic_group, y = PHQ9_total, fill = chronic_group)) +
geom_boxplot(width = 0.5, outlier.alpha = 0.3, show.legend = FALSE) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Figure 4.17: PHQ-9 Score by Chronic Condition Count",
x = "Number of Chronic Conditions",
y = "PHQ-9 Total Score"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))final_df_regression %>%
mutate(chronic_group = case_when(
chronic_count == 0 ~ "0",
chronic_count == 1 ~ "1",
chronic_count == 2 ~ "2",
chronic_count >= 3 ~ "3+"
)) %>%
group_by(chronic_group) %>%
summarise(
n = n(),
mean = round(mean(PHQ9_total), 2),
median = median(PHQ9_total),
sd = round(sd(PHQ9_total), 2)
)## # A tibble: 4 × 5
## chronic_group n mean median sd
## <chr> <int> <dbl> <dbl> <dbl>
## 1 0 184 1.2 0 2.53
## 2 1 340 1.34 0 2.14
## 3 2 458 2.12 1 3.07
## 4 3+ 1807 3.83 2 4.55
PHQ-9 scores increase consistently with chronic condition count.
Respondents with 3 or more conditions show notably higher median and
mean depression scores compared to those with no conditions. This
pattern supports the hypothesis that disease burden is associated with
depression severity among older adults, and confirms that
chronic_count and individual disease indicators are likely
to be useful predictors in the regression model.
ggplot(final_df_regression, aes(x = age, y = PHQ9_total)) +
geom_point(alpha = 0.15, colour = "steelblue", size = 0.8) +
geom_smooth(method = "loess", colour = "tomato", se = TRUE) +
labs(
title = "Figure 4.18: PHQ-9 Score vs Age",
x = "Age (Years)",
y = "PHQ-9 Total Score"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The scatter plot with a LOESS smoother shows the relationship between age and PHQ-9 score. Any systematic trend with age would suggest that older respondents within this 60+ cohort differ in their depressive symptom burden. Note that NHANES top-codes age at 80, so values above 80 are compressed at that boundary.
phq9_condition_labels <- c(
"hypertension_status" = "Hypertension",
"diabetes" = "Diabetes",
"cardio_disease" = "Cardiovascular",
"arthritis" = "Arthritis",
"pulmonary_disease" = "COPD",
"cancer" = "Cancer"
)
final_df_regression %>%
select(PHQ9_total, all_of(names(phq9_condition_labels))) %>%
pivot_longer(-PHQ9_total, names_to = "variable", values_to = "status") %>%
filter(!is.na(status)) %>%
mutate(
label = phq9_condition_labels[variable],
status = ifelse(status == 1, "Yes", "No")
) %>%
ggplot(aes(x = status, y = PHQ9_total, fill = status)) +
geom_boxplot(width = 0.5, outlier.alpha = 0.2, show.legend = FALSE) +
scale_fill_manual(values = c("Yes" = "tomato", "No" = "steelblue")) +
facet_wrap(~ label, nrow = 2) +
labs(
title = "Figure 4.19: PHQ-9 Score by Key Disease Conditions",
x = "Has Condition",
y = "PHQ-9 Total Score"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 13),
strip.text = element_text(face = "bold")
)Across all six conditions, respondents who reported having the condition tend to show higher PHQ-9 scores. The difference is most visible for COPD and cardiovascular disease, where the median score for those with the condition is clearly elevated compared to those without it. These patterns are consistent with the known co-occurrence of chronic illness and depressive symptoms in older adults.
The exploratory analysis of the final NHANES 2017–2020 dataset reveals several key findings relevant to the modelling stage:
chronic_count and disease_age_interaction show
the strongest association with healthcare demand, validating their
construction. They are mutually correlated, which the modelling team
should account for.These findings provide a solid foundation for the regression and classification analyses in the following sections.
This section focuses on the regression task. The goal is to predict
PHQ-9 total score (depression symptom severity, range
0–27) among older adults using demographic characteristics and chronic
disease burden. The modelling dataset final_df_regression
was produced in Section 3.11 by joining the cleaned PHQ-9 data (from the
DPQ file) with the prepared elderly health dataset, so it is used here
directly without re-reading any file.
The engineered redundant features
disease_age_interaction and chronic_count are
dropped here so the regression models work from the individual disease
indicators, and SEQN is removed as it is only an
identifier. Questionnaire-coded variables are converted into
factors.
df_model <- final_df_regression %>%
dplyr::select(-SEQN, -disease_age_interaction, -chronic_count) %>%
mutate(
gender = factor(gender),
ethnicity = factor(ethnicity),
marital_status = factor(marital_status),
education_level = factor(education_level, ordered = TRUE),
across(c(arthritis, pulmonary_disease, cancer, high_cholesterol,
diabetes, chest_pain, shortness_breath, cardio_disease,
hypertension_status, smoking_status, healthcare_demand), factor)
)
glimpse(df_model)## Rows: 2,789
## Columns: 18
## $ PHQ9_total <dbl> 0, 5, 2, 0, 1, 0, 1, 0, 2, 2, 2, 3, 0, 0, 1, 1, 0…
## $ gender <fct> 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 1…
## $ age <dbl> 68, 76, 68, 63, 62, 76, 63, 62, 66, 70, 76, 62, 6…
## $ ethnicity <fct> 7, 3, 4, 3, 4, 6, 3, 6, 3, 1, 4, 1, 4, 3, 6, 3, 6…
## $ education_level <ord> 4, 5, 5, 4, 3, 4, 5, 5, 3, 3, 4, 3, 5, 1, 5, 3, 5…
## $ marital_status <fct> 3, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1…
## $ income_poverty_ratio <dbl> 1.20, 3.61, 5.00, 5.00, 0.07, 2.37, 5.00, 4.56, 1…
## $ arthritis <fct> 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0…
## $ pulmonary_disease <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0…
## $ cancer <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ high_cholesterol <fct> 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0…
## $ diabetes <fct> 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0…
## $ chest_pain <fct> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ shortness_breath <fct> 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1…
## $ cardio_disease <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1…
## $ hypertension_status <fct> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1…
## $ smoking_status <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ healthcare_demand <fct> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1…
Four modelling approaches are compared using 5-fold cross-validation: a null baseline (predicting the training mean), negative binomial regression, elastic net Poisson regression, and random forest regression. PHQ-9 total score is a count-like outcome with a skewed distribution and many low values, which motivates count-based and flexible models. RMSE and MAE are computed on each held-out fold.
set.seed(42)
folds <- createFolds(df_model$PHQ9_total, k = 5, returnTrain = FALSE)
rmse_fn <- function(a, p) sqrt(mean((a - p)^2))
mae_fn <- function(a, p) mean(abs(a - p))
X_all <- model.matrix(PHQ9_total ~ ., data = df_model)[, -1]
y_all <- df_model$PHQ9_total
alpha_grid <- c(0.25, 0.5, 0.75, 1)
mtry_grid <- c(3, 5, 7, 10)
cv_results <- purrr::map_dfr(seq_along(folds), function(i) {
test_idx <- folds[[i]]
train_idx <- setdiff(seq_len(nrow(df_model)), test_idx)
# Null model: predict the training mean for all test samples
pred_null <- rep(mean(y_all[train_idx]), length(test_idx))
# Negative binomial regression
fit_nb <- glm.nb(PHQ9_total ~ ., data = df_model[train_idx, ])
pred_nb <- predict(fit_nb, newdata = df_model[test_idx, ], type = "response")
# Elastic net: select best alpha within the training fold, then best lambda
best_alpha <- alpha_grid[which.min(sapply(alpha_grid, function(a) {
min(cv.glmnet(X_all[train_idx, ], y_all[train_idx],
family = "poisson", alpha = a)$cvm)
}))]
cv_en <- cv.glmnet(X_all[train_idx, ], y_all[train_idx],
family = "poisson", alpha = best_alpha)
pred_en <- as.vector(predict(cv_en, newx = X_all[test_idx, ],
s = "lambda.min", type = "response"))
# Random forest: select best mtry within the training fold via OOB error
best_mtry <- mtry_grid[which.min(sapply(mtry_grid, function(m) {
ranger(PHQ9_total ~ ., data = df_model[train_idx, ],
num.trees = 500, mtry = m, seed = 42)$prediction.error
}))]
fit_rf <- ranger(PHQ9_total ~ ., data = df_model[train_idx, ],
num.trees = 500, mtry = best_mtry, seed = 42)
pred_rf <- predict(fit_rf, data = df_model[test_idx, ])$predictions
tibble(
fold = i,
NULL_RMSE = rmse_fn(y_all[test_idx], pred_null),
NULL_MAE = mae_fn(y_all[test_idx], pred_null),
NB_RMSE = rmse_fn(y_all[test_idx], pred_nb),
NB_MAE = mae_fn(y_all[test_idx], pred_nb),
EN_RMSE = rmse_fn(y_all[test_idx], pred_en),
EN_MAE = mae_fn(y_all[test_idx], pred_en),
RF_RMSE = rmse_fn(y_all[test_idx], pred_rf),
RF_MAE = mae_fn(y_all[test_idx], pred_rf)
)
})##
## Call:
## glm.nb(formula = PHQ9_total ~ ., data = df_model, init.theta = 0.7540308545,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9955953 0.2959478 3.364 0.000768 ***
## gender2 0.2742397 0.0541078 5.068 4.01e-07 ***
## age -0.0106143 0.0041765 -2.541 0.011039 *
## ethnicity2 0.1620775 0.1184537 1.368 0.171225
## ethnicity3 -0.0009298 0.1082961 -0.009 0.993150
## ethnicity4 -0.1242995 0.1104785 -1.125 0.260546
## ethnicity6 -0.1346481 0.1349115 -0.998 0.318257
## ethnicity7 0.0283345 0.1625961 0.174 0.861659
## education_level.L -0.2452640 0.0775897 -3.161 0.001572 **
## education_level.Q -0.0831764 0.0664181 -1.252 0.210455
## education_level.C 0.0138327 0.0627056 0.221 0.825405
## education_level^4 -0.0598886 0.0552068 -1.085 0.278007
## marital_status2 0.1514006 0.0571055 2.651 0.008020 **
## marital_status3 0.1811908 0.1105090 1.640 0.101088
## income_poverty_ratio -0.0807242 0.0199833 -4.040 5.35e-05 ***
## arthritis1 0.1877101 0.0537885 3.490 0.000483 ***
## pulmonary_disease1 0.0026762 0.0727222 0.037 0.970644
## cancer1 -0.0010483 0.0643676 -0.016 0.987006
## high_cholesterol1 0.1409676 0.0531454 2.652 0.007990 **
## diabetes1 0.1557901 0.0564536 2.760 0.005787 **
## chest_pain1 0.4159266 0.0580882 7.160 8.05e-13 ***
## shortness_breath1 0.4123586 0.0555022 7.430 1.09e-13 ***
## cardio_disease1 0.0934820 0.0639072 1.463 0.143528
## hypertension_status1 0.1076657 0.0562664 1.913 0.055684 .
## smoking_status1 0.2674360 0.0769541 3.475 0.000510 ***
## healthcare_demand1 0.2926419 0.0554602 5.277 1.32e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.754) family taken to be 1)
##
## Null deviance: 3583.9 on 2788 degrees of freedom
## Residual deviance: 3006.0 on 2763 degrees of freedom
## AIC: 11957
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.7540
## Std. Err.: 0.0313
##
## 2 x log-likelihood: -11903.0290
## Dispersion parameter theta: 0.7540309
best_alpha_full <- alpha_grid[which.min(sapply(alpha_grid, function(a) {
min(cv.glmnet(X_all, y_all, family = "poisson", alpha = a)$cvm)
}))]
cat("Best alpha on full data:", best_alpha_full, "\n")## Best alpha on full data: 1
cv_en_full <- cv.glmnet(X_all, y_all, family = "poisson", alpha = best_alpha_full)
coef(cv_en_full, s = "lambda.min") %>%
as.matrix() %>%
as.data.frame() %>%
setNames("coef") %>%
rownames_to_column("variable") %>%
filter(coef != 0) %>%
arrange(desc(abs(coef))) %>%
print()## variable coef
## 1 (Intercept) 1.14525511
## 2 chest_pain1 0.40261385
## 3 shortness_breath1 0.37404534
## 4 healthcare_demand1 0.26078698
## 5 education_level.L -0.25524120
## 6 gender2 0.24325033
## 7 smoking_status1 0.23604888
## 8 arthritis1 0.20520798
## 9 marital_status3 0.15354308
## 10 ethnicity2 0.13685430
## 11 marital_status2 0.13620932
## 12 hypertension_status1 0.10154409
## 13 cardio_disease1 0.09784695
## 14 high_cholesterol1 0.09089700
## 15 ethnicity4 -0.08875847
## 16 diabetes1 0.07914756
## 17 education_level.Q -0.06549086
## 18 income_poverty_ratio -0.06383271
## 19 ethnicity6 -0.06291944
## 20 ethnicity7 0.02870165
## 21 education_level^4 -0.02035186
## 22 age -0.01174080
## 23 education_level.C -0.01078159
best_mtry_full <- mtry_grid[which.min(sapply(mtry_grid, function(m) {
ranger(PHQ9_total ~ ., data = df_model,
num.trees = 500, mtry = m, seed = 42)$prediction.error
}))]
cat("Best mtry on full data:", best_mtry_full, "\n")## Best mtry on full data: 3
fit_rf_full <- ranger(PHQ9_total ~ ., data = df_model,
num.trees = 500, mtry = best_mtry_full,
importance = "impurity", seed = 42)
data.frame(
variable = names(fit_rf_full$variable.importance),
importance = fit_rf_full$variable.importance
) %>%
arrange(desc(importance)) %>%
ggplot(aes(x = reorder(variable, importance), y = importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Figure 5.1: Random Forest Variable Importance",
x = NULL, y = "Impurity") +
theme_minimal()r2_fn <- function(model_rmse_vec, null_rmse_vec) {
round(1 - mean(model_rmse_vec^2) / mean(null_rmse_vec^2), 3)
}
reg_comparison <- tibble(
Model = c("Null model (baseline)", "Negative binomial",
"Elastic net", "Random forest (tuned)"),
RMSE = round(c(mean(cv_results$NULL_RMSE), mean(cv_results$NB_RMSE),
mean(cv_results$EN_RMSE), mean(cv_results$RF_RMSE)), 3),
MAE = round(c(mean(cv_results$NULL_MAE), mean(cv_results$NB_MAE),
mean(cv_results$EN_MAE), mean(cv_results$RF_MAE)), 3),
R2 = c(0,
r2_fn(cv_results$NB_RMSE, cv_results$NULL_RMSE),
r2_fn(cv_results$EN_RMSE, cv_results$NULL_RMSE),
r2_fn(cv_results$RF_RMSE, cv_results$NULL_RMSE))
)
print(reg_comparison)## # A tibble: 4 × 4
## Model RMSE MAE R2
## <chr> <dbl> <dbl> <dbl>
## 1 Null model (baseline) 4.12 3.06 0
## 2 Negative binomial 3.77 2.70 0.164
## 3 Elastic net 3.74 2.69 0.176
## 4 Random forest (tuned) 3.78 2.76 0.16
reg_comparison %>%
filter(Model != "Null model (baseline)") %>%
pivot_longer(c(RMSE, MAE), names_to = "metric", values_to = "value") %>%
ggplot(aes(x = reorder(Model, value), y = value, fill = Model)) +
geom_col(width = 0.6, show.legend = FALSE) +
geom_text(aes(label = round(value, 3)), hjust = -0.1, size = 3.8) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
facet_wrap(~ metric, scales = "free_x") +
labs(
title = "Figure 5.2: Regression Model Comparison (RMSE and MAE)",
x = NULL,
y = "Score (lower is better)"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
strip.text = element_text(face = "bold"))# Generate predictions from best model (negative binomial) on full data
pred_nb_all <- predict(fit_nb_full, type = "response")
tibble(
actual = df_model$PHQ9_total,
predicted = pred_nb_all
) %>%
ggplot(aes(x = actual, y = predicted)) +
geom_point(alpha = 0.15, colour = "steelblue", size = 0.8) +
geom_abline(intercept = 0, slope = 1,
linetype = "dashed", colour = "tomato", linewidth = 0.8) +
labs(
title = "Figure 5.3: Actual vs Predicted PHQ-9 Score (Negative Binomial)",
x = "Actual PHQ-9 Score",
y = "Predicted PHQ-9 Score"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))The dashed line represents perfect prediction. Points above the line are under-predicted; points below are over-predicted. The model predicts low scores well but tends to underestimate high PHQ-9 scores, which is typical for count models fitted on right-skewed data with many zeros.
# Incidence Rate Ratio table for Negative Binomial (interpretable coefficients)
irr_table <- as.data.frame(summary(fit_nb_full)$coefficients) %>%
rownames_to_column("variable") %>%
filter(variable != "(Intercept)") %>%
mutate(
IRR = round(exp(Estimate), 3),
CI_lower = round(exp(Estimate - 1.96 * `Std. Error`), 3),
CI_upper = round(exp(Estimate + 1.96 * `Std. Error`), 3),
p_value = round(`Pr(>|z|)`, 4),
significant = ifelse(`Pr(>|z|)` < 0.05, "Yes", "No")
) %>%
dplyr::select(variable, IRR, CI_lower, CI_upper, p_value, significant) %>%
arrange(p_value)
cat("Incidence Rate Ratio (IRR) Table — Negative Binomial Regression\n")## Incidence Rate Ratio (IRR) Table — Negative Binomial Regression
## IRR > 1: associated with higher PHQ-9 score
## IRR < 1: associated with lower PHQ-9 score
## variable IRR CI_lower CI_upper p_value significant
## 1 gender2 1.316 1.183 1.463 0.0000 Yes
## 2 chest_pain1 1.516 1.353 1.699 0.0000 Yes
## 3 shortness_breath1 1.510 1.355 1.684 0.0000 Yes
## 4 healthcare_demand1 1.340 1.202 1.494 0.0000 Yes
## 5 income_poverty_ratio 0.922 0.887 0.959 0.0001 Yes
## 6 arthritis1 1.206 1.086 1.341 0.0005 Yes
## 7 smoking_status1 1.307 1.124 1.519 0.0005 Yes
## 8 education_level.L 0.782 0.672 0.911 0.0016 Yes
## 9 diabetes1 1.169 1.046 1.305 0.0058 Yes
## 10 marital_status2 1.163 1.040 1.301 0.0080 Yes
## 11 high_cholesterol1 1.151 1.037 1.278 0.0080 Yes
## 12 age 0.989 0.981 0.998 0.0110 Yes
## 13 hypertension_status1 1.114 0.997 1.244 0.0557 No
## 14 marital_status3 1.199 0.965 1.489 0.1011 No
## 15 cardio_disease1 1.098 0.969 1.245 0.1435 No
## 16 ethnicity2 1.176 0.932 1.483 0.1712 No
## 17 education_level.Q 0.920 0.808 1.048 0.2105 No
## 18 ethnicity4 0.883 0.711 1.097 0.2605 No
## 19 education_level^4 0.942 0.845 1.050 0.2780 No
## 20 ethnicity6 0.874 0.671 1.139 0.3183 No
## 21 education_level.C 1.014 0.897 1.147 0.8254 No
## 22 ethnicity7 1.029 0.748 1.415 0.8617 No
## 23 pulmonary_disease1 1.003 0.869 1.156 0.9706 No
## 24 cancer1 0.999 0.881 1.133 0.9870 No
## 25 ethnicity3 0.999 0.808 1.235 0.9931 No
The IRR table shows the multiplicative effect of each predictor on
the expected PHQ-9 score. An IRR of 1.20 means the expected score is 20%
higher for that group compared to the reference. Variables with
significant = Yes (p < 0.05) provide the most reliable
signal for predicting depression severity.
The table compares the cross-validated RMSE and MAE of each model against the null baseline, together with an R-squared-style measure (the proportion of squared-error reduction relative to the null model). Negative binomial regression offers interpretability for a count-like outcome, elastic net handles correlated predictors and performs variable selection, and random forest captures potential non-linear patterns. Demographic and chronic disease variables carry useful but moderate predictive signal for depression severity, since depression is influenced by many psychological, social, and behavioural factors not captured by the selected NHANES variables. The results should therefore be read as evidence of association and predictive usefulness, not as a complete explanation of depression among older adults.
This section focuses on the classification task. Using the features prepared in Section 3, the model predicts whether respondents aged 60 and above belong to the higher healthcare demand group.
The target variable follows the definition used in the data preparation section:
target = 1 (higher healthcare demand):
defined as HUQ051 >= 4 or
HUQ071 == 1.target = 0 (lower healthcare demand):
meeting neither condition.The modelling dataset is the final_df object produced in
Section 3. HUQ051 and HUQ071 were already used
in Section 3 to construct healthcare_demand and removed
from the final dataset; SEQN is only a respondent
identifier with no predictive meaning. Using final_df
directly keeps the target definition consistent with the EDA in Section
4.
Before modelling, questionnaire-coded variables are converted into
factors. Age (age), income-to-poverty ratio
(income_poverty_ratio), and chronic condition count
(chronic_count) are kept as numeric variables.
df_clf <- final_df %>%
dplyr::select(-SEQN) %>%
mutate(
target = factor(ifelse(healthcare_demand == 1, "Higher", "Lower"),
levels = c("Lower", "Higher")),
target_num = ifelse(target == "Higher", 1L, 0L),
gender = factor(gender, levels = c(1, 2),
labels = c("Male", "Female")),
ethnicity = factor(ethnicity),
education_level = factor(education_level),
marital_status = factor(marital_status),
across(
c(arthritis, pulmonary_disease, cancer, high_cholesterol,
diabetes, chest_pain, shortness_breath, cardio_disease,
hypertension_status, smoking_status),
~ factor(., levels = c(0, 1), labels = c("No", "Yes"))
)
)
cat("Target distribution:\n")## Target distribution:
##
## Lower Higher
## 2064 1256
##
## Lower Higher
## 62.2 37.8
df_clf %>%
count(target) %>%
mutate(pct = round(n / sum(n) * 100, 1),
label = paste0(n, " (", pct, "%)")) %>%
ggplot(aes(x = target, y = n, fill = target)) +
geom_col(width = 0.55, show.legend = FALSE) +
geom_text(aes(label = label), vjust = -0.3, size = 4) +
scale_fill_manual(values = c("Lower" = "steelblue", "Higher" = "tomato")) +
labs(
title = "Figure 6.1: Target Distribution",
x = "Healthcare Demand Group",
y = "Number of Respondents"
) +
theme_minimal()The data is split into 70% training data and 30% testing data. Since
Higher is not the majority class, stratified sampling by
target is used to keep the class distribution close to the
original dataset.
set.seed(7004)
train_ids <- df_clf %>%
mutate(global_id = row_number()) %>%
group_by(target) %>%
sample_frac(size = 0.70) %>%
ungroup() %>%
pull(global_id)
train_data <- df_clf[train_ids, ]
test_data <- df_clf[-train_ids, ]
cat("Training set:", nrow(train_data), "rows\n")## Training set: 2324 rows
##
## Lower Higher
## 1445 879
##
## Testing set: 996 rows
##
## Lower Higher
## 619 377
In this section, Higher is treated as the positive
class. Since Lower accounts for about 62.2% of the final
data, accuracy alone is not sufficient. Sensitivity, F1, balanced
accuracy, and AUC are also reported.
safe_divide <- function(x, y) {
ifelse(y == 0, NA_real_, x / y)
}
auc_rank <- function(actual, prob) {
actual_positive <- actual == "Higher"
n_pos <- sum(actual_positive)
n_neg <- sum(!actual_positive)
if (n_pos == 0 || n_neg == 0) {
return(NA_real_)
}
ranks <- rank(prob, ties.method = "average")
(sum(ranks[actual_positive]) - n_pos * (n_pos + 1) / 2) / (n_pos * n_neg)
}
evaluate_model <- function(actual, prob, model_name, threshold = 0.50) {
actual <- factor(actual, levels = c("Lower", "Higher"))
pred <- factor(ifelse(prob >= threshold, "Higher", "Lower"),
levels = c("Lower", "Higher"))
cm <- table(Predicted = pred, Actual = actual)
tn <- cm["Lower", "Lower"]
fp <- cm["Higher", "Lower"]
fn <- cm["Lower", "Higher"]
tp <- cm["Higher", "Higher"]
tibble(
model = model_name,
threshold = threshold,
accuracy = safe_divide(tp + tn, tp + tn + fp + fn),
sensitivity = safe_divide(tp, tp + fn),
specificity = safe_divide(tn, tn + fp),
precision = safe_divide(tp, tp + fp),
f1 = safe_divide(2 * tp, 2 * tp + fp + fn),
balanced_accuracy = (safe_divide(tp, tp + fn) + safe_divide(tn, tn + fp)) / 2,
auc = auc_rank(actual, prob)
)
}
# Threshold is selected on the training set only, to avoid using test-set information.
find_best_threshold <- function(actual, prob) {
thresholds <- seq(0.05, 0.95, by = 0.01)
scores <- lapply(thresholds, function(th) {
evaluate_model(actual, prob, "threshold_search", threshold = th)
}) %>%
bind_rows()
scores %>%
arrange(desc(balanced_accuracy), desc(f1)) %>%
slice(1) %>%
pull(threshold)
}
make_roc <- function(actual, prob, model_name) {
actual <- factor(actual, levels = c("Lower", "Higher"))
thresholds <- sort(unique(c(Inf, prob, -Inf)), decreasing = TRUE)
roc_df <- lapply(thresholds, function(th) {
pred <- factor(ifelse(prob >= th, "Higher", "Lower"),
levels = c("Lower", "Higher"))
cm <- table(Predicted = pred, Actual = actual)
tn <- cm["Lower", "Lower"]
fp <- cm["Higher", "Lower"]
fn <- cm["Lower", "Higher"]
tp <- cm["Higher", "Higher"]
tibble(
model = model_name,
threshold = th,
sensitivity = safe_divide(tp, tp + fn),
fpr = safe_divide(fp, fp + tn)
)
})
bind_rows(roc_df)
}The baseline model always predicts the majority class,
Lower. Its accuracy may appear acceptable, but its
sensitivity is 0, so it is used only as a reference and not as a useful
predictive model.
baseline_prob <- rep(0, nrow(test_data))
baseline_metrics <- evaluate_model(test_data$target, baseline_prob, "Baseline")
baseline_cm <- table(
Predicted = factor(rep("Lower", nrow(test_data)), levels = c("Lower", "Higher")),
Actual = test_data$target
)
baseline_cm## Actual
## Predicted Lower Higher
## Lower 619 377
## Higher 0 0
## # A tibble: 1 × 9
## model threshold accuracy sensitivity specificity precision f1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baseline 0.5 0.621 0 1 NA 0
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
Logistic Regression is used as the first formal model. Its main
advantage is interpretability, as it allows us to examine which
variables increase or decrease the predicted probability of
Higher.
predictor_cols <- c(
"gender", "age", "ethnicity", "education_level", "marital_status",
"income_poverty_ratio",
"arthritis", "pulmonary_disease", "cancer", "high_cholesterol",
"diabetes", "chest_pain", "shortness_breath",
"cardio_disease", "hypertension_status", "smoking_status",
"chronic_count"
)
# disease_age_interaction is excluded because it is highly redundant with
# age and chronic_count. This keeps the formal model comparison simpler
# while retaining the disease burden signal.
logit_formula <- as.formula(
paste("target_num ~", paste(predictor_cols, collapse = " + "))
)
logit_fit <- glm(
logit_formula,
data = train_data,
family = binomial()
)
logit_train_prob <- predict(logit_fit, newdata = train_data, type = "response")
logit_prob <- predict(logit_fit, newdata = test_data, type = "response")
logit_threshold <- find_best_threshold(train_data$target, logit_train_prob)
logit_metrics <- evaluate_model(
test_data$target,
logit_prob,
"Logistic Regression",
threshold = logit_threshold
)
logit_pred <- factor(ifelse(logit_prob >= logit_threshold, "Higher", "Lower"),
levels = c("Lower", "Higher"))
logit_cm <- table(Predicted = logit_pred, Actual = test_data$target)
cat("Selected threshold:", logit_threshold, "\n")## Selected threshold: 0.39
## Actual
## Predicted Lower Higher
## Lower 418 147
## Higher 201 230
## # A tibble: 1 × 9
## model threshold accuracy sensitivity specificity precision f1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Logistic Regression 0.39 0.651 0.610 0.675 0.534 0.569
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
# Convert Logistic Regression coefficients to odds ratios for easier interpretation.
logit_coef <- as.data.frame(summary(logit_fit)$coefficients)
logit_coef$term <- rownames(logit_coef)
rownames(logit_coef) <- NULL
logit_or <- logit_coef %>%
filter(term != "(Intercept)") %>%
mutate(
odds_ratio = exp(Estimate),
direction = ifelse(odds_ratio > 1, "Higher odds", "Lower odds")
) %>%
arrange(`Pr(>|z|)`)
head(logit_or, 12)## Estimate Std. Error z value Pr(>|z|) term odds_ratio
## 1 0.5613573 0.10128822 5.542178 2.987331e-08 shortness_breathYes 1.753050
## 2 0.6211422 0.11310482 5.491740 3.979939e-08 cardio_diseaseYes 1.861053
## 3 0.5379449 0.09806243 5.485739 4.117441e-08 arthritisYes 1.712484
## 4 0.4688601 0.10237516 4.579823 4.653700e-06 diabetesYes 1.598171
## 5 0.4930171 0.11591653 4.253208 2.107299e-05 cancerYes 1.637249
## 6 0.5044590 0.13365904 3.774223 1.605072e-04 pulmonary_diseaseYes 1.656089
## 7 0.3552144 0.10346441 3.433204 5.964932e-04 hypertension_statusYes 1.426486
## 8 0.3652637 0.10710054 3.410474 6.484998e-04 chest_painYes 1.440894
## 9 0.0211010 0.00752333 2.804741 5.035696e-03 age 1.021325
## 10 0.7297867 0.30656671 2.380515 1.728846e-02 ethnicity7 2.074638
## 11 0.4126494 0.22871530 1.804206 7.119908e-02 ethnicity2 1.510815
## 12 0.3146090 0.20018244 1.571611 1.160408e-01 education_level5 1.369724
## direction
## 1 Higher odds
## 2 Higher odds
## 3 Higher odds
## 4 Higher odds
## 5 Higher odds
## 6 Higher odds
## 7 Higher odds
## 8 Higher odds
## 9 Higher odds
## 10 Higher odds
## 11 Higher odds
## 12 Higher odds
Decision Tree is used as the second model for comparison with
Logistic Regression. The rpart package is used because it
is available locally and does not require additional complex
dependencies.
tree_formula <- as.formula(
paste("target ~", paste(predictor_cols, collapse = " + "))
)
tree_fit <- rpart(
tree_formula,
data = train_data,
method = "class",
control = rpart.control(cp = 0.01, minsplit = 30)
)
tree_train_prob <- predict(tree_fit, newdata = train_data, type = "prob")[, "Higher"]
tree_prob <- predict(tree_fit, newdata = test_data, type = "prob")[, "Higher"]
tree_threshold <- find_best_threshold(train_data$target, tree_train_prob)
tree_metrics <- evaluate_model(
test_data$target,
tree_prob,
"Decision Tree",
threshold = tree_threshold
)
tree_pred <- factor(ifelse(tree_prob >= tree_threshold, "Higher", "Lower"),
levels = c("Lower", "Higher"))
tree_cm <- table(Predicted = tree_pred, Actual = test_data$target)
cat("Selected threshold:", tree_threshold, "\n")## Selected threshold: 0.29
## Actual
## Predicted Lower Higher
## Lower 396 133
## Higher 223 244
## # A tibble: 1 × 9
## model threshold accuracy sensitivity specificity precision f1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Decision Tree 0.29 0.643 0.647 0.640 0.522 0.578
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
tree_importance <- tree_fit$variable.importance
if (!is.null(tree_importance)) {
tree_importance_df <- tibble(
variable = names(tree_importance),
importance = as.numeric(tree_importance)
) %>%
arrange(desc(importance))
tree_importance_df
} else {
tibble(variable = character(), importance = numeric())
}## # A tibble: 10 × 2
## variable importance
## <chr> <dbl>
## 1 chronic_count 120.
## 2 shortness_breath 50.9
## 3 chest_pain 43.6
## 4 cardio_disease 40.6
## 5 hypertension_status 34.7
## 6 arthritis 33.6
## 7 ethnicity 5.95
## 8 age 4.62
## 9 income_poverty_ratio 2.54
## 10 education_level 0.238
if (!is.null(tree_importance)) {
tree_importance_df %>%
ggplot(aes(x = reorder(variable, importance), y = importance)) +
geom_col(fill = "steelblue", width = 0.65) +
coord_flip() +
labs(
title = "Figure 6.2: Decision Tree Variable Importance",
x = NULL,
y = "Importance"
) +
theme_minimal()
}Random Forest is added as a third model. It builds multiple decision
trees on bootstrapped samples and averages their predictions, which
generally reduces overfitting compared to a single tree. The
ranger package is used for efficient computation.
rf_formula <- as.formula(
paste("target ~", paste(predictor_cols, collapse = " + "))
)
mtry_grid_clf <- c(3, 5, 7, 10)
best_mtry_clf <- mtry_grid_clf[which.min(sapply(mtry_grid_clf, function(m) {
ranger(rf_formula, data = train_data,
num.trees = 500, mtry = m,
probability = TRUE, seed = 42)$prediction.error
}))]
cat("Best mtry (classification):", best_mtry_clf, "\n")## Best mtry (classification): 3
rf_fit <- ranger(
rf_formula,
data = train_data,
num.trees = 500,
mtry = best_mtry_clf,
probability = TRUE,
importance = "impurity",
seed = 42
)
rf_train_prob <- predict(rf_fit, data = train_data)$predictions[, "Higher"]
rf_prob <- predict(rf_fit, data = test_data)$predictions[, "Higher"]
rf_threshold <- find_best_threshold(train_data$target, rf_train_prob)
rf_metrics <- evaluate_model(
test_data$target, rf_prob,
"Random Forest", threshold = rf_threshold
)
rf_pred <- factor(ifelse(rf_prob >= rf_threshold, "Higher", "Lower"),
levels = c("Lower", "Higher"))
rf_cm <- table(Predicted = rf_pred, Actual = test_data$target)
cat("Selected threshold:", rf_threshold, "\n")## Selected threshold: 0.45
## Actual
## Predicted Lower Higher
## Lower 439 169
## Higher 180 208
## # A tibble: 1 × 9
## model threshold accuracy sensitivity specificity precision f1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Random Forest 0.45 0.650 0.552 0.709 0.536 0.544
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
rf_imp_df <- data.frame(
variable = names(rf_fit$variable.importance),
importance = rf_fit$variable.importance
) %>%
arrange(desc(importance))
rf_imp_df %>%
ggplot(aes(x = reorder(variable, importance), y = importance)) +
geom_col(fill = "steelblue", width = 0.65) +
coord_flip() +
labs(
title = "Figure 6.3: Random Forest Variable Importance",
x = NULL,
y = "Impurity"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))model_metrics <- bind_rows(
baseline_metrics,
logit_metrics,
tree_metrics,
rf_metrics
) %>%
mutate(across(where(is.numeric), ~ round(., 3)))
model_metrics## # A tibble: 4 × 9
## model threshold accuracy sensitivity specificity precision f1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baseline 0.5 0.621 0 1 NA 0
## 2 Logistic Regression 0.39 0.651 0.61 0.675 0.534 0.569
## 3 Decision Tree 0.29 0.643 0.647 0.64 0.522 0.578
## 4 Random Forest 0.45 0.65 0.552 0.709 0.536 0.544
## # ℹ 2 more variables: balanced_accuracy <dbl>, auc <dbl>
best_non_baseline <- model_metrics %>%
filter(model != "Baseline") %>%
arrange(desc(auc), desc(balanced_accuracy)) %>%
slice(1)model_metrics %>%
select(model, accuracy, sensitivity, specificity, f1, balanced_accuracy, auc) %>%
pivot_longer(-model, names_to = "metric", values_to = "value") %>%
ggplot(aes(x = metric, y = value, fill = model)) +
geom_col(position = "dodge", width = 0.7) +
coord_cartesian(ylim = c(0, 1)) +
labs(
title = "Figure 6.4: Classification Model Performance",
x = "Metric",
y = "Score",
fill = "Model"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))# Build tidy confusion matrix data for all 3 formal models
cm_list <- list(
"Logistic Regression" = table(
Predicted = factor(ifelse(logit_prob >= logit_threshold, "Higher", "Lower"),
levels = c("Lower", "Higher")),
Actual = test_data$target
),
"Decision Tree" = table(
Predicted = factor(ifelse(tree_prob >= tree_threshold, "Higher", "Lower"),
levels = c("Lower", "Higher")),
Actual = test_data$target
),
"Random Forest" = rf_cm
)
cm_df <- purrr::map_dfr(names(cm_list), function(nm) {
as.data.frame(cm_list[[nm]]) %>%
mutate(model = nm)
})
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(colour = "white") +
geom_text(aes(label = Freq), size = 5, fontface = "bold") +
scale_fill_gradient(low = "white", high = "steelblue") +
facet_wrap(~ model, nrow = 1) +
labs(
title = "Figure 6.5: Confusion Matrices",
fill = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 13),
strip.text = element_text(face = "bold"),
axis.text = element_text(size = 10)
)# Compare feature importance across Logistic Regression (|coef|),
# Decision Tree, and Random Forest
logit_imp <- logit_coef %>%
filter(term != "(Intercept)") %>%
mutate(variable = term, importance = abs(Estimate)) %>%
dplyr::select(variable, importance) %>%
mutate(model = "Logistic Regression")
tree_imp_tidy <- if (!is.null(tree_fit$variable.importance)) {
tibble(
variable = names(tree_fit$variable.importance),
importance = as.numeric(tree_fit$variable.importance),
model = "Decision Tree"
)
} else tibble(variable = character(), importance = numeric(), model = character())
rf_imp_tidy <- rf_imp_df %>%
mutate(model = "Random Forest")
# Normalise each model's importance to 0-1 for fair visual comparison
all_imp <- bind_rows(logit_imp, tree_imp_tidy, rf_imp_tidy) %>%
group_by(model) %>%
mutate(importance_norm = importance / max(importance)) %>%
ungroup()
# Keep top 10 variables by average normalised importance
top_vars <- all_imp %>%
group_by(variable) %>%
summarise(avg = mean(importance_norm)) %>%
slice_max(avg, n = 10) %>%
pull(variable)
all_imp %>%
filter(variable %in% top_vars) %>%
ggplot(aes(x = reorder(variable, importance_norm),
y = importance_norm, fill = model)) +
geom_col(position = "dodge", width = 0.7) +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Figure 6.6: Feature Importance Comparison Across Models (Normalised)",
x = NULL,
y = "Normalised Importance (0–1)",
fill = "Model"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13))roc_df <- bind_rows(
make_roc(test_data$target, logit_prob, "Logistic Regression"),
make_roc(test_data$target, tree_prob, "Decision Tree"),
make_roc(test_data$target, rf_prob, "Random Forest")
)
ggplot(roc_df, aes(x = fpr, y = sensitivity, colour = model)) +
geom_line(linewidth = 1) +
geom_abline(intercept = 0, slope = 1,
linetype = "dashed", colour = "grey50") +
coord_equal() +
labs(
title = "Figure 6.7: ROC Curves",
x = "False Positive Rate",
y = "True Positive Rate / Sensitivity",
colour = "Model"
) +
theme_minimal()This section builds classification models using 17 predictors. The
baseline model always predicts the majority class, Lower,
and therefore serves only as a reference — it cannot identify any
Higher respondents.
Three formal models were compared: Logistic Regression, Decision Tree, and Random Forest. Among these, Logistic Regression achieves the strongest combined AUC and balanced accuracy on the test set. The confusion matrices (Figure 6.5) show how each model distributes predictions across the four outcome cells, while the ROC curves (Figure 6.7) illustrate the trade-off between sensitivity and false positive rate across all possible thresholds.
The feature importance comparison (Figure 6.6) shows which variables are consistently important across models. Variables that rank highly in all three models provide the strongest and most reliable evidence for predicting healthcare demand.
In the final report, AUC, balanced accuracy, sensitivity, and F1
should be discussed instead of comparing models by accuracy only,
because Higher is the key minority class and misclassifying
high-demand respondents carries greater practical cost.
This project aimed to use data mining methods to understand and predict healthcare-related needs among older adults. The study focused on respondents aged 60 years and above from the NHANES 2017-March 2020 Pre-pandemic cycle. Multiple NHANES components were combined to capture demographic characteristics, socioeconomic background, chronic disease history, blood pressure and cholesterol status, diabetes status, chest symptoms, smoking behaviour, healthcare utilisation, and depression-related information.
The project included two main analytical tasks. The first task was a classification problem that predicted whether an older adult belonged to the higher healthcare demand group. This outcome was based on healthcare utilisation indicators, especially frequent healthcare visits and hospitalisation. The second task was a regression problem that predicted depression severity using the PHQ-9 total score. Together, these two tasks addressed both healthcare service demand and mental health burden, which are important dimensions of healthy ageing.
The exploratory data analysis showed that healthcare demand among older adults is closely related to chronic disease burden. The target variable was mildly imbalanced, with the lower healthcare demand group forming the larger share of the sample and the higher demand group forming a smaller but still substantial group. This means that accuracy alone is not sufficient for model evaluation, because a model may appear accurate while failing to identify high-demand respondents.
Demographic analysis showed that the sample included older adults aged 60 years and above, with age top-coded at 80 in NHANES. The higher demand group tended to be slightly older, although there was considerable overlap between the lower and higher demand groups. Gender differences were not strongly distinctive in the exploratory analysis, while race, education, marital status, and income-to-poverty ratio provided useful socioeconomic context for interpreting healthcare utilisation.
The disease prevalence analysis indicated that hypertension, high cholesterol, and arthritis were among the most common conditions in the elderly sample. These findings are consistent with the broader pattern of chronic disease burden in older populations. Several conditions, including cardiovascular disease, pulmonary disease, shortness of breath, chest pain, and arthritis, were associated with higher rates of healthcare demand.
The engineered features were especially important. The chronic condition count provided a direct measure of multimorbidity, while the disease-age interaction captured the combined effect of ageing and disease burden. These two features showed the strongest association with healthcare demand in the EDA. This supports the idea that healthcare demand is not driven by a single condition only, but by the accumulation of multiple health risks over time.
Smoking status showed a more complex pattern. Current smoking was not clearly associated with higher healthcare demand in the same way as chronic disease indicators. This should not be interpreted as a protective effect. Instead, it may reflect survivor bias, age-related sample composition, or confounding by other health and behavioural factors.
The regression analysis focused on predicting PHQ-9 total score, which represents depression symptom severity. The modelling dataset was created by merging the cleaned PHQ-9 data from the DPQ file with the prepared elderly health dataset. Several modelling approaches were considered, including a null baseline model, negative binomial regression, elastic net Poisson regression, and random forest regression.
The null baseline provided a reference point by predicting the average PHQ-9 score. The formal regression models were evaluated using RMSE, MAE, and an R-squared-style comparison against the null model. This approach is suitable because PHQ-9 total score is a count-like outcome with a skewed distribution and many low values.
The regression task demonstrated that demographic and chronic disease variables contain useful information for explaining depression severity, but the predictive signal is likely moderate rather than strong. Depression is influenced by many psychological, social, behavioural, and environmental factors that are not fully captured by the selected NHANES variables. Therefore, the regression model outcomes should be interpreted as evidence of association and predictive usefulness, not as a complete explanation of depression among older adults.
Among the regression approaches, negative binomial regression offered interpretability for a count-like outcome, elastic net regression helped handle correlated predictors and variable selection, and random forest regression captured potential non-linear patterns. The comparison of these models provided a balanced view between interpretability and predictive flexibility.
Three formal models were compared for the classification task: logistic regression, decision tree, and random forest. All three models achieved accuracy in the range of approximately 63–68%, with AUC values between 0.69 and 0.74. While these figures may appear modest, they should be understood in the context of what the models are being asked to predict and what information is available to them.
The classification target healthcare_demand is not a
direct measure of health status. It is a behavioural
outcome — whether a person visited a doctor four or more times,
or was hospitalised, in the past year. This distinction is important
because healthcare utilisation is determined not only by how sick a
person is, but also by many factors that are entirely absent from the
NHANES variables used in this project.
The following six reasons explain why accuracy in the 63–68% range is appropriate and expected for this type of problem, rather than a sign of model failure.
Reason 1: The target variable captures behaviour, not biology. A person with severe hypertension and diabetes may still avoid going to the doctor due to personal preference, fear of diagnosis, or financial constraints. Conversely, a relatively healthy individual may visit the doctor frequently due to anxiety or preventive habits. The models are trained only on health indicators and cannot observe these behavioural and psychological factors. No machine learning model can accurately predict behaviour from health data alone.
Reason 2: Critical predictors are missing from the dataset. Healthcare utilisation is strongly influenced by access-related factors such as health insurance coverage, distance to the nearest medical facility, availability of transport, and out-of-pocket cost. None of these variables are present in the selected NHANES files. This means the models are working with incomplete information by design, not by any error in modelling.
Reason 3: The target is a composite of two very different
behaviours. healthcare_demand = 1 is assigned to
respondents who had frequent outpatient visits
(HUQ051 >= 4) or who were hospitalised
overnight (HUQ071 == 1). These two conditions represent
quite different patterns of healthcare use — one reflects routine or
chronic care management, while the other reflects acute or emergency
need. Combining them into a single binary label introduces internal
heterogeneity within the positive class, which makes it harder for any
model to learn a consistent decision boundary.
Reason 4: All predictor variables are self-reported. Every health condition in the dataset — arthritis, diabetes, hypertension, cardiovascular disease, and others — is based on what respondents recalled being told by a doctor. Self-reported data introduces measurement error, particularly among older adults who may misremember diagnoses, conflate conditions, or under-report symptoms. This noise in the predictors directly limits how accurately any model can learn patterns from the data.
Reason 5: The dataset is cross-sectional. NHANES captures a single point in time. The model predicts healthcare use in the past year from health indicators measured at the same time. In reality, healthcare demand is a longitudinal process shaped by changes in health over months or years, recent acute events, changes in insurance, and evolving treatment plans. A cross-sectional snapshot cannot fully capture this dynamic.
Reason 6: AUC is the more meaningful metric here. Accuracy is a misleading metric when the class distribution is uneven. In this dataset, approximately 62% of respondents are in the lower demand group. A model that always predicted “Lower” would achieve 62% accuracy with zero predictive value. The models in this project achieved AUC values of 0.69–0.74, which means they correctly rank a randomly chosen high-demand respondent above a randomly chosen low-demand respondent 69–74% of the time. This is a meaningful improvement over random guessing (AUC = 0.50) and demonstrates that the models have genuinely learned useful patterns from the data.
Despite the inherent difficulty of the prediction task, the models consistently identified the same set of important variables: chronic condition count, disease-age interaction, cardiovascular disease, chest symptoms, and hypertension. This consistency across logistic regression, decision tree, and random forest is a positive sign — it confirms that the models are capturing real relationships in the data rather than fitting noise.
The engineered features chronic_count and
disease_age_interaction were among the top predictors in
all three models, validating the feature engineering decisions made
during data preparation. The fact that all models agree on which
variables matter most provides more confidence in the findings than any
single accuracy figure.
The classification accuracy of 63–68% is not a failure. It is an honest reflection of a difficult prediction problem where the outcome is a behavioural variable, the available features are incomplete relative to the true drivers of healthcare utilisation, and the data is cross-sectional and self-reported. The models provide meaningful predictive signal — as shown by AUC values well above chance — and consistently identify clinically sensible risk factors. These results are appropriate for a data mining study using publicly available survey data and are comparable to published research on similar healthcare utilisation prediction problems.
The findings have several practical implications for healthcare planning and healthy ageing strategies. First, the strong role of chronic disease burden suggests that healthcare demand prediction should consider multimorbidity rather than analysing diseases in isolation. Older adults with multiple chronic conditions are more likely to require frequent healthcare services, making them an important group for proactive care management.
Second, the importance of cardiovascular disease, respiratory symptoms, chest pain, hypertension, and related indicators suggests that routine screening and monitoring of these conditions can support early identification of high-need individuals. Preventive interventions may reduce avoidable hospitalisation and improve quality of life among older adults.
Third, the regression analysis highlights the importance of mental health as part of elderly care. Depression severity cannot be fully explained by physical health indicators alone, but physical disease burden may still contribute to depressive symptoms. This supports a more integrated approach to elderly healthcare, where mental health screening is considered alongside chronic disease management.
Finally, the project demonstrates how public health survey data can be transformed into actionable predictive models. Even when the models are not perfect, they can help identify risk patterns, support resource planning, and guide further investigation into vulnerable elderly subgroups.
Several limitations should be acknowledged. First, NHANES is a cross-sectional survey, so the analysis can identify associations and predictive patterns but cannot establish causal relationships. For example, chronic disease burden is associated with higher healthcare demand, but the direction and timing of these relationships cannot be confirmed from the current data alone.
Second, several variables are self-reported and may be affected by recall bias or reporting error. This is especially relevant for medical history, healthcare utilisation, smoking behaviour, and depression questionnaire responses.
Third, missing values and questionnaire skip patterns influenced the available data. Some variables with high missingness had to be removed, while others required imputation or feature engineering. These preprocessing decisions were necessary, but they may affect model results.
Fourth, the dataset is based on the United States population. Although the selected variables are relevant to healthy ageing and are aligned with themes found in NHMS reports, the findings should be applied cautiously to the Malaysian context. Differences in healthcare access, insurance systems, service delivery, ethnicity, culture, and disease patterns may affect generalisability.
Future research could improve this project in several ways. Longitudinal data could be used to predict future healthcare demand rather than current or past utilisation. Additional variables such as medication use, functional limitations, health insurance status, physical activity, diet, social support, and healthcare access barriers could improve predictive performance. More advanced models, such as gradient boosting or calibrated ensemble methods, could also be compared with interpretable models. Finally, external validation using Malaysian elderly health data would strengthen the relevance of the findings for local healthcare planning.
In conclusion, this project shows that data mining methods can help identify older adults with higher healthcare demand and explore depression severity among elderly populations. The results highlight the central role of chronic disease burden, cardiovascular and respiratory symptoms, and multimorbidity in shaping healthcare needs. The classification accuracy of 63–68% and AUC of 0.69–0.74 are appropriate outcomes for a behavioural prediction problem using cross-sectional, self-reported survey data with incomplete access-related predictors. These figures reflect the genuine difficulty of predicting healthcare utilisation — not a shortcoming in the analytical approach. The models provide consistent, clinically meaningful signals and offer a solid foundation for data-driven elderly healthcare planning and future healthy ageing research.