Heart disease remains one of the leading causes of mortality in the United States, affecting millions of adults across all demographic groups. It is closely associated with multiple risk factors, including poor physical health, unhealthy lifestyle behaviours such as smoking and physical inactivity, as well as chronic conditions such as diabetes and obesity. These factors often develop over time and contribute to a higher risk of cardiovascular disease.
Understanding the relationship between personal health indicators, overall physical well-being, and heart disease risk can provide valuable insights for public health interventions and early detection strategies. This information is important for identifying individuals at high risk and improving the effectiveness of preventive healthcare strategies.
This project utilises the Behavioral Risk Factor Surveillance System (BRFSS) 2022 dataset, a large-scale health survey conducted by the Centers for Disease Control and Prevention (CDC). The dataset contains self-reported information from U.S. adults, including health behaviours, chronic conditions, and use of preventive health services.
This project aims to investigate the relationship between personal health indicators, physical well-being, and heart disease risk using the BRFSS 2022 dataset. Two predictive modelling tasks are conducted.
Can the number of physically unhealthy days (PhysicalHealthDays) experienced by a U.S. adult be predicted based on personal health indicators?
Objectives:
1. To identify key factors associated with poor physical health among
adults.
2. To examine the relationship between lifestyle behaviours, chronic
conditions, and physically unhealthy days.
3. To develop a regression model for predicting physical health
outcomes.
Can the occurrence of a heart attack (HadHeartAttack) be classified based on demographic and lifestyle characteristics of U.S. adults?
Objectives:
1. To identify important demographic, lifestyle, and health-related
factors associated with heart disease risk.
2. To develop a classification model for predicting heart attack
occurrence.
3. To support early risk identification and improve preventive
healthcare strategies using data-driven insights.
The dataset used in this project is the Personal Key Indicators of Heart Disease (2022 update), published on Kaggle by Kamil Pytlak. It is based on the annual Behavioral Risk Factor Surveillance System (BRFSS) survey conducted by the Centers for Disease Control and Prevention (CDC). The dataset contains self-reported information from non-institutionalised adults in the United States, covering health risk behaviours, chronic conditions, and use of preventive health services.
The 40 features are organised into six thematic categories covering different domains of personal health information.
| Category | Features | Description |
|---|---|---|
| Numerical health measures | PhysicalHealthDays, MentalHealthDays,
SleepHours, BMI, HeightInMeters,
WeightInKilograms |
Continuous measurements of physical and mental well-being, body composition, and sleep |
| Chronic conditions | HadHeartAttack, HadAngina,
HadStroke, HadAsthma, HadCOPD,
HadDiabetes, HadArthritis,
HadKidneyDisease |
Self-reported history of diagnosed chronic diseases and cardiovascular conditions (Yes / No) |
| Lifestyle & behaviour | SmokerStatus, ECigaretteUsage,
PhysicalActivities, AlcoholDrinkers,
HighRiskLastYear |
Modifiable behavioural risk factors including smoking, physical activity, and alcohol use |
| Physical difficulties | DifficultyWalking,
DifficultyConcentrating,
DifficultyDressingBathing, DifficultyErrands,
BlindOrVisionDifficulty,
DeafOrHardOfHearing |
Functional limitations and disability indicators reflecting health-related quality of life (Yes / No) |
| Preventive care | FluVaxLast12, PneumoVaxEver,
HIVTesting, TetanusLast10Tdap,
ChestScan, LastCheckupTime |
Engagement with preventive health services including vaccines, screenings, and routine check-ups |
| Demographics | Sex, AgeCategory,
RaceEthnicityCategory, State,
GeneralHealth, RemovedTeeth,
CovidPos |
Respondent identity, socioeconomic background, geographic location, and self-rated health status |
We start by loading the libraries used throughout the analysis and
reading the CSV file into R. Text columns are kept as plain characters
for now (stringsAsFactors = FALSE) so they can be cleaned
first; the conversion to factors is done once the data is tidy.
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(caret)
library(summarytools)
library(VIM)
library(themis)
library(recipes)
library(scales)
library(gridExtra)
library(reshape2)
df <- read.csv('heart_2022.csv', stringsAsFactors = FALSE)
In the dataset, missing values were stored as empty strings
(""). To make these gaps clearly visible and ensure they
are handled consistently, the empty strings were converted to NA. This
step standardises the representation of missing values and prepares the
data for cleaning and analysis later on.
df[df == ""] <- NA
A quick check of the dataset’s structure confirms the data loaded correctly and provides an overview of the variable types. The dataset contains 445,132 rows and 40 columns — 6 numerical columns and 34 categorical columns covering chronic conditions, lifestyle behaviours, and demographics.
glimpse(df)
## Rows: 445,132
## Columns: 40
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", …
## $ Sex <chr> "Female", "Female", "Female", "Female", "Fem…
## $ GeneralHealth <chr> "Very good", "Excellent", "Very good", "Exce…
## $ PhysicalHealthDays <dbl> 0, 0, 2, 0, 2, 1, 0, 0, 0, 1, 8, 0, 5, 0, 30…
## $ MentalHealthDays <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 5,…
## $ LastCheckupTime <chr> "Within past year (anytime less than 12 mont…
## $ PhysicalActivities <chr> "No", "No", "Yes", "Yes", "Yes", "No", "Yes"…
## $ SleepHours <dbl> 8, 6, 5, 7, 9, 7, 7, 8, 6, 7, 8, 6, 6, 8, 8,…
## $ RemovedTeeth <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ HadHeartAttack <chr> "No", "No", "No", "No", "No", "Yes", "No", "…
## $ HadAngina <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadStroke <chr> "No", "No", "No", "No", "No", "Yes", "No", "…
## $ HadAsthma <chr> "No", "No", "No", "Yes", "No", "No", "No", "…
## $ HadSkinCancer <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ HadCOPD <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadDepressiveDisorder <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadKidneyDisease <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadArthritis <chr> "No", "No", "No", "Yes", "No", "No", "No", "…
## $ HadDiabetes <chr> "Yes", "No", "No", "No", "No", "Yes", "No", …
## $ DeafOrHardOfHearing <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ BlindOrVisionDifficulty <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyConcentrating <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyWalking <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyDressingBathing <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyErrands <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ SmokerStatus <chr> "Never smoked", "Never smoked", "Never smoke…
## $ ECigaretteUsage <chr> "Not at all (right now)", "Never used e-ciga…
## $ ChestScan <chr> "No", "No", "No", "Yes", "Yes", "No", "No", …
## $ RaceEthnicityCategory <chr> "White only, Non-Hispanic", "White only, Non…
## $ AgeCategory <chr> "Age 80 or older", "Age 80 or older", "Age 5…
## $ HeightInMeters <dbl> NA, 1.60, 1.57, 1.65, 1.57, 1.80, 1.65, 1.63…
## $ WeightInKilograms <dbl> NA, 68.04, 63.50, 63.50, 53.98, 84.82, 62.60…
## $ BMI <dbl> NA, 26.57, 25.61, 23.30, 21.77, 26.08, 22.96…
## $ AlcoholDrinkers <chr> "No", "No", "No", "No", "Yes", "No", "Yes", …
## $ HIVTesting <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ FluVaxLast12 <chr> "Yes", "No", "No", "Yes", "No", "No", "No", …
## $ PneumoVaxEver <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No",…
## $ TetanusLast10Tdap <chr> "Yes, received tetanus shot but not sure wha…
## $ HighRiskLastYear <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ CovidPos <chr> "No", "No", "Yes", "No", "No", "No", "No", "…
Numerical variables such as PhysicalHealthDays,
MentalHealthDays, BMI, and
SleepHours show wide variation. Most respondents report a
low number of unhealthy days, while a small proportion show extreme
values. The categorical variables show that most respondents do not
report chronic conditions such as heart disease, and missing values are
present in several variables.
summary(df)
## State Sex GeneralHealth PhysicalHealthDays
## Length :445132 Length :445132 Length :445132 Min. : 0.000
## N.unique : 54 N.unique : 2 N.unique : 5 1st Qu.: 0.000
## N.blank : 0 N.blank : 0 N.blank : 0 Median : 0.000
## Min.nchar: 4 Min.nchar: 4 Min.nchar: 4 Mean : 4.348
## Max.nchar: 20 Max.nchar: 6 Max.nchar: 9 3rd Qu.: 3.000
## NAs : 1198 Max. :30.000
## NAs :10927
## MentalHealthDays LastCheckupTime PhysicalActivities SleepHours
## Min. : 0.000 Length :445132 Length :445132 Min. : 1.000
## 1st Qu.: 0.000 N.unique : 4 N.unique : 2 1st Qu.: 6.000
## Median : 0.000 N.blank : 0 N.blank : 0 Median : 7.000
## Mean : 4.383 Min.nchar: 19 Min.nchar: 2 Mean : 7.023
## 3rd Qu.: 5.000 Max.nchar: 55 Max.nchar: 3 3rd Qu.: 8.000
## Max. :30.000 NAs : 8308 NAs : 1093 Max. :24.000
## NAs :9067 NAs :5453
## RemovedTeeth HadHeartAttack HadAngina HadStroke
## Length :445132 Length :445132 Length :445132 Length :445132
## N.unique : 4 N.unique : 2 N.unique : 2 N.unique : 2
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 3 Min.nchar: 2 Min.nchar: 2 Min.nchar: 2
## Max.nchar: 22 Max.nchar: 3 Max.nchar: 3 Max.nchar: 3
## NAs : 11360 NAs : 3065 NAs : 4405 NAs : 1557
##
## HadAsthma HadSkinCancer HadCOPD HadDepressiveDisorder
## Length :445132 Length :445132 Length :445132 Length :445132
## N.unique : 2 N.unique : 2 N.unique : 2 N.unique : 2
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 2 Min.nchar: 2 Min.nchar: 2
## Max.nchar: 3 Max.nchar: 3 Max.nchar: 3 Max.nchar: 3
## NAs : 1773 NAs : 3143 NAs : 2219 NAs : 2812
##
## HadKidneyDisease HadArthritis HadDiabetes DeafOrHardOfHearing
## Length :445132 Length :445132 Length :445132 Length :445132
## N.unique : 2 N.unique : 2 N.unique : 4 N.unique : 2
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 2 Min.nchar: 2 Min.nchar: 2
## Max.nchar: 3 Max.nchar: 3 Max.nchar: 39 Max.nchar: 3
## NAs : 1926 NAs : 2633 NAs : 1087 NAs : 20647
##
## BlindOrVisionDifficulty DifficultyConcentrating DifficultyWalking
## Length :445132 Length :445132 Length :445132
## N.unique : 2 N.unique : 2 N.unique : 2
## N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 2 Min.nchar: 2
## Max.nchar: 3 Max.nchar: 3 Max.nchar: 3
## NAs : 21564 NAs : 24240 NAs : 24012
##
## DifficultyDressingBathing DifficultyErrands SmokerStatus
## Length :445132 Length :445132 Length :445132
## N.unique : 2 N.unique : 2 N.unique : 4
## N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 2 Min.nchar: 12
## Max.nchar: 3 Max.nchar: 3 Max.nchar: 37
## NAs : 23915 NAs : 25656 NAs : 35462
##
## ECigaretteUsage ChestScan RaceEthnicityCategory AgeCategory
## Length :445132 Length :445132 Length :445132 Length :445132
## N.unique : 4 N.unique : 2 N.unique : 5 N.unique : 13
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 18 Min.nchar: 2 Min.nchar: 8 Min.nchar: 12
## Max.nchar: 41 Max.nchar: 3 Max.nchar: 29 Max.nchar: 15
## NAs : 35660 NAs : 56046 NAs : 14057 NAs : 9079
##
## HeightInMeters WeightInKilograms BMI AlcoholDrinkers
## Min. :0.910 Min. : 22.68 Min. :12.02 Length :445132
## 1st Qu.:1.630 1st Qu.: 68.04 1st Qu.:24.13 N.unique : 2
## Median :1.700 Median : 80.74 Median :27.44 N.blank : 0
## Mean :1.703 Mean : 83.07 Mean :28.53 Min.nchar: 2
## 3rd Qu.:1.780 3rd Qu.: 95.25 3rd Qu.:31.75 Max.nchar: 3
## Max. :2.410 Max. :292.57 Max. :99.64 NAs : 46574
## NAs :28652 NAs :42078 NAs :48806
## HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap
## Length :445132 Length :445132 Length :445132 Length :445132
## N.unique : 2 N.unique : 2 N.unique : 2 N.unique : 4
## N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 2 Min.nchar: 2 Min.nchar: 18
## Max.nchar: 3 Max.nchar: 3 Max.nchar: 3 Max.nchar: 57
## NAs : 66127 NAs : 47121 NAs : 77040 NAs : 82516
##
## HighRiskLastYear CovidPos
## Length :445132 Length :445132
## N.unique : 2 N.unique : 3
## N.blank : 0 N.blank : 0
## Min.nchar: 2 Min.nchar: 2
## Max.nchar: 3 Max.nchar: 61
## NAs : 50623 NAs : 50764
##
The dfSummary() function provides a detailed overview of
the dataset, including frequencies, distributions and missing-value
percentages, which helps to guide the cleaning decisions.
print(dfSummary(df,
graph.magnif = 0.75),
method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | State [character] |
|
|
445132 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | Sex [character] |
|
|
445132 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | GeneralHealth [character] |
|
|
443934 (99.7%) | 1198 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | PhysicalHealthDays [numeric] |
|
31 distinct values | 434205 (97.5%) | 10927 (2.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | MentalHealthDays [numeric] |
|
31 distinct values | 436065 (98.0%) | 9067 (2.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | LastCheckupTime [character] |
|
|
436824 (98.1%) | 8308 (1.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | PhysicalActivities [character] |
|
|
444039 (99.8%) | 1093 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | SleepHours [numeric] |
|
24 distinct values | 439679 (98.8%) | 5453 (1.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | RemovedTeeth [character] |
|
|
433772 (97.4%) | 11360 (2.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | HadHeartAttack [character] |
|
|
442067 (99.3%) | 3065 (0.7%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | HadAngina [character] |
|
|
440727 (99.0%) | 4405 (1.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | HadStroke [character] |
|
|
443575 (99.7%) | 1557 (0.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | HadAsthma [character] |
|
|
443359 (99.6%) | 1773 (0.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | HadSkinCancer [character] |
|
|
441989 (99.3%) | 3143 (0.7%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | HadCOPD [character] |
|
|
442913 (99.5%) | 2219 (0.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | HadDepressiveDisorder [character] |
|
|
442320 (99.4%) | 2812 (0.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | HadKidneyDisease [character] |
|
|
443206 (99.6%) | 1926 (0.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | HadArthritis [character] |
|
|
442499 (99.4%) | 2633 (0.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | HadDiabetes [character] |
|
|
444045 (99.8%) | 1087 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | DeafOrHardOfHearing [character] |
|
|
424485 (95.4%) | 20647 (4.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | BlindOrVisionDifficulty [character] |
|
|
423568 (95.2%) | 21564 (4.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | DifficultyConcentrating [character] |
|
|
420892 (94.6%) | 24240 (5.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | DifficultyWalking [character] |
|
|
421120 (94.6%) | 24012 (5.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | DifficultyDressingBathing [character] |
|
|
421217 (94.6%) | 23915 (5.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | DifficultyErrands [character] |
|
|
419476 (94.2%) | 25656 (5.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | SmokerStatus [character] |
|
|
409670 (92.0%) | 35462 (8.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | ECigaretteUsage [character] |
|
|
409472 (92.0%) | 35660 (8.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | ChestScan [character] |
|
|
389086 (87.4%) | 56046 (12.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | RaceEthnicityCategory [character] |
|
|
431075 (96.8%) | 14057 (3.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 30 | AgeCategory [character] |
|
|
436053 (98.0%) | 9079 (2.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31 | HeightInMeters [numeric] |
|
109 distinct values | 416480 (93.6%) | 28652 (6.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 32 | WeightInKilograms [numeric] |
|
599 distinct values | 403054 (90.5%) | 42078 (9.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 33 | BMI [numeric] |
|
3985 distinct values | 396326 (89.0%) | 48806 (11.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 34 | AlcoholDrinkers [character] |
|
|
398558 (89.5%) | 46574 (10.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 35 | HIVTesting [character] |
|
|
379005 (85.1%) | 66127 (14.9%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 36 | FluVaxLast12 [character] |
|
|
398011 (89.4%) | 47121 (10.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 37 | PneumoVaxEver [character] |
|
|
368092 (82.7%) | 77040 (17.3%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 38 | TetanusLast10Tdap [character] |
|
|
362616 (81.5%) | 82516 (18.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 39 | HighRiskLastYear [character] |
|
|
394509 (88.6%) | 50623 (11.4%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 40 | CovidPos [character] |
|
|
394368 (88.6%) | 50764 (11.4%) |
Generated by summarytools 1.1.5 (R version 4.6.0)
2026-06-09
First, we find out how much data is actually missing and where. The table below counts the missing values in each variable.
data.frame(variable = colnames(df),
missing_count = sapply(df, function(x) sum(is.na(x))),
missing_pct = round(100 * sapply(df, function(x) sum(is.na(x))) / nrow(df), 1),
row.names = NULL)
## variable missing_count missing_pct
## 1 State 0 0.0
## 2 Sex 0 0.0
## 3 GeneralHealth 1198 0.3
## 4 PhysicalHealthDays 10927 2.5
## 5 MentalHealthDays 9067 2.0
## 6 LastCheckupTime 8308 1.9
## 7 PhysicalActivities 1093 0.2
## 8 SleepHours 5453 1.2
## 9 RemovedTeeth 11360 2.6
## 10 HadHeartAttack 3065 0.7
## 11 HadAngina 4405 1.0
## 12 HadStroke 1557 0.3
## 13 HadAsthma 1773 0.4
## 14 HadSkinCancer 3143 0.7
## 15 HadCOPD 2219 0.5
## 16 HadDepressiveDisorder 2812 0.6
## 17 HadKidneyDisease 1926 0.4
## 18 HadArthritis 2633 0.6
## 19 HadDiabetes 1087 0.2
## 20 DeafOrHardOfHearing 20647 4.6
## 21 BlindOrVisionDifficulty 21564 4.8
## 22 DifficultyConcentrating 24240 5.4
## 23 DifficultyWalking 24012 5.4
## 24 DifficultyDressingBathing 23915 5.4
## 25 DifficultyErrands 25656 5.8
## 26 SmokerStatus 35462 8.0
## 27 ECigaretteUsage 35660 8.0
## 28 ChestScan 56046 12.6
## 29 RaceEthnicityCategory 14057 3.2
## 30 AgeCategory 9079 2.0
## 31 HeightInMeters 28652 6.4
## 32 WeightInKilograms 42078 9.5
## 33 BMI 48806 11.0
## 34 AlcoholDrinkers 46574 10.5
## 35 HIVTesting 66127 14.9
## 36 FluVaxLast12 47121 10.6
## 37 PneumoVaxEver 77040 17.3
## 38 TetanusLast10Tdap 82516 18.5
## 39 HighRiskLastYear 50623 11.4
## 40 CovidPos 50764 11.4
Most variables (38 out of 40) turn out to have some missing values, ranging from very small amounts up to around 18.5%. No variable is missing so much data that it has to be discarded entirely.
Before deciding how to treat the missing values, we need to
understand why the data is missing, because that affects
whether deleting or filling in the values is the better choice. To check
this, we compare the missing rate of BMI across age
groups.
df %>%
mutate(BMI_missing = is.na(BMI)) %>%
group_by(AgeCategory) %>%
summarise(pct_missing = round(100 * mean(BMI_missing), 1)) %>%
arrange(AgeCategory)
## # A tibble: 14 × 2
## AgeCategory pct_missing
## <chr> <dbl>
## 1 Age 18 to 24 11.4
## 2 Age 25 to 29 12.2
## 3 Age 30 to 34 13.2
## 4 Age 35 to 39 11.9
## 5 Age 40 to 44 11.3
## 6 Age 45 to 49 10.8
## 7 Age 50 to 54 10.7
## 8 Age 55 to 59 10.2
## 9 Age 60 to 64 9.8
## 10 Age 65 to 69 9.2
## 11 Age 70 to 74 8.7
## 12 Age 75 to 79 8.1
## 13 Age 80 or older 8
## 14 <NA> 48.9
The result shows that the missing rate is higher for younger respondents (around 11-13%) and lower for older respondents (around 8%), meaning the missingness follows a pattern related to an observed variable, which is age. This tells us the data is Missing At Random (MAR) rather than missing in a completely random way. Deleting rows with missing values would remove proportionally more younger respondents and bias the sample, so filling in the predictor values (imputation) is the safer approach.
HadHeartAttack and PhysicalHealthDays are
the two variables we are trying to predict. A record with no target
value cannot be used to train or evaluate a model, and imputing it would
fabricate the outcome being predicted. These rows are therefore
removed.
df <- df %>% filter(!is.na(HadHeartAttack), !is.na(PhysicalHealthDays))
The dropped fractions are small (0.7% and 2.5%), so the sample loss is minimal. Removing rows with a missing target reduced the dataset from 445,132 to 431,470 records.
The numeric predictors are filled in using the median. The median is
used rather than the mean because, based on the summary statistics, the
data is skewed, and the median is not affected by skew or extreme
values. Note that PhysicalHealthDays is deliberately left
out here, since it is a target rather than a predictor.
num_predictors <- c("MentalHealthDays", "SleepHours",
"HeightInMeters", "WeightInKilograms", "BMI")
for (col in num_predictors) {
df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
}
# Quick check
sapply(num_predictors, function(c) sum(is.na(df[[c]])))
## MentalHealthDays SleepHours HeightInMeters WeightInKilograms
## 0 0 0 0
## BMI
## 0
Categorical variables have no mean or median, so the missing values are filled with the mode, the most common category in each column. This keeps every record and avoids creating an artificial “Missing” category that could be mistaken for a real group.
get_mode <- function(x) {
x <- x[!is.na(x)]
names(sort(table(x), decreasing = TRUE))[1]
}
cat_cols <- names(df)[sapply(df, is.character)]
for (col in cat_cols) {
df[[col]][is.na(df[[col]])] <- get_mode(df[[col]])
}
A final check confirms that the deletion and imputation steps removed all missing values. Every variable should now show a count of zero.
data.frame(variable = colnames(df),
missing_count = sapply(df, function(x) sum(is.na(x))),
row.names = NULL)
## variable missing_count
## 1 State 0
## 2 Sex 0
## 3 GeneralHealth 0
## 4 PhysicalHealthDays 0
## 5 MentalHealthDays 0
## 6 LastCheckupTime 0
## 7 PhysicalActivities 0
## 8 SleepHours 0
## 9 RemovedTeeth 0
## 10 HadHeartAttack 0
## 11 HadAngina 0
## 12 HadStroke 0
## 13 HadAsthma 0
## 14 HadSkinCancer 0
## 15 HadCOPD 0
## 16 HadDepressiveDisorder 0
## 17 HadKidneyDisease 0
## 18 HadArthritis 0
## 19 HadDiabetes 0
## 20 DeafOrHardOfHearing 0
## 21 BlindOrVisionDifficulty 0
## 22 DifficultyConcentrating 0
## 23 DifficultyWalking 0
## 24 DifficultyDressingBathing 0
## 25 DifficultyErrands 0
## 26 SmokerStatus 0
## 27 ECigaretteUsage 0
## 28 ChestScan 0
## 29 RaceEthnicityCategory 0
## 30 AgeCategory 0
## 31 HeightInMeters 0
## 32 WeightInKilograms 0
## 33 BMI 0
## 34 AlcoholDrinkers 0
## 35 HIVTesting 0
## 36 FluVaxLast12 0
## 37 PneumoVaxEver 0
## 38 TetanusLast10Tdap 0
## 39 HighRiskLastYear 0
## 40 CovidPos 0
The dataset is checked for duplicate rows. Because it contains no personal identifiers and is made up entirely of categorical health and demographic answers, identical rows most likely come from different respondents who happened to give the same responses. They are therefore kept, as removing them would discard valid data.
cat("Duplicate rows:", sum(duplicated(df)), "\n")
## Duplicate rows: 322
The State column is a geographic identifier with 50
categories but no direct clinical relevance to heart attack risk. Any
health differences between states are already captured by other
variables such as age and lifestyle, so removing it reduces complexity
and noise without losing useful information.
df <- df %>% select(-State)
The HadDiabetes variable originally had four categories,
including small groups such as pre-diabetes and pregnancy-only. These
were combined into a clear Yes/No indicator, creating a simpler binary
predictor and avoiding sparse categories.
df <- df %>%
mutate(HadDiabetes = case_when(
HadDiabetes == "Yes" ~ "Yes",
TRUE ~ "No"))
Boxplots are a quick way to spot extreme values in the numeric variables.
num_vars <- c("PhysicalHealthDays", "MentalHealthDays", "SleepHours",
"HeightInMeters", "WeightInKilograms", "BMI")
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))
for (col in num_vars) boxplot(df[[col]], main = col)
par(mfrow = c(1, 1))
The boxplots reveal extreme values across the six numeric variables,
but not every high value is a problem. The high values in
PhysicalHealthDays and MentalHealthDays simply
reflect the survey design, where 0 to 30 days is a valid answer rather
than an error, so these two variables are left unchanged. In contrast,
the body-measurement and sleep variables show extreme values that are
more likely caused by measurement or reporting errors. To address this,
the IQR method is applied, whereby any value beyond 1.5 times the
interquartile range is replaced with the column median.
clean_iqr <- function(data, col) {
Q1 <- quantile(data[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(data[[col]], 0.75, na.rm = TRUE)
iqr <- Q3 - Q1
lo <- Q1 - 1.5 * iqr
hi <- Q3 + 1.5 * iqr
out <- data[[col]] < lo | data[[col]] > hi
data[[col]][out] <- median(data[[col]], na.rm = TRUE)
data
}
for (col in c("BMI", "WeightInKilograms", "HeightInMeters", "SleepHours")) {
df <- clean_iqr(df, col)
}
Some useful information is not directly available in a single column
but can be created from existing ones. Three new variables were added to
improve model performance. BMI_category groups the numeric
BMI into the standard WHO weight bands, which carry clearer health
meaning than a raw number. ChronicConditionCount summarises
overall health burden by counting eight chronic conditions, providing a
single measure instead of multiple columns. PoorHealth is a
simple flag for anyone who rated their general health as fair or
poor.
# BMI grouped into WHO weight bands
df$BMI_category <- cut(df$BMI,
breaks = c(0, 18.5, 25, 30, 100),
labels = c("Underweight", "Normal", "Overweight", "Obese"),
right = FALSE)
# Number of chronic conditions per person
condition_cols <- c("HadAngina", "HadStroke", "HadAsthma", "HadCOPD",
"HadKidneyDisease", "HadArthritis", "HadDiabetes",
"HadDepressiveDisorder")
df$ChronicConditionCount <- rowSums(df[condition_cols] == "Yes")
# Flag for self-reported fair or poor health
df$PoorHealth <- ifelse(df$GeneralHealth %in% c("Fair", "Poor"), 1, 0)
ncol(df)
## [1] 42
table(df$BMI_category) # counts per BMI band
##
## Underweight Normal Overweight Obese
## 6213 113752 200846 110659
table(df$ChronicConditionCount) # counts per number of conditions (0 to 8)
##
## 0 1 2 3 4 5 6 7 8
## 177916 133738 69426 30744 12838 4815 1534 395 64
table(df$PoorHealth) # counts of 0 vs 1
##
## 0 1
## 356582 74888
R’s modelling functions expect categorical variables to be stored as
factors rather than plain text, so all the character columns are
converted here. HadHeartAttack is set as a factor because
it is the classification target, while AgeCategory is made
an ordered factor since its categories have a natural ranking from
youngest to oldest. A final summary then confirms the dataset structure
after all cleaning steps.
cat_cols <- names(df)[sapply(df, is.character)]
df[cat_cols] <- lapply(df[cat_cols], as.factor)
df$HadHeartAttack <- as.factor(df$HadHeartAttack)
df$BMI_category <- as.factor(df$BMI_category)
df$AgeCategory <- factor(df$AgeCategory, ordered = TRUE)
print(dfSummary(df,
graph.magnif = 0.75),
method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Sex [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | GeneralHealth [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | PhysicalHealthDays [numeric] |
|
31 distinct values | 431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | MentalHealthDays [numeric] |
|
31 distinct values | 431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | LastCheckupTime [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | PhysicalActivities [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | SleepHours [numeric] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | RemovedTeeth [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | HadHeartAttack [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | HadAngina [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | HadStroke [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | HadAsthma [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | HadSkinCancer [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | HadCOPD [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | HadDepressiveDisorder [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | HadKidneyDisease [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | HadArthritis [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | HadDiabetes [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | DeafOrHardOfHearing [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | BlindOrVisionDifficulty [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | DifficultyConcentrating [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | DifficultyWalking [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | DifficultyDressingBathing [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | DifficultyErrands [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25 | SmokerStatus [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 26 | ECigaretteUsage [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 27 | ChestScan [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28 | RaceEthnicityCategory [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29 | AgeCategory [ordered, factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 30 | HeightInMeters [numeric] |
|
57 distinct values | 431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 31 | WeightInKilograms [numeric] |
|
293 distinct values | 431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 32 | BMI [numeric] |
|
2301 distinct values | 431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 33 | AlcoholDrinkers [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 34 | HIVTesting [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 35 | FluVaxLast12 [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 36 | PneumoVaxEver [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 37 | TetanusLast10Tdap [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 38 | HighRiskLastYear [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 39 | CovidPos [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 40 | BMI_category [factor] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 41 | ChronicConditionCount [numeric] |
|
|
431470 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 42 | PoorHealth [numeric] |
|
|
431470 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.1.5 (R version 4.6.0)
2026-06-09
sapply(df, class)
## $Sex
## [1] "factor"
##
## $GeneralHealth
## [1] "factor"
##
## $PhysicalHealthDays
## [1] "numeric"
##
## $MentalHealthDays
## [1] "numeric"
##
## $LastCheckupTime
## [1] "factor"
##
## $PhysicalActivities
## [1] "factor"
##
## $SleepHours
## [1] "numeric"
##
## $RemovedTeeth
## [1] "factor"
##
## $HadHeartAttack
## [1] "factor"
##
## $HadAngina
## [1] "factor"
##
## $HadStroke
## [1] "factor"
##
## $HadAsthma
## [1] "factor"
##
## $HadSkinCancer
## [1] "factor"
##
## $HadCOPD
## [1] "factor"
##
## $HadDepressiveDisorder
## [1] "factor"
##
## $HadKidneyDisease
## [1] "factor"
##
## $HadArthritis
## [1] "factor"
##
## $HadDiabetes
## [1] "factor"
##
## $DeafOrHardOfHearing
## [1] "factor"
##
## $BlindOrVisionDifficulty
## [1] "factor"
##
## $DifficultyConcentrating
## [1] "factor"
##
## $DifficultyWalking
## [1] "factor"
##
## $DifficultyDressingBathing
## [1] "factor"
##
## $DifficultyErrands
## [1] "factor"
##
## $SmokerStatus
## [1] "factor"
##
## $ECigaretteUsage
## [1] "factor"
##
## $ChestScan
## [1] "factor"
##
## $RaceEthnicityCategory
## [1] "factor"
##
## $AgeCategory
## [1] "ordered" "factor"
##
## $HeightInMeters
## [1] "numeric"
##
## $WeightInKilograms
## [1] "numeric"
##
## $BMI
## [1] "numeric"
##
## $AlcoholDrinkers
## [1] "factor"
##
## $HIVTesting
## [1] "factor"
##
## $FluVaxLast12
## [1] "factor"
##
## $PneumoVaxEver
## [1] "factor"
##
## $TetanusLast10Tdap
## [1] "factor"
##
## $HighRiskLastYear
## [1] "factor"
##
## $CovidPos
## [1] "factor"
##
## $BMI_category
## [1] "factor"
##
## $ChronicConditionCount
## [1] "numeric"
##
## $PoorHealth
## [1] "numeric"
The data is split into 80% for training and 20% for testing, so that
models can be built on one portion and judged on data they have not
seen. The split is stratified on HadHeartAttack, which
keeps the same proportion of heart-attack cases in both sets, and a
fixed random seed makes the split reproducible. Importantly, the split
is done before any class balancing, so that the test set stays a
realistic sample.
set.seed(123)
train_index <- createDataPartition(df$HadHeartAttack, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
nrow(train_data)
## [1] 345177
nrow(test_data)
## [1] 86293
In the training set, heart-attack cases make up only about 5.6% of the records. This imbalance means a model could achieve high accuracy just by always predicting “No”, while completely missing the cases that matter most.
library(recipes)
library(themis)
table(train_data$HadHeartAttack)
##
## No Yes
## 325860 19317
To correct the imbalance, SMOTE-NC is applied to the training set. Instead of duplicating existing cases, it creates synthetic minority-class records. This version of SMOTE is designed for datasets that mix categorical and numeric variables, which fits this dataset well. Balancing is applied only to the training set, while the test set is left unchanged so that evaluation reflects real-world performance.
balance_recipe <- recipe(HadHeartAttack ~ ., data = train_data) %>%
step_smotenc(HadHeartAttack, over_ratio = 1, seed = 123)
train_balanced <- prep(balance_recipe) %>% juice()
After balancing, the two classes are equal in the training set.
table(train_balanced$HadHeartAttack)
##
## No Yes
## 325860 325860
The preprocessing is now complete, and it produces three datasets for
the modelling stage. train_balanced is the balanced
training set used to build the classification model for Question 2.
test_data is the untouched test set, kept at the realistic
heart-attack rate, used to evaluate that model fairly. df
is the full cleaned dataset used for the Question 1 regression task.
Along the way, missing values were handled, outliers corrected where needed, three new features engineered, categorical variables converted to factors, and the class imbalance fixed. The data is now clean, consistent, and ready for modelling.
Exploratory Data Analysis (EDA) is conducted on the cleaned dataset
(df) after preprocessing. The goal is to understand the
distribution of key variables, examine relationships between predictors
and the two target variables (PhysicalHealthDays for
regression and HadHeartAttack for classification), and
uncover patterns that inform the modelling stage.
The distribution of PhysicalHealthDays is heavily
right-skewed. The majority of respondents report zero physically
unhealthy days, while a smaller proportion reports values closer to 30
days. This skewness confirms that median-based imputation was
appropriate during preprocessing and should be considered when choosing
regression models.
p1 <- ggplot(df, aes(x = PhysicalHealthDays)) +
geom_histogram(binwidth = 1, fill = "steelblue", colour = "white", alpha = 0.85) +
labs(title = "Distribution of Physical Health Days",
subtitle = "Heavily right-skewed; most respondents report 0 unhealthy days",
x = "Number of Physically Unhealthy Days (Past 30)",
y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
p2 <- ggplot(df, aes(y = PhysicalHealthDays)) +
geom_boxplot(fill = "steelblue", alpha = 0.7, outlier.colour = "firebrick",
outlier.alpha = 0.3, outlier.size = 0.8) +
labs(title = "Boxplot of Physical Health Days",
subtitle = "Median = 0; upper outliers are valid (max = 30)",
y = "Physical Health Days") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_blank())
grid.arrange(p1, p2, ncol = 2)
The bar chart below confirms the severe class imbalance: approximately 94% of respondents did not have a heart attack, while only ~6% did. This imbalance was handled during preprocessing using SMOTE-NC on the training set.
df %>%
count(HadHeartAttack) %>%
mutate(pct = n / sum(n),
label = paste0(comma(n), "\n(", percent(pct, accuracy = 0.1), ")")) %>%
ggplot(aes(x = HadHeartAttack, y = n, fill = HadHeartAttack)) +
geom_col(width = 0.55, show.legend = FALSE) +
geom_text(aes(label = label), vjust = -0.3, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("No" = "forestgreen", "Yes" = "firebrick")) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
labs(title = "Class Distribution: Had Heart Attack",
subtitle = "Severe imbalance — ~94% No vs ~6% Yes",
x = "Had Heart Attack", y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
The sample is approximately balanced between female and male respondents, with females slightly outnumbering males.
df %>%
count(Sex) %>%
mutate(pct = percent(n / sum(n), accuracy = 0.1)) %>%
ggplot(aes(x = Sex, y = n, fill = Sex)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = paste0(comma(n), "\n(", pct, ")")),
vjust = -0.3, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Female" = "deeppink", "Male" = "navy")) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
labs(title = "Respondents by Sex",
x = "Sex", y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
The sample skews older, with respondents aged 60 and above making up the largest proportion. This is expected for a health survey focused on chronic conditions.
df %>%
count(AgeCategory) %>%
ggplot(aes(x = AgeCategory, y = n)) +
geom_col(fill = "slateblue", alpha = 0.85) +
geom_text(aes(label = comma(n)), vjust = -0.3, size = 3, fontface = "bold") +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
labs(title = "Distribution by Age Category",
subtitle = "Older age groups are more represented in the survey",
x = "Age Category", y = "Count") +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 35, hjust = 1),
plot.title = element_text(face = "bold"))
White Non-Hispanic respondents represent the largest group by a wide margin, reflecting the broader demographic composition of BRFSS survey responses.
df %>%
count(RaceEthnicityCategory, sort = TRUE) %>%
ggplot(aes(x = reorder(RaceEthnicityCategory, n), y = n)) +
geom_col(fill = "cadetblue", alpha = 0.85) +
geom_text(aes(label = comma(n)), hjust = -0.1, size = 3) +
coord_flip() +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
labs(title = "Distribution by Race/Ethnicity",
x = NULL, y = "Count") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
The six numeric variables are examined together using density plots and a correlation heatmap.
num_vars <- c("PhysicalHealthDays", "MentalHealthDays",
"SleepHours", "BMI", "HeightInMeters", "WeightInKilograms")
df %>%
select(all_of(num_vars)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, fill = Variable)) +
geom_density(alpha = 0.75, colour = "white") +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Density Plots of Numerical Variables",
subtitle = "PhysicalHealthDays and MentalHealthDays are right-skewed; BMI and Weight are approximately normal",
x = NULL, y = "Density") +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))
cor_mat <- df %>% select(all_of(num_vars)) %>% cor(use = "complete.obs")
cor_melt <- melt(cor_mat)
ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
geom_tile(colour = "white") +
geom_text(aes(label = round(value, 2)), size = 3.2, fontface = "bold") +
scale_fill_gradient2(low = "navy", mid = "white", high = "darkred",
midpoint = 0, limits = c(-1, 1), name = "Correlation") +
labs(title = "Correlation Heatmap — Numerical Variables",
x = NULL, y = NULL) +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 35, hjust = 1),
plot.title = element_text(face = "bold"))
Key observations:
BMI, WeightInKilograms, and
HeightInMeters are moderately correlated, which is expected
since BMI is derived from height and weight.PhysicalHealthDays and MentalHealthDays
show a moderate positive correlation, suggesting that physical and
mental health conditions co-occur.Most respondents rate their health as “Good” or “Very Good”. Only a minority report “Poor” health, though this group is important as it is closely linked to both target variables.
health_order <- c("Poor", "Fair", "Good", "Very good", "Excellent")
df %>%
count(GeneralHealth) %>%
mutate(GeneralHealth = factor(GeneralHealth, levels = health_order)) %>%
ggplot(aes(x = GeneralHealth, y = n, fill = GeneralHealth)) +
geom_col(show.legend = FALSE, alpha = 0.85) +
geom_text(aes(label = comma(n)), vjust = -0.3, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c("Poor" = "red3",
"Fair" = "tomato",
"Good" = "gold",
"Very good" = "mediumseagreen",
"Excellent" = "dodgerblue")) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
labs(title = "General Health Self-Rating",
subtitle = "Most respondents rate their health as Good or Very good",
x = "General Health", y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
binary_vars <- c("PhysicalActivities", "AlcoholDrinkers", "HighRiskLastYear")
df %>%
select(all_of(binary_vars)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
count(Variable, Value) %>%
group_by(Variable) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = Variable, y = pct, fill = Value)) +
geom_col(position = "fill", alpha = 0.85) +
geom_text(aes(label = percent(pct, accuracy = 1)),
position = position_fill(vjust = 0.5), size = 3.5, colour = "white",
fontface = "bold") +
scale_fill_manual(values = c("Yes" = "darkgreen", "No" = "grey75")) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Lifestyle Behaviour Variables (Yes vs No)",
x = NULL, y = "Proportion", fill = NULL) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
df %>%
count(SmokerStatus, sort = TRUE) %>%
ggplot(aes(x = reorder(SmokerStatus, n), y = n, fill = SmokerStatus)) +
geom_col(show.legend = FALSE, alpha = 0.85) +
geom_text(aes(label = comma(n)), hjust = -0.1, size = 3.5) +
coord_flip() +
scale_fill_brewer(palette = "Oranges", direction = -1) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
labs(title = "Smoker Status Distribution", x = NULL, y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
condition_cols <- c("HadAngina", "HadStroke", "HadAsthma", "HadCOPD",
"HadKidneyDisease", "HadArthritis", "HadDiabetes",
"HadDepressiveDisorder", "HadSkinCancer", "HadHeartAttack")
df %>%
select(all_of(condition_cols)) %>%
pivot_longer(everything(), names_to = "Condition", values_to = "Value") %>%
filter(Value == "Yes") %>%
count(Condition, sort = TRUE) %>%
mutate(pct = n / nrow(df)) %>%
ggplot(aes(x = reorder(Condition, pct), y = pct)) +
geom_col(fill = "red3", alpha = 0.82) +
geom_text(aes(label = percent(pct, accuracy = 0.1)), hjust = -0.1, size = 3.5) +
coord_flip() +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0, 0.18))) +
labs(title = "Prevalence of Chronic Conditions",
subtitle = "Arthritis and Depression are most prevalent; Heart Attack affects ~6%",
x = NULL, y = "Prevalence (% of respondents)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
df %>%
count(ChronicConditionCount) %>%
ggplot(aes(x = factor(ChronicConditionCount), y = n)) +
geom_col(fill = "purple", alpha = 0.82) +
geom_text(aes(label = comma(n)), vjust = -0.3, size = 3.5, fontface = "bold") +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
labs(title = "Distribution of Chronic Condition Count",
subtitle = "Most respondents have 0–2 chronic conditions",
x = "Number of Chronic Conditions", y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Respondents who rate their health as “Poor” report a markedly higher
median number of physically unhealthy days. This strong monotonic
pattern confirms that GeneralHealth will be an important
predictor in the regression model.
df %>%
mutate(GeneralHealth = factor(GeneralHealth, levels = health_order)) %>%
ggplot(aes(x = GeneralHealth, y = PhysicalHealthDays, fill = GeneralHealth)) +
geom_boxplot(alpha = 0.8, outlier.alpha = 0.1, outlier.size = 0.5,
show.legend = FALSE) +
scale_fill_manual(values = c("Poor" = "red3",
"Fair" = "tomato",
"Good" = "gold",
"Very good" = "mediumseagreen",
"Excellent" = "dodgerblue")) +
labs(title = "Physical Health Days by General Health Rating",
subtitle = "Clear monotonic increase: worse health → more unhealthy days",
x = "General Health", y = "Physical Health Days (Past 30)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Respondents with more chronic conditions report substantially more
physically unhealthy days, showing that
ChronicConditionCount captures cumulative health burden
effectively.
df %>%
mutate(ChronicConditionCount = as.numeric(as.character(ChronicConditionCount))) %>%
group_by(ChronicConditionCount) %>%
summarise(mean_phd = mean(PhysicalHealthDays)) %>%
ggplot(aes(x = factor(ChronicConditionCount), y = mean_phd)) +
geom_col(fill = "dodgerblue", alpha = 0.85) +
geom_text(aes(label = round(mean_phd, 1)), vjust = -0.4, size = 3.5, fontface = "bold") +
scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
labs(title = "Mean Physical Health Days by Chronic Condition Count",
subtitle = "Mean unhealthy days rises sharply with each additional condition",
x = "Number of Chronic Conditions", y = "Mean Physical Health Days") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Older respondents tend to report slightly more physically unhealthy days on average, though the medians remain low across all groups due to the skewed distribution.
df %>%
group_by(AgeCategory) %>%
summarise(mean_phd = mean(PhysicalHealthDays)) %>%
ggplot(aes(x = AgeCategory, y = mean_phd, group = 1)) +
geom_line(colour = "darkorange", linewidth = 1.2) +
geom_point(colour = "darkorange", size = 3) +
geom_text(aes(label = round(mean_phd, 1)), vjust = -0.7, size = 3.2) +
scale_y_continuous(expand = expansion(mult = c(0.05, 0.15))) +
labs(title = "Mean Physical Health Days by Age Category",
subtitle = "Physically unhealthy days peak in middle age (55–64) and remain elevated in older groups",
x = "Age Category", y = "Mean Physical Health Days") +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 35, hjust = 1),
plot.title = element_text(face = "bold"))
df %>%
filter(!is.na(BMI_category)) %>%
ggplot(aes(x = BMI_category, y = PhysicalHealthDays, fill = BMI_category)) +
geom_boxplot(alpha = 0.8, outlier.alpha = 0.1, outlier.size = 0.5,
show.legend = FALSE) +
scale_fill_manual(values = c("Underweight" = "lightyellow",
"Normal" = "lightgreen",
"Overweight" = "sandybrown",
"Obese" = "tomato")) +
labs(title = "Physical Health Days by BMI Category",
subtitle = "Obese respondents report higher physical unhealthy days",
x = "BMI Category", y = "Physical Health Days (Past 30)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
The rate of heart attack increases steeply with age. Respondents aged 75 and above have the highest prevalence, which is consistent with cardiovascular disease epidemiology.
df %>%
group_by(AgeCategory) %>%
summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
ggplot(aes(x = AgeCategory, y = ha_rate, group = 1)) +
geom_line(colour = "darkred", linewidth = 1.3) +
geom_point(colour = "darkred", size = 3.5) +
geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
vjust = -0.8, size = 3.2) +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0.05, 0.18))) +
labs(title = "Heart Attack Rate by Age Category",
subtitle = "Risk increases sharply from age 50 onwards",
x = "Age Category", y = "Heart Attack Rate (%)") +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 35, hjust = 1),
plot.title = element_text(face = "bold"))
Male respondents have a notably higher rate of heart attack compared to female respondents, consistent with established cardiovascular disease patterns.
df %>%
group_by(Sex) %>%
summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
ggplot(aes(x = Sex, y = ha_rate, fill = Sex)) +
geom_col(width = 0.5, show.legend = FALSE, alpha = 0.85) +
geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
vjust = -0.5, size = 4, fontface = "bold") +
scale_fill_manual(values = c("Female" = "deeppink", "Male" = "navy")) +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0, 0.15))) +
labs(title = "Heart Attack Rate by Sex",
subtitle = "Males have a higher heart attack prevalence",
x = "Sex", y = "Heart Attack Rate (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Respondents who self-rate their health as “Poor” are the most likely to have had a heart attack. Even “Excellent” health raters show a non-zero rate, suggesting self-reported health does not fully capture cardiovascular risk.
df %>%
mutate(GeneralHealth = factor(GeneralHealth, levels = health_order)) %>%
group_by(GeneralHealth) %>%
summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
ggplot(aes(x = GeneralHealth, y = ha_rate, fill = GeneralHealth)) +
geom_col(show.legend = FALSE, alpha = 0.85) +
geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
vjust = -0.4, size = 3.8, fontface = "bold") +
scale_fill_manual(values = c("Poor" = "red3",
"Fair" = "tomato",
"Good" = "gold",
"Very good" = "mediumseagreen",
"Excellent" = "dodgerblue")) +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0, 0.15))) +
labs(title = "Heart Attack Rate by General Health Rating",
subtitle = "Heart attack risk is highest among those rating their health as Poor",
x = "General Health", y = "Heart Attack Rate (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Having more chronic conditions is strongly associated with a higher heart attack rate. Respondents with 4 or more conditions have a heart attack rate more than 10 times that of those with none.
df %>%
group_by(ChronicConditionCount) %>%
summarise(ha_rate = mean(HadHeartAttack == "Yes"), .groups = "drop") %>%
ggplot(aes(x = factor(ChronicConditionCount), y = ha_rate)) +
geom_col(fill = "purple4", alpha = 0.85) +
geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
vjust = -0.4, size = 3.8, fontface = "bold") +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0, 0.15))) +
labs(title = "Heart Attack Rate by Chronic Condition Count",
subtitle = "Sharp increase in risk as the number of conditions rises",
x = "Number of Chronic Conditions", y = "Heart Attack Rate (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Respondents who smoke daily or formerly smoked have a higher heart attack rate compared to those who have never smoked, reflecting the well-established link between tobacco use and cardiovascular disease.
df %>%
group_by(SmokerStatus) %>%
summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
ggplot(aes(x = reorder(SmokerStatus, ha_rate), y = ha_rate)) +
geom_col(fill = "orangered4", alpha = 0.85) +
geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
hjust = -0.2, size = 3.8) +
coord_flip() +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0, 0.18))) +
labs(title = "Heart Attack Rate by Smoker Status",
subtitle = "Daily smokers and former smokers carry higher cardiovascular risk",
x = NULL, y = "Heart Attack Rate (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
df %>%
filter(!is.na(BMI_category)) %>%
group_by(BMI_category) %>%
summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
ggplot(aes(x = BMI_category, y = ha_rate, fill = BMI_category)) +
geom_col(show.legend = FALSE, alpha = 0.85) +
geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
vjust = -0.4, size = 3.8, fontface = "bold") +
scale_fill_manual(values = c("Underweight" = "lightyellow",
"Normal" = "lightgreen",
"Overweight" = "sandybrown",
"Obese" = "tomato")) +
scale_y_continuous(labels = percent_format(),
expand = expansion(mult = c(0, 0.15))) +
labs(title = "Heart Attack Rate by BMI Category",
subtitle = "Overweight and obese respondents show elevated risk",
x = "BMI Category", y = "Heart Attack Rate (%)") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
The exploratory analysis reveals several important insights that will guide the modelling stage:
| Finding | Relevance |
|---|---|
PhysicalHealthDays is heavily right-skewed with many
zeros |
Supports use of non-linear or tree-based regression models |
HadHeartAttack has severe class imbalance (~94%
No) |
Confirmed the need for SMOTE-NC balancing (applied in preprocessing) |
GeneralHealth shows a strong monotonic relationship
with PhysicalHealthDays |
Likely a key predictor in the regression model |
ChronicConditionCount rises steeply with both
targets |
The engineered feature captures cumulative health burden effectively |
| Heart attack rate increases sharply after age 50 | AgeCategory is an important classification
predictor |
| Males and smokers show higher heart attack rates | Sex and SmokerStatus are relevant
classification features |
| BMI, weight, and height are correlated | Multicollinearity should be considered in linear regression |
The purpose of this regression analysis is to predict the number of physically unhealthy days (PhysicalHealthDays) experienced by U.S. adults using demographic characteristics, lifestyle behaviours, and chronic health conditions. The regression models are developed to:
1. Identify the key factors associated with poor physical health.
2. Examine the relationships between lifestyle behaviours, chronic diseases, and physically unhealthy days.
3. Compare multiple regression algorithms in predicting physical health outcomes.
#Set seed for consistency
set.seed(123)
#Partition 80% for training the regression model, 20% for testing
reg_index <- createDataPartition(df$PhysicalHealthDays, p = 0.8, list = FALSE)
reg_train <- df[reg_index, ]
reg_test <- df[-reg_index, ]
# Remove the classification target from the regression data
reg_train <- reg_train %>% select(-HadHeartAttack)
reg_test <- reg_test %>% select(-HadHeartAttack)
The dataset is splitted into 80% training set and 20% training set randomly. Training data is used to train the regression models. Testing data is used to evaluate model performance on unseen data. HadHeartAttack is removed because it is the target variable for classification.
To predict PhysicalHealthDays, three models were built to compare a traditional baseline against non-linear machine learning techniques: Multiple Linear Regression (MLR), Decision Tree, and XGBoost.
The analysis followed a three-step workflow:
Multicollinearity Check — Evaluated Variance Inflation Factors (VIF) for the MLR model to ensure highly correlated predictors did not distort the results.
Performance Evaluation — Compared model accuracy on the 20% test set using MAE (average error), RMSE (penalty for large errors), and R-squared (variance explained) to identify the best-performing algorithm.
Feature Importance — Extracted the top predictors from each model using regression coefficients for MLR and variable importance scores for tree-based models, to determine which factors most strongly drive poor physical health.
# Remove variables that caused multicollinearity / leakage
# Apply the SAME column filtering to both training and testing data
reg_remove_vars <- c(
"BMI_category",
"ChronicConditionCount",
"PoorHealth"
)
reg_train <- reg_train %>% select(-any_of(reg_remove_vars))
reg_test <- reg_test %>% select(-any_of(reg_remove_vars))
# -------------------------------
# Model 1: Multiple Linear Regression
# -------------------------------
lm_model <- lm(
PhysicalHealthDays ~ .,
data = reg_train
)
# Display regression summary
summary(lm_model)
##
## Call:
## lm(formula = PhysicalHealthDays ~ ., data = reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.588 -2.464 -1.010 0.377 31.278
##
## Coefficients:
## Estimate
## (Intercept) -0.474089
## SexMale -0.088749
## GeneralHealthFair 6.329129
## GeneralHealthGood 1.106633
## GeneralHealthPoor 15.765496
## GeneralHealthVery good 0.236445
## MentalHealthDays 0.138069
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago) 0.183285
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago) 0.062447
## LastCheckupTimeWithin past year (anytime less than 12 months ago) 0.410626
## PhysicalActivitiesYes -1.026868
## SleepHours -0.088518
## RemovedTeeth6 or more, but not all 0.142376
## RemovedTeethAll -0.034518
## RemovedTeethNone of them 0.030287
## HadAnginaYes 0.404943
## HadStrokeYes 0.258391
## HadAsthmaYes 0.421201
## HadSkinCancerYes 0.268762
## HadCOPDYes 0.676686
## HadDepressiveDisorderYes 0.139001
## HadKidneyDiseaseYes 0.558421
## HadArthritisYes 1.080825
## HadDiabetesYes -0.195221
## DeafOrHardOfHearingYes -0.069555
## BlindOrVisionDifficultyYes -0.044944
## DifficultyConcentratingYes 0.101607
## DifficultyWalkingYes 3.469760
## DifficultyDressingBathingYes 2.983413
## DifficultyErrandsYes 1.692900
## SmokerStatusCurrent smoker - now smokes some days 0.246066
## SmokerStatusFormer smoker 0.333373
## SmokerStatusNever smoked 0.374750
## ECigaretteUsageNot at all (right now) -0.020681
## ECigaretteUsageUse them every day -0.265751
## ECigaretteUsageUse them some days -0.113992
## ChestScanYes 0.285206
## RaceEthnicityCategoryHispanic 0.425116
## RaceEthnicityCategoryMultiracial, Non-Hispanic 0.846140
## RaceEthnicityCategoryOther race only, Non-Hispanic 0.458535
## RaceEthnicityCategoryWhite only, Non-Hispanic 0.821241
## AgeCategory.L -0.185830
## AgeCategory.Q -0.926816
## AgeCategory.C -0.194844
## AgeCategory^4 -0.087265
## AgeCategory^5 0.027721
## AgeCategory^6 -0.086562
## AgeCategory^7 -0.167156
## AgeCategory^8 -0.108057
## AgeCategory^9 -0.066952
## AgeCategory^10 0.028585
## AgeCategory^11 -0.029771
## AgeCategory^12 0.008210
## HeightInMeters 1.021743
## WeightInKilograms -0.005042
## BMI -0.020350
## AlcoholDrinkersYes -0.128509
## HIVTestingYes 0.149213
## FluVaxLast12Yes 0.093880
## PneumoVaxEverYes -0.082744
## TetanusLast10TdapYes, received Tdap -0.016540
## TetanusLast10TdapYes, received tetanus shot but not sure what type -0.025812
## TetanusLast10TdapYes, received tetanus shot, but not Tdap -0.019289
## HighRiskLastYearYes -0.109835
## CovidPosTested positive using home test without a health professional 0.875165
## CovidPosYes 0.423176
## Std. Error
## (Intercept) 0.320267
## SexMale 0.031608
## GeneralHealthFair 0.047021
## GeneralHealthGood 0.035782
## GeneralHealthPoor 0.071024
## GeneralHealthVery good 0.034109
## MentalHealthDays 0.001618
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago) 0.065790
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago) 0.072029
## LastCheckupTimeWithin past year (anytime less than 12 months ago) 0.057449
## PhysicalActivitiesYes 0.028625
## SleepHours 0.009339
## RemovedTeeth6 or more, but not all 0.041997
## RemovedTeethAll 0.054149
## RemovedTeethNone of them 0.026946
## HadAnginaYes 0.050802
## HadStrokeYes 0.057907
## HadAsthmaYes 0.032870
## HadSkinCancerYes 0.043187
## HadCOPDYes 0.046079
## HadDepressiveDisorderYes 0.032339
## HadKidneyDiseaseYes 0.056390
## HadArthritisYes 0.027581
## HadDiabetesYes 0.035823
## DeafOrHardOfHearingYes 0.042347
## BlindOrVisionDifficultyYes 0.053290
## DifficultyConcentratingYes 0.041262
## DifficultyWalkingYes 0.040060
## DifficultyDressingBathingYes 0.069627
## DifficultyErrandsYes 0.052784
## SmokerStatusCurrent smoker - now smokes some days 0.075273
## SmokerStatusFormer smoker 0.047406
## SmokerStatusNever smoked 0.046368
## ECigaretteUsageNot at all (right now) 0.032069
## ECigaretteUsageUse them every day 0.077929
## ECigaretteUsageUse them some days 0.072791
## ChestScanYes 0.025719
## RaceEthnicityCategoryHispanic 0.055008
## RaceEthnicityCategoryMultiracial, Non-Hispanic 0.086418
## RaceEthnicityCategoryOther race only, Non-Hispanic 0.064894
## RaceEthnicityCategoryWhite only, Non-Hispanic 0.043529
## AgeCategory.L 0.062049
## AgeCategory.Q 0.046891
## AgeCategory.C 0.041643
## AgeCategory^4 0.042533
## AgeCategory^5 0.042946
## AgeCategory^6 0.042654
## AgeCategory^7 0.042534
## AgeCategory^8 0.041567
## AgeCategory^9 0.040783
## AgeCategory^10 0.039613
## AgeCategory^11 0.039875
## AgeCategory^12 0.040727
## HeightInMeters 0.180940
## WeightInKilograms 0.001351
## BMI 0.004304
## AlcoholDrinkersYes 0.024112
## HIVTestingYes 0.027006
## FluVaxLast12Yes 0.024701
## PneumoVaxEverYes 0.028815
## TetanusLast10TdapYes, received Tdap 0.030564
## TetanusLast10TdapYes, received tetanus shot but not sure what type 0.028628
## TetanusLast10TdapYes, received tetanus shot, but not Tdap 0.048517
## HighRiskLastYearYes 0.060364
## CovidPosTested positive using home test without a health professional 0.065878
## CovidPosYes 0.026883
## t value
## (Intercept) -1.480
## SexMale -2.808
## GeneralHealthFair 134.601
## GeneralHealthGood 30.927
## GeneralHealthPoor 221.973
## GeneralHealthVery good 6.932
## MentalHealthDays 85.335
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago) 2.786
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago) 0.867
## LastCheckupTimeWithin past year (anytime less than 12 months ago) 7.148
## PhysicalActivitiesYes -35.874
## SleepHours -9.478
## RemovedTeeth6 or more, but not all 3.390
## RemovedTeethAll -0.637
## RemovedTeethNone of them 1.124
## HadAnginaYes 7.971
## HadStrokeYes 4.462
## HadAsthmaYes 12.814
## HadSkinCancerYes 6.223
## HadCOPDYes 14.685
## HadDepressiveDisorderYes 4.298
## HadKidneyDiseaseYes 9.903
## HadArthritisYes 39.188
## HadDiabetesYes -5.450
## DeafOrHardOfHearingYes -1.642
## BlindOrVisionDifficultyYes -0.843
## DifficultyConcentratingYes 2.463
## DifficultyWalkingYes 86.614
## DifficultyDressingBathingYes 42.848
## DifficultyErrandsYes 32.072
## SmokerStatusCurrent smoker - now smokes some days 3.269
## SmokerStatusFormer smoker 7.032
## SmokerStatusNever smoked 8.082
## ECigaretteUsageNot at all (right now) -0.645
## ECigaretteUsageUse them every day -3.410
## ECigaretteUsageUse them some days -1.566
## ChestScanYes 11.089
## RaceEthnicityCategoryHispanic 7.728
## RaceEthnicityCategoryMultiracial, Non-Hispanic 9.791
## RaceEthnicityCategoryOther race only, Non-Hispanic 7.066
## RaceEthnicityCategoryWhite only, Non-Hispanic 18.867
## AgeCategory.L -2.995
## AgeCategory.Q -19.766
## AgeCategory.C -4.679
## AgeCategory^4 -2.052
## AgeCategory^5 0.645
## AgeCategory^6 -2.029
## AgeCategory^7 -3.930
## AgeCategory^8 -2.600
## AgeCategory^9 -1.642
## AgeCategory^10 0.722
## AgeCategory^11 -0.747
## AgeCategory^12 0.202
## HeightInMeters 5.647
## WeightInKilograms -3.732
## BMI -4.728
## AlcoholDrinkersYes -5.330
## HIVTestingYes 5.525
## FluVaxLast12Yes 3.801
## PneumoVaxEverYes -2.872
## TetanusLast10TdapYes, received Tdap -0.541
## TetanusLast10TdapYes, received tetanus shot but not sure what type -0.902
## TetanusLast10TdapYes, received tetanus shot, but not Tdap -0.398
## HighRiskLastYearYes -1.820
## CovidPosTested positive using home test without a health professional 13.285
## CovidPosYes 15.741
## Pr(>|t|)
## (Intercept) 0.138797
## SexMale 0.004988
## GeneralHealthFair < 2e-16
## GeneralHealthGood < 2e-16
## GeneralHealthPoor < 2e-16
## GeneralHealthVery good 4.15e-12
## MentalHealthDays < 2e-16
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago) 0.005338
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago) 0.385962
## LastCheckupTimeWithin past year (anytime less than 12 months ago) 8.84e-13
## PhysicalActivitiesYes < 2e-16
## SleepHours < 2e-16
## RemovedTeeth6 or more, but not all 0.000699
## RemovedTeethAll 0.523818
## RemovedTeethNone of them 0.261009
## HadAnginaYes 1.58e-15
## HadStrokeYes 8.12e-06
## HadAsthmaYes < 2e-16
## HadSkinCancerYes 4.88e-10
## HadCOPDYes < 2e-16
## HadDepressiveDisorderYes 1.72e-05
## HadKidneyDiseaseYes < 2e-16
## HadArthritisYes < 2e-16
## HadDiabetesYes 5.05e-08
## DeafOrHardOfHearingYes 0.100489
## BlindOrVisionDifficultyYes 0.399012
## DifficultyConcentratingYes 0.013798
## DifficultyWalkingYes < 2e-16
## DifficultyDressingBathingYes < 2e-16
## DifficultyErrandsYes < 2e-16
## SmokerStatusCurrent smoker - now smokes some days 0.001079
## SmokerStatusFormer smoker 2.04e-12
## SmokerStatusNever smoked 6.39e-16
## ECigaretteUsageNot at all (right now) 0.518999
## ECigaretteUsageUse them every day 0.000649
## ECigaretteUsageUse them some days 0.117346
## ChestScanYes < 2e-16
## RaceEthnicityCategoryHispanic 1.09e-14
## RaceEthnicityCategoryMultiracial, Non-Hispanic < 2e-16
## RaceEthnicityCategoryOther race only, Non-Hispanic 1.60e-12
## RaceEthnicityCategoryWhite only, Non-Hispanic < 2e-16
## AgeCategory.L 0.002745
## AgeCategory.Q < 2e-16
## AgeCategory.C 2.88e-06
## AgeCategory^4 0.040197
## AgeCategory^5 0.518610
## AgeCategory^6 0.042418
## AgeCategory^7 8.50e-05
## AgeCategory^8 0.009334
## AgeCategory^9 0.100658
## AgeCategory^10 0.470528
## AgeCategory^11 0.455309
## AgeCategory^12 0.840232
## HeightInMeters 1.64e-08
## WeightInKilograms 0.000190
## BMI 2.27e-06
## AlcoholDrinkersYes 9.84e-08
## HIVTestingYes 3.29e-08
## FluVaxLast12Yes 0.000144
## PneumoVaxEverYes 0.004084
## TetanusLast10TdapYes, received Tdap 0.588389
## TetanusLast10TdapYes, received tetanus shot but not sure what type 0.367259
## TetanusLast10TdapYes, received tetanus shot, but not Tdap 0.690946
## HighRiskLastYearYes 0.068832
## CovidPosTested positive using home test without a health professional < 2e-16
## CovidPosYes < 2e-16
##
## (Intercept)
## SexMale **
## GeneralHealthFair ***
## GeneralHealthGood ***
## GeneralHealthPoor ***
## GeneralHealthVery good ***
## MentalHealthDays ***
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago) **
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago)
## LastCheckupTimeWithin past year (anytime less than 12 months ago) ***
## PhysicalActivitiesYes ***
## SleepHours ***
## RemovedTeeth6 or more, but not all ***
## RemovedTeethAll
## RemovedTeethNone of them
## HadAnginaYes ***
## HadStrokeYes ***
## HadAsthmaYes ***
## HadSkinCancerYes ***
## HadCOPDYes ***
## HadDepressiveDisorderYes ***
## HadKidneyDiseaseYes ***
## HadArthritisYes ***
## HadDiabetesYes ***
## DeafOrHardOfHearingYes
## BlindOrVisionDifficultyYes
## DifficultyConcentratingYes *
## DifficultyWalkingYes ***
## DifficultyDressingBathingYes ***
## DifficultyErrandsYes ***
## SmokerStatusCurrent smoker - now smokes some days **
## SmokerStatusFormer smoker ***
## SmokerStatusNever smoked ***
## ECigaretteUsageNot at all (right now)
## ECigaretteUsageUse them every day ***
## ECigaretteUsageUse them some days
## ChestScanYes ***
## RaceEthnicityCategoryHispanic ***
## RaceEthnicityCategoryMultiracial, Non-Hispanic ***
## RaceEthnicityCategoryOther race only, Non-Hispanic ***
## RaceEthnicityCategoryWhite only, Non-Hispanic ***
## AgeCategory.L **
## AgeCategory.Q ***
## AgeCategory.C ***
## AgeCategory^4 *
## AgeCategory^5
## AgeCategory^6 *
## AgeCategory^7 ***
## AgeCategory^8 **
## AgeCategory^9
## AgeCategory^10
## AgeCategory^11
## AgeCategory^12
## HeightInMeters ***
## WeightInKilograms ***
## BMI ***
## AlcoholDrinkersYes ***
## HIVTestingYes ***
## FluVaxLast12Yes ***
## PneumoVaxEverYes **
## TetanusLast10TdapYes, received Tdap
## TetanusLast10TdapYes, received tetanus shot but not sure what type
## TetanusLast10TdapYes, received tetanus shot, but not Tdap
## HighRiskLastYearYes .
## CovidPosTested positive using home test without a health professional ***
## CovidPosYes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.579 on 345111 degrees of freedom
## Multiple R-squared: 0.4226, Adjusted R-squared: 0.4225
## F-statistic: 3886 on 65 and 345111 DF, p-value: < 2.2e-16
Multiple Linear Regression examines the linear relationship between predictor variables and the target variable (“PhysicalHealthDays”). This model helps to identify significant predictors, measure the direction of relationships, estimate how strongly each variable affects physical health days.
Prior to running the Multiple Linear Regression model, “BMI_category”, “ChronicConditionCount”, and “PoorHealth” were removed from the dataset. These variables created perfect multicollinearity, causing model instability and mathematical errors. After removing them, the model was successfully fitted, and a post-estimation Variance Inflation Factor (VIF) check was conducted to confirm that no remaining multicollinearity issues exist.
# Check multicollinearity using VIF
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif_values <- vif(lm_model)
# Convert VIF output into dataframe properly
vif_df <- data.frame(vif_values)
# Add variable names
vif_df$Variable <- rownames(vif_df)
# Reorder columns
vif_df <- vif_df %>%
select(Variable, everything())
# Display table
kable(
vif_df,
caption = "Variance Inflation Factor (VIF)"
)
| Variable | GVIF | Df | GVIF..1..2.Df.. | |
|---|---|---|---|---|
| Sex | Sex | 1.984494 | 1 | 1.408721 |
| GeneralHealth | GeneralHealth | 1.749553 | 4 | 1.072422 |
| MentalHealthDays | MentalHealthDays | 1.439293 | 1 | 1.199706 |
| LastCheckupTime | LastCheckupTime | 1.150064 | 3 | 1.023576 |
| PhysicalActivities | PhysicalActivities | 1.173882 | 1 | 1.083458 |
| SleepHours | SleepHours | 1.077918 | 1 | 1.038228 |
| RemovedTeeth | RemovedTeeth | 1.388600 | 3 | 1.056241 |
| HadAngina | HadAngina | 1.132570 | 1 | 1.064223 |
| HadStroke | HadStroke | 1.079322 | 1 | 1.038904 |
| HadAsthma | HadAsthma | 1.094944 | 1 | 1.046396 |
| HadSkinCancer | HadSkinCancer | 1.092327 | 1 | 1.045145 |
| HadCOPD | HadCOPD | 1.225845 | 1 | 1.107179 |
| HadDepressiveDisorder | HadDepressiveDisorder | 1.358346 | 1 | 1.165481 |
| HadKidneyDisease | HadKidneyDisease | 1.083800 | 1 | 1.041057 |
| HadArthritis | HadArthritis | 1.355438 | 1 | 1.164233 |
| HadDiabetes | HadDiabetes | 1.195933 | 1 | 1.093587 |
| DeafOrHardOfHearing | DeafOrHardOfHearing | 1.119321 | 1 | 1.057980 |
| BlindOrVisionDifficulty | BlindOrVisionDifficulty | 1.103305 | 1 | 1.050383 |
| DifficultyConcentrating | DifficultyConcentrating | 1.330369 | 1 | 1.153416 |
| DifficultyWalking | DifficultyWalking | 1.615998 | 1 | 1.271219 |
| DifficultyDressingBathing | DifficultyDressingBathing | 1.355466 | 1 | 1.164245 |
| DifficultyErrands | DifficultyErrands | 1.450554 | 1 | 1.204389 |
| SmokerStatus | SmokerStatus | 1.422921 | 3 | 1.060547 |
| ECigaretteUsage | ECigaretteUsage | 1.296604 | 3 | 1.044242 |
| ChestScan | ChestScan | 1.230834 | 1 | 1.109430 |
| RaceEthnicityCategory | RaceEthnicityCategory | 1.260622 | 4 | 1.029374 |
| AgeCategory | AgeCategory | 2.785987 | 12 | 1.043616 |
| HeightInMeters | HeightInMeters | 2.659517 | 1 | 1.630803 |
| WeightInKilograms | WeightInKilograms | 4.132705 | 1 | 2.032905 |
| BMI | BMI | 3.248924 | 1 | 1.802477 |
| AlcoholDrinkers | AlcoholDrinkers | 1.128150 | 1 | 1.062144 |
| HIVTesting | HIVTesting | 1.201230 | 1 | 1.096006 |
| FluVaxLast12 | FluVaxLast12 | 1.188634 | 1 | 1.090245 |
| PneumoVaxEver | PneumoVaxEver | 1.489767 | 1 | 1.220560 |
| TetanusLast10Tdap | TetanusLast10Tdap | 1.182102 | 3 | 1.028275 |
| HighRiskLastYear | HighRiskLastYear | 1.085673 | 1 | 1.041956 |
| CovidPos | CovidPos | 1.077644 | 2 | 1.018870 |
Variance Inflation Factor (VIF) is used to detect multicollinearity among predictor variables. If VIF < 5 → acceptable VIF > 10 → serious multicollinearity (need to be removed)
Since VIF of all variables are < 5, then no more variables need to be removed.
# -------------------------------
# Model 2: Decision Tree Regression
# -------------------------------
library(rpart)
tree_reg_model <- rpart(
PhysicalHealthDays ~ .,
data = reg_train,
method = "anova"
)
Decision Tree Regression aim to capture non-linear relationships between variables. It is easy to interpret, can handles interactions automatically, and able to handle complex health relationships.
# -------------------------------
# Model 3: XGBoost Regression
# -------------------------------
library(xgboost)
# Convert categorical variables into dummy variables
# IMPORTANT: train and test matrices must have the same columns
xgb_formula <- PhysicalHealthDays ~ .
train_matrix <- model.matrix(xgb_formula, data = reg_train)[, -1]
test_matrix <- model.matrix(xgb_formula, data = reg_test)[, -1]
# Align test matrix columns to the training matrix to avoid XGBoost column mismatch
missing_cols <- setdiff(colnames(train_matrix), colnames(test_matrix))
for (col in missing_cols) {
test_matrix <- cbind(test_matrix, setNames(data.frame(0), col))
}
extra_cols <- setdiff(colnames(test_matrix), colnames(train_matrix))
if (length(extra_cols) > 0) {
test_matrix <- test_matrix[, !colnames(test_matrix) %in% extra_cols, drop = FALSE]
}
test_matrix <- test_matrix[, colnames(train_matrix), drop = FALSE]
# Target variable
train_label <- reg_train$PhysicalHealthDays
# Train XGBoost model
set.seed(123)
xgb_reg_model <- xgboost(
x = train_matrix,
y = train_label,
nrounds = 100,
max_depth = 6,
learning_rate = 0.1,
objective = "reg:squarederror"
)
# Prediction
xgb_pred <- predict(
xgb_reg_model,
test_matrix
)
XGBoost (Extreme Gradient Boosting) is an advanced machine learning algorithm that improves prediction performance by combining multiple decision trees. It has high predictive accuracy, and it is able to handles non-linear relationships. XGBoost can captures complex interactions among health indicators too.
library(xgboost)
library(dplyr)
library(knitr)
# 1. Generate predictions for Multiple Linear Regression and Decision Tree
lm_preds <- predict(lm_model, newdata = reg_test)
tree_preds <- predict(tree_reg_model, newdata = reg_test)
# 2. Generate XGBoost predictions using the aligned test matrix created in Section 5.3
xgboost_preds <- predict(xgb_reg_model, newdata = test_matrix)
# 3. Combine the actual values and all predictions into one dataframe
prediction_comparison <- data.frame(
Record_ID = 1:nrow(reg_test),
Actual_Days = reg_test$PhysicalHealthDays,
MLR_Predicted = round(lm_preds, 2),
DecTree_Predicted = round(tree_preds, 2),
XGBoost_Predicted = round(xgboost_preds, 2)
)
# 4. Display the first 10 rows of the comparison table neatly
kable(
head(prediction_comparison, 10),
caption = "Comparison of Actual vs. Predicted Physical Health Days Across Models"
)
| Record_ID | Actual_Days | MLR_Predicted | DecTree_Predicted | XGBoost_Predicted | |
|---|---|---|---|---|---|
| 6 | 1 | 1 | 16.39 | 23.61 | 20.54 |
| 34 | 2 | 0 | 1.38 | 1.86 | 1.03 |
| 40 | 3 | 0 | 11.37 | 15.11 | 10.47 |
| 42 | 4 | 5 | 6.40 | 6.78 | 5.13 |
| 45 | 5 | 5 | 2.52 | 1.86 | 2.07 |
| 52 | 6 | 0 | 0.81 | 1.86 | 1.07 |
| 56 | 7 | 0 | 12.56 | 15.11 | 12.48 |
| 58 | 8 | 0 | 5.35 | 6.78 | 4.92 |
| 61 | 9 | 0 | 2.67 | 1.86 | 2.86 |
| 73 | 10 | 2 | 2.11 | 1.86 | 1.79 |
The table above showed the actual “PhysicalHealthDays” predicted by 3 models (Multiple Linear Regression, Decision Tree and XGBoost)
To determine the most competent model for predicting “PhysicalHealthDays”, the predictive performance of the Multiple Linear Regression, Decision Tree, and XGBoost models was evaluated and compared on the unseen test dataset (the 20% test dataset).
Three standard statistical metrics were utilized to evaluate the performance of each models:
# Generate predictions
lm_pred <- predict(lm_model, newdata = reg_test)
tree_pred <- predict(
tree_reg_model,
newdata = reg_test
)
xgb_pred <- predict(
xgb_reg_model,
test_matrix
)
# Function to evaluate regression models
eval_regression <- function(actual, predicted, model_name) {
# Root Mean Squared Error
rmse <- sqrt(mean((actual - predicted)^2))
# Mean Absolute Error
mae <- mean(abs(actual - predicted))
# R-squared
r2 <- cor(actual, predicted)^2
data.frame(
Model = model_name,
RMSE = round(rmse, 4),
MAE = round(mae, 4),
R_Squared = round(r2, 4)
)
}
# Compare all models
regression_perf <- bind_rows(
eval_regression(
reg_test$PhysicalHealthDays,
lm_pred,
"Linear Regression"
),
eval_regression(
reg_test$PhysicalHealthDays,
tree_pred,
"Decision Tree Regression"
),
eval_regression(
reg_test$PhysicalHealthDays,
xgb_pred,
"XGBoost Regression"
)
)
# Display results
kable(
regression_perf,
caption = "Regression Model Performance Comparison"
)
| Model | RMSE | MAE | R_Squared |
|---|---|---|---|
| Linear Regression | 6.6201 | 4.0077 | 0.4185 |
| Decision Tree Regression | 6.8758 | 4.2679 | 0.3727 |
| XGBoost Regression | 6.5410 | 3.9067 | 0.4324 |
Best Performing Algorithm: XGBoost
This section identifies and interprets the top predictors influencing “PhysicalHealthDays” across all three models. Cross-referencing the outputs from three models allows us to isolate the most consistent demographic, lifestyle, and clinical risk factors driving poor physical health.
Key factors were identified using the absolute values of the regression coefficients and their associated p-values. This allows us to quantify the direct linear impact of each variable on PhysicalHealthDays.
summary_stats <- summary(lm_model)$coefficients
summary_df <- data.frame(
Variable = rownames(summary_stats),
Estimate = summary_stats[, "Estimate"],
P_Value = summary_stats[, "Pr(>|t|)"]
) %>%
filter(Variable != "(Intercept)") %>%
arrange(desc(abs(Estimate)))
# Display top influential factors
kable(
head(summary_df, 10),
caption = "Top 10 Factors Influencing PhysicalHealthDays"
)
| Variable | Estimate | P_Value | |
|---|---|---|---|
| GeneralHealthPoor | GeneralHealthPoor | 15.7654955 | 0 |
| GeneralHealthFair | GeneralHealthFair | 6.3291289 | 0 |
| DifficultyWalkingYes | DifficultyWalkingYes | 3.4697602 | 0 |
| DifficultyDressingBathingYes | DifficultyDressingBathingYes | 2.9834132 | 0 |
| DifficultyErrandsYes | DifficultyErrandsYes | 1.6928997 | 0 |
| GeneralHealthGood | GeneralHealthGood | 1.1066332 | 0 |
| HadArthritisYes | HadArthritisYes | 1.0808248 | 0 |
| PhysicalActivitiesYes | PhysicalActivitiesYes | -1.0268675 | 0 |
| HeightInMeters | HeightInMeters | 1.0217432 | 0 |
| AgeCategory.Q | AgeCategory.Q | -0.9268163 | 0 |
Predictor importance was determined by evaluating Variable Importance scores, which measure how frequently and effectively a specific variable was chosen to split the data into distinct, homogenous health groups.
tree_imp <- data.frame(
Variable = names(tree_reg_model$variable.importance),
Importance = as.numeric(
tree_reg_model$variable.importance
)
) %>%
arrange(desc(Importance))
kable(
head(tree_imp, 10),
caption = "Top Predictors from Decision Tree Regression"
)
| Variable | Importance |
|---|---|
| GeneralHealth | 8701301.1220 |
| DifficultyWalking | 1423734.5811 |
| DifficultyDressingBathing | 712883.1810 |
| DifficultyErrands | 641383.3313 |
| HadArthritis | 43026.6735 |
| BlindOrVisionDifficulty | 25816.0041 |
| HadCOPD | 24089.2758 |
| BMI | 241.1825 |
The top predictors were extracted using Feature Importance (Gain) metrics. Gain calculates the fractional contribution of each feature to the model’s overall predictive power, highlighting the variables that provide the greatest reduction in prediction error.
xgb_imp <- xgb.importance(
model = xgb_reg_model
)
# Display top predictors
kable(
head(xgb_imp, 10),
caption = "Top Predictors from XGBoost Regression"
)
| Feature | Gain | Cover | Frequency |
|---|---|---|---|
| GeneralHealthPoor | 0.4854848 | 0.0620056 | 0.0199646 |
| GeneralHealthFair | 0.2467141 | 0.0642205 | 0.0328450 |
| DifficultyWalkingYes | 0.0911368 | 0.0483557 | 0.0275318 |
| MentalHealthDays | 0.0601061 | 0.0827465 | 0.0834004 |
| HadArthritisYes | 0.0148580 | 0.0294214 | 0.0272098 |
| DifficultyDressingBathingYes | 0.0128497 | 0.0347810 | 0.0257607 |
| DifficultyErrandsYes | 0.0121763 | 0.0288004 | 0.0222186 |
| GeneralHealthGood | 0.0098582 | 0.0302832 | 0.0133634 |
| PhysicalActivitiesYes | 0.0077017 | 0.0342225 | 0.0307519 |
| AgeCategory.L | 0.0056853 | 0.0110243 | 0.0305909 |
Objective 1: Identify the key factors associated with poor physical health.
Across all three models, the following key drivers consistently
dominated the prediction of PhysicalHealthDays:
GeneralHealth)
— Subjective self-reported poor or fair health is overwhelmingly the
strongest predictor of physically unhealthy days across every
algorithm.DifficultyWalking) — Having physical trouble walking or
climbing stairs consistently ranks as the second most critical
factor.DifficultyDressingBathing and
DifficultyErrands) — Difficulty taking care of oneself or
handling errands outside the home strongly increases the predicted
number of unhealthy days.HadArthritis) —
Arthritis consistently appears near the top of the clinical indicators
across all models.Objective 2: Examine the relationships between lifestyle behaviours, chronic diseases, and physically unhealthy days.
Multiple Linear Regression:
PhysicalActivitiesYes)
reduces unhealthy days by approximately −1.03 days, serving as a
protective lifestyle factor.Decision Tree:
GeneralHealth has an importance score of 8.7 million,
far surpassing all other predictors. This indicates that self-perceived
health status is the primary splitting criterion in the tree
structure.XGBoost:
GeneralHealthPoor and GeneralHealthFair
together account for nearly 73% of the total predictive Gain (0.485 +
0.246).MentalHealthDays ranks 4th with a 6% Gain contribution,
higher than in the other models, suggesting that mental health interacts
non-linearly with physical health outcomes.Objective 3: Compare multiple regression algorithms in predicting physical health outcomes.
PoorHealth,
ChronicConditionCount, and BMI_category to
resolve severe multicollinearity.All three regression objectives have been achieved.
The classification task focuses on predicting whether a respondent
has experienced a heart attack. The target variable is
HadHeartAttack, which contains two classes:
Yes and No. This is a binary classification
problem because each respondent belongs to only one of these two outcome
groups.
The objective of this section is to build classification models using the cleaned and balanced training data, then evaluate how well the models identify heart attack cases on the untouched test data. The EDA section shows that heart attack occurrence is related to variables such as age, sex, general health, chronic condition count, smoker status, and BMI category, so these predictors are expected to contribute useful information during modelling.
PhysicalHealthDays is excluded from the classification
predictors because it is the target variable for the regression
task.
The classification models use train_balanced for
training and test_data for evaluation. The training set was
balanced during preprocessing using SMOTE-NC, while the test set remains
unchanged so that the final evaluation reflects the real distribution of
heart attack cases. To keep the classification task separate from the
regression task, PhysicalHealthDays is removed from the
classification modelling data before training the models.
class_train <- train_balanced
class_test <- test_data
# Exclude the regression target from classification predictors.
class_train <- class_train %>% select(-PhysicalHealthDays)
class_test <- class_test %>% select(-PhysicalHealthDays)
# Set the reference level clearly so "Yes" can be treated as the positive class.
class_train$HadHeartAttack <- relevel(as.factor(class_train$HadHeartAttack), ref = "No")
class_test$HadHeartAttack <- factor(class_test$HadHeartAttack,
levels = levels(class_train$HadHeartAttack))
# Treat age groups as regular categories for modelling.
class_train$AgeCategory <- factor(class_train$AgeCategory, ordered = FALSE)
class_test$AgeCategory <- factor(class_test$AgeCategory,
levels = levels(class_train$AgeCategory),
ordered = FALSE)
table(class_train$HadHeartAttack)
##
## No Yes
## 325860 325860
table(class_test$HadHeartAttack)
##
## No Yes
## 81464 4829
The classification data consists of a balanced training set with 325860 No and 325860 Yes cases, while the test set keeps the original distribution with 81464 No and 4829 Yes cases.
The models are evaluated using a confusion matrix and several performance metrics. Accuracy shows the overall proportion of correct predictions, but it is not enough for this dataset because most respondents did not have a heart attack. Therefore, sensitivity, specificity, precision and F1-score are also used.
extract_metrics <- function(conf_matrix, model_name) {
data.frame(
Model = model_name,
Accuracy = round(as.numeric(conf_matrix$overall["Accuracy"]), 4),
Sensitivity = round(as.numeric(conf_matrix$byClass["Sensitivity"]), 4),
Specificity = round(as.numeric(conf_matrix$byClass["Specificity"]), 4),
Precision = round(as.numeric(conf_matrix$byClass["Pos Pred Value"]), 4),
F1_Score = round(as.numeric(conf_matrix$byClass["F1"]), 4)
)
}
plot_confusion_matrix <- function(conf_matrix, title_text) {
as.data.frame(conf_matrix$table) %>%
ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 4) +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = title_text,
x = "Actual Class",
y = "Predicted Class",
fill = "Count") +
theme_minimal()
}
Logistic regression is used as the first classification model because
it is suitable for binary outcomes and provides an interpretable
baseline. The model estimates the probability that each respondent
belongs to the Yes class for
HadHeartAttack.
log_model <- glm(HadHeartAttack ~ .,
data = class_train,
family = binomial)
log_prob <- predict(log_model, newdata = class_test, type = "response")
log_pred <- ifelse(log_prob >= 0.5, "Yes", "No")
log_pred <- factor(log_pred, levels = levels(class_test$HadHeartAttack))
log_cm <- confusionMatrix(log_pred,
class_test$HadHeartAttack,
positive = "Yes")
log_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 69208 1571
## Yes 12256 3258
##
## Accuracy : 0.8398
## 95% CI : (0.8373, 0.8422)
## No Information Rate : 0.944
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2569
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.67467
## Specificity : 0.84955
## Pos Pred Value : 0.21000
## Neg Pred Value : 0.97780
## Prevalence : 0.05596
## Detection Rate : 0.03776
## Detection Prevalence : 0.17978
## Balanced Accuracy : 0.76211
##
## 'Positive' Class : Yes
##
plot_confusion_matrix(log_cm, "Confusion Matrix: Logistic Regression")
The most influential variables in the logistic regression model are identified using the absolute z-value of the model coefficients. Larger absolute z-values indicate stronger evidence that the predictor is associated with the target variable.
log_coef <- as.data.frame(summary(log_model)$coefficients)
log_coef$Variable <- rownames(log_coef)
log_importance <- log_coef %>%
filter(Variable != "(Intercept)") %>%
mutate(Abs_Z_Value = abs(`z value`)) %>%
arrange(desc(Abs_Z_Value)) %>%
select(Variable, Estimate, `z value`, `Pr(>|z|)`, Abs_Z_Value) %>%
head(10)
kable(log_importance,
caption = "Top 10 Logistic Regression Predictors Based on Absolute Z-value")
| Variable | Estimate | z value | Pr(>|z|) | Abs_Z_Value | |
|---|---|---|---|---|---|
| ChronicConditionCount | ChronicConditionCount | 1.2588716 | 134.27160 | 0 | 134.27160 |
| HadAnginaYes | HadAnginaYes | 1.3567008 | 99.01654 | 0 | 99.01654 |
| HadAsthmaYes | HadAsthmaYes | -1.4202208 | -96.85512 | 0 | 96.85512 |
| HadKidneyDiseaseYes | HadKidneyDiseaseYes | -1.6216314 | -89.21245 | 0 | 89.21245 |
| ChestScanYes | ChestScanYes | 0.7318143 | 89.04639 | 0 | 89.04639 |
| HadArthritisYes | HadArthritisYes | -1.0480605 | -86.66914 | 0 | 86.66914 |
| HadDepressiveDisorderYes | HadDepressiveDisorderYes | -1.2131575 | -84.48736 | 0 | 84.48736 |
| SexMale | SexMale | 0.8520101 | 79.97112 | 0 | 79.97112 |
| GeneralHealthGood | GeneralHealthGood | 1.2052778 | 74.86627 | 0 | 74.86627 |
| RaceEthnicityCategoryWhite only, Non-Hispanic | RaceEthnicityCategoryWhite only, Non-Hispanic | 1.3212142 | 74.75610 | 0 | 74.75610 |
A decision tree is used as the second classification model. This
model separates respondents into groups based on decision rules, making
it easier to interpret how different factors contribute to the
prediction. Like the logistic regression model, it excludes
PhysicalHealthDays to avoid overlapping with the regression
task.
# Import library:
library(rpart)
library(rpart.plot)
tree_model <- rpart(HadHeartAttack ~ .,
data = class_train,
method = "class",
control = rpart.control(cp = 0.001,
minsplit = 500,
maxdepth = 5))
tree_pred <- predict(tree_model, newdata = class_test, type = "class")
tree_pred <- factor(tree_pred, levels = levels(class_test$HadHeartAttack))
tree_cm <- confusionMatrix(tree_pred,
class_test$HadHeartAttack,
positive = "Yes")
tree_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 64123 1345
## Yes 17341 3484
##
## Accuracy : 0.7835
## 95% CI : (0.7807, 0.7862)
## No Information Rate : 0.944
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1988
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72147
## Specificity : 0.78713
## Pos Pred Value : 0.16730
## Neg Pred Value : 0.97946
## Prevalence : 0.05596
## Detection Rate : 0.04037
## Detection Prevalence : 0.24133
## Balanced Accuracy : 0.75430
##
## 'Positive' Class : Yes
##
plot_confusion_matrix(tree_cm, "Confusion Matrix: Decision Tree")
rpart.plot(tree_model,
type = 2,
extra = 104,
fallen.leaves = TRUE,
under = TRUE,
tweak = 1.1,
cex = 0.65,
varlen = 0,
faclen = 0,
roundint = FALSE,
box.palette = "RdYlGn",
shadow.col = "gray")
The decision tree’s variable importance values show which predictors contributed most to the splits in the tree.
tree_importance <- data.frame(
Variable = names(tree_model$variable.importance),
Importance = as.numeric(tree_model$variable.importance),
row.names = NULL
) %>%
arrange(desc(Importance))
kable(head(tree_importance, 10),
caption = "Top 10 Decision Tree Predictor Importance Values")
| Variable | Importance |
|---|---|
| HadAngina | 85607.612 |
| AgeCategory | 30845.945 |
| ChronicConditionCount | 26121.296 |
| GeneralHealth | 13407.729 |
| LastCheckupTime | 8373.542 |
| RaceEthnicityCategory | 6878.666 |
| ECigaretteUsage | 2339.482 |
| HighRiskLastYear | 1723.262 |
| HadKidneyDisease | 1084.050 |
| PoorHealth | 957.155 |
The table below compares the logistic regression and decision tree models using the same test set. Since the test set was not balanced, these results provide a fairer estimate of how the models may perform on real-world data.
classification_results <- bind_rows(
extract_metrics(log_cm, "Logistic Regression"),
extract_metrics(tree_cm, "Decision Tree")
)
kable(classification_results,
caption = "Classification Model Performance Comparison")
| Model | Accuracy | Sensitivity | Specificity | Precision | F1_Score |
|---|---|---|---|---|---|
| Logistic Regression | 0.8398 | 0.6747 | 0.8496 | 0.2100 | 0.3203 |
| Decision Tree | 0.7835 | 0.7215 | 0.7871 | 0.1673 | 0.2716 |
The classification results should be interpreted carefully because the original test data is imbalanced, with many more respondents who did not have a heart attack than respondents who did. Therefore, accuracy alone is not enough to judge model performance. Sensitivity, precision and F1-score are also important because they show how well the model identifies the minority class, which is the respondents who had a heart attack.
Logistic Regression achieved better overall performance than the Decision Tree. It recorded higher accuracy (0.8398), specificity (0.8496), precision (0.2100), and F1-score (0.3203). This means Logistic Regression was better at correctly classifying the overall test data and produced a better balance between precision and recall. Although its precision was still low, it performed better than the Decision Tree in this metric.
The Decision Tree achieved higher sensitivity (0.7215) compared to Logistic Regression (0.6747), meaning it was slightly better at identifying actual heart attack cases. However, it had lower accuracy, specificity, precision and F1-score. This suggests that although the Decision Tree detected more positive cases, it also produced more false positive predictions.
Overall, Logistic Regression is selected as the preferred classification model because it achieved the best balance across most evaluation metrics, especially accuracy, specificity, precision and F1-score. For this health-related classification task, identifying positive cases is important, but the model should also avoid too many incorrect positive predictions. Therefore, Logistic Regression provides a more balanced and reliable result compared to the Decision Tree.
It should be noted that some coefficients in the Logistic Regression
model, such as those for HadAsthmaYes and
HadKidneyDiseaseYes, showed unexpected negative signs. This
is likely attributable to multicollinearity among the large number of
correlated health predictors rather than a true protective effect, and
should be interpreted with caution.
This project investigated two predictive modelling tasks using the
BRFSS 2022 dataset, which contains self-reported health information from
445,132 U.S. adults. The two tasks were: (1) predicting the number of
physically unhealthy days (PhysicalHealthDays) using
regression models, and (2) predicting whether a respondent had
experienced a heart attack (HadHeartAttack) using
classification models.
Three regression models were developed and compared: Multiple Linear Regression, Decision Tree Regression, and XGBoost Regression. XGBoost was the best-performing model, as shown in the table below.
| Model | MAE | RMSE | R-Squared |
|---|---|---|---|
| Multiple Linear Regression | 4.0077 | 6.6201 | 0.4185 |
| Decision Tree Regression | 4.2679 | 6.8758 | 0.3727 |
| XGBoost Regression | 3.9067 | 6.5410 | 0.4324 |
The key predictors identified across all three models were:
Two classification models were built to predict heart attack occurrence: Logistic Regression and Decision Tree. Since only about 6% of respondents reported having a heart attack, SMOTE-NC was applied to the training data to address the class imbalance. The performance of both models on the test set is summarised below.
| Model | Accuracy | Sensitivity | Specificity | Precision | F1-Score |
|---|---|---|---|---|---|
| Logistic Regression | 0.8398 | 0.6747 | 0.8496 | 0.2100 | 0.3203 |
| Decision Tree | 0.7835 | 0.7215 | 0.7871 | 0.1673 | 0.2716 |
Logistic Regression was selected as the preferred model as it achieved a better overall balance across most evaluation metrics, particularly in terms of accuracy, specificity, precision, and F1-score. Across both models, the key predictors of heart attack risk were history of angina, age category, and chronic condition count. General health status was additionally identified as a contributing predictor in both models, while sex was found to be a contributing factor in the Logistic Regression model.
It should be noted that some coefficients in the Logistic Regression
model, such as those for HadAsthmaYes and
HadKidneyDiseaseYes, showed unexpected negative signs. This
is likely attributable to multicollinearity among the large number of
correlated health predictors rather than a true protective effect, and
should be interpreted with caution.
The findings from both tasks consistently show that general health status, chronic disease burden, physical limitations, and lifestyle behaviours are important factors in predicting both physically unhealthy days and heart attack risk. These results support the use of personal health indicators from large-scale public health surveys in identifying high-risk individuals and guiding preventive healthcare strategies.
This project has several limitations that should be acknowledged:
PhysicalHealthDays, suggesting that unmeasured factors such
as income level, healthcare access, and environmental conditions may
also play a role.Future studies could build upon this work by incorporating additional variables such as socioeconomic indicators and healthcare access, applying more advanced class balancing strategies, and exploring a wider range of machine learning algorithms to improve overall predictive performance.