Group Member

Chai Juan Ci 25078348
Liew Pooi Yee 24201420
Chen Chian Xuan 25079982
Lum Yong Sheng 24202422
Tan Zhi Ying 25082095

Project Title

Predicting Physically Unhealthy Days and Heart Disease Risk Classification Using Personal Health Indicators Among US Adults

1 Introduction

1.1 Background

Heart disease remains one of the leading causes of mortality in the United States, affecting millions of adults across all demographic groups. It is closely associated with multiple risk factors, including poor physical health, unhealthy lifestyle behaviours such as smoking and physical inactivity, as well as chronic conditions such as diabetes and obesity. These factors often develop over time and contribute to a higher risk of cardiovascular disease.

Understanding the relationship between personal health indicators, overall physical well-being, and heart disease risk can provide valuable insights for public health interventions and early detection strategies. This information is important for identifying individuals at high risk and improving the effectiveness of preventive healthcare strategies.

This project utilises the Behavioral Risk Factor Surveillance System (BRFSS) 2022 dataset, a large-scale health survey conducted by the Centers for Disease Control and Prevention (CDC). The dataset contains self-reported information from U.S. adults, including health behaviours, chronic conditions, and use of preventive health services.

1.2 Objective

This project aims to investigate the relationship between personal health indicators, physical well-being, and heart disease risk using the BRFSS 2022 dataset. Two predictive modelling tasks are conducted.

Question 1 (Regression)

Can the number of physically unhealthy days (PhysicalHealthDays) experienced by a U.S. adult be predicted based on personal health indicators?

Objectives:
1. To identify key factors associated with poor physical health among adults.
2. To examine the relationship between lifestyle behaviours, chronic conditions, and physically unhealthy days.
3. To develop a regression model for predicting physical health outcomes.

Question 2 (Classification)

Can the occurrence of a heart attack (HadHeartAttack) be classified based on demographic and lifestyle characteristics of U.S. adults?

Objectives:
1. To identify important demographic, lifestyle, and health-related factors associated with heart disease risk.
2. To develop a classification model for predicting heart attack occurrence.
3. To support early risk identification and improve preventive healthcare strategies using data-driven insights.

2 Data Overview

The dataset used in this project is the Personal Key Indicators of Heart Disease (2022 update), published on Kaggle by Kamil Pytlak. It is based on the annual Behavioral Risk Factor Surveillance System (BRFSS) survey conducted by the Centers for Disease Control and Prevention (CDC). The dataset contains self-reported information from non-institutionalised adults in the United States, covering health risk behaviours, chronic conditions, and use of preventive health services.

Types of Data Collected

The 40 features are organised into six thematic categories covering different domains of personal health information.

Category	Features	Description
Numerical health measures	`PhysicalHealthDays`, `MentalHealthDays`, `SleepHours`, `BMI`, `HeightInMeters`, `WeightInKilograms`	Continuous measurements of physical and mental well-being, body composition, and sleep
Chronic conditions	`HadHeartAttack`, `HadAngina`, `HadStroke`, `HadAsthma`, `HadCOPD`, `HadDiabetes`, `HadArthritis`, `HadKidneyDisease`	Self-reported history of diagnosed chronic diseases and cardiovascular conditions (Yes / No)
Lifestyle & behaviour	`SmokerStatus`, `ECigaretteUsage`, `PhysicalActivities`, `AlcoholDrinkers`, `HighRiskLastYear`	Modifiable behavioural risk factors including smoking, physical activity, and alcohol use
Physical difficulties	`DifficultyWalking`, `DifficultyConcentrating`, `DifficultyDressingBathing`, `DifficultyErrands`, `BlindOrVisionDifficulty`, `DeafOrHardOfHearing`	Functional limitations and disability indicators reflecting health-related quality of life (Yes / No)
Preventive care	`FluVaxLast12`, `PneumoVaxEver`, `HIVTesting`, `TetanusLast10Tdap`, `ChestScan`, `LastCheckupTime`	Engagement with preventive health services including vaccines, screenings, and routine check-ups
Demographics	`Sex`, `AgeCategory`, `RaceEthnicityCategory`, `State`, `GeneralHealth`, `RemovedTeeth`, `CovidPos`	Respondent identity, socioeconomic background, geographic location, and self-rated health status

2.1 Loading the packages and the data

We start by loading the libraries used throughout the analysis and reading the CSV file into R. Text columns are kept as plain characters for now (stringsAsFactors = FALSE) so they can be cleaned first; the conversion to factors is done once the data is tidy.

library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(caret)
library(summarytools)
library(VIM)
library(themis)
library(recipes)
library(scales)
library(gridExtra)
library(reshape2)

df <- read.csv('heart_2022.csv', stringsAsFactors = FALSE)

2.2 Fixing missing values

In the dataset, missing values were stored as empty strings (""). To make these gaps clearly visible and ensure they are handled consistently, the empty strings were converted to NA. This step standardises the representation of missing values and prepares the data for cleaning and analysis later on.

df[df == ""] <- NA

2.3 Structure of the dataset

A quick check of the dataset’s structure confirms the data loaded correctly and provides an overview of the variable types. The dataset contains 445,132 rows and 40 columns — 6 numerical columns and 34 categorical columns covering chronic conditions, lifestyle behaviours, and demographics.

glimpse(df)

## Rows: 445,132
## Columns: 40
## $ State                     <chr> "Alabama", "Alabama", "Alabama", "Alabama", …
## $ Sex                       <chr> "Female", "Female", "Female", "Female", "Fem…
## $ GeneralHealth             <chr> "Very good", "Excellent", "Very good", "Exce…
## $ PhysicalHealthDays        <dbl> 0, 0, 2, 0, 2, 1, 0, 0, 0, 1, 8, 0, 5, 0, 30…
## $ MentalHealthDays          <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 5,…
## $ LastCheckupTime           <chr> "Within past year (anytime less than 12 mont…
## $ PhysicalActivities        <chr> "No", "No", "Yes", "Yes", "Yes", "No", "Yes"…
## $ SleepHours                <dbl> 8, 6, 5, 7, 9, 7, 7, 8, 6, 7, 8, 6, 6, 8, 8,…
## $ RemovedTeeth              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ HadHeartAttack            <chr> "No", "No", "No", "No", "No", "Yes", "No", "…
## $ HadAngina                 <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadStroke                 <chr> "No", "No", "No", "No", "No", "Yes", "No", "…
## $ HadAsthma                 <chr> "No", "No", "No", "Yes", "No", "No", "No", "…
## $ HadSkinCancer             <chr> "No", "Yes", "Yes", "No", "No", "No", "No", …
## $ HadCOPD                   <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadDepressiveDisorder     <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadKidneyDisease          <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ HadArthritis              <chr> "No", "No", "No", "Yes", "No", "No", "No", "…
## $ HadDiabetes               <chr> "Yes", "No", "No", "No", "No", "Yes", "No", …
## $ DeafOrHardOfHearing       <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ BlindOrVisionDifficulty   <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyConcentrating   <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyWalking         <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyDressingBathing <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ DifficultyErrands         <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ SmokerStatus              <chr> "Never smoked", "Never smoked", "Never smoke…
## $ ECigaretteUsage           <chr> "Not at all (right now)", "Never used e-ciga…
## $ ChestScan                 <chr> "No", "No", "No", "Yes", "Yes", "No", "No", …
## $ RaceEthnicityCategory     <chr> "White only, Non-Hispanic", "White only, Non…
## $ AgeCategory               <chr> "Age 80 or older", "Age 80 or older", "Age 5…
## $ HeightInMeters            <dbl> NA, 1.60, 1.57, 1.65, 1.57, 1.80, 1.65, 1.63…
## $ WeightInKilograms         <dbl> NA, 68.04, 63.50, 63.50, 53.98, 84.82, 62.60…
## $ BMI                       <dbl> NA, 26.57, 25.61, 23.30, 21.77, 26.08, 22.96…
## $ AlcoholDrinkers           <chr> "No", "No", "No", "No", "Yes", "No", "Yes", …
## $ HIVTesting                <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ FluVaxLast12              <chr> "Yes", "No", "No", "Yes", "No", "No", "No", …
## $ PneumoVaxEver             <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No",…
## $ TetanusLast10Tdap         <chr> "Yes, received tetanus shot but not sure wha…
## $ HighRiskLastYear          <chr> "No", "No", "No", "No", "No", "No", "No", "N…
## $ CovidPos                  <chr> "No", "No", "Yes", "No", "No", "No", "No", "…

2.4 Summary statistics

Numerical variables such as PhysicalHealthDays, MentalHealthDays, BMI, and SleepHours show wide variation. Most respondents report a low number of unhealthy days, while a small proportion show extreme values. The categorical variables show that most respondents do not report chronic conditions such as heart disease, and missing values are present in several variables.

summary(df)

##        State               Sex           GeneralHealth    PhysicalHealthDays
##  Length   :445132   Length   :445132   Length   :445132   Min.   : 0.000    
##  N.unique :    54   N.unique :     2   N.unique :     5   1st Qu.: 0.000    
##  N.blank  :     0   N.blank  :     0   N.blank  :     0   Median : 0.000    
##  Min.nchar:     4   Min.nchar:     4   Min.nchar:     4   Mean   : 4.348    
##  Max.nchar:    20   Max.nchar:     6   Max.nchar:     9   3rd Qu.: 3.000    
##                                        NAs      :  1198   Max.   :30.000    
##                                                           NAs    :10927     
##  MentalHealthDays  LastCheckupTime   PhysicalActivities   SleepHours    
##  Min.   : 0.000   Length   :445132   Length   :445132   Min.   : 1.000  
##  1st Qu.: 0.000   N.unique :     4   N.unique :     2   1st Qu.: 6.000  
##  Median : 0.000   N.blank  :     0   N.blank  :     0   Median : 7.000  
##  Mean   : 4.383   Min.nchar:    19   Min.nchar:     2   Mean   : 7.023  
##  3rd Qu.: 5.000   Max.nchar:    55   Max.nchar:     3   3rd Qu.: 8.000  
##  Max.   :30.000   NAs      :  8308   NAs      :  1093   Max.   :24.000  
##  NAs    :9067                                           NAs    :5453    
##     RemovedTeeth      HadHeartAttack       HadAngina          HadStroke     
##  Length   :445132   Length   :445132   Length   :445132   Length   :445132  
##  N.unique :     4   N.unique :     2   N.unique :     2   N.unique :     2  
##  N.blank  :     0   N.blank  :     0   N.blank  :     0   N.blank  :     0  
##  Min.nchar:     3   Min.nchar:     2   Min.nchar:     2   Min.nchar:     2  
##  Max.nchar:    22   Max.nchar:     3   Max.nchar:     3   Max.nchar:     3  
##  NAs      : 11360   NAs      :  3065   NAs      :  4405   NAs      :  1557  
##                                                                             
##      HadAsthma        HadSkinCancer         HadCOPD       HadDepressiveDisorder
##  Length   :445132   Length   :445132   Length   :445132   Length   :445132     
##  N.unique :     2   N.unique :     2   N.unique :     2   N.unique :     2     
##  N.blank  :     0   N.blank  :     0   N.blank  :     0   N.blank  :     0     
##  Min.nchar:     2   Min.nchar:     2   Min.nchar:     2   Min.nchar:     2     
##  Max.nchar:     3   Max.nchar:     3   Max.nchar:     3   Max.nchar:     3     
##  NAs      :  1773   NAs      :  3143   NAs      :  2219   NAs      :  2812     
##                                                                                
##   HadKidneyDisease     HadArthritis       HadDiabetes     DeafOrHardOfHearing
##  Length   :445132   Length   :445132   Length   :445132   Length   :445132   
##  N.unique :     2   N.unique :     2   N.unique :     4   N.unique :     2   
##  N.blank  :     0   N.blank  :     0   N.blank  :     0   N.blank  :     0   
##  Min.nchar:     2   Min.nchar:     2   Min.nchar:     2   Min.nchar:     2   
##  Max.nchar:     3   Max.nchar:     3   Max.nchar:    39   Max.nchar:     3   
##  NAs      :  1926   NAs      :  2633   NAs      :  1087   NAs      : 20647   
##                                                                              
##  BlindOrVisionDifficulty DifficultyConcentrating DifficultyWalking 
##  Length   :445132        Length   :445132        Length   :445132  
##  N.unique :     2        N.unique :     2        N.unique :     2  
##  N.blank  :     0        N.blank  :     0        N.blank  :     0  
##  Min.nchar:     2        Min.nchar:     2        Min.nchar:     2  
##  Max.nchar:     3        Max.nchar:     3        Max.nchar:     3  
##  NAs      : 21564        NAs      : 24240        NAs      : 24012  
##                                                                    
##  DifficultyDressingBathing DifficultyErrands     SmokerStatus   
##  Length   :445132          Length   :445132   Length   :445132  
##  N.unique :     2          N.unique :     2   N.unique :     4  
##  N.blank  :     0          N.blank  :     0   N.blank  :     0  
##  Min.nchar:     2          Min.nchar:     2   Min.nchar:    12  
##  Max.nchar:     3          Max.nchar:     3   Max.nchar:    37  
##  NAs      : 23915          NAs      : 25656   NAs      : 35462  
##                                                                 
##   ECigaretteUsage       ChestScan      RaceEthnicityCategory    AgeCategory    
##  Length   :445132   Length   :445132   Length   :445132      Length   :445132  
##  N.unique :     4   N.unique :     2   N.unique :     5      N.unique :    13  
##  N.blank  :     0   N.blank  :     0   N.blank  :     0      N.blank  :     0  
##  Min.nchar:    18   Min.nchar:     2   Min.nchar:     8      Min.nchar:    12  
##  Max.nchar:    41   Max.nchar:     3   Max.nchar:    29      Max.nchar:    15  
##  NAs      : 35660   NAs      : 56046   NAs      : 14057      NAs      :  9079  
##                                                                                
##  HeightInMeters  WeightInKilograms      BMI         AlcoholDrinkers  
##  Min.   :0.910   Min.   : 22.68    Min.   :12.02   Length   :445132  
##  1st Qu.:1.630   1st Qu.: 68.04    1st Qu.:24.13   N.unique :     2  
##  Median :1.700   Median : 80.74    Median :27.44   N.blank  :     0  
##  Mean   :1.703   Mean   : 83.07    Mean   :28.53   Min.nchar:     2  
##  3rd Qu.:1.780   3rd Qu.: 95.25    3rd Qu.:31.75   Max.nchar:     3  
##  Max.   :2.410   Max.   :292.57    Max.   :99.64   NAs      : 46574  
##  NAs    :28652   NAs    :42078     NAs    :48806                     
##      HIVTesting        FluVaxLast12      PneumoVaxEver    TetanusLast10Tdap 
##  Length   :445132   Length   :445132   Length   :445132   Length   :445132  
##  N.unique :     2   N.unique :     2   N.unique :     2   N.unique :     4  
##  N.blank  :     0   N.blank  :     0   N.blank  :     0   N.blank  :     0  
##  Min.nchar:     2   Min.nchar:     2   Min.nchar:     2   Min.nchar:    18  
##  Max.nchar:     3   Max.nchar:     3   Max.nchar:     3   Max.nchar:    57  
##  NAs      : 66127   NAs      : 47121   NAs      : 77040   NAs      : 82516  
##                                                                             
##   HighRiskLastYear       CovidPos     
##  Length   :445132   Length   :445132  
##  N.unique :     2   N.unique :     3  
##  N.blank  :     0   N.blank  :     0  
##  Min.nchar:     2   Min.nchar:     2  
##  Max.nchar:     3   Max.nchar:    61  
##  NAs      : 50623   NAs      : 50764  
##

2.5 Detailed summary

The dfSummary() function provides a detailed overview of the dataset, including frequencies, distributions and missing-value percentages, which helps to guide the cleaning decisions.

print(dfSummary(df,
                     graph.magnif = 0.75),
           method = "render")

Data Frame Summary

df

Dimensions: 445132 x 40
Duplicates: 157

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

State [character]

1. Washington

2. New York

3. Minnesota

4. Ohio

5. Maryland

6. Texas

7. Florida

8. Wisconsin

9. Kansas

10. Massachusetts

[ 44 others ]

26152	(	5.9%	)
17800	(	4.0%	)
16821	(	3.8%	)
16487	(	3.7%	)
16418	(	3.7%	)
14245	(	3.2%	)
13393	(	3.0%	)
11276	(	2.5%	)
11247	(	2.5%	)
11029	(	2.5%	)
290264	(	65.2%	)

445132 (100.0%)

0 (0.0%)

Sex [character]

1. Female

2. Male

235893	(	53.0%	)
209239	(	47.0%	)

445132 (100.0%)

0 (0.0%)

GeneralHealth [character]

1. Excellent

2. Fair

3. Good

4. Poor

5. Very good

71878	(	16.2%	)
60273	(	13.6%	)
143598	(	32.3%	)
19741	(	4.4%	)
148444	(	33.4%	)

443934 (99.7%)

1198 (0.3%)

PhysicalHealthDays [numeric]

Mean (sd) : 4.3 (8.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 30

IQR (CV) : 3 (2)

31 distinct values

434205 (97.5%)

10927 (2.5%)

MentalHealthDays [numeric]

Mean (sd) : 4.4 (8.4)

min ≤ med ≤ max:

0 ≤ 0 ≤ 30

IQR (CV) : 5 (1.9)

31 distinct values

436065 (98.0%)

9067 (2.0%)

LastCheckupTime [character]

1. 5 or more years ago

2. Within past 2 years (1 ye

3. Within past 5 years (2 ye

4. Within past year (anytime

19079	(	4.4%	)
41919	(	9.6%	)
24882	(	5.7%	)
350944	(	80.3%	)

436824 (98.1%)

8308 (1.9%)

PhysicalActivities [character]

1. No

2. Yes

106480	(	24.0%	)
337559	(	76.0%	)

444039 (99.8%)

1093 (0.2%)

SleepHours [numeric]

Mean (sd) : 7 (1.5)

min ≤ med ≤ max:

1 ≤ 7 ≤ 24

IQR (CV) : 2 (0.2)

24 distinct values

439679 (98.8%)

5453 (1.2%)

RemovedTeeth [character]

1. 1 to 5

2. 6 or more, but not all

3. All

4. None of them

129294	(	29.8%	)
45570	(	10.5%	)
25453	(	5.9%	)
233455	(	53.8%	)

433772 (97.4%)

11360 (2.6%)

HadHeartAttack [character]

1. No

2. Yes

416959	(	94.3%	)
25108	(	5.7%	)

442067 (99.3%)

3065 (0.7%)

HadAngina [character]

1. No

2. Yes

414176	(	94.0%	)
26551	(	6.0%	)

440727 (99.0%)

4405 (1.0%)

HadStroke [character]

1. No

2. Yes

424336	(	95.7%	)
19239	(	4.3%	)

443575 (99.7%)

1557 (0.3%)

HadAsthma [character]

1. No

2. Yes

376665	(	85.0%	)
66694	(	15.0%	)

443359 (99.6%)

1773 (0.4%)

HadSkinCancer [character]

1. No

2. Yes

406504	(	92.0%	)
35485	(	8.0%	)

441989 (99.3%)

3143 (0.7%)

HadCOPD [character]

1. No

2. Yes

407257	(	91.9%	)
35656	(	8.1%	)

442913 (99.5%)

2219 (0.5%)

HadDepressiveDisorder [character]

1. No

2. Yes

350910	(	79.3%	)
91410	(	20.7%	)

442320 (99.4%)

2812 (0.6%)

HadKidneyDisease [character]

1. No

2. Yes

422891	(	95.4%	)
20315	(	4.6%	)

443206 (99.6%)

1926 (0.4%)

HadArthritis [character]

1. No

2. Yes

291351	(	65.8%	)
151148	(	34.2%	)

442499 (99.4%)

2633 (0.6%)

HadDiabetes [character]

1. No

2. No, pre-diabetes or borde

3. Yes

4. Yes, but only during preg

368722	(	83.0%	)
10329	(	2.3%	)
61158	(	13.8%	)
3836	(	0.9%	)

444045 (99.8%)

1087 (0.2%)

DeafOrHardOfHearing [character]

1. No

2. Yes

385539	(	90.8%	)
38946	(	9.2%	)

424485 (95.4%)

20647 (4.6%)

BlindOrVisionDifficulty [character]

1. No

2. Yes

399910	(	94.4%	)
23658	(	5.6%	)

423568 (95.2%)

21564 (4.8%)

DifficultyConcentrating [character]

1. No

2. Yes

370792	(	88.1%	)
50100	(	11.9%	)

420892 (94.6%)

24240 (5.4%)

DifficultyWalking [character]

1. No

2. Yes

353039	(	83.8%	)
68081	(	16.2%	)

421120 (94.6%)

24012 (5.4%)

DifficultyDressingBathing [character]

1. No

2. Yes

404404	(	96.0%	)
16813	(	4.0%	)

421217 (94.6%)

23915 (5.4%)

DifficultyErrands [character]

1. No

2. Yes

387029	(	92.3%	)
32447	(	7.7%	)

419476 (94.2%)

25656 (5.8%)

SmokerStatus [character]

1. Current smoker - now smok

2. Current smoker - now smok

3. Former smoker

4. Never smoked

36003	(	8.8%	)
13938	(	3.4%	)
113774	(	27.8%	)
245955	(	60.0%	)

409670 (92.0%)

35462 (8.0%)

ECigaretteUsage [character]

1. Never used e-cigarettes i

2. Not at all (right now)

3. Use them every day

4. Use them some days

311988	(	76.2%	)
75368	(	18.4%	)
10382	(	2.5%	)
11734	(	2.9%	)

409472 (92.0%)

35660 (8.0%)

ChestScan [character]

1. No

2. Yes

223221	(	57.4%	)
165865	(	42.6%	)

389086 (87.4%)

56046 (12.6%)

RaceEthnicityCategory [character]

1. Black only, Non-Hispanic

2. Hispanic

3. Multiracial, Non-Hispanic

4. Other race only, Non-Hisp

5. White only, Non-Hispanic

35446	(	8.2%	)
42917	(	10.0%	)
9578	(	2.2%	)
22713	(	5.3%	)
320421	(	74.3%	)

431075 (96.8%)

14057 (3.2%)

AgeCategory [character]

1. Age 65 to 69

2. Age 60 to 64

3. Age 70 to 74

4. Age 55 to 59

5. Age 80 or older

6. Age 50 to 54

7. Age 75 to 79

8. Age 40 to 44

9. Age 45 to 49

10. Age 35 to 39

[ 3 others ]

47099	(	10.8%	)
44511	(	10.2%	)
43472	(	10.0%	)
36821	(	8.4%	)
36251	(	8.3%	)
33644	(	7.7%	)
32518	(	7.5%	)
29942	(	6.9%	)
28531	(	6.5%	)
28526	(	6.5%	)
74738	(	17.1%	)

436053 (98.0%)

9079 (2.0%)

HeightInMeters [numeric]

Mean (sd) : 1.7 (0.1)

min ≤ med ≤ max:

0.9 ≤ 1.7 ≤ 2.4

IQR (CV) : 0.2 (0.1)

109 distinct values

416480 (93.6%)

28652 (6.4%)

WeightInKilograms [numeric]

Mean (sd) : 83.1 (21.4)

min ≤ med ≤ max:

22.7 ≤ 80.7 ≤ 292.6

IQR (CV) : 27.2 (0.3)

599 distinct values

403054 (90.5%)

42078 (9.5%)

BMI [numeric]

Mean (sd) : 28.5 (6.6)

min ≤ med ≤ max:

12 ≤ 27.4 ≤ 99.6

IQR (CV) : 7.6 (0.2)

3985 distinct values

396326 (89.0%)

48806 (11.0%)

AlcoholDrinkers [character]

1. No

2. Yes

187667	(	47.1%	)
210891	(	52.9%	)

398558 (89.5%)

46574 (10.5%)

HIVTesting [character]

1. No

2. Yes

249919	(	65.9%	)
129086	(	34.1%	)

379005 (85.1%)

66127 (14.9%)

FluVaxLast12 [character]

1. No

2. Yes

188755	(	47.4%	)
209256	(	52.6%	)

398011 (89.4%)

47121 (10.6%)

PneumoVaxEver [character]

1. No

2. Yes

215604	(	58.6%	)
152488	(	41.4%	)

368092 (82.7%)

77040 (17.3%)

TetanusLast10Tdap [character]

1. No, did not receive any t

2. Yes, received Tdap

3. Yes, received tetanus sho

4. Yes, received tetanus sho

121493	(	33.5%	)
99943	(	27.6%	)
113725	(	31.4%	)
27455	(	7.6%	)

362616 (81.5%)

82516 (18.5%)

HighRiskLastYear [character]

1. No

2. Yes

377324	(	95.6%	)
17185	(	4.4%	)

394509 (88.6%)

50623 (11.4%)

CovidPos [character]

1. No

2. Tested positive using hom

3. Yes

270055	(	68.5%	)
13436	(	3.4%	)
110877	(	28.1%	)

394368 (88.6%)

50764 (11.4%)

Generated by summarytools 1.1.5 (R version 4.6.0)
2026-06-09

3 Data Preprocessing

3.1 Missing Data

3.1.1 Identifying how much data is missing

First, we find out how much data is actually missing and where. The table below counts the missing values in each variable.

data.frame(variable      = colnames(df),
           missing_count = sapply(df, function(x) sum(is.na(x))),
           missing_pct   = round(100 * sapply(df, function(x) sum(is.na(x))) / nrow(df), 1),
           row.names = NULL)

##                     variable missing_count missing_pct
## 1                      State             0         0.0
## 2                        Sex             0         0.0
## 3              GeneralHealth          1198         0.3
## 4         PhysicalHealthDays         10927         2.5
## 5           MentalHealthDays          9067         2.0
## 6            LastCheckupTime          8308         1.9
## 7         PhysicalActivities          1093         0.2
## 8                 SleepHours          5453         1.2
## 9               RemovedTeeth         11360         2.6
## 10            HadHeartAttack          3065         0.7
## 11                 HadAngina          4405         1.0
## 12                 HadStroke          1557         0.3
## 13                 HadAsthma          1773         0.4
## 14             HadSkinCancer          3143         0.7
## 15                   HadCOPD          2219         0.5
## 16     HadDepressiveDisorder          2812         0.6
## 17          HadKidneyDisease          1926         0.4
## 18              HadArthritis          2633         0.6
## 19               HadDiabetes          1087         0.2
## 20       DeafOrHardOfHearing         20647         4.6
## 21   BlindOrVisionDifficulty         21564         4.8
## 22   DifficultyConcentrating         24240         5.4
## 23         DifficultyWalking         24012         5.4
## 24 DifficultyDressingBathing         23915         5.4
## 25         DifficultyErrands         25656         5.8
## 26              SmokerStatus         35462         8.0
## 27           ECigaretteUsage         35660         8.0
## 28                 ChestScan         56046        12.6
## 29     RaceEthnicityCategory         14057         3.2
## 30               AgeCategory          9079         2.0
## 31            HeightInMeters         28652         6.4
## 32         WeightInKilograms         42078         9.5
## 33                       BMI         48806        11.0
## 34           AlcoholDrinkers         46574        10.5
## 35                HIVTesting         66127        14.9
## 36              FluVaxLast12         47121        10.6
## 37             PneumoVaxEver         77040        17.3
## 38         TetanusLast10Tdap         82516        18.5
## 39          HighRiskLastYear         50623        11.4
## 40                  CovidPos         50764        11.4

Most variables (38 out of 40) turn out to have some missing values, ranging from very small amounts up to around 18.5%. No variable is missing so much data that it has to be discarded entirely.

3.1.2 Is the missingness random (MAR)?

Before deciding how to treat the missing values, we need to understand why the data is missing, because that affects whether deleting or filling in the values is the better choice. To check this, we compare the missing rate of BMI across age groups.

df %>%
  mutate(BMI_missing = is.na(BMI)) %>%
  group_by(AgeCategory) %>%
  summarise(pct_missing = round(100 * mean(BMI_missing), 1)) %>%
  arrange(AgeCategory)

## # A tibble: 14 × 2
##    AgeCategory     pct_missing
##    <chr>                 <dbl>
##  1 Age 18 to 24           11.4
##  2 Age 25 to 29           12.2
##  3 Age 30 to 34           13.2
##  4 Age 35 to 39           11.9
##  5 Age 40 to 44           11.3
##  6 Age 45 to 49           10.8
##  7 Age 50 to 54           10.7
##  8 Age 55 to 59           10.2
##  9 Age 60 to 64            9.8
## 10 Age 65 to 69            9.2
## 11 Age 70 to 74            8.7
## 12 Age 75 to 79            8.1
## 13 Age 80 or older         8  
## 14 <NA>                   48.9

The result shows that the missing rate is higher for younger respondents (around 11-13%) and lower for older respondents (around 8%), meaning the missingness follows a pattern related to an observed variable, which is age. This tells us the data is Missing At Random (MAR) rather than missing in a completely random way. Deleting rows with missing values would remove proportionally more younger respondents and bias the sample, so filling in the predictor values (imputation) is the safer approach.

3.1.3 Removing rows with a missing target

HadHeartAttack and PhysicalHealthDays are the two variables we are trying to predict. A record with no target value cannot be used to train or evaluate a model, and imputing it would fabricate the outcome being predicted. These rows are therefore removed.

df <- df %>% filter(!is.na(HadHeartAttack), !is.na(PhysicalHealthDays))

The dropped fractions are small (0.7% and 2.5%), so the sample loss is minimal. Removing rows with a missing target reduced the dataset from 445,132 to 431,470 records.

3.1.4 Imputing numeric predictors with the median

The numeric predictors are filled in using the median. The median is used rather than the mean because, based on the summary statistics, the data is skewed, and the median is not affected by skew or extreme values. Note that PhysicalHealthDays is deliberately left out here, since it is a target rather than a predictor.

num_predictors <- c("MentalHealthDays", "SleepHours",
                     "HeightInMeters", "WeightInKilograms", "BMI")
for (col in num_predictors) {
  df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)
}

# Quick check
sapply(num_predictors, function(c) sum(is.na(df[[c]])))

##  MentalHealthDays        SleepHours    HeightInMeters WeightInKilograms 
##                 0                 0                 0                 0 
##               BMI 
##                 0

3.1.5 Imputing categorical predictors with the mode

Categorical variables have no mean or median, so the missing values are filled with the mode, the most common category in each column. This keeps every record and avoids creating an artificial “Missing” category that could be mistaken for a real group.

get_mode <- function(x) {
  x <- x[!is.na(x)]
  names(sort(table(x), decreasing = TRUE))[1]
}
cat_cols <- names(df)[sapply(df, is.character)]
for (col in cat_cols) {
  df[[col]][is.na(df[[col]])] <- get_mode(df[[col]])
}

3.1.6 Confirming no missing data remains

A final check confirms that the deletion and imputation steps removed all missing values. Every variable should now show a count of zero.

data.frame(variable      = colnames(df),
           missing_count = sapply(df, function(x) sum(is.na(x))),
           row.names = NULL)

##                     variable missing_count
## 1                      State             0
## 2                        Sex             0
## 3              GeneralHealth             0
## 4         PhysicalHealthDays             0
## 5           MentalHealthDays             0
## 6            LastCheckupTime             0
## 7         PhysicalActivities             0
## 8                 SleepHours             0
## 9               RemovedTeeth             0
## 10            HadHeartAttack             0
## 11                 HadAngina             0
## 12                 HadStroke             0
## 13                 HadAsthma             0
## 14             HadSkinCancer             0
## 15                   HadCOPD             0
## 16     HadDepressiveDisorder             0
## 17          HadKidneyDisease             0
## 18              HadArthritis             0
## 19               HadDiabetes             0
## 20       DeafOrHardOfHearing             0
## 21   BlindOrVisionDifficulty             0
## 22   DifficultyConcentrating             0
## 23         DifficultyWalking             0
## 24 DifficultyDressingBathing             0
## 25         DifficultyErrands             0
## 26              SmokerStatus             0
## 27           ECigaretteUsage             0
## 28                 ChestScan             0
## 29     RaceEthnicityCategory             0
## 30               AgeCategory             0
## 31            HeightInMeters             0
## 32         WeightInKilograms             0
## 33                       BMI             0
## 34           AlcoholDrinkers             0
## 35                HIVTesting             0
## 36              FluVaxLast12             0
## 37             PneumoVaxEver             0
## 38         TetanusLast10Tdap             0
## 39          HighRiskLastYear             0
## 40                  CovidPos             0

3.2 Data Transformation

3.2.1 Checking for duplicate rows

The dataset is checked for duplicate rows. Because it contains no personal identifiers and is made up entirely of categorical health and demographic answers, identical rows most likely come from different respondents who happened to give the same responses. They are therefore kept, as removing them would discard valid data.

cat("Duplicate rows:", sum(duplicated(df)), "\n")

## Duplicate rows: 322

3.2.2 Dropping the State column

The State column is a geographic identifier with 50 categories but no direct clinical relevance to heart attack risk. Any health differences between states are already captured by other variables such as age and lifestyle, so removing it reduces complexity and noise without losing useful information.

df <- df %>% select(-State)

3.2.3 Simplifying the HadDiabetes variable

The HadDiabetes variable originally had four categories, including small groups such as pre-diabetes and pregnancy-only. These were combined into a clear Yes/No indicator, creating a simpler binary predictor and avoiding sparse categories.

df <- df %>%
  mutate(HadDiabetes = case_when(
    HadDiabetes == "Yes" ~ "Yes",
    TRUE ~ "No"))

3.2.4 Checking for outliers

Boxplots are a quick way to spot extreme values in the numeric variables.

num_vars <- c("PhysicalHealthDays", "MentalHealthDays", "SleepHours",
              "HeightInMeters", "WeightInKilograms", "BMI")
par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))
for (col in num_vars) boxplot(df[[col]], main = col)

par(mfrow = c(1, 1))

3.2.5 Handling the outliers

The boxplots reveal extreme values across the six numeric variables, but not every high value is a problem. The high values in PhysicalHealthDays and MentalHealthDays simply reflect the survey design, where 0 to 30 days is a valid answer rather than an error, so these two variables are left unchanged. In contrast, the body-measurement and sleep variables show extreme values that are more likely caused by measurement or reporting errors. To address this, the IQR method is applied, whereby any value beyond 1.5 times the interquartile range is replaced with the column median.

clean_iqr <- function(data, col) {
  Q1  <- quantile(data[[col]], 0.25, na.rm = TRUE)
  Q3  <- quantile(data[[col]], 0.75, na.rm = TRUE)
  iqr <- Q3 - Q1
  lo  <- Q1 - 1.5 * iqr
  hi  <- Q3 + 1.5 * iqr
  out <- data[[col]] < lo | data[[col]] > hi
  data[[col]][out] <- median(data[[col]], na.rm = TRUE)
  data
}
for (col in c("BMI", "WeightInKilograms", "HeightInMeters", "SleepHours")) {
  df <- clean_iqr(df, col)
}

3.3 Feature Engineering

Some useful information is not directly available in a single column but can be created from existing ones. Three new variables were added to improve model performance. BMI_category groups the numeric BMI into the standard WHO weight bands, which carry clearer health meaning than a raw number. ChronicConditionCount summarises overall health burden by counting eight chronic conditions, providing a single measure instead of multiple columns. PoorHealth is a simple flag for anyone who rated their general health as fair or poor.

# BMI grouped into WHO weight bands
df$BMI_category <- cut(df$BMI,
                       breaks = c(0, 18.5, 25, 30, 100),
                       labels = c("Underweight", "Normal", "Overweight", "Obese"),
                       right = FALSE)

# Number of chronic conditions per person
condition_cols <- c("HadAngina", "HadStroke", "HadAsthma", "HadCOPD",
                     "HadKidneyDisease", "HadArthritis", "HadDiabetes",
                     "HadDepressiveDisorder")
df$ChronicConditionCount <- rowSums(df[condition_cols] == "Yes")

# Flag for self-reported fair or poor health
df$PoorHealth <- ifelse(df$GeneralHealth %in% c("Fair", "Poor"), 1, 0)

ncol(df)

## [1] 42

table(df$BMI_category)          # counts per BMI band

## 
## Underweight      Normal  Overweight       Obese 
##        6213      113752      200846      110659

table(df$ChronicConditionCount) # counts per number of conditions (0 to 8)

## 
##      0      1      2      3      4      5      6      7      8 
## 177916 133738  69426  30744  12838   4815   1534    395     64

table(df$PoorHealth)            # counts of 0 vs 1

## 
##      0      1 
## 356582  74888

3.4 Converting categorical columns to factors

R’s modelling functions expect categorical variables to be stored as factors rather than plain text, so all the character columns are converted here. HadHeartAttack is set as a factor because it is the classification target, while AgeCategory is made an ordered factor since its categories have a natural ranking from youngest to oldest. A final summary then confirms the dataset structure after all cleaning steps.

cat_cols <- names(df)[sapply(df, is.character)]
df[cat_cols] <- lapply(df[cat_cols], as.factor)

df$HadHeartAttack <- as.factor(df$HadHeartAttack)
df$BMI_category   <- as.factor(df$BMI_category)
df$AgeCategory    <- factor(df$AgeCategory, ordered = TRUE)

print(dfSummary(df,
                     graph.magnif = 0.75),
           method = "render")

Data Frame Summary

df

Dimensions: 431470 x 42
Duplicates: 2856

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Sex [factor]

1. Female

2. Male

228485	(	53.0%	)
202985	(	47.0%	)

431470 (100.0%)

0 (0.0%)

GeneralHealth [factor]

1. Excellent

2. Fair

3. Good

4. Poor

5. Very good

70818	(	16.4%	)
56355	(	13.1%	)
138787	(	32.2%	)
18533	(	4.3%	)
146977	(	34.1%	)

431470 (100.0%)

0 (0.0%)

PhysicalHealthDays [numeric]

Mean (sd) : 4.3 (8.7)

min ≤ med ≤ max:

0 ≤ 0 ≤ 30

IQR (CV) : 3 (2)

31 distinct values

431470 (100.0%)

0 (0.0%)

MentalHealthDays [numeric]

Mean (sd) : 4.3 (8.3)

min ≤ med ≤ max:

0 ≤ 0 ≤ 30

IQR (CV) : 4 (1.9)

31 distinct values

431470 (100.0%)

0 (0.0%)

LastCheckupTime [factor]

1. 5 or more years ago

2. Within past 2 years (1 ye

3. Within past 5 years (2 ye

4. Within past year (anytime

18606	(	4.3%	)
40952	(	9.5%	)
24310	(	5.6%	)
347602	(	80.6%	)

431470 (100.0%)

0 (0.0%)

PhysicalActivities [factor]

1. No

2. Yes

101358	(	23.5%	)
330112	(	76.5%	)

431470 (100.0%)

0 (0.0%)

SleepHours [numeric]

Mean (sd) : 7 (1.2)

min ≤ med ≤ max:

3 ≤ 7 ≤ 11

IQR (CV) : 2 (0.2)

3	:	3100	(	0.7%	)
4	:	11830	(	2.7%	)
5	:	29038	(	6.7%	)
6	:	93206	(	21.6%	)
7	:	141343	(	32.8%	)
8	:	121746	(	28.2%	)
9	:	20561	(	4.8%	)
10	:	9994	(	2.3%	)
11	:	652	(	0.2%	)

431470 (100.0%)

0 (0.0%)

RemovedTeeth [factor]

1. 1 to 5

2. 6 or more, but not all

3. All

4. None of them

125316	(	29.0%	)
43645	(	10.1%	)
23986	(	5.6%	)
238523	(	55.3%	)

431470 (100.0%)

0 (0.0%)

HadHeartAttack [factor]

1. No

2. Yes

407324	(	94.4%	)
24146	(	5.6%	)

431470 (100.0%)

0 (0.0%)

HadAngina [factor]

1. No

2. Yes

406244	(	94.2%	)
25226	(	5.8%	)

431470 (100.0%)

0 (0.0%)

HadStroke [factor]

1. No

2. Yes

413383	(	95.8%	)
18087	(	4.2%	)

431470 (100.0%)

0 (0.0%)

HadAsthma [factor]

1. No

2. Yes

367086	(	85.1%	)
64384	(	14.9%	)

431470 (100.0%)

0 (0.0%)

HadSkinCancer [factor]

1. No

2. Yes

397133	(	92.0%	)
34337	(	8.0%	)

431470 (100.0%)

0 (0.0%)

HadCOPD [factor]

1. No

2. Yes

397635	(	92.2%	)
33835	(	7.8%	)

431470 (100.0%)

0 (0.0%)

HadDepressiveDisorder [factor]

1. No

2. Yes

343131	(	79.5%	)
88339	(	20.5%	)

431470 (100.0%)

0 (0.0%)

HadKidneyDisease [factor]

1. No

2. Yes

412171	(	95.5%	)
19299	(	4.5%	)

431470 (100.0%)

0 (0.0%)

HadArthritis [factor]

1. No

2. Yes

286299	(	66.4%	)
145171	(	33.6%	)

431470 (100.0%)

0 (0.0%)

HadDiabetes [factor]

1. No

2. Yes

373081	(	86.5%	)
58389	(	13.5%	)

431470 (100.0%)

0 (0.0%)

DeafOrHardOfHearing [factor]

1. No

2. Yes

394501	(	91.4%	)
36969	(	8.6%	)

431470 (100.0%)

0 (0.0%)

BlindOrVisionDifficulty [factor]

1. No

2. Yes

409283	(	94.9%	)
22187	(	5.1%	)

431470 (100.0%)

0 (0.0%)

DifficultyConcentrating [factor]

1. No

2. Yes

383891	(	89.0%	)
47579	(	11.0%	)

431470 (100.0%)

0 (0.0%)

DifficultyWalking [factor]

1. No

2. Yes

367595	(	85.2%	)
63875	(	14.8%	)

431470 (100.0%)

0 (0.0%)

DifficultyDressingBathing [factor]

1. No

2. Yes

415796	(	96.4%	)
15674	(	3.6%	)

431470 (100.0%)

0 (0.0%)

DifficultyErrands [factor]

1. No

2. Yes

401069	(	93.0%	)
30401	(	7.0%	)

431470 (100.0%)

0 (0.0%)

SmokerStatus [factor]

1. Current smoker - now smok

2. Current smoker - now smok

3. Former smoker

4. Never smoked

34703	(	8.0%	)
13429	(	3.1%	)
110308	(	25.6%	)
273030	(	63.3%	)

431470 (100.0%)

0 (0.0%)

ECigaretteUsage [factor]

1. Never used e-cigarettes i

2. Not at all (right now)

3. Use them every day

4. Use them some days

337220	(	78.2%	)
72748	(	16.9%	)
10116	(	2.3%	)
11386	(	2.6%	)

431470 (100.0%)

0 (0.0%)

ChestScan [factor]

1. No

2. Yes

271575	(	62.9%	)
159895	(	37.1%	)

431470 (100.0%)

0 (0.0%)

RaceEthnicityCategory [factor]

1. Black only, Non-Hispanic

2. Hispanic

3. Multiracial, Non-Hispanic

4. Other race only, Non-Hisp

5. White only, Non-Hispanic

33996	(	7.9%	)
41098	(	9.5%	)
9295	(	2.2%	)
21874	(	5.1%	)
325207	(	75.4%	)

431470 (100.0%)

0 (0.0%)

AgeCategory [ordered, factor]

1. Age 18 to 24

2. Age 25 to 29

3. Age 30 to 34

4. Age 35 to 39

5. Age 40 to 44

6. Age 45 to 49

7. Age 50 to 54

8. Age 55 to 59

9. Age 60 to 64

10. Age 65 to 69

[ 3 others ]

26300	(	6.1%	)
21507	(	5.0%	)
25238	(	5.8%	)
27906	(	6.5%	)
29332	(	6.8%	)
27919	(	6.5%	)
32864	(	7.6%	)
35915	(	8.3%	)
43269	(	10.0%	)
54166	(	12.6%	)
107054	(	24.8%	)

431470 (100.0%)

0 (0.0%)

HeightInMeters [numeric]

Mean (sd) : 1.7 (0.1)

min ≤ med ≤ max:

1.4 ≤ 1.7 ≤ 2

IQR (CV) : 0.2 (0.1)

57 distinct values

431470 (100.0%)

0 (0.0%)

WeightInKilograms [numeric]

Mean (sd) : 81 (16.9)

min ≤ med ≤ max:

33.1 ≤ 80.7 ≤ 127.5

IQR (CV) : 22.2 (0.2)

293 distinct values

431470 (100.0%)

0 (0.0%)

BMI [numeric]

Mean (sd) : 27.6 (4.7)

min ≤ med ≤ max:

14.5 ≤ 27.4 ≤ 40.9

IQR (CV) : 5.7 (0.2)

2301 distinct values

431470 (100.0%)

0 (0.0%)

AlcoholDrinkers [factor]

1. No

2. Yes

180543	(	41.8%	)
250927	(	58.2%	)

431470 (100.0%)

0 (0.0%)

HIVTesting [factor]

1. No

2. Yes

305608	(	70.8%	)
125862	(	29.2%	)

431470 (100.0%)

0 (0.0%)

FluVaxLast12 [factor]

1. No

2. Yes

183210	(	42.5%	)
248260	(	57.5%	)

431470 (100.0%)

0 (0.0%)

PneumoVaxEver [factor]

1. No

2. Yes

283924	(	65.8%	)
147546	(	34.2%	)

431470 (100.0%)

0 (0.0%)

TetanusLast10Tdap [factor]

1. No, did not receive any t

2. Yes, received Tdap

3. Yes, received tetanus sho

4. Yes, received tetanus sho

196077	(	45.4%	)
98029	(	22.7%	)
110652	(	25.6%	)
26712	(	6.2%	)

431470 (100.0%)

0 (0.0%)

HighRiskLastYear [factor]

1. No

2. Yes

414697	(	96.1%	)
16773	(	3.9%	)

431470 (100.0%)

0 (0.0%)

CovidPos [factor]

1. No

2. Tested positive using hom

3. Yes

310177	(	71.9%	)
13187	(	3.1%	)
108106	(	25.1%	)

431470 (100.0%)

0 (0.0%)

BMI_category [factor]

1. Underweight

2. Normal

3. Overweight

4. Obese

6213	(	1.4%	)
113752	(	26.4%	)
200846	(	46.5%	)
110659	(	25.6%	)

431470 (100.0%)

0 (0.0%)

ChronicConditionCount [numeric]

Mean (sd) : 1 (1.2)

min ≤ med ≤ max:

0 ≤ 1 ≤ 8

IQR (CV) : 2 (1.1)

0	:	177916	(	41.2%	)
1	:	133738	(	31.0%	)
2	:	69426	(	16.1%	)
3	:	30744	(	7.1%	)
4	:	12838	(	3.0%	)
5	:	4815	(	1.1%	)
6	:	1534	(	0.4%	)
7	:	395	(	0.1%	)
8	:	64	(	0.0%	)

431470 (100.0%)

0 (0.0%)

PoorHealth [numeric]

Min : 0

Mean : 0.2

Max : 1

0	:	356582	(	82.6%	)
1	:	74888	(	17.4%	)

431470 (100.0%)

0 (0.0%)

Generated by summarytools 1.1.5 (R version 4.6.0)
2026-06-09

sapply(df, class)

## $Sex
## [1] "factor"
## 
## $GeneralHealth
## [1] "factor"
## 
## $PhysicalHealthDays
## [1] "numeric"
## 
## $MentalHealthDays
## [1] "numeric"
## 
## $LastCheckupTime
## [1] "factor"
## 
## $PhysicalActivities
## [1] "factor"
## 
## $SleepHours
## [1] "numeric"
## 
## $RemovedTeeth
## [1] "factor"
## 
## $HadHeartAttack
## [1] "factor"
## 
## $HadAngina
## [1] "factor"
## 
## $HadStroke
## [1] "factor"
## 
## $HadAsthma
## [1] "factor"
## 
## $HadSkinCancer
## [1] "factor"
## 
## $HadCOPD
## [1] "factor"
## 
## $HadDepressiveDisorder
## [1] "factor"
## 
## $HadKidneyDisease
## [1] "factor"
## 
## $HadArthritis
## [1] "factor"
## 
## $HadDiabetes
## [1] "factor"
## 
## $DeafOrHardOfHearing
## [1] "factor"
## 
## $BlindOrVisionDifficulty
## [1] "factor"
## 
## $DifficultyConcentrating
## [1] "factor"
## 
## $DifficultyWalking
## [1] "factor"
## 
## $DifficultyDressingBathing
## [1] "factor"
## 
## $DifficultyErrands
## [1] "factor"
## 
## $SmokerStatus
## [1] "factor"
## 
## $ECigaretteUsage
## [1] "factor"
## 
## $ChestScan
## [1] "factor"
## 
## $RaceEthnicityCategory
## [1] "factor"
## 
## $AgeCategory
## [1] "ordered" "factor" 
## 
## $HeightInMeters
## [1] "numeric"
## 
## $WeightInKilograms
## [1] "numeric"
## 
## $BMI
## [1] "numeric"
## 
## $AlcoholDrinkers
## [1] "factor"
## 
## $HIVTesting
## [1] "factor"
## 
## $FluVaxLast12
## [1] "factor"
## 
## $PneumoVaxEver
## [1] "factor"
## 
## $TetanusLast10Tdap
## [1] "factor"
## 
## $HighRiskLastYear
## [1] "factor"
## 
## $CovidPos
## [1] "factor"
## 
## $BMI_category
## [1] "factor"
## 
## $ChronicConditionCount
## [1] "numeric"
## 
## $PoorHealth
## [1] "numeric"

3.5 Splitting into training and test sets

The data is split into 80% for training and 20% for testing, so that models can be built on one portion and judged on data they have not seen. The split is stratified on HadHeartAttack, which keeps the same proportion of heart-attack cases in both sets, and a fixed random seed makes the split reproducible. Importantly, the split is done before any class balancing, so that the test set stays a realistic sample.

set.seed(123)
train_index <- createDataPartition(df$HadHeartAttack, p = 0.8, list = FALSE)
train_data  <- df[train_index, ]
test_data   <- df[-train_index, ]

nrow(train_data)

## [1] 345177

nrow(test_data)

## [1] 86293

3.6 Handling class imbalance

In the training set, heart-attack cases make up only about 5.6% of the records. This imbalance means a model could achieve high accuracy just by always predicting “No”, while completely missing the cases that matter most.

library(recipes)
library(themis)
table(train_data$HadHeartAttack)

## 
##     No    Yes 
## 325860  19317

To correct the imbalance, SMOTE-NC is applied to the training set. Instead of duplicating existing cases, it creates synthetic minority-class records. This version of SMOTE is designed for datasets that mix categorical and numeric variables, which fits this dataset well. Balancing is applied only to the training set, while the test set is left unchanged so that evaluation reflects real-world performance.

balance_recipe <- recipe(HadHeartAttack ~ ., data = train_data) %>%
  step_smotenc(HadHeartAttack, over_ratio = 1, seed = 123)

train_balanced <- prep(balance_recipe) %>% juice()

After balancing, the two classes are equal in the training set.

table(train_balanced$HadHeartAttack)

## 
##     No    Yes 
## 325860 325860

3.7 Data Preprocessing Summary

The preprocessing is now complete, and it produces three datasets for the modelling stage. train_balanced is the balanced training set used to build the classification model for Question 2. test_data is the untouched test set, kept at the realistic heart-attack rate, used to evaluate that model fairly. df is the full cleaned dataset used for the Question 1 regression task.

Along the way, missing values were handled, outliers corrected where needed, three new features engineered, categorical variables converted to factors, and the class imbalance fixed. The data is now clean, consistent, and ready for modelling.

4 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is conducted on the cleaned dataset (df) after preprocessing. The goal is to understand the distribution of key variables, examine relationships between predictors and the two target variables (PhysicalHealthDays for regression and HadHeartAttack for classification), and uncover patterns that inform the modelling stage.

4.1 Distribution of Target Variables

4.1.1 PhysicalHealthDays (Regression Target)

The distribution of PhysicalHealthDays is heavily right-skewed. The majority of respondents report zero physically unhealthy days, while a smaller proportion reports values closer to 30 days. This skewness confirms that median-based imputation was appropriate during preprocessing and should be considered when choosing regression models.

p1 <- ggplot(df, aes(x = PhysicalHealthDays)) +
  geom_histogram(binwidth = 1, fill = "steelblue", colour = "white", alpha = 0.85) +
  labs(title = "Distribution of Physical Health Days",
       subtitle = "Heavily right-skewed; most respondents report 0 unhealthy days",
       x = "Number of Physically Unhealthy Days (Past 30)",
       y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(df, aes(y = PhysicalHealthDays)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7, outlier.colour = "firebrick",
               outlier.alpha = 0.3, outlier.size = 0.8) +
  labs(title = "Boxplot of Physical Health Days",
       subtitle = "Median = 0; upper outliers are valid (max = 30)",
       y = "Physical Health Days") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_blank())

grid.arrange(p1, p2, ncol = 2)

4.1.2 HadHeartAttack (Classification Target)

The bar chart below confirms the severe class imbalance: approximately 94% of respondents did not have a heart attack, while only ~6% did. This imbalance was handled during preprocessing using SMOTE-NC on the training set.

df %>%
  count(HadHeartAttack) %>%
  mutate(pct = n / sum(n),
         label = paste0(comma(n), "\n(", percent(pct, accuracy = 0.1), ")")) %>%
  ggplot(aes(x = HadHeartAttack, y = n, fill = HadHeartAttack)) +
  geom_col(width = 0.55, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("No" = "forestgreen", "Yes" = "firebrick")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Class Distribution: Had Heart Attack",
       subtitle = "Severe imbalance — ~94% No vs ~6% Yes",
       x = "Had Heart Attack", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.2 Demographic Overview

4.2.1 Distribution by Sex

The sample is approximately balanced between female and male respondents, with females slightly outnumbering males.

df %>%
  count(Sex) %>%
  mutate(pct = percent(n / sum(n), accuracy = 0.1)) %>%
  ggplot(aes(x = Sex, y = n, fill = Sex)) +
  geom_col(width = 0.5, show.legend = FALSE) +
  geom_text(aes(label = paste0(comma(n), "\n(", pct, ")")),
            vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Female" = "deeppink", "Male" = "navy")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Respondents by Sex",
       x = "Sex", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.2.2 Distribution by Age Category

The sample skews older, with respondents aged 60 and above making up the largest proportion. This is expected for a health survey focused on chronic conditions.

df %>%
  count(AgeCategory) %>%
  ggplot(aes(x = AgeCategory, y = n)) +
  geom_col(fill = "slateblue", alpha = 0.85) +
  geom_text(aes(label = comma(n)), vjust = -0.3, size = 3, fontface = "bold") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Distribution by Age Category",
       subtitle = "Older age groups are more represented in the survey",
       x = "Age Category", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1),
        plot.title = element_text(face = "bold"))

4.2.3 Distribution by Race/Ethnicity

White Non-Hispanic respondents represent the largest group by a wide margin, reflecting the broader demographic composition of BRFSS survey responses.

df %>%
  count(RaceEthnicityCategory, sort = TRUE) %>%
  ggplot(aes(x = reorder(RaceEthnicityCategory, n), y = n)) +
  geom_col(fill = "cadetblue", alpha = 0.85) +
  geom_text(aes(label = comma(n)), hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Distribution by Race/Ethnicity",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

4.3 Distribution of Numerical Health Variables

The six numeric variables are examined together using density plots and a correlation heatmap.

num_vars <- c("PhysicalHealthDays", "MentalHealthDays",
              "SleepHours", "BMI", "HeightInMeters", "WeightInKilograms")

df %>%
  select(all_of(num_vars)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = Variable)) +
  geom_density(alpha = 0.75, colour = "white") +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Density Plots of Numerical Variables",
       subtitle = "PhysicalHealthDays and MentalHealthDays are right-skewed; BMI and Weight are approximately normal",
       x = NULL, y = "Density") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

4.3.1 Correlation Among Numerical Variables

cor_mat  <- df %>% select(all_of(num_vars)) %>% cor(use = "complete.obs")
cor_melt <- melt(cor_mat)

ggplot(cor_melt, aes(Var1, Var2, fill = value)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = round(value, 2)), size = 3.2, fontface = "bold") +
  scale_fill_gradient2(low = "navy", mid = "white", high = "darkred",
                       midpoint = 0, limits = c(-1, 1), name = "Correlation") +
  labs(title = "Correlation Heatmap — Numerical Variables",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1),
        plot.title = element_text(face = "bold"))

Key observations:

BMI, WeightInKilograms, and HeightInMeters are moderately correlated, which is expected since BMI is derived from height and weight.
PhysicalHealthDays and MentalHealthDays show a moderate positive correlation, suggesting that physical and mental health conditions co-occur.
Sleep hours has weak correlations with all other variables.

4.4 General Health and Lifestyle Variables

4.4.1 General Health Distribution

Most respondents rate their health as “Good” or “Very Good”. Only a minority report “Poor” health, though this group is important as it is closely linked to both target variables.

health_order <- c("Poor", "Fair", "Good", "Very good", "Excellent")

df %>%
  count(GeneralHealth) %>%
  mutate(GeneralHealth = factor(GeneralHealth, levels = health_order)) %>%
  ggplot(aes(x = GeneralHealth, y = n, fill = GeneralHealth)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = comma(n)), vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Poor"      = "red3",
                               "Fair"      = "tomato",
                               "Good"      = "gold",
                               "Very good" = "mediumseagreen",
                               "Excellent" = "dodgerblue")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(title = "General Health Self-Rating",
       subtitle = "Most respondents rate their health as Good or Very good",
       x = "General Health", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.4.2 Lifestyle Behaviour Summary

binary_vars <- c("PhysicalActivities", "AlcoholDrinkers", "HighRiskLastYear")

df %>%
  select(all_of(binary_vars)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
  count(Variable, Value) %>%
  group_by(Variable) %>%
  mutate(pct = n / sum(n)) %>%
  ggplot(aes(x = Variable, y = pct, fill = Value)) +
  geom_col(position = "fill", alpha = 0.85) +
  geom_text(aes(label = percent(pct, accuracy = 1)),
            position = position_fill(vjust = 0.5), size = 3.5, colour = "white",
            fontface = "bold") +
  scale_fill_manual(values = c("Yes" = "darkgreen", "No" = "grey75")) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Lifestyle Behaviour Variables (Yes vs No)",
       x = NULL, y = "Proportion", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

df %>%
  count(SmokerStatus, sort = TRUE) %>%
  ggplot(aes(x = reorder(SmokerStatus, n), y = n, fill = SmokerStatus)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = comma(n)), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_fill_brewer(palette = "Oranges", direction = -1) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Smoker Status Distribution", x = NULL, y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.5 Chronic Condition Prevalence

4.5.1 Prevalence of Each Chronic Condition

condition_cols <- c("HadAngina", "HadStroke", "HadAsthma", "HadCOPD",
                    "HadKidneyDisease", "HadArthritis", "HadDiabetes",
                    "HadDepressiveDisorder", "HadSkinCancer", "HadHeartAttack")

df %>%
  select(all_of(condition_cols)) %>%
  pivot_longer(everything(), names_to = "Condition", values_to = "Value") %>%
  filter(Value == "Yes") %>%
  count(Condition, sort = TRUE) %>%
  mutate(pct = n / nrow(df)) %>%
  ggplot(aes(x = reorder(Condition, pct), y = pct)) +
  geom_col(fill = "red3", alpha = 0.82) +
  geom_text(aes(label = percent(pct, accuracy = 0.1)), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0, 0.18))) +
  labs(title = "Prevalence of Chronic Conditions",
       subtitle = "Arthritis and Depression are most prevalent; Heart Attack affects ~6%",
       x = NULL, y = "Prevalence (% of respondents)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.5.2 Chronic Condition Count Distribution (Engineered Feature)

df %>%
  count(ChronicConditionCount) %>%
  ggplot(aes(x = factor(ChronicConditionCount), y = n)) +
  geom_col(fill = "purple", alpha = 0.82) +
  geom_text(aes(label = comma(n)), vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Distribution of Chronic Condition Count",
       subtitle = "Most respondents have 0–2 chronic conditions",
       x = "Number of Chronic Conditions", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.6 EDA Focused on Regression Target: PhysicalHealthDays

4.6.1 PhysicalHealthDays by General Health

Respondents who rate their health as “Poor” report a markedly higher median number of physically unhealthy days. This strong monotonic pattern confirms that GeneralHealth will be an important predictor in the regression model.

df %>%
  mutate(GeneralHealth = factor(GeneralHealth, levels = health_order)) %>%
  ggplot(aes(x = GeneralHealth, y = PhysicalHealthDays, fill = GeneralHealth)) +
  geom_boxplot(alpha = 0.8, outlier.alpha = 0.1, outlier.size = 0.5,
               show.legend = FALSE) +
  scale_fill_manual(values = c("Poor"      = "red3",
                               "Fair"      = "tomato",
                               "Good"      = "gold",
                               "Very good" = "mediumseagreen",
                               "Excellent" = "dodgerblue")) +
  labs(title = "Physical Health Days by General Health Rating",
       subtitle = "Clear monotonic increase: worse health → more unhealthy days",
       x = "General Health", y = "Physical Health Days (Past 30)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.6.2 PhysicalHealthDays by Chronic Condition Count

Respondents with more chronic conditions report substantially more physically unhealthy days, showing that ChronicConditionCount captures cumulative health burden effectively.

df %>%
  mutate(ChronicConditionCount = as.numeric(as.character(ChronicConditionCount))) %>%
  group_by(ChronicConditionCount) %>%
  summarise(mean_phd = mean(PhysicalHealthDays)) %>%
  ggplot(aes(x = factor(ChronicConditionCount), y = mean_phd)) +
  geom_col(fill = "dodgerblue", alpha = 0.85) +
  geom_text(aes(label = round(mean_phd, 1)), vjust = -0.4, size = 3.5, fontface = "bold") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  labs(title = "Mean Physical Health Days by Chronic Condition Count",
       subtitle = "Mean unhealthy days rises sharply with each additional condition",
       x = "Number of Chronic Conditions", y = "Mean Physical Health Days") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.6.3 PhysicalHealthDays by Age Category

Older respondents tend to report slightly more physically unhealthy days on average, though the medians remain low across all groups due to the skewed distribution.

df %>%
  group_by(AgeCategory) %>%
  summarise(mean_phd = mean(PhysicalHealthDays)) %>%
  ggplot(aes(x = AgeCategory, y = mean_phd, group = 1)) +
  geom_line(colour = "darkorange", linewidth = 1.2) +
  geom_point(colour = "darkorange", size = 3) +
  geom_text(aes(label = round(mean_phd, 1)), vjust = -0.7, size = 3.2) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.15))) +
  labs(title = "Mean Physical Health Days by Age Category",
       subtitle = "Physically unhealthy days peak in middle age (55–64) and remain elevated in older groups",
       x = "Age Category", y = "Mean Physical Health Days") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1),
        plot.title = element_text(face = "bold"))

4.6.4 PhysicalHealthDays by BMI Category

df %>%
  filter(!is.na(BMI_category)) %>%
  ggplot(aes(x = BMI_category, y = PhysicalHealthDays, fill = BMI_category)) +
  geom_boxplot(alpha = 0.8, outlier.alpha = 0.1, outlier.size = 0.5,
               show.legend = FALSE) +
  scale_fill_manual(values = c("Underweight" = "lightyellow",
                               "Normal"      = "lightgreen",
                               "Overweight"  = "sandybrown",
                               "Obese"       = "tomato")) +
  labs(title = "Physical Health Days by BMI Category",
       subtitle = "Obese respondents report higher physical unhealthy days",
       x = "BMI Category", y = "Physical Health Days (Past 30)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.7 EDA Focused on Classification Target: HadHeartAttack

4.7.1 Heart Attack Rate by Age Category

The rate of heart attack increases steeply with age. Respondents aged 75 and above have the highest prevalence, which is consistent with cardiovascular disease epidemiology.

df %>%
  group_by(AgeCategory) %>%
  summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
  ggplot(aes(x = AgeCategory, y = ha_rate, group = 1)) +
  geom_line(colour = "darkred", linewidth = 1.3) +
  geom_point(colour = "darkred", size = 3.5) +
  geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
            vjust = -0.8, size = 3.2) +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0.05, 0.18))) +
  labs(title = "Heart Attack Rate by Age Category",
       subtitle = "Risk increases sharply from age 50 onwards",
       x = "Age Category", y = "Heart Attack Rate (%)") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1),
        plot.title = element_text(face = "bold"))

4.7.2 Heart Attack Rate by Sex

Male respondents have a notably higher rate of heart attack compared to female respondents, consistent with established cardiovascular disease patterns.

df %>%
  group_by(Sex) %>%
  summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
  ggplot(aes(x = Sex, y = ha_rate, fill = Sex)) +
  geom_col(width = 0.5, show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
            vjust = -0.5, size = 4, fontface = "bold") +
  scale_fill_manual(values = c("Female" = "deeppink", "Male" = "navy")) +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Heart Attack Rate by Sex",
       subtitle = "Males have a higher heart attack prevalence",
       x = "Sex", y = "Heart Attack Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.7.3 Heart Attack Rate by General Health Rating

Respondents who self-rate their health as “Poor” are the most likely to have had a heart attack. Even “Excellent” health raters show a non-zero rate, suggesting self-reported health does not fully capture cardiovascular risk.

df %>%
  mutate(GeneralHealth = factor(GeneralHealth, levels = health_order)) %>%
  group_by(GeneralHealth) %>%
  summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
  ggplot(aes(x = GeneralHealth, y = ha_rate, fill = GeneralHealth)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
            vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_fill_manual(values = c("Poor"      = "red3",
                               "Fair"      = "tomato",
                               "Good"      = "gold",
                               "Very good" = "mediumseagreen",
                               "Excellent" = "dodgerblue")) +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Heart Attack Rate by General Health Rating",
       subtitle = "Heart attack risk is highest among those rating their health as Poor",
       x = "General Health", y = "Heart Attack Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.7.4 Heart Attack Rate by Chronic Condition Count

Having more chronic conditions is strongly associated with a higher heart attack rate. Respondents with 4 or more conditions have a heart attack rate more than 10 times that of those with none.

df %>%
  group_by(ChronicConditionCount) %>%
  summarise(ha_rate = mean(HadHeartAttack == "Yes"), .groups = "drop") %>%
  ggplot(aes(x = factor(ChronicConditionCount), y = ha_rate)) +
  geom_col(fill = "purple4", alpha = 0.85) +
  geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
            vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Heart Attack Rate by Chronic Condition Count",
       subtitle = "Sharp increase in risk as the number of conditions rises",
       x = "Number of Chronic Conditions", y = "Heart Attack Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.7.5 Heart Attack Rate by Smoker Status

Respondents who smoke daily or formerly smoked have a higher heart attack rate compared to those who have never smoked, reflecting the well-established link between tobacco use and cardiovascular disease.

df %>%
  group_by(SmokerStatus) %>%
  summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
  ggplot(aes(x = reorder(SmokerStatus, ha_rate), y = ha_rate)) +
  geom_col(fill = "orangered4", alpha = 0.85) +
  geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
            hjust = -0.2, size = 3.8) +
  coord_flip() +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0, 0.18))) +
  labs(title = "Heart Attack Rate by Smoker Status",
       subtitle = "Daily smokers and former smokers carry higher cardiovascular risk",
       x = NULL, y = "Heart Attack Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.7.6 Heart Attack Rate by BMI Category

df %>%
  filter(!is.na(BMI_category)) %>%
  group_by(BMI_category) %>%
  summarise(ha_rate = mean(HadHeartAttack == "Yes")) %>%
  ggplot(aes(x = BMI_category, y = ha_rate, fill = BMI_category)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = percent(ha_rate, accuracy = 0.1)),
            vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_fill_manual(values = c("Underweight" = "lightyellow",
                               "Normal"      = "lightgreen",
                               "Overweight"  = "sandybrown",
                               "Obese"       = "tomato")) +
  scale_y_continuous(labels = percent_format(),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Heart Attack Rate by BMI Category",
       subtitle = "Overweight and obese respondents show elevated risk",
       x = "BMI Category", y = "Heart Attack Rate (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

4.8 EDA Summary

The exploratory analysis reveals several important insights that will guide the modelling stage:

Finding	Relevance
`PhysicalHealthDays` is heavily right-skewed with many zeros	Supports use of non-linear or tree-based regression models
`HadHeartAttack` has severe class imbalance (~94% No)	Confirmed the need for SMOTE-NC balancing (applied in preprocessing)
`GeneralHealth` shows a strong monotonic relationship with `PhysicalHealthDays`	Likely a key predictor in the regression model
`ChronicConditionCount` rises steeply with both targets	The engineered feature captures cumulative health burden effectively
Heart attack rate increases sharply after age 50	`AgeCategory` is an important classification predictor
Males and smokers show higher heart attack rates	`Sex` and `SmokerStatus` are relevant classification features
BMI, weight, and height are correlated	Multicollinearity should be considered in linear regression

5 Regression

5.1 Regression Objective

The purpose of this regression analysis is to predict the number of physically unhealthy days (PhysicalHealthDays) experienced by U.S. adults using demographic characteristics, lifestyle behaviours, and chronic health conditions. The regression models are developed to:

1. Identify the key factors associated with poor physical health.

2. Examine the relationships between lifestyle behaviours, chronic diseases, and physically unhealthy days.

3. Compare multiple regression algorithms in predicting physical health outcomes.

5.2 Splitting the data

#Set seed for consistency
set.seed(123)

#Partition 80% for training the regression model, 20% for testing
reg_index <- createDataPartition(df$PhysicalHealthDays, p = 0.8, list = FALSE)
reg_train <- df[reg_index, ]
reg_test  <- df[-reg_index, ]

# Remove the classification target from the regression data
reg_train <- reg_train %>% select(-HadHeartAttack)
reg_test  <- reg_test %>% select(-HadHeartAttack)

The dataset is splitted into 80% training set and 20% training set randomly. Training data is used to train the regression models. Testing data is used to evaluate model performance on unseen data. HadHeartAttack is removed because it is the target variable for classification.

5.3 Building the Regression Models

To predict PhysicalHealthDays, three models were built to compare a traditional baseline against non-linear machine learning techniques: Multiple Linear Regression (MLR), Decision Tree, and XGBoost.

The analysis followed a three-step workflow:

Multicollinearity Check — Evaluated Variance Inflation Factors (VIF) for the MLR model to ensure highly correlated predictors did not distort the results.
Performance Evaluation — Compared model accuracy on the 20% test set using MAE (average error), RMSE (penalty for large errors), and R-squared (variance explained) to identify the best-performing algorithm.
Feature Importance — Extracted the top predictors from each model using regression coefficients for MLR and variable importance scores for tree-based models, to determine which factors most strongly drive poor physical health.

Model 1: Multiple Linear Regression

# Remove variables that caused multicollinearity / leakage
# Apply the SAME column filtering to both training and testing data
reg_remove_vars <- c(
  "BMI_category",
  "ChronicConditionCount",
  "PoorHealth"
)

reg_train <- reg_train %>% select(-any_of(reg_remove_vars))
reg_test  <- reg_test  %>% select(-any_of(reg_remove_vars))
# -------------------------------
# Model 1: Multiple Linear Regression
# -------------------------------
lm_model <- lm(
  PhysicalHealthDays ~ .,
  data = reg_train
)

# Display regression summary
summary(lm_model)

## 
## Call:
## lm(formula = PhysicalHealthDays ~ ., data = reg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.588  -2.464  -1.010   0.377  31.278 
## 
## Coefficients:
##                                                                         Estimate
## (Intercept)                                                            -0.474089
## SexMale                                                                -0.088749
## GeneralHealthFair                                                       6.329129
## GeneralHealthGood                                                       1.106633
## GeneralHealthPoor                                                      15.765496
## GeneralHealthVery good                                                  0.236445
## MentalHealthDays                                                        0.138069
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago)   0.183285
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago)  0.062447
## LastCheckupTimeWithin past year (anytime less than 12 months ago)       0.410626
## PhysicalActivitiesYes                                                  -1.026868
## SleepHours                                                             -0.088518
## RemovedTeeth6 or more, but not all                                      0.142376
## RemovedTeethAll                                                        -0.034518
## RemovedTeethNone of them                                                0.030287
## HadAnginaYes                                                            0.404943
## HadStrokeYes                                                            0.258391
## HadAsthmaYes                                                            0.421201
## HadSkinCancerYes                                                        0.268762
## HadCOPDYes                                                              0.676686
## HadDepressiveDisorderYes                                                0.139001
## HadKidneyDiseaseYes                                                     0.558421
## HadArthritisYes                                                         1.080825
## HadDiabetesYes                                                         -0.195221
## DeafOrHardOfHearingYes                                                 -0.069555
## BlindOrVisionDifficultyYes                                             -0.044944
## DifficultyConcentratingYes                                              0.101607
## DifficultyWalkingYes                                                    3.469760
## DifficultyDressingBathingYes                                            2.983413
## DifficultyErrandsYes                                                    1.692900
## SmokerStatusCurrent smoker - now smokes some days                       0.246066
## SmokerStatusFormer smoker                                               0.333373
## SmokerStatusNever smoked                                                0.374750
## ECigaretteUsageNot at all (right now)                                  -0.020681
## ECigaretteUsageUse them every day                                      -0.265751
## ECigaretteUsageUse them some days                                      -0.113992
## ChestScanYes                                                            0.285206
## RaceEthnicityCategoryHispanic                                           0.425116
## RaceEthnicityCategoryMultiracial, Non-Hispanic                          0.846140
## RaceEthnicityCategoryOther race only, Non-Hispanic                      0.458535
## RaceEthnicityCategoryWhite only, Non-Hispanic                           0.821241
## AgeCategory.L                                                          -0.185830
## AgeCategory.Q                                                          -0.926816
## AgeCategory.C                                                          -0.194844
## AgeCategory^4                                                          -0.087265
## AgeCategory^5                                                           0.027721
## AgeCategory^6                                                          -0.086562
## AgeCategory^7                                                          -0.167156
## AgeCategory^8                                                          -0.108057
## AgeCategory^9                                                          -0.066952
## AgeCategory^10                                                          0.028585
## AgeCategory^11                                                         -0.029771
## AgeCategory^12                                                          0.008210
## HeightInMeters                                                          1.021743
## WeightInKilograms                                                      -0.005042
## BMI                                                                    -0.020350
## AlcoholDrinkersYes                                                     -0.128509
## HIVTestingYes                                                           0.149213
## FluVaxLast12Yes                                                         0.093880
## PneumoVaxEverYes                                                       -0.082744
## TetanusLast10TdapYes, received Tdap                                    -0.016540
## TetanusLast10TdapYes, received tetanus shot but not sure what type     -0.025812
## TetanusLast10TdapYes, received tetanus shot, but not Tdap              -0.019289
## HighRiskLastYearYes                                                    -0.109835
## CovidPosTested positive using home test without a health professional   0.875165
## CovidPosYes                                                             0.423176
##                                                                        Std. Error
## (Intercept)                                                              0.320267
## SexMale                                                                  0.031608
## GeneralHealthFair                                                        0.047021
## GeneralHealthGood                                                        0.035782
## GeneralHealthPoor                                                        0.071024
## GeneralHealthVery good                                                   0.034109
## MentalHealthDays                                                         0.001618
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago)    0.065790
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago)   0.072029
## LastCheckupTimeWithin past year (anytime less than 12 months ago)        0.057449
## PhysicalActivitiesYes                                                    0.028625
## SleepHours                                                               0.009339
## RemovedTeeth6 or more, but not all                                       0.041997
## RemovedTeethAll                                                          0.054149
## RemovedTeethNone of them                                                 0.026946
## HadAnginaYes                                                             0.050802
## HadStrokeYes                                                             0.057907
## HadAsthmaYes                                                             0.032870
## HadSkinCancerYes                                                         0.043187
## HadCOPDYes                                                               0.046079
## HadDepressiveDisorderYes                                                 0.032339
## HadKidneyDiseaseYes                                                      0.056390
## HadArthritisYes                                                          0.027581
## HadDiabetesYes                                                           0.035823
## DeafOrHardOfHearingYes                                                   0.042347
## BlindOrVisionDifficultyYes                                               0.053290
## DifficultyConcentratingYes                                               0.041262
## DifficultyWalkingYes                                                     0.040060
## DifficultyDressingBathingYes                                             0.069627
## DifficultyErrandsYes                                                     0.052784
## SmokerStatusCurrent smoker - now smokes some days                        0.075273
## SmokerStatusFormer smoker                                                0.047406
## SmokerStatusNever smoked                                                 0.046368
## ECigaretteUsageNot at all (right now)                                    0.032069
## ECigaretteUsageUse them every day                                        0.077929
## ECigaretteUsageUse them some days                                        0.072791
## ChestScanYes                                                             0.025719
## RaceEthnicityCategoryHispanic                                            0.055008
## RaceEthnicityCategoryMultiracial, Non-Hispanic                           0.086418
## RaceEthnicityCategoryOther race only, Non-Hispanic                       0.064894
## RaceEthnicityCategoryWhite only, Non-Hispanic                            0.043529
## AgeCategory.L                                                            0.062049
## AgeCategory.Q                                                            0.046891
## AgeCategory.C                                                            0.041643
## AgeCategory^4                                                            0.042533
## AgeCategory^5                                                            0.042946
## AgeCategory^6                                                            0.042654
## AgeCategory^7                                                            0.042534
## AgeCategory^8                                                            0.041567
## AgeCategory^9                                                            0.040783
## AgeCategory^10                                                           0.039613
## AgeCategory^11                                                           0.039875
## AgeCategory^12                                                           0.040727
## HeightInMeters                                                           0.180940
## WeightInKilograms                                                        0.001351
## BMI                                                                      0.004304
## AlcoholDrinkersYes                                                       0.024112
## HIVTestingYes                                                            0.027006
## FluVaxLast12Yes                                                          0.024701
## PneumoVaxEverYes                                                         0.028815
## TetanusLast10TdapYes, received Tdap                                      0.030564
## TetanusLast10TdapYes, received tetanus shot but not sure what type       0.028628
## TetanusLast10TdapYes, received tetanus shot, but not Tdap                0.048517
## HighRiskLastYearYes                                                      0.060364
## CovidPosTested positive using home test without a health professional    0.065878
## CovidPosYes                                                              0.026883
##                                                                        t value
## (Intercept)                                                             -1.480
## SexMale                                                                 -2.808
## GeneralHealthFair                                                      134.601
## GeneralHealthGood                                                       30.927
## GeneralHealthPoor                                                      221.973
## GeneralHealthVery good                                                   6.932
## MentalHealthDays                                                        85.335
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago)    2.786
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago)   0.867
## LastCheckupTimeWithin past year (anytime less than 12 months ago)        7.148
## PhysicalActivitiesYes                                                  -35.874
## SleepHours                                                              -9.478
## RemovedTeeth6 or more, but not all                                       3.390
## RemovedTeethAll                                                         -0.637
## RemovedTeethNone of them                                                 1.124
## HadAnginaYes                                                             7.971
## HadStrokeYes                                                             4.462
## HadAsthmaYes                                                            12.814
## HadSkinCancerYes                                                         6.223
## HadCOPDYes                                                              14.685
## HadDepressiveDisorderYes                                                 4.298
## HadKidneyDiseaseYes                                                      9.903
## HadArthritisYes                                                         39.188
## HadDiabetesYes                                                          -5.450
## DeafOrHardOfHearingYes                                                  -1.642
## BlindOrVisionDifficultyYes                                              -0.843
## DifficultyConcentratingYes                                               2.463
## DifficultyWalkingYes                                                    86.614
## DifficultyDressingBathingYes                                            42.848
## DifficultyErrandsYes                                                    32.072
## SmokerStatusCurrent smoker - now smokes some days                        3.269
## SmokerStatusFormer smoker                                                7.032
## SmokerStatusNever smoked                                                 8.082
## ECigaretteUsageNot at all (right now)                                   -0.645
## ECigaretteUsageUse them every day                                       -3.410
## ECigaretteUsageUse them some days                                       -1.566
## ChestScanYes                                                            11.089
## RaceEthnicityCategoryHispanic                                            7.728
## RaceEthnicityCategoryMultiracial, Non-Hispanic                           9.791
## RaceEthnicityCategoryOther race only, Non-Hispanic                       7.066
## RaceEthnicityCategoryWhite only, Non-Hispanic                           18.867
## AgeCategory.L                                                           -2.995
## AgeCategory.Q                                                          -19.766
## AgeCategory.C                                                           -4.679
## AgeCategory^4                                                           -2.052
## AgeCategory^5                                                            0.645
## AgeCategory^6                                                           -2.029
## AgeCategory^7                                                           -3.930
## AgeCategory^8                                                           -2.600
## AgeCategory^9                                                           -1.642
## AgeCategory^10                                                           0.722
## AgeCategory^11                                                          -0.747
## AgeCategory^12                                                           0.202
## HeightInMeters                                                           5.647
## WeightInKilograms                                                       -3.732
## BMI                                                                     -4.728
## AlcoholDrinkersYes                                                      -5.330
## HIVTestingYes                                                            5.525
## FluVaxLast12Yes                                                          3.801
## PneumoVaxEverYes                                                        -2.872
## TetanusLast10TdapYes, received Tdap                                     -0.541
## TetanusLast10TdapYes, received tetanus shot but not sure what type      -0.902
## TetanusLast10TdapYes, received tetanus shot, but not Tdap               -0.398
## HighRiskLastYearYes                                                     -1.820
## CovidPosTested positive using home test without a health professional   13.285
## CovidPosYes                                                             15.741
##                                                                        Pr(>|t|)
## (Intercept)                                                            0.138797
## SexMale                                                                0.004988
## GeneralHealthFair                                                       < 2e-16
## GeneralHealthGood                                                       < 2e-16
## GeneralHealthPoor                                                       < 2e-16
## GeneralHealthVery good                                                 4.15e-12
## MentalHealthDays                                                        < 2e-16
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago)  0.005338
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago) 0.385962
## LastCheckupTimeWithin past year (anytime less than 12 months ago)      8.84e-13
## PhysicalActivitiesYes                                                   < 2e-16
## SleepHours                                                              < 2e-16
## RemovedTeeth6 or more, but not all                                     0.000699
## RemovedTeethAll                                                        0.523818
## RemovedTeethNone of them                                               0.261009
## HadAnginaYes                                                           1.58e-15
## HadStrokeYes                                                           8.12e-06
## HadAsthmaYes                                                            < 2e-16
## HadSkinCancerYes                                                       4.88e-10
## HadCOPDYes                                                              < 2e-16
## HadDepressiveDisorderYes                                               1.72e-05
## HadKidneyDiseaseYes                                                     < 2e-16
## HadArthritisYes                                                         < 2e-16
## HadDiabetesYes                                                         5.05e-08
## DeafOrHardOfHearingYes                                                 0.100489
## BlindOrVisionDifficultyYes                                             0.399012
## DifficultyConcentratingYes                                             0.013798
## DifficultyWalkingYes                                                    < 2e-16
## DifficultyDressingBathingYes                                            < 2e-16
## DifficultyErrandsYes                                                    < 2e-16
## SmokerStatusCurrent smoker - now smokes some days                      0.001079
## SmokerStatusFormer smoker                                              2.04e-12
## SmokerStatusNever smoked                                               6.39e-16
## ECigaretteUsageNot at all (right now)                                  0.518999
## ECigaretteUsageUse them every day                                      0.000649
## ECigaretteUsageUse them some days                                      0.117346
## ChestScanYes                                                            < 2e-16
## RaceEthnicityCategoryHispanic                                          1.09e-14
## RaceEthnicityCategoryMultiracial, Non-Hispanic                          < 2e-16
## RaceEthnicityCategoryOther race only, Non-Hispanic                     1.60e-12
## RaceEthnicityCategoryWhite only, Non-Hispanic                           < 2e-16
## AgeCategory.L                                                          0.002745
## AgeCategory.Q                                                           < 2e-16
## AgeCategory.C                                                          2.88e-06
## AgeCategory^4                                                          0.040197
## AgeCategory^5                                                          0.518610
## AgeCategory^6                                                          0.042418
## AgeCategory^7                                                          8.50e-05
## AgeCategory^8                                                          0.009334
## AgeCategory^9                                                          0.100658
## AgeCategory^10                                                         0.470528
## AgeCategory^11                                                         0.455309
## AgeCategory^12                                                         0.840232
## HeightInMeters                                                         1.64e-08
## WeightInKilograms                                                      0.000190
## BMI                                                                    2.27e-06
## AlcoholDrinkersYes                                                     9.84e-08
## HIVTestingYes                                                          3.29e-08
## FluVaxLast12Yes                                                        0.000144
## PneumoVaxEverYes                                                       0.004084
## TetanusLast10TdapYes, received Tdap                                    0.588389
## TetanusLast10TdapYes, received tetanus shot but not sure what type     0.367259
## TetanusLast10TdapYes, received tetanus shot, but not Tdap              0.690946
## HighRiskLastYearYes                                                    0.068832
## CovidPosTested positive using home test without a health professional   < 2e-16
## CovidPosYes                                                             < 2e-16
##                                                                           
## (Intercept)                                                               
## SexMale                                                                ** 
## GeneralHealthFair                                                      ***
## GeneralHealthGood                                                      ***
## GeneralHealthPoor                                                      ***
## GeneralHealthVery good                                                 ***
## MentalHealthDays                                                       ***
## LastCheckupTimeWithin past 2 years (1 year but less than 2 years ago)  ** 
## LastCheckupTimeWithin past 5 years (2 years but less than 5 years ago)    
## LastCheckupTimeWithin past year (anytime less than 12 months ago)      ***
## PhysicalActivitiesYes                                                  ***
## SleepHours                                                             ***
## RemovedTeeth6 or more, but not all                                     ***
## RemovedTeethAll                                                           
## RemovedTeethNone of them                                                  
## HadAnginaYes                                                           ***
## HadStrokeYes                                                           ***
## HadAsthmaYes                                                           ***
## HadSkinCancerYes                                                       ***
## HadCOPDYes                                                             ***
## HadDepressiveDisorderYes                                               ***
## HadKidneyDiseaseYes                                                    ***
## HadArthritisYes                                                        ***
## HadDiabetesYes                                                         ***
## DeafOrHardOfHearingYes                                                    
## BlindOrVisionDifficultyYes                                                
## DifficultyConcentratingYes                                             *  
## DifficultyWalkingYes                                                   ***
## DifficultyDressingBathingYes                                           ***
## DifficultyErrandsYes                                                   ***
## SmokerStatusCurrent smoker - now smokes some days                      ** 
## SmokerStatusFormer smoker                                              ***
## SmokerStatusNever smoked                                               ***
## ECigaretteUsageNot at all (right now)                                     
## ECigaretteUsageUse them every day                                      ***
## ECigaretteUsageUse them some days                                         
## ChestScanYes                                                           ***
## RaceEthnicityCategoryHispanic                                          ***
## RaceEthnicityCategoryMultiracial, Non-Hispanic                         ***
## RaceEthnicityCategoryOther race only, Non-Hispanic                     ***
## RaceEthnicityCategoryWhite only, Non-Hispanic                          ***
## AgeCategory.L                                                          ** 
## AgeCategory.Q                                                          ***
## AgeCategory.C                                                          ***
## AgeCategory^4                                                          *  
## AgeCategory^5                                                             
## AgeCategory^6                                                          *  
## AgeCategory^7                                                          ***
## AgeCategory^8                                                          ** 
## AgeCategory^9                                                             
## AgeCategory^10                                                            
## AgeCategory^11                                                            
## AgeCategory^12                                                            
## HeightInMeters                                                         ***
## WeightInKilograms                                                      ***
## BMI                                                                    ***
## AlcoholDrinkersYes                                                     ***
## HIVTestingYes                                                          ***
## FluVaxLast12Yes                                                        ***
## PneumoVaxEverYes                                                       ** 
## TetanusLast10TdapYes, received Tdap                                       
## TetanusLast10TdapYes, received tetanus shot but not sure what type        
## TetanusLast10TdapYes, received tetanus shot, but not Tdap                 
## HighRiskLastYearYes                                                    .  
## CovidPosTested positive using home test without a health professional  ***
## CovidPosYes                                                            ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.579 on 345111 degrees of freedom
## Multiple R-squared:  0.4226, Adjusted R-squared:  0.4225 
## F-statistic:  3886 on 65 and 345111 DF,  p-value: < 2.2e-16

Multiple Linear Regression examines the linear relationship between predictor variables and the target variable (“PhysicalHealthDays”). This model helps to identify significant predictors, measure the direction of relationships, estimate how strongly each variable affects physical health days.

Prior to running the Multiple Linear Regression model, “BMI_category”, “ChronicConditionCount”, and “PoorHealth” were removed from the dataset. These variables created perfect multicollinearity, causing model instability and mathematical errors. After removing them, the model was successfully fitted, and a post-estimation Variance Inflation Factor (VIF) check was conducted to confirm that no remaining multicollinearity issues exist.

# Check multicollinearity using VIF
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif_values <- vif(lm_model)

# Convert VIF output into dataframe properly
vif_df <- data.frame(vif_values)

# Add variable names
vif_df$Variable <- rownames(vif_df)

# Reorder columns
vif_df <- vif_df %>%
  select(Variable, everything())

# Display table
kable(
  vif_df,
  caption = "Variance Inflation Factor (VIF)"
)

Variance Inflation Factor (VIF)
	Variable	GVIF	Df	GVIF..1..2.Df..
Sex	Sex	1.984494	1	1.408721
GeneralHealth	GeneralHealth	1.749553	4	1.072422
MentalHealthDays	MentalHealthDays	1.439293	1	1.199706
LastCheckupTime	LastCheckupTime	1.150064	3	1.023576
PhysicalActivities	PhysicalActivities	1.173882	1	1.083458
SleepHours	SleepHours	1.077918	1	1.038228
RemovedTeeth	RemovedTeeth	1.388600	3	1.056241
HadAngina	HadAngina	1.132570	1	1.064223
HadStroke	HadStroke	1.079322	1	1.038904
HadAsthma	HadAsthma	1.094944	1	1.046396
HadSkinCancer	HadSkinCancer	1.092327	1	1.045145
HadCOPD	HadCOPD	1.225845	1	1.107179
HadDepressiveDisorder	HadDepressiveDisorder	1.358346	1	1.165481
HadKidneyDisease	HadKidneyDisease	1.083800	1	1.041057
HadArthritis	HadArthritis	1.355438	1	1.164233
HadDiabetes	HadDiabetes	1.195933	1	1.093587
DeafOrHardOfHearing	DeafOrHardOfHearing	1.119321	1	1.057980
BlindOrVisionDifficulty	BlindOrVisionDifficulty	1.103305	1	1.050383
DifficultyConcentrating	DifficultyConcentrating	1.330369	1	1.153416
DifficultyWalking	DifficultyWalking	1.615998	1	1.271219
DifficultyDressingBathing	DifficultyDressingBathing	1.355466	1	1.164245
DifficultyErrands	DifficultyErrands	1.450554	1	1.204389
SmokerStatus	SmokerStatus	1.422921	3	1.060547
ECigaretteUsage	ECigaretteUsage	1.296604	3	1.044242
ChestScan	ChestScan	1.230834	1	1.109430
RaceEthnicityCategory	RaceEthnicityCategory	1.260622	4	1.029374
AgeCategory	AgeCategory	2.785987	12	1.043616
HeightInMeters	HeightInMeters	2.659517	1	1.630803
WeightInKilograms	WeightInKilograms	4.132705	1	2.032905
BMI	BMI	3.248924	1	1.802477
AlcoholDrinkers	AlcoholDrinkers	1.128150	1	1.062144
HIVTesting	HIVTesting	1.201230	1	1.096006
FluVaxLast12	FluVaxLast12	1.188634	1	1.090245
PneumoVaxEver	PneumoVaxEver	1.489767	1	1.220560
TetanusLast10Tdap	TetanusLast10Tdap	1.182102	3	1.028275
HighRiskLastYear	HighRiskLastYear	1.085673	1	1.041956
CovidPos	CovidPos	1.077644	2	1.018870

Variance Inflation Factor (VIF) is used to detect multicollinearity among predictor variables. If VIF < 5 → acceptable VIF > 10 → serious multicollinearity (need to be removed)

Since VIF of all variables are < 5, then no more variables need to be removed.

Model 2: Decision Tree Regression

# -------------------------------
# Model 2: Decision Tree Regression
# -------------------------------

library(rpart)

tree_reg_model <- rpart(
  PhysicalHealthDays ~ .,
  data = reg_train,
  method = "anova"
)

Decision Tree Regression aim to capture non-linear relationships between variables. It is easy to interpret, can handles interactions automatically, and able to handle complex health relationships.

Model 3: XGBoost Regression

# -------------------------------
# Model 3: XGBoost Regression
# -------------------------------

library(xgboost)

# Convert categorical variables into dummy variables
# IMPORTANT: train and test matrices must have the same columns
xgb_formula <- PhysicalHealthDays ~ .

train_matrix <- model.matrix(xgb_formula, data = reg_train)[, -1]
test_matrix  <- model.matrix(xgb_formula, data = reg_test)[, -1]

# Align test matrix columns to the training matrix to avoid XGBoost column mismatch
missing_cols <- setdiff(colnames(train_matrix), colnames(test_matrix))
for (col in missing_cols) {
  test_matrix <- cbind(test_matrix, setNames(data.frame(0), col))
}

extra_cols <- setdiff(colnames(test_matrix), colnames(train_matrix))
if (length(extra_cols) > 0) {
  test_matrix <- test_matrix[, !colnames(test_matrix) %in% extra_cols, drop = FALSE]
}

test_matrix <- test_matrix[, colnames(train_matrix), drop = FALSE]

# Target variable
train_label <- reg_train$PhysicalHealthDays

# Train XGBoost model
set.seed(123)

xgb_reg_model <- xgboost(
  x = train_matrix,
  y = train_label,
  nrounds = 100,
  max_depth = 6,
  learning_rate = 0.1,
  objective = "reg:squarederror"
)

# Prediction
xgb_pred <- predict(
  xgb_reg_model,
  test_matrix
)

XGBoost (Extreme Gradient Boosting) is an advanced machine learning algorithm that improves prediction performance by combining multiple decision trees. It has high predictive accuracy, and it is able to handles non-linear relationships. XGBoost can captures complex interactions among health indicators too.

5.4 Predicted Physical Health Days Across Three Models

library(xgboost)
library(dplyr)
library(knitr)

# 1. Generate predictions for Multiple Linear Regression and Decision Tree
lm_preds   <- predict(lm_model, newdata = reg_test)
tree_preds <- predict(tree_reg_model, newdata = reg_test)

# 2. Generate XGBoost predictions using the aligned test matrix created in Section 5.3
xgboost_preds <- predict(xgb_reg_model, newdata = test_matrix)

# 3. Combine the actual values and all predictions into one dataframe
prediction_comparison <- data.frame(
  Record_ID         = 1:nrow(reg_test),
  Actual_Days       = reg_test$PhysicalHealthDays,
  MLR_Predicted     = round(lm_preds, 2),
  DecTree_Predicted = round(tree_preds, 2),
  XGBoost_Predicted = round(xgboost_preds, 2)
)

# 4. Display the first 10 rows of the comparison table neatly
kable(
  head(prediction_comparison, 10),
  caption = "Comparison of Actual vs. Predicted Physical Health Days Across Models"
)

Comparison of Actual vs. Predicted Physical Health Days Across Models
	Record_ID	Actual_Days	MLR_Predicted	DecTree_Predicted	XGBoost_Predicted
6	1	1	16.39	23.61	20.54
34	2	0	1.38	1.86	1.03
40	3	0	11.37	15.11	10.47
42	4	5	6.40	6.78	5.13
45	5	5	2.52	1.86	2.07
52	6	0	0.81	1.86	1.07
56	7	0	12.56	15.11	12.48
58	8	0	5.35	6.78	4.92
61	9	0	2.67	1.86	2.86
73	10	2	2.11	1.86	1.79

The table above showed the actual “PhysicalHealthDays” predicted by 3 models (Multiple Linear Regression, Decision Tree and XGBoost)

5.5 Model Prediction and Evaluation

To determine the most competent model for predicting “PhysicalHealthDays”, the predictive performance of the Multiple Linear Regression, Decision Tree, and XGBoost models was evaluated and compared on the unseen test dataset (the 20% test dataset).

Three standard statistical metrics were utilized to evaluate the performance of each models:

Mean Absolute Error (MAE)

Measures the average magnitude of the errors in the predictions, without considering their direction.
Lower MAE values indicate better accuracy.

Root Mean Squared Error (RMSE)

Squares the errors before averaging them, which penalizes larger errors more heavily.
A significantly higher RMSE compared to MAE indicates that the model is making a few exceptionally large prediction errors.
Lower RMSE values indicate a more stable and accurate model.

Coefficient of Determination (R-squared)

Represents the proportion of the variance in “PhysicalHealthDays” that can be explained by the predictor variables in the model.
Higher R-squared score indicates a better fit.

# Generate predictions
lm_pred <- predict(lm_model, newdata = reg_test)

tree_pred <- predict(
  tree_reg_model,
  newdata = reg_test
)

xgb_pred <- predict(
  xgb_reg_model,
  test_matrix
)

# Function to evaluate regression models
eval_regression <- function(actual, predicted, model_name) {

  # Root Mean Squared Error
  rmse <- sqrt(mean((actual - predicted)^2))

  # Mean Absolute Error
  mae <- mean(abs(actual - predicted))

  # R-squared
  r2 <- cor(actual, predicted)^2

  data.frame(
    Model = model_name,
    RMSE = round(rmse, 4),
    MAE = round(mae, 4),
    R_Squared = round(r2, 4)
  )
}

# Compare all models
regression_perf <- bind_rows(

  eval_regression(
    reg_test$PhysicalHealthDays,
    lm_pred,
    "Linear Regression"
  ),

  eval_regression(
    reg_test$PhysicalHealthDays,
    tree_pred,
    "Decision Tree Regression"
  ),

  eval_regression(
    reg_test$PhysicalHealthDays,
    xgb_pred,
    "XGBoost Regression"
  )
)

# Display results
kable(
  regression_perf,
  caption = "Regression Model Performance Comparison"
)

Regression Model Performance Comparison
Model	RMSE	MAE	R_Squared
Linear Regression	6.6201	4.0077	0.4185
Decision Tree Regression	6.8758	4.2679	0.3727
XGBoost Regression	6.5410	3.9067	0.4324

Best Performing Algorithm: XGBoost

Lowest MAE: XGBoost’s individual day-to-day predictions are, on average, the closest to the actual number of physically unhealthy days reported by respondents.
Lowest RMSE: XGBoost is the most stable model and the least prone to making large, wildly inaccurate prediction errors.
Highest R-squared: XGBoost successfully captured and explained the highest percentage of variance in “PhysicalHealthDays”.

5.6 Extracting Key Factors and Relationships

This section identifies and interprets the top predictors influencing “PhysicalHealthDays” across all three models. Cross-referencing the outputs from three models allows us to isolate the most consistent demographic, lifestyle, and clinical risk factors driving poor physical health.

Linear Regression Coefficients

Key factors were identified using the absolute values of the regression coefficients and their associated p-values. This allows us to quantify the direct linear impact of each variable on PhysicalHealthDays.

summary_stats <- summary(lm_model)$coefficients

summary_df <- data.frame(
  Variable = rownames(summary_stats),
  Estimate = summary_stats[, "Estimate"],
  P_Value = summary_stats[, "Pr(>|t|)"]
) %>%

  filter(Variable != "(Intercept)") %>%

  arrange(desc(abs(Estimate)))

# Display top influential factors
kable(
  head(summary_df, 10),
  caption = "Top 10 Factors Influencing PhysicalHealthDays"
)

Top 10 Factors Influencing PhysicalHealthDays
	Variable	Estimate
GeneralHealthPoor	GeneralHealthPoor	15.7654955
GeneralHealthFair	GeneralHealthFair	6.3291289
DifficultyWalkingYes	DifficultyWalkingYes	3.4697602
DifficultyDressingBathingYes	DifficultyDressingBathingYes	2.9834132
DifficultyErrandsYes	DifficultyErrandsYes	1.6928997
GeneralHealthGood	GeneralHealthGood	1.1066332
HadArthritisYes	HadArthritisYes	1.0808248
PhysicalActivitiesYes	PhysicalActivitiesYes	-1.0268675
HeightInMeters	HeightInMeters	1.0217432
AgeCategory.Q	AgeCategory.Q	-0.9268163

Decision Tree Variable Importance

Predictor importance was determined by evaluating Variable Importance scores, which measure how frequently and effectively a specific variable was chosen to split the data into distinct, homogenous health groups.

tree_imp <- data.frame(
  Variable = names(tree_reg_model$variable.importance),
  Importance = as.numeric(
    tree_reg_model$variable.importance
  )
) %>%

  arrange(desc(Importance))

kable(
  head(tree_imp, 10),
  caption = "Top Predictors from Decision Tree Regression"
)

Top Predictors from Decision Tree Regression
Variable	Importance
GeneralHealth	8701301.1220
DifficultyWalking	1423734.5811
DifficultyDressingBathing	712883.1810
DifficultyErrands	641383.3313
HadArthritis	43026.6735
BlindOrVisionDifficulty	25816.0041
HadCOPD	24089.2758
BMI	241.1825

XGBoost Feature Importance

The top predictors were extracted using Feature Importance (Gain) metrics. Gain calculates the fractional contribution of each feature to the model’s overall predictive power, highlighting the variables that provide the greatest reduction in prediction error.

xgb_imp <- xgb.importance(
  model = xgb_reg_model
)

# Display top predictors
kable(
  head(xgb_imp, 10),
  caption = "Top Predictors from XGBoost Regression"
)

Top Predictors from XGBoost Regression
Feature	Gain	Cover	Frequency
GeneralHealthPoor	0.4854848	0.0620056	0.0199646
GeneralHealthFair	0.2467141	0.0642205	0.0328450
DifficultyWalkingYes	0.0911368	0.0483557	0.0275318
MentalHealthDays	0.0601061	0.0827465	0.0834004
HadArthritisYes	0.0148580	0.0294214	0.0272098
DifficultyDressingBathingYes	0.0128497	0.0347810	0.0257607
DifficultyErrandsYes	0.0121763	0.0288004	0.0222186
GeneralHealthGood	0.0098582	0.0302832	0.0133634
PhysicalActivitiesYes	0.0077017	0.0342225	0.0307519
AgeCategory.L	0.0056853	0.0110243	0.0305909

5.7 Regression Summary

Objective 1: Identify the key factors associated with poor physical health.

Across all three models, the following key drivers consistently dominated the prediction of PhysicalHealthDays:

General Health Status (GeneralHealth) — Subjective self-reported poor or fair health is overwhelmingly the strongest predictor of physically unhealthy days across every algorithm.
Functional Mobility Limitations (DifficultyWalking) — Having physical trouble walking or climbing stairs consistently ranks as the second most critical factor.
Daily Living Impairments (DifficultyDressingBathing and DifficultyErrands) — Difficulty taking care of oneself or handling errands outside the home strongly increases the predicted number of unhealthy days.
Chronic Pain (HadArthritis) — Arthritis consistently appears near the top of the clinical indicators across all models.

Objective 2: Examine the relationships between lifestyle behaviours, chronic diseases, and physically unhealthy days.

Multiple Linear Regression:

Individuals who rate their health as “Poor” experience a statistically significant increase of +15.77 physically unhealthy days compared to the reference group, while “Fair” health adds +6.33 days.
Engaging in physical activity (PhysicalActivitiesYes) reduces unhealthy days by approximately −1.03 days, serving as a protective lifestyle factor.

Decision Tree:

GeneralHealth has an importance score of 8.7 million, far surpassing all other predictors. This indicates that self-perceived health status is the primary splitting criterion in the tree structure.

XGBoost:

GeneralHealthPoor and GeneralHealthFair together account for nearly 73% of the total predictive Gain (0.485 + 0.246).
MentalHealthDays ranks 4th with a 6% Gain contribution, higher than in the other models, suggesting that mental health interacts non-linearly with physical health outcomes.

Objective 3: Compare multiple regression algorithms in predicting physical health outcomes.

XGBoost — Best overall performance, achieving the lowest MAE, lowest RMSE, and highest R-squared. It effectively captures non-linear patterns and complex interactions among health indicators.
Multiple Linear Regression — Provided good interpretability and quantified direct variable impacts, but was limited by its linear assumptions and required removing PoorHealth, ChronicConditionCount, and BMI_category to resolve severe multicollinearity.
Decision Tree — Confirmed that self-perceived health status is the dominant primary split in the population’s health structure.

All three regression objectives have been achieved.

6 Classification

6.1 Classification objective

The classification task focuses on predicting whether a respondent has experienced a heart attack. The target variable is HadHeartAttack, which contains two classes: Yes and No. This is a binary classification problem because each respondent belongs to only one of these two outcome groups.

The objective of this section is to build classification models using the cleaned and balanced training data, then evaluate how well the models identify heart attack cases on the untouched test data. The EDA section shows that heart attack occurrence is related to variables such as age, sex, general health, chronic condition count, smoker status, and BMI category, so these predictors are expected to contribute useful information during modelling.

PhysicalHealthDays is excluded from the classification predictors because it is the target variable for the regression task.

6.2 Preparing the classification data

The classification models use train_balanced for training and test_data for evaluation. The training set was balanced during preprocessing using SMOTE-NC, while the test set remains unchanged so that the final evaluation reflects the real distribution of heart attack cases. To keep the classification task separate from the regression task, PhysicalHealthDays is removed from the classification modelling data before training the models.

class_train <- train_balanced
class_test  <- test_data

# Exclude the regression target from classification predictors.
class_train <- class_train %>% select(-PhysicalHealthDays)
class_test  <- class_test %>% select(-PhysicalHealthDays)

# Set the reference level clearly so "Yes" can be treated as the positive class.
class_train$HadHeartAttack <- relevel(as.factor(class_train$HadHeartAttack), ref = "No")
class_test$HadHeartAttack  <- factor(class_test$HadHeartAttack,
                                     levels = levels(class_train$HadHeartAttack))

# Treat age groups as regular categories for modelling.
class_train$AgeCategory <- factor(class_train$AgeCategory, ordered = FALSE)
class_test$AgeCategory  <- factor(class_test$AgeCategory,
                                  levels = levels(class_train$AgeCategory),
                                  ordered = FALSE)

table(class_train$HadHeartAttack)

## 
##     No    Yes 
## 325860 325860

table(class_test$HadHeartAttack)

## 
##    No   Yes 
## 81464  4829

The classification data consists of a balanced training set with 325860 No and 325860 Yes cases, while the test set keeps the original distribution with 81464 No and 4829 Yes cases.

6.3 Evaluation approach

The models are evaluated using a confusion matrix and several performance metrics. Accuracy shows the overall proportion of correct predictions, but it is not enough for this dataset because most respondents did not have a heart attack. Therefore, sensitivity, specificity, precision and F1-score are also used.

Sensitivity measures how many actual heart attack cases are correctly identified.
Specificity measures how many non-heart attack cases are correctly identified.
Precision measures how many predicted heart attack cases are actually correct.
F1-score balances precision and sensitivity into one measure.

extract_metrics <- function(conf_matrix, model_name) {
  data.frame(
    Model = model_name,
    Accuracy = round(as.numeric(conf_matrix$overall["Accuracy"]), 4),
    Sensitivity = round(as.numeric(conf_matrix$byClass["Sensitivity"]), 4),
    Specificity = round(as.numeric(conf_matrix$byClass["Specificity"]), 4),
    Precision = round(as.numeric(conf_matrix$byClass["Pos Pred Value"]), 4),
    F1_Score = round(as.numeric(conf_matrix$byClass["F1"]), 4)
  )
}

plot_confusion_matrix <- function(conf_matrix, title_text) {
  as.data.frame(conf_matrix$table) %>%
    ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(color = "white") +
    geom_text(aes(label = Freq), size = 4) +
    scale_fill_gradient(low = "white", high = "steelblue") +
    labs(title = title_text,
         x = "Actual Class",
         y = "Predicted Class",
         fill = "Count") +
    theme_minimal()
}

6.4 Model 1: Logistic regression

Logistic regression is used as the first classification model because it is suitable for binary outcomes and provides an interpretable baseline. The model estimates the probability that each respondent belongs to the Yes class for HadHeartAttack.

log_model <- glm(HadHeartAttack ~ .,
                 data = class_train,
                 family = binomial)

log_prob <- predict(log_model, newdata = class_test, type = "response")
log_pred <- ifelse(log_prob >= 0.5, "Yes", "No")
log_pred <- factor(log_pred, levels = levels(class_test$HadHeartAttack))

log_cm <- confusionMatrix(log_pred,
                          class_test$HadHeartAttack,
                          positive = "Yes")

log_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  69208  1571
##        Yes 12256  3258
##                                           
##                Accuracy : 0.8398          
##                  95% CI : (0.8373, 0.8422)
##     No Information Rate : 0.944           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2569          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.67467         
##             Specificity : 0.84955         
##          Pos Pred Value : 0.21000         
##          Neg Pred Value : 0.97780         
##              Prevalence : 0.05596         
##          Detection Rate : 0.03776         
##    Detection Prevalence : 0.17978         
##       Balanced Accuracy : 0.76211         
##                                           
##        'Positive' Class : Yes             
##

plot_confusion_matrix(log_cm, "Confusion Matrix: Logistic Regression")

The most influential variables in the logistic regression model are identified using the absolute z-value of the model coefficients. Larger absolute z-values indicate stronger evidence that the predictor is associated with the target variable.

log_coef <- as.data.frame(summary(log_model)$coefficients)
log_coef$Variable <- rownames(log_coef)

log_importance <- log_coef %>%
  filter(Variable != "(Intercept)") %>%
  mutate(Abs_Z_Value = abs(`z value`)) %>%
  arrange(desc(Abs_Z_Value)) %>%
  select(Variable, Estimate, `z value`, `Pr(>|z|)`, Abs_Z_Value) %>%
  head(10)

kable(log_importance,
      caption = "Top 10 Logistic Regression Predictors Based on Absolute Z-value")

Top 10 Logistic Regression Predictors Based on Absolute Z-value
	Variable	Estimate	z value	Abs_Z_Value
ChronicConditionCount	ChronicConditionCount	1.2588716	134.27160	134.27160
HadAnginaYes	HadAnginaYes	1.3567008	99.01654	99.01654
HadAsthmaYes	HadAsthmaYes	-1.4202208	-96.85512	96.85512
HadKidneyDiseaseYes	HadKidneyDiseaseYes	-1.6216314	-89.21245	89.21245
ChestScanYes	ChestScanYes	0.7318143	89.04639	89.04639
HadArthritisYes	HadArthritisYes	-1.0480605	-86.66914	86.66914
HadDepressiveDisorderYes	HadDepressiveDisorderYes	-1.2131575	-84.48736	84.48736
SexMale	SexMale	0.8520101	79.97112	79.97112
GeneralHealthGood	GeneralHealthGood	1.2052778	74.86627	74.86627
RaceEthnicityCategoryWhite only, Non-Hispanic	RaceEthnicityCategoryWhite only, Non-Hispanic	1.3212142	74.75610	74.75610

6.5 Model 2: Decision tree

A decision tree is used as the second classification model. This model separates respondents into groups based on decision rules, making it easier to interpret how different factors contribute to the prediction. Like the logistic regression model, it excludes PhysicalHealthDays to avoid overlapping with the regression task.

# Import library:
library(rpart)
library(rpart.plot)

tree_model <- rpart(HadHeartAttack ~ .,
                    data = class_train,
                    method = "class",
                    control = rpart.control(cp = 0.001,
                                            minsplit = 500,
                                            maxdepth = 5))

tree_pred <- predict(tree_model, newdata = class_test, type = "class")
tree_pred <- factor(tree_pred, levels = levels(class_test$HadHeartAttack))

tree_cm <- confusionMatrix(tree_pred,
                           class_test$HadHeartAttack,
                           positive = "Yes")

tree_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  64123  1345
##        Yes 17341  3484
##                                           
##                Accuracy : 0.7835          
##                  95% CI : (0.7807, 0.7862)
##     No Information Rate : 0.944           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1988          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72147         
##             Specificity : 0.78713         
##          Pos Pred Value : 0.16730         
##          Neg Pred Value : 0.97946         
##              Prevalence : 0.05596         
##          Detection Rate : 0.04037         
##    Detection Prevalence : 0.24133         
##       Balanced Accuracy : 0.75430         
##                                           
##        'Positive' Class : Yes             
##

plot_confusion_matrix(tree_cm, "Confusion Matrix: Decision Tree")

rpart.plot(tree_model,
           type = 2,
           extra = 104,
           fallen.leaves = TRUE,
           under = TRUE,
           tweak = 1.1,
           cex = 0.65,
           varlen = 0,
           faclen = 0,
           roundint = FALSE,
           box.palette = "RdYlGn",
           shadow.col = "gray")

The decision tree’s variable importance values show which predictors contributed most to the splits in the tree.

tree_importance <- data.frame(
  Variable = names(tree_model$variable.importance),
  Importance = as.numeric(tree_model$variable.importance),
  row.names = NULL
) %>%
  arrange(desc(Importance))

kable(head(tree_importance, 10),
      caption = "Top 10 Decision Tree Predictor Importance Values")

Top 10 Decision Tree Predictor Importance Values
Variable	Importance
HadAngina	85607.612
AgeCategory	30845.945
ChronicConditionCount	26121.296
GeneralHealth	13407.729
LastCheckupTime	8373.542
RaceEthnicityCategory	6878.666
ECigaretteUsage	2339.482
HighRiskLastYear	1723.262
HadKidneyDisease	1084.050
PoorHealth	957.155

6.6 Model comparison

The table below compares the logistic regression and decision tree models using the same test set. Since the test set was not balanced, these results provide a fairer estimate of how the models may perform on real-world data.

classification_results <- bind_rows(
  extract_metrics(log_cm, "Logistic Regression"),
  extract_metrics(tree_cm, "Decision Tree")
)

kable(classification_results,
      caption = "Classification Model Performance Comparison")

Classification Model Performance Comparison
Model	Accuracy	Sensitivity	Specificity	Precision	F1_Score
Logistic Regression	0.8398	0.6747	0.8496	0.2100	0.3203
Decision Tree	0.7835	0.7215	0.7871	0.1673	0.2716

6.7 Classification Summary

The classification results should be interpreted carefully because the original test data is imbalanced, with many more respondents who did not have a heart attack than respondents who did. Therefore, accuracy alone is not enough to judge model performance. Sensitivity, precision and F1-score are also important because they show how well the model identifies the minority class, which is the respondents who had a heart attack.

Logistic Regression achieved better overall performance than the Decision Tree. It recorded higher accuracy (0.8398), specificity (0.8496), precision (0.2100), and F1-score (0.3203). This means Logistic Regression was better at correctly classifying the overall test data and produced a better balance between precision and recall. Although its precision was still low, it performed better than the Decision Tree in this metric.

The Decision Tree achieved higher sensitivity (0.7215) compared to Logistic Regression (0.6747), meaning it was slightly better at identifying actual heart attack cases. However, it had lower accuracy, specificity, precision and F1-score. This suggests that although the Decision Tree detected more positive cases, it also produced more false positive predictions.

Overall, Logistic Regression is selected as the preferred classification model because it achieved the best balance across most evaluation metrics, especially accuracy, specificity, precision and F1-score. For this health-related classification task, identifying positive cases is important, but the model should also avoid too many incorrect positive predictions. Therefore, Logistic Regression provides a more balanced and reliable result compared to the Decision Tree.

It should be noted that some coefficients in the Logistic Regression model, such as those for HadAsthmaYes and HadKidneyDiseaseYes, showed unexpected negative signs. This is likely attributable to multicollinearity among the large number of correlated health predictors rather than a true protective effect, and should be interpreted with caution.

7 Conclusion

7.1 Summary of Findings

This project investigated two predictive modelling tasks using the BRFSS 2022 dataset, which contains self-reported health information from 445,132 U.S. adults. The two tasks were: (1) predicting the number of physically unhealthy days (PhysicalHealthDays) using regression models, and (2) predicting whether a respondent had experienced a heart attack (HadHeartAttack) using classification models.

Regression Task

Three regression models were developed and compared: Multiple Linear Regression, Decision Tree Regression, and XGBoost Regression. XGBoost was the best-performing model, as shown in the table below.

Model	MAE	RMSE	R-Squared
Multiple Linear Regression	4.0077	6.6201	0.4185
Decision Tree Regression	4.2679	6.8758	0.3727
XGBoost Regression	3.9067	6.5410	0.4324

The key predictors identified across all three models were:

General health status — the strongest predictor of physically unhealthy days
Functional mobility limitations — the second most important factor, particularly difficulty walking
Daily living impairments — such as difficulty dressing and running errands
Arthritis — the most influential chronic condition

Classification Task

Two classification models were built to predict heart attack occurrence: Logistic Regression and Decision Tree. Since only about 6% of respondents reported having a heart attack, SMOTE-NC was applied to the training data to address the class imbalance. The performance of both models on the test set is summarised below.

Model	Accuracy	Sensitivity	Specificity	Precision	F1-Score
Logistic Regression	0.8398	0.6747	0.8496	0.2100	0.3203
Decision Tree	0.7835	0.7215	0.7871	0.1673	0.2716

Logistic Regression was selected as the preferred model as it achieved a better overall balance across most evaluation metrics, particularly in terms of accuracy, specificity, precision, and F1-score. Across both models, the key predictors of heart attack risk were history of angina, age category, and chronic condition count. General health status was additionally identified as a contributing predictor in both models, while sex was found to be a contributing factor in the Logistic Regression model.

7.2 Overall Conclusion

The findings from both tasks consistently show that general health status, chronic disease burden, physical limitations, and lifestyle behaviours are important factors in predicting both physically unhealthy days and heart attack risk. These results support the use of personal health indicators from large-scale public health surveys in identifying high-risk individuals and guiding preventive healthcare strategies.

7.3 Limitations and Future Work

This project has several limitations that should be acknowledged:

Self-reported data — All variables in the BRFSS dataset are based on self-reports, which may introduce recall bias and inaccurate responses.
Unexplained variance — The regression models explained approximately 43% of the variance in PhysicalHealthDays, suggesting that unmeasured factors such as income level, healthcare access, and environmental conditions may also play a role.
Class imbalance — With only approximately 6% of respondents reporting a heart attack, predicting rare health events remains an inherently challenging task, even with SMOTE-NC applied.

Future studies could build upon this work by incorporating additional variables such as socioeconomic indicators and healthcare access, applying more advanced class balancing strategies, and exploring a wider range of machine learning algorithms to improve overall predictive performance.

Group 4_Group Project

Heart Disease

2026-06-09

Group Member

Project Title

Predicting Physically Unhealthy Days and Heart Disease Risk Classification Using Personal Health Indicators Among US Adults

1 Introduction

1.1 Background

1.2 Objective

Question 1 (Regression)

Question 2 (Classification)

2 Data Overview

Types of Data Collected

2.1 Loading the packages and the data

2.2 Fixing missing values

2.3 Structure of the dataset

2.4 Summary statistics

2.5 Detailed summary

Data Frame Summary

df

3 Data Preprocessing

3.1 Missing Data

3.1.1 Identifying how much data is missing

3.1.2 Is the missingness random (MAR)?

3.1.3 Removing rows with a missing target

3.1.4 Imputing numeric predictors with the median

3.1.5 Imputing categorical predictors with the mode

3.1.6 Confirming no missing data remains

3.2 Data Transformation

3.2.1 Checking for duplicate rows

3.2.2 Dropping the State column

3.2.3 Simplifying the HadDiabetes variable

3.2.4 Checking for outliers

3.2.5 Handling the outliers

3.3 Feature Engineering

3.4 Converting categorical columns to factors

Data Frame Summary

df

3.5 Splitting into training and test sets

3.6 Handling class imbalance

3.7 Data Preprocessing Summary

4 Exploratory Data Analysis

4.1 Distribution of Target Variables

4.1.1 PhysicalHealthDays (Regression Target)

4.1.2 HadHeartAttack (Classification Target)

4.2 Demographic Overview

4.2.1 Distribution by Sex

4.2.2 Distribution by Age Category

4.2.3 Distribution by Race/Ethnicity

4.3 Distribution of Numerical Health Variables

4.3.1 Correlation Among Numerical Variables

4.4 General Health and Lifestyle Variables

4.4.1 General Health Distribution

4.4.2 Lifestyle Behaviour Summary

4.5 Chronic Condition Prevalence

4.5.1 Prevalence of Each Chronic Condition

4.5.2 Chronic Condition Count Distribution (Engineered Feature)

4.6 EDA Focused on Regression Target: PhysicalHealthDays

4.6.1 PhysicalHealthDays by General Health

4.6.2 PhysicalHealthDays by Chronic Condition Count

4.6.3 PhysicalHealthDays by Age Category

4.6.4 PhysicalHealthDays by BMI Category

4.7 EDA Focused on Classification Target: HadHeartAttack

4.7.1 Heart Attack Rate by Age Category

4.7.2 Heart Attack Rate by Sex

4.7.3 Heart Attack Rate by General Health Rating

4.7.4 Heart Attack Rate by Chronic Condition Count

4.7.5 Heart Attack Rate by Smoker Status

4.7.6 Heart Attack Rate by BMI Category

4.8 EDA Summary

5 Regression

5.1 Regression Objective

5.2 Splitting the data

5.3 Building the Regression Models

Model 1: Multiple Linear Regression

Model 2: Decision Tree Regression

Model 3: XGBoost Regression

5.4 Predicted Physical Health Days Across Three Models

5.5 Model Prediction and Evaluation

5.6 Extracting Key Factors and Relationships