Title : Predictive Modeling for Early Detection of Heart Attack Risk

1.0 Business Understanding

The primary objective of this project is to leverage data science techniques to identify and predict heart attack risks among individuals. Heart disease is one of the leading causes of death globally, and early identification of individuals at risk can save lives by enabling timely interventions and promoting preventive measures. This project focuses on analyzing health and lifestyle data to create actionable insights and predictive models to support healthcare providers, policymakers, and individuals.

1.1 Introduction (Project Background)

Our project focus on heart attack. Heart attack under heart disease is one of the leading causes of mortality worldwide. Early detection of heart attack risk can significantly improve patient outcomes by enabling timely medical interventions. This project focuses on predictive modeling techniques to analyze health-related datasets and identify individuals at risk of a heart attack. By leveraging machine learning algorithms, we aim to provide accurate predictions and support healthcare professionals in making informed decisions.

1.2 Project Questions & Objectives

Classification
- Question: How can we classify individuals as “Low Risk” or “High Risk” of a heart attack based on basic health data?
- Objective: To develop a simple classification model that predicts whether an individual falls into “Low Risk” or “High Risk” category using easily accessible data such as age, BMI, and blood pressure.
Regression
- Question: How can we predict the probability or severity score of a heart attack occurrence for an individual?
- Objective: To build a regression model that predicts the likelihood or severity score of a heart attack, aiding personalized risk assessments.

1.3 Project Success Criteria

The success of this project will be measured by:

Model Performance Metrics:

Achieving an accuracy > 85%.

Achieving an ROC-AUC Score > 80%.

Maintaining a false-negative rate of < 20%.

High precision and recall to flag high-risk patients correctly.

2.0 Data Understanding

This section provides an overview of the dataset, including its source and structure.

This dataset was obtained from https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease.

R packages including “readr”, “dplyr” and “knitr” were downloaded to load and explore the raw data.

# Load necessary libraries
library(readr)    # For reading CSV files
library(dplyr)    # For data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)    # For table formatting (optional)
# Load the dataset
df <- read_csv("heart_2022_with_nans.csv")

## Rows: 445132 Columns: 40

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): State, Sex, GeneralHealth, LastCheckupTime, PhysicalActivities, Re...
## dbl  (6): PhysicalHealthDays, MentalHealthDays, SleepHours, HeightInMeters, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows
head(df)

## # A tibble: 6 × 40
##   State  Sex   GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime
##   <chr>  <chr> <chr>                      <dbl>            <dbl> <chr>          
## 1 Alaba… Fema… Very good                      0                0 Within past ye…
## 2 Alaba… Fema… Excellent                      0                0 <NA>           
## 3 Alaba… Fema… Very good                      2                3 Within past ye…
## 4 Alaba… Fema… Excellent                      0                0 Within past ye…
## 5 Alaba… Fema… Fair                           2                0 Within past ye…
## 6 Alaba… Male  Poor                           1                0 Within past ye…
## # ℹ 34 more variables: PhysicalActivities <chr>, SleepHours <dbl>,
## #   RemovedTeeth <chr>, HadHeartAttack <chr>, HadAngina <chr>, HadStroke <chr>,
## #   HadAsthma <chr>, HadSkinCancer <chr>, HadCOPD <chr>,
## #   HadDepressiveDisorder <chr>, HadKidneyDisease <chr>, HadArthritis <chr>,
## #   HadDiabetes <chr>, DeafOrHardOfHearing <chr>,
## #   BlindOrVisionDifficulty <chr>, DifficultyConcentrating <chr>,
## #   DifficultyWalking <chr>, DifficultyDressingBathing <chr>, …

2.1 Dataset Details

The dataset used in this project is the Heart Disease Dataset (2022). It contains 445,132 records and a total of 40 columns. The information is about various health-related metrics collected from patients, which can be used to predict the likelihood of heart disease.

2.2 Content Overview

2.2.1 Input Variables

The dataset includes demographic, health, and lifestyle input variables related to heart attack, such as:

Demographics: State, Sex, Race/Ethnicity, AgeCategory
Health Metrics: BMI, SleepHours, HeightInMeters, WeightInKilograms
Risk Factors: Smoking status, PhysicalHealthDays, MentalHealthDays
Medical History: HadAngina, HadStroke, HadDiabetes, etc.

2.2.2 Target Variable

This is the target variable that will be used for the heart attack prediction model.

HadHeartAttack: Indicates whether the patient has heart attack (1 = yes, 0 = no).

2.2.3 Purpose

The dataset will be used to:

1. Explore patterns and trends in health metrics.

2. Build predictive models to identify individuals at risk of heart attack.

2.3 Structure of Data

Contains a mix of numerical and categorical variables.
Example data types:
- Numerical: PhysicalHealthDays, MentalHealthDays, BMI
- Categorical: Sex, GeneralHealth, SmokerStatus

2.4 Summary of Data

Numerical Variables:
- BMI: 1st Q = 24.13, Median = 27.44, Mean = 28.53, 3rd Q = 31.75, Range = [12.02, 99.64]
- SleepHours: 1st Q = 6.00, Median = 7.00, Mean = 7.02, 3rd Q = 8.00, Range = [1.00, 24.00]
Categorical Variables:
- Sex: Categories = [Male, Female]
- SmokerStatus: Categories = [Never smoked, Former smoker, Current smoker (Every Days), Current Smokers (Some Days)]

#Dimensions of the dataset
cat("Dataset Dimensions: ", dim(df)[1], "rows and", dim(df)[2], "columns\n")

## Dataset Dimensions:  445132 rows and 40 columns

# Print dataset dimensions
cat("The dataset has", nrow(df), "rows and", ncol(df), "columns.\n")

## The dataset has 445132 rows and 40 columns.

# Dataset structure
str(df)

## spc_tbl_ [445,132 × 40] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ State                    : chr [1:445132] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Sex                      : chr [1:445132] "Female" "Female" "Female" "Female" ...
##  $ GeneralHealth            : chr [1:445132] "Very good" "Excellent" "Very good" "Excellent" ...
##  $ PhysicalHealthDays       : num [1:445132] 0 0 2 0 2 1 0 0 0 1 ...
##  $ MentalHealthDays         : num [1:445132] 0 0 3 0 0 0 0 0 0 0 ...
##  $ LastCheckupTime          : chr [1:445132] "Within past year (anytime less than 12 months ago)" NA "Within past year (anytime less than 12 months ago)" "Within past year (anytime less than 12 months ago)" ...
##  $ PhysicalActivities       : chr [1:445132] "No" "No" "Yes" "Yes" ...
##  $ SleepHours               : num [1:445132] 8 6 5 7 9 7 7 8 6 7 ...
##  $ RemovedTeeth             : chr [1:445132] NA NA NA NA ...
##  $ HadHeartAttack           : chr [1:445132] "No" "No" "No" "No" ...
##  $ HadAngina                : chr [1:445132] "No" "No" "No" "No" ...
##  $ HadStroke                : chr [1:445132] "No" "No" "No" "No" ...
##  $ HadAsthma                : chr [1:445132] "No" "No" "No" "Yes" ...
##  $ HadSkinCancer            : chr [1:445132] "No" "Yes" "Yes" "No" ...
##  $ HadCOPD                  : chr [1:445132] "No" "No" "No" "No" ...
##  $ HadDepressiveDisorder    : chr [1:445132] "No" "No" "No" "No" ...
##  $ HadKidneyDisease         : chr [1:445132] "No" "No" "No" "No" ...
##  $ HadArthritis             : chr [1:445132] "No" "No" "No" "Yes" ...
##  $ HadDiabetes              : chr [1:445132] "Yes" "No" "No" "No" ...
##  $ DeafOrHardOfHearing      : chr [1:445132] "No" "No" "No" "No" ...
##  $ BlindOrVisionDifficulty  : chr [1:445132] "No" "No" "No" "No" ...
##  $ DifficultyConcentrating  : chr [1:445132] "No" "No" "No" "No" ...
##  $ DifficultyWalking        : chr [1:445132] "No" "No" "No" "No" ...
##  $ DifficultyDressingBathing: chr [1:445132] "No" "No" "No" "No" ...
##  $ DifficultyErrands        : chr [1:445132] "No" "No" "No" "No" ...
##  $ SmokerStatus             : chr [1:445132] "Never smoked" "Never smoked" "Never smoked" "Current smoker - now smokes some days" ...
##  $ ECigaretteUsage          : chr [1:445132] "Not at all (right now)" "Never used e-cigarettes in my entire life" "Never used e-cigarettes in my entire life" "Never used e-cigarettes in my entire life" ...
##  $ ChestScan                : chr [1:445132] "No" "No" "No" "Yes" ...
##  $ RaceEthnicityCategory    : chr [1:445132] "White only, Non-Hispanic" "White only, Non-Hispanic" "White only, Non-Hispanic" "White only, Non-Hispanic" ...
##  $ AgeCategory              : chr [1:445132] "Age 80 or older" "Age 80 or older" "Age 55 to 59" NA ...
##  $ HeightInMeters           : num [1:445132] NA 1.6 1.57 1.65 1.57 1.8 1.65 1.63 1.7 1.68 ...
##  $ WeightInKilograms        : num [1:445132] NA 68 63.5 63.5 54 ...
##  $ BMI                      : num [1:445132] NA 26.6 25.6 23.3 21.8 ...
##  $ AlcoholDrinkers          : chr [1:445132] "No" "No" "No" "No" ...
##  $ HIVTesting               : chr [1:445132] "No" "No" "No" "No" ...
##  $ FluVaxLast12             : chr [1:445132] "Yes" "No" "No" "Yes" ...
##  $ PneumoVaxEver            : chr [1:445132] "No" "No" "No" "Yes" ...
##  $ TetanusLast10Tdap        : chr [1:445132] "Yes, received tetanus shot but not sure what type" "No, did not receive any tetanus shot in the past 10 years" NA "No, did not receive any tetanus shot in the past 10 years" ...
##  $ HighRiskLastYear         : chr [1:445132] "No" "No" "No" "No" ...
##  $ CovidPos                 : chr [1:445132] "No" "No" "Yes" "No" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   State = col_character(),
##   ..   Sex = col_character(),
##   ..   GeneralHealth = col_character(),
##   ..   PhysicalHealthDays = col_double(),
##   ..   MentalHealthDays = col_double(),
##   ..   LastCheckupTime = col_character(),
##   ..   PhysicalActivities = col_character(),
##   ..   SleepHours = col_double(),
##   ..   RemovedTeeth = col_character(),
##   ..   HadHeartAttack = col_character(),
##   ..   HadAngina = col_character(),
##   ..   HadStroke = col_character(),
##   ..   HadAsthma = col_character(),
##   ..   HadSkinCancer = col_character(),
##   ..   HadCOPD = col_character(),
##   ..   HadDepressiveDisorder = col_character(),
##   ..   HadKidneyDisease = col_character(),
##   ..   HadArthritis = col_character(),
##   ..   HadDiabetes = col_character(),
##   ..   DeafOrHardOfHearing = col_character(),
##   ..   BlindOrVisionDifficulty = col_character(),
##   ..   DifficultyConcentrating = col_character(),
##   ..   DifficultyWalking = col_character(),
##   ..   DifficultyDressingBathing = col_character(),
##   ..   DifficultyErrands = col_character(),
##   ..   SmokerStatus = col_character(),
##   ..   ECigaretteUsage = col_character(),
##   ..   ChestScan = col_character(),
##   ..   RaceEthnicityCategory = col_character(),
##   ..   AgeCategory = col_character(),
##   ..   HeightInMeters = col_double(),
##   ..   WeightInKilograms = col_double(),
##   ..   BMI = col_double(),
##   ..   AlcoholDrinkers = col_character(),
##   ..   HIVTesting = col_character(),
##   ..   FluVaxLast12 = col_character(),
##   ..   PneumoVaxEver = col_character(),
##   ..   TetanusLast10Tdap = col_character(),
##   ..   HighRiskLastYear = col_character(),
##   ..   CovidPos = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

sapply(df, class)

##                     State                       Sex             GeneralHealth 
##               "character"               "character"               "character" 
##        PhysicalHealthDays          MentalHealthDays           LastCheckupTime 
##                 "numeric"                 "numeric"               "character" 
##        PhysicalActivities                SleepHours              RemovedTeeth 
##               "character"                 "numeric"               "character" 
##            HadHeartAttack                 HadAngina                 HadStroke 
##               "character"               "character"               "character" 
##                 HadAsthma             HadSkinCancer                   HadCOPD 
##               "character"               "character"               "character" 
##     HadDepressiveDisorder          HadKidneyDisease              HadArthritis 
##               "character"               "character"               "character" 
##               HadDiabetes       DeafOrHardOfHearing   BlindOrVisionDifficulty 
##               "character"               "character"               "character" 
##   DifficultyConcentrating         DifficultyWalking DifficultyDressingBathing 
##               "character"               "character"               "character" 
##         DifficultyErrands              SmokerStatus           ECigaretteUsage 
##               "character"               "character"               "character" 
##                 ChestScan     RaceEthnicityCategory               AgeCategory 
##               "character"               "character"               "character" 
##            HeightInMeters         WeightInKilograms                       BMI 
##                 "numeric"                 "numeric"                 "numeric" 
##           AlcoholDrinkers                HIVTesting              FluVaxLast12 
##               "character"               "character"               "character" 
##             PneumoVaxEver         TetanusLast10Tdap          HighRiskLastYear 
##               "character"               "character"               "character" 
##                  CovidPos 
##               "character"

# Summary statistics
summary(df)

##     State               Sex            GeneralHealth      PhysicalHealthDays
##  Length:445132      Length:445132      Length:445132      Min.   : 0.000    
##  Class :character   Class :character   Class :character   1st Qu.: 0.000    
##  Mode  :character   Mode  :character   Mode  :character   Median : 0.000    
##                                                           Mean   : 4.348    
##                                                           3rd Qu.: 3.000    
##                                                           Max.   :30.000    
##                                                           NA's   :10927     
##  MentalHealthDays LastCheckupTime    PhysicalActivities   SleepHours    
##  Min.   : 0.000   Length:445132      Length:445132      Min.   : 1.000  
##  1st Qu.: 0.000   Class :character   Class :character   1st Qu.: 6.000  
##  Median : 0.000   Mode  :character   Mode  :character   Median : 7.000  
##  Mean   : 4.383                                         Mean   : 7.023  
##  3rd Qu.: 5.000                                         3rd Qu.: 8.000  
##  Max.   :30.000                                         Max.   :24.000  
##  NA's   :9067                                           NA's   :5453    
##  RemovedTeeth       HadHeartAttack      HadAngina          HadStroke        
##  Length:445132      Length:445132      Length:445132      Length:445132     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HadAsthma         HadSkinCancer        HadCOPD          HadDepressiveDisorder
##  Length:445132      Length:445132      Length:445132      Length:445132        
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  HadKidneyDisease   HadArthritis       HadDiabetes        DeafOrHardOfHearing
##  Length:445132      Length:445132      Length:445132      Length:445132      
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  BlindOrVisionDifficulty DifficultyConcentrating DifficultyWalking 
##  Length:445132           Length:445132           Length:445132     
##  Class :character        Class :character        Class :character  
##  Mode  :character        Mode  :character        Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##  DifficultyDressingBathing DifficultyErrands  SmokerStatus      
##  Length:445132             Length:445132      Length:445132     
##  Class :character          Class :character   Class :character  
##  Mode  :character          Mode  :character   Mode  :character  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  ECigaretteUsage     ChestScan         RaceEthnicityCategory AgeCategory       
##  Length:445132      Length:445132      Length:445132         Length:445132     
##  Class :character   Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character   Mode  :character      Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  HeightInMeters  WeightInKilograms      BMI        AlcoholDrinkers   
##  Min.   :0.910   Min.   : 22.68    Min.   :12.02   Length:445132     
##  1st Qu.:1.630   1st Qu.: 68.04    1st Qu.:24.13   Class :character  
##  Median :1.700   Median : 80.74    Median :27.44   Mode  :character  
##  Mean   :1.703   Mean   : 83.07    Mean   :28.53                     
##  3rd Qu.:1.780   3rd Qu.: 95.25    3rd Qu.:31.75                     
##  Max.   :2.410   Max.   :292.57    Max.   :99.64                     
##  NA's   :28652   NA's   :42078     NA's   :48806                     
##   HIVTesting        FluVaxLast12       PneumoVaxEver      TetanusLast10Tdap 
##  Length:445132      Length:445132      Length:445132      Length:445132     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  HighRiskLastYear     CovidPos        
##  Length:445132      Length:445132     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

2.5 Percentage of Missing Values

This section identifies missing values in the dataset and calculates the percentage of null values for each feature. Missing data can affect the performance of machine learning models, so it’s crucial to address them during the data pre-processing stage.

###Check Dataset for Percentage of Null Values

Here’s the description of 9 features with more than 10% missing values:

ChestScan (12.59%): Indicates whether the patient has undergone a chest scan.
BMI (10.96%): Body Mass Index, a measure of body fat based on height and weight.
AlcoholDrinkers (10.46%): Indicates whether the patient consumes alcohol.
HIVTesting (14.86%): Indicates whether the patient has been tested for HIV.
FluVaxLast12 (10.58%): Indicates whether the patient received a flu vaccine in the last 12 months.
PneumoVaxEver (17.31%): Indicates whether the patient has ever received the pneumococcal vaccine.
TetanusLast10Tdap (18.54%): Indicates whether the patient received a tetanus vaccine in the last 10 years.
CovidPos (11.40%): Indicates whether the patient tested positive for COVID-19.
HighRiskLastYear (11.37%): Indicates whether the patient was categorized as high-risk in the previous year.

# Calculate the percentage of missing values in each column
missing_percentage <- colSums(is.na(df)) / nrow(df) * 100

# Display the percentages in a neat table
kable(data.frame(Column = names(missing_percentage), MissingPercentage = missing_percentage),
      col.names = c("Column Name", "Missing Percentage (%)"),
      caption = "Percentage of Missing Values in Each Column")

Percentage of Missing Values in Each Column
	Column Name	Missing Percentage (%)
State	State	0.0000000
Sex	Sex	0.0000000
GeneralHealth	GeneralHealth	0.2691337
PhysicalHealthDays	PhysicalHealthDays	2.4547775
MentalHealthDays	MentalHealthDays	2.0369239
LastCheckupTime	LastCheckupTime	1.8664127
PhysicalActivities	PhysicalActivities	0.2455451
SleepHours	SleepHours	1.2250299
RemovedTeeth	RemovedTeeth	2.5520520
HadHeartAttack	HadHeartAttack	0.6885598
HadAngina	HadAngina	0.9895941
HadStroke	HadStroke	0.3497839
HadAsthma	HadAsthma	0.3983088
HadSkinCancer	HadSkinCancer	0.7060827
HadCOPD	HadCOPD	0.4985038
HadDepressiveDisorder	HadDepressiveDisorder	0.6317227
HadKidneyDisease	HadKidneyDisease	0.4326806
HadArthritis	HadArthritis	0.5915099
HadDiabetes	HadDiabetes	0.2441972
DeafOrHardOfHearing	DeafOrHardOfHearing	4.6383994
BlindOrVisionDifficulty	BlindOrVisionDifficulty	4.8444057
DifficultyConcentrating	DifficultyConcentrating	5.4455757
DifficultyWalking	DifficultyWalking	5.3943549
DifficultyDressingBathing	DifficultyDressingBathing	5.3725636
DifficultyErrands	DifficultyErrands	5.7636836
SmokerStatus	SmokerStatus	7.9666256
ECigaretteUsage	ECigaretteUsage	8.0111068
ChestScan	ChestScan	12.5908719
RaceEthnicityCategory	RaceEthnicityCategory	3.1579397
AgeCategory	AgeCategory	2.0396197
HeightInMeters	HeightInMeters	6.4367424
WeightInKilograms	WeightInKilograms	9.4529263
BMI	BMI	10.9643881
AlcoholDrinkers	AlcoholDrinkers	10.4629638
HIVTesting	HIVTesting	14.8555934
FluVaxLast12	FluVaxLast12	10.5858487
PneumoVaxEver	PneumoVaxEver	17.3072257
TetanusLast10Tdap	TetanusLast10Tdap	18.5374226
HighRiskLastYear	HighRiskLastYear	11.3725816
CovidPos	CovidPos	11.4042576

barplot(missing_percentage,
        main = "Percentage of Missing Values by Column",
        col = "steelblue",
        xlab = "Columns",
        ylab = "Percentage of Missing Values",
        las = 2,  # Rotate x-axis labels
        ylim = c(0, max(missing_percentage) + 5))  # Adjust y-axis limit

2.6 Imbalance Ratio Calculation

This section analyzes the distribution of the target variable (Heart Attack) to check for class imbalance. Imbalanced datasets can bias machine learning models toward the majority class, necessitating techniques like resampling or using weighted metrics.

If IR > 1.5 = Imbalance

If IR < 1.5 = Balance

If IR > 9 = Highly Imbalance

The code calculates the Imbalance Ratio (IR) for the target variable HadHeartAttack, which measures the proportion of the majority class (No heart attack) to the minority class (Had heart attack). A high imbalance ratio indicates that the dataset is significantly skewed toward one class, which can impact the performance of machine learning models.

Output:

The Imbalance Ratio (IR) is calculated as approximately 16.61, which means there are around 16.61 instances of “No Heart Attack” for every instance of “Yes Heart Attack” in the dataset. This indicates a severe imbalance in the dataset.

Implications:

Model Bias: Without addressing this imbalance, the machine learning model may become biased toward predicting the majority class (No heart attack), resulting in poor performance for identifying the minority class (Yes heart attack).

Evaluation Metrics: Accuracy alone may not be a reliable metric. Precision, recall, and F1-score should be emphasized to evaluate model performance on the minority class.

# Count values for "Yes" and "No" in the "HadHeartAttack" column
heart_attack_count <- sum(df$HadHeartAttack == "Yes", na.rm = TRUE)
heart_attack_count_no <- sum(df$HadHeartAttack == "No", na.rm = TRUE)

# Calculate the imbalance ratio (IR)
IR <- heart_attack_count_no / heart_attack_count

# Print the result
cat("Imbalance Ratio (IR):", IR, "\n")

## Imbalance Ratio (IR): 16.60662

3.0. Data Preparation

This section details the steps taken to clean and preprocess the data, such as handling missing values and encoding categorical variables.

Data preparation is a crucial step in the data science workflow, involving the transformation of raw data into a clean and structured format suitable for analysis and modeling. This process ensures data quality and enhances the performance of machine learning models.

3.1 Handling Noise Data for BMI Column

The goal is to address noise data in the HeightInMeters, WeightInKilograms and BMI columns by applying specific strategies:

Statistical imputation for missing BMI values based on HeightInMeters and WeightInKilograms.
Remove rows with missing BMI values or missing values in either HeightInMeters or WeightInKilograms.
Removal of biologically unrealistic outliers for HeightInMeters and BMI.

Summary of Output:

HeightInMeters:

Heights below 1.3m or above 2.2m were removed.

BMI:

Missing BMI values were imputed using height and weight.
Boxplot visualization highlighted the central tendency and spread of BMI.

R packages including “ggplot2”, and “dplyr” were downloaded to clean and preprocess the data.

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Create a boxplot for HeightInMeters
ggplot(df, aes(x = HeightInMeters)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Boxplot for HeightInMeters", x = "HeightInMeters") +
  theme_minimal()

# Calculate quartiles and IQR
upper_quartile <- quantile(df$HeightInMeters, 0.75, na.rm = TRUE)
lower_quartile <- quantile(df$HeightInMeters, 0.25, na.rm = TRUE)
iqr <- upper_quartile - lower_quartile

#Statistical Summary for Height 
# Calculate whiskers
upper_whisker <- max(df$HeightInMeters[df$HeightInMeters <= (upper_quartile + 1.5 * iqr)], na.rm = TRUE)
lower_whisker <- min(df$HeightInMeters[df$HeightInMeters >= (lower_quartile - 1.5 * iqr)], na.rm = TRUE)

# Print results
cat("Lower Whisker:", lower_whisker, "\n")

## Lower Whisker: 1.42

cat("Upper Whisker:", upper_whisker, "\n")

## Upper Whisker: 2

cat("Min Height:", min(df$HeightInMeters, na.rm = TRUE), "\n")

## Min Height: 0.91

cat("Max Height:", max(df$HeightInMeters, na.rm = TRUE), "\n")

## Max Height: 2.41

#Filter Height for Dwarfism and Gigantism
# Remove rows where height is outside 1.3m to 2.2m
df <- df %>% filter(HeightInMeters > 1.3 & HeightInMeters < 2.2)

# Print new height range
cat("New Min Height:", min(df$HeightInMeters, na.rm = TRUE), "\n")

## New Min Height: 1.32

cat("New Max Height:", max(df$HeightInMeters, na.rm = TRUE), "\n")

## New Max Height: 2.18

#BMI Calculation 
# Calculate missing BMI using height and weight
df <- df %>%
  mutate(BMI = ifelse(
    is.na(BMI) & !is.na(HeightInMeters) & !is.na(WeightInKilograms),
    WeightInKilograms / (HeightInMeters ^ 2),
    BMI
  ))

# Verify changes
cat("BMI calculation completed.\n")

## BMI calculation completed.

#Boxplot BMI 
# Create a boxplot for BMI
ggplot(df, aes(x = BMI)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "Boxplot for BMI", x = "BMI") +
  theme_minimal()

#Statistical Summary for BMI
# Calculate quartiles and IQR for BMI
upper_quartile <- quantile(df$BMI, 0.75, na.rm = TRUE)
lower_quartile <- quantile(df$BMI, 0.25, na.rm = TRUE)
iqr <- upper_quartile - lower_quartile

# Calculate whiskers
upper_whisker <- max(df$BMI[df$BMI <= (upper_quartile + 1.5 * iqr)], na.rm = TRUE)
lower_whisker <- min(df$BMI[df$BMI >= (lower_quartile - 1.5 * iqr)], na.rm = TRUE)

# Print results
cat("Lower Whisker:", lower_whisker, "\n")

## Lower Whisker: 12.75

cat("Upper Whisker:", upper_whisker, "\n")

## Upper Whisker: 43.14

3.2 Remove Null Values for Other Columns By Row

The aim of this process is to remove rows with any missing values (null) from the dataset, ensuring that the resulting data is completed and ready for analysis.

Summary of the Output:

After removing null values,the dataset contains 247,205 rows (individual records) and still 40 columns (features or variables).

# Check the number of missing values in each column
missing_values <- colSums(is.na(df))
print(missing_values)

##                     State                       Sex             GeneralHealth 
##                         0                         0                       982 
##        PhysicalHealthDays          MentalHealthDays           LastCheckupTime 
##                      9626                      7970                      7035 
##        PhysicalActivities                SleepHours              RemovedTeeth 
##                       855                      4597                      9954 
##            HadHeartAttack                 HadAngina                 HadStroke 
##                      2546                      3885                      1254 
##                 HadAsthma             HadSkinCancer                   HadCOPD 
##                      1465                      2755                      1839 
##     HadDepressiveDisorder          HadKidneyDisease              HadArthritis 
##                      2352                      1572                      2216 
##               HadDiabetes       DeafOrHardOfHearing   BlindOrVisionDifficulty 
##                       786                      2772                      3612 
##   DifficultyConcentrating         DifficultyWalking DifficultyDressingBathing 
##                      6087                      5887                      5752 
##         DifficultyErrands              SmokerStatus           ECigaretteUsage 
##                      7361                     16303                     16504 
##                 ChestScan     RaceEthnicityCategory               AgeCategory 
##                     36082                     11337                      5972 
##            HeightInMeters         WeightInKilograms                       BMI 
##                         0                     17885                     17885 
##           AlcoholDrinkers                HIVTesting              FluVaxLast12 
##                     26544                     45820                     27322 
##             PneumoVaxEver         TetanusLast10Tdap          HighRiskLastYear 
##                     56537                     61652                     30769 
##                  CovidPos 
##                     30896

# Drop rows with missing values
df_clean <- na.omit(df)

# Check the shape (dimensions) of the cleaned dataset
cat("Number of rows and columns after cleaning:", dim(df_clean), "\n")

## Number of rows and columns after cleaning: 247205 40

3.3 Heart Attack Ratio Analysis Across U.S. States

This section visualizes geographical distribution of the Heart Attack Ratio by State which indicates the percentage of heart attack cases relative to the total records for each state across U.S. states.

The maps use state abbreviations and color intensity to represent the magnitude of heart attack occurrences or ratios, making it easier to identify high-risk areas.Packages plotly has been used to insert plot the map.

# State abbreviation mapping
state_abbrev <- c(
  'Alabama' = 'AL', 'Alaska' = 'AK', 'Arizona' = 'AZ', 'Arkansas' = 'AR', 'California' = 'CA',
  'Colorado' = 'CO', 'Connecticut' = 'CT', 'Delaware' = 'DE', 'Florida' = 'FL', 'Georgia' = 'GA',
  'Hawaii' = 'HI', 'Idaho' = 'ID', 'Illinois' = 'IL', 'Indiana' = 'IN', 'Iowa' = 'IA',
  'Kansas' = 'KS', 'Kentucky' = 'KY', 'Louisiana' = 'LA', 'Maine' = 'ME', 'Maryland' = 'MD',
  'Massachusetts' = 'MA', 'Michigan' = 'MI', 'Minnesota' = 'MN', 'Mississippi' = 'MS',
  'Missouri' = 'MO', 'Montana' = 'MT', 'Nebraska' = 'NE', 'Nevada' = 'NV', 'New Hampshire' = 'NH',
  'New Jersey' = 'NJ', 'New Mexico' = 'NM', 'New York' = 'NY', 'North Carolina' = 'NC',
  'North Dakota' = 'ND', 'Ohio' = 'OH', 'Oklahoma' = 'OK', 'Oregon' = 'OR', 'Pennsylvania' = 'PA',
  'Rhode Island' = 'RI', 'South Carolina' = 'SC', 'South Dakota' = 'SD', 'Tennessee' = 'TN',
  'Texas' = 'TX', 'Utah' = 'UT', 'Vermont' = 'VT', 'Virginia' = 'VA', 'Washington' = 'WA',
  'West Virginia' = 'WV', 'Wisconsin' = 'WI', 'Wyoming' = 'WY'
)

# Add state abbreviations to the dataframe
df_clean$State <- state_abbrev[df_clean$State]

library(dplyr)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# Heart Attack Ratios
# Step 1: Group by State and Calculate the Ratio of Heart Attacks
state_heartattack_ratio <- df_clean %>%
  group_by(State) %>%
  summarise(HeartAttackRatio = sum(HadHeartAttack == "Yes") / n() * 100)  # Calculate ratio of "Yes"

# Step 2: Create a US Map Plot
fig2 <- plot_ly(
  data = state_heartattack_ratio,
  type = "choropleth",
  locations = ~State,                 # State abbreviations
  locationmode = "USA-states",        # Mode for state abbreviations
  z = ~HeartAttackRatio,              # Data to color by
  colors = "Blues"                    # Color scheme
) %>%
  layout(
    title = "Heart Attack Ratio by State",
    geo = list(scope = "usa")         # Focus on the US map
  )

# Step 3: Display the Maps
fig2

df_clean <- df_clean %>%  select(-State)

3.4 Binning and Encoding for Further Analysis

This section involves comprehensive preprocessing techniques applied to transform and encode categorical and numerical variables into structured formats suitable for analysis and modelling.

Key techniques include:

Binning: Continuous numerical features are grouped into discrete intervals (bins), enabling better interpretability and analysis.
Categorical Encoding: Categorical data is encoded into integers or mapped to numerical representations for seamless integration into machine learning models.

Techniques used for Categorical Encoding:

Mapping: Assigning meaningful integers to categories
Label Encoding: where converting categorical values to integers for binary variables and other categorical features.

Column - SleepHours

# Define the bin edges and labels for SleepHours
bin_edges <- c(0, 4, 8, 12, 16, 20, 24)
bin_labels <- c('0-4', '4-8', '8-12', '12-16', '16-20', '20-24')

# Create a mapping of bin labels to integer values
SleepHours_int <- c('0-4' = 2, '4-8' = 6, '8-12' = 10, '12-16' = 14, '16-20' = 18, '20-24' = 22)

# Bin the SleepHours data
df_clean$SleepHoursBinned <- cut(df_clean$SleepHours, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)

# Map the SleepHoursBinned categories to the corresponding integer values
df_clean$SleepHours_INT <- SleepHours_int[as.character(df_clean$SleepHoursBinned)]

Column - PhysicalHealthDays

# Define the bin edges and labels for PhysicalHealthDays
bin_edges <- c(0, 5, 10, 15, 20, 25, 30)
bin_labels <- c('0-5', '6-10', '11-15', '16-20', '21-25', '26-30')

# Create a mapping of bin labels to integer values
PhysicalHealthDays_int <- c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28)

# Bin the PhysicalHealthDays data
df_clean$PhysicalHealthDaysBinned <- cut(df_clean$PhysicalHealthDays, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)

# Map the PhysicalHealthDaysBinned categories to the corresponding integer values
df_clean$PhysicalHealthDays_INT <- PhysicalHealthDays_int[as.character(df_clean$PhysicalHealthDaysBinned)]

Column - MentalHealthDays

# Define the bin edges and labels for MentalHealthDays
bin_edges <- c(0, 5, 10, 15, 20, 25, 30)
bin_labels <- c('0-5', '6-10', '11-15', '16-20', '21-25', '26-30')

# Create a mapping of bin labels to integer values
MentalHealthDays_int <- c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28)

# Bin the MentalHealthDays data
df_clean$MentalHealthDaysBinned <- cut(df_clean$MentalHealthDays, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)

# Map the MentalHealthDaysBinned categories to the corresponding integer values
df_clean$MentalHealthDays_INT <- MentalHealthDays_int[as.character(df_clean$MentalHealthDaysBinned)]

Column - HeightInMeters

# Define the bin edges and labels for HeightInMeters
bin_edges <- c(0.91, 1.20, 1.50, 1.80, 2.10, 2.41)
bin_labels <- c('0.91-1.20', '1.21-1.50', '1.51-1.80', '1.81-2.10', '2.11-2.41')

# Create a mapping of bin labels to integer values
HeightInMeters_int <- c('0.91-1.20' = 1, '1.21-1.50' = 2, '1.51-1.80' = 3, '1.81-2.10' = 4, '2.11-2.41' = 5)

# Bin the HeightInMeters data
df_clean$HeightInMetersBinned <- cut(df_clean$HeightInMeters, breaks = bin_edges, labels = bin_labels, right = FALSE)

# Map the HeightInMetersBinned categories to the corresponding integer values
df_clean$HeightInMeters_INT <- HeightInMeters_int[as.character(df_clean$HeightInMetersBinned)]

Column - WeightInKilograms

# Define the bin edges and labels for WeightInKilograms
bin_edges <- c(20.0, 50.0, 100.0, 150.0, 200.0, 250.0, 295.0)
bin_labels <- c('20.0-50.0', '50.1-100.0', '100.1-150.0', '150.1-200.0', '200.1-250.0', '250.1-295.0')

# Create a mapping of bin labels to integer values
WeightInKilograms_int <- c('20.0-50.0' = 1, '50.1-100.0' = 2, '100.1-150.0' = 3, '150.1-200.0' = 4, '200.1-250.0' = 5, '250.1-295.0' = 6)

# Bin the WeightInKilograms data
df_clean$WeightBinned <- cut(df_clean$WeightInKilograms, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)

# Map the WeightBinned categories to the corresponding integer values
df_clean$Weight_INT <- WeightInKilograms_int[as.character(df_clean$WeightBinned)]

Column - BMI

# Define the bin edges and labels for BMI
bin_edges <- c(0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 200.0)
bin_labels <- c('0.0-10.0', '10.0-20.0', '20.0-30.0', '30.0-40.0', '40.0-50.0', '50.0-60.0', '60.0-70.0', '70.0-80.0', '80.0-90.0', '90.0-100.0', '100.0-200.0')

# Create a mapping of bin labels to integer values
BMInt_map <- c('0.0-10.0' = 0, '10.0-20.0' = 1, '20.0-30.0' = 2, '30.0-40.0' = 3, '40.0-50.0' = 4, '50.0-60.0' = 5, 
               '60.0-70.0' = 6, '70.0-80.0' = 7, '80.0-90.0' = 8, '90.0-100.0' = 9, '100.0-200.0'= 10)

# Bin the BMI data
df_clean$BMIBinned <- cut(df_clean$BMI, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)

# Map the BMIBinned categories to the corresponding integer values
df_clean$BMI_INT <- BMInt_map[as.character(df_clean$BMIBinned)]

Column - AgeCategory

# Create a mapping dictionary for Age Category
AgeCategory_int <- c('Age 80 or older' = 80,
                     'Age 75 to 79' = 77,
                     'Age 70 to 74' = 72,
                     'Age 65 to 69' = 67,
                     'Age 60 to 64' = 62,
                     'Age 55 to 59' = 57,
                     'Age 50 to 54' = 52,
                     'Age 45 to 49' = 47,
                     'Age 40 to 44' = 42,
                     'Age 35 to 39' = 37,
                     'Age 30 to 34' = 32,
                     'Age 25 to 29' = 27,
                     'Age 18 to 24' = 21)

# Map the AgeCategory categories to the corresponding values
df_clean$AgeCategory_INT <- AgeCategory_int[as.character(df_clean$AgeCategory)]

Column - LastCheckupTime

# Create a mapping dictionary for LastCheckupTime
LastCheckupTime_int <- c('Within past year (anytime less than 12 months ago)' = 0,
                         'Within past 2 years (1 year but less than 2 years ago)' = 1,
                         'Within past 5 years (2 years but less than 5 years ago)' = 2,
                         '5 or more years ago' = 3)

# Map the LastCheckupTime categories to the corresponding values
df_clean$LastCheckupTime_INT <- LastCheckupTime_int[as.character(df_clean$LastCheckupTime)]

Column - RemovedTeeth

# Create a mapping dictionary for RemovedTeeth
RemovedTeeth_int <- c('None of them' = 0,
                      '1 to 5' = 1,
                      '6 or more, but not all' = 2,
                      'All' = 3)

# Map the RemovedTeeth categories to the corresponding values
df_clean$RemovedTeeth_INT <- RemovedTeeth_int[as.character(df_clean$RemovedTeeth)]

Column - RaceEthnicityCategory

# Create a mapping dictionary for RaceEthnicityCategory
RaceEthnicityCategory_int <- c('White only, Non-Hispanic' = 0,
                               'Black only, Non-Hispanic' = 1,
                               'Other race only, Non-Hispanic' = 2,
                               'Multiracial, Non-Hispanic' = 3,
                               'Hispanic' = 4)

# Map the RaceEthnicityCategory categories to the corresponding values
df_clean$RaceEthnicityCategory_INT <- RaceEthnicityCategory_int[as.character(df_clean$RaceEthnicityCategory)]

Column - GeneralHealth

# Create a mapping dictionary for GeneralHealth
GeneralHealth_int <- c('Poor' = 0,
                       'Fair' = 1,
                       'Good' = 2,
                       'Very good' = 3,
                       'Excellent' = 4)

# Map the GeneralHealth categories to the corresponding values
df_clean$GeneralHealth_INT <- GeneralHealth_int[as.character(df_clean$GeneralHealth)]

Column - SmokerStatus

# Create a mapping dictionary for SmokerStatus
SmokerStatus_int <- c('Never smoked' = 0,
                      'Former smoker' = 1,
                      'Current smoker - now smokes some days' = 2,
                      'Current smoker - now smokes every day' = 3)

# Map the SmokerStatus categories to the corresponding values
df_clean$SmokerStatus_INT <- SmokerStatus_int[as.character(df_clean$SmokerStatus)]

Column - ECigaretteUsage

# Create a mapping dictionary for ECigaretteUsage
ECigaretteUsage_int <- c('Never used e-cigarettes in my entire life' = 0,
                         'Not at all (right now)' = 1,
                         'Use them some days' = 2,
                         'Use them every day' = 3)

# Map the ECigaretteUsage categories to the corresponding values
df_clean$ECigaretteUsage_INT <- ECigaretteUsage_int[as.character(df_clean$ECigaretteUsage)]

Column - PhysicalActivities

# Convert PhysicalActivities to a factor (category) and then to numeric codes
df_clean$PhysicalActivities_INT <- as.numeric(factor(df_clean$PhysicalActivities)) - 1

Column - Sex

# Label Encoding for the Sex column
df_clean$Sex_INT <- as.integer(factor(df_clean$Sex))

Rest of the column

# Define a list of condition columns
condition_list <- c('HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 
                    'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 
                    'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty', 
                    'DifficultyConcentrating', 'DifficultyWalking', 'DifficultyDressingBathing', 
                    'DifficultyErrands', 'ChestScan', 'AlcoholDrinkers', 'HIVTesting', 
                    'FluVaxLast12', 'PneumoVaxEver', 'HighRiskLastYear', 'TetanusLast10Tdap', 'CovidPos')

# Loop through each condition and apply label encoding
for (condition in condition_list) {
  df_clean[[paste0(condition, "_INT")]] <- as.integer(factor(df_clean[[condition]]))
}

missing_values <- colSums(is.na(df_clean))
print(missing_values)

##                           Sex                 GeneralHealth 
##                             0                             0 
##            PhysicalHealthDays              MentalHealthDays 
##                             0                             0 
##               LastCheckupTime            PhysicalActivities 
##                             0                             0 
##                    SleepHours                  RemovedTeeth 
##                             0                             0 
##                HadHeartAttack                     HadAngina 
##                             0                             0 
##                     HadStroke                     HadAsthma 
##                             0                             0 
##                 HadSkinCancer                       HadCOPD 
##                             0                             0 
##         HadDepressiveDisorder              HadKidneyDisease 
##                             0                             0 
##                  HadArthritis                   HadDiabetes 
##                             0                             0 
##           DeafOrHardOfHearing       BlindOrVisionDifficulty 
##                             0                             0 
##       DifficultyConcentrating             DifficultyWalking 
##                             0                             0 
##     DifficultyDressingBathing             DifficultyErrands 
##                             0                             0 
##                  SmokerStatus               ECigaretteUsage 
##                             0                             0 
##                     ChestScan         RaceEthnicityCategory 
##                             0                             0 
##                   AgeCategory                HeightInMeters 
##                             0                             0 
##             WeightInKilograms                           BMI 
##                             0                             0 
##               AlcoholDrinkers                    HIVTesting 
##                             0                             0 
##                  FluVaxLast12                 PneumoVaxEver 
##                             0                             0 
##             TetanusLast10Tdap              HighRiskLastYear 
##                             0                             0 
##                      CovidPos              SleepHoursBinned 
##                             0                             0 
##                SleepHours_INT      PhysicalHealthDaysBinned 
##                             0                             0 
##        PhysicalHealthDays_INT        MentalHealthDaysBinned 
##                             0                             0 
##          MentalHealthDays_INT          HeightInMetersBinned 
##                             0                             0 
##            HeightInMeters_INT                  WeightBinned 
##                             0                             0 
##                    Weight_INT                     BMIBinned 
##                             0                             0 
##                       BMI_INT               AgeCategory_INT 
##                             0                             0 
##           LastCheckupTime_INT              RemovedTeeth_INT 
##                             0                             0 
##     RaceEthnicityCategory_INT             GeneralHealth_INT 
##                             0                             0 
##              SmokerStatus_INT           ECigaretteUsage_INT 
##                             0                             0 
##        PhysicalActivities_INT                       Sex_INT 
##                             0                             0 
##            HadHeartAttack_INT                 HadAngina_INT 
##                             0                             0 
##                 HadStroke_INT                 HadAsthma_INT 
##                             0                             0 
##             HadSkinCancer_INT                   HadCOPD_INT 
##                             0                             0 
##     HadDepressiveDisorder_INT          HadKidneyDisease_INT 
##                             0                             0 
##              HadArthritis_INT               HadDiabetes_INT 
##                             0                             0 
##       DeafOrHardOfHearing_INT   BlindOrVisionDifficulty_INT 
##                             0                             0 
##   DifficultyConcentrating_INT         DifficultyWalking_INT 
##                             0                             0 
## DifficultyDressingBathing_INT         DifficultyErrands_INT 
##                             0                             0 
##                 ChestScan_INT           AlcoholDrinkers_INT 
##                             0                             0 
##                HIVTesting_INT              FluVaxLast12_INT 
##                             0                             0 
##             PneumoVaxEver_INT          HighRiskLastYear_INT 
##                             0                             0 
##         TetanusLast10Tdap_INT                  CovidPos_INT 
##                             0                             0

4.0 Data Analysis

The objective of the data analysis is to process the dataset to derive meaningful insights about heart attack risks and develop predictive models that can assist in identifying individuals at risk. The analysis involves both regression and classification approaches to answer the project objectives.

The goals of the analysis are to:

Understand the Relationships:
- Exploratory Data Analysis (EDA): Explore the relationships between critical factors and the occurrence of heart attacks through .
Build Predictive Modelling:
- Classification Problem: Predict whether an individual is at “Low Risk” or “High Risk” of a heart attack.
- Regression Problem: Predict the severity score or likelihood of a heart attack based on specific features.

4.1 Exploratory Data Analysis (EDA)

This section visualizes the relationship between six critical factors—Sleep Hours, BMI (Body Mass Index), Age Categories, Physical Activities, Smoking Status, and Physical Health Days—and heart attack occurrences.

For each factor, the analysis includes:

Heart Attack Count: The number of individuals in each category who experienced a heart attack.
Heart Attack Ratio: The percentage of individuals in each category who had a heart attack.

The visualizations combine:

Bar Charts: Representing heart attack counts for each category.
Line Charts: Displaying the heart attack ratio as a percentage for each category.

Each factor highlights specific health behaviors, demographics, and physical conditions as critical contributors to heart attack risks. These insights can guide targeted interventions, promoting better health outcomes and reducing heart attack incidence in high-risk populations.

R packages including “ggplot2” and “dplyr” were loaded and plot the bar charts and line charts for visualisation.

4.1.1 Sleep Hours vs Heart Attack Risk:

Very low sleep hours (e.g., 1-4) and very high sleep hours (e.g., 17-20) show higher heart attack ratios, emphasizing the detrimental effects of insufficient or excessive sleep on heart health. Moderate sleep hours (e.g., 5-8) demonstrate lower heart attack ratios, aligning with recommended sleep durations for optimal health.

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Define the function for SleepHours vs Heart Attack Analysis
plot_heart_attack_analysis <- function(df, col) {
  # Calculate counts and ratios for each sleep hour category
  analysis <- df %>%
    group_by(!!sym(col)) %>%
    summarise(
      heart_attacks = sum(HadHeartAttack_INT, na.rm = TRUE),
      total = n()
    ) %>%
    mutate(
      ratio = round((heart_attacks / total) * 100, 1)
    )
  
  # Create the plot
  p <- ggplot(analysis, aes(x = as.factor(!!sym(col)))) +
    # Heart attack bar
    geom_bar(aes(y = heart_attacks, fill = "Heart Attacks"), 
             stat = "identity", position = "dodge", width = 0.5, color = "black") +
    # Heart attack ratio line
    geom_line(aes(y = ratio * max(heart_attacks) / 100, color = "Heart Attack Ratio (%)"), 
              linewidth = 1.2, group = 1) +
    geom_point(aes(y = ratio * max(heart_attacks) / 100, color = "Heart Attack Ratio (%)"), size = 3) +
    
    # Labels for bars and ratios
    geom_text(aes(y = heart_attacks, label = heart_attacks), 
              vjust = -0.5, size = 3) +
    geom_text(aes(y = ratio * max(heart_attacks) / 100, label = paste0(ratio, "%")), 
              vjust = -1, size = 3, color = "darkred") +
    
    # Customize y-axes
    scale_y_continuous(
      name = "Heart Attack Count",
      sec.axis = sec_axis(~ . * 100 / max(analysis$heart_attacks), name = "Heart Attack Ratio (%)")
    ) +
    
    # Customize theme and aesthetics
    labs(
      title = paste(col, "vs Heart Attack Analysis"),
      x = col,
      y = "Heart Attack Count",
      fill = "Legend",
      color = "Legend"
    ) +
    scale_fill_manual(values = c("Heart Attacks" = "lightcoral")) +
    scale_color_manual(values = c("Heart Attack Ratio (%)" = "darkred")) +
    theme_minimal() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1),
      legend.position = "top",
      legend.title = element_blank()
    )
  
  return(p)
}

plot_heart_attack_analysis(df_clean, "SleepHours")

4.1.2 Age Categories vs Heart Attack Risk:

Heart attack risk steadily increases with age, with the highest ratios observed in the oldest age categories (e.g., 80 or older, 75-79). Younger age groups (18-24, 25-29) exhibit very low heart attack ratios, reflecting the lower risk in these populations.

plot_heart_attack_analysis(df_clean, "AgeCategory")

4.1.3 Physical Health Days vs Heart Attack Risk:

The heart attack ratio increases significantly as the number of poor physical health days rises. Individuals reporting 26-30 poor physical health days show the highest ratio (14.4%), indicating a strong correlation between physical health and heart disease.

plot_heart_attack_analysis(df_clean, "PhysicalHealthDays_INT")

4.1.4 Physical Activities vs Heart Attack Risk:

Individuals reporting no physical activities show a significantly higher heart attack ratio compared to those engaging in physical activities. This emphasizes the protective effects of regular physical activity on heart health.

plot_heart_attack_analysis(df_clean, "PhysicalActivities")

4.1.5 BMI (Body Mass Index) vs Heart Attack Risk:

Higher BMI categories (e.g., 61.0-70.0) show an increasing heart attack ratio, suggesting a positive correlation between obesity and heart health risks. Extremely high BMI categories (e.g., 81.0-90.0, 91.0-100.0) show a decrease in the ratio, likely due to insufficient sample size or other confounding factors.

# Call the function to plot BMI vs Heart Attack Analysis
plot <- plot_heart_attack_analysis(df_clean, "BMI_INT")

# Define the custom labels for BMI_INT
custom_labels <- c(
  "10.0-20.0", "20.0-30.0", "30.0-40.0", "40.0-50.0",
  "50.0-60.0", "60.0-70.0", "70.0-80.0", "80.0-90.0", "90.0-100.0"
)

# Update the plot to include custom x-axis labels
plot <- plot +
  scale_x_discrete(labels = custom_labels) +
  labs(x = "BMI Range")

# Display the updated plot
print(plot)

4.1.6 Smoking Status vs Heart Attack Risk:

Current smokers (every day) show the highest heart attack ratio, followed by former smokers. Surprisingly, current smokers (some days) have a lower ratio, possibly due to under reporting or reduced exposure. Non-smokers exhibit the lowest risk.

plot_heart_attack_analysis(df_clean, "SmokerStatus")

4.2 Feature Selection

Feature selection is a critical step in machine learning that helps in improving model performance by selecting the most relevant features, models are less likely to overfit or underfit and reducing dimensionality by eliminating irrelevant or redundant features to simplify the model and decreases computational complexity.

Methods Used:

1. Chi-Square Test (Chi2):

A statistical test used to measure the association between categorical features and the target variable.

Features with the higher Chi-Square scores (more relevant) have stronger association with the target variable.

BMI: The strongest association with heart attack occurrence.
PhysicalHealthDays: Significant impact on heart attack prevalence.
HadAngina: Strongly correlated with heart attack occurrence.
AgeCategory: Indicates the importance of age in predicting heart attack risk.

The top 27 features were selected for further analysis:

Health-Related Variables:

BMI, PhysicalHealthDays, MentalHealthDays, HadAngina, HadStroke, HadDiabetes, HadArthritis, HadCOPD,HadKidneyDisease, HadArthritis, BlindorVisionDifficulty, DeaforHardOfHearing,HadSkinCancer

Lifestyle Variables:

SmokerStatus, AlcoholDrinkers, DifficultyWalking, PhysicalActivities, DifficultyDressingBathing_INT,DifficultyErrands_,DifficultyConcentrating

Demographic Variables:

AgeCategory, Sex

Vaccination and Preventive Health:

PneumoVaxEver

Other Variables:

GeneralHealth, WeightInKilogram, LastCheckupTime, RemovedTeeth

R packages including “dplyr” and “caret” were downloaded and used for data manipulation and feature selection.

# Load necessary libraries
library(dplyr)
library(caret)

## Loading required package: lattice

# Assuming df1 is the data frame
data <- df_clean

# Define the target variable and features
X <- data %>% select(-HadHeartAttack, -HadHeartAttack_INT)  # Select all features except the target variables
y <- data$HadHeartAttack  # Target variable

# Convert all features and the target variable to factors (equivalent to LabelEncoder)
X[] <- lapply(X, factor)  # Convert each column in X to a factor
y <- factor(y)  # Convert target variable to factor

# Perform Chi-Square test for each feature
chi_square_results <- apply(X, 2, function(feature) {
  # Use tryCatch to handle potential errors and skip problematic features
  tryCatch({
    chi_test <- chisq.test(table(feature, y))
    c(Chi2_Score = chi_test$statistic, P_Value = chi_test$p.value)
  }, error = function(e) {
    # Return NA if an error occurs (e.g., if there are empty combinations)
    return(c(Chi2_Score = NA, P_Value = NA))
  })
})

## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect

# Convert the results to a data frame
chi2_df <- data.frame(
  Feature = colnames(X),
  Chi2_Score = chi_square_results[1,],
  P_Value = chi_square_results[2,]
)

# Sort by Chi-Square score (descending order)
chi2_df <- chi2_df %>% arrange(desc(Chi2_Score))

# Print the results
print(chi2_df)

##                                                     Feature  Chi2_Score
## HadAngina                                         HadAngina 49142.83694
## HadAngina_INT                                 HadAngina_INT 49142.83694
## GeneralHealth                                 GeneralHealth  9923.94909
## GeneralHealth_INT                         GeneralHealth_INT  9923.94909
## AgeCategory                                     AgeCategory  7991.87048
## AgeCategory_INT                             AgeCategory_INT  7991.87048
## HadStroke                                         HadStroke  7778.62312
## HadStroke_INT                                 HadStroke_INT  7778.62312
## RemovedTeeth                                   RemovedTeeth  7053.06255
## RemovedTeeth_INT                           RemovedTeeth_INT  7053.06255
## ChestScan                                         ChestScan  6957.54315
## ChestScan_INT                                 ChestScan_INT  6957.54315
## DifficultyWalking                         DifficultyWalking  6322.24299
## DifficultyWalking_INT                 DifficultyWalking_INT  6322.24299
## BMI                                                     BMI  5619.53679
## HadDiabetes                                     HadDiabetes  5381.49128
## HadDiabetes_INT                             HadDiabetes_INT  5381.49128
## PhysicalHealthDays                       PhysicalHealthDays  4524.09366
## HadCOPD                                             HadCOPD  4396.04632
## HadCOPD_INT                                     HadCOPD_INT  4396.04632
## PhysicalHealthDaysBinned           PhysicalHealthDaysBinned  4227.70419
## PhysicalHealthDays_INT               PhysicalHealthDays_INT  4227.70419
## PneumoVaxEver                                 PneumoVaxEver  3562.14868
## PneumoVaxEver_INT                         PneumoVaxEver_INT  3562.14868
## HadArthritis                                   HadArthritis  3449.14725
## HadArthritis_INT                           HadArthritis_INT  3449.14725
## HadKidneyDisease                           HadKidneyDisease  2964.82763
## HadKidneyDisease_INT                   HadKidneyDisease_INT  2964.82763
## DeafOrHardOfHearing                     DeafOrHardOfHearing  2374.07372
## DeafOrHardOfHearing_INT             DeafOrHardOfHearing_INT  2374.07372
## SmokerStatus                                   SmokerStatus  2255.89727
## SmokerStatus_INT                           SmokerStatus_INT  2255.89727
## DifficultyErrands                         DifficultyErrands  1972.73570
## DifficultyErrands_INT                 DifficultyErrands_INT  1972.73570
## WeightInKilograms                         WeightInKilograms  1825.37700
## DifficultyDressingBathing         DifficultyDressingBathing  1704.60215
## DifficultyDressingBathing_INT DifficultyDressingBathing_INT  1704.60215
## PhysicalActivities                       PhysicalActivities  1696.65196
## PhysicalActivities_INT               PhysicalActivities_INT  1696.65196
## SleepHours                                       SleepHours  1668.82165
## Sex                                                     Sex  1359.03360
## Sex_INT                                             Sex_INT  1359.03360
## BlindOrVisionDifficulty             BlindOrVisionDifficulty  1322.02027
## BlindOrVisionDifficulty_INT     BlindOrVisionDifficulty_INT  1322.02027
## AlcoholDrinkers                             AlcoholDrinkers  1312.82073
## AlcoholDrinkers_INT                     AlcoholDrinkers_INT  1312.82073
## LastCheckupTime                             LastCheckupTime  1238.81557
## LastCheckupTime_INT                     LastCheckupTime_INT  1238.81557
## SleepHoursBinned                           SleepHoursBinned  1110.38287
## SleepHours_INT                               SleepHours_INT  1110.38287
## DifficultyConcentrating             DifficultyConcentrating   661.23857
## DifficultyConcentrating_INT     DifficultyConcentrating_INT   661.23857
## MentalHealthDays                           MentalHealthDays   652.91197
## HadSkinCancer                                 HadSkinCancer   610.66687
## HadSkinCancer_INT                         HadSkinCancer_INT   610.66687
## FluVaxLast12                                   FluVaxLast12   503.34442
## FluVaxLast12_INT                           FluVaxLast12_INT   503.34442
## HeightInMeters                               HeightInMeters   494.25712
## TetanusLast10Tdap                         TetanusLast10Tdap   422.24065
## TetanusLast10Tdap_INT                 TetanusLast10Tdap_INT   422.24065
## MentalHealthDaysBinned               MentalHealthDaysBinned   373.45879
## MentalHealthDays_INT                   MentalHealthDays_INT   373.45879
## CovidPos                                           CovidPos   214.08532
## CovidPos_INT                                   CovidPos_INT   214.08532
## WeightBinned                                   WeightBinned   211.71933
## Weight_INT                                       Weight_INT   211.71933
## BMIBinned                                         BMIBinned   198.96608
## BMI_INT                                             BMI_INT   198.96608
## RaceEthnicityCategory                 RaceEthnicityCategory   198.02261
## RaceEthnicityCategory_INT         RaceEthnicityCategory_INT   198.02261
## HadAsthma                                         HadAsthma   140.32243
## HadAsthma_INT                                 HadAsthma_INT   140.32243
## HadDepressiveDisorder                 HadDepressiveDisorder   137.15091
## HadDepressiveDisorder_INT         HadDepressiveDisorder_INT   137.15091
## ECigaretteUsage                             ECigaretteUsage   113.27251
## ECigaretteUsage_INT                     ECigaretteUsage_INT   113.27251
## HighRiskLastYear                           HighRiskLastYear   108.46258
## HighRiskLastYear_INT                   HighRiskLastYear_INT   108.46258
## HeightInMetersBinned                   HeightInMetersBinned    81.82726
## HeightInMeters_INT                       HeightInMeters_INT    81.82726
## HIVTesting                                       HIVTesting    54.27412
## HIVTesting_INT                               HIVTesting_INT    54.27412
##                                     P_Value
## HadAngina                      0.000000e+00
## HadAngina_INT                  0.000000e+00
## GeneralHealth                  0.000000e+00
## GeneralHealth_INT              0.000000e+00
## AgeCategory                    0.000000e+00
## AgeCategory_INT                0.000000e+00
## HadStroke                      0.000000e+00
## HadStroke_INT                  0.000000e+00
## RemovedTeeth                   0.000000e+00
## RemovedTeeth_INT               0.000000e+00
## ChestScan                      0.000000e+00
## ChestScan_INT                  0.000000e+00
## DifficultyWalking              0.000000e+00
## DifficultyWalking_INT          0.000000e+00
## BMI                            1.327513e-47
## HadDiabetes                    0.000000e+00
## HadDiabetes_INT                0.000000e+00
## PhysicalHealthDays             0.000000e+00
## HadCOPD                        0.000000e+00
## HadCOPD_INT                    0.000000e+00
## PhysicalHealthDaysBinned       0.000000e+00
## PhysicalHealthDays_INT         0.000000e+00
## PneumoVaxEver                  0.000000e+00
## PneumoVaxEver_INT              0.000000e+00
## HadArthritis                   0.000000e+00
## HadArthritis_INT               0.000000e+00
## HadKidneyDisease               0.000000e+00
## HadKidneyDisease_INT           0.000000e+00
## DeafOrHardOfHearing            0.000000e+00
## DeafOrHardOfHearing_INT        0.000000e+00
## SmokerStatus                   0.000000e+00
## SmokerStatus_INT               0.000000e+00
## DifficultyErrands              0.000000e+00
## DifficultyErrands_INT          0.000000e+00
## WeightInKilograms             8.181677e-140
## DifficultyDressingBathing      0.000000e+00
## DifficultyDressingBathing_INT  0.000000e+00
## PhysicalActivities             0.000000e+00
## PhysicalActivities_INT         0.000000e+00
## SleepHours                     0.000000e+00
## Sex                           1.677291e-297
## Sex_INT                       1.677291e-297
## BlindOrVisionDifficulty       1.853268e-289
## BlindOrVisionDifficulty_INT   1.853268e-289
## AlcoholDrinkers               1.849729e-287
## AlcoholDrinkers_INT           1.849729e-287
## LastCheckupTime               2.775940e-268
## LastCheckupTime_INT           2.775940e-268
## SleepHoursBinned              7.544387e-238
## SleepHours_INT                7.544387e-238
## DifficultyConcentrating       8.034817e-146
## DifficultyConcentrating_INT   8.034817e-146
## MentalHealthDays              3.119448e-118
## HadSkinCancer                 8.011317e-135
## HadSkinCancer_INT             8.011317e-135
## FluVaxLast12                  1.779495e-111
## FluVaxLast12_INT              1.779495e-111
## HeightInMeters                 1.572055e-65
## TetanusLast10Tdap              3.367854e-91
## TetanusLast10Tdap_INT          3.367854e-91
## MentalHealthDaysBinned         1.552816e-78
## MentalHealthDays_INT           1.552816e-78
## CovidPos                       3.250595e-47
## CovidPos_INT                   3.250595e-47
## WeightBinned                   8.817200e-44
## Weight_INT                     8.817200e-44
## BMIBinned                      2.651463e-37
## BMI_INT                        2.651463e-37
## RaceEthnicityCategory          9.999654e-42
## RaceEthnicityCategory_INT      9.999654e-42
## HadAsthma                      2.263116e-32
## HadAsthma_INT                  2.263116e-32
## HadDepressiveDisorder          1.117607e-31
## HadDepressiveDisorder_INT      1.117607e-31
## ECigaretteUsage                2.167571e-24
## ECigaretteUsage_INT            2.167571e-24
## HighRiskLastYear               2.128303e-25
## HighRiskLastYear_INT           2.128303e-25
## HeightInMetersBinned           1.244627e-17
## HeightInMeters_INT             1.244627e-17
## HIVTesting                     1.743832e-13
## HIVTesting_INT                 1.743832e-13

# Load necessary libraries
library(ggplot2)
# Ensure Chi2_Score is numeric and remove NA values
chi2_df$Chi2_Score <- as.numeric(chi2_df$Chi2_Score)

# Remove rows with NA in either Chi2_Score or Feature
chi2_df_clean <- chi2_df[!is.na(chi2_df$Chi2_Score) & !is.na(chi2_df$Feature), ]

# Sort the data by Chi2_Score in descending order and select the top 10
top_10_features <- chi2_df_clean[order(-chi2_df_clean$Chi2_Score), ][1:10, ]

# Create the plot for the top 10 features
ggplot(top_10_features, aes(x = Chi2_Score, y = reorder(Feature, Chi2_Score))) +
  geom_bar(stat = "identity", fill = "skyblue") +  # Horizontal bar plot
  xlab("Chi2 Score") +
  ylab("Feature") +
  ggtitle("Top 10 Feature Importance (Chi2)") +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10))  # Adjust text size if needed

# Select relevant columns from df_clean
X <- df_clean %>%
  select(
    BMI, WeightInKilograms, AgeCategory_INT, HadAngina_INT, HadDiabetes_INT, 
    HadStroke_INT, DifficultyWalking_INT, HadCOPD_INT, ChestScan_INT, 
    HadKidneyDisease_INT, HadArthritis_INT, DeafOrHardOfHearing_INT, 
    PneumoVaxEver_INT, DifficultyErrands_INT, DifficultyDressingBathing_INT, 
    BlindOrVisionDifficulty_INT, AlcoholDrinkers_INT, DifficultyConcentrating_INT, 
    HadSkinCancer_INT, PhysicalActivities_INT, Sex_INT, PhysicalHealthDays_INT, 
    RemovedTeeth_INT, GeneralHealth_INT, MentalHealthDays_INT, LastCheckupTime_INT, 
    SmokerStatus_INT
  )

library(dplyr)
library(caret)

df_clean$HadHeartAttack <- factor(df_clean$HadHeartAttack, levels = c("No", "Yes"))

y <- df_clean$HadHeartAttack  # Target variable

# Convert all features and the target variable to factors (equivalent to LabelEncoder)
X[] <- lapply(X, factor)  # Convert each column in X to a factor
y <- factor(y)  # Convert target variable to factor

# Perform Chi-Square test for each feature
chi_square_results <- apply(X, 2, function(feature) {
  # Use tryCatch to handle potential errors and skip problematic features
  tryCatch({
    chi_test <- chisq.test(table(feature, y))
    c(Chi2_Score = chi_test$statistic, P_Value = chi_test$p.value)
  }, error = function(e) {
    # Return NA if an error occurs (e.g., if there are empty combinations)
    return(c(Chi2_Score = NA, P_Value = NA))
  })
})

## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect

# Convert the results to a data frame
chi2_df <- data.frame(
  Feature = colnames(X),
  Chi2_Score = chi_square_results[1,],
  P_Value = chi_square_results[2,]
)

# Sort by Chi-Square score (descending order)
chi2_df <- chi2_df %>% arrange(desc(Chi2_Score))

# Print the results
print(chi2_df)

##                                                     Feature Chi2_Score
## HadAngina_INT                                 HadAngina_INT 49142.8369
## GeneralHealth_INT                         GeneralHealth_INT  9923.9491
## AgeCategory_INT                             AgeCategory_INT  7991.8705
## HadStroke_INT                                 HadStroke_INT  7778.6231
## RemovedTeeth_INT                           RemovedTeeth_INT  7053.0625
## ChestScan_INT                                 ChestScan_INT  6957.5431
## DifficultyWalking_INT                 DifficultyWalking_INT  6322.2430
## BMI                                                     BMI  5619.5368
## HadDiabetes_INT                             HadDiabetes_INT  5381.4913
## HadCOPD_INT                                     HadCOPD_INT  4396.0463
## PhysicalHealthDays_INT               PhysicalHealthDays_INT  4227.7042
## PneumoVaxEver_INT                         PneumoVaxEver_INT  3562.1487
## HadArthritis_INT                           HadArthritis_INT  3449.1472
## HadKidneyDisease_INT                   HadKidneyDisease_INT  2964.8276
## DeafOrHardOfHearing_INT             DeafOrHardOfHearing_INT  2374.0737
## SmokerStatus_INT                           SmokerStatus_INT  2255.8973
## DifficultyErrands_INT                 DifficultyErrands_INT  1972.7357
## WeightInKilograms                         WeightInKilograms  1825.3770
## DifficultyDressingBathing_INT DifficultyDressingBathing_INT  1704.6022
## PhysicalActivities_INT               PhysicalActivities_INT  1696.6520
## Sex_INT                                             Sex_INT  1359.0336
## BlindOrVisionDifficulty_INT     BlindOrVisionDifficulty_INT  1322.0203
## AlcoholDrinkers_INT                     AlcoholDrinkers_INT  1312.8207
## LastCheckupTime_INT                     LastCheckupTime_INT  1238.8156
## DifficultyConcentrating_INT     DifficultyConcentrating_INT   661.2386
## HadSkinCancer_INT                         HadSkinCancer_INT   610.6669
## MentalHealthDays_INT                   MentalHealthDays_INT   373.4588
##                                     P_Value
## HadAngina_INT                  0.000000e+00
## GeneralHealth_INT              0.000000e+00
## AgeCategory_INT                0.000000e+00
## HadStroke_INT                  0.000000e+00
## RemovedTeeth_INT               0.000000e+00
## ChestScan_INT                  0.000000e+00
## DifficultyWalking_INT          0.000000e+00
## BMI                            1.327513e-47
## HadDiabetes_INT                0.000000e+00
## HadCOPD_INT                    0.000000e+00
## PhysicalHealthDays_INT         0.000000e+00
## PneumoVaxEver_INT              0.000000e+00
## HadArthritis_INT               0.000000e+00
## HadKidneyDisease_INT           0.000000e+00
## DeafOrHardOfHearing_INT        0.000000e+00
## SmokerStatus_INT               0.000000e+00
## DifficultyErrands_INT          0.000000e+00
## WeightInKilograms             8.181677e-140
## DifficultyDressingBathing_INT  0.000000e+00
## PhysicalActivities_INT         0.000000e+00
## Sex_INT                       1.677291e-297
## BlindOrVisionDifficulty_INT   1.853268e-289
## AlcoholDrinkers_INT           1.849729e-287
## LastCheckupTime_INT           2.775940e-268
## DifficultyConcentrating_INT   8.034817e-146
## HadSkinCancer_INT             8.011317e-135
## MentalHealthDays_INT           1.552816e-78

4.3 Predictive Modeling

The modeling phase focuses on developing predictive models to address the identified research questions. This involves building and evaluating both classification and regression models, ensuring robust and interpretable results.

4.3.1 Prediction Modelling: Random Forest Classification

1. Data Splitting:

A total of 247,205 samples were split into 80% training set with 197,765 samples and 20% testing set with 49,440 samples.

2. Model Training:

A Random Forest Ranger was used with the parameters below:

num.trees = 100: 100 decision trees in the forest.
mtry = 3: 3 features considered at each split
importance = ‘impurity’:Computes feature importance based on the decrease in impurity.
probability = TRUE: Enables probability predictions for classification.

R packages including “ranger”, “pROC” and “caret” were downloaded and used for building, evaluating, and interpreting a Random Forest model.

# Load necessary libraries
library(ranger)
library(caret)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Assuming X and y are already defined
# Split the data into training and test sets
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]

# Convert X_train and y_train to data frames (if they are not already)
X_train <- as.data.frame(X_train)
y_train <- as.data.frame(y_train)

# Combine the features and target variable into a single data frame for training
train_data <- data.frame(y_train, X_train)

# Train the Random Forest model using ranger with probability enabled
rf_model_ranger <- ranger(
  formula = y_train ~ .,  # Formula notation: target variable ~ predictors
  data = train_data,  # Use the combined data frame
  num.trees = 100,  # Number of trees in the forest
  mtry = 3,  # Number of variables to consider at each split
  importance = 'impurity',  # Feature importance method
  probability = TRUE,     # Enable probability predictions
)

# Print the model
print(rf_model_ranger)

## Ranger result
## 
## Call:
##  ranger(formula = y_train ~ ., data = train_data, num.trees = 100,      mtry = 3, importance = "impurity", probability = TRUE, ) 
## 
## Type:                             Probability estimation 
## Number of trees:                  100 
## Sample size:                      197765 
## Number of independent variables:  27 
## Mtry:                             3 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.03947575

# Predict on the test set
y_pred_ranger <- predict(rf_model_ranger, data = X_test)$predictions

# Since it’s binary classification, y_pred_ranger will be a matrix where
# the second column contains the probabilities for class 1 (positive class)
y_proba <- y_pred_ranger[, 2]  # Probabilities for the positive class (class 1)

# Adjust predictions using a threshold (e.g., 0.3)
threshold <- 0.3
y_pred_adjusted <- ifelse(y_proba >= threshold, "Yes", "No")

y_pred_adjusted <- factor(y_pred_adjusted, levels = levels(factor(y_test)))

# Evaluate the performance using confusion matrix
confusion_ranger <- confusionMatrix(as.factor(y_pred_adjusted), as.factor(y_test))
print(confusion_ranger)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  45356  1496
##        Yes  1398  1190
##                                           
##                Accuracy : 0.9415          
##                  95% CI : (0.9394, 0.9435)
##     No Information Rate : 0.9457          
##     P-Value [Acc > NIR] : 0.99998         
##                                           
##                   Kappa : 0.4204          
##                                           
##  Mcnemar's Test P-Value : 0.07137         
##                                           
##             Sensitivity : 0.9701          
##             Specificity : 0.4430          
##          Pos Pred Value : 0.9681          
##          Neg Pred Value : 0.4598          
##              Prevalence : 0.9457          
##          Detection Rate : 0.9174          
##    Detection Prevalence : 0.9477          
##       Balanced Accuracy : 0.7066          
##                                           
##        'Positive' Class : No              
##

# Feature importance
feature_importances <- rf_model_ranger$variable.importance
print(feature_importances)

##                           BMI             WeightInKilograms 
##                    1056.63979                     949.71666 
##               AgeCategory_INT                 HadAngina_INT 
##                     667.25577                    2887.25066 
##               HadDiabetes_INT                 HadStroke_INT 
##                     249.93391                     336.83182 
##         DifficultyWalking_INT                   HadCOPD_INT 
##                     201.75906                     152.93841 
##                 ChestScan_INT          HadKidneyDisease_INT 
##                     251.48779                     139.78377 
##              HadArthritis_INT       DeafOrHardOfHearing_INT 
##                     152.89513                     141.70842 
##             PneumoVaxEver_INT         DifficultyErrands_INT 
##                     163.35246                     112.41947 
## DifficultyDressingBathing_INT   BlindOrVisionDifficulty_INT 
##                      86.88416                     114.87921 
##           AlcoholDrinkers_INT   DifficultyConcentrating_INT 
##                     148.35171                     122.35758 
##             HadSkinCancer_INT        PhysicalActivities_INT 
##                     127.84722                     146.31240 
##                       Sex_INT        PhysicalHealthDays_INT 
##                     214.77660                     302.23701 
##              RemovedTeeth_INT             GeneralHealth_INT 
##                     423.21381                     488.63699 
##          MentalHealthDays_INT           LastCheckupTime_INT 
##                     243.43797                     131.95551 
##              SmokerStatus_INT 
##                     296.09221

# Plot feature importance
barplot(feature_importances, main = "Feature Importances", col = "blue", las = 2)

# Calculate ROC-AUC
roc_curve <- roc(as.factor(y_test), y_proba)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

roc_auc <- auc(roc_curve)

# Print ROC-AUC
cat("ROC-AUC:", roc_auc, "\n")

## ROC-AUC: 0.8818473

# Print model summary
print("model_classification")

## [1] "model_classification"

# Predict the outcome on the same dataset (for simplicity)

# Define control for repeated k-fold cross-validation
control <- trainControl(  method = "repeatedcv",  
                          number = 5,            # Number of folds  
                          repeats = 3,           # Number of repetitions
                          verboseIter = TRUE # Show training progress
                          )

tuneGrid <- expand.grid(  mtry = c(2, 3, 4),
                          num.trees = c(100, 200) 
                          )

model_classification <- ranger(  formula = y_train ~ .,   
                                 data = data.frame(X_train, y_train),          
                                 num.trees = 100,                   
                                 importance = "impurity" 
                                 )

## Growing trees.. Progress: 90%. Estimated remaining time: 3 seconds.

predictions <- predict(model_classification, data = X_test)$predictions 
# Compare predicted values with actual values
predictions <- factor(predictions, levels = c("No", "Yes"))

confusion_matrix <- confusionMatrix(predictions, y_test)

print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  46289  2149
##        Yes   465   537
##                                           
##                Accuracy : 0.9471          
##                  95% CI : (0.9451, 0.9491)
##     No Information Rate : 0.9457          
##     P-Value [Acc > NIR] : 0.07755         
##                                           
##                   Kappa : 0.2697          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9901          
##             Specificity : 0.1999          
##          Pos Pred Value : 0.9556          
##          Neg Pred Value : 0.5359          
##              Prevalence : 0.9457          
##          Detection Rate : 0.9363          
##    Detection Prevalence : 0.9797          
##       Balanced Accuracy : 0.5950          
##                                           
##        'Positive' Class : No              
##

# Print model summary
print(model_classification)# Get feature importance

## Ranger result
## 
## Call:
##  ranger(formula = y_train ~ ., data = data.frame(X_train, y_train),      num.trees = 100, importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  100 
## Sample size:                      197765 
## Number of independent variables:  27 
## Mtry:                             5 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             5.20 %

importance(model_classification)

##                           BMI             WeightInKilograms 
##                     2618.1375                     2232.6347 
##               AgeCategory_INT                 HadAngina_INT 
##                     1256.0165                     3369.7638 
##               HadDiabetes_INT                 HadStroke_INT 
##                      402.9200                      381.9373 
##         DifficultyWalking_INT                   HadCOPD_INT 
##                      290.9224                      267.0708 
##                 ChestScan_INT          HadKidneyDisease_INT 
##                      346.4512                      234.9295 
##              HadArthritis_INT       DeafOrHardOfHearing_INT 
##                      356.6207                      283.6256 
##             PneumoVaxEver_INT         DifficultyErrands_INT 
##                      331.5210                      204.5603 
## DifficultyDressingBathing_INT   BlindOrVisionDifficulty_INT 
##                      155.5923                      214.2703 
##           AlcoholDrinkers_INT   DifficultyConcentrating_INT 
##                      384.0180                      249.2714 
##             HadSkinCancer_INT        PhysicalActivities_INT 
##                      270.9854                      358.9717 
##                       Sex_INT        PhysicalHealthDays_INT 
##                      307.9428                      576.6549 
##              RemovedTeeth_INT             GeneralHealth_INT 
##                      753.3156                      791.5328 
##          MentalHealthDays_INT           LastCheckupTime_INT 
##                      480.2359                      251.1312 
##              SmokerStatus_INT 
##                      621.0829

# Calculate the risk_score based on the categories
df_clean <- df_clean %>%
  mutate(risk_score = BMI_INT + Weight_INT + AgeCategory_INT + HadAngina_INT + HadDiabetes_INT + 
           HadStroke_INT + DifficultyWalking_INT + HadCOPD_INT + ChestScan_INT + 
           HadKidneyDisease_INT + HadArthritis_INT + DeafOrHardOfHearing_INT + 
           PneumoVaxEver_INT + DifficultyErrands_INT + DifficultyDressingBathing_INT + 
           BlindOrVisionDifficulty_INT + AlcoholDrinkers_INT + DifficultyConcentrating_INT + 
           HadSkinCancer_INT + PhysicalActivities_INT + Sex_INT + PhysicalHealthDays_INT +
           RemovedTeeth_INT + GeneralHealth_INT + MentalHealthDays_INT + LastCheckupTime_INT + SmokerStatus_INT)

# View the dataset with the new risk_score variable
print(df_clean)

## # A tibble: 247,205 × 85
##    Sex    GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime     
##    <chr>  <chr>                      <dbl>            <dbl> <chr>               
##  1 Female Very good                      4                0 Within past year (a…
##  2 Male   Very good                      0                0 Within past year (a…
##  3 Male   Very good                      0                0 Within past year (a…
##  4 Female Fair                           5                0 Within past year (a…
##  5 Female Good                           3               15 Within past year (a…
##  6 Male   Good                           0                0 Within past year (a…
##  7 Female Good                           3                0 Within past year (a…
##  8 Male   Fair                           5                0 Within past year (a…
##  9 Male   Good                           2                0 5 or more years ago 
## 10 Female Very good                      0                0 Within past year (a…
## # ℹ 247,195 more rows
## # ℹ 80 more variables: PhysicalActivities <chr>, SleepHours <dbl>,
## #   RemovedTeeth <chr>, HadHeartAttack <fct>, HadAngina <chr>, HadStroke <chr>,
## #   HadAsthma <chr>, HadSkinCancer <chr>, HadCOPD <chr>,
## #   HadDepressiveDisorder <chr>, HadKidneyDisease <chr>, HadArthritis <chr>,
## #   HadDiabetes <chr>, DeafOrHardOfHearing <chr>,
## #   BlindOrVisionDifficulty <chr>, DifficultyConcentrating <chr>, …

# Prepare the data for training
XR <- df_clean %>%
  select(risk_score)  # Only selecting the risk_score column as predictor

# Create train and test splits
XR_train <- XR[trainIndex, ]
yR_train <- y[trainIndex]
XR_test <- XR[-trainIndex, ]
yR_test <- y[-trainIndex]

# Ensure XR_train and yR_train are data frames
XR_train <- as.data.frame(XR_train)
yR_train <- as.data.frame(yR_train)

# Train a Random Forest model for regression
model_regression <- train(yR_train ~ risk_score,
               data = data.frame(XR_train, yR_train),
               method = "glm",  # Multinomial logistic regression
               trControl = control)# View the model summary

## + Fold1.Rep1: parameter=none 
## - Fold1.Rep1: parameter=none 
## + Fold2.Rep1: parameter=none 
## - Fold2.Rep1: parameter=none 
## + Fold3.Rep1: parameter=none 
## - Fold3.Rep1: parameter=none 
## + Fold4.Rep1: parameter=none 
## - Fold4.Rep1: parameter=none 
## + Fold5.Rep1: parameter=none 
## - Fold5.Rep1: parameter=none 
## + Fold1.Rep2: parameter=none 
## - Fold1.Rep2: parameter=none 
## + Fold2.Rep2: parameter=none 
## - Fold2.Rep2: parameter=none 
## + Fold3.Rep2: parameter=none 
## - Fold3.Rep2: parameter=none 
## + Fold4.Rep2: parameter=none 
## - Fold4.Rep2: parameter=none 
## + Fold5.Rep2: parameter=none 
## - Fold5.Rep2: parameter=none 
## + Fold1.Rep3: parameter=none 
## - Fold1.Rep3: parameter=none 
## + Fold2.Rep3: parameter=none 
## - Fold2.Rep3: parameter=none 
## + Fold3.Rep3: parameter=none 
## - Fold3.Rep3: parameter=none 
## + Fold4.Rep3: parameter=none 
## - Fold4.Rep3: parameter=none 
## + Fold5.Rep3: parameter=none 
## - Fold5.Rep3: parameter=none 
## Aggregating results
## Fitting final model on full training set

summary(model_regression)# Make predictions using the model

## 
## Call:
## NULL
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.7295099  0.0573577 -134.76   <2e-16 ***
## risk_score   0.0461178  0.0004961   92.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83488  on 197764  degrees of freedom
## Residual deviance: 73983  on 197763  degrees of freedom
## AIC: 73987
## 
## Number of Fisher Scoring iterations: 6

predictions <- predict(model_regression, newdata = data.frame(XR_test, yR_test))
# Evaluate model performance (confusion matrix)
confusion_matrix <- confusionMatrix(predictions, yR_test)
print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  46730  2673
##        Yes    24    13
##                                           
##                Accuracy : 0.9454          
##                  95% CI : (0.9434, 0.9474)
##     No Information Rate : 0.9457          
##     P-Value [Acc > NIR] : 0.5913          
##                                           
##                   Kappa : 0.0081          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99949         
##             Specificity : 0.00484         
##          Pos Pred Value : 0.94589         
##          Neg Pred Value : 0.35135         
##              Prevalence : 0.94567         
##          Detection Rate : 0.94519         
##    Detection Prevalence : 0.99925         
##       Balanced Accuracy : 0.50216         
##                                           
##        'Positive' Class : No              
##

# Save the best trained model using save() function
saveRDS(rf_model_ranger, file = "random_forest_model.rds")
print("Model saved successfully!")

## [1] "Model saved successfully!"

4.4 Model Evaluation

Evaluation ensures that the deployed model meets performance expectations in real-world scenarios. The following steps describe key evaluation metrics and methods:

Performance Metrics: Evaluate the model on real-world data using:
Accuracy: Proportion of correctly classified cases.
Precision: Proportion of predicted positives that are true positives.
Recall (Sensitivity): Proportion of actual positives that are correctly identified.
F1-Score: Balance between precision and recall.
ROC-AUC Score: Measures the model’s ability to distinguish between classes.

Monitoring Real-World Performance:

Data Drift: Monitor if the input data characteristics change over time, which may degrade model performance.
Prediction Drift: Ensure that predicted probabilities remain consistent with real-world trends.
Feedback Loop: Incorporate feedback from end-users (e.g., clinicians) to improve the model. Use new labeled data to periodically retrain and update the model.

Additional packages like smotefamily, rpart, ROSE were downloaded to treat inbalanced data, etc.

library(caret)
library(smotefamily)  # For SMOTE functionality
library(rpart)
library(pROC)         # For ROC-AUC
library(ROSE)         # For random under-sampling

## Loaded ROSE 0.0-4

# Set seed for reproducibility
set.seed(42)

# Split the dataset into training and testing sets
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)  # 70% train, 30% test
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]

# Combine features and target into a data frame for random under-sampling
train_data <- data.frame(X_train, y_train)

# Apply Random Undersampling using ROSE
train_data_resampled <- ROSE(y_train ~ ., data = train_data, seed = 42)$data
X_train_resampled <- train_data_resampled[, -ncol(train_data_resampled)]  # Features
y_train_resampled <- train_data_resampled[, ncol(train_data_resampled)]   # Target

# Train a Decision Tree model with class weights (balanced)
dt_model <- rpart(y_train_resampled ~ ., data = cbind(X_train_resampled, y_train_resampled),
                  method = "class", control = rpart.control(cp = 0))

# Predict on the test set
y_pred <- predict(dt_model, X_test, type = "class")

# Evaluate the model: confusion matrix
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  61391  1887
##        Yes  8741  2142
##                                           
##                Accuracy : 0.8567          
##                  95% CI : (0.8541, 0.8592)
##     No Information Rate : 0.9457          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2259          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8754          
##             Specificity : 0.5316          
##          Pos Pred Value : 0.9702          
##          Neg Pred Value : 0.1968          
##              Prevalence : 0.9457          
##          Detection Rate : 0.8278          
##    Detection Prevalence : 0.8533          
##       Balanced Accuracy : 0.7035          
##                                           
##        'Positive' Class : No              
##

# ROC-AUC: Probability of the positive class
y_prob <- predict(dt_model, X_test, type = "prob")[, 2]

# Compute the ROC curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

# Calculate the AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)

# Print the ROC-AUC score
print(paste("ROC-AUC Score:", roc_auc))

## [1] "ROC-AUC Score: 0.72966603259659"

# Load required libraries
library(caret)          # For data splitting and evaluation
library(ROSE)           # For random under-sampling
library(rpart)          # For decision tree model
library(pROC)           # For ROC-AUC calculation

# Assuming df1 is your data frame and y is your target variable
# Select features (X1)
X1 <- df_clean[, c('PhysicalHealthDays', 'MentalHealthDays', 'SleepHours',
              'HadArthritis', 'AgeCategory', 'HeightInMeters', 'WeightInKilograms',
              'BMI', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12',
              'TetanusLast10Tdap', 'PhysicalHealthDaysBinned', 'AgeCategory_INT',
              'RemovedTeeth_INT', 'RaceEthnicityCategory_INT', 'GeneralHealth_INT',
              'SmokerStatus_INT', 'ECigaretteUsage_INT', 'HadAngina_INT',
              'HadDiabetes_INT', 'PneumoVaxEver_INT', 'TetanusLast10Tdap_INT',
              'CovidPos_INT')]

# Encode all columns as numeric (similar to LabelEncoder in Python)
X1[] <- lapply(X1, function(col) as.integer(as.factor(col)))

# Split data into training and testing sets (70% training, 30% testing)
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X1_train <- X1[trainIndex, ]
y_train <- y[trainIndex]
X1_test <- X1[-trainIndex, ]
y_test <- y[-trainIndex]

# Combine features and target for random under-sampling
train_data <- data.frame(X1_train, y_train)

# Apply Random Under-Sampling using ROSE
train_data_resampled <- ROSE(y_train ~ ., data = train_data, seed = 42)$data
X1_train_resampled <- train_data_resampled[, -ncol(train_data_resampled)]
y_train_resampled <- train_data_resampled[, ncol(train_data_resampled)]

# Train a Decision Tree model
dt_model <- rpart(y_train_resampled ~ ., 
                  data = data.frame(X1_train_resampled, y_train_resampled),
                  method = "class", control = rpart.control(cp = 0))

# Predict on the test set
y_pred <- predict(dt_model, X1_test, type = "class")

# Evaluate the model: Confusion Matrix
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  66633  2123
##        Yes  3499  1906
##                                           
##                Accuracy : 0.9242          
##                  95% CI : (0.9223, 0.9261)
##     No Information Rate : 0.9457          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3645          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9501          
##             Specificity : 0.4731          
##          Pos Pred Value : 0.9691          
##          Neg Pred Value : 0.3526          
##              Prevalence : 0.9457          
##          Detection Rate : 0.8985          
##    Detection Prevalence : 0.9271          
##       Balanced Accuracy : 0.7116          
##                                           
##        'Positive' Class : No              
##

# ROC-AUC: Probability of the positive class
y_prob <- predict(dt_model, X1_test, type = "prob")[, 2]

# Compute ROC Curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

# Calculate AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)

# Print ROC-AUC Score
print(paste("ROC-AUC Score:", roc_auc))

## [1] "ROC-AUC Score: 0.834802402962937"

# Load required libraries
library(caret)    # For data splitting and evaluation
library(rpart)    # For decision tree model
library(pROC)     # For ROC-AUC calculation

# Assuming X is the feature data frame and y is the target variable

# Split data into training and testing sets (70% training, 30% testing)
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]

# Combine training features and target for model training
train_data <- data.frame(X_train, y_train)

# Train a Decision Tree model
dt_model <- rpart(y_train ~ ., data = train_data, method = "class", control = rpart.control(cp = 0))

# Predict on the test set
y_pred <- predict(dt_model, X_test, type = "class")

# Evaluate the model: Confusion Matrix
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  68088  2937
##        Yes  2044  1092
##                                          
##                Accuracy : 0.9328         
##                  95% CI : (0.931, 0.9346)
##     No Information Rate : 0.9457         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2701         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9709         
##             Specificity : 0.2710         
##          Pos Pred Value : 0.9586         
##          Neg Pred Value : 0.3482         
##              Prevalence : 0.9457         
##          Detection Rate : 0.9181         
##    Detection Prevalence : 0.9577         
##       Balanced Accuracy : 0.6209         
##                                          
##        'Positive' Class : No             
##

# ROC-AUC: Probability of the positive class
y_prob <- predict(dt_model, X_test, type = "prob")[, 2]

# Compute ROC Curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

# Calculate AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)

# Print ROC-AUC Score
print(paste("ROC-AUC Score:", roc_auc))

## [1] "ROC-AUC Score: 0.704005223946952"

# Load required libraries
library(caret)     # For data splitting and evaluation
library(e1071)     # For Naive Bayes model
library(pROC)      # For ROC-AUC calculation

# Assuming X is the feature data frame and y is the target variable

# Split data into training and testing sets (70% training, 30% testing)
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]

# Train the Naive Bayes model
nb_model <- naiveBayes(x = X_train, y = as.factor(y_train))

# Predict on the test set
y_pred <- predict(nb_model, X_test)

# Accuracy
accuracy <- sum(y_pred == y_test) / length(y_test)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.874732

# Classification report (Confusion Matrix)
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  62506  1664
##        Yes  7626  2365
##                                           
##                Accuracy : 0.8747          
##                  95% CI : (0.8723, 0.8771)
##     No Information Rate : 0.9457          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2818          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8913          
##             Specificity : 0.5870          
##          Pos Pred Value : 0.9741          
##          Neg Pred Value : 0.2367          
##              Prevalence : 0.9457          
##          Detection Rate : 0.8428          
##    Detection Prevalence : 0.8653          
##       Balanced Accuracy : 0.7391          
##                                           
##        'Positive' Class : No              
##

# ROC-AUC: Probability of the positive class
y_prob <- predict(nb_model, X_test, type = "raw")[, 2]  # Get probabilities for the positive class

# Compute ROC Curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

# Calculate AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)

# Print ROC-AUC Score
cat("ROC-AUC Score:", roc_auc, "\n")

## ROC-AUC Score: 0.8635652

library(caret)     # For confusionMatrix
library(pROC)      # For ROC-AUC

evaluation <- function(model, X_test, y_test) {
  # Predict class labels
  y_pred <- predict(model, X_test, type = "class")
  
  # Predict probabilities (if available)
  y_proba <- predict(model, X_test, type = "raw")[, 2]  # Probability for the positive class
  
  # Accuracy
  accuracy <- sum(y_pred == y_test) / length(y_test)
  
  # Confusion matrix for precision, recall, F1-Score
  conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
  
  # Precision, Recall, F1-Score
  precision <- conf_matrix$byClass["Pos Pred Value"]
  recall <- conf_matrix$byClass["Sensitivity"]
  f1 <- conf_matrix$byClass["F1"]
  
  # ROC-AUC
  roc_curve <- roc(as.factor(y_test), y_proba)
  roc_auc <- auc(roc_curve)
  
  # Print metrics
  cat("Accuracy:", accuracy, "\n")
  cat("Precision:", precision, "\n")
  cat("Recall:", recall, "\n")
  cat("F1-Score:", f1, "\n")
  cat("ROC-AUC:", roc_auc, "\n")
}

# Now call the evaluation function on your Naive Bayes model
evaluation(nb_model, X_test, y_test)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

## Accuracy: 0.874732 
## Precision: 0.9740689 
## Recall: 0.8912622 
## F1-Score: 0.9308275 
## ROC-AUC: 0.8635652

# Load required libraries
library(caret)   # For confusionMatrix and model evaluation
library(pROC)    # For ROC-AUC calculation

# Model performance metrics function
evaluation <- function(model, X_test, y_test) {
  # Predict class labels
  y_pred <- predict(model, X_test, type = "class")

  # Check if the length of predicted and actual labels match
  if (length(y_pred) != length(y_test)) {
    stop("Length of predicted labels and actual labels do not match.")
  }

  # Ensure the predicted labels have the same factor levels as the actual labels
  y_pred <- factor(y_pred, levels = levels(as.factor(y_test)))

  # Predict probabilities (for ROC-AUC calculation)
  y_proba <- predict(model, X_test, type = "prob")[, 2]  # Probability for the positive class

  # Accuracy
  accuracy <- sum(y_pred == y_test) / length(y_test)

  # Confusion matrix for precision, recall, F1-Score
  conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))

  # Extract precision, recall, and F1 from confusion matrix
  precision <- conf_matrix$byClass["Pos Pred Value"]
  recall <- conf_matrix$byClass["Sensitivity"]
  f1 <- conf_matrix$byClass["F1"]

  # ROC-AUC
  roc_curve <- roc(as.factor(y_test), y_proba)
  roc_auc <- auc(roc_curve)

  # Print metrics
  cat("Accuracy:", accuracy, "\n")
  cat("Precision:", precision, "\n")
  cat("Recall:", recall, "\n")
  cat("F1-Score:", f1, "\n")
  cat("ROC-AUC:", roc_auc, "\n")
}

# Example usage:
# Assuming 'model' is a trained model, and X_test, y_test are the test dataset and labels
print("Model Performance:")

## [1] "Model Performance:"

evaluation(dt_model, X_test, y_test)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

## Accuracy: 0.9328353 
## Precision: 0.9586484 
## Recall: 0.970855 
## F1-Score: 0.9647131 
## ROC-AUC: 0.7040052

# Load necessary libraries
library(caret)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(pROC)

# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

# Load the pre-trained Random Forest model
rf_model <- readRDS("random_forest_model.rds")

# Predict using the model
y_pred <- predict(rf_model, X_test)$predictions 
y_proba <- predict(rf_model, X_test, type = "response")$predictions[, 2]

# Adjust predictions based on threshold
threshold <- 0.3
y_pred_adjusted <- ifelse(y_proba >= threshold, "Yes", "No")

# Convert to factors for confusion matrix and metrics
y_pred_adjusted <- factor(y_pred_adjusted, levels = c("Yes", "No"))
y_test <- factor(y_test, levels = c("Yes", "No"))

# Accuracy
accuracy <- sum(y_pred_adjusted == y_test) / length(y_test)

# Confusion matrix
conf_matrix <- confusionMatrix(y_pred_adjusted, y_test)

# Precision, Recall, F1-Score
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- conf_matrix$byClass["F1"]

# ROC-AUC
roc_curve <- roc(y_test, y_proba)

## Setting levels: control = Yes, case = No

## Setting direction: controls > cases

roc_auc <- auc(roc_curve)

# Print metrics
cat(sprintf("Accuracy: %.4f\n", accuracy))

## Accuracy: 0.9562

cat(sprintf("Precision: %.4f\n", precision))

## Precision: 0.5900

cat(sprintf("Recall: %.4f\n", recall))

## Recall: 0.6339

cat(sprintf("F1-Score: %.4f\n", f1))

## F1-Score: 0.6112

cat(sprintf("ROC-AUC: %.4f\n", roc_auc))

## ROC-AUC: 0.9599

6. Deployment

The deployment of the heart attack prediction model in R ensures that the trained model is accessible for practical use, such as predicting heart attack risks in new data. By bridging the gap between technical development and real-world application, this model provides actionable insights that can address the business challenges outlined earlier in the project.

To integrate the model into applications, a user-friendly interface is created using the Shiny package, enabling seamless interaction between the model and end-users. Shiny allows stakeholders to input data and receive real-time predictions on heart attack risk. Whether for individual predictions or large-scale population health analyses, this interface provides an accessible tool for stakeholders.

R packages including “Shiny”, “randomForest”, and “caret” were downloaded and used to create the interactive web application, load and apply the pre-trained Random Forest model to make heart attack predictions, and to preprocess and train the model.

# Load necessary libraries
library(shiny)
library(randomForest)

# Load the pre-trained model
model <- readRDS("random_forest_model.rds")

# Define the prediction function
predict_heart_attack_risk <- function(input_data) {
  numeric_features <- input_data[1:3]
  binary_features <- input_data[4:20]
  categorical_features <- input_data[21:length(input_data)]
  
  # Map categorical inputs to numeric values
  mappings <- list(
    c('Yes' = 1, 'No' = 0),  # For binary features
    c('F' = 0, 'M' = 1),  # Gender
    c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28),  # Sick days
    c('None of them' = 0, '1 to 5' = 1, '6 or more, but not all' = 2, 'All' = 3),  # Teeth removed
    c('Poor' = 0, 'Fair' = 1, 'Good' = 2, 'Very good' = 3, 'Excellent' = 4),  # General health
    c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28),  # Mental health days
    c('Within past year (anytime less than 12 months ago)' = 0,
      'Within past 2 years (1 year but less than 2 years ago)' = 1,
      'Within past 5 years (2 years but less than 5 years ago)' = 2,
      '5 or more years ago' = 3),  # Last checkup
    c('Never smoked' = 0, 'Former smoker' = 1,
      'Current smoker - now smokes some days' = 2,
      'Current smoker - now smokes every day' = 3)  # Smoking status
  )
  
  # Transform binary inputs
  binary_values <- sapply(binary_features, function(val) ifelse(val == 'Yes', 1, 0))
  
  # Transform categorical inputs using correct mappings
    categorical_values <- sapply(1:length(categorical_features), function(i) {
    if (i <= length(mappings)) {
      mapping <- mappings[[i]]  # Adjust index to match categorical inputs
      return(mapping[[categorical_features[i]]])
    } else {
      return(NA)  # Handle case where the index exceeds mappings length
    }
  })
  
  # Combine all inputs into a single row
  input_row <- c(numeric_features, binary_values, categorical_values)
  
  # Predict probabilities
  risk_probability <- predict(model, data.frame(t(input_row)), type = "prob")[1, ]
  low_risk_prob <- risk_probability[1] * 100
  high_risk_prob <- risk_probability[2] * 100
  
  # Final prediction
  prediction <- ifelse(high_risk_prob > low_risk_prob, "High Risk of Heart Attack", "Low Risk of Heart Attack")
  
  # Return results
  return(list(
    LowRiskProbability = round(low_risk_prob, 2),
    HighRiskProbability = round(high_risk_prob, 2),
    Prediction = prediction
  ))
}

# Define UI for the Shiny app
ui <- fluidPage(
  titlePanel("Heart Attack Risk Prediction"),
  
  sidebarLayout(
    sidebarPanel(
      numericInput("BMI", "BMI", value = 25, min = 10, max = 50),
      numericInput("Weight", "Weight (in kg)", value = 70, min = 30, max = 200),
      numericInput("Age", "Age", value = 35, min = 18, max = 100),
      
      # Binary Inputs
      checkboxGroupInput(
        "BinaryFeatures", "Health Conditions",
        choices = c("HadAngina", "HadDiabetes", "HadStroke", "DifficultyWalking",
                    "HadCOPD", "ChestScan", "HadKidneyDisease", "HadArthritis",
                    "DeafOrHardOfHearing", "PneumoniaVaccineEver", "DifficultyErrands",
                    "DifficultyDressingBathing", "BlindOrVisionDifficulty",
                    "AlcoholDrinkers", "DifficultyConcentrating", "HadSkinCancer",
                    "PhysicalActivities")
      ),
      
      # Categorical Inputs
      radioButtons("Gender", "Gender", choices = c("F", "M")),
      selectInput("SickDays", "Sick Days in a Month", choices = c("0-5", "6-10", "11-15", "16-20", "21-25", "26-30")),
      selectInput("TeethRemoved", "Teeth Removed", choices = c("None of them", "1 to 5", "6 or more, but not all", "All")),
      selectInput("GeneralHealth", "General Health", choices = c("Poor", "Fair", "Good", "Very good", "Excellent")),
      selectInput("LastCheckup", "Last Checkup", choices = c(
        "Within past year (anytime less than 12 months ago)",
        "Within past 2 years (1 year but less than 2 years ago)",
        "Within past 5 years (2 years but less than 5 years ago)",
        "5 or more years ago"
      )),
      selectInput("SmokingStatus", "Smoking Status", choices = c(
        "Never smoked", "Former smoker", "Current smoker - now smokes some days", "Current smoker - now smokes every day"
      )),
      
      actionButton("submit", "Predict")
    ),
    
    mainPanel(
      h4("Prediction Results"),
      verbatimTextOutput("result")
    )
  )
)

# Define server logic for the Shiny app
server <- function(input, output) {
  observeEvent(input$submit, {
    # Construct input data for the model
    input_data <- c(input$BMI, input$Weight, input$Age,
                    input$Gender, input$SickDays, input$TeethRemoved,
                    input$GeneralHealth, input$LastCheckup, input$SmokingStatus)
    
    # Add binary features
    binary_features <- c(
      "HadAngina", "HadDiabetes", "HadStroke", "DifficultyWalking",
      "HadCOPD", "ChestScan", "HadKidneyDisease", "HadArthritis",
      "DeafOrHardOfHearing", "PneumoniaVaccineEver", "DifficultyErrands",
      "DifficultyDressingBathing", "BlindOrVisionDifficulty",
      "AlcoholDrinkers", "DifficultyConcentrating", "HadSkinCancer",
      "PhysicalActivities"
    )
    for (feature in binary_features) {
      input_data[[feature]] <- ifelse(feature %in% input$BinaryFeatures, 1, 0)
    }
    
    # Predict and display the results
    results <- predict_heart_attack_risk(input_data)
    output$result <- renderText({
      paste(
        "Low Risk Probability:", results$LowRiskProbability, "%\n",
        "High Risk Probability:", results$HighRiskProbability, "%\n",
        "Prediction:", results$Prediction
      )
    })
  })
}

# Launch the Shiny app
shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

#library(reticulate)
 
#library(reticulate)
#use_python("C:/Users/LEE/Documents/.virtualenvs/r-reticulate/Scripts/python.ex#e", required = TRUE)

# Load the model in R
#model_regression <- readRDS("random_forest_model.rds")

#py_run_string("import joblib")# Optionally: Save the model to a .pkl file in R, so Python can load it
#saveRDS(model_regression, file = "model.rds") 

# Use reticulate to save it as a Python-compatible format
#py_run_string("
#import joblib
#model = r.model_regression
#joblib.dump(model, 'random_forest_model.pkl')
#")

# Run the Python script for Gradio
#py_run_file("deployment.py")
#system("python deployment.py", wait = TRUE)

Conclusion

The heart attack prediction model using a Random Forest algorithm effectively assesses an individual’s risk by analyzing various health factors such as age, BMI, lifestyle habits, and medical history. By calculating the probabilities of high or low risk, the model provides personalized predictions that can guide healthcare decisions. This tool aids in early risk detection, encouraging timely interventions and lifestyle changes. While effective, further improvements could involve expanding features and real-time data integration for even more accurate and dynamic risk assessment.

Predictive Modeling for Early Detection of Heart Attack Risk

Group8

2025-01-02