The primary objective of this project is to leverage data science techniques to identify and predict heart attack risks among individuals. Heart disease is one of the leading causes of death globally, and early identification of individuals at risk can save lives by enabling timely interventions and promoting preventive measures. This project focuses on analyzing health and lifestyle data to create actionable insights and predictive models to support healthcare providers, policymakers, and individuals.
Our project focus on heart attack. Heart attack under heart disease is one of the leading causes of mortality worldwide. Early detection of heart attack risk can significantly improve patient outcomes by enabling timely medical interventions. This project focuses on predictive modeling techniques to analyze health-related datasets and identify individuals at risk of a heart attack. By leveraging machine learning algorithms, we aim to provide accurate predictions and support healthcare professionals in making informed decisions.
Classification
Question: How can we classify individuals as “Low Risk” or “High Risk” of a heart attack based on basic health data?
Objective: To develop a simple classification model that predicts whether an individual falls into “Low Risk” or “High Risk” category using easily accessible data such as age, BMI, and blood pressure.
Regression
Question: How can we predict the probability or severity score of a heart attack occurrence for an individual?
Objective: To build a regression model that predicts the likelihood or severity score of a heart attack, aiding personalized risk assessments.
The success of this project will be measured by:
Model Performance Metrics:
Achieving an accuracy > 85%.
Achieving an ROC-AUC Score > 80%.
Maintaining a false-negative rate of < 20%.
High precision and recall to flag high-risk patients correctly.
This section provides an overview of the dataset, including its source and structure.
This dataset was obtained from https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease.
R packages including “readr”, “dplyr” and “knitr” were downloaded to load and explore the raw data.
# Load necessary libraries
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr) # For table formatting (optional)
# Load the dataset
df <- read_csv("heart_2022_with_nans.csv")
## Rows: 445132 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (34): State, Sex, GeneralHealth, LastCheckupTime, PhysicalActivities, Re...
## dbl (6): PhysicalHealthDays, MentalHealthDays, SleepHours, HeightInMeters, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows
head(df)
## # A tibble: 6 × 40
## State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Alaba… Fema… Very good 0 0 Within past ye…
## 2 Alaba… Fema… Excellent 0 0 <NA>
## 3 Alaba… Fema… Very good 2 3 Within past ye…
## 4 Alaba… Fema… Excellent 0 0 Within past ye…
## 5 Alaba… Fema… Fair 2 0 Within past ye…
## 6 Alaba… Male Poor 1 0 Within past ye…
## # ℹ 34 more variables: PhysicalActivities <chr>, SleepHours <dbl>,
## # RemovedTeeth <chr>, HadHeartAttack <chr>, HadAngina <chr>, HadStroke <chr>,
## # HadAsthma <chr>, HadSkinCancer <chr>, HadCOPD <chr>,
## # HadDepressiveDisorder <chr>, HadKidneyDisease <chr>, HadArthritis <chr>,
## # HadDiabetes <chr>, DeafOrHardOfHearing <chr>,
## # BlindOrVisionDifficulty <chr>, DifficultyConcentrating <chr>,
## # DifficultyWalking <chr>, DifficultyDressingBathing <chr>, …
The dataset used in this project is the Heart Disease Dataset (2022). It contains 445,132 records and a total of 40 columns. The information is about various health-related metrics collected from patients, which can be used to predict the likelihood of heart disease.
The dataset includes demographic, health, and lifestyle input variables related to heart attack, such as:
Demographics: State, Sex, Race/Ethnicity, AgeCategory
Health Metrics: BMI, SleepHours, HeightInMeters, WeightInKilograms
Risk Factors: Smoking status, PhysicalHealthDays, MentalHealthDays
Medical History: HadAngina, HadStroke, HadDiabetes, etc.
This is the target variable that will be used for the heart attack prediction model.
The dataset will be used to:
1. Explore patterns and trends in health metrics.
2. Build predictive models to identify individuals at risk of heart attack.
Contains a mix of numerical and categorical variables.
Example data types:
Numerical: PhysicalHealthDays,
MentalHealthDays, BMI
Categorical: Sex, GeneralHealth,
SmokerStatus
Numerical Variables:
BMI: 1st Q = 24.13, Median = 27.44, Mean = 28.53,
3rd Q = 31.75, Range = [12.02, 99.64]
SleepHours: 1st Q = 6.00, Median = 7.00, Mean =
7.02, 3rd Q = 8.00, Range = [1.00, 24.00]
Categorical Variables:
Sex: Categories = [Male, Female]
SmokerStatus: Categories = [Never smoked, Former
smoker, Current smoker (Every Days), Current Smokers (Some
Days)]
#Dimensions of the dataset
cat("Dataset Dimensions: ", dim(df)[1], "rows and", dim(df)[2], "columns\n")
## Dataset Dimensions: 445132 rows and 40 columns
# Print dataset dimensions
cat("The dataset has", nrow(df), "rows and", ncol(df), "columns.\n")
## The dataset has 445132 rows and 40 columns.
# Dataset structure
str(df)
## spc_tbl_ [445,132 × 40] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ State : chr [1:445132] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Sex : chr [1:445132] "Female" "Female" "Female" "Female" ...
## $ GeneralHealth : chr [1:445132] "Very good" "Excellent" "Very good" "Excellent" ...
## $ PhysicalHealthDays : num [1:445132] 0 0 2 0 2 1 0 0 0 1 ...
## $ MentalHealthDays : num [1:445132] 0 0 3 0 0 0 0 0 0 0 ...
## $ LastCheckupTime : chr [1:445132] "Within past year (anytime less than 12 months ago)" NA "Within past year (anytime less than 12 months ago)" "Within past year (anytime less than 12 months ago)" ...
## $ PhysicalActivities : chr [1:445132] "No" "No" "Yes" "Yes" ...
## $ SleepHours : num [1:445132] 8 6 5 7 9 7 7 8 6 7 ...
## $ RemovedTeeth : chr [1:445132] NA NA NA NA ...
## $ HadHeartAttack : chr [1:445132] "No" "No" "No" "No" ...
## $ HadAngina : chr [1:445132] "No" "No" "No" "No" ...
## $ HadStroke : chr [1:445132] "No" "No" "No" "No" ...
## $ HadAsthma : chr [1:445132] "No" "No" "No" "Yes" ...
## $ HadSkinCancer : chr [1:445132] "No" "Yes" "Yes" "No" ...
## $ HadCOPD : chr [1:445132] "No" "No" "No" "No" ...
## $ HadDepressiveDisorder : chr [1:445132] "No" "No" "No" "No" ...
## $ HadKidneyDisease : chr [1:445132] "No" "No" "No" "No" ...
## $ HadArthritis : chr [1:445132] "No" "No" "No" "Yes" ...
## $ HadDiabetes : chr [1:445132] "Yes" "No" "No" "No" ...
## $ DeafOrHardOfHearing : chr [1:445132] "No" "No" "No" "No" ...
## $ BlindOrVisionDifficulty : chr [1:445132] "No" "No" "No" "No" ...
## $ DifficultyConcentrating : chr [1:445132] "No" "No" "No" "No" ...
## $ DifficultyWalking : chr [1:445132] "No" "No" "No" "No" ...
## $ DifficultyDressingBathing: chr [1:445132] "No" "No" "No" "No" ...
## $ DifficultyErrands : chr [1:445132] "No" "No" "No" "No" ...
## $ SmokerStatus : chr [1:445132] "Never smoked" "Never smoked" "Never smoked" "Current smoker - now smokes some days" ...
## $ ECigaretteUsage : chr [1:445132] "Not at all (right now)" "Never used e-cigarettes in my entire life" "Never used e-cigarettes in my entire life" "Never used e-cigarettes in my entire life" ...
## $ ChestScan : chr [1:445132] "No" "No" "No" "Yes" ...
## $ RaceEthnicityCategory : chr [1:445132] "White only, Non-Hispanic" "White only, Non-Hispanic" "White only, Non-Hispanic" "White only, Non-Hispanic" ...
## $ AgeCategory : chr [1:445132] "Age 80 or older" "Age 80 or older" "Age 55 to 59" NA ...
## $ HeightInMeters : num [1:445132] NA 1.6 1.57 1.65 1.57 1.8 1.65 1.63 1.7 1.68 ...
## $ WeightInKilograms : num [1:445132] NA 68 63.5 63.5 54 ...
## $ BMI : num [1:445132] NA 26.6 25.6 23.3 21.8 ...
## $ AlcoholDrinkers : chr [1:445132] "No" "No" "No" "No" ...
## $ HIVTesting : chr [1:445132] "No" "No" "No" "No" ...
## $ FluVaxLast12 : chr [1:445132] "Yes" "No" "No" "Yes" ...
## $ PneumoVaxEver : chr [1:445132] "No" "No" "No" "Yes" ...
## $ TetanusLast10Tdap : chr [1:445132] "Yes, received tetanus shot but not sure what type" "No, did not receive any tetanus shot in the past 10 years" NA "No, did not receive any tetanus shot in the past 10 years" ...
## $ HighRiskLastYear : chr [1:445132] "No" "No" "No" "No" ...
## $ CovidPos : chr [1:445132] "No" "No" "Yes" "No" ...
## - attr(*, "spec")=
## .. cols(
## .. State = col_character(),
## .. Sex = col_character(),
## .. GeneralHealth = col_character(),
## .. PhysicalHealthDays = col_double(),
## .. MentalHealthDays = col_double(),
## .. LastCheckupTime = col_character(),
## .. PhysicalActivities = col_character(),
## .. SleepHours = col_double(),
## .. RemovedTeeth = col_character(),
## .. HadHeartAttack = col_character(),
## .. HadAngina = col_character(),
## .. HadStroke = col_character(),
## .. HadAsthma = col_character(),
## .. HadSkinCancer = col_character(),
## .. HadCOPD = col_character(),
## .. HadDepressiveDisorder = col_character(),
## .. HadKidneyDisease = col_character(),
## .. HadArthritis = col_character(),
## .. HadDiabetes = col_character(),
## .. DeafOrHardOfHearing = col_character(),
## .. BlindOrVisionDifficulty = col_character(),
## .. DifficultyConcentrating = col_character(),
## .. DifficultyWalking = col_character(),
## .. DifficultyDressingBathing = col_character(),
## .. DifficultyErrands = col_character(),
## .. SmokerStatus = col_character(),
## .. ECigaretteUsage = col_character(),
## .. ChestScan = col_character(),
## .. RaceEthnicityCategory = col_character(),
## .. AgeCategory = col_character(),
## .. HeightInMeters = col_double(),
## .. WeightInKilograms = col_double(),
## .. BMI = col_double(),
## .. AlcoholDrinkers = col_character(),
## .. HIVTesting = col_character(),
## .. FluVaxLast12 = col_character(),
## .. PneumoVaxEver = col_character(),
## .. TetanusLast10Tdap = col_character(),
## .. HighRiskLastYear = col_character(),
## .. CovidPos = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
sapply(df, class)
## State Sex GeneralHealth
## "character" "character" "character"
## PhysicalHealthDays MentalHealthDays LastCheckupTime
## "numeric" "numeric" "character"
## PhysicalActivities SleepHours RemovedTeeth
## "character" "numeric" "character"
## HadHeartAttack HadAngina HadStroke
## "character" "character" "character"
## HadAsthma HadSkinCancer HadCOPD
## "character" "character" "character"
## HadDepressiveDisorder HadKidneyDisease HadArthritis
## "character" "character" "character"
## HadDiabetes DeafOrHardOfHearing BlindOrVisionDifficulty
## "character" "character" "character"
## DifficultyConcentrating DifficultyWalking DifficultyDressingBathing
## "character" "character" "character"
## DifficultyErrands SmokerStatus ECigaretteUsage
## "character" "character" "character"
## ChestScan RaceEthnicityCategory AgeCategory
## "character" "character" "character"
## HeightInMeters WeightInKilograms BMI
## "numeric" "numeric" "numeric"
## AlcoholDrinkers HIVTesting FluVaxLast12
## "character" "character" "character"
## PneumoVaxEver TetanusLast10Tdap HighRiskLastYear
## "character" "character" "character"
## CovidPos
## "character"
# Summary statistics
summary(df)
## State Sex GeneralHealth PhysicalHealthDays
## Length:445132 Length:445132 Length:445132 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Mode :character Median : 0.000
## Mean : 4.348
## 3rd Qu.: 3.000
## Max. :30.000
## NA's :10927
## MentalHealthDays LastCheckupTime PhysicalActivities SleepHours
## Min. : 0.000 Length:445132 Length:445132 Min. : 1.000
## 1st Qu.: 0.000 Class :character Class :character 1st Qu.: 6.000
## Median : 0.000 Mode :character Mode :character Median : 7.000
## Mean : 4.383 Mean : 7.023
## 3rd Qu.: 5.000 3rd Qu.: 8.000
## Max. :30.000 Max. :24.000
## NA's :9067 NA's :5453
## RemovedTeeth HadHeartAttack HadAngina HadStroke
## Length:445132 Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HadAsthma HadSkinCancer HadCOPD HadDepressiveDisorder
## Length:445132 Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HadKidneyDisease HadArthritis HadDiabetes DeafOrHardOfHearing
## Length:445132 Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BlindOrVisionDifficulty DifficultyConcentrating DifficultyWalking
## Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## DifficultyDressingBathing DifficultyErrands SmokerStatus
## Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ECigaretteUsage ChestScan RaceEthnicityCategory AgeCategory
## Length:445132 Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HeightInMeters WeightInKilograms BMI AlcoholDrinkers
## Min. :0.910 Min. : 22.68 Min. :12.02 Length:445132
## 1st Qu.:1.630 1st Qu.: 68.04 1st Qu.:24.13 Class :character
## Median :1.700 Median : 80.74 Median :27.44 Mode :character
## Mean :1.703 Mean : 83.07 Mean :28.53
## 3rd Qu.:1.780 3rd Qu.: 95.25 3rd Qu.:31.75
## Max. :2.410 Max. :292.57 Max. :99.64
## NA's :28652 NA's :42078 NA's :48806
## HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap
## Length:445132 Length:445132 Length:445132 Length:445132
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HighRiskLastYear CovidPos
## Length:445132 Length:445132
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
This section identifies missing values in the dataset and calculates the percentage of null values for each feature. Missing data can affect the performance of machine learning models, so it’s crucial to address them during the data pre-processing stage.
###Check Dataset for Percentage of Null Values
This section identifies missing values in the dataset and calculates the percentage of null values for each feature. Missing data can affect the performance of machine learning models, so it’s crucial to address them during the data pre-processing stage.
Here’s the description of 9 features with more than 10% missing values:
ChestScan (12.59%): Indicates whether the patient has undergone a chest scan.
BMI (10.96%): Body Mass Index, a measure of body fat based on height and weight.
AlcoholDrinkers (10.46%): Indicates whether the patient consumes alcohol.
HIVTesting (14.86%): Indicates whether the patient has been tested for HIV.
FluVaxLast12 (10.58%): Indicates whether the patient received a flu vaccine in the last 12 months.
PneumoVaxEver (17.31%): Indicates whether the patient has ever received the pneumococcal vaccine.
TetanusLast10Tdap (18.54%): Indicates whether the patient received a tetanus vaccine in the last 10 years.
CovidPos (11.40%): Indicates whether the patient tested positive for COVID-19.
HighRiskLastYear (11.37%): Indicates whether the patient was categorized as high-risk in the previous year.
# Calculate the percentage of missing values in each column
missing_percentage <- colSums(is.na(df)) / nrow(df) * 100
# Display the percentages in a neat table
kable(data.frame(Column = names(missing_percentage), MissingPercentage = missing_percentage),
col.names = c("Column Name", "Missing Percentage (%)"),
caption = "Percentage of Missing Values in Each Column")
| Column Name | Missing Percentage (%) | |
|---|---|---|
| State | State | 0.0000000 |
| Sex | Sex | 0.0000000 |
| GeneralHealth | GeneralHealth | 0.2691337 |
| PhysicalHealthDays | PhysicalHealthDays | 2.4547775 |
| MentalHealthDays | MentalHealthDays | 2.0369239 |
| LastCheckupTime | LastCheckupTime | 1.8664127 |
| PhysicalActivities | PhysicalActivities | 0.2455451 |
| SleepHours | SleepHours | 1.2250299 |
| RemovedTeeth | RemovedTeeth | 2.5520520 |
| HadHeartAttack | HadHeartAttack | 0.6885598 |
| HadAngina | HadAngina | 0.9895941 |
| HadStroke | HadStroke | 0.3497839 |
| HadAsthma | HadAsthma | 0.3983088 |
| HadSkinCancer | HadSkinCancer | 0.7060827 |
| HadCOPD | HadCOPD | 0.4985038 |
| HadDepressiveDisorder | HadDepressiveDisorder | 0.6317227 |
| HadKidneyDisease | HadKidneyDisease | 0.4326806 |
| HadArthritis | HadArthritis | 0.5915099 |
| HadDiabetes | HadDiabetes | 0.2441972 |
| DeafOrHardOfHearing | DeafOrHardOfHearing | 4.6383994 |
| BlindOrVisionDifficulty | BlindOrVisionDifficulty | 4.8444057 |
| DifficultyConcentrating | DifficultyConcentrating | 5.4455757 |
| DifficultyWalking | DifficultyWalking | 5.3943549 |
| DifficultyDressingBathing | DifficultyDressingBathing | 5.3725636 |
| DifficultyErrands | DifficultyErrands | 5.7636836 |
| SmokerStatus | SmokerStatus | 7.9666256 |
| ECigaretteUsage | ECigaretteUsage | 8.0111068 |
| ChestScan | ChestScan | 12.5908719 |
| RaceEthnicityCategory | RaceEthnicityCategory | 3.1579397 |
| AgeCategory | AgeCategory | 2.0396197 |
| HeightInMeters | HeightInMeters | 6.4367424 |
| WeightInKilograms | WeightInKilograms | 9.4529263 |
| BMI | BMI | 10.9643881 |
| AlcoholDrinkers | AlcoholDrinkers | 10.4629638 |
| HIVTesting | HIVTesting | 14.8555934 |
| FluVaxLast12 | FluVaxLast12 | 10.5858487 |
| PneumoVaxEver | PneumoVaxEver | 17.3072257 |
| TetanusLast10Tdap | TetanusLast10Tdap | 18.5374226 |
| HighRiskLastYear | HighRiskLastYear | 11.3725816 |
| CovidPos | CovidPos | 11.4042576 |
barplot(missing_percentage,
main = "Percentage of Missing Values by Column",
col = "steelblue",
xlab = "Columns",
ylab = "Percentage of Missing Values",
las = 2, # Rotate x-axis labels
ylim = c(0, max(missing_percentage) + 5)) # Adjust y-axis limit
This section analyzes the distribution of the target variable (Heart Attack) to check for class imbalance. Imbalanced datasets can bias machine learning models toward the majority class, necessitating techniques like resampling or using weighted metrics.
If IR > 1.5 = Imbalance
If IR < 1.5 = Balance
If IR > 9 = Highly Imbalance
The code calculates the Imbalance Ratio (IR) for the target variable HadHeartAttack, which measures the proportion of the majority class (No heart attack) to the minority class (Had heart attack). A high imbalance ratio indicates that the dataset is significantly skewed toward one class, which can impact the performance of machine learning models.
Output:
The Imbalance Ratio (IR) is calculated as approximately 16.61, which means there are around 16.61 instances of “No Heart Attack” for every instance of “Yes Heart Attack” in the dataset. This indicates a severe imbalance in the dataset.
Implications:
Model Bias: Without addressing this imbalance, the machine learning model may become biased toward predicting the majority class (No heart attack), resulting in poor performance for identifying the minority class (Yes heart attack).
Evaluation Metrics: Accuracy alone may not be a reliable metric. Precision, recall, and F1-score should be emphasized to evaluate model performance on the minority class.
# Count values for "Yes" and "No" in the "HadHeartAttack" column
heart_attack_count <- sum(df$HadHeartAttack == "Yes", na.rm = TRUE)
heart_attack_count_no <- sum(df$HadHeartAttack == "No", na.rm = TRUE)
# Calculate the imbalance ratio (IR)
IR <- heart_attack_count_no / heart_attack_count
# Print the result
cat("Imbalance Ratio (IR):", IR, "\n")
## Imbalance Ratio (IR): 16.60662
This section details the steps taken to clean and preprocess the data, such as handling missing values and encoding categorical variables.
Data preparation is a crucial step in the data science workflow, involving the transformation of raw data into a clean and structured format suitable for analysis and modeling. This process ensures data quality and enhances the performance of machine learning models.
The goal is to address noise data in the HeightInMeters, WeightInKilograms and BMI columns by applying specific strategies:
Statistical imputation for missing BMI values based on HeightInMeters and WeightInKilograms.
Remove rows with missing BMI values or missing values in either HeightInMeters or WeightInKilograms.
Removal of biologically unrealistic outliers for HeightInMeters and BMI.
Summary of Output:
HeightInMeters:
BMI:
Missing BMI values were imputed using height and weight.
Boxplot visualization highlighted the central tendency and spread of BMI.
R packages including “ggplot2”, and “dplyr” were downloaded to clean and preprocess the data.
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Create a boxplot for HeightInMeters
ggplot(df, aes(x = HeightInMeters)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Boxplot for HeightInMeters", x = "HeightInMeters") +
theme_minimal()
# Calculate quartiles and IQR
upper_quartile <- quantile(df$HeightInMeters, 0.75, na.rm = TRUE)
lower_quartile <- quantile(df$HeightInMeters, 0.25, na.rm = TRUE)
iqr <- upper_quartile - lower_quartile
#Statistical Summary for Height
# Calculate whiskers
upper_whisker <- max(df$HeightInMeters[df$HeightInMeters <= (upper_quartile + 1.5 * iqr)], na.rm = TRUE)
lower_whisker <- min(df$HeightInMeters[df$HeightInMeters >= (lower_quartile - 1.5 * iqr)], na.rm = TRUE)
# Print results
cat("Lower Whisker:", lower_whisker, "\n")
## Lower Whisker: 1.42
cat("Upper Whisker:", upper_whisker, "\n")
## Upper Whisker: 2
cat("Min Height:", min(df$HeightInMeters, na.rm = TRUE), "\n")
## Min Height: 0.91
cat("Max Height:", max(df$HeightInMeters, na.rm = TRUE), "\n")
## Max Height: 2.41
#Filter Height for Dwarfism and Gigantism
# Remove rows where height is outside 1.3m to 2.2m
df <- df %>% filter(HeightInMeters > 1.3 & HeightInMeters < 2.2)
# Print new height range
cat("New Min Height:", min(df$HeightInMeters, na.rm = TRUE), "\n")
## New Min Height: 1.32
cat("New Max Height:", max(df$HeightInMeters, na.rm = TRUE), "\n")
## New Max Height: 2.18
#BMI Calculation
# Calculate missing BMI using height and weight
df <- df %>%
mutate(BMI = ifelse(
is.na(BMI) & !is.na(HeightInMeters) & !is.na(WeightInKilograms),
WeightInKilograms / (HeightInMeters ^ 2),
BMI
))
# Verify changes
cat("BMI calculation completed.\n")
## BMI calculation completed.
#Boxplot BMI
# Create a boxplot for BMI
ggplot(df, aes(x = BMI)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Boxplot for BMI", x = "BMI") +
theme_minimal()
#Statistical Summary for BMI
# Calculate quartiles and IQR for BMI
upper_quartile <- quantile(df$BMI, 0.75, na.rm = TRUE)
lower_quartile <- quantile(df$BMI, 0.25, na.rm = TRUE)
iqr <- upper_quartile - lower_quartile
# Calculate whiskers
upper_whisker <- max(df$BMI[df$BMI <= (upper_quartile + 1.5 * iqr)], na.rm = TRUE)
lower_whisker <- min(df$BMI[df$BMI >= (lower_quartile - 1.5 * iqr)], na.rm = TRUE)
# Print results
cat("Lower Whisker:", lower_whisker, "\n")
## Lower Whisker: 12.75
cat("Upper Whisker:", upper_whisker, "\n")
## Upper Whisker: 43.14
The aim of this process is to remove rows with any missing values (null) from the dataset, ensuring that the resulting data is completed and ready for analysis.
Summary of the Output:
After removing null values,the dataset contains 247,205 rows (individual records) and still 40 columns (features or variables).
# Check the number of missing values in each column
missing_values <- colSums(is.na(df))
print(missing_values)
## State Sex GeneralHealth
## 0 0 982
## PhysicalHealthDays MentalHealthDays LastCheckupTime
## 9626 7970 7035
## PhysicalActivities SleepHours RemovedTeeth
## 855 4597 9954
## HadHeartAttack HadAngina HadStroke
## 2546 3885 1254
## HadAsthma HadSkinCancer HadCOPD
## 1465 2755 1839
## HadDepressiveDisorder HadKidneyDisease HadArthritis
## 2352 1572 2216
## HadDiabetes DeafOrHardOfHearing BlindOrVisionDifficulty
## 786 2772 3612
## DifficultyConcentrating DifficultyWalking DifficultyDressingBathing
## 6087 5887 5752
## DifficultyErrands SmokerStatus ECigaretteUsage
## 7361 16303 16504
## ChestScan RaceEthnicityCategory AgeCategory
## 36082 11337 5972
## HeightInMeters WeightInKilograms BMI
## 0 17885 17885
## AlcoholDrinkers HIVTesting FluVaxLast12
## 26544 45820 27322
## PneumoVaxEver TetanusLast10Tdap HighRiskLastYear
## 56537 61652 30769
## CovidPos
## 30896
# Drop rows with missing values
df_clean <- na.omit(df)
# Check the shape (dimensions) of the cleaned dataset
cat("Number of rows and columns after cleaning:", dim(df_clean), "\n")
## Number of rows and columns after cleaning: 247205 40
This section visualizes geographical distribution of the Heart Attack Ratio by State which indicates the percentage of heart attack cases relative to the total records for each state across U.S. states.
The maps use state abbreviations and color intensity to represent the magnitude of heart attack occurrences or ratios, making it easier to identify high-risk areas.Packages plotly has been used to insert plot the map.
# State abbreviation mapping
state_abbrev <- c(
'Alabama' = 'AL', 'Alaska' = 'AK', 'Arizona' = 'AZ', 'Arkansas' = 'AR', 'California' = 'CA',
'Colorado' = 'CO', 'Connecticut' = 'CT', 'Delaware' = 'DE', 'Florida' = 'FL', 'Georgia' = 'GA',
'Hawaii' = 'HI', 'Idaho' = 'ID', 'Illinois' = 'IL', 'Indiana' = 'IN', 'Iowa' = 'IA',
'Kansas' = 'KS', 'Kentucky' = 'KY', 'Louisiana' = 'LA', 'Maine' = 'ME', 'Maryland' = 'MD',
'Massachusetts' = 'MA', 'Michigan' = 'MI', 'Minnesota' = 'MN', 'Mississippi' = 'MS',
'Missouri' = 'MO', 'Montana' = 'MT', 'Nebraska' = 'NE', 'Nevada' = 'NV', 'New Hampshire' = 'NH',
'New Jersey' = 'NJ', 'New Mexico' = 'NM', 'New York' = 'NY', 'North Carolina' = 'NC',
'North Dakota' = 'ND', 'Ohio' = 'OH', 'Oklahoma' = 'OK', 'Oregon' = 'OR', 'Pennsylvania' = 'PA',
'Rhode Island' = 'RI', 'South Carolina' = 'SC', 'South Dakota' = 'SD', 'Tennessee' = 'TN',
'Texas' = 'TX', 'Utah' = 'UT', 'Vermont' = 'VT', 'Virginia' = 'VA', 'Washington' = 'WA',
'West Virginia' = 'WV', 'Wisconsin' = 'WI', 'Wyoming' = 'WY'
)
# Add state abbreviations to the dataframe
df_clean$State <- state_abbrev[df_clean$State]
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Heart Attack Ratios
# Step 1: Group by State and Calculate the Ratio of Heart Attacks
state_heartattack_ratio <- df_clean %>%
group_by(State) %>%
summarise(HeartAttackRatio = sum(HadHeartAttack == "Yes") / n() * 100) # Calculate ratio of "Yes"
# Step 2: Create a US Map Plot
fig2 <- plot_ly(
data = state_heartattack_ratio,
type = "choropleth",
locations = ~State, # State abbreviations
locationmode = "USA-states", # Mode for state abbreviations
z = ~HeartAttackRatio, # Data to color by
colors = "Blues" # Color scheme
) %>%
layout(
title = "Heart Attack Ratio by State",
geo = list(scope = "usa") # Focus on the US map
)
# Step 3: Display the Maps
fig2
df_clean <- df_clean %>% select(-State)
This section involves comprehensive preprocessing techniques applied to transform and encode categorical and numerical variables into structured formats suitable for analysis and modelling.
Key techniques include:
Binning: Continuous numerical features are grouped into discrete intervals (bins), enabling better interpretability and analysis.
Categorical Encoding: Categorical data is encoded into integers or mapped to numerical representations for seamless integration into machine learning models.
Techniques used for Categorical Encoding:
Mapping: Assigning meaningful integers to categories
Label Encoding: where converting categorical values to integers for binary variables and other categorical features.
# Define the bin edges and labels for SleepHours
bin_edges <- c(0, 4, 8, 12, 16, 20, 24)
bin_labels <- c('0-4', '4-8', '8-12', '12-16', '16-20', '20-24')
# Create a mapping of bin labels to integer values
SleepHours_int <- c('0-4' = 2, '4-8' = 6, '8-12' = 10, '12-16' = 14, '16-20' = 18, '20-24' = 22)
# Bin the SleepHours data
df_clean$SleepHoursBinned <- cut(df_clean$SleepHours, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)
# Map the SleepHoursBinned categories to the corresponding integer values
df_clean$SleepHours_INT <- SleepHours_int[as.character(df_clean$SleepHoursBinned)]
# Define the bin edges and labels for PhysicalHealthDays
bin_edges <- c(0, 5, 10, 15, 20, 25, 30)
bin_labels <- c('0-5', '6-10', '11-15', '16-20', '21-25', '26-30')
# Create a mapping of bin labels to integer values
PhysicalHealthDays_int <- c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28)
# Bin the PhysicalHealthDays data
df_clean$PhysicalHealthDaysBinned <- cut(df_clean$PhysicalHealthDays, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)
# Map the PhysicalHealthDaysBinned categories to the corresponding integer values
df_clean$PhysicalHealthDays_INT <- PhysicalHealthDays_int[as.character(df_clean$PhysicalHealthDaysBinned)]
# Define the bin edges and labels for MentalHealthDays
bin_edges <- c(0, 5, 10, 15, 20, 25, 30)
bin_labels <- c('0-5', '6-10', '11-15', '16-20', '21-25', '26-30')
# Create a mapping of bin labels to integer values
MentalHealthDays_int <- c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28)
# Bin the MentalHealthDays data
df_clean$MentalHealthDaysBinned <- cut(df_clean$MentalHealthDays, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)
# Map the MentalHealthDaysBinned categories to the corresponding integer values
df_clean$MentalHealthDays_INT <- MentalHealthDays_int[as.character(df_clean$MentalHealthDaysBinned)]
# Define the bin edges and labels for HeightInMeters
bin_edges <- c(0.91, 1.20, 1.50, 1.80, 2.10, 2.41)
bin_labels <- c('0.91-1.20', '1.21-1.50', '1.51-1.80', '1.81-2.10', '2.11-2.41')
# Create a mapping of bin labels to integer values
HeightInMeters_int <- c('0.91-1.20' = 1, '1.21-1.50' = 2, '1.51-1.80' = 3, '1.81-2.10' = 4, '2.11-2.41' = 5)
# Bin the HeightInMeters data
df_clean$HeightInMetersBinned <- cut(df_clean$HeightInMeters, breaks = bin_edges, labels = bin_labels, right = FALSE)
# Map the HeightInMetersBinned categories to the corresponding integer values
df_clean$HeightInMeters_INT <- HeightInMeters_int[as.character(df_clean$HeightInMetersBinned)]
# Define the bin edges and labels for WeightInKilograms
bin_edges <- c(20.0, 50.0, 100.0, 150.0, 200.0, 250.0, 295.0)
bin_labels <- c('20.0-50.0', '50.1-100.0', '100.1-150.0', '150.1-200.0', '200.1-250.0', '250.1-295.0')
# Create a mapping of bin labels to integer values
WeightInKilograms_int <- c('20.0-50.0' = 1, '50.1-100.0' = 2, '100.1-150.0' = 3, '150.1-200.0' = 4, '200.1-250.0' = 5, '250.1-295.0' = 6)
# Bin the WeightInKilograms data
df_clean$WeightBinned <- cut(df_clean$WeightInKilograms, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)
# Map the WeightBinned categories to the corresponding integer values
df_clean$Weight_INT <- WeightInKilograms_int[as.character(df_clean$WeightBinned)]
# Define the bin edges and labels for BMI
bin_edges <- c(0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 200.0)
bin_labels <- c('0.0-10.0', '10.0-20.0', '20.0-30.0', '30.0-40.0', '40.0-50.0', '50.0-60.0', '60.0-70.0', '70.0-80.0', '80.0-90.0', '90.0-100.0', '100.0-200.0')
# Create a mapping of bin labels to integer values
BMInt_map <- c('0.0-10.0' = 0, '10.0-20.0' = 1, '20.0-30.0' = 2, '30.0-40.0' = 3, '40.0-50.0' = 4, '50.0-60.0' = 5,
'60.0-70.0' = 6, '70.0-80.0' = 7, '80.0-90.0' = 8, '90.0-100.0' = 9, '100.0-200.0'= 10)
# Bin the BMI data
df_clean$BMIBinned <- cut(df_clean$BMI, breaks = bin_edges, labels = bin_labels, right = TRUE, include.lowest = TRUE)
# Map the BMIBinned categories to the corresponding integer values
df_clean$BMI_INT <- BMInt_map[as.character(df_clean$BMIBinned)]
# Create a mapping dictionary for Age Category
AgeCategory_int <- c('Age 80 or older' = 80,
'Age 75 to 79' = 77,
'Age 70 to 74' = 72,
'Age 65 to 69' = 67,
'Age 60 to 64' = 62,
'Age 55 to 59' = 57,
'Age 50 to 54' = 52,
'Age 45 to 49' = 47,
'Age 40 to 44' = 42,
'Age 35 to 39' = 37,
'Age 30 to 34' = 32,
'Age 25 to 29' = 27,
'Age 18 to 24' = 21)
# Map the AgeCategory categories to the corresponding values
df_clean$AgeCategory_INT <- AgeCategory_int[as.character(df_clean$AgeCategory)]
# Create a mapping dictionary for LastCheckupTime
LastCheckupTime_int <- c('Within past year (anytime less than 12 months ago)' = 0,
'Within past 2 years (1 year but less than 2 years ago)' = 1,
'Within past 5 years (2 years but less than 5 years ago)' = 2,
'5 or more years ago' = 3)
# Map the LastCheckupTime categories to the corresponding values
df_clean$LastCheckupTime_INT <- LastCheckupTime_int[as.character(df_clean$LastCheckupTime)]
# Create a mapping dictionary for RemovedTeeth
RemovedTeeth_int <- c('None of them' = 0,
'1 to 5' = 1,
'6 or more, but not all' = 2,
'All' = 3)
# Map the RemovedTeeth categories to the corresponding values
df_clean$RemovedTeeth_INT <- RemovedTeeth_int[as.character(df_clean$RemovedTeeth)]
# Create a mapping dictionary for RaceEthnicityCategory
RaceEthnicityCategory_int <- c('White only, Non-Hispanic' = 0,
'Black only, Non-Hispanic' = 1,
'Other race only, Non-Hispanic' = 2,
'Multiracial, Non-Hispanic' = 3,
'Hispanic' = 4)
# Map the RaceEthnicityCategory categories to the corresponding values
df_clean$RaceEthnicityCategory_INT <- RaceEthnicityCategory_int[as.character(df_clean$RaceEthnicityCategory)]
# Create a mapping dictionary for GeneralHealth
GeneralHealth_int <- c('Poor' = 0,
'Fair' = 1,
'Good' = 2,
'Very good' = 3,
'Excellent' = 4)
# Map the GeneralHealth categories to the corresponding values
df_clean$GeneralHealth_INT <- GeneralHealth_int[as.character(df_clean$GeneralHealth)]
# Create a mapping dictionary for SmokerStatus
SmokerStatus_int <- c('Never smoked' = 0,
'Former smoker' = 1,
'Current smoker - now smokes some days' = 2,
'Current smoker - now smokes every day' = 3)
# Map the SmokerStatus categories to the corresponding values
df_clean$SmokerStatus_INT <- SmokerStatus_int[as.character(df_clean$SmokerStatus)]
# Create a mapping dictionary for ECigaretteUsage
ECigaretteUsage_int <- c('Never used e-cigarettes in my entire life' = 0,
'Not at all (right now)' = 1,
'Use them some days' = 2,
'Use them every day' = 3)
# Map the ECigaretteUsage categories to the corresponding values
df_clean$ECigaretteUsage_INT <- ECigaretteUsage_int[as.character(df_clean$ECigaretteUsage)]
# Convert PhysicalActivities to a factor (category) and then to numeric codes
df_clean$PhysicalActivities_INT <- as.numeric(factor(df_clean$PhysicalActivities)) - 1
# Label Encoding for the Sex column
df_clean$Sex_INT <- as.integer(factor(df_clean$Sex))
# Define a list of condition columns
condition_list <- c('HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer',
'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis',
'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty',
'DifficultyConcentrating', 'DifficultyWalking', 'DifficultyDressingBathing',
'DifficultyErrands', 'ChestScan', 'AlcoholDrinkers', 'HIVTesting',
'FluVaxLast12', 'PneumoVaxEver', 'HighRiskLastYear', 'TetanusLast10Tdap', 'CovidPos')
# Loop through each condition and apply label encoding
for (condition in condition_list) {
df_clean[[paste0(condition, "_INT")]] <- as.integer(factor(df_clean[[condition]]))
}
missing_values <- colSums(is.na(df_clean))
print(missing_values)
## Sex GeneralHealth
## 0 0
## PhysicalHealthDays MentalHealthDays
## 0 0
## LastCheckupTime PhysicalActivities
## 0 0
## SleepHours RemovedTeeth
## 0 0
## HadHeartAttack HadAngina
## 0 0
## HadStroke HadAsthma
## 0 0
## HadSkinCancer HadCOPD
## 0 0
## HadDepressiveDisorder HadKidneyDisease
## 0 0
## HadArthritis HadDiabetes
## 0 0
## DeafOrHardOfHearing BlindOrVisionDifficulty
## 0 0
## DifficultyConcentrating DifficultyWalking
## 0 0
## DifficultyDressingBathing DifficultyErrands
## 0 0
## SmokerStatus ECigaretteUsage
## 0 0
## ChestScan RaceEthnicityCategory
## 0 0
## AgeCategory HeightInMeters
## 0 0
## WeightInKilograms BMI
## 0 0
## AlcoholDrinkers HIVTesting
## 0 0
## FluVaxLast12 PneumoVaxEver
## 0 0
## TetanusLast10Tdap HighRiskLastYear
## 0 0
## CovidPos SleepHoursBinned
## 0 0
## SleepHours_INT PhysicalHealthDaysBinned
## 0 0
## PhysicalHealthDays_INT MentalHealthDaysBinned
## 0 0
## MentalHealthDays_INT HeightInMetersBinned
## 0 0
## HeightInMeters_INT WeightBinned
## 0 0
## Weight_INT BMIBinned
## 0 0
## BMI_INT AgeCategory_INT
## 0 0
## LastCheckupTime_INT RemovedTeeth_INT
## 0 0
## RaceEthnicityCategory_INT GeneralHealth_INT
## 0 0
## SmokerStatus_INT ECigaretteUsage_INT
## 0 0
## PhysicalActivities_INT Sex_INT
## 0 0
## HadHeartAttack_INT HadAngina_INT
## 0 0
## HadStroke_INT HadAsthma_INT
## 0 0
## HadSkinCancer_INT HadCOPD_INT
## 0 0
## HadDepressiveDisorder_INT HadKidneyDisease_INT
## 0 0
## HadArthritis_INT HadDiabetes_INT
## 0 0
## DeafOrHardOfHearing_INT BlindOrVisionDifficulty_INT
## 0 0
## DifficultyConcentrating_INT DifficultyWalking_INT
## 0 0
## DifficultyDressingBathing_INT DifficultyErrands_INT
## 0 0
## ChestScan_INT AlcoholDrinkers_INT
## 0 0
## HIVTesting_INT FluVaxLast12_INT
## 0 0
## PneumoVaxEver_INT HighRiskLastYear_INT
## 0 0
## TetanusLast10Tdap_INT CovidPos_INT
## 0 0
The objective of the data analysis is to process the dataset to derive meaningful insights about heart attack risks and develop predictive models that can assist in identifying individuals at risk. The analysis involves both regression and classification approaches to answer the project objectives.
The goals of the analysis are to:
Understand the Relationships:
Build Predictive Modelling:
Classification Problem: Predict whether an individual is at “Low Risk” or “High Risk” of a heart attack.
Regression Problem: Predict the severity score or likelihood of a heart attack based on specific features.
This section visualizes the relationship between six critical factors—Sleep Hours, BMI (Body Mass Index), Age Categories, Physical Activities, Smoking Status, and Physical Health Days—and heart attack occurrences.
For each factor, the analysis includes:
Heart Attack Count: The number of individuals in each category who experienced a heart attack.
Heart Attack Ratio: The percentage of individuals in each category who had a heart attack.
The visualizations combine:
Bar Charts: Representing heart attack counts for each category.
Line Charts: Displaying the heart attack ratio as a percentage for each category.
Each factor highlights specific health behaviors, demographics, and physical conditions as critical contributors to heart attack risks. These insights can guide targeted interventions, promoting better health outcomes and reducing heart attack incidence in high-risk populations.
R packages including “ggplot2” and “dplyr” were loaded and plot the bar charts and line charts for visualisation.
Very low sleep hours (e.g., 1-4) and very high sleep hours (e.g., 17-20) show higher heart attack ratios, emphasizing the detrimental effects of insufficient or excessive sleep on heart health. Moderate sleep hours (e.g., 5-8) demonstrate lower heart attack ratios, aligning with recommended sleep durations for optimal health.
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Define the function for SleepHours vs Heart Attack Analysis
plot_heart_attack_analysis <- function(df, col) {
# Calculate counts and ratios for each sleep hour category
analysis <- df %>%
group_by(!!sym(col)) %>%
summarise(
heart_attacks = sum(HadHeartAttack_INT, na.rm = TRUE),
total = n()
) %>%
mutate(
ratio = round((heart_attacks / total) * 100, 1)
)
# Create the plot
p <- ggplot(analysis, aes(x = as.factor(!!sym(col)))) +
# Heart attack bar
geom_bar(aes(y = heart_attacks, fill = "Heart Attacks"),
stat = "identity", position = "dodge", width = 0.5, color = "black") +
# Heart attack ratio line
geom_line(aes(y = ratio * max(heart_attacks) / 100, color = "Heart Attack Ratio (%)"),
linewidth = 1.2, group = 1) +
geom_point(aes(y = ratio * max(heart_attacks) / 100, color = "Heart Attack Ratio (%)"), size = 3) +
# Labels for bars and ratios
geom_text(aes(y = heart_attacks, label = heart_attacks),
vjust = -0.5, size = 3) +
geom_text(aes(y = ratio * max(heart_attacks) / 100, label = paste0(ratio, "%")),
vjust = -1, size = 3, color = "darkred") +
# Customize y-axes
scale_y_continuous(
name = "Heart Attack Count",
sec.axis = sec_axis(~ . * 100 / max(analysis$heart_attacks), name = "Heart Attack Ratio (%)")
) +
# Customize theme and aesthetics
labs(
title = paste(col, "vs Heart Attack Analysis"),
x = col,
y = "Heart Attack Count",
fill = "Legend",
color = "Legend"
) +
scale_fill_manual(values = c("Heart Attacks" = "lightcoral")) +
scale_color_manual(values = c("Heart Attack Ratio (%)" = "darkred")) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top",
legend.title = element_blank()
)
return(p)
}
plot_heart_attack_analysis(df_clean, "SleepHours")
Heart attack risk steadily increases with age, with the highest ratios observed in the oldest age categories (e.g., 80 or older, 75-79). Younger age groups (18-24, 25-29) exhibit very low heart attack ratios, reflecting the lower risk in these populations.
plot_heart_attack_analysis(df_clean, "AgeCategory")
The heart attack ratio increases significantly as the number of poor physical health days rises. Individuals reporting 26-30 poor physical health days show the highest ratio (14.4%), indicating a strong correlation between physical health and heart disease.
plot_heart_attack_analysis(df_clean, "PhysicalHealthDays_INT")
Individuals reporting no physical activities show a significantly higher heart attack ratio compared to those engaging in physical activities. This emphasizes the protective effects of regular physical activity on heart health.
plot_heart_attack_analysis(df_clean, "PhysicalActivities")
Higher BMI categories (e.g., 61.0-70.0) show an increasing heart attack ratio, suggesting a positive correlation between obesity and heart health risks. Extremely high BMI categories (e.g., 81.0-90.0, 91.0-100.0) show a decrease in the ratio, likely due to insufficient sample size or other confounding factors.
# Call the function to plot BMI vs Heart Attack Analysis
plot <- plot_heart_attack_analysis(df_clean, "BMI_INT")
# Define the custom labels for BMI_INT
custom_labels <- c(
"10.0-20.0", "20.0-30.0", "30.0-40.0", "40.0-50.0",
"50.0-60.0", "60.0-70.0", "70.0-80.0", "80.0-90.0", "90.0-100.0"
)
# Update the plot to include custom x-axis labels
plot <- plot +
scale_x_discrete(labels = custom_labels) +
labs(x = "BMI Range")
# Display the updated plot
print(plot)
Current smokers (every day) show the highest heart attack ratio, followed by former smokers. Surprisingly, current smokers (some days) have a lower ratio, possibly due to under reporting or reduced exposure. Non-smokers exhibit the lowest risk.
plot_heart_attack_analysis(df_clean, "SmokerStatus")
Feature selection is a critical step in machine learning that helps in improving model performance by selecting the most relevant features, models are less likely to overfit or underfit and reducing dimensionality by eliminating irrelevant or redundant features to simplify the model and decreases computational complexity.
Methods Used:
1. Chi-Square Test (Chi2):
A statistical test used to measure the association between categorical features and the target variable.
Features with the higher Chi-Square scores (more relevant) have stronger association with the target variable.
BMI: The strongest association with heart attack occurrence.
PhysicalHealthDays: Significant impact on heart attack prevalence.
HadAngina: Strongly correlated with heart attack occurrence.
AgeCategory: Indicates the importance of age in predicting heart attack risk.
The top 27 features were selected for further analysis:
Health-Related Variables:
Lifestyle Variables:
Demographic Variables:
Vaccination and Preventive Health:
Other Variables:
R packages including “dplyr” and “caret” were downloaded and used for data manipulation and feature selection.
# Load necessary libraries
library(dplyr)
library(caret)
## Loading required package: lattice
# Assuming df1 is the data frame
data <- df_clean
# Define the target variable and features
X <- data %>% select(-HadHeartAttack, -HadHeartAttack_INT) # Select all features except the target variables
y <- data$HadHeartAttack # Target variable
# Convert all features and the target variable to factors (equivalent to LabelEncoder)
X[] <- lapply(X, factor) # Convert each column in X to a factor
y <- factor(y) # Convert target variable to factor
# Perform Chi-Square test for each feature
chi_square_results <- apply(X, 2, function(feature) {
# Use tryCatch to handle potential errors and skip problematic features
tryCatch({
chi_test <- chisq.test(table(feature, y))
c(Chi2_Score = chi_test$statistic, P_Value = chi_test$p.value)
}, error = function(e) {
# Return NA if an error occurs (e.g., if there are empty combinations)
return(c(Chi2_Score = NA, P_Value = NA))
})
})
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
# Convert the results to a data frame
chi2_df <- data.frame(
Feature = colnames(X),
Chi2_Score = chi_square_results[1,],
P_Value = chi_square_results[2,]
)
# Sort by Chi-Square score (descending order)
chi2_df <- chi2_df %>% arrange(desc(Chi2_Score))
# Print the results
print(chi2_df)
## Feature Chi2_Score
## HadAngina HadAngina 49142.83694
## HadAngina_INT HadAngina_INT 49142.83694
## GeneralHealth GeneralHealth 9923.94909
## GeneralHealth_INT GeneralHealth_INT 9923.94909
## AgeCategory AgeCategory 7991.87048
## AgeCategory_INT AgeCategory_INT 7991.87048
## HadStroke HadStroke 7778.62312
## HadStroke_INT HadStroke_INT 7778.62312
## RemovedTeeth RemovedTeeth 7053.06255
## RemovedTeeth_INT RemovedTeeth_INT 7053.06255
## ChestScan ChestScan 6957.54315
## ChestScan_INT ChestScan_INT 6957.54315
## DifficultyWalking DifficultyWalking 6322.24299
## DifficultyWalking_INT DifficultyWalking_INT 6322.24299
## BMI BMI 5619.53679
## HadDiabetes HadDiabetes 5381.49128
## HadDiabetes_INT HadDiabetes_INT 5381.49128
## PhysicalHealthDays PhysicalHealthDays 4524.09366
## HadCOPD HadCOPD 4396.04632
## HadCOPD_INT HadCOPD_INT 4396.04632
## PhysicalHealthDaysBinned PhysicalHealthDaysBinned 4227.70419
## PhysicalHealthDays_INT PhysicalHealthDays_INT 4227.70419
## PneumoVaxEver PneumoVaxEver 3562.14868
## PneumoVaxEver_INT PneumoVaxEver_INT 3562.14868
## HadArthritis HadArthritis 3449.14725
## HadArthritis_INT HadArthritis_INT 3449.14725
## HadKidneyDisease HadKidneyDisease 2964.82763
## HadKidneyDisease_INT HadKidneyDisease_INT 2964.82763
## DeafOrHardOfHearing DeafOrHardOfHearing 2374.07372
## DeafOrHardOfHearing_INT DeafOrHardOfHearing_INT 2374.07372
## SmokerStatus SmokerStatus 2255.89727
## SmokerStatus_INT SmokerStatus_INT 2255.89727
## DifficultyErrands DifficultyErrands 1972.73570
## DifficultyErrands_INT DifficultyErrands_INT 1972.73570
## WeightInKilograms WeightInKilograms 1825.37700
## DifficultyDressingBathing DifficultyDressingBathing 1704.60215
## DifficultyDressingBathing_INT DifficultyDressingBathing_INT 1704.60215
## PhysicalActivities PhysicalActivities 1696.65196
## PhysicalActivities_INT PhysicalActivities_INT 1696.65196
## SleepHours SleepHours 1668.82165
## Sex Sex 1359.03360
## Sex_INT Sex_INT 1359.03360
## BlindOrVisionDifficulty BlindOrVisionDifficulty 1322.02027
## BlindOrVisionDifficulty_INT BlindOrVisionDifficulty_INT 1322.02027
## AlcoholDrinkers AlcoholDrinkers 1312.82073
## AlcoholDrinkers_INT AlcoholDrinkers_INT 1312.82073
## LastCheckupTime LastCheckupTime 1238.81557
## LastCheckupTime_INT LastCheckupTime_INT 1238.81557
## SleepHoursBinned SleepHoursBinned 1110.38287
## SleepHours_INT SleepHours_INT 1110.38287
## DifficultyConcentrating DifficultyConcentrating 661.23857
## DifficultyConcentrating_INT DifficultyConcentrating_INT 661.23857
## MentalHealthDays MentalHealthDays 652.91197
## HadSkinCancer HadSkinCancer 610.66687
## HadSkinCancer_INT HadSkinCancer_INT 610.66687
## FluVaxLast12 FluVaxLast12 503.34442
## FluVaxLast12_INT FluVaxLast12_INT 503.34442
## HeightInMeters HeightInMeters 494.25712
## TetanusLast10Tdap TetanusLast10Tdap 422.24065
## TetanusLast10Tdap_INT TetanusLast10Tdap_INT 422.24065
## MentalHealthDaysBinned MentalHealthDaysBinned 373.45879
## MentalHealthDays_INT MentalHealthDays_INT 373.45879
## CovidPos CovidPos 214.08532
## CovidPos_INT CovidPos_INT 214.08532
## WeightBinned WeightBinned 211.71933
## Weight_INT Weight_INT 211.71933
## BMIBinned BMIBinned 198.96608
## BMI_INT BMI_INT 198.96608
## RaceEthnicityCategory RaceEthnicityCategory 198.02261
## RaceEthnicityCategory_INT RaceEthnicityCategory_INT 198.02261
## HadAsthma HadAsthma 140.32243
## HadAsthma_INT HadAsthma_INT 140.32243
## HadDepressiveDisorder HadDepressiveDisorder 137.15091
## HadDepressiveDisorder_INT HadDepressiveDisorder_INT 137.15091
## ECigaretteUsage ECigaretteUsage 113.27251
## ECigaretteUsage_INT ECigaretteUsage_INT 113.27251
## HighRiskLastYear HighRiskLastYear 108.46258
## HighRiskLastYear_INT HighRiskLastYear_INT 108.46258
## HeightInMetersBinned HeightInMetersBinned 81.82726
## HeightInMeters_INT HeightInMeters_INT 81.82726
## HIVTesting HIVTesting 54.27412
## HIVTesting_INT HIVTesting_INT 54.27412
## P_Value
## HadAngina 0.000000e+00
## HadAngina_INT 0.000000e+00
## GeneralHealth 0.000000e+00
## GeneralHealth_INT 0.000000e+00
## AgeCategory 0.000000e+00
## AgeCategory_INT 0.000000e+00
## HadStroke 0.000000e+00
## HadStroke_INT 0.000000e+00
## RemovedTeeth 0.000000e+00
## RemovedTeeth_INT 0.000000e+00
## ChestScan 0.000000e+00
## ChestScan_INT 0.000000e+00
## DifficultyWalking 0.000000e+00
## DifficultyWalking_INT 0.000000e+00
## BMI 1.327513e-47
## HadDiabetes 0.000000e+00
## HadDiabetes_INT 0.000000e+00
## PhysicalHealthDays 0.000000e+00
## HadCOPD 0.000000e+00
## HadCOPD_INT 0.000000e+00
## PhysicalHealthDaysBinned 0.000000e+00
## PhysicalHealthDays_INT 0.000000e+00
## PneumoVaxEver 0.000000e+00
## PneumoVaxEver_INT 0.000000e+00
## HadArthritis 0.000000e+00
## HadArthritis_INT 0.000000e+00
## HadKidneyDisease 0.000000e+00
## HadKidneyDisease_INT 0.000000e+00
## DeafOrHardOfHearing 0.000000e+00
## DeafOrHardOfHearing_INT 0.000000e+00
## SmokerStatus 0.000000e+00
## SmokerStatus_INT 0.000000e+00
## DifficultyErrands 0.000000e+00
## DifficultyErrands_INT 0.000000e+00
## WeightInKilograms 8.181677e-140
## DifficultyDressingBathing 0.000000e+00
## DifficultyDressingBathing_INT 0.000000e+00
## PhysicalActivities 0.000000e+00
## PhysicalActivities_INT 0.000000e+00
## SleepHours 0.000000e+00
## Sex 1.677291e-297
## Sex_INT 1.677291e-297
## BlindOrVisionDifficulty 1.853268e-289
## BlindOrVisionDifficulty_INT 1.853268e-289
## AlcoholDrinkers 1.849729e-287
## AlcoholDrinkers_INT 1.849729e-287
## LastCheckupTime 2.775940e-268
## LastCheckupTime_INT 2.775940e-268
## SleepHoursBinned 7.544387e-238
## SleepHours_INT 7.544387e-238
## DifficultyConcentrating 8.034817e-146
## DifficultyConcentrating_INT 8.034817e-146
## MentalHealthDays 3.119448e-118
## HadSkinCancer 8.011317e-135
## HadSkinCancer_INT 8.011317e-135
## FluVaxLast12 1.779495e-111
## FluVaxLast12_INT 1.779495e-111
## HeightInMeters 1.572055e-65
## TetanusLast10Tdap 3.367854e-91
## TetanusLast10Tdap_INT 3.367854e-91
## MentalHealthDaysBinned 1.552816e-78
## MentalHealthDays_INT 1.552816e-78
## CovidPos 3.250595e-47
## CovidPos_INT 3.250595e-47
## WeightBinned 8.817200e-44
## Weight_INT 8.817200e-44
## BMIBinned 2.651463e-37
## BMI_INT 2.651463e-37
## RaceEthnicityCategory 9.999654e-42
## RaceEthnicityCategory_INT 9.999654e-42
## HadAsthma 2.263116e-32
## HadAsthma_INT 2.263116e-32
## HadDepressiveDisorder 1.117607e-31
## HadDepressiveDisorder_INT 1.117607e-31
## ECigaretteUsage 2.167571e-24
## ECigaretteUsage_INT 2.167571e-24
## HighRiskLastYear 2.128303e-25
## HighRiskLastYear_INT 2.128303e-25
## HeightInMetersBinned 1.244627e-17
## HeightInMeters_INT 1.244627e-17
## HIVTesting 1.743832e-13
## HIVTesting_INT 1.743832e-13
# Load necessary libraries
library(ggplot2)
# Ensure Chi2_Score is numeric and remove NA values
chi2_df$Chi2_Score <- as.numeric(chi2_df$Chi2_Score)
# Remove rows with NA in either Chi2_Score or Feature
chi2_df_clean <- chi2_df[!is.na(chi2_df$Chi2_Score) & !is.na(chi2_df$Feature), ]
# Sort the data by Chi2_Score in descending order and select the top 10
top_10_features <- chi2_df_clean[order(-chi2_df_clean$Chi2_Score), ][1:10, ]
# Create the plot for the top 10 features
ggplot(top_10_features, aes(x = Chi2_Score, y = reorder(Feature, Chi2_Score))) +
geom_bar(stat = "identity", fill = "skyblue") + # Horizontal bar plot
xlab("Chi2 Score") +
ylab("Feature") +
ggtitle("Top 10 Feature Importance (Chi2)") +
theme_minimal() +
theme(axis.text.y = element_text(size = 10)) # Adjust text size if needed
# Select relevant columns from df_clean
X <- df_clean %>%
select(
BMI, WeightInKilograms, AgeCategory_INT, HadAngina_INT, HadDiabetes_INT,
HadStroke_INT, DifficultyWalking_INT, HadCOPD_INT, ChestScan_INT,
HadKidneyDisease_INT, HadArthritis_INT, DeafOrHardOfHearing_INT,
PneumoVaxEver_INT, DifficultyErrands_INT, DifficultyDressingBathing_INT,
BlindOrVisionDifficulty_INT, AlcoholDrinkers_INT, DifficultyConcentrating_INT,
HadSkinCancer_INT, PhysicalActivities_INT, Sex_INT, PhysicalHealthDays_INT,
RemovedTeeth_INT, GeneralHealth_INT, MentalHealthDays_INT, LastCheckupTime_INT,
SmokerStatus_INT
)
library(dplyr)
library(caret)
df_clean$HadHeartAttack <- factor(df_clean$HadHeartAttack, levels = c("No", "Yes"))
y <- df_clean$HadHeartAttack # Target variable
# Convert all features and the target variable to factors (equivalent to LabelEncoder)
X[] <- lapply(X, factor) # Convert each column in X to a factor
y <- factor(y) # Convert target variable to factor
# Perform Chi-Square test for each feature
chi_square_results <- apply(X, 2, function(feature) {
# Use tryCatch to handle potential errors and skip problematic features
tryCatch({
chi_test <- chisq.test(table(feature, y))
c(Chi2_Score = chi_test$statistic, P_Value = chi_test$p.value)
}, error = function(e) {
# Return NA if an error occurs (e.g., if there are empty combinations)
return(c(Chi2_Score = NA, P_Value = NA))
})
})
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(table(feature, y)): Chi-squared approximation may be
## incorrect
# Convert the results to a data frame
chi2_df <- data.frame(
Feature = colnames(X),
Chi2_Score = chi_square_results[1,],
P_Value = chi_square_results[2,]
)
# Sort by Chi-Square score (descending order)
chi2_df <- chi2_df %>% arrange(desc(Chi2_Score))
# Print the results
print(chi2_df)
## Feature Chi2_Score
## HadAngina_INT HadAngina_INT 49142.8369
## GeneralHealth_INT GeneralHealth_INT 9923.9491
## AgeCategory_INT AgeCategory_INT 7991.8705
## HadStroke_INT HadStroke_INT 7778.6231
## RemovedTeeth_INT RemovedTeeth_INT 7053.0625
## ChestScan_INT ChestScan_INT 6957.5431
## DifficultyWalking_INT DifficultyWalking_INT 6322.2430
## BMI BMI 5619.5368
## HadDiabetes_INT HadDiabetes_INT 5381.4913
## HadCOPD_INT HadCOPD_INT 4396.0463
## PhysicalHealthDays_INT PhysicalHealthDays_INT 4227.7042
## PneumoVaxEver_INT PneumoVaxEver_INT 3562.1487
## HadArthritis_INT HadArthritis_INT 3449.1472
## HadKidneyDisease_INT HadKidneyDisease_INT 2964.8276
## DeafOrHardOfHearing_INT DeafOrHardOfHearing_INT 2374.0737
## SmokerStatus_INT SmokerStatus_INT 2255.8973
## DifficultyErrands_INT DifficultyErrands_INT 1972.7357
## WeightInKilograms WeightInKilograms 1825.3770
## DifficultyDressingBathing_INT DifficultyDressingBathing_INT 1704.6022
## PhysicalActivities_INT PhysicalActivities_INT 1696.6520
## Sex_INT Sex_INT 1359.0336
## BlindOrVisionDifficulty_INT BlindOrVisionDifficulty_INT 1322.0203
## AlcoholDrinkers_INT AlcoholDrinkers_INT 1312.8207
## LastCheckupTime_INT LastCheckupTime_INT 1238.8156
## DifficultyConcentrating_INT DifficultyConcentrating_INT 661.2386
## HadSkinCancer_INT HadSkinCancer_INT 610.6669
## MentalHealthDays_INT MentalHealthDays_INT 373.4588
## P_Value
## HadAngina_INT 0.000000e+00
## GeneralHealth_INT 0.000000e+00
## AgeCategory_INT 0.000000e+00
## HadStroke_INT 0.000000e+00
## RemovedTeeth_INT 0.000000e+00
## ChestScan_INT 0.000000e+00
## DifficultyWalking_INT 0.000000e+00
## BMI 1.327513e-47
## HadDiabetes_INT 0.000000e+00
## HadCOPD_INT 0.000000e+00
## PhysicalHealthDays_INT 0.000000e+00
## PneumoVaxEver_INT 0.000000e+00
## HadArthritis_INT 0.000000e+00
## HadKidneyDisease_INT 0.000000e+00
## DeafOrHardOfHearing_INT 0.000000e+00
## SmokerStatus_INT 0.000000e+00
## DifficultyErrands_INT 0.000000e+00
## WeightInKilograms 8.181677e-140
## DifficultyDressingBathing_INT 0.000000e+00
## PhysicalActivities_INT 0.000000e+00
## Sex_INT 1.677291e-297
## BlindOrVisionDifficulty_INT 1.853268e-289
## AlcoholDrinkers_INT 1.849729e-287
## LastCheckupTime_INT 2.775940e-268
## DifficultyConcentrating_INT 8.034817e-146
## HadSkinCancer_INT 8.011317e-135
## MentalHealthDays_INT 1.552816e-78
The modeling phase focuses on developing predictive models to address the identified research questions. This involves building and evaluating both classification and regression models, ensuring robust and interpretable results.
1. Data Splitting:
2. Model Training:
A Random Forest Ranger was used with the parameters below:
num.trees = 100: 100 decision trees in the forest.
mtry = 3: 3 features considered at each split
importance = ‘impurity’:Computes feature importance based on the decrease in impurity.
probability = TRUE: Enables probability predictions for classification.
R packages including “ranger”, “pROC” and “caret” were downloaded and used for building, evaluating, and interpreting a Random Forest model.
# Load necessary libraries
library(ranger)
library(caret)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Assuming X and y are already defined
# Split the data into training and test sets
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]
# Convert X_train and y_train to data frames (if they are not already)
X_train <- as.data.frame(X_train)
y_train <- as.data.frame(y_train)
# Combine the features and target variable into a single data frame for training
train_data <- data.frame(y_train, X_train)
# Train the Random Forest model using ranger with probability enabled
rf_model_ranger <- ranger(
formula = y_train ~ ., # Formula notation: target variable ~ predictors
data = train_data, # Use the combined data frame
num.trees = 100, # Number of trees in the forest
mtry = 3, # Number of variables to consider at each split
importance = 'impurity', # Feature importance method
probability = TRUE, # Enable probability predictions
)
# Print the model
print(rf_model_ranger)
## Ranger result
##
## Call:
## ranger(formula = y_train ~ ., data = train_data, num.trees = 100, mtry = 3, importance = "impurity", probability = TRUE, )
##
## Type: Probability estimation
## Number of trees: 100
## Sample size: 197765
## Number of independent variables: 27
## Mtry: 3
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.03947575
# Predict on the test set
y_pred_ranger <- predict(rf_model_ranger, data = X_test)$predictions
# Since it’s binary classification, y_pred_ranger will be a matrix where
# the second column contains the probabilities for class 1 (positive class)
y_proba <- y_pred_ranger[, 2] # Probabilities for the positive class (class 1)
# Adjust predictions using a threshold (e.g., 0.3)
threshold <- 0.3
y_pred_adjusted <- ifelse(y_proba >= threshold, "Yes", "No")
y_pred_adjusted <- factor(y_pred_adjusted, levels = levels(factor(y_test)))
# Evaluate the performance using confusion matrix
confusion_ranger <- confusionMatrix(as.factor(y_pred_adjusted), as.factor(y_test))
print(confusion_ranger)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 45356 1496
## Yes 1398 1190
##
## Accuracy : 0.9415
## 95% CI : (0.9394, 0.9435)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 0.99998
##
## Kappa : 0.4204
##
## Mcnemar's Test P-Value : 0.07137
##
## Sensitivity : 0.9701
## Specificity : 0.4430
## Pos Pred Value : 0.9681
## Neg Pred Value : 0.4598
## Prevalence : 0.9457
## Detection Rate : 0.9174
## Detection Prevalence : 0.9477
## Balanced Accuracy : 0.7066
##
## 'Positive' Class : No
##
# Feature importance
feature_importances <- rf_model_ranger$variable.importance
print(feature_importances)
## BMI WeightInKilograms
## 1056.63979 949.71666
## AgeCategory_INT HadAngina_INT
## 667.25577 2887.25066
## HadDiabetes_INT HadStroke_INT
## 249.93391 336.83182
## DifficultyWalking_INT HadCOPD_INT
## 201.75906 152.93841
## ChestScan_INT HadKidneyDisease_INT
## 251.48779 139.78377
## HadArthritis_INT DeafOrHardOfHearing_INT
## 152.89513 141.70842
## PneumoVaxEver_INT DifficultyErrands_INT
## 163.35246 112.41947
## DifficultyDressingBathing_INT BlindOrVisionDifficulty_INT
## 86.88416 114.87921
## AlcoholDrinkers_INT DifficultyConcentrating_INT
## 148.35171 122.35758
## HadSkinCancer_INT PhysicalActivities_INT
## 127.84722 146.31240
## Sex_INT PhysicalHealthDays_INT
## 214.77660 302.23701
## RemovedTeeth_INT GeneralHealth_INT
## 423.21381 488.63699
## MentalHealthDays_INT LastCheckupTime_INT
## 243.43797 131.95551
## SmokerStatus_INT
## 296.09221
# Plot feature importance
barplot(feature_importances, main = "Feature Importances", col = "blue", las = 2)
# Calculate ROC-AUC
roc_curve <- roc(as.factor(y_test), y_proba)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
roc_auc <- auc(roc_curve)
# Print ROC-AUC
cat("ROC-AUC:", roc_auc, "\n")
## ROC-AUC: 0.8818473
# Print model summary
print("model_classification")
## [1] "model_classification"
# Predict the outcome on the same dataset (for simplicity)
# Define control for repeated k-fold cross-validation
control <- trainControl( method = "repeatedcv",
number = 5, # Number of folds
repeats = 3, # Number of repetitions
verboseIter = TRUE # Show training progress
)
tuneGrid <- expand.grid( mtry = c(2, 3, 4),
num.trees = c(100, 200)
)
model_classification <- ranger( formula = y_train ~ .,
data = data.frame(X_train, y_train),
num.trees = 100,
importance = "impurity"
)
## Growing trees.. Progress: 90%. Estimated remaining time: 3 seconds.
predictions <- predict(model_classification, data = X_test)$predictions
# Compare predicted values with actual values
predictions <- factor(predictions, levels = c("No", "Yes"))
confusion_matrix <- confusionMatrix(predictions, y_test)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 46289 2149
## Yes 465 537
##
## Accuracy : 0.9471
## 95% CI : (0.9451, 0.9491)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 0.07755
##
## Kappa : 0.2697
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9901
## Specificity : 0.1999
## Pos Pred Value : 0.9556
## Neg Pred Value : 0.5359
## Prevalence : 0.9457
## Detection Rate : 0.9363
## Detection Prevalence : 0.9797
## Balanced Accuracy : 0.5950
##
## 'Positive' Class : No
##
# Print model summary
print(model_classification)# Get feature importance
## Ranger result
##
## Call:
## ranger(formula = y_train ~ ., data = data.frame(X_train, y_train), num.trees = 100, importance = "impurity")
##
## Type: Classification
## Number of trees: 100
## Sample size: 197765
## Number of independent variables: 27
## Mtry: 5
## Target node size: 1
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error: 5.20 %
importance(model_classification)
## BMI WeightInKilograms
## 2618.1375 2232.6347
## AgeCategory_INT HadAngina_INT
## 1256.0165 3369.7638
## HadDiabetes_INT HadStroke_INT
## 402.9200 381.9373
## DifficultyWalking_INT HadCOPD_INT
## 290.9224 267.0708
## ChestScan_INT HadKidneyDisease_INT
## 346.4512 234.9295
## HadArthritis_INT DeafOrHardOfHearing_INT
## 356.6207 283.6256
## PneumoVaxEver_INT DifficultyErrands_INT
## 331.5210 204.5603
## DifficultyDressingBathing_INT BlindOrVisionDifficulty_INT
## 155.5923 214.2703
## AlcoholDrinkers_INT DifficultyConcentrating_INT
## 384.0180 249.2714
## HadSkinCancer_INT PhysicalActivities_INT
## 270.9854 358.9717
## Sex_INT PhysicalHealthDays_INT
## 307.9428 576.6549
## RemovedTeeth_INT GeneralHealth_INT
## 753.3156 791.5328
## MentalHealthDays_INT LastCheckupTime_INT
## 480.2359 251.1312
## SmokerStatus_INT
## 621.0829
# Calculate the risk_score based on the categories
df_clean <- df_clean %>%
mutate(risk_score = BMI_INT + Weight_INT + AgeCategory_INT + HadAngina_INT + HadDiabetes_INT +
HadStroke_INT + DifficultyWalking_INT + HadCOPD_INT + ChestScan_INT +
HadKidneyDisease_INT + HadArthritis_INT + DeafOrHardOfHearing_INT +
PneumoVaxEver_INT + DifficultyErrands_INT + DifficultyDressingBathing_INT +
BlindOrVisionDifficulty_INT + AlcoholDrinkers_INT + DifficultyConcentrating_INT +
HadSkinCancer_INT + PhysicalActivities_INT + Sex_INT + PhysicalHealthDays_INT +
RemovedTeeth_INT + GeneralHealth_INT + MentalHealthDays_INT + LastCheckupTime_INT + SmokerStatus_INT)
# View the dataset with the new risk_score variable
print(df_clean)
## # A tibble: 247,205 × 85
## Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Female Very good 4 0 Within past year (a…
## 2 Male Very good 0 0 Within past year (a…
## 3 Male Very good 0 0 Within past year (a…
## 4 Female Fair 5 0 Within past year (a…
## 5 Female Good 3 15 Within past year (a…
## 6 Male Good 0 0 Within past year (a…
## 7 Female Good 3 0 Within past year (a…
## 8 Male Fair 5 0 Within past year (a…
## 9 Male Good 2 0 5 or more years ago
## 10 Female Very good 0 0 Within past year (a…
## # ℹ 247,195 more rows
## # ℹ 80 more variables: PhysicalActivities <chr>, SleepHours <dbl>,
## # RemovedTeeth <chr>, HadHeartAttack <fct>, HadAngina <chr>, HadStroke <chr>,
## # HadAsthma <chr>, HadSkinCancer <chr>, HadCOPD <chr>,
## # HadDepressiveDisorder <chr>, HadKidneyDisease <chr>, HadArthritis <chr>,
## # HadDiabetes <chr>, DeafOrHardOfHearing <chr>,
## # BlindOrVisionDifficulty <chr>, DifficultyConcentrating <chr>, …
# Prepare the data for training
XR <- df_clean %>%
select(risk_score) # Only selecting the risk_score column as predictor
# Create train and test splits
XR_train <- XR[trainIndex, ]
yR_train <- y[trainIndex]
XR_test <- XR[-trainIndex, ]
yR_test <- y[-trainIndex]
# Ensure XR_train and yR_train are data frames
XR_train <- as.data.frame(XR_train)
yR_train <- as.data.frame(yR_train)
# Train a Random Forest model for regression
model_regression <- train(yR_train ~ risk_score,
data = data.frame(XR_train, yR_train),
method = "glm", # Multinomial logistic regression
trControl = control)# View the model summary
## + Fold1.Rep1: parameter=none
## - Fold1.Rep1: parameter=none
## + Fold2.Rep1: parameter=none
## - Fold2.Rep1: parameter=none
## + Fold3.Rep1: parameter=none
## - Fold3.Rep1: parameter=none
## + Fold4.Rep1: parameter=none
## - Fold4.Rep1: parameter=none
## + Fold5.Rep1: parameter=none
## - Fold5.Rep1: parameter=none
## + Fold1.Rep2: parameter=none
## - Fold1.Rep2: parameter=none
## + Fold2.Rep2: parameter=none
## - Fold2.Rep2: parameter=none
## + Fold3.Rep2: parameter=none
## - Fold3.Rep2: parameter=none
## + Fold4.Rep2: parameter=none
## - Fold4.Rep2: parameter=none
## + Fold5.Rep2: parameter=none
## - Fold5.Rep2: parameter=none
## + Fold1.Rep3: parameter=none
## - Fold1.Rep3: parameter=none
## + Fold2.Rep3: parameter=none
## - Fold2.Rep3: parameter=none
## + Fold3.Rep3: parameter=none
## - Fold3.Rep3: parameter=none
## + Fold4.Rep3: parameter=none
## - Fold4.Rep3: parameter=none
## + Fold5.Rep3: parameter=none
## - Fold5.Rep3: parameter=none
## Aggregating results
## Fitting final model on full training set
summary(model_regression)# Make predictions using the model
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.7295099 0.0573577 -134.76 <2e-16 ***
## risk_score 0.0461178 0.0004961 92.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83488 on 197764 degrees of freedom
## Residual deviance: 73983 on 197763 degrees of freedom
## AIC: 73987
##
## Number of Fisher Scoring iterations: 6
predictions <- predict(model_regression, newdata = data.frame(XR_test, yR_test))
# Evaluate model performance (confusion matrix)
confusion_matrix <- confusionMatrix(predictions, yR_test)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 46730 2673
## Yes 24 13
##
## Accuracy : 0.9454
## 95% CI : (0.9434, 0.9474)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 0.5913
##
## Kappa : 0.0081
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99949
## Specificity : 0.00484
## Pos Pred Value : 0.94589
## Neg Pred Value : 0.35135
## Prevalence : 0.94567
## Detection Rate : 0.94519
## Detection Prevalence : 0.99925
## Balanced Accuracy : 0.50216
##
## 'Positive' Class : No
##
# Save the best trained model using save() function
saveRDS(rf_model_ranger, file = "random_forest_model.rds")
print("Model saved successfully!")
## [1] "Model saved successfully!"
Evaluation ensures that the deployed model meets performance expectations in real-world scenarios. The following steps describe key evaluation metrics and methods:
Performance Metrics: Evaluate the model on real-world data using:
Accuracy: Proportion of correctly classified cases.
Precision: Proportion of predicted positives that are true positives.
Recall (Sensitivity): Proportion of actual positives that are correctly identified.
F1-Score: Balance between precision and recall.
ROC-AUC Score: Measures the model’s ability to distinguish between classes.
Monitoring Real-World Performance:
Data Drift: Monitor if the input data characteristics change over time, which may degrade model performance.
Prediction Drift: Ensure that predicted probabilities remain consistent with real-world trends.
Feedback Loop: Incorporate feedback from end-users (e.g., clinicians) to improve the model. Use new labeled data to periodically retrain and update the model.
Additional packages like smotefamily, rpart, ROSE were downloaded to treat inbalanced data, etc.
library(caret)
library(smotefamily) # For SMOTE functionality
library(rpart)
library(pROC) # For ROC-AUC
library(ROSE) # For random under-sampling
## Loaded ROSE 0.0-4
# Set seed for reproducibility
set.seed(42)
# Split the dataset into training and testing sets
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE) # 70% train, 30% test
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]
# Combine features and target into a data frame for random under-sampling
train_data <- data.frame(X_train, y_train)
# Apply Random Undersampling using ROSE
train_data_resampled <- ROSE(y_train ~ ., data = train_data, seed = 42)$data
X_train_resampled <- train_data_resampled[, -ncol(train_data_resampled)] # Features
y_train_resampled <- train_data_resampled[, ncol(train_data_resampled)] # Target
# Train a Decision Tree model with class weights (balanced)
dt_model <- rpart(y_train_resampled ~ ., data = cbind(X_train_resampled, y_train_resampled),
method = "class", control = rpart.control(cp = 0))
# Predict on the test set
y_pred <- predict(dt_model, X_test, type = "class")
# Evaluate the model: confusion matrix
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 61391 1887
## Yes 8741 2142
##
## Accuracy : 0.8567
## 95% CI : (0.8541, 0.8592)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2259
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8754
## Specificity : 0.5316
## Pos Pred Value : 0.9702
## Neg Pred Value : 0.1968
## Prevalence : 0.9457
## Detection Rate : 0.8278
## Detection Prevalence : 0.8533
## Balanced Accuracy : 0.7035
##
## 'Positive' Class : No
##
# ROC-AUC: Probability of the positive class
y_prob <- predict(dt_model, X_test, type = "prob")[, 2]
# Compute the ROC curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Calculate the AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)
# Print the ROC-AUC score
print(paste("ROC-AUC Score:", roc_auc))
## [1] "ROC-AUC Score: 0.72966603259659"
# Load required libraries
library(caret) # For data splitting and evaluation
library(ROSE) # For random under-sampling
library(rpart) # For decision tree model
library(pROC) # For ROC-AUC calculation
# Assuming df1 is your data frame and y is your target variable
# Select features (X1)
X1 <- df_clean[, c('PhysicalHealthDays', 'MentalHealthDays', 'SleepHours',
'HadArthritis', 'AgeCategory', 'HeightInMeters', 'WeightInKilograms',
'BMI', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12',
'TetanusLast10Tdap', 'PhysicalHealthDaysBinned', 'AgeCategory_INT',
'RemovedTeeth_INT', 'RaceEthnicityCategory_INT', 'GeneralHealth_INT',
'SmokerStatus_INT', 'ECigaretteUsage_INT', 'HadAngina_INT',
'HadDiabetes_INT', 'PneumoVaxEver_INT', 'TetanusLast10Tdap_INT',
'CovidPos_INT')]
# Encode all columns as numeric (similar to LabelEncoder in Python)
X1[] <- lapply(X1, function(col) as.integer(as.factor(col)))
# Split data into training and testing sets (70% training, 30% testing)
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X1_train <- X1[trainIndex, ]
y_train <- y[trainIndex]
X1_test <- X1[-trainIndex, ]
y_test <- y[-trainIndex]
# Combine features and target for random under-sampling
train_data <- data.frame(X1_train, y_train)
# Apply Random Under-Sampling using ROSE
train_data_resampled <- ROSE(y_train ~ ., data = train_data, seed = 42)$data
X1_train_resampled <- train_data_resampled[, -ncol(train_data_resampled)]
y_train_resampled <- train_data_resampled[, ncol(train_data_resampled)]
# Train a Decision Tree model
dt_model <- rpart(y_train_resampled ~ .,
data = data.frame(X1_train_resampled, y_train_resampled),
method = "class", control = rpart.control(cp = 0))
# Predict on the test set
y_pred <- predict(dt_model, X1_test, type = "class")
# Evaluate the model: Confusion Matrix
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 66633 2123
## Yes 3499 1906
##
## Accuracy : 0.9242
## 95% CI : (0.9223, 0.9261)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3645
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9501
## Specificity : 0.4731
## Pos Pred Value : 0.9691
## Neg Pred Value : 0.3526
## Prevalence : 0.9457
## Detection Rate : 0.8985
## Detection Prevalence : 0.9271
## Balanced Accuracy : 0.7116
##
## 'Positive' Class : No
##
# ROC-AUC: Probability of the positive class
y_prob <- predict(dt_model, X1_test, type = "prob")[, 2]
# Compute ROC Curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Calculate AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)
# Print ROC-AUC Score
print(paste("ROC-AUC Score:", roc_auc))
## [1] "ROC-AUC Score: 0.834802402962937"
# Load required libraries
library(caret) # For data splitting and evaluation
library(rpart) # For decision tree model
library(pROC) # For ROC-AUC calculation
# Assuming X is the feature data frame and y is the target variable
# Split data into training and testing sets (70% training, 30% testing)
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]
# Combine training features and target for model training
train_data <- data.frame(X_train, y_train)
# Train a Decision Tree model
dt_model <- rpart(y_train ~ ., data = train_data, method = "class", control = rpart.control(cp = 0))
# Predict on the test set
y_pred <- predict(dt_model, X_test, type = "class")
# Evaluate the model: Confusion Matrix
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 68088 2937
## Yes 2044 1092
##
## Accuracy : 0.9328
## 95% CI : (0.931, 0.9346)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2701
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9709
## Specificity : 0.2710
## Pos Pred Value : 0.9586
## Neg Pred Value : 0.3482
## Prevalence : 0.9457
## Detection Rate : 0.9181
## Detection Prevalence : 0.9577
## Balanced Accuracy : 0.6209
##
## 'Positive' Class : No
##
# ROC-AUC: Probability of the positive class
y_prob <- predict(dt_model, X_test, type = "prob")[, 2]
# Compute ROC Curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Calculate AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)
# Print ROC-AUC Score
print(paste("ROC-AUC Score:", roc_auc))
## [1] "ROC-AUC Score: 0.704005223946952"
# Load required libraries
library(caret) # For data splitting and evaluation
library(e1071) # For Naive Bayes model
library(pROC) # For ROC-AUC calculation
# Assuming X is the feature data frame and y is the target variable
# Split data into training and testing sets (70% training, 30% testing)
set.seed(42)
trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[trainIndex, ]
y_train <- y[trainIndex]
X_test <- X[-trainIndex, ]
y_test <- y[-trainIndex]
# Train the Naive Bayes model
nb_model <- naiveBayes(x = X_train, y = as.factor(y_train))
# Predict on the test set
y_pred <- predict(nb_model, X_test)
# Accuracy
accuracy <- sum(y_pred == y_test) / length(y_test)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.874732
# Classification report (Confusion Matrix)
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 62506 1664
## Yes 7626 2365
##
## Accuracy : 0.8747
## 95% CI : (0.8723, 0.8771)
## No Information Rate : 0.9457
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2818
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8913
## Specificity : 0.5870
## Pos Pred Value : 0.9741
## Neg Pred Value : 0.2367
## Prevalence : 0.9457
## Detection Rate : 0.8428
## Detection Prevalence : 0.8653
## Balanced Accuracy : 0.7391
##
## 'Positive' Class : No
##
# ROC-AUC: Probability of the positive class
y_prob <- predict(nb_model, X_test, type = "raw")[, 2] # Get probabilities for the positive class
# Compute ROC Curve
roc_curve <- roc(as.numeric(as.factor(y_test)), y_prob)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Calculate AUC (Area Under the Curve)
roc_auc <- auc(roc_curve)
# Print ROC-AUC Score
cat("ROC-AUC Score:", roc_auc, "\n")
## ROC-AUC Score: 0.8635652
library(caret) # For confusionMatrix
library(pROC) # For ROC-AUC
evaluation <- function(model, X_test, y_test) {
# Predict class labels
y_pred <- predict(model, X_test, type = "class")
# Predict probabilities (if available)
y_proba <- predict(model, X_test, type = "raw")[, 2] # Probability for the positive class
# Accuracy
accuracy <- sum(y_pred == y_test) / length(y_test)
# Confusion matrix for precision, recall, F1-Score
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
# Precision, Recall, F1-Score
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- conf_matrix$byClass["F1"]
# ROC-AUC
roc_curve <- roc(as.factor(y_test), y_proba)
roc_auc <- auc(roc_curve)
# Print metrics
cat("Accuracy:", accuracy, "\n")
cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1-Score:", f1, "\n")
cat("ROC-AUC:", roc_auc, "\n")
}
# Now call the evaluation function on your Naive Bayes model
evaluation(nb_model, X_test, y_test)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
## Accuracy: 0.874732
## Precision: 0.9740689
## Recall: 0.8912622
## F1-Score: 0.9308275
## ROC-AUC: 0.8635652
# Load required libraries
library(caret) # For confusionMatrix and model evaluation
library(pROC) # For ROC-AUC calculation
# Model performance metrics function
evaluation <- function(model, X_test, y_test) {
# Predict class labels
y_pred <- predict(model, X_test, type = "class")
# Check if the length of predicted and actual labels match
if (length(y_pred) != length(y_test)) {
stop("Length of predicted labels and actual labels do not match.")
}
# Ensure the predicted labels have the same factor levels as the actual labels
y_pred <- factor(y_pred, levels = levels(as.factor(y_test)))
# Predict probabilities (for ROC-AUC calculation)
y_proba <- predict(model, X_test, type = "prob")[, 2] # Probability for the positive class
# Accuracy
accuracy <- sum(y_pred == y_test) / length(y_test)
# Confusion matrix for precision, recall, F1-Score
conf_matrix <- confusionMatrix(as.factor(y_pred), as.factor(y_test))
# Extract precision, recall, and F1 from confusion matrix
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- conf_matrix$byClass["F1"]
# ROC-AUC
roc_curve <- roc(as.factor(y_test), y_proba)
roc_auc <- auc(roc_curve)
# Print metrics
cat("Accuracy:", accuracy, "\n")
cat("Precision:", precision, "\n")
cat("Recall:", recall, "\n")
cat("F1-Score:", f1, "\n")
cat("ROC-AUC:", roc_auc, "\n")
}
# Example usage:
# Assuming 'model' is a trained model, and X_test, y_test are the test dataset and labels
print("Model Performance:")
## [1] "Model Performance:"
evaluation(dt_model, X_test, y_test)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
## Accuracy: 0.9328353
## Precision: 0.9586484
## Recall: 0.970855
## F1-Score: 0.9647131
## ROC-AUC: 0.7040052
# Load necessary libraries
library(caret)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(pROC)
# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y, p = 0.7, list = FALSE)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
# Load the pre-trained Random Forest model
rf_model <- readRDS("random_forest_model.rds")
# Predict using the model
y_pred <- predict(rf_model, X_test)$predictions
y_proba <- predict(rf_model, X_test, type = "response")$predictions[, 2]
# Adjust predictions based on threshold
threshold <- 0.3
y_pred_adjusted <- ifelse(y_proba >= threshold, "Yes", "No")
# Convert to factors for confusion matrix and metrics
y_pred_adjusted <- factor(y_pred_adjusted, levels = c("Yes", "No"))
y_test <- factor(y_test, levels = c("Yes", "No"))
# Accuracy
accuracy <- sum(y_pred_adjusted == y_test) / length(y_test)
# Confusion matrix
conf_matrix <- confusionMatrix(y_pred_adjusted, y_test)
# Precision, Recall, F1-Score
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- conf_matrix$byClass["F1"]
# ROC-AUC
roc_curve <- roc(y_test, y_proba)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
roc_auc <- auc(roc_curve)
# Print metrics
cat(sprintf("Accuracy: %.4f\n", accuracy))
## Accuracy: 0.9562
cat(sprintf("Precision: %.4f\n", precision))
## Precision: 0.5900
cat(sprintf("Recall: %.4f\n", recall))
## Recall: 0.6339
cat(sprintf("F1-Score: %.4f\n", f1))
## F1-Score: 0.6112
cat(sprintf("ROC-AUC: %.4f\n", roc_auc))
## ROC-AUC: 0.9599
The deployment of the heart attack prediction model in R ensures that the trained model is accessible for practical use, such as predicting heart attack risks in new data. By bridging the gap between technical development and real-world application, this model provides actionable insights that can address the business challenges outlined earlier in the project.
To integrate the model into applications, a user-friendly interface is created using the Shiny package, enabling seamless interaction between the model and end-users. Shiny allows stakeholders to input data and receive real-time predictions on heart attack risk. Whether for individual predictions or large-scale population health analyses, this interface provides an accessible tool for stakeholders.
R packages including “Shiny”, “randomForest”, and “caret” were downloaded and used to create the interactive web application, load and apply the pre-trained Random Forest model to make heart attack predictions, and to preprocess and train the model.
# Load necessary libraries
library(shiny)
library(randomForest)
# Load the pre-trained model
model <- readRDS("random_forest_model.rds")
# Define the prediction function
predict_heart_attack_risk <- function(input_data) {
numeric_features <- input_data[1:3]
binary_features <- input_data[4:20]
categorical_features <- input_data[21:length(input_data)]
# Map categorical inputs to numeric values
mappings <- list(
c('Yes' = 1, 'No' = 0), # For binary features
c('F' = 0, 'M' = 1), # Gender
c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28), # Sick days
c('None of them' = 0, '1 to 5' = 1, '6 or more, but not all' = 2, 'All' = 3), # Teeth removed
c('Poor' = 0, 'Fair' = 1, 'Good' = 2, 'Very good' = 3, 'Excellent' = 4), # General health
c('0-5' = 3, '6-10' = 8, '11-15' = 13, '16-20' = 18, '21-25' = 23, '26-30' = 28), # Mental health days
c('Within past year (anytime less than 12 months ago)' = 0,
'Within past 2 years (1 year but less than 2 years ago)' = 1,
'Within past 5 years (2 years but less than 5 years ago)' = 2,
'5 or more years ago' = 3), # Last checkup
c('Never smoked' = 0, 'Former smoker' = 1,
'Current smoker - now smokes some days' = 2,
'Current smoker - now smokes every day' = 3) # Smoking status
)
# Transform binary inputs
binary_values <- sapply(binary_features, function(val) ifelse(val == 'Yes', 1, 0))
# Transform categorical inputs using correct mappings
categorical_values <- sapply(1:length(categorical_features), function(i) {
if (i <= length(mappings)) {
mapping <- mappings[[i]] # Adjust index to match categorical inputs
return(mapping[[categorical_features[i]]])
} else {
return(NA) # Handle case where the index exceeds mappings length
}
})
# Combine all inputs into a single row
input_row <- c(numeric_features, binary_values, categorical_values)
# Predict probabilities
risk_probability <- predict(model, data.frame(t(input_row)), type = "prob")[1, ]
low_risk_prob <- risk_probability[1] * 100
high_risk_prob <- risk_probability[2] * 100
# Final prediction
prediction <- ifelse(high_risk_prob > low_risk_prob, "High Risk of Heart Attack", "Low Risk of Heart Attack")
# Return results
return(list(
LowRiskProbability = round(low_risk_prob, 2),
HighRiskProbability = round(high_risk_prob, 2),
Prediction = prediction
))
}
# Define UI for the Shiny app
ui <- fluidPage(
titlePanel("Heart Attack Risk Prediction"),
sidebarLayout(
sidebarPanel(
numericInput("BMI", "BMI", value = 25, min = 10, max = 50),
numericInput("Weight", "Weight (in kg)", value = 70, min = 30, max = 200),
numericInput("Age", "Age", value = 35, min = 18, max = 100),
# Binary Inputs
checkboxGroupInput(
"BinaryFeatures", "Health Conditions",
choices = c("HadAngina", "HadDiabetes", "HadStroke", "DifficultyWalking",
"HadCOPD", "ChestScan", "HadKidneyDisease", "HadArthritis",
"DeafOrHardOfHearing", "PneumoniaVaccineEver", "DifficultyErrands",
"DifficultyDressingBathing", "BlindOrVisionDifficulty",
"AlcoholDrinkers", "DifficultyConcentrating", "HadSkinCancer",
"PhysicalActivities")
),
# Categorical Inputs
radioButtons("Gender", "Gender", choices = c("F", "M")),
selectInput("SickDays", "Sick Days in a Month", choices = c("0-5", "6-10", "11-15", "16-20", "21-25", "26-30")),
selectInput("TeethRemoved", "Teeth Removed", choices = c("None of them", "1 to 5", "6 or more, but not all", "All")),
selectInput("GeneralHealth", "General Health", choices = c("Poor", "Fair", "Good", "Very good", "Excellent")),
selectInput("LastCheckup", "Last Checkup", choices = c(
"Within past year (anytime less than 12 months ago)",
"Within past 2 years (1 year but less than 2 years ago)",
"Within past 5 years (2 years but less than 5 years ago)",
"5 or more years ago"
)),
selectInput("SmokingStatus", "Smoking Status", choices = c(
"Never smoked", "Former smoker", "Current smoker - now smokes some days", "Current smoker - now smokes every day"
)),
actionButton("submit", "Predict")
),
mainPanel(
h4("Prediction Results"),
verbatimTextOutput("result")
)
)
)
# Define server logic for the Shiny app
server <- function(input, output) {
observeEvent(input$submit, {
# Construct input data for the model
input_data <- c(input$BMI, input$Weight, input$Age,
input$Gender, input$SickDays, input$TeethRemoved,
input$GeneralHealth, input$LastCheckup, input$SmokingStatus)
# Add binary features
binary_features <- c(
"HadAngina", "HadDiabetes", "HadStroke", "DifficultyWalking",
"HadCOPD", "ChestScan", "HadKidneyDisease", "HadArthritis",
"DeafOrHardOfHearing", "PneumoniaVaccineEver", "DifficultyErrands",
"DifficultyDressingBathing", "BlindOrVisionDifficulty",
"AlcoholDrinkers", "DifficultyConcentrating", "HadSkinCancer",
"PhysicalActivities"
)
for (feature in binary_features) {
input_data[[feature]] <- ifelse(feature %in% input$BinaryFeatures, 1, 0)
}
# Predict and display the results
results <- predict_heart_attack_risk(input_data)
output$result <- renderText({
paste(
"Low Risk Probability:", results$LowRiskProbability, "%\n",
"High Risk Probability:", results$HighRiskProbability, "%\n",
"Prediction:", results$Prediction
)
})
})
}
# Launch the Shiny app
shinyApp(ui = ui, server = server)
#library(reticulate)
#library(reticulate)
#use_python("C:/Users/LEE/Documents/.virtualenvs/r-reticulate/Scripts/python.ex#e", required = TRUE)
# Load the model in R
#model_regression <- readRDS("random_forest_model.rds")
#py_run_string("import joblib")# Optionally: Save the model to a .pkl file in R, so Python can load it
#saveRDS(model_regression, file = "model.rds")
# Use reticulate to save it as a Python-compatible format
#py_run_string("
#import joblib
#model = r.model_regression
#joblib.dump(model, 'random_forest_model.pkl')
#")
# Run the Python script for Gradio
#py_run_file("deployment.py")
#system("python deployment.py", wait = TRUE)
The heart attack prediction model using a Random Forest algorithm effectively assesses an individual’s risk by analyzing various health factors such as age, BMI, lifestyle habits, and medical history. By calculating the probabilities of high or low risk, the model provides personalized predictions that can guide healthcare decisions. This tool aids in early risk detection, encouraging timely interventions and lifestyle changes. While effective, further improvements could involve expanding features and real-time data integration for even more accurate and dynamic risk assessment.