Homework 2

Overview

Individuals’ quality of life is influenced by a variety of factors, including genetics, lifestyle choices, access to healthcare, and environmental conditions. Cardiovascular health is a critical element that often goes unrecognized. Heart disease has become a global concern, affecting the health of millions. They discourage people from leading active lives, pursuing their passions, and contributing to society successfully. When cardiovascular problems appear, they not only become difficult to manage, but they also inflict significant financial and emotional pressures on families and healthcare systems. Despite medical advances, there is no universal agreement on preventative strategies.

This assignment focused on the prediction and understanding of heart disorders. We specifically used a dataset compiled from a We investigated, evaluated, and built models for a dataset containing health indicators and behaviors of around 4,238 people, with a focus on factors potentially contributing to heart disease.

We began by scanning the dataset for potential problems such as missing values, abnormalities, and suspected collinearity among predictors. Following the initial inspection, we began the data purification process, correcting the highlighted concerns.

With the improved information in hand, we built two decision trees, assuring differentiation by changing the first splitting feature. We also created a Random Forest model, leveraging its capacity to detect detailed, non-linear patterns in the data.

We tested our models’ performance against a predefined evaluation set after training them. After doing our research, we identified an ideal model that strikes a balance between predicted accuracy and model simplicity. Based on health measurements and behaviors, this model attempts to estimate the chance of a person having a heart attack.

1. Data Preparation

#Load data
data <- read.csv("https://raw.githubusercontent.com/ex-pr/DATA_622/main/HW%202/heart_disease.csv")

1.1 Summary Statistics

The dataset contained 4238 observations of 16 predictor variables.

Each record represented a patient, and each column represented a health aspect. The Heart_ stroke was the target, it specified whether or not the patient experienced a stroke. Specifically:

Gender: Gender of the patient (Male, Female)
age: Age of the patient
education: Education of the patient (Uneducated, Primary School, Graduate, Postgraduate)
current smoker: “1” if the patient is smoking, “0” otherwise
cigsperDay: Number of cigarettes per day
BPMeds: “1” if the patient is on a blood pressure medication, “0” otherwise
prevalentStroke: “1” if the patient has a history of stroke, “0” otherwise
prevalentHyp: “1” if the patient has prevalent hypertension, “0” otherwise
diabetes: “1” if the patient has diabetes, “0” otherwise
totChol: Total cholesterol level
sysBP: Systolic blood pressure
diaBP: Diastolic blood pressure
BMI: Body Mass Index
heartRate: Heart rate of the patient
glucose: Glucose level of the patient
Heart_ stroke: “1” if the patient had a heart stroke, “0” otherwise

As the target variable was binary, the following algorithms were considered. The following data preparation was based on this algorithm selection. The choice of the algorithms wasn’t affected by the size of the data, only by the nature of the target variable and the assignment.

Decision tree: it visualizes and understands decision-making processes clearly, manages numerical and categorical data, and is less sensitive to outliers. However, if not pruned, it can easily become too complex and overfit the data, and it may not capture as intricate correlations as ensemble approaches like Random Forest.

Random Forest: can capture complicated, non-linear patterns in data, resistant to overfitting, can handle numerical as well as categorical data. But it is more difficult to interpret than simpler models, large datasets can be computationally expensive.

The data source: https://www.kaggle.com/datasets/mirzahasnine/heart-disease-dataset/data

# Check first rows of data
DT::datatable(
      data[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)

The table below provided a summary statistics for the data. There were some missing values (<10%) to be imputed.

The dataset contained more female observations than males. Patients ranged in age from 32 to 70 years old, with an average age of 49.6 years. Nearly half of the people in the dataset (49.4%) were smokers. The average number of cigarettes smoked per day is 9, yet this varied from non-smokers to heavy smokers who consumed up to 70 cigarettes per day.

The average systolic blood pressure was 132.35, while the average diastolic blood pressure was 82.89. Almost 2.96% of the participants were taking blood pressure medication. The average total cholesterol level was approximately 236.72, with values ranging from 107 to 696. The average BMI was 25.8, and the average glucose level was around 81.97.

About 31.1% of people had hypertension, while a minor percentage (2.57%) had diabetes. Only a small percentage of participants (less than 1%) had a history of stroke, indicating that it was uncommon in our sample. A significant fraction of patients was classified as “uneducated,” while primary school students and postgraduates were also represented.

The vast majority of the individuals in the dataset had never had a strokeas the target variable indicated. The imbalance in the target variable was fixed further.

# Check summary statistics of the data
print(dfSummary(data, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 500, footnote = NA, col.width=10, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

Gender [character]

1. Female

2. Male

2419	(	57.1%	)
1819	(	42.9%	)

0 (0.0%)

age [integer]

Mean (sd) : 49.6 (8.6)

min ≤ med ≤ max:

32 ≤ 49 ≤ 70

IQR (CV) : 14 (0.2)

39 distinct values

0 (0.0%)

education [character]

1. graduate

2. postgraduate

3. primaryschool

4. uneducated

687	(	16.6%	)
473	(	11.4%	)
1253	(	30.3%	)
1720	(	41.6%	)

105 (2.5%)

currentSmoker [integer]

Min : 0

Mean : 0.5

Max : 1

0	:	2144	(	50.6%	)
1	:	2094	(	49.4%	)

0 (0.0%)

cigsPerDay [integer]

Mean (sd) : 9 (11.9)

min ≤ med ≤ max:

0 ≤ 0 ≤ 70

IQR (CV) : 20 (1.3)

33 distinct values

29 (0.7%)

BPMeds [integer]

Min : 0

Mean : 0

Max : 1

0	:	4061	(	97.0%	)
1	:	124	(	3.0%	)

53 (1.3%)

prevalentStroke [character]

1. no

2. yes

4213	(	99.4%	)
25	(	0.6%	)

0 (0.0%)

prevalentHyp [integer]

Min : 0

Mean : 0.3

Max : 1

0	:	2922	(	68.9%	)
1	:	1316	(	31.1%	)

0 (0.0%)

diabetes [integer]

Min : 0

Mean : 0

Max : 1

0	:	4129	(	97.4%	)
1	:	109	(	2.6%	)

0 (0.0%)

totChol [integer]

Mean (sd) : 236.7 (44.6)

min ≤ med ≤ max:

107 ≤ 234 ≤ 696

IQR (CV) : 57 (0.2)

248 distinct values

50 (1.2%)

sysBP [numeric]

Mean (sd) : 132.4 (22)

min ≤ med ≤ max:

83.5 ≤ 128 ≤ 295

IQR (CV) : 27 (0.2)

234 distinct values

0 (0.0%)

diaBP [numeric]

Mean (sd) : 82.9 (11.9)

min ≤ med ≤ max:

48 ≤ 82 ≤ 142.5

IQR (CV) : 14.9 (0.1)

146 distinct values

0 (0.0%)

BMI [numeric]

Mean (sd) : 25.8 (4.1)

min ≤ med ≤ max:

15.5 ≤ 25.4 ≤ 56.8

IQR (CV) : 5 (0.2)

1363 distinct values

19 (0.4%)

heartRate [integer]

Mean (sd) : 75.9 (12)

min ≤ med ≤ max:

44 ≤ 75 ≤ 143

IQR (CV) : 15 (0.2)

73 distinct values

1 (0.0%)

glucose [integer]

Mean (sd) : 82 (24)

min ≤ med ≤ max:

40 ≤ 78 ≤ 394

IQR (CV) : 16 (0.3)

143 distinct values

388 (9.2%)

Heart_.stroke [character]

1. No

2. yes

3594	(	84.8%	)
644	(	15.2%	)

0 (0.0%)

1.2 Missing values

Managing missing values was one of the critical problems to fix before building the models.

The missing values for continuous variables cigsPerDay, BMI, totChol. heartRate and glucose were imputed using the median value of their respective columns. Using the median rather than the mean reduced the impact of any outliers in the data.

To fill in the missing data points for categorical variables, such as education and BPMeds, the mode (most often occurring value) of each column was used. This method assured that the categorical data distribution was preserved and that no unintentional biases were introduced.

# Set seed for constant results
set.seed(42)

# Copy original data
imputed_df <- data

# Impute NAs for 'cigsPerDay', 'totChol', 'BMI', 'heartRate', and 'glucose' with their median
imputed_df <- imputed_df %>% mutate(across(c('cigsPerDay', 'totChol', 'BMI', 'heartRate', 'glucose'), ~replace_na(., median(., na.rm=TRUE))))

# Impute NAs for 'education' with its mode
mode_education <- names(sort(table(imputed_df$education), decreasing = TRUE)[1])
imputed_df$education[is.na(imputed_df$education)] <- mode_education
 
# Transform 'BPMeds' to factor
imputed_df$BPMeds <- as.factor(imputed_df$BPMeds)

# Impute missing values for 'BPMeds' with its mode
mode_bpmeds <- names(sort(table(imputed_df$BPMeds), decreasing = TRUE)[1])
imputed_df$BPMeds[is.na(imputed_df$BPMeds)] <- mode_bpmeds

1.3 Encoding

We assured that our dataset was appropriate for a broader range of algorithms that require numerical input by using one-hot encoding, boosting the potential accuracy and effectiveness of our later analysis.

Gender, 'education, and prevalentStroke were recognized as categorical columns that would benefit from one-hot encoding.

The encoding of Gender resulted in a new column male, where a value of 1 indicated a male and a value of 0 indicated a female.

The encoding of prevalentStroke resulted in a new column prevalStroke, where a value of 1 indicated if a patient had a previous stroke, 0 indicated a patient without a previous stroke.

The encoding of the target variable Heart_.stroke resulted in a new column stroke, where a value of 1 indicated if a patient had a stroke, 0 indicated a patient without a stroke.

Three new columns were created for education: postgraduate, primaryschool, and uneducated. Each of these columns accepted a binary value, indicating whether the education level was present (1) or absent (0).

The first category of each original categorical variable was eliminated throughout the encoding procedure to avoid multicollinearity and reduce redundancy.

# Set seed for constant results
set.seed(42)

# Copy data without NAs
encoded_df <- imputed_df


# Encoding 'Gender', 'Heart_.stroke', 'prevalentStroke' variables to 0 and 1

encoded_df <-  encoded_df%>%
    mutate(
      male = factor(ifelse(Gender == "Male", 1, 0)),
      prevalStroke = factor(ifelse(prevalentStroke == "yes", 1, 0)),
      stroke = factor(ifelse(Heart_.stroke == "yes", 1, 0))
    ) %>% 
    dplyr::select(-c(Gender, prevalentStroke, Heart_.stroke))

# One hot encoding for education
encoded_df <- encoded_df %>% 
  mutate(across(c(education), ~as.factor(.))) %>%
  pivot_wider(names_from = c(education), 
              values_from = c(education), 
              values_fill = 0, 
              values_fn = function(x) 1)  %>%
  dplyr::select(-c(graduate))

1.4 Column change

Several entries in the dataset, including diabetes, prevalentHyp, currentSmoker, BPMeds, postgraduate, primaryschool, uneducated were converted to categorical data types. This change was performed to better depict the variables’ inherent categorical character.

We included a new column called BP. This column represented the ratio of systolic to diastolic blood pressure (from the sysBP and diaBP columns respectively). This derived attribute could provide new insights on people’s cardiovascular health.

# Transform binary variables to factors
cols <- c("diabetes", "prevalentHyp", "currentSmoker", "BPMeds", "postgraduate", "primaryschool", "uneducated")
encoded_df[cols] <- lapply(encoded_df[cols], factor)

# New column for the ratio of systolic to diastolic blood pressure
encoded_df$BP <- encoded_df$sysBP / encoded_df$diaBP

1.5 Outliers

We used boxplots to detect outliers. Several variables, including cigsPerDay, totChol, sysBP, diaBP, BMI, heartRate, and glucose, appeared to have possible outliers as values that fell outside the whiskers of the boxplots. Several variables, including sysBP, glucose, and BP, had a disproportionately high number of outliers. These outliers could have a considerable impact on the outcomes, depending on the analytic or modeling technique used. Outliers in health data could represent actual and critical findings, and deleting them may result in the loss of vital information. An unusually high glucose level, for example, could indicate an untreated diabetes patient. By removing such outliers, important therapeutic insights would be lost. As a result, the ourliers were retained.

# Numeric columns to check for outliers
continuous_vars <- colnames(select_if(encoded_df, is.numeric))

# List to store plots
plots <- list()

# Generate boxplots for each variable
for(i in 1:length(continuous_vars)) {
  p <- ggplot(encoded_df, aes_string(y = continuous_vars[i])) + 
    geom_boxplot() +
    scale_fill_viridis_d() +
    ggtitle(continuous_vars[i]) +
    theme_minimal()
  plots[[i]] <- p
}

# Arrange the plots in a grid
grid.arrange(grobs = plots, ncol = 3)

1.6 Summary Statistics for transformed data

After the data transformation, no missing values detected.

New columns were added (male, prevalStroke, stroke, BP, postgraduate, primaryschool, uneducated) while other were removed (Gender, education, Heart_stroke, prevalentStroke).

The mean for cigsPerDay had been reduced in the converted data because missing values were imputed using the median. totChol, sysBP, diaBP, BMI, heartRate, and glucose: These columns’ statistics changed slightly, owing to imputation of missing values with medians.

DT::datatable(
      encoded_df[1:25,],
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE)

print(dfSummary(encoded_df, text.graph.col = FALSE, graph.col = FALSE, style = "grid", valid.col = FALSE), headings = FALSE, max.tbl.height = 500, footnote = NA, col.width=50, method="render")

Variable

Stats / Values

Freqs (% of Valid)

Missing

age [integer]

Mean (sd) : 49.6 (8.6)

min ≤ med ≤ max:

32 ≤ 49 ≤ 70

IQR (CV) : 14 (0.2)

39 distinct values

0 (0.0%)

currentSmoker [factor]

1. 0

2. 1

2144	(	50.6%	)
2094	(	49.4%	)

0 (0.0%)

cigsPerDay [integer]

Mean (sd) : 8.9 (11.9)

min ≤ med ≤ max:

0 ≤ 0 ≤ 70

IQR (CV) : 20 (1.3)

33 distinct values

0 (0.0%)

BPMeds [factor]

1. 0

2. 1

4114	(	97.1%	)
124	(	2.9%	)

0 (0.0%)

prevalentHyp [factor]

1. 0

2. 1

2922	(	68.9%	)
1316	(	31.1%	)

0 (0.0%)

diabetes [factor]

1. 0

2. 1

4129	(	97.4%	)
109	(	2.6%	)

0 (0.0%)

totChol [integer]

Mean (sd) : 236.7 (44.3)

min ≤ med ≤ max:

107 ≤ 234 ≤ 696

IQR (CV) : 56 (0.2)

248 distinct values

0 (0.0%)

sysBP [numeric]

Mean (sd) : 132.4 (22)

min ≤ med ≤ max:

83.5 ≤ 128 ≤ 295

IQR (CV) : 27 (0.2)

234 distinct values

0 (0.0%)

diaBP [numeric]

Mean (sd) : 82.9 (11.9)

min ≤ med ≤ max:

48 ≤ 82 ≤ 142.5

IQR (CV) : 14.9 (0.1)

146 distinct values

0 (0.0%)

BMI [numeric]

Mean (sd) : 25.8 (4.1)

min ≤ med ≤ max:

15.5 ≤ 25.4 ≤ 56.8

IQR (CV) : 5 (0.2)

1363 distinct values

0 (0.0%)

heartRate [integer]

Mean (sd) : 75.9 (12)

min ≤ med ≤ max:

44 ≤ 75 ≤ 143

IQR (CV) : 15 (0.2)

73 distinct values

0 (0.0%)

glucose [integer]

Mean (sd) : 81.6 (22.9)

min ≤ med ≤ max:

40 ≤ 78 ≤ 394

IQR (CV) : 13 (0.3)

143 distinct values

0 (0.0%)

male [factor]

1. 0

2. 1

2419	(	57.1%	)
1819	(	42.9%	)

0 (0.0%)

prevalStroke [factor]

1. 0

2. 1

4213	(	99.4%	)
25	(	0.6%	)

0 (0.0%)

stroke [factor]

1. 0

2. 1

3594	(	84.8%	)
644	(	15.2%	)

0 (0.0%)

postgraduate [factor]

1. 0

2. 1

3765	(	88.8%	)
473	(	11.2%	)

0 (0.0%)

primaryschool [factor]

1. 0

2. 1

2985	(	70.4%	)
1253	(	29.6%	)

0 (0.0%)

uneducated [factor]

1. 0

2. 1

2413	(	56.9%	)
1825	(	43.1%	)

0 (0.0%)

BP [numeric]

Mean (sd) : 1.6 (0.2)

min ≤ med ≤ max:

1.2 ≤ 1.6 ≤ 3.1

IQR (CV) : 0.2 (0.1)

2026 distinct values

0 (0.0%)

2. Data Exploration

2.1 Continuous Variables

First, we checked the distribution of all continuous variables with the most patients between the ages of 40 to 60..

Age appeared to be rather evenly distributed, with a small left skew.

The distribution of cigsPerDay showed that the majority of people either did not smoke or smoked around 20 cigarettes per day.

totChol,sysBP, Glucose and diaBP were slightly skewed to the right, with most values centered around 200-300 mg/dL, 120 mmHg, 80 mmHg and 100 mg/d respectively.

BMI was tilted to the right, with the majority of readings falling between 20 and 30. The heart rate appeared to be rather normal, ranging between 70 and 80 beats per minute. BP was slightly biased to the right as well, with most values around 1.5.

# Choose numeric variables
numeric_vars <-colnames(select_if(encoded_df, is.numeric))

# List to store plots
plots <- list()

# Generate histograms for each variable
for (i in 1:length(numeric_vars)) {
  p <- ggplot(encoded_df, aes_string(x = numeric_vars[i])) + 
    geom_histogram(aes(y=..density..), bins = 30, fill = "lightgreen", color = "black", alpha = 0.7) +
    geom_density(alpha = 0.2, fill = "#FF6666") +
    ggtitle(paste0('Distribution of ', numeric_vars[i])) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 3)

2.2 Categorical Variables

Next, we explored the distribution of all categorical variables.

The number of current smokers and nonsmokers was nearly equal, with nonsmokers somewhat more common. The majority of people didn’t take blood pressure medicine (BPMeds). Although most people did not have prevalent hypertension, a considerable proportion did. Diabetes affected a small percentage of the population.

The dataset had more females than males. A relatively small percentage of people had suffered a previous stroke. In terms of education, only a few people had a postgraduate education, a small percentage of people had only completed primary school, the majority of people were uneducated. prevalentStroke_yes:

The majority of people had never had a heart attack. The target variable had a large class imbalance in this distribution. Such imbalances could make modeling difficult since models could become biased toward anticipating the majority class. It was critical to keep this in mind while creating and evaluating models, and to think about tactics like resampling or utilizing specific assessment criteria.

# Choose factor variables
factor_vars <- encoded_df %>% select_if(is.factor) %>% colnames()

# List to store plots
plots <- list()

# Generate barplots for each variable
for (i in 1:length(factor_vars)) {
  p <- ggplot(encoded_df, aes_string(x = factor_vars[i])) + 
    geom_bar(fill = "lightgreen", color = "black", alpha = 0.7) +
    ggtitle(paste0('Distribution of ', factor_vars[i])) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 3)

2.3 Target variable vs numeric variables

To better understand the relationships between the variables and target, we started with boxplots: numeric variables vs target.

Patients who had had a heart attack were usually older. There did not appear to be a significant difference between the two groups in the number of cigarettes smoked each day. Those who had had a heart attack had a slightly higher median total cholesterol level.

Patients who had a heart attack had greater systolic blood pressure and diastolic blood pressure. As a result, those who had had a heart attack had a slightly higher systolic to diastolic blood pressure ratio. The distribution of Body Mass Index appeared to be similar in both groups. There was no discernible difference in heart rate between the two groups. Glucose levels appeared to be higher in persons who had had a heart attack.

The visualizations revealed which characteristics could be major predictors of the occurrence of heart attacks. It was important to note, however, that correlation did not imply causality. However, the plots confirmed the general knowledge that people with a stroke were older than 45 years, had not had healthy lifestyle and had problems with blood pressure.

# List to store plots
plots <- list()

# Generate boxplots for each numeric variable vs target
for (i in 1:length(numeric_vars)) {
  p <- ggplot(encoded_df, aes_string(x = 'stroke', y = numeric_vars [i])) + 
    geom_boxplot(fill = "lightgreen") +
    ggtitle(paste0(numeric_vars [i], ' vs stroke')) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 3)

2.4 Target variable vs categorical variables

Next, we checked the distribution of categorical variables vs target variable.

Males and females appeared to experience an equal proportion of cardiac strokes. The proportion of heart strokes was generally consistent across education levels. People who have had a prevalent stroke had a slightly greater risk of having a heart attack. The proportion of cardiac strokes was comparable between smokers and nonsmokers which was unusual.

Patients with prevalent hypertension and who used blood pressure drugs had an increased risk of having a heart attack. The proportion of people who have a heart attack was higher in diabetics.

These plots confirmed the general knowledge about the reasons for stroke.

# Define the categorical variables without the target
factor_vars_no_target <- setdiff(factor_vars, 'stroke')

# List to store plots
plots <- list()

# Generate bar plots for each factor variable vs target
for (i in 1:length(factor_vars_no_target)) {
  p <- ggplot(encoded_df, aes_string(x = factor_vars_no_target[i], fill = 'stroke')) + 
    geom_bar(position = "dodge") +
    ggtitle(paste0('Stroke vs ', factor_vars_no_target[i])) +
    theme_minimal()
  plots[[i]] <- p
}

# Plot in grid with 3 columns
grid.arrange(grobs = plots, ncol = 3)

2.5 Correlation

To understand the relationships between features, we built the correlation matrix.

Age had a positive association with sysBP, diaBP, and totChol, showing that as people got older, their blood pressure and cholesterol levels rose. CigsPerDay had no significant relationships with any other variables which was unusual, the general knowledge connects smoking with heart diseases.

sysBP and diaBP were substantially positively associated as they both assessed blood pressure. As a result, the systolic to diastolic blood pressure ratio had a high positive association with sysBP and a strong negative correlation with diaBP. Other correlations were modest, indicating that there were just a few linear associations between those variables.

# Check correlation
rcore <- rcorr(as.matrix(encoded_df %>% dplyr::select(where(is.numeric))))
# Take correlation coeff
coeff <- rcore$r
# Build corr plot
corrplot(coeff, tl.cex = .7, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust",
         diag=FALSE)

3. Split data

Finally, we split our data into train (75%) and test (25%) datasets to evaluate model performance before we proceeded to prediction. The train data contained 3179 records, test data 1059.

The response variable had a balance of 85% for 0 response and 15% for 1 response. For the further analysis, it could be an option to try balance the response variable.

With imbalanced data, most machine learning models predicted the majority class more efficiently than the minority class. To address this behavior, we utilized SMOTE() function the data in order to achieve higher accuracy rates between classes. The SMOTE function oversampled the minority class by using bootstrapping and k-nearest neighbor to synthetically create additional observations of that event. We tried to run models with imbalanced data and without.

# random seed
set.seed(42)

# 80/20 split of the data set
sample <- sample.split(encoded_df$stroke, SplitRatio = 0.75)
train_data  <- subset(encoded_df, sample == TRUE)
test_data   <- subset(encoded_df, sample == FALSE)

# Check dimensions of train and test data
dim(train_data)

## [1] 3179   19

dim(test_data)

## [1] 1059   19

# Check class distribution of original, train, and test sets
round(prop.table(table(dplyr::select(encoded_df, stroke), exclude = NULL)), 4) * 100

## 
##    0    1 
## 84.8 15.2

round(prop.table(table(dplyr::select(train_data, stroke), exclude = NULL)), 4) * 100

## 
##     0     1 
## 84.81 15.19

round(prop.table(table(dplyr::select(test_data, stroke), exclude = NULL)), 4) * 100

## 
##    0    1 
## 84.8 15.2

The SMOTE() function was applied to the train data in order to achieve higher accuracy rates between classes.

# random seed
set.seed(42)

# Fix imbalance
train_data <- as.data.frame(train_data)
#train_data$stroke <- as.factor(train_data$stroke)
train_data<- SMOTE(stroke ~ ., train_data, perc.over = 100, perc.under=200)

round(prop.table(table(dplyr::select(train_data, stroke), exclude = NULL)), 4) * 100

## 
##  0  1 
## 50 50

4. Models

4.1 Decision Tree - First Model

For the first model, we trained and evaluated a Decision Tree classifier on the dataset with all the variables except for sysBP, diaBP as there was a new variable instead (BP).

We used the caret package’s train() function and trainControl() to build the first tree. To evaluate model performance, we used the ‘cv’ method to run 10-fold cross-validation across 10 resampling iterations.

# Set seed for constant results
set.seed(42)

# Train Decision Tree
control_1 <- trainControl(method = "cv", number = 10) 

tree_1 <- train(stroke ~ . -sysBP -diaBP, method = "rpart", trControl = control_1, tuneLength = 10, data = train_data)

The accuracy of the model was 66%.

The first tree predicted 56% of the class 1. The model predicted the target, this couldn’t have happened with imbalanced data. F1 score of 0.77 combined the precision and recall scores of the model. These metrics indicated that while the model had a decent overall accuracy, its performance in predicting heart strokes (Class 1) was not very high (Specificity 0.56). This was likely due to the imbalanced nature of the target variable.

AUC for the ROC curve was 0.62, it ranked a random positive example higher than a random negative example 62% of the time, the model had a moderate predictive power.

The table below showed the feature importance. The most important features to result in stroke were age, amount of cigarettes smoked in one day, if a patient had prevalent hypertension. As a result, age feature was used as the first node.

# Set seed for constant results
set.seed(42)
# Predict using test data
y_pred_1 <- predict(tree_1, newdata = test_data) #, type = "class")

conf_matrix_1 <- confusionMatrix(y_pred_1, test_data$stroke)

# Evaluate Decision tree
results <- tibble(Model = "Model #1 - Decision Tree", Accuracy=conf_matrix_1$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_1$overall[1],
                  F1 = conf_matrix_1$byClass[7],
                  Sensitivity = conf_matrix_1$byClass["Sensitivity"],
                  Specificity = conf_matrix_1$byClass["Specificity"],
                  Precision =  conf_matrix_1$byClass["Precision"],
                  ROC = auc(roc(test_data$stroke, factor(y_pred_1, ordered = TRUE))))


conf_matrix_1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 610  71
##          1 288  90
##                                           
##                Accuracy : 0.661           
##                  95% CI : (0.6316, 0.6895)
##     No Information Rate : 0.848           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1534          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6793          
##             Specificity : 0.5590          
##          Pos Pred Value : 0.8957          
##          Neg Pred Value : 0.2381          
##              Prevalence : 0.8480          
##          Detection Rate : 0.5760          
##    Detection Prevalence : 0.6431          
##       Balanced Accuracy : 0.6191          
##                                           
##        'Positive' Class : 0               
##

rpart.plot(tree_1$finalModel)

varImp(tree_1)

## rpart variable importance
## 
##                 Overall
## age            100.0000
## cigsPerDay      95.3861
## prevalentHyp1   86.9722
## BPMeds1         80.0869
## glucose         66.6218
## BMI             52.7338
## diabetes1       42.9540
## currentSmoker1  29.6760
## totChol         29.0164
## BP              16.8515
## heartRate        8.7908
## prevalStroke1    4.2730
## male1            3.3960
## primaryschool1   0.6642
## postgraduate1    0.0000
## uneducated1      0.0000

4.2 Decision Tree - Second Model

For the second model, we trained a Decision Tree using a different feature as the major split. In the previous tree, the age was used as the first node, this feature was removed from the second model. The variables sysBP, diaBP were not used again.

The rpart() function with minsplit=50 (the minimum number of observations that must exist in a node in order for a split to be attempted) was used to build the second tree.

# Set seed for constant results
set.seed(42)
# Train Decision Tree
tree_2 <- rpart(stroke ~. -sysBP -diaBP -age, data = train_data, method = "class", control = rpart.control(minsplit=50))

The prevalentHyp was used as the first node.

The accuracy of the model was 68%.

The second tree predicted 55% of the class 1. F1 score of 0.79 combined the precision and recall scores of the model, it showed better results than the previous model. These metrics indicate that while both models had a decent overall accuracy, its performance on predicting heart strokes (Class 1) was not very high (Specificity 0.55 for the second model). This was likely due to the imbalanced nature of the target variable.

AUC for the ROC curve was 0.63, it ranked a random positive example higher than a random negative example 63% of the time, the model had a moderate predictive power.

The table below showed the feature importance. After the age was removed from the model, the most important features to result in stroke were if a patient had prevalent hypertension (first node prevalentHyp), if a patient was a smoker and took blood pressure medicines.

# Set seed for constant results
set.seed(42)

# Predict using test data
y_pred_2 <- predict(tree_2, newdata = test_data, type = "class")

# Evaluate Decision tree
conf_matrix_2 <- confusionMatrix(y_pred_2, test_data$stroke)

results <- rbind(results, tibble(Model = "Model #2 - Decision Tree", Accuracy=conf_matrix_2$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_2$overall[1],
                  F1 = conf_matrix_2$byClass[7],
                  Sensitivity = conf_matrix_2$byClass["Sensitivity"],
                  Specificity = conf_matrix_2$byClass["Specificity"],
                  Precision =  conf_matrix_2$byClass["Precision"],
                  ROC = auc(roc(test_data$stroke, factor(y_pred_2, ordered = TRUE)))))

conf_matrix_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 633  73
##          1 265  88
##                                           
##                Accuracy : 0.6808          
##                  95% CI : (0.6518, 0.7088)
##     No Information Rate : 0.848           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1689          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7049          
##             Specificity : 0.5466          
##          Pos Pred Value : 0.8966          
##          Neg Pred Value : 0.2493          
##              Prevalence : 0.8480          
##          Detection Rate : 0.5977          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.6257          
##                                           
##        'Positive' Class : 0               
##

rpart.plot(tree_2)

tree_2$variable.importance

##  prevalentHyp currentSmoker        BPMeds    cigsPerDay       glucose 
##    71.9111834    52.5057403    48.9055721    29.4262710    26.7470410 
##            BP           BMI      diabetes     heartRate       totChol 
##     8.5104332     6.5348906     3.5190174     3.4915286     0.1040433

4.3 Random Forest

Next, we tried the Random Forest Regression model with 200 trees. The variables sysBP, diaBP were dropped, the variable BP was used instead.

# Set seed for constant results
set.seed(42)  
rf <- randomForest(stroke ~ . -sysBP -diaBP, data = train_data, ntree=200)

The accuracy of the model was 72%.

The second tree predicted 55% of the class 1. F1 score of 0.82 combined the precision and recall scores of the model, it showed better results than the previous models. These metrics indicate that while both models had a decent overall accuracy, its performance on predicting heart strokes (Class 1) was not very high (Specificity 0.55 for the third model) due to the imbalance in the data again.

AUC for the ROC curve was 0.65, it ranked a random positive example higher than a random negative example 65% of the time, the model had a moderate predictive power.

The OOB error vs trees plot below showed how model performance changed as the number of trees in the ensemble grew. This figure explained how adding more trees impacted the accuracy of the model. The OOB error (Out-of-Bag) measured how well the Random Forest model generalized to previously unknown data. The red training error line began high and gradually decreased, especially as additional trees were added. It was, however, less important for model evaluation because it measured performance on training data.

The black line indicated how much each tree contributed to lowering OOB error. Because it was cumulative, the slope of this line could provide insight into the effectiveness of adding more trees.

The most crucial line was the green OOB error line. When there were few trees, it started with a reasonably high value. The OOB error reduced and stabilized when more trees were added to the ensemble. The optimal number of trees for the Random Forest model was the point at which the OOB error stabilized (after 100 trees).

The plot below showed the feature importance. Mean Decrease Gini showed how important the features were over all splits done in the forest - whereas for each individual split the Gini importance indicated how much the Gini criterion (unequality/heterogeneity) was reduced using this split. The higher Mean Decrease Gini, the higher the importance of the variable in the model. Age, BMI, BP were the most important variables.

# Set seed for constant results
set.seed(42)

# Predict using test data
y_pred_3 <- predict(rf, newdata = test_data)


# Evaluate Random Forest
conf_matrix_3 <- confusionMatrix(y_pred_3, test_data$stroke)

results <- rbind(results, tibble(Model = "Model #3 - Random Forest", Accuracy=conf_matrix_3$overall[1], 
                  "Classification error rate" = 1 - conf_matrix_3$overall[1],
                  F1 = conf_matrix_3$byClass[7],
                  Sensitivity = conf_matrix_3$byClass["Sensitivity"],
                  Specificity = conf_matrix_3$byClass["Specificity"],
                  Precision =  conf_matrix_3$byClass["Precision"],
                  ROC = auc(roc(test_data$stroke, factor(y_pred_3, ordered = TRUE)))))

conf_matrix_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 671  71
##          1 227  90
##                                           
##                Accuracy : 0.7186          
##                  95% CI : (0.6905, 0.7455)
##     No Information Rate : 0.848           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2191          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7472          
##             Specificity : 0.5590          
##          Pos Pred Value : 0.9043          
##          Neg Pred Value : 0.2839          
##              Prevalence : 0.8480          
##          Detection Rate : 0.6336          
##    Detection Prevalence : 0.7007          
##       Balanced Accuracy : 0.6531          
##                                           
##        'Positive' Class : 0               
##

plot(rf)

varImpPlot(rf,  main = "Varibale Importance, Random Forest")

5. Model selection

All the models didn’t use sysBP, diaBP features as variable BP was used instead.

When compared to the two decision trees, the Random Forest had a greater overall accuracy (72%), AUC-ROC (65%) and F1-Score (0.82).

The ability to predict hear attack (class 1) was low for all 3 models (~0.55 Sensitivity) which was due to the original imbalance in the target variable.

Due to the work process, several techniques were used: normalization, adding new features, different balance techniques. These steps didn’t improve the models’ performance.

As a result, the Random Forest Regression would be a better choice than Decision trees as it could capture complicated, non-linear patterns in data, was resistant to overfitting, and could handle numerical as well as categorical data. Also, the decision tree has a low bias and a high variance, which means that it will entirely fit the training data (low bias), but if more test points are added, the error will increase (high variance). However, when many decision trees are joined with row and column sampling, the collection’s cumulative variance is minimal. As a result, Random Forest is a low bias and low variance model.

nice_table <- function(df, cap=NULL, cols=NULL, dig=3, fw=F){
  if (is.null(cols)) {c <- colnames(df)} else {c <- cols}
  table <- df %>% 
    kable(caption=cap, col.names=c, digits=dig) %>% 
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      html_font = 'monospace',
      full_width = fw)
  return(table)
}

results %>% 
  nice_table(cap='Model Comparison') %>% 
  scroll_box(width='100%')

Model Comparison
Model	Accuracy	Classification error rate	F1	Sensitivity	Specificity	Precision	ROC
Model #1 - Decision Tree	0.661	0.339	0.773	0.679	0.559	0.896	0.6191468
Model #2 - Decision Tree	0.681	0.319	0.789	0.705	0.547	0.897	0.6257418
Model #3 - Random Forest	0.719	0.281	0.818	0.747	0.559	0.904	0.6531111

6. Conclusion

In practice, these models would not be a suitable fit for the imbalanced data. The Standard random forest model is based on decision trees, which are sensitive to the class imbalance. Each tree is based on a “bag,” which is a uniform random sampling from the data (with replacement). As a result, the class imbalance will bias each tree in the same direction and amount (on average). To improve the results, new health features can be added based on the existing ones (categories for a smoking habit, BMI, combination of cigarettes per day if a patient is a smoker to define heavy, light smoker or non-smoker, etc), XGBoost model could be used as higher weight is given to the minority class at each successive iteration.

The Decision Trees and Random Forest models struggled to predict the minority class (heart strokes) due to the unequal nature of the target variable. Overall, the Random Forest model was more accurate and would be a better choice than the Decision tree due to its low bias and low variance.

References

Huh, K. (2021, February 13). Surviving in a random forest with imbalanced datasets. Medium. https://medium.com/sfu-cspmp/surviving-in-a-random-forest-with-imbalanced-datasets-b98b963d52eb
Nwanganga, F., & Chapple, M. (2020). Practical Machine Learning in R. https://doi.org/10.1002/9781119591542
Mirza_Hasnine. (2023, March 11). Heart disease dataset. Kaggle. https://www.kaggle.com/datasets/mirzahasnine/heart-disease-dataset/data
DeciZone. (n.d.). The good, the bad & the ugly of using decision trees - decizone. DeciZone.com. https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees

Essay. Machine Learning models for Stroke Data

The work below shows the stages from data preparation to model selection for the heart disease data from the source: https://www.kaggle.com/datasets/mirzahasnine/heart-disease-dataset/data

The first step was data preparation, which began with datasets of 4,238 records and 17 predictor variables. Each record represented a patient, and each column represented a health aspect. The Heart_ stroke was the target, it specified whether or not the patient experienced a stroke. There was a new variable added (BP - blood pressure ratio) from sysBP and diaBP, which reduced the dataset by removing unnecessary columns. The data transformation process included: addressing missing values, and categorical variable encoding (Gender, Heart_stroke,’prevalentStroke, education) resulting in a clean dataset for modeling. The resulting dataset consisted of 4,238 observations and 19 variables.

We discovered that the age bracket ‘50-59’ was the most represented in our sample using histograms, presumably indicating a vulnerable sector. Box plots revealed a clear relationship between glucose level and stroke outcomes, implying that greater glucose levels may be a risk factor. The correlation matrix demonstrated a significant relationship between sysBP and diaBP, implying multicollinearity. Furthermore, our investigation into smoking status revealed that the vast majority had never smoked, and a pie chart of the stroke variable revealed an 85%-15% data discrepancy. These findings from our data exploration phase were critical in laying the groundwork for the upcoming modeling processes.

The dataset was divided into training (75%) and testing (25%) datasets to evaluate model performance before we proceeded to prediction. The train data contained 3179 records, test data 1059. The response variable had a balance of 85% for 0 response and 15% for 1 response. To address this behavior, we utilized SMOTE() function the data in order to achieve higher accuracy rates between classes. The SMOTE function oversampled the minority class by using bootstrapping and k-nearest neighbor to synthetically create additional observations of that event. We tried to run models with imbalanced data and without.

Decision Tree - First Model: For the first model, we trained and evaluated a Decision Tree classifier on the dataset with all the variables except for sysBP, diaBP as there was a new variable instead (BP). We used the caret package’s train() function and trainControl() to build the first tree. To evaluate model performance, we used the ‘cv’ method to run 10-fold cross-validation across 10 resampling iterations.

The accuracy of the model was 66%. The first tree predicted 56% of the class 1. The model predicted the target, this couldn’t have happened with imbalanced data. F1 score of 0.77 combined the precision and recall scores of the model. These metrics indicated that while the model had a decent overall accuracy, its performance in predicting heart strokes (Class 1) was not very high (Specificity 0.56). This was likely due to the imbalanced nature of the target variable. AUC for the ROC curve was 0.62, it ranked a random positive example higher than a random negative example 62% of the time, the model had a moderate predictive power. The most important features to result in stroke were age, amount of cigarettes smoked in one day, if a patient had prevalent hypertension. As a result, age feature was used as the first node.

Decision Tree - Second Model: For the second model, we trained a Decision Tree using a different feature as the major split. In the previous tree, the age was used as the first node, this feature was removed from the second model. The variables sysBP, diaBP were not used again. The rpart() function with minsplit=50 (the minimum number of observations that must exist in a node in order for a split to be attempted) was used to build the second tree.

The prevalentHyp was used as the first node. The accuracy of the model was 68%. The second tree predicted 55% of the class 1. F1 score of 0.79 combined the precision and recall scores of the model, it showed better results than the previous model. These metrics indicate that while both models had a decent overall accuracy, their performance in predicting heart strokes (Class 1) was not very high (Specificity 0.55 for the second model). This was likely due to the imbalanced nature of the target variable. AUC for the ROC curve was 0.63, it ranked a random positive example higher than a random negative example 63% of the time, the model had a moderate predictive power. After the age was removed from the model, the most important features to result in stroke were if a patient had prevalent hypertension (first node prevalentHyp), if a patient was a smoker and took blood pressure medicines.

Random Forest: Next, we tried the Random Forest Regression model with 200 trees. The variables sysBP, diaBP were dropped, the variable BP was used instead.

The accuracy of the model was 72%. The second tree predicted 55% of the class 1. F1 score of 0.82 combined the precision and recall scores of the model, it showed better results than the previous models. These metrics indicate that while both models had a decent overall accuracy, its performance on predicting heart strokes (Class 1) was not very high (Specificity 0.55 for the third model) due to the imbalance in the data again. AUC for the ROC curve was 0.65, it ranked a random positive example higher than a random negative example 65% of the time, the model had a moderate predictive power.

The OOB vs trees plot showed that the optimal number of trees for the Random Forest model was the point at which the OOB error stabilized (after 100 trees). Age, BMI, BP were the most important variables as per Mean Decrease Gini.

When compared to the two decision trees, the Random Forest had a greater overall accuracy (72%), AUC-ROC (65%) and F1-Score (0.82).

Interestingly, the second decision tree (without age feature) performed better than the first tree that had age as the first node. It had a a slightly greater accuracy (68% instead of 66%), AUC-ROC (0.63 instead of 0.62) and F1-Score (0.79 instead of 0.77). The ability to predict hear attack (class 1) was low for all 3 models (~0.55 Sensitivity) which was due to the original imbalance in the target variable. Due to the work process, several techniques were used: normalization, adding new features, different balance techniques. These steps didn’t improve the models’ performance. As a result, the Random Forest Regression would be a better choice than Decision trees as it could capture complicated, non-linear patterns in data, was resistant to overfitting, and could handle numerical as well as categorical data. Also, the decision tree has a low bias and a high variance, which means that it will entirely fit the training data (low bias), but if more test points are added, the error will increase (high variance). However, when many decision trees are joined with row and column sampling, the collection’s cumulative variance is minimal. As a result, Random Forest is a low bias and low variance model.

The Decision Trees and Random Forest models struggled to predict the minority class (heart strokes) due to the unequal nature of the target variable. Overall, the Random Forest model was more accurate. Overall, the Random Forest model was more accurate and would be a better choice than the Decision tree due to its low bias and low variance.

Resolving concerns regarding the usage of decision trees is critical to building trust and ensuring effective implementation, especially in light of real-world issues and the “bad & ugly” characteristics noted in the article [4].

Details & Documentation: The decision tree is a transparent algorithm that forecasts the risk of a heart attack. Being transparent about how predictions are formed and which factors are taken into account might increase user trust. Furthermore, giving additional documentation or explanatory notes regarding the tree’s structure and logic can aid in demystifying its operation.

Complexity & Familiarity: The Decision tree may become complicated. While based on data, the decision trees we constructed for forecasting heart stroke risk was supposed to be interpretable. In healthcare, terminology like ‘age,’ ‘glucose,’ and ‘BMI’ are common. Making sure that decision tree outputs and splits were based on such clear and well recognized metrics helped in lowering the perception of complexity.

Subjectivity & Evolution: The model was constructed entirely from quantifiable data, ensuring objectivity in forecasts. However, as medical knowledge and patient data evolve, the tree should be updated and enhanced on a regular basis. The tree’s reliability can be improved by ongoing evolution depending on new research or comments.

Usability: While the model was built in this environment, it is critical that it be integrated into user-friendly platforms to ensure cross-device accessibility. When healthcare professionals or patients interact with the tree, they should be guided by intuitive interfaces.

Continuous input & Improvement: Collect input from end users, whether they are healthcare professionals or patients, on the decision tree’s predictions and usability. This feedback loop can identify areas for improvement and ensure that the model remains relevant to real-world requirements.

Education & Outreach: Providing training sessions or workshops can assist familiarize stakeholders with the benefits and workings of the decision tree. Demonstrating its efficacy through case studies or real-world experiences might help to reinforce its worth.

In order to modify people’s perceptions of the decision tree we developed, we must emphasize its clarity, objectivity, and adaptability. We can address many of the concerns about decision trees and demonstrate its efficacy in forecasting heart stroke risks by integrating it into user-friendly platforms, regularly developing it based on feedback, and engaging in proactive education and outreach.

Homework 2

Daria Dubovskaia

Overview

1. Data Preparation

1.1 Summary Statistics

1.2 Missing values

1.3 Encoding

1.4 Column change

1.5 Outliers

1.6 Summary Statistics for transformed data

2. Data Exploration

2.1 Continuous Variables

2.2 Categorical Variables

2.3 Target variable vs numeric variables

2.4 Target variable vs categorical variables

2.5 Correlation

3. Split data

4. Models

4.1 Decision Tree - First Model

4.2 Decision Tree - Second Model

4.3 Random Forest

5. Model selection

6. Conclusion

References

Essay. Machine Learning models for Stroke Data