Data 622 Homework 2

library(tidyverse)
library(knitr)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(kableExtra)
library(reactable)
library(caret)
library(car)
library(vtable)
library(outliers)
#library(rstatix)
library(GGally)
library(fixr)
library(rpart)
library(randomForest)
#library(MASS)
library(rsample)
library(tidymodels)
library(visdat)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(corrplot)
library(partykit)
library(reshape)
library(janitor)
library(yardstick)
library(grid)
library(gridExtra)           
library(likert)

I obtained this dataset from Kaggle.

df <- read.csv("/Users/mohamedhassan/Downloads/heart_attack_prediction_dataset.csv", check.names = FALSE)

reactable(df)

st(df)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
Age	8763	54	21	18	35	72	90
Sex	8763
… Female	2652	30%
… Male	6111	70%
Cholesterol	8763	260	81	120	192	330	400
Heart Rate	8763	75	21	40	57	93	110
Diabetes	8763	0.65	0.48	0	0	1	1
Family History	8763	0.49	0.5	0	0	1	1
Smoking	8763	0.9	0.3	0	1	1	1
Obesity	8763	0.5	0.5	0	0	1	1
Alcohol Consumption	8763	0.6	0.49	0	0	1	1
Exercise Hours Per Week	8763	10	5.8	0.0024	5	15	20
Diet	8763
… Average	2912	33%
… Healthy	2960	34%
… Unhealthy	2891	33%
Previous Heart Problems	8763	0.5	0.5	0	0	1	1
Medication Use	8763	0.5	0.5	0	0	1	1
Stress Level	8763	5.5	2.9	1	3	8	10
Sedentary Hours Per Day	8763	6	3.5	0.0013	3	9	12
Income	8763	158263	80575	20062	88310	227749	299954
BMI	8763	29	6.3	18	23	34	40
Triglycerides	8763	418	224	30	226	612	800
Physical Activity Days Per Week	8763	3.5	2.3	0	2	5	7
Sleep Hours Per Day	8763	7	2	4	5	9	10
Continent	8763
… Africa	873	10%
… Asia	2543	29%
… Australia	884	10%
… Europe	2241	26%
… North America	860	10%
… South America	1362	16%
Hemisphere	8763
… Northern Hemisphere	5660	65%
… Southern Hemisphere	3103	35%
Heart Attack Risk	8763	0.36	0.48	0	0	1	1

glimpse(df)

## Rows: 8,763
## Columns: 26
## $ `Patient ID`                      <chr> "BMW7812", "CZE1114", "BNI9906", "JL…
## $ Age                               <int> 67, 21, 21, 84, 66, 54, 90, 84, 20, …
## $ Sex                               <chr> "Male", "Male", "Female", "Male", "M…
## $ Cholesterol                       <int> 208, 389, 324, 383, 318, 297, 358, 2…
## $ `Blood Pressure`                  <chr> "158/88", "165/93", "174/99", "163/1…
## $ `Heart Rate`                      <int> 72, 98, 72, 73, 93, 48, 84, 107, 68,…
## $ Diabetes                          <int> 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, …
## $ `Family History`                  <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, …
## $ Smoking                           <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Obesity                           <int> 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, …
## $ `Alcohol Consumption`             <int> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, …
## $ `Exercise Hours Per Week`         <dbl> 4.1681888, 1.8132416, 2.0783530, 9.8…
## $ Diet                              <chr> "Average", "Unhealthy", "Healthy", "…
## $ `Previous Heart Problems`         <int> 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ `Medication Use`                  <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, …
## $ `Stress Level`                    <int> 9, 1, 9, 9, 6, 2, 7, 4, 5, 4, 8, 4, …
## $ `Sedentary Hours Per Day`         <dbl> 6.615001, 4.963459, 9.463426, 7.6489…
## $ Income                            <int> 261404, 285768, 235282, 125640, 1605…
## $ BMI                               <dbl> 31.25123, 27.19497, 28.17657, 36.464…
## $ Triglycerides                     <int> 286, 235, 587, 378, 231, 795, 284, 3…
## $ `Physical Activity Days Per Week` <int> 0, 1, 4, 3, 1, 5, 4, 6, 7, 7, 0, 4, …
## $ `Sleep Hours Per Day`             <int> 6, 7, 4, 4, 5, 10, 10, 7, 4, 7, 4, 8…
## $ Country                           <chr> "Argentina", "Canada", "France", "Ca…
## $ Continent                         <chr> "South America", "North America", "E…
## $ Hemisphere                        <chr> "Southern Hemisphere", "Northern Hem…
## $ `Heart Attack Risk`               <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …

sum(is.na(df))

## [1] 0

I removed Patient ID since it isn’t a health metric:

df2 <- df %>%
  select(-`Patient ID`)

I transformed the Blood Pressure column into two separate columns: Systolic and Diastolic:

df2 <- df2 %>%
  mutate(s0 = str_split(`Blood Pressure`,"/")) %>%
  rowwise() %>%
  mutate(Systolic = as.numeric(s0[1]), 
         Diastolic = as.numeric(s0[2])) %>%
  select(-s0)
  #select(first_bp, id, amount, systole, diastole)

Next, I created a column that categorized each patient’s blood pressure level based on their systolic and diastolic values:

df2 <- df2 %>%
      mutate(BPLevel = (case_when(Systolic <=120 | Diastolic <=80 ~ "Normal",
                                 between(Systolic, 120, 139) | between(Diastolic, 80, 89)~ "Prehypertension",
                                 Systolic>=140 | Diastolic >= 90 ~ "Hypertension",
                                 TRUE ~ "Missing"
                                 )))

After creating the new columns, I used the nearZeroVar function to check if there are any predictors that have little or no variance, which can affect the forthcoming models:

nearZeroVar(df2)

## integer(0)

df2 %>%
  group_by(BPLevel) %>%
  count() %>%
  ggplot(aes(x=reorder(BPLevel, -n), y=n)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Blood Pressure Level",
       y="Number of Patients",
       title="Count of Patients in Each Blood Pressure Category") +
  theme(plot.title = element_text(hjust = 0.5))

A majority of the patients have normal blood pressure.

Exploring Each Continent’s Health and Lifestyle

I wanted to analyze the relationship between features in the dataset and the Continents of each patient:

df2 %>%
  group_by(Continent) %>%
  mutate(`Sleep Hours Per Day` = mean(`Sleep Hours Per Day`)) %>%
  ggplot(aes(x=reorder(Continent,-`Sleep Hours Per Day`), y=`Sleep Hours Per Day`, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Sleep Per Day",
       title="Average Sleep Per Day in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(`Sedentary Hours Per Day` = mean(`Sedentary Hours Per Day`)) %>%
  ggplot(aes(x=reorder(Continent,-`Sedentary Hours Per Day`), y=`Sedentary Hours Per Day`, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Sedentary Hours Per Day",
       title="Average Sedentary Hours Per Day in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(`Physical Activity Days Per Week` = mean(`Physical Activity Days Per Week`)) %>%
  ggplot(aes(x=reorder(Continent,-`Physical Activity Days Per Week`), y=`Physical Activity Days Per Week`, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Physical Activitiy Days",
       title="Average Physical Activitiy Days Per Week in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(`Exercise Hours Per Week` = mean(`Exercise Hours Per Week`)) %>%
  ggplot(aes(x=reorder(Continent,-`Exercise Hours Per Week`), y=`Exercise Hours Per Week`, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Exercise Hours",
       title="Average Exercise Hours Per Week in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(`Stress Level` = mean(`Stress Level`)) %>%
  ggplot(aes(x=reorder(Continent,-`Stress Level`), y=`Stress Level`, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Stress Level",
       title="Average Stress Level in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(Cholesterol = mean(Cholesterol)) %>%
  ggplot(aes(x=reorder(Continent,-Cholesterol), y=Cholesterol, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Cholesterol",
       title="Average Cholesterol in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(BMI = mean(BMI)) %>%
  ggplot(aes(x=reorder(Continent,-BMI), y=BMI, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average BMI",
       title="Average BMI in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

df2 %>%
  group_by(Continent) %>%
  mutate(Triglycerides = mean(Triglycerides)) %>%
  ggplot(aes(x=reorder(Continent,-Triglycerides), y=Triglycerides, fill=Continent)) +
  geom_bar(stat="identity", position="dodge") +
  labs(x="Continent",
       y="Average Triglycerides",
       title="Average Triglycerides in Each Continent") +
  theme(plot.title = element_text(hjust = 0.5))

There are some interesting observations that are noteworthy. Patients from Australia had the highest average of sleep hours per day and sedentary hours per day, while also having the lowest average stress levels and lowest average exercise hours per week. North American patients had the highest BMI and Cholesterol levels, but averaged the second-highest exercise hours per week. Patients from Africa averaged the highest exercise hours per week and the second-highest BMI, sleep hours per day, and exercise hours per week, respectively.

df3 <- df2 %>%
  select(BPLevel, `Sleep Hours Per Day`, `Sedentary Hours Per Day`,
         `Physical Activity Days Per Week`, BMI, `Stress Level`, `Exercise Hours Per Week`,
         Cholesterol, Triglycerides)

p <- ggpairs(df3, lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p

df4 <- df3 %>%
  select(-BPLevel)

df_cor <- Hmisc::rcorr(as.matrix(df4))

data.frame(df_cor$r) %>% head() %>% kable() %>% kable_styling()

	Sleep.Hours.Per.Day	Sedentary.Hours.Per.Day	Physical.Activity.Days.Per.Week	BMI	Stress.Level	Exercise.Hours.Per.Week	Cholesterol	Triglycerides
Sleep Hours Per Day	1.0000000	0.0047920	0.0140334	-0.0100304	-0.0142054	-0.0012453	0.0044562	-0.0292160
Sedentary Hours Per Day	0.0047920	1.0000000	-0.0061780	-0.0000236	-0.0053972	0.0087556	0.0189145	-0.0057846
Physical Activity Days Per Week	0.0140334	-0.0061780	1.0000000	0.0081104	0.0074046	0.0077252	0.0160559	-0.0075564
BMI	-0.0100304	-0.0000236	0.0081104	1.0000000	-0.0032504	0.0037769	0.0172919	-0.0059636
Stress Level	-0.0142054	-0.0053972	0.0074046	-0.0032504	1.0000000	-0.0091024	-0.0244871	-0.0039213
Exercise Hours Per Week	-0.0012453	0.0087556	0.0077252	0.0037769	-0.0091024	1.0000000	0.0215171	0.0017169

Among the independent variables chosen, there doesn’t appear to be any multicollinearity, which helps eliminate a possible factor in determining model performance.

First Decision Tree Model

The first Decision Tree model will focus on using the newly created BPLevel column as the dependent variable. I will include the independent variables explored when comparing each continent: Sleep Hours Per Day, Sedentary Hours Per Day, Physical Activity Days Per Week, Exercise Hours Per Week, Stress Level, Cholesterol, BMI, and Triglycerides.

df3 <- fix_col_spaces(df3)

set.seed(1234)

df_tree <- createDataPartition(df3$BPLevel, p=0.75, list=FALSE)
# select 20% of the data for validation
df_train <- df3[df_tree,]
# use the remaining 80% of data to training and testing the models
df_test <- df3[-df_tree,]

round(prop.table(table(select(df3, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.23           60.55           20.22

round(prop.table(table(select(df_train, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.23           60.55           20.22

round(prop.table(table(select(df_test, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.22           60.55           20.23

set.seed(123)
tree <- rpart(BPLevel ~ ., data = df_train, cp=0.00001, method="class")
rpart.plot(tree, fallen.leaves = FALSE)

tree %>%
  varImp() %>%
  arrange(desc(Overall))

##                                  Overall
## Exercise_Hours_Per_Week         516.4832
## Triglycerides                   496.6889
## Sedentary_Hours_Per_Day         492.9993
## Cholesterol                     456.8389
## BMI                             427.4433
## Stress_Level                    250.2686
## Sleep_Hours_Per_Day             240.5037
## Physical_Activity_Days_Per_Week 163.5319

set.seed(123)
predict_tree <- predict(tree,df_test,type = 'class')

set.seed(123)
table(df_test$BPLevel, predict_tree)

##                  predict_tree
##                   Hypertension Normal Prehypertension
##   Hypertension              47    300              74
##   Normal                   190    934             202
##   Prehypertension           61    315              67

set.seed(123)
confusionMatrix(predict_tree,as.factor(df_test$BPLevel))

## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Hypertension Normal Prehypertension
##   Hypertension              47    190              61
##   Normal                   300    934             315
##   Prehypertension           74    202              67
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4785          
##                  95% CI : (0.4574, 0.4997)
##     No Information Rate : 0.6055          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0147         
##                                           
##  Mcnemar's Test P-Value : 5.826e-11       
## 
## Statistics by Class:
## 
##                      Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity                      0.11164        0.7044                0.15124
## Specificity                      0.85811        0.2882                0.84201
## Pos Pred Value                   0.15772        0.6030                0.19534
## Neg Pred Value                   0.80233        0.3885                0.79643
## Prevalence                       0.19224        0.6055                0.20228
## Detection Rate                   0.02146        0.4265                0.03059
## Detection Prevalence             0.13607        0.7073                0.15662
## Balanced Accuracy                0.48488        0.4963                0.49663

cm <- confusionMatrix(predict_tree,as.factor(df_test$BPLevel))

cm$byClass

##                        Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension      0.1116390   0.8581119      0.1577181      0.8023256
## Class: Normal            0.7043741   0.2881944      0.6029697      0.3884555
## Class: Prehypertension   0.1512415   0.8420149      0.1953353      0.7964266
##                        Precision    Recall        F1 Prevalence Detection Rate
## Class: Hypertension    0.1577181 0.1116390 0.1307371  0.1922374     0.02146119
## Class: Normal          0.6029697 0.7043741 0.6497391  0.6054795     0.42648402
## Class: Prehypertension 0.1953353 0.1512415 0.1704835  0.2022831     0.03059361
##                        Detection Prevalence Balanced Accuracy
## Class: Hypertension               0.1360731         0.4848754
## Class: Normal                     0.7073059         0.4962843
## Class: Prehypertension            0.1566210         0.4966282

First Decision Tree Variable Importance

tree %>%
  varImp() %>%
  arrange(desc(Overall))

##                                  Overall
## Exercise_Hours_Per_Week         516.4832
## Triglycerides                   496.6889
## Sedentary_Hours_Per_Day         492.9993
## Cholesterol                     456.8389
## BMI                             427.4433
## Stress_Level                    250.2686
## Sleep_Hours_Per_Day             240.5037
## Physical_Activity_Days_Per_Week 163.5319

imp_df <- data.frame(imp = tree$variable.importance)
imp_df2 <- imp_df %>% 
  tibble::rownames_to_column() %>% 
  dplyr::rename("variable" = rowname) %>% 
  dplyr::arrange(imp) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(imp_df2) +
  geom_col(aes(x = variable, y = imp),
           col = "black", show.legend = F) +
  coord_flip() +
  scale_fill_grey() +
  theme_bw()

Second Decision Tree Model

I used the feature importances from the first model to determine the variables to use for the second model. This time, I will use less variables and assess what the results are:

df4 <- df2 %>%
  #select(`Heart Attack Risk`, BMI)
  select(BPLevel, `Sedentary Hours Per Day`, Cholesterol, Triglycerides, `Exercise Hours Per Week`)

df4 <- fix_col_spaces(df4)

set.seed(1234)

df_tree2 <- createDataPartition(df4$BPLevel, p=0.75, list=FALSE)
#df_tree2 <- createDataPartition(df4$BPLevel, p=0.8, list=FALSE)
df_train2 <- df4[df_tree2,]
df_test2 <- df4[-df_tree2,]

round(prop.table(table(select(df4, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.23           60.55           20.22

round(prop.table(table(select(df_train2, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.23           60.55           20.22

round(prop.table(table(select(df_test2, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.22           60.55           20.23

set.seed(1234)
tree2 <- rpart(BPLevel ~ ., data = df_train2, cp=0.0001, method="class")
rpart.plot(tree2, fallen.leaves=FALSE, box.palette = "BuGn")

set.seed(123)
predict_tree2 <- predict(tree2, df_test2, type = 'class')

table(df_test2$BPLevel, predict_tree2)

##                  predict_tree2
##                   Hypertension Normal Prehypertension
##   Hypertension              59    292              70
##   Normal                   195    939             192
##   Prehypertension           67    311              65

set.seed(123)
confusionMatrix(predict_tree2,as.factor(df_test2$BPLevel))

## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Hypertension Normal Prehypertension
##   Hypertension              59    195              67
##   Normal                   292    939             311
##   Prehypertension           70    192              65
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4854          
##                  95% CI : (0.4643, 0.5066)
##     No Information Rate : 0.6055          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0013          
##                                           
##  Mcnemar's Test P-Value : 2.669e-10       
## 
## Statistics by Class:
## 
##                      Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity                      0.14014        0.7081                0.14673
## Specificity                      0.85189        0.3021                0.85003
## Pos Pred Value                   0.18380        0.6089                0.19878
## Neg Pred Value                   0.80631        0.4028                0.79710
## Prevalence                       0.19224        0.6055                0.20228
## Detection Rate                   0.02694        0.4288                0.02968
## Detection Prevalence             0.14658        0.7041                0.14932
## Balanced Accuracy                0.49602        0.5051                0.49838

cm2 <- confusionMatrix(predict_tree2,as.factor(df_test2$BPLevel))

cm2$byClass

##                        Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension      0.1401425   0.8518937      0.1838006      0.8063135
## Class: Normal            0.7081448   0.3020833      0.6089494      0.4027778
## Class: Prehypertension   0.1467269   0.8500286      0.1987768      0.7971014
##                        Precision    Recall        F1 Prevalence Detection Rate
## Class: Hypertension    0.1838006 0.1401425 0.1590296  0.1922374     0.02694064
## Class: Normal          0.6089494 0.7081448 0.6548117  0.6054795     0.42876712
## Class: Prehypertension 0.1987768 0.1467269 0.1688312  0.2022831     0.02968037
##                        Detection Prevalence Balanced Accuracy
## Class: Hypertension               0.1465753         0.4960181
## Class: Normal                     0.7041096         0.5051141
## Class: Prehypertension            0.1493151         0.4983777

Second Decision Tree Variable Importance

tree2 %>%
  varImp() %>%
  arrange(desc(Overall))

##                          Overall
## Sedentary_Hours_Per_Day 586.8255
## Triglycerides           564.2332
## Exercise_Hours_Per_Week 532.8059
## Cholesterol             522.9029

imp_df <- data.frame(imp = tree2$variable.importance)
imp_df2 <- imp_df %>% 
  tibble::rownames_to_column() %>% 
  dplyr::rename("variable" = rowname) %>% 
  dplyr::arrange(imp) %>%
  dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(imp_df2) +
  geom_col(aes(x = variable, y = imp),
           col = "black", show.legend = F) +
  coord_flip() +
  scale_fill_grey() +
  theme_bw()

Random Forest Model

df5 <- fix_col_spaces(df4)

df5$BPLevel <- factor(df5$BPLevel)

set.seed(123)
df_rf <- createDataPartition(df5$BPLevel, p = 0.8, list = FALSE)
df_train_rf <- df5[df_rf, ]
df_test_rf <- df5[-df_rf, ]

round(prop.table(table(select(df5, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.23           60.55           20.22

round(prop.table(table(select(df_train_rf, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.23           60.55           20.23

round(prop.table(table(select(df_test_rf, BPLevel), exclude = NULL)), 4) * 100

## BPLevel
##    Hypertension          Normal Prehypertension 
##           19.24           60.56           20.21

set.seed(123)
rf_model <- randomForest(BPLevel ~ ., data = df_train_rf, importance=TRUE, ntree=500)

rf_model

## 
## Call:
##  randomForest(formula = BPLevel ~ ., data = df_train_rf, importance = TRUE,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 41.98%
## Confusion matrix:
##                 Hypertension Normal Prehypertension class.error
## Hypertension              43   1263              42  0.96810089
## Normal                   139   3961             145  0.06690224
## Prehypertension           36   1318              64  0.95486601

set.seed(123)
rf_pred <- predict(rf_model,newdata = df_test_rf)

set.seed(123)
confusionMatrix(rf_pred, df_test_rf$BPLevel)

## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Hypertension Normal Prehypertension
##   Hypertension               3     30               8
##   Normal                   322    993             336
##   Prehypertension           12     38              10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5742          
##                  95% CI : (0.5507, 0.5975)
##     No Information Rate : 0.6056          
##     P-Value [Acc > NIR] : 0.9966          
##                                           
##                   Kappa : -0.0189         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity                     0.008902       0.93591               0.028249
## Specificity                     0.973145       0.04776               0.964235
## Pos Pred Value                  0.073171       0.60145               0.166667
## Neg Pred Value                  0.804793       0.32673               0.796690
## Prevalence                      0.192352       0.60559               0.202055
## Detection Rate                  0.001712       0.56678               0.005708
## Detection Prevalence            0.023402       0.94235               0.034247
## Balanced Accuracy               0.491023       0.49183               0.496242

cm3 <- confusionMatrix(rf_pred, df_test_rf$BPLevel)

Random Forest Variable Importance

set.seed(123)
rf_model %>%
  varImp()

##                         Hypertension     Normal Prehypertension
## Sedentary_Hours_Per_Day     4.117624  1.0014847        2.880540
## Cholesterol                 6.415973 -1.7999160        2.604052
## Triglycerides               0.615599 -1.3743438        3.640563
## Exercise_Hours_Per_Week     5.031793  0.1870341        0.946421

Confusion Matrix Class Statistics: First Decision Tree

set.seed(123)
cm_d <- as.data.frame(cm$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm$overall)
# round the values
cm_st$cm.overall <- round(cm_st$cm.overall,2)

# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm$table))
cm_d$Perc <- round(cm_p$Freq*100,2)

set.seed(123)
# plotting the matrix
cm_d_p <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  geom_tile() +
  geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
  theme_light() +
  guides(fill=FALSE) 

# plotting the stats
cm_st_p <-  tableGrob(cm_st)

# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2, 
             top=textGrob("Decision Tree 1 Confusion Matrix",gp=gpar(fontsize=25,font=1)))

Confusion Matrix Class Statistics: Second Decision Tree

cm2$byClass

##                        Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension      0.1401425   0.8518937      0.1838006      0.8063135
## Class: Normal            0.7081448   0.3020833      0.6089494      0.4027778
## Class: Prehypertension   0.1467269   0.8500286      0.1987768      0.7971014
##                        Precision    Recall        F1 Prevalence Detection Rate
## Class: Hypertension    0.1838006 0.1401425 0.1590296  0.1922374     0.02694064
## Class: Normal          0.6089494 0.7081448 0.6548117  0.6054795     0.42876712
## Class: Prehypertension 0.1987768 0.1467269 0.1688312  0.2022831     0.02968037
##                        Detection Prevalence Balanced Accuracy
## Class: Hypertension               0.1465753         0.4960181
## Class: Normal                     0.7041096         0.5051141
## Class: Prehypertension            0.1493151         0.4983777

set.seed(123)
cm_d <- as.data.frame(cm2$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm2$overall)
# round the values
cm_st$cm2.overall <- round(cm_st$cm2.overall,2)

# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm2$table))
cm_d$Perc <- round(cm_p$Freq*100,2)

set.seed(123)
# plotting the matrix
cm_d_p <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  geom_tile() +
  geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
  theme_light() +
  guides(fill=FALSE) 

# plotting the stats
cm_st_p <-  tableGrob(cm_st)

# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2, 
             top=textGrob("Decision Tree 2 Confusion Matrix",gp=gpar(fontsize=25,font=1)))

Confusion Matrix Class Statistics: Random Forest

cm3$byClass

##                        Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension    0.008902077  0.97314488     0.07317073      0.8047925
## Class: Normal          0.935909519  0.04775687     0.60145366      0.3267327
## Class: Prehypertension 0.028248588  0.96423462     0.16666667      0.7966903
##                         Precision      Recall         F1 Prevalence
## Class: Hypertension    0.07317073 0.008902077 0.01587302  0.1923516
## Class: Normal          0.60145366 0.935909519 0.73230088  0.6055936
## Class: Prehypertension 0.16666667 0.028248588 0.04830918  0.2020548
##                        Detection Rate Detection Prevalence Balanced Accuracy
## Class: Hypertension       0.001712329           0.02340183         0.4910235
## Class: Normal             0.566780822           0.94235160         0.4918332
## Class: Prehypertension    0.005707763           0.03424658         0.4962416

set.seed(123)
cm_d <- as.data.frame(cm3$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm3$overall)
# round the values
cm_st$cm3.overall <- round(cm_st$cm3.overall,2)

# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm3$table))
cm_d$Perc <- round(cm_p$Freq*100,2)

set.seed(123)
# plotting the matrix
cm_d_p <-  ggplot(data = cm_d, aes(x = Prediction , y =  Reference, fill = Freq))+
  geom_tile() +
  geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
  theme_light() +
  guides(fill=FALSE) 

# plotting the stats
cm_st_p <-  tableGrob(cm_st)

# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2, 
             top=textGrob("Random Forest Model Confusion Matrix",gp=gpar(fontsize=25,font=1)))

Comparing Precision, Recall, F1 Score, and Accuracy

Precision

set.seed(123)
precision_results <- data.frame("Decision Tree 1: Precision" = cm[["byClass"]][ , "Precision"],
                         "Decision Tree 2: Precision" = cm2[["byClass"]][ , "Precision"],
                         "Random Forest: Precision" = cm3[["byClass"]][ , "Precision"], check.names = FALSE)
precision_results %>%
  kbl(align = 'c') %>%
  kable_styling()

	Decision Tree 1: Precision	Decision Tree 2: Precision	Random Forest: Precision
Class: Hypertension	0.1577181	0.1838006	0.0731707
Class: Normal	0.6029697	0.6089494	0.6014537
Class: Prehypertension	0.1953353	0.1987768	0.1666667

Recall

set.seed(123)
recall_results <- data.frame("Decision Tree 1: Recall" = cm[["byClass"]][ , "Recall"],
                         "Decision Tree 2: Recall" = cm2[["byClass"]][ , "Recall"],
                         "Random Forest: Recall" = cm3[["byClass"]][ , "Recall"], check.names = FALSE)
recall_results %>%
  kbl(align = 'c') %>%
  kable_styling()

	Decision Tree 1: Recall	Decision Tree 2: Recall	Random Forest: Recall
Class: Hypertension	0.1116390	0.1401425	0.0089021
Class: Normal	0.7043741	0.7081448	0.9359095
Class: Prehypertension	0.1512415	0.1467269	0.0282486

F1 Score

set.seed(123)
f1_results <- data.frame("Decision Tree 1: F1 Score" = cm[["byClass"]][ , "F1"],
                         "Decision Tree 2: F1 Score" = cm2[["byClass"]][ , "F1"],
                         "Random Forest: F1 Score" = cm3[["byClass"]][ , "F1"], check.names = FALSE)
f1_results %>%
  kbl(align = 'c') %>%
  kable_styling()

	Decision Tree 1: F1 Score	Decision Tree 2: F1 Score	Random Forest: F1 Score
Class: Hypertension	0.1307371	0.1590296	0.0158730
Class: Normal	0.6497391	0.6548117	0.7323009
Class: Prehypertension	0.1704835	0.1688312	0.0483092

Accuracy

accuracy_results <- data.frame("Decision Tree 1: Accuracy" = cm[["byClass"]][ , "Balanced Accuracy"],
                         "Decision Tree 2: Accuracy" = cm2[["byClass"]][ , "Balanced Accuracy"],
                         "Random Forest: Accuracy" = cm3[["byClass"]][ , "Balanced Accuracy"], check.names = FALSE)
accuracy_results %>%
  kbl(align = 'c') %>%
  kable_styling()

	Decision Tree 1: Accuracy	Decision Tree 2: Accuracy	Random Forest: Accuracy
Class: Hypertension	0.4848754	0.4960181	0.4910235
Class: Normal	0.4962843	0.5051141	0.4918332
Class: Prehypertension	0.4966282	0.4983777	0.4962416

Essay

The dataset I used explored Heart Attack Risk. I wanted to create a multiclass classification model focused on Blood Pressure. I used the values provided in the Blood Pressure column to categorize patients into 3 classes: Normal, Prehypertension, and Hypertension, creating aBPLevel column. The first decision tree model had 8 features used to predict BPLevel. The independent variables were all continuous variables, which was purposely done to limit the complications of using a categorical variable with the dependent categorical variable. The visualization of both Decision Tree models did not render clearly, which is a danger of decision trees in terms of producing a complex diagram, making it hard to interpret. When assessing the accuracy of each class for each model, they were relatively unchanged, ranging from 0.487 to 0.504. However, the overall accuracy of each model varied, particularly between the two Decision Tree models and the Random Forest model. The first Decision Tree model produced an overall accuracy of 0.4789, improving slightly with less features in the second Decision Tree model to 0.4897. Using the same number of features as the second Decision Tree model, the Random Forest model performed better, with an overall accuracy of 0.5799.

In a scenario where we want to predict when a patient has pre-hypertension or hypertension, recall would be the metric we would want to use to identify the patients that have hypertension or prehypertension, since this metric is measured by true positives divided by true positives plus false negatives. If a patient actually has high or near-high blood pressure, we want the models to accurately predict those cases rather than predict that the patient has normal blood pressure when their blood pressure is actually not normal. Overall, each of the models don’t perform well when predicting having high or near-high blood pressure. Among the models, Random Forest performed the worst, having a recall value for hypertension of 0.0089021, or 0.89%, and a recall value for prehypertension of 0.0254237, or around 2.54%. The second Decision Tree performed the best when recalling patients with hypertension at around 14% and the first Decision Tree recalling hypertension patients at 11%. Conversely, the first Decision performed performed best when recalling prehypertension patients with a value of around 15.1%, which was slightly better than the second Decision Tree which had a recall value of 14.7%. The reduced number of variables in the second Decision Tree may have impacted the recall performance of patients with hypertension. While Random Forest performed the best when recalling patients with normal blood pressure, recalling patients in this category at a 93.6% rate, its poor performance in recalling patients with prehypertension and hypertension may indicate that the multiclass nature of the dependent variable may be too complex for the model to accurately predict patients in the prehypertension and hypertension categories, despite the low number of independent variables in the model.

The concerns expressed in the blog regarding Decision Trees are valid. Using a multiclass classification model made the visualizations from the Decision Tree models difficult to understand. Removing features from the first model did not reduce the complexity of the second model. I thought the low correlation among the independent variables and zero features that were detected using the nearZeroVar function would produce a more interpretable model. However, I think the multiclass nature of the dependent variable made it difficult to assess how each independent variable impacted each class of the dependent variable.