library(tidyverse)
library(knitr)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(kableExtra)
library(reactable)
library(caret)
library(car)
library(vtable)
library(outliers)
#library(rstatix)
library(GGally)
library(fixr)
library(rpart)
library(randomForest)
#library(MASS)
library(rsample)
library(tidymodels)
library(visdat)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(corrplot)
library(partykit)
library(reshape)
library(janitor)
library(yardstick)
library(grid)
library(gridExtra)
library(likert)
library(e1071)
I obtained this dataset from Kaggle.
df <- read.csv("/Users/mohamedhassan/Downloads/heart_attack_prediction_dataset.csv", check.names = FALSE)
reactable(df)
st(df)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| Age | 8763 | 54 | 21 | 18 | 35 | 72 | 90 |
| Sex | 8763 | ||||||
| … Female | 2652 | 30% | |||||
| … Male | 6111 | 70% | |||||
| Cholesterol | 8763 | 260 | 81 | 120 | 192 | 330 | 400 |
| Heart Rate | 8763 | 75 | 21 | 40 | 57 | 93 | 110 |
| Diabetes | 8763 | 0.65 | 0.48 | 0 | 0 | 1 | 1 |
| Family History | 8763 | 0.49 | 0.5 | 0 | 0 | 1 | 1 |
| Smoking | 8763 | 0.9 | 0.3 | 0 | 1 | 1 | 1 |
| Obesity | 8763 | 0.5 | 0.5 | 0 | 0 | 1 | 1 |
| Alcohol Consumption | 8763 | 0.6 | 0.49 | 0 | 0 | 1 | 1 |
| Exercise Hours Per Week | 8763 | 10 | 5.8 | 0.0024 | 5 | 15 | 20 |
| Diet | 8763 | ||||||
| … Average | 2912 | 33% | |||||
| … Healthy | 2960 | 34% | |||||
| … Unhealthy | 2891 | 33% | |||||
| Previous Heart Problems | 8763 | 0.5 | 0.5 | 0 | 0 | 1 | 1 |
| Medication Use | 8763 | 0.5 | 0.5 | 0 | 0 | 1 | 1 |
| Stress Level | 8763 | 5.5 | 2.9 | 1 | 3 | 8 | 10 |
| Sedentary Hours Per Day | 8763 | 6 | 3.5 | 0.0013 | 3 | 9 | 12 |
| Income | 8763 | 158263 | 80575 | 20062 | 88310 | 227749 | 299954 |
| BMI | 8763 | 29 | 6.3 | 18 | 23 | 34 | 40 |
| Triglycerides | 8763 | 418 | 224 | 30 | 226 | 612 | 800 |
| Physical Activity Days Per Week | 8763 | 3.5 | 2.3 | 0 | 2 | 5 | 7 |
| Sleep Hours Per Day | 8763 | 7 | 2 | 4 | 5 | 9 | 10 |
| Continent | 8763 | ||||||
| … Africa | 873 | 10% | |||||
| … Asia | 2543 | 29% | |||||
| … Australia | 884 | 10% | |||||
| … Europe | 2241 | 26% | |||||
| … North America | 860 | 10% | |||||
| … South America | 1362 | 16% | |||||
| Hemisphere | 8763 | ||||||
| … Northern Hemisphere | 5660 | 65% | |||||
| … Southern Hemisphere | 3103 | 35% | |||||
| Heart Attack Risk | 8763 | 0.36 | 0.48 | 0 | 0 | 1 | 1 |
glimpse(df)
## Rows: 8,763
## Columns: 26
## $ `Patient ID` <chr> "BMW7812", "CZE1114", "BNI9906", "JL…
## $ Age <int> 67, 21, 21, 84, 66, 54, 90, 84, 20, …
## $ Sex <chr> "Male", "Male", "Female", "Male", "M…
## $ Cholesterol <int> 208, 389, 324, 383, 318, 297, 358, 2…
## $ `Blood Pressure` <chr> "158/88", "165/93", "174/99", "163/1…
## $ `Heart Rate` <int> 72, 98, 72, 73, 93, 48, 84, 107, 68,…
## $ Diabetes <int> 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, …
## $ `Family History` <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, …
## $ Smoking <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Obesity <int> 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, …
## $ `Alcohol Consumption` <int> 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, …
## $ `Exercise Hours Per Week` <dbl> 4.1681888, 1.8132416, 2.0783530, 9.8…
## $ Diet <chr> "Average", "Unhealthy", "Healthy", "…
## $ `Previous Heart Problems` <int> 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ `Medication Use` <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, …
## $ `Stress Level` <int> 9, 1, 9, 9, 6, 2, 7, 4, 5, 4, 8, 4, …
## $ `Sedentary Hours Per Day` <dbl> 6.615001, 4.963459, 9.463426, 7.6489…
## $ Income <int> 261404, 285768, 235282, 125640, 1605…
## $ BMI <dbl> 31.25123, 27.19497, 28.17657, 36.464…
## $ Triglycerides <int> 286, 235, 587, 378, 231, 795, 284, 3…
## $ `Physical Activity Days Per Week` <int> 0, 1, 4, 3, 1, 5, 4, 6, 7, 7, 0, 4, …
## $ `Sleep Hours Per Day` <int> 6, 7, 4, 4, 5, 10, 10, 7, 4, 7, 4, 8…
## $ Country <chr> "Argentina", "Canada", "France", "Ca…
## $ Continent <chr> "South America", "North America", "E…
## $ Hemisphere <chr> "Southern Hemisphere", "Northern Hem…
## $ `Heart Attack Risk` <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, …
sum(is.na(df))
## [1] 0
I removed Patient ID since it isn’t a health metric:
df2 <- df %>%
select(-`Patient ID`)
I transformed the Blood Pressure column into two separate columns: Systolic and Diastolic:
df2 <- df2 %>%
mutate(s0 = str_split(`Blood Pressure`,"/")) %>%
rowwise() %>%
mutate(Systolic = as.numeric(s0[1]),
Diastolic = as.numeric(s0[2])) %>%
select(-s0)
#select(first_bp, id, amount, systole, diastole)
Next, I created a column that categorized each patient’s blood pressure level based on their systolic and diastolic values:
df2 <- df2 %>%
mutate(BPLevel = (case_when(Systolic <=120 | Diastolic <=80 ~ "Normal",
between(Systolic, 120, 139) | between(Diastolic, 80, 89)~ "Prehypertension",
Systolic>=140 | Diastolic >= 90 ~ "Hypertension",
TRUE ~ "Missing"
)))
After creating the new columns, I used the nearZeroVar
function to check if there are any predictors that have little or no
variance, which can affect the forthcoming models:
nearZeroVar(df2)
## integer(0)
df2 %>%
group_by(BPLevel) %>%
count() %>%
ggplot(aes(x=reorder(BPLevel, -n), y=n)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Number of Patients",
title="Count of Patients in Each Blood Pressure Category") +
theme(plot.title = element_text(hjust = 0.5))
A majority of the patients have normal blood pressure.
Exploring Health and Lifestyle For Patients in Each Blood Pressure Level
I wanted to compare patients that have normal, prehypertension, and hypertension. Because of the stark imbalance of patients who had normal blood pressure and those who had prehypertension and hypertension, I examined the median values instead of the average among each categorical blood pressure group:
df2 %>%
group_by(BPLevel) %>%
mutate(`Sleep Hours Per Day` = median(`Sleep Hours Per Day`)) %>%
ggplot(aes(x=reorder(BPLevel,-`Sleep Hours Per Day`), y=`Sleep Hours Per Day`, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median Sleep Per Day",
title="Median Sleep Per Day For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(`Sedentary Hours Per Day` = median(`Sedentary Hours Per Day`)) %>%
ggplot(aes(x=reorder(BPLevel,-`Sedentary Hours Per Day`), y=`Sedentary Hours Per Day`, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median Sedentary Hours Per Day",
title="Median Sedentary Hours Per Day For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(`Physical Activity Days Per Week` = median(`Physical Activity Days Per Week`)) %>%
ggplot(aes(x=reorder(BPLevel,-`Physical Activity Days Per Week`), y=`Physical Activity Days Per Week`, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median Physical Activity Days",
title="Median Physical Activity Days Per Week For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(`Exercise Hours Per Week` = median(`Exercise Hours Per Week`)) %>%
ggplot(aes(x=reorder(BPLevel,-`Exercise Hours Per Week`), y=`Exercise Hours Per Week`, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median Exercise Hours",
title="Median Exercise Hours Per Week in For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(`Stress Level` = median(`Stress Level`)) %>%
ggplot(aes(x=reorder(BPLevel,-`Stress Level`), y=`Stress Level`, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median Stress Level",
title="Median Stress Level by Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(Cholesterol = median(Cholesterol)) %>%
ggplot(aes(x=reorder(BPLevel,-Cholesterol), y=Cholesterol, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="MedianCholesterol",
title="Median Cholesterol For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(BMI = median(BMI)) %>%
ggplot(aes(x=reorder(BPLevel,-BMI), y=BMI, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median BMI",
title="Median BMI For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(BPLevel) %>%
mutate(Triglycerides = median(Triglycerides)) %>%
ggplot(aes(x=reorder(BPLevel,-Triglycerides), y=Triglycerides, fill=BPLevel)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Blood Pressure Level",
y="Median Triglycerides",
title="Median Triglycerides For Each Blood Pressure Level") +
theme(plot.title = element_text(hjust = 0.5))
Exploring Each Continent’s Health and Lifestyle
I wanted to analyze the relationship between features in the dataset and the Continents of each patient:
df2 %>%
group_by(Continent) %>%
mutate(`Sleep Hours Per Day` = mean(`Sleep Hours Per Day`)) %>%
ggplot(aes(x=reorder(Continent,-`Sleep Hours Per Day`), y=`Sleep Hours Per Day`, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Sleep Per Day",
title="Average Sleep Per Day in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(`Sedentary Hours Per Day` = mean(`Sedentary Hours Per Day`)) %>%
ggplot(aes(x=reorder(Continent,-`Sedentary Hours Per Day`), y=`Sedentary Hours Per Day`, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Sedentary Hours Per Day",
title="Average Sedentary Hours Per Day in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(`Physical Activity Days Per Week` = mean(`Physical Activity Days Per Week`)) %>%
ggplot(aes(x=reorder(Continent,-`Physical Activity Days Per Week`), y=`Physical Activity Days Per Week`, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Physical Activity Days",
title="Average Physical Activity Days Per Week in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(`Exercise Hours Per Week` = mean(`Exercise Hours Per Week`)) %>%
ggplot(aes(x=reorder(Continent,-`Exercise Hours Per Week`), y=`Exercise Hours Per Week`, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Exercise Hours",
title="Average Exercise Hours Per Week in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(`Stress Level` = mean(`Stress Level`)) %>%
ggplot(aes(x=reorder(Continent,-`Stress Level`), y=`Stress Level`, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Stress Level",
title="Average Stress Level in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(Cholesterol = mean(Cholesterol)) %>%
ggplot(aes(x=reorder(Continent,-Cholesterol), y=Cholesterol, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Cholesterol",
title="Average Cholesterol in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(BMI = mean(BMI)) %>%
ggplot(aes(x=reorder(Continent,-BMI), y=BMI, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average BMI",
title="Average BMI in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
df2 %>%
group_by(Continent) %>%
mutate(Triglycerides = mean(Triglycerides)) %>%
ggplot(aes(x=reorder(Continent,-Triglycerides), y=Triglycerides, fill=Continent)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Continent",
y="Average Triglycerides",
title="Average Triglycerides in Each Continent") +
theme(plot.title = element_text(hjust = 0.5))
There are some interesting observations that are noteworthy. Patients from Australia had the highest average of sleep hours per day and sedentary hours per day, while also having the lowest average stress levels and lowest average exercise hours per week. North American patients had the highest BMI and Cholesterol levels, but averaged the second-highest exercise hours per week. Patients from Africa averaged the highest exercise hours per week and the second-highest BMI, sleep hours per day, and exercise hours per week, respectively.
df3 <- df2 %>%
select(BPLevel, `Sleep Hours Per Day`, `Sedentary Hours Per Day`,
`Physical Activity Days Per Week`, BMI, `Stress Level`, `Exercise Hours Per Week`,
Cholesterol, Triglycerides)
p <- ggpairs(df3, lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p
df4 <- df3 %>%
select(-BPLevel)
df_cor <- Hmisc::rcorr(as.matrix(df4))
data.frame(df_cor$r) %>% head() %>% kable() %>% kable_styling()
| Sleep.Hours.Per.Day | Sedentary.Hours.Per.Day | Physical.Activity.Days.Per.Week | BMI | Stress.Level | Exercise.Hours.Per.Week | Cholesterol | Triglycerides | |
|---|---|---|---|---|---|---|---|---|
| Sleep Hours Per Day | 1.0000000 | 0.0047920 | 0.0140334 | -0.0100304 | -0.0142054 | -0.0012453 | 0.0044562 | -0.0292160 |
| Sedentary Hours Per Day | 0.0047920 | 1.0000000 | -0.0061780 | -0.0000236 | -0.0053972 | 0.0087556 | 0.0189145 | -0.0057846 |
| Physical Activity Days Per Week | 0.0140334 | -0.0061780 | 1.0000000 | 0.0081104 | 0.0074046 | 0.0077252 | 0.0160559 | -0.0075564 |
| BMI | -0.0100304 | -0.0000236 | 0.0081104 | 1.0000000 | -0.0032504 | 0.0037769 | 0.0172919 | -0.0059636 |
| Stress Level | -0.0142054 | -0.0053972 | 0.0074046 | -0.0032504 | 1.0000000 | -0.0091024 | -0.0244871 | -0.0039213 |
| Exercise Hours Per Week | -0.0012453 | 0.0087556 | 0.0077252 | 0.0037769 | -0.0091024 | 1.0000000 | 0.0215171 | 0.0017169 |
Among the independent variables chosen, there doesn’t appear to be any multicollinearity, which helps eliminate a possible factor in determining model performance.
The first Decision Tree model will focus on using the newly created
BPLevel column as the dependent variable. I will include
the independent variables explored when comparing each continent:
Sleep Hours Per Day, Sedentary Hours Per Day,
Physical Activity Days Per Week,
Exercise Hours Per Week, Stress Level,
Cholesterol, BMI, and
Triglycerides.
df3 <- fix_col_spaces(df3)
set.seed(1234)
df_tree <- createDataPartition(df3$BPLevel, p=0.75, list=FALSE)
# select 20% of the data for validation
df_train <- df3[df_tree,]
# use the remaining 80% of data to training and testing the models
df_test <- df3[-df_tree,]
round(prop.table(table(select(df3, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.22
round(prop.table(table(select(df_train, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.22
round(prop.table(table(select(df_test, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.22 60.55 20.23
set.seed(123)
tree <- rpart(BPLevel ~ ., data = df_train, cp=0.00001, method="class")
rpart.plot(tree, fallen.leaves = FALSE)
tree %>%
varImp() %>%
arrange(desc(Overall))
## Overall
## Exercise_Hours_Per_Week 516.4832
## Triglycerides 496.6889
## Sedentary_Hours_Per_Day 492.9993
## Cholesterol 456.8389
## BMI 427.4433
## Stress_Level 250.2686
## Sleep_Hours_Per_Day 240.5037
## Physical_Activity_Days_Per_Week 163.5319
set.seed(123)
predict_tree <- predict(tree,df_test,type = 'class')
set.seed(123)
table(df_test$BPLevel, predict_tree)
## predict_tree
## Hypertension Normal Prehypertension
## Hypertension 47 300 74
## Normal 190 934 202
## Prehypertension 61 315 67
set.seed(123)
confusionMatrix(predict_tree,as.factor(df_test$BPLevel))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Hypertension Normal Prehypertension
## Hypertension 47 190 61
## Normal 300 934 315
## Prehypertension 74 202 67
##
## Overall Statistics
##
## Accuracy : 0.4785
## 95% CI : (0.4574, 0.4997)
## No Information Rate : 0.6055
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0147
##
## Mcnemar's Test P-Value : 5.826e-11
##
## Statistics by Class:
##
## Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity 0.11164 0.7044 0.15124
## Specificity 0.85811 0.2882 0.84201
## Pos Pred Value 0.15772 0.6030 0.19534
## Neg Pred Value 0.80233 0.3885 0.79643
## Prevalence 0.19224 0.6055 0.20228
## Detection Rate 0.02146 0.4265 0.03059
## Detection Prevalence 0.13607 0.7073 0.15662
## Balanced Accuracy 0.48488 0.4963 0.49663
cm <- confusionMatrix(predict_tree,as.factor(df_test$BPLevel))
cm$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0.1116390 0.8581119 0.1577181 0.8023256
## Class: Normal 0.7043741 0.2881944 0.6029697 0.3884555
## Class: Prehypertension 0.1512415 0.8420149 0.1953353 0.7964266
## Precision Recall F1 Prevalence Detection Rate
## Class: Hypertension 0.1577181 0.1116390 0.1307371 0.1922374 0.02146119
## Class: Normal 0.6029697 0.7043741 0.6497391 0.6054795 0.42648402
## Class: Prehypertension 0.1953353 0.1512415 0.1704835 0.2022831 0.03059361
## Detection Prevalence Balanced Accuracy
## Class: Hypertension 0.1360731 0.4848754
## Class: Normal 0.7073059 0.4962843
## Class: Prehypertension 0.1566210 0.4966282
First Decision Tree Variable Importance
tree %>%
varImp() %>%
arrange(desc(Overall))
## Overall
## Exercise_Hours_Per_Week 516.4832
## Triglycerides 496.6889
## Sedentary_Hours_Per_Day 492.9993
## Cholesterol 456.8389
## BMI 427.4433
## Stress_Level 250.2686
## Sleep_Hours_Per_Day 240.5037
## Physical_Activity_Days_Per_Week 163.5319
imp_df <- data.frame(imp = tree$variable.importance)
imp_df2 <- imp_df %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(imp) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(imp_df2) +
geom_col(aes(x = variable, y = imp),
col = "black", show.legend = F) +
coord_flip() +
scale_fill_grey() +
theme_bw()
I used the feature importances from the first model to determine the variables to use for the second model. This time, I will use less variables and assess what the results are:
df4 <- df2 %>%
#select(`Heart Attack Risk`, BMI)
select(BPLevel, `Sedentary Hours Per Day`, Cholesterol, Triglycerides, `Exercise Hours Per Week`)
df4 <- fix_col_spaces(df4)
set.seed(1234)
df_tree2 <- createDataPartition(df4$BPLevel, p=0.75, list=FALSE)
#df_tree2 <- createDataPartition(df4$BPLevel, p=0.8, list=FALSE)
df_train2 <- df4[df_tree2,]
df_test2 <- df4[-df_tree2,]
round(prop.table(table(select(df4, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.22
round(prop.table(table(select(df_train2, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.22
round(prop.table(table(select(df_test2, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.22 60.55 20.23
set.seed(1234)
tree2 <- rpart(BPLevel ~ ., data = df_train2, cp=0.0001, method="class")
rpart.plot(tree2, fallen.leaves=FALSE, box.palette = "BuGn")
set.seed(123)
predict_tree2 <- predict(tree2, df_test2, type = 'class')
table(df_test2$BPLevel, predict_tree2)
## predict_tree2
## Hypertension Normal Prehypertension
## Hypertension 59 292 70
## Normal 195 939 192
## Prehypertension 67 311 65
set.seed(123)
confusionMatrix(predict_tree2,as.factor(df_test2$BPLevel))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Hypertension Normal Prehypertension
## Hypertension 59 195 67
## Normal 292 939 311
## Prehypertension 70 192 65
##
## Overall Statistics
##
## Accuracy : 0.4854
## 95% CI : (0.4643, 0.5066)
## No Information Rate : 0.6055
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0013
##
## Mcnemar's Test P-Value : 2.669e-10
##
## Statistics by Class:
##
## Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity 0.14014 0.7081 0.14673
## Specificity 0.85189 0.3021 0.85003
## Pos Pred Value 0.18380 0.6089 0.19878
## Neg Pred Value 0.80631 0.4028 0.79710
## Prevalence 0.19224 0.6055 0.20228
## Detection Rate 0.02694 0.4288 0.02968
## Detection Prevalence 0.14658 0.7041 0.14932
## Balanced Accuracy 0.49602 0.5051 0.49838
cm2 <- confusionMatrix(predict_tree2,as.factor(df_test2$BPLevel))
cm2$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0.1401425 0.8518937 0.1838006 0.8063135
## Class: Normal 0.7081448 0.3020833 0.6089494 0.4027778
## Class: Prehypertension 0.1467269 0.8500286 0.1987768 0.7971014
## Precision Recall F1 Prevalence Detection Rate
## Class: Hypertension 0.1838006 0.1401425 0.1590296 0.1922374 0.02694064
## Class: Normal 0.6089494 0.7081448 0.6548117 0.6054795 0.42876712
## Class: Prehypertension 0.1987768 0.1467269 0.1688312 0.2022831 0.02968037
## Detection Prevalence Balanced Accuracy
## Class: Hypertension 0.1465753 0.4960181
## Class: Normal 0.7041096 0.5051141
## Class: Prehypertension 0.1493151 0.4983777
Second Decision Tree Variable Importance
tree2 %>%
varImp() %>%
arrange(desc(Overall))
## Overall
## Sedentary_Hours_Per_Day 586.8255
## Triglycerides 564.2332
## Exercise_Hours_Per_Week 532.8059
## Cholesterol 522.9029
imp_df <- data.frame(imp = tree2$variable.importance)
imp_df2 <- imp_df %>%
tibble::rownames_to_column() %>%
dplyr::rename("variable" = rowname) %>%
dplyr::arrange(imp) %>%
dplyr::mutate(variable = forcats::fct_inorder(variable))
ggplot2::ggplot(imp_df2) +
geom_col(aes(x = variable, y = imp),
col = "black", show.legend = F) +
coord_flip() +
scale_fill_grey() +
theme_bw()
df5 <- fix_col_spaces(df4)
df5$BPLevel <- factor(df5$BPLevel)
set.seed(123)
df_rf <- createDataPartition(df5$BPLevel, p = 0.8, list = FALSE)
df_train_rf <- df5[df_rf, ]
df_test_rf <- df5[-df_rf, ]
round(prop.table(table(select(df5, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.22
round(prop.table(table(select(df_train_rf, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.23
round(prop.table(table(select(df_test_rf, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.24 60.56 20.21
set.seed(123)
rf_model <- randomForest(BPLevel ~ ., data = df_train_rf, importance=TRUE, ntree=500)
rf_model
##
## Call:
## randomForest(formula = BPLevel ~ ., data = df_train_rf, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 41.98%
## Confusion matrix:
## Hypertension Normal Prehypertension class.error
## Hypertension 43 1263 42 0.96810089
## Normal 139 3961 145 0.06690224
## Prehypertension 36 1318 64 0.95486601
set.seed(123)
rf_pred <- predict(rf_model,newdata = df_test_rf)
set.seed(123)
confusionMatrix(rf_pred, df_test_rf$BPLevel)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Hypertension Normal Prehypertension
## Hypertension 3 30 8
## Normal 322 993 336
## Prehypertension 12 38 10
##
## Overall Statistics
##
## Accuracy : 0.5742
## 95% CI : (0.5507, 0.5975)
## No Information Rate : 0.6056
## P-Value [Acc > NIR] : 0.9966
##
## Kappa : -0.0189
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity 0.008902 0.93591 0.028249
## Specificity 0.973145 0.04776 0.964235
## Pos Pred Value 0.073171 0.60145 0.166667
## Neg Pred Value 0.804793 0.32673 0.796690
## Prevalence 0.192352 0.60559 0.202055
## Detection Rate 0.001712 0.56678 0.005708
## Detection Prevalence 0.023402 0.94235 0.034247
## Balanced Accuracy 0.491023 0.49183 0.496242
cm3 <- confusionMatrix(rf_pred, df_test_rf$BPLevel)
Random Forest Variable Importance
set.seed(123)
rf_model %>%
varImp()
## Hypertension Normal Prehypertension
## Sedentary_Hours_Per_Day 4.117624 1.0014847 2.880540
## Cholesterol 6.415973 -1.7999160 2.604052
## Triglycerides 0.615599 -1.3743438 3.640563
## Exercise_Hours_Per_Week 5.031793 0.1870341 0.946421
df5 <- fix_col_spaces(df4)
df5$BPLevel <- factor(df5$BPLevel)
set.seed(123)
df_svm <- createDataPartition(df5$BPLevel, p = 0.8, list = FALSE)
df_train_svm <- df5[df_svm, ]
df_test_svm <- df5[-df_svm, ]
round(prop.table(table(select(df5, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.22
round(prop.table(table(select(df_train_svm, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.23 60.55 20.23
round(prop.table(table(select(df_test_svm, BPLevel), exclude = NULL)), 4) * 100
## BPLevel
## Hypertension Normal Prehypertension
## 19.24 60.56 20.21
set.seed(123)
svm_model_sigmoid <- svm(BPLevel ~ ., data = df_train_svm, method="C-classification",
kernel="sigmoid", cost = 0.05, scale = TRUE, cross=5, coef0=0.09)
svm_model_linear <- svm(BPLevel ~ ., data = df_train_svm, method="C-classification",
kernel="linear", cost = 0.05, scale = TRUE, cross=5)
summary(svm_model_sigmoid)
##
## Call:
## svm(formula = BPLevel ~ ., data = df_train_svm, method = "C-classification",
## kernel = "sigmoid", cost = 0.05, cross = 5, coef0 = 0.09, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 0.05
## coef.0: 0.09
##
## Number of Support Vectors: 5133
##
## ( 1348 2386 1399 )
##
##
## Number of Classes: 3
##
## Levels:
## Hypertension Normal Prehypertension
##
## 5-fold cross-validation on training data:
##
## Total Accuracy: 56.36856
## Single Accuracies:
## 53.78031 57.4893 59.20114 55.27817 56.09408
summary(svm_model_linear)
##
## Call:
## svm(formula = BPLevel ~ ., data = df_train_svm, method = "C-classification",
## kernel = "linear", cost = 0.05, cross = 5, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
##
## Number of Support Vectors: 5146
##
## ( 1348 2380 1418 )
##
##
## Number of Classes: 3
##
## Levels:
## Hypertension Normal Prehypertension
##
## 5-fold cross-validation on training data:
##
## Total Accuracy: 60.54771
## Single Accuracies:
## 60.699 60.12839 60.699 61.26961 59.94298
set.seed(123)
svm_pred_sig <- predict(svm_model_sigmoid,newdata = df_test_svm)
set.seed(123)
svm_pred_lin <- predict(svm_model_linear,newdata = df_test_svm)
set.seed(123)
confusionMatrix(svm_pred_sig, df_test_svm$BPLevel)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Hypertension Normal Prehypertension
## Hypertension 34 109 35
## Normal 284 837 283
## Prehypertension 19 115 36
##
## Overall Statistics
##
## Accuracy : 0.5177
## 95% CI : (0.494, 0.5413)
## No Information Rate : 0.6056
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0142
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity 0.10089 0.7889 0.10169
## Specificity 0.89823 0.1795 0.90415
## Pos Pred Value 0.19101 0.5962 0.21176
## Neg Pred Value 0.80750 0.3563 0.79899
## Prevalence 0.19235 0.6056 0.20205
## Detection Rate 0.01941 0.4777 0.02055
## Detection Prevalence 0.10160 0.8014 0.09703
## Balanced Accuracy 0.49956 0.4842 0.50292
cm4 <- confusionMatrix(svm_pred_sig, df_test_svm$BPLevel)
set.seed(123)
confusionMatrix(svm_pred_lin, df_test_svm$BPLevel)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Hypertension Normal Prehypertension
## Hypertension 0 0 0
## Normal 337 1061 354
## Prehypertension 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6056
## 95% CI : (0.5823, 0.6286)
## No Information Rate : 0.6056
## P-Value [Acc > NIR] : 0.5104
##
## Kappa : 0
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Hypertension Class: Normal Class: Prehypertension
## Sensitivity 0.0000 1.0000 0.0000
## Specificity 1.0000 0.0000 1.0000
## Pos Pred Value NaN 0.6056 NaN
## Neg Pred Value 0.8076 NaN 0.7979
## Prevalence 0.1924 0.6056 0.2021
## Detection Rate 0.0000 0.6056 0.0000
## Detection Prevalence 0.0000 1.0000 0.0000
## Balanced Accuracy 0.5000 0.5000 0.5000
cm5 <- confusionMatrix(svm_pred_lin, df_test_svm$BPLevel)
cm$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0.1116390 0.8581119 0.1577181 0.8023256
## Class: Normal 0.7043741 0.2881944 0.6029697 0.3884555
## Class: Prehypertension 0.1512415 0.8420149 0.1953353 0.7964266
## Precision Recall F1 Prevalence Detection Rate
## Class: Hypertension 0.1577181 0.1116390 0.1307371 0.1922374 0.02146119
## Class: Normal 0.6029697 0.7043741 0.6497391 0.6054795 0.42648402
## Class: Prehypertension 0.1953353 0.1512415 0.1704835 0.2022831 0.03059361
## Detection Prevalence Balanced Accuracy
## Class: Hypertension 0.1360731 0.4848754
## Class: Normal 0.7073059 0.4962843
## Class: Prehypertension 0.1566210 0.4966282
set.seed(123)
cm_d <- as.data.frame(cm$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm$overall)
# round the values
cm_st$cm.overall <- round(cm_st$cm.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
set.seed(123)
# plotting the matrix
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
theme_light() +
guides(fill=FALSE)
# plotting the stats
cm_st_p <- tableGrob(cm_st)
# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2,
top=textGrob("Decision Tree 1 Confusion Matrix",gp=gpar(fontsize=25,font=1)))
cm2$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0.1401425 0.8518937 0.1838006 0.8063135
## Class: Normal 0.7081448 0.3020833 0.6089494 0.4027778
## Class: Prehypertension 0.1467269 0.8500286 0.1987768 0.7971014
## Precision Recall F1 Prevalence Detection Rate
## Class: Hypertension 0.1838006 0.1401425 0.1590296 0.1922374 0.02694064
## Class: Normal 0.6089494 0.7081448 0.6548117 0.6054795 0.42876712
## Class: Prehypertension 0.1987768 0.1467269 0.1688312 0.2022831 0.02968037
## Detection Prevalence Balanced Accuracy
## Class: Hypertension 0.1465753 0.4960181
## Class: Normal 0.7041096 0.5051141
## Class: Prehypertension 0.1493151 0.4983777
set.seed(123)
cm_d <- as.data.frame(cm2$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm2$overall)
# round the values
cm_st$cm2.overall <- round(cm_st$cm2.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm2$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
set.seed(123)
# plotting the matrix
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
theme_light() +
guides(fill=FALSE)
# plotting the stats
cm_st_p <- tableGrob(cm_st)
# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2,
top=textGrob("Decision Tree 2 Confusion Matrix",gp=gpar(fontsize=25,font=1)))
cm3$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0.008902077 0.97314488 0.07317073 0.8047925
## Class: Normal 0.935909519 0.04775687 0.60145366 0.3267327
## Class: Prehypertension 0.028248588 0.96423462 0.16666667 0.7966903
## Precision Recall F1 Prevalence
## Class: Hypertension 0.07317073 0.008902077 0.01587302 0.1923516
## Class: Normal 0.60145366 0.935909519 0.73230088 0.6055936
## Class: Prehypertension 0.16666667 0.028248588 0.04830918 0.2020548
## Detection Rate Detection Prevalence Balanced Accuracy
## Class: Hypertension 0.001712329 0.02340183 0.4910235
## Class: Normal 0.566780822 0.94235160 0.4918332
## Class: Prehypertension 0.005707763 0.03424658 0.4962416
set.seed(123)
cm_d <- as.data.frame(cm3$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm3$overall)
# round the values
cm_st$cm3.overall <- round(cm_st$cm3.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm3$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
set.seed(123)
# plotting the matrix
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
theme_light() +
guides(fill=FALSE)
# plotting the stats
cm_st_p <- tableGrob(cm_st)
# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2,
top=textGrob("Random Forest Model Confusion Matrix",gp=gpar(fontsize=25,font=1)))
cm4$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0.1008902 0.8982332 0.1910112 0.8074968
## Class: Normal 0.7888784 0.1794501 0.5961538 0.3563218
## Class: Prehypertension 0.1016949 0.9041488 0.2117647 0.7989886
## Precision Recall F1 Prevalence Detection Rate
## Class: Hypertension 0.1910112 0.1008902 0.1320388 0.1923516 0.01940639
## Class: Normal 0.5961538 0.7888784 0.6791075 0.6055936 0.47773973
## Class: Prehypertension 0.2117647 0.1016949 0.1374046 0.2020548 0.02054795
## Detection Prevalence Balanced Accuracy
## Class: Hypertension 0.10159817 0.4995617
## Class: Normal 0.80136986 0.4841642
## Class: Prehypertension 0.09703196 0.5029218
set.seed(123)
cm_d <- as.data.frame(cm4$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm4$overall)
# round the values
cm_st$cm4.overall <- round(cm_st$cm4.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm4$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
set.seed(123)
# plotting the matrix
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
theme_light() +
guides(fill=FALSE)
# plotting the stats
cm_st_p <- tableGrob(cm_st)
# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2,
top=textGrob("SVM Model (Sigmoid) Confusion Matrix",gp=gpar(fontsize=25,font=1)))
cm5$byClass
## Sensitivity Specificity Pos Pred Value Neg Pred Value
## Class: Hypertension 0 1 NaN 0.8076484
## Class: Normal 1 0 0.6055936 NaN
## Class: Prehypertension 0 1 NaN 0.7979452
## Precision Recall F1 Prevalence Detection Rate
## Class: Hypertension NA 0 NA 0.1923516 0.0000000
## Class: Normal 0.6055936 1 0.7543548 0.6055936 0.6055936
## Class: Prehypertension NA 0 NA 0.2020548 0.0000000
## Detection Prevalence Balanced Accuracy
## Class: Hypertension 0 0.5
## Class: Normal 1 0.5
## Class: Prehypertension 0 0.5
set.seed(123)
cm_d <- as.data.frame(cm5$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(cm5$overall)
# round the values
cm_st$cm5.overall <- round(cm_st$cm5.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(cm5$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
set.seed(123)
# plotting the matrix
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq,",",Perc,"%")), color = 'white', size = 2.6) +
theme_light() +
guides(fill=FALSE)
# plotting the stats
cm_st_p <- tableGrob(cm_st)
# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2,
top=textGrob("SVM Model (Linear) Confusion Matrix",gp=gpar(fontsize=25,font=1)))
Precision
set.seed(123)
precision_results <- data.frame("Decision Tree 1: Precision" = cm[["byClass"]][ , "Precision"],
"Decision Tree 2: Precision" = cm2[["byClass"]][ , "Precision"],
"Random Forest: Precision" = cm3[["byClass"]][ , "Precision"],
"SVM (Sigmoid): Precision" = cm4[["byClass"]][ , "Precision"],
"SVM (Linear): Precision" = cm5[["byClass"]][ , "Precision"], check.names = FALSE)
precision_results %>%
kbl(align = 'c') %>%
kable_styling()
| Decision Tree 1: Precision | Decision Tree 2: Precision | Random Forest: Precision | SVM (Sigmoid): Precision | SVM (Linear): Precision | |
|---|---|---|---|---|---|
| Class: Hypertension | 0.1577181 | 0.1838006 | 0.0731707 | 0.1910112 | NA |
| Class: Normal | 0.6029697 | 0.6089494 | 0.6014537 | 0.5961538 | 0.6055936 |
| Class: Prehypertension | 0.1953353 | 0.1987768 | 0.1666667 | 0.2117647 | NA |
Recall
set.seed(123)
recall_results <- data.frame("Decision Tree 1: Recall" = cm[["byClass"]][ , "Recall"],
"Decision Tree 2: Recall" = cm2[["byClass"]][ , "Recall"],
"Random Forest: Recall" = cm3[["byClass"]][ , "Recall"],
"SVM (Sigmoid): Recall" = cm4[["byClass"]][ , "Recall"],
"SVM (Linear): Recall" = cm5[["byClass"]][ , "Recall"], check.names = FALSE)
recall_results %>%
kbl(align = 'c') %>%
kable_styling()
| Decision Tree 1: Recall | Decision Tree 2: Recall | Random Forest: Recall | SVM (Sigmoid): Recall | SVM (Linear): Recall | |
|---|---|---|---|---|---|
| Class: Hypertension | 0.1116390 | 0.1401425 | 0.0089021 | 0.1008902 | 0 |
| Class: Normal | 0.7043741 | 0.7081448 | 0.9359095 | 0.7888784 | 1 |
| Class: Prehypertension | 0.1512415 | 0.1467269 | 0.0282486 | 0.1016949 | 0 |
F1 Score
set.seed(123)
f1_results <- data.frame("Decision Tree 1: F1 Score" = cm[["byClass"]][ , "F1"],
"Decision Tree 2: F1 Score" = cm2[["byClass"]][ , "F1"],
"Random Forest: F1 Score" = cm3[["byClass"]][ , "F1"],
"SVM (Sigmoid): F1 Score" = cm4[["byClass"]][ , "F1"],
"SVM (Linear): F1 Score" = cm5[["byClass"]][ , "F1"],check.names = FALSE)
f1_results %>%
kbl(align = 'c') %>%
kable_styling()
| Decision Tree 1: F1 Score | Decision Tree 2: F1 Score | Random Forest: F1 Score | SVM (Sigmoid): F1 Score | SVM (Linear): F1 Score | |
|---|---|---|---|---|---|
| Class: Hypertension | 0.1307371 | 0.1590296 | 0.0158730 | 0.1320388 | NA |
| Class: Normal | 0.6497391 | 0.6548117 | 0.7323009 | 0.6791075 | 0.7543548 |
| Class: Prehypertension | 0.1704835 | 0.1688312 | 0.0483092 | 0.1374046 | NA |
Accuracy
accuracy_results <- data.frame("Decision Tree 1: Accuracy" = cm[["byClass"]][ , "Balanced Accuracy"],
"Decision Tree 2: Accuracy" = cm2[["byClass"]][ , "Balanced Accuracy"],
"Random Forest: Accuracy" = cm3[["byClass"]][ , "Balanced Accuracy"],
"SVM (Sigmoid): Accuracy" = cm4[["byClass"]][ , "Balanced Accuracy"],
"SVM (Linear): Accuracy" = cm5[["byClass"]][ , "Balanced Accuracy"], check.names = FALSE)
accuracy_results %>%
kbl(align = 'c') %>%
kable_styling()
| Decision Tree 1: Accuracy | Decision Tree 2: Accuracy | Random Forest: Accuracy | SVM (Sigmoid): Accuracy | SVM (Linear): Accuracy | |
|---|---|---|---|---|---|
| Class: Hypertension | 0.4848754 | 0.4960181 | 0.4910235 | 0.4995617 | 0.5 |
| Class: Normal | 0.4962843 | 0.5051141 | 0.4918332 | 0.4841642 | 0.5 |
| Class: Prehypertension | 0.4966282 | 0.4983777 | 0.4962416 | 0.5029218 | 0.5 |
The dataset I used explored Heart Attack Risk. I wanted to create a
multiclass classification model focused on Blood Pressure. I used the
values provided in the Blood Pressure column to categorize
patients into 3 classes: Normal, Prehypertension, and Hypertension,
creating aBPLevel column. The first decision tree model had
8 features used to predict BPLevel. The independent
variables were all continuous variables, which was purposely done to
limit the complications of using a categorical variable with the
dependent categorical variable. The visualization of both Decision Tree
models did not render clearly, which is a danger of decision trees in
terms of producing a complex diagram, making it hard to interpret. When
assessing the accuracy of each class for each model, they were
relatively unchanged, ranging from 0.487 to 0.504. However, the overall
accuracy of each model varied, particularly between the two Decision
Tree models and the Random Forest model. The first Decision Tree model
produced an overall accuracy of 0.4789, improving slightly with less
features in the second Decision Tree model to 0.4897. Using the same
number of features as the second Decision Tree model, the Random Forest
model performed better, with an overall accuracy of 0.5799.
In a scenario where we want to predict when a patient has pre-hypertension or hypertension, recall would be the metric we would want to use to identify the patients that have hypertension or prehypertension, since this metric is measured by true positives divided by true positives plus false negatives. If a patient actually has high or near-high blood pressure, we want the models to accurately predict those cases rather than predict that the patient has normal blood pressure when their blood pressure is actually not normal. Overall, each of the models don’t perform well when predicting having high or near-high blood pressure. Among the models, Random Forest performed the worst, having a recall value for hypertension of 0.0089021, or 0.89%, and a recall value for prehypertension of 0.0254237, or around 2.54%. The second Decision Tree performed the best when recalling patients with hypertension at around 14% and the first Decision Tree recalling hypertension patients at 11%. Conversely, the first Decision performed performed best when recalling prehypertension patients with a value of around 15.1%, which was slightly better than the second Decision Tree which had a recall value of 14.7%. The reduced number of variables in the second Decision Tree may have impacted the recall performance of patients with hypertension. While Random Forest performed the best when recalling patients with normal blood pressure, recalling patients in this category at a 93.6% rate, its poor performance in recalling patients with prehypertension and hypertension may indicate that the multiclass nature of the dependent variable may be too complex for the model to accurately predict patients in the prehypertension and hypertension categories, despite the low number of independent variables in the model.
I used two different types of Support Vector Machine models to
compare with the performance of the Decision Tree and Random Forest
models. The SVM models had different kernels, and they produced
different results. The first SVM model used sigmoid as a
kernel with the cost function set at 0.05. I tried multiple values of
the cost function, and most of the values produced an overall accuracy
under 50%. With the cost function set at 0.05, the overall accuracy of
the SVM Sigmoid model was 52%, which was slightly better than the
Decision Tree models but worse than the Random Forest model. The second
SVM model used linear as a kernel with the same cost
function of 0.05. In terms of overall accuracy, the model performed the
best among all models at 61%. Because of the dataset being used and
structure of the multiclass model I’ve made, accuracy may not be the
most useful performance metric to evaluate the models. Instead,
precision, recall and f1-score of the models would be more effective
measures, with recall being the most preferred metric because of the
reasons described earlier. When comparing the precision, recall, and
f1-score of each model, the SVM Sigmoid model performed the best with
precision, predicting prehypertension at 21% and hypertension at 19%. In
terms of recall, SVM Sigmoid performed much better than Random Forest
but worse than both Decision Tree models. The SVM Linear model did not
produce a precision or recall value, which renders the model not useful
for the purposes of this analysis.