Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
install.packages("RCurl", repos = "http://cran.us.r-project.org")
## package 'RCurl' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("rpart", repos = "http://cran.us.r-project.org")
## package 'rpart' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("devtools", repos = "http://cran.us.r-project.org")
## package 'devtools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("randomForest", repos = "http://cran.us.r-project.org")
## package 'randomForest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
library(devtools)
library(RCurl)
library(tidyverse)
library(dplyr)
library(rpart)
library(rpart.plot)
library(randomForest)
heart <- read.csv("https://raw.githubusercontent.com/JennierJ/CUNY_DATA_622/main/heart.csv", header = TRUE)
head(heart)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
summary(heart)
## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
## $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~
The data set I am using is Heart Failure Prediction Data set from Kaggle. The data set contains 11 clinical features for predicting heart disease events.The Heart Disease is the output that I am interested in. In order to make the decision tree easy to read, I am reassiging 1 in HeartDiease to HeartDiease, and 0 to Healthy.
heart <- heart %>%
mutate(HeartDisease = ifelse(heart$HeartDisease == 1, "HeartDisease", "Healthy"))
The original data is partitioned into training and test subsets by a ratio of 75:25, respectively.
set.seed(1234)
sample_set <- sample(nrow(heart), round(nrow(heart)* .75), replace = FALSE)
heart_train <- heart[sample_set,]
heart_test <- heart[-sample_set,]
round(prop.table(table(select(heart, HeartDisease), exclude = NULL)), 4) * 100
##
## Healthy HeartDisease
## 44.66 55.34
round(prop.table(table(select(heart_train, HeartDisease), exclude = NULL)), 4) * 100
##
## Healthy HeartDisease
## 43.9 56.1
round(prop.table(table(select(heart_test, HeartDisease), exclude = NULL)), 4) * 100
##
## Healthy HeartDisease
## 46.96 53.04
heart_mod <-
rpart(
HeartDisease ~ .,
method = "class",
data = heart_train
)
rpart.plot(heart_mod)
Looking at the structure of the tree, of the 11 features variables in the dataset, the model only uses ST_Slope (the slope of teh peak exercise ST segment), Cholesterol level, Chest Pain Type, Age and Sex.
heart_pred <- predict(heart_mod, heart_test, type = "class")
heart_pred_table <- table(heart_test$HeartDisease, heart_pred)
heart_pred_table
## heart_pred
## Healthy HeartDisease
## Healthy 91 17
## HeartDisease 26 96
sum(diag(heart_pred_table))/nrow(heart_test)
## [1] 0.8130435
The predictive accuracy of the model is 81%.
First, I would like only limit my variables to be Age, Sex and Cholesterol.
heart1 <- heart %>%
select(
Age,
Sex,
Cholesterol,
HeartDisease
)
set.seed(1234)
sample_set1 <- sample(nrow(heart), round(nrow(heart1)* .75), replace = FALSE)
heart1_train <- heart1[sample_set1,]
heart1_test <- heart1[-sample_set1,]
round(prop.table(table(select(heart1, HeartDisease), exclude = NULL)), 4) * 100
##
## Healthy HeartDisease
## 44.66 55.34
round(prop.table(table(select(heart1_train, HeartDisease), exclude = NULL)), 4) * 100
##
## Healthy HeartDisease
## 43.9 56.1
round(prop.table(table(select(heart1_test, HeartDisease), exclude = NULL)), 4) * 100
##
## Healthy HeartDisease
## 46.96 53.04
heart_mod1 <-
rpart(
HeartDisease ~ .,
method = "class",
data = heart1_train
)
rpart.plot(heart_mod1)
heart1_pred <- predict(heart_mod1, heart1_test, type = "class")
heart1_pred_table <- table(heart1_test$HeartDisease, heart1_pred)
heart1_pred_table
## heart1_pred
## Healthy HeartDisease
## Healthy 59 49
## HeartDisease 22 100
sum(diag(heart1_pred_table))/nrow(heart1_test)
## [1] 0.6913043
The predictive accuracy is 69% when only considering Sex, Age and Cholesterol as predictors.
set.seed(1234)
dim(heart)
## [1] 918 12
heart$HeartDisease = as.factor(heart$HeartDisease)
train = sample(1:nrow(heart),500)
rf.heart = randomForest(HeartDisease~., data = heart, subset = train)
rf.heart
##
## Call:
## randomForest(formula = HeartDisease ~ ., data = heart, subset = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 14.4%
## Confusion matrix:
## Healthy HeartDisease class.error
## Healthy 185 42 0.1850220
## HeartDisease 30 243 0.1098901
Like what mentioned in the article, there are good, bad and ugly side of using decision trees to solve problems. The decision trees that I created to solve what can be a good predictors to predict heart diseases. The first decision trees tells us the most important predictors are ST_Slope, Cholesterol level, Chest Pain, Max Heart Rate, Resting blood pressure, sex and fasting blood sugar. It uses the structure of decision tree to clearly articulate and logically organize all the alternatives and percentage of people with heart disease and healthy.