HOMEWORK #2

Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

Install packages

install.packages("RCurl", repos = "http://cran.us.r-project.org")
## package 'RCurl' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("rpart", repos = "http://cran.us.r-project.org")
## package 'rpart' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("devtools", repos = "http://cran.us.r-project.org")
## package 'devtools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## package 'dplyr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
install.packages("randomForest", repos = "http://cran.us.r-project.org")
## package 'randomForest' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\linwe\AppData\Local\Temp\Rtmpugoyko\downloaded_packages
library(devtools)
library(RCurl)
library(tidyverse)
library(dplyr)
library(rpart)
library(rpart.plot)
library(randomForest)

Imporating the Data

heart <- read.csv("https://raw.githubusercontent.com/JennierJ/CUNY_DATA_622/main/heart.csv", header = TRUE)
head(heart)
##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0
summary(heart)
##       Age            Sex            ChestPainType        RestingBP    
##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
##  Mean   :53.51                                         Mean   :132.4  
##  3rd Qu.:60.00                                         3rd Qu.:140.0  
##  Max.   :77.00                                         Max.   :200.0  
##   Cholesterol      FastingBS       RestingECG            MaxHR      
##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
##                     Mean   : 0.8874                      Mean   :0.5534  
##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
##                     Max.   : 6.2000                      Max.   :1.0000
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~

The data set I am using is Heart Failure Prediction Data set from Kaggle. The data set contains 11 clinical features for predicting heart disease events.The Heart Disease is the output that I am interested in. In order to make the decision tree easy to read, I am reassiging 1 in HeartDiease to HeartDiease, and 0 to Healthy.

heart <- heart %>%
  mutate(HeartDisease = ifelse(heart$HeartDisease == 1, "HeartDisease", "Healthy"))

Split data into training and test sets.

The original data is partitioned into training and test subsets by a ratio of 75:25, respectively.

set.seed(1234)
sample_set <- sample(nrow(heart), round(nrow(heart)* .75), replace = FALSE)
heart_train <- heart[sample_set,]
heart_test <- heart[-sample_set,]

round(prop.table(table(select(heart, HeartDisease), exclude = NULL)), 4) * 100
## 
##      Healthy HeartDisease 
##        44.66        55.34
round(prop.table(table(select(heart_train, HeartDisease), exclude = NULL)), 4) * 100
## 
##      Healthy HeartDisease 
##         43.9         56.1
round(prop.table(table(select(heart_test, HeartDisease), exclude = NULL)), 4) * 100
## 
##      Healthy HeartDisease 
##        46.96        53.04

Building the Model

heart_mod <-
  rpart(
    HeartDisease ~ .,
    method = "class",
    data = heart_train
  )

Evaluating the Model

rpart.plot(heart_mod)

Looking at the structure of the tree, of the 11 features variables in the dataset, the model only uses ST_Slope (the slope of teh peak exercise ST segment), Cholesterol level, Chest Pain Type, Age and Sex.

heart_pred <- predict(heart_mod, heart_test, type = "class")
heart_pred_table <- table(heart_test$HeartDisease, heart_pred)
heart_pred_table
##               heart_pred
##                Healthy HeartDisease
##   Healthy           91           17
##   HeartDisease      26           96
sum(diag(heart_pred_table))/nrow(heart_test)
## [1] 0.8130435

The predictive accuracy of the model is 81%.

Switch variables

First, I would like only limit my variables to be Age, Sex and Cholesterol.

heart1 <- heart %>%
  select(
    Age,
    Sex,
    Cholesterol,
    HeartDisease
  )

Split Data

set.seed(1234)
sample_set1 <- sample(nrow(heart), round(nrow(heart1)* .75), replace = FALSE)
heart1_train <- heart1[sample_set1,]
heart1_test <- heart1[-sample_set1,]

round(prop.table(table(select(heart1, HeartDisease), exclude = NULL)), 4) * 100
## 
##      Healthy HeartDisease 
##        44.66        55.34
round(prop.table(table(select(heart1_train, HeartDisease), exclude = NULL)), 4) * 100
## 
##      Healthy HeartDisease 
##         43.9         56.1
round(prop.table(table(select(heart1_test, HeartDisease), exclude = NULL)), 4) * 100
## 
##      Healthy HeartDisease 
##        46.96        53.04

Training Mode

heart_mod1 <-
  rpart(
    HeartDisease ~ .,
    method = "class",
    data = heart1_train
  )

Evaluating the Model

rpart.plot(heart_mod1)

heart1_pred <- predict(heart_mod1, heart1_test, type = "class")
heart1_pred_table <- table(heart1_test$HeartDisease, heart1_pred)
heart1_pred_table
##               heart1_pred
##                Healthy HeartDisease
##   Healthy           59           49
##   HeartDisease      22          100
sum(diag(heart1_pred_table))/nrow(heart1_test)
## [1] 0.6913043

The predictive accuracy is 69% when only considering Sex, Age and Cholesterol as predictors.

Random Forest

set.seed(1234)
dim(heart)
## [1] 918  12
heart$HeartDisease = as.factor(heart$HeartDisease)
train = sample(1:nrow(heart),500)
rf.heart = randomForest(HeartDisease~., data = heart, subset = train)
rf.heart
## 
## Call:
##  randomForest(formula = HeartDisease ~ ., data = heart, subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.4%
## Confusion matrix:
##              Healthy HeartDisease class.error
## Healthy          185           42   0.1850220
## HeartDisease      30          243   0.1098901

Like what mentioned in the article, there are good, bad and ugly side of using decision trees to solve problems. The decision trees that I created to solve what can be a good predictors to predict heart diseases. The first decision trees tells us the most important predictors are ST_Slope, Cholesterol level, Chest Pain, Max Heart Rate, Resting blood pressure, sex and fasting blood sugar. It uses the structure of decision tree to clearly articulate and logically organize all the alternatives and percentage of people with heart disease and healthy.