Goal

  1. Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.

  2. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.

  3. Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

  4. Format: document with screen captures & analysis.

1. Choose dataset and Split the dataset

For this homework I decided to use the heart failure dataset available in kaggle that is available in the below link https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download .

This dataset contains 918 observations and 12 variables. I am choosing HeartDisease as my target variable. This variable takes values 0 and 1 which corresponds to not having heartdisease and having the diease respectively.

7 out of the 12 variables are numeric which includes my target variable. For the purpose of this homework I am changing the target variable datatype as factor.

df <- read.csv("heart.csv", colClasses = c("numeric", "factor", "factor","numeric", "numeric", "numeric","factor", "numeric", "factor","numeric", "factor","factor"))
head(df)
##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0

Below is the skim view of the dataset and it is seen that there are no missing values.

skim(df)
Data summary
Name df
Number of rows 918
Number of columns 12
_______________________
Column type frequency:
factor 6
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Sex 0 1 FALSE 2 M: 725, F: 193
ChestPainType 0 1 FALSE 4 ASY: 496, NAP: 203, ATA: 173, TA: 46
RestingECG 0 1 FALSE 3 Nor: 552, LVH: 188, ST: 178
ExerciseAngina 0 1 FALSE 2 N: 547, Y: 371
ST_Slope 0 1 FALSE 3 Fla: 460, Up: 395, Dow: 63
HeartDisease 0 1 FALSE 2 1: 508, 0: 410

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 53.51 9.43 28.0 47.00 54.0 60.0 77.0 ▁▅▇▆▁
RestingBP 0 1 132.40 18.51 0.0 120.00 130.0 140.0 200.0 ▁▁▃▇▁
Cholesterol 0 1 198.80 109.38 0.0 173.25 223.0 267.0 603.0 ▃▇▇▁▁
FastingBS 0 1 0.23 0.42 0.0 0.00 0.0 0.0 1.0 ▇▁▁▁▂
MaxHR 0 1 136.81 25.46 60.0 120.00 138.0 156.0 202.0 ▁▃▇▆▂
Oldpeak 0 1 0.89 1.07 -2.6 0.00 0.6 1.5 6.2 ▁▇▆▁▁

To move forward I am splitting the dataset into train and test set in the ratio 75:25

set.seed(101)
df.sample <- sample(nrow(df), round(nrow(df)*0.75), replace = FALSE)
df.train <- df[df.sample, ]
df.test <- df[-df.sample, ]

After splitting checking if each set has both the entries in the target

round(prop.table(table(select(df.train, HeartDisease), exclude = NULL)), 4) * 100
## 
##     0     1 
## 43.02 56.98
round(prop.table(table(select(df.test, HeartDisease), exclude = NULL)), 4) * 100
## 
##     0     1 
## 49.57 50.43

2. Build 2 decision trees

To build decision trees first I will use all the variables available.

df.m1 <- rpart(HeartDisease ~ ., method = "class", data = df.train)

rpart.plot(df.m1)

df.m1.pred <- predict(df.m1, df.test)

For the second decision tree I will will use few of the variables that I think are more relevant to the prediction than the rest.

I am using the following variables:

Age + Sex + ChestPainType + RestingBP + RestingECG + MaxHR

df.m2 <- rpart(HeartDisease ~ Age + Sex + ChestPainType + RestingBP + RestingECG + MaxHR, method = "class", data = df.train)

rpart.plot(df.m2)

From both these models it can be concluded that most of the cases where the HeartDisease was 1 , the corresponding ChestPainType FALSE and Sex was Male which is quite interesting.

3. Random Forest

Here I am creating a RandomForest using all the variables in the dataset to predict HeartDisease.

df.randomforest <- randomForest(HeartDisease ~ ., data = df.train)

df.RF.pred <- predict(df.randomforest, df.test)

Lets have a look at the confusion Matrix for this prediction below.

confusionMatrix(df.RF.pred, df.test$HeartDisease)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  98  13
##          1  16 103
##                                          
##                Accuracy : 0.8739         
##                  95% CI : (0.824, 0.9139)
##     No Information Rate : 0.5043         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7477         
##                                          
##  Mcnemar's Test P-Value : 0.7103         
##                                          
##             Sensitivity : 0.8596         
##             Specificity : 0.8879         
##          Pos Pred Value : 0.8829         
##          Neg Pred Value : 0.8655         
##              Prevalence : 0.4957         
##          Detection Rate : 0.4261         
##    Detection Prevalence : 0.4826         
##       Balanced Accuracy : 0.8738         
##                                          
##        'Positive' Class : 0              
## 

The accuracy of this prediction is 87% which is pretty good.

4. Conclusion based on the article

Based the dataset i have chosen and after reading the article I feel that decision tree is not the best prediction method here. RandomForest show more accuracy (87%) which is better here.