Based on the latest topics presented, bring a dataset of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used.
Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results.
Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
Format: document with screen captures & analysis.
For this homework I decided to use the heart failure dataset available in kaggle that is available in the below link https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download .
This dataset contains 918 observations and 12 variables. I am choosing HeartDisease
as my target variable. This variable takes values 0 and 1 which corresponds to not having heartdisease and having the diease respectively.
7 out of the 12 variables are numeric which includes my target variable. For the purpose of this homework I am changing the target variable datatype as factor.
df <- read.csv("heart.csv", colClasses = c("numeric", "factor", "factor","numeric", "numeric", "numeric","factor", "numeric", "factor","numeric", "factor","factor"))
head(df)
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1 N 0.0 Up 0
## 2 N 1.0 Flat 1
## 3 N 0.0 Up 0
## 4 Y 1.5 Flat 1
## 5 N 0.0 Up 0
## 6 N 0.0 Up 0
Below is the skim view of the dataset and it is seen that there are no missing values.
skim(df)
Name | df |
Number of rows | 918 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
factor | 6 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Sex | 0 | 1 | FALSE | 2 | M: 725, F: 193 |
ChestPainType | 0 | 1 | FALSE | 4 | ASY: 496, NAP: 203, ATA: 173, TA: 46 |
RestingECG | 0 | 1 | FALSE | 3 | Nor: 552, LVH: 188, ST: 178 |
ExerciseAngina | 0 | 1 | FALSE | 2 | N: 547, Y: 371 |
ST_Slope | 0 | 1 | FALSE | 3 | Fla: 460, Up: 395, Dow: 63 |
HeartDisease | 0 | 1 | FALSE | 2 | 1: 508, 0: 410 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1 | 53.51 | 9.43 | 28.0 | 47.00 | 54.0 | 60.0 | 77.0 | ▁▅▇▆▁ |
RestingBP | 0 | 1 | 132.40 | 18.51 | 0.0 | 120.00 | 130.0 | 140.0 | 200.0 | ▁▁▃▇▁ |
Cholesterol | 0 | 1 | 198.80 | 109.38 | 0.0 | 173.25 | 223.0 | 267.0 | 603.0 | ▃▇▇▁▁ |
FastingBS | 0 | 1 | 0.23 | 0.42 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |
MaxHR | 0 | 1 | 136.81 | 25.46 | 60.0 | 120.00 | 138.0 | 156.0 | 202.0 | ▁▃▇▆▂ |
Oldpeak | 0 | 1 | 0.89 | 1.07 | -2.6 | 0.00 | 0.6 | 1.5 | 6.2 | ▁▇▆▁▁ |
To move forward I am splitting the dataset into train and test set in the ratio 75:25
set.seed(101)
df.sample <- sample(nrow(df), round(nrow(df)*0.75), replace = FALSE)
df.train <- df[df.sample, ]
df.test <- df[-df.sample, ]
After splitting checking if each set has both the entries in the target
round(prop.table(table(select(df.train, HeartDisease), exclude = NULL)), 4) * 100
##
## 0 1
## 43.02 56.98
round(prop.table(table(select(df.test, HeartDisease), exclude = NULL)), 4) * 100
##
## 0 1
## 49.57 50.43
To build decision trees first I will use all the variables available.
df.m1 <- rpart(HeartDisease ~ ., method = "class", data = df.train)
rpart.plot(df.m1)
df.m1.pred <- predict(df.m1, df.test)
For the second decision tree I will will use few of the variables that I think are more relevant to the prediction than the rest.
I am using the following variables:
Age + Sex + ChestPainType + RestingBP + RestingECG + MaxHR
df.m2 <- rpart(HeartDisease ~ Age + Sex + ChestPainType + RestingBP + RestingECG + MaxHR, method = "class", data = df.train)
rpart.plot(df.m2)
From both these models it can be concluded that most of the cases where the HeartDisease
was 1 , the corresponding ChestPainType
FALSE and Sex
was Male which is quite interesting.
Here I am creating a RandomForest using all the variables in the dataset to predict HeartDisease
.
df.randomforest <- randomForest(HeartDisease ~ ., data = df.train)
df.RF.pred <- predict(df.randomforest, df.test)
Lets have a look at the confusion Matrix for this prediction below.
confusionMatrix(df.RF.pred, df.test$HeartDisease)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 98 13
## 1 16 103
##
## Accuracy : 0.8739
## 95% CI : (0.824, 0.9139)
## No Information Rate : 0.5043
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7477
##
## Mcnemar's Test P-Value : 0.7103
##
## Sensitivity : 0.8596
## Specificity : 0.8879
## Pos Pred Value : 0.8829
## Neg Pred Value : 0.8655
## Prevalence : 0.4957
## Detection Rate : 0.4261
## Detection Prevalence : 0.4826
## Balanced Accuracy : 0.8738
##
## 'Positive' Class : 0
##
The accuracy of this prediction is 87% which is pretty good.
Based the dataset i have chosen and after reading the article I feel that decision tree is not the best prediction method here. RandomForest show more accuracy (87%) which is better here.