Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Author of the Data: fedesoriano
The dataset selected [https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset] is a kaggle dataset relating to stroke prediction clinical data and will allow us to build a model to predict stroke prediction based on certain variables.
data <- readr::read_csv("https://raw.githubusercontent.com/jtul333/Data622/main/healthcare-dataset-stroke-data.csv")
## Rows: 5110 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, ever_married, work_type, Residence_type, smoking_status
## dbl (7): id, age, hypertension, heart_disease, avg_glucose_level, bmi, stroke
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data contain the variables “gender, ever_married, work_type, Residence_type, bmi, smoking_status, id, age, hypertension, heart_disease, avg_glucose_level, stroke”.
head(data)
## # A tibble: 6 × 12
## id gender age hypertension heart_disease ever_married work_type
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## # ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <dbl>,
## # smoking_status <chr>, stroke <dbl>
sapply(data,class)
## id gender age hypertension
## "numeric" "character" "numeric" "numeric"
## heart_disease ever_married work_type Residence_type
## "numeric" "character" "character" "character"
## avg_glucose_level bmi smoking_status stroke
## "numeric" "numeric" "character" "numeric"
The plan is to predict Stroke based on these variables. The skim function allows us a quick and detailed view of the dataset. Important Notation about the Data gender - Male, Female age - Age of patient hypertension - 0 = No, 1 = Yes heart_disease - 0 = No, 1 = Yes ever_married - Yes, No, Children work_type - Private, Govt_job, Self-employed Residence_type - Urban, Rural avg_glucose_level - double bmi - double smoking_status - formerly smoked, never smoked, smokes, Unknown stroke - 0 = No, 1 = Yes
skim(data)
| Name | data |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
| ever_married | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| work_type | 0 | 1 | 7 | 13 | 0 | 5 | 0 |
| Residence_type | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
| smoking_status | 0 | 1 | 6 | 15 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67.00 | 17741.25 | 36932.00 | 54682.00 | 72940.00 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 | ▇▃▁▁▁ |
| bmi | 0 | 1 | 29.17 | 7.82 | 10.30 | 23.80 | 28.40 | 34.00 | 97.60 | ▇▇▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
Check for missing Data
plot_missing(data)
There is no missing Data
create a separate data set in order to maintain the original data and make all the necessary transformations there. we’ll transform the previous variables that were of type character into factor.
data_prepared <- data
#Value substitutions cleanup
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "formerly smoked"] <- "formerly smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "never smoked"] <- "never smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "smokes"] <- "smokes"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "Unknown"] <- "Unknown"
data_prepared$`ever_married`[data_prepared$`ever_married` == "Yes"] <- "Yes"
data_prepared$`ever_married`[data_prepared$`ever_married` == "No"] <- "No"
data_prepared$`ever_married`[data_prepared$`ever_married` == "Children"] <- "Children"
data_prepared$gender[data_prepared$gender == "Female"] <- "F"
data_prepared$gender[data_prepared$gender == "Male"] <- "M"
data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Urban"] <- "Urban"
data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Rural"] <- "Rural"
data_prepared$`work_type`[data_prepared$`work_type` == "Private"] <- "Private"
data_prepared$`work_type`[data_prepared$`work_type` == "Govt_job"] <- "Govt_job"
data_prepared$`work_type`[data_prepared$`work_type` == "Self-employed"] <- "Self-employed"
#Data type change for columns of interest
data_prepared$work_type <- as.factor(data_prepared$work_type)
data_prepared$gender <- as.factor(data_prepared$gender)
data_prepared$Residence_type <- as.factor(data_prepared$Residence_type)
data_prepared$ever_married <- as.factor(data_prepared$ever_married)
data_prepared$smoking_status <- as.factor(data_prepared$smoking_status)
data_prepared$stroke <- as.integer(data_prepared$stroke)
data_prepared$heart_disease <- as.integer(data_prepared$heart_disease)
data_prepared$bmi <- as.integer(data_prepared$bmi)
data_prepared$avg_glucose_level <- as.integer(data_prepared$avg_glucose_level)
data_prepared$hypertension <- as.integer(data_prepared$hypertension)
data_prepared$age <- as.integer(data_prepared$age)
str(data_prepared)
## spc_tbl_ [5,110 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:5110] 9046 51676 31112 60182 1665 ...
## $ gender : Factor w/ 3 levels "F","M","Other": 2 1 2 1 1 2 2 1 1 1 ...
## $ age : int [1:5110] 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int [1:5110] 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int [1:5110] 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ work_type : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
## $ Residence_type : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
## $ avg_glucose_level: int [1:5110] 228 202 105 171 174 186 70 94 76 58 ...
## $ bmi : int [1:5110] 36 36 32 34 24 29 27 22 36 24 ...
## $ smoking_status : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
## $ stroke : int [1:5110] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. gender = col_character(),
## .. age = col_double(),
## .. hypertension = col_double(),
## .. heart_disease = col_double(),
## .. ever_married = col_character(),
## .. work_type = col_character(),
## .. Residence_type = col_character(),
## .. avg_glucose_level = col_double(),
## .. bmi = col_double(),
## .. smoking_status = col_character(),
## .. stroke = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(data_prepared)
## id gender age hypertension
## Min. : 67 F :2994 Min. : 0.00 Min. :0.00000
## 1st Qu.:17741 M :2115 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Other: 1 Median :45.00 Median :0.00000
## Mean :36518 Mean :43.22 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 No :1757 children : 687 Rural:2514
## 1st Qu.:0.00000 Yes:3353 Govt_job : 657 Urban:2596
## Median :0.00000 Never_worked : 22
## Mean :0.05401 Private :2925
## 3rd Qu.:0.00000 Self-employed: 819
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.0 Min. :10.00 formerly smoked: 885 Min. :0.00000
## 1st Qu.: 77.0 1st Qu.:23.00 never smoked :1892 1st Qu.:0.00000
## Median : 91.0 Median :28.00 smokes : 789 Median :0.00000
## Mean :105.7 Mean :28.75 Unknown :1544 Mean :0.04873
## 3rd Qu.:114.0 3rd Qu.:34.00 3rd Qu.:0.00000
## Max. :271.0 Max. :97.00 Max. :1.00000
skim(data_prepared)
| Name | data_prepared |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 3 | F: 2994, M: 2115, Oth: 1 |
| ever_married | 0 | 1 | FALSE | 2 | Yes: 3353, No: 1757 |
| work_type | 0 | 1 | FALSE | 5 | Pri: 2925, Sel: 819, chi: 687, Gov: 657 |
| Residence_type | 0 | 1 | FALSE | 2 | Urb: 2596, Rur: 2514 |
| smoking_status | 0 | 1 | FALSE | 4 | nev: 1892, Unk: 1544, for: 885, smo: 789 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67 | 17741.25 | 36932 | 54682 | 72940 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.22 | 22.63 | 0 | 25.00 | 45 | 61 | 82 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0 | 0.00 | 0 | 0 | 1 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0 | 0.00 | 0 | 0 | 1 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 105.66 | 45.28 | 55 | 77.00 | 91 | 114 | 271 | ▇▃▁▁▁ |
| bmi | 0 | 1 | 28.75 | 7.83 | 10 | 23.00 | 28 | 34 | 97 | ▇▇▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0 | 0.00 | 0 | 0 | 1 | ▇▁▁▁▁ |
Since the stroke is our target variable, let’s examine it’s proportions:
hist(data_prepared$stroke)
data_prepared$stroke[is.na(data_prepared$stroke)] <- median(data_prepared$stroke, na.rm=TRUE)
# We use this line of code to get rid of decimals.
data_prepared$stroke <- trunc(data_prepared$stroke)
summary(data_prepared)
## id gender age hypertension
## Min. : 67 F :2994 Min. : 0.00 Min. :0.00000
## 1st Qu.:17741 M :2115 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Other: 1 Median :45.00 Median :0.00000
## Mean :36518 Mean :43.22 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 No :1757 children : 687 Rural:2514
## 1st Qu.:0.00000 Yes:3353 Govt_job : 657 Urban:2596
## Median :0.00000 Never_worked : 22
## Mean :0.05401 Private :2925
## 3rd Qu.:0.00000 Self-employed: 819
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.0 Min. :10.00 formerly smoked: 885 Min. :0.00000
## 1st Qu.: 77.0 1st Qu.:23.00 never smoked :1892 1st Qu.:0.00000
## Median : 91.0 Median :28.00 smokes : 789 Median :0.00000
## Mean :105.7 Mean :28.75 Unknown :1544 Mean :0.04873
## 3rd Qu.:114.0 3rd Qu.:34.00 3rd Qu.:0.00000
## Max. :271.0 Max. :97.00 Max. :1.00000
predict the “stroke” variable and find out whether variables “hypertension”, “heart_disease” and “gender”, “avg_glucose_level”,“bmi” have any relationship with our dependent variable.
We first take a look at the “gender” variable to see if it is evenly distributed among our “heart_disease” variable. At a first glance, we can observe that the number of Females and Males in the data seem to be evenly distributed.
xtabs(~gender + `heart_disease`, data = data_prepared)
## heart_disease
## gender 0 1
## F 2881 113
## M 1952 163
## Other 1 0
Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.
# Partition data
set.seed(123)
# create a list of 60% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data_prepared$stroke, p=0.8, list=FALSE)
# select 20% of the data for validation
# use the remaining 80% of data to training and testing the models
data_train <- data_prepared[validation_index,]
data_test <- data_prepared[-validation_index,]
We run the first decision tree model to predict “stroke” with the variable “heart_disease” below:
model1 <- rpart(stroke ~ age + avg_glucose_level + bmi,
method = "class",
data = data_test
)
rpart.plot(model1)
Switch variables to generate 2 decision trees and compare the results. In this case, I used the remaining variables in the data set
model2 <- rpart(stroke ~ smoking_status + avg_glucose_level + bmi + age ,
method = "class",
data = data_test
)
rpart.plot(model2)
Create a random forest for regression and analyze the results. For the Random forest, I will use all the attributes.
rf_model <- randomForest(as.factor(stroke) ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi,
data = data_test)
rf_pred <- predict(rf_model, data_train)
c_matrix <- confusionMatrix(rf_pred, as.factor(data_train$stroke))
c_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3890 192
## 1 5 1
##
## Accuracy : 0.9518
## 95% CI : (0.9448, 0.9582)
## No Information Rate : 0.9528
## P-Value [Acc > NIR] : 0.6337
##
## Kappa : 0.0072
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.998716
## Specificity : 0.005181
## Pos Pred Value : 0.952964
## Neg Pred Value : 0.166667
## Prevalence : 0.952789
## Detection Rate : 0.951566
## Detection Prevalence : 0.998532
## Balanced Accuracy : 0.501949
##
## 'Positive' Class : 0
##
The accuracy is 95% which is a good model.
Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
This assignment demonostrated some of the pros and cons to using one algorithm to another. For example, the decision trees produced a result extremely fast but at a cost of predictive power. Random forests, on the other hand, produced a model with much greater predictive power at the cost of both computational speed and model interpretability.
Unbalanced data is not great for tree based models. To give the model a fighting chance at predicting the positive class, resampling of the data was used. I believe the most important learning from this particular project is that one Machine Learning algorithm alone is probably not going to be able to satisfy or provide all of the answers and that they should used in conjunction of other ML algorithms to correlate or validate the results.