Assignment 2

Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.

Data From Kaggle

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Author of the Data: fedesoriano

Data Review:

The dataset selected [https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset] is a kaggle dataset relating to stroke prediction clinical data and will allow us to build a model to predict stroke prediction based on certain variables.

data <- readr::read_csv("https://raw.githubusercontent.com/jtul333/Data622/main/healthcare-dataset-stroke-data.csv")

## Rows: 5110 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, ever_married, work_type, Residence_type, smoking_status
## dbl (7): id, age, hypertension, heart_disease, avg_glucose_level, bmi, stroke
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data contain the variables “gender, ever_married, work_type, Residence_type, bmi, smoking_status, id, age, hypertension, heart_disease, avg_glucose_level, stroke”.

head(data)

## # A tibble: 6 × 12
##      id gender   age hypertension heart_disease ever_married work_type    
##   <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
## 1  9046 Male      67            0             1 Yes          Private      
## 2 51676 Female    61            0             0 Yes          Self-employed
## 3 31112 Male      80            0             1 Yes          Private      
## 4 60182 Female    49            0             0 Yes          Private      
## 5  1665 Female    79            1             0 Yes          Self-employed
## 6 56669 Male      81            0             0 Yes          Private      
## # ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <dbl>,
## #   smoking_status <chr>, stroke <dbl>

Types of Variables

sapply(data,class)

##                id            gender               age      hypertension 
##         "numeric"       "character"         "numeric"         "numeric" 
##     heart_disease      ever_married         work_type    Residence_type 
##         "numeric"       "character"       "character"       "character" 
## avg_glucose_level               bmi    smoking_status            stroke 
##         "numeric"         "numeric"       "character"         "numeric"

Statistical Summary

The plan is to predict Stroke based on these variables. The skim function allows us a quick and detailed view of the dataset. Important Notation about the Data gender - Male, Female age - Age of patient hypertension - 0 = No, 1 = Yes heart_disease - 0 = No, 1 = Yes ever_married - Yes, No, Children work_type - Private, Govt_job, Self-employed Residence_type - Urban, Rural avg_glucose_level - double bmi - double smoking_status - formerly smoked, never smoked, smokes, Unknown stroke - 0 = No, 1 = Yes

skim(data)

Data summary
Name	data
Number of rows	5110
Number of columns	12
_______________________
Column type frequency:
character	5
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
gender	1	4	6	3
ever_married	1	2	3	2
work_type	1	7	13	5
Residence_type	1	5	5	2
smoking_status	1	6	15	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	36517.83	21161.72	67.00	17741.25	36932.00	54682.00	72940.00	▇▇▇▇▇
age	1	43.23	22.61	0.08	25.00	45.00	61.00	82.00	▅▆▇▇▆
hypertension	1	0.10	0.30	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
heart_disease	1	0.05	0.23	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
avg_glucose_level	1	106.15	45.28	55.12	77.24	91.88	114.09	271.74	▇▃▁▁▁
bmi	1	29.17	7.82	10.30	23.80	28.40	34.00	97.60	▇▇▁▁▁
stroke	1	0.05	0.22	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁

Check for missing Data

plot_missing(data)

There is no missing Data

Data Preparation

create a separate data set in order to maintain the original data and make all the necessary transformations there. we’ll transform the previous variables that were of type character into factor.

data_prepared <- data

#Value substitutions cleanup
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "formerly smoked"] <- "formerly smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "never smoked"] <- "never smoked"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "smokes"] <- "smokes"
data_prepared$`smoking_status`[data_prepared$`smoking_status` == "Unknown"] <- "Unknown"

data_prepared$`ever_married`[data_prepared$`ever_married` == "Yes"] <- "Yes"
data_prepared$`ever_married`[data_prepared$`ever_married` == "No"] <- "No"
data_prepared$`ever_married`[data_prepared$`ever_married` == "Children"] <- "Children"

data_prepared$gender[data_prepared$gender == "Female"] <- "F"
data_prepared$gender[data_prepared$gender == "Male"] <- "M"

data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Urban"] <- "Urban"
data_prepared$`Residence_type`[data_prepared$`Residence_type` == "Rural"] <- "Rural"

data_prepared$`work_type`[data_prepared$`work_type` == "Private"] <- "Private"
data_prepared$`work_type`[data_prepared$`work_type` == "Govt_job"] <- "Govt_job"
data_prepared$`work_type`[data_prepared$`work_type` == "Self-employed"] <- "Self-employed"


#Data type change for columns of interest
data_prepared$work_type <- as.factor(data_prepared$work_type)
data_prepared$gender <- as.factor(data_prepared$gender)
data_prepared$Residence_type <- as.factor(data_prepared$Residence_type)
data_prepared$ever_married <- as.factor(data_prepared$ever_married)
data_prepared$smoking_status <- as.factor(data_prepared$smoking_status)
data_prepared$stroke <- as.integer(data_prepared$stroke)
data_prepared$heart_disease <- as.integer(data_prepared$heart_disease)
data_prepared$bmi <- as.integer(data_prepared$bmi)
data_prepared$avg_glucose_level <- as.integer(data_prepared$avg_glucose_level)
data_prepared$hypertension <- as.integer(data_prepared$hypertension)
data_prepared$age <- as.integer(data_prepared$age)

Structure Of the variables

str(data_prepared)

## spc_tbl_ [5,110 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id               : num [1:5110] 9046 51676 31112 60182 1665 ...
##  $ gender           : Factor w/ 3 levels "F","M","Other": 2 1 2 1 1 2 2 1 1 1 ...
##  $ age              : int [1:5110] 67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int [1:5110] 0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int [1:5110] 1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ work_type        : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
##  $ Residence_type   : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
##  $ avg_glucose_level: int [1:5110] 228 202 105 171 174 186 70 94 76 58 ...
##  $ bmi              : int [1:5110] 36 36 32 34 24 29 27 22 36 24 ...
##  $ smoking_status   : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
##  $ stroke           : int [1:5110] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   hypertension = col_double(),
##   ..   heart_disease = col_double(),
##   ..   ever_married = col_character(),
##   ..   work_type = col_character(),
##   ..   Residence_type = col_character(),
##   ..   avg_glucose_level = col_double(),
##   ..   bmi = col_double(),
##   ..   smoking_status = col_character(),
##   ..   stroke = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(data_prepared)

##        id          gender          age         hypertension    
##  Min.   :   67   F    :2994   Min.   : 0.00   Min.   :0.00000  
##  1st Qu.:17741   M    :2115   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Other:   1   Median :45.00   Median :0.00000  
##  Mean   :36518                Mean   :43.22   Mean   :0.09746  
##  3rd Qu.:54682                3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married         work_type    Residence_type
##  Min.   :0.00000   No :1757     children     : 687   Rural:2514    
##  1st Qu.:0.00000   Yes:3353     Govt_job     : 657   Urban:2596    
##  Median :0.00000                Never_worked :  22                 
##  Mean   :0.05401                Private      :2925                 
##  3rd Qu.:0.00000                Self-employed: 819                 
##  Max.   :1.00000                                                   
##  avg_glucose_level      bmi                smoking_status     stroke       
##  Min.   : 55.0     Min.   :10.00   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.: 77.0     1st Qu.:23.00   never smoked   :1892   1st Qu.:0.00000  
##  Median : 91.0     Median :28.00   smokes         : 789   Median :0.00000  
##  Mean   :105.7     Mean   :28.75   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:114.0     3rd Qu.:34.00                          3rd Qu.:0.00000  
##  Max.   :271.0     Max.   :97.00                          Max.   :1.00000

skim(data_prepared)

Data summary
Name	data_prepared
Number of rows	5110
Number of columns	12
_______________________
Column type frequency:
factor	5
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
gender	1	FALSE	3	F: 2994, M: 2115, Oth: 1
ever_married	1	FALSE	2	Yes: 3353, No: 1757
work_type	1	FALSE	5	Pri: 2925, Sel: 819, chi: 687, Gov: 657
Residence_type	1	FALSE	2	Urb: 2596, Rur: 2514
smoking_status	1	FALSE	4	nev: 1892, Unk: 1544, for: 885, smo: 789

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	36517.83	21161.72	67	17741.25	36932	54682	72940	▇▇▇▇▇
age	1	43.22	22.63	0	25.00	45	61	82	▅▆▇▇▆
hypertension	1	0.10	0.30	0	0.00	0	0	1	▇▁▁▁▁
heart_disease	1	0.05	0.23	0	0.00	0	0	1	▇▁▁▁▁
avg_glucose_level	1	105.66	45.28	55	77.00	91	114	271	▇▃▁▁▁
bmi	1	28.75	7.83	10	23.00	28	34	97	▇▇▁▁▁
stroke	1	0.05	0.22	0	0.00	0	0	1	▇▁▁▁▁

Since the stroke is our target variable, let’s examine it’s proportions:

hist(data_prepared$stroke)

data_prepared$stroke[is.na(data_prepared$stroke)] <- median(data_prepared$stroke, na.rm=TRUE)

# We use this line of code to get rid of decimals.
data_prepared$stroke <- trunc(data_prepared$stroke)

summary(data_prepared)

##        id          gender          age         hypertension    
##  Min.   :   67   F    :2994   Min.   : 0.00   Min.   :0.00000  
##  1st Qu.:17741   M    :2115   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Other:   1   Median :45.00   Median :0.00000  
##  Mean   :36518                Mean   :43.22   Mean   :0.09746  
##  3rd Qu.:54682                3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married         work_type    Residence_type
##  Min.   :0.00000   No :1757     children     : 687   Rural:2514    
##  1st Qu.:0.00000   Yes:3353     Govt_job     : 657   Urban:2596    
##  Median :0.00000                Never_worked :  22                 
##  Mean   :0.05401                Private      :2925                 
##  3rd Qu.:0.00000                Self-employed: 819                 
##  Max.   :1.00000                                                   
##  avg_glucose_level      bmi                smoking_status     stroke       
##  Min.   : 55.0     Min.   :10.00   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.: 77.0     1st Qu.:23.00   never smoked   :1892   1st Qu.:0.00000  
##  Median : 91.0     Median :28.00   smokes         : 789   Median :0.00000  
##  Mean   :105.7     Mean   :28.75   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:114.0     3rd Qu.:34.00                          3rd Qu.:0.00000  
##  Max.   :271.0     Max.   :97.00                          Max.   :1.00000

Build Models

predict the “stroke” variable and find out whether variables “hypertension”, “heart_disease” and “gender”, “avg_glucose_level”,“bmi” have any relationship with our dependent variable.

We first take a look at the “gender” variable to see if it is evenly distributed among our “heart_disease” variable. At a first glance, we can observe that the number of Females and Males in the data seem to be evenly distributed.

xtabs(~gender + `heart_disease`, data = data_prepared)

##        heart_disease
## gender     0    1
##   F     2881  113
##   M     1952  163
##   Other    1    0

Our next step is to partition the data into training (80%) and test (20%) in order to measure how well the models perform.

# Partition data 
set.seed(123)
# create a list of 60% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data_prepared$stroke, p=0.8, list=FALSE)
# select 20% of the data for validation
# use the remaining 80% of data to training and testing the models
data_train <- data_prepared[validation_index,]

data_test <- data_prepared[-validation_index,]

First Decision Tree

We run the first decision tree model to predict “stroke” with the variable “heart_disease” below:

model1 <- rpart(stroke ~ age + avg_glucose_level + bmi,
                         method = "class",
                         data = data_test
                )

rpart.plot(model1)

Second Decision Tree

Switch variables to generate 2 decision trees and compare the results. In this case, I used the remaining variables in the data set

model2 <- rpart(stroke ~ smoking_status + avg_glucose_level + bmi + age ,
                         method = "class",
                         data = data_test
                )

rpart.plot(model2)

Random Forest

Create a random forest for regression and analyze the results. For the Random forest, I will use all the attributes.

rf_model <- randomForest(as.factor(stroke) ~ smoking_status + avg_glucose_level + heart_disease + age + gender + bmi,
                                    data = data_test)
rf_pred <- predict(rf_model, data_train)
c_matrix <- confusionMatrix(rf_pred, as.factor(data_train$stroke))
c_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3890  192
##          1    5    1
##                                           
##                Accuracy : 0.9518          
##                  95% CI : (0.9448, 0.9582)
##     No Information Rate : 0.9528          
##     P-Value [Acc > NIR] : 0.6337          
##                                           
##                   Kappa : 0.0072          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.998716        
##             Specificity : 0.005181        
##          Pos Pred Value : 0.952964        
##          Neg Pred Value : 0.166667        
##              Prevalence : 0.952789        
##          Detection Rate : 0.951566        
##    Detection Prevalence : 0.998532        
##       Balanced Accuracy : 0.501949        
##                                           
##        'Positive' Class : 0               
##

The accuracy is 95% which is a good model.

Conclusion

Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

This assignment demonostrated some of the pros and cons to using one algorithm to another. For example, the decision trees produced a result extremely fast but at a cost of predictive power. Random forests, on the other hand, produced a model with much greater predictive power at the cost of both computational speed and model interpretability.

Unbalanced data is not great for tree based models. To give the model a fighting chance at predicting the positive class, resampling of the data was used. I believe the most important learning from this particular project is that one Machine Learning algorithm alone is probably not going to be able to satisfy or provide all of the answers and that they should used in conjunction of other ML algorithms to correlate or validate the results.

Data622hw2

Jaya Veluri

11/4/2023