This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Sun Jun 2 20:35:02 2024.
Data Description:
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
## vars n mean sd median trimmed mad min max
## gender* 1 5110 1.41 0.49 1.00 1.39 0.00 1.00 3.00
## age 2 5110 43.23 22.61 45.00 43.61 26.69 0.08 82.00
## hypertension 3 5110 0.10 0.30 0.00 0.00 0.00 0.00 1.00
## heart_disease 4 5110 0.05 0.23 0.00 0.00 0.00 0.00 1.00
## ever_married* 5 5110 1.66 0.48 2.00 1.70 0.00 1.00 2.00
## work_type* 6 5110 3.50 1.28 4.00 3.62 0.00 1.00 5.00
## Residence_type* 7 5110 1.51 0.50 2.00 1.51 0.00 1.00 2.00
## avg_glucose_level 8 5110 106.15 45.28 91.88 97.85 26.06 55.12 271.74
## bmi 9 4909 28.89 7.85 28.10 28.34 6.97 10.30 97.60
## smoking_status* 10 5110 2.59 1.09 2.00 2.61 1.48 1.00 4.00
## stroke 11 5110 0.05 0.22 0.00 0.00 0.00 0.00 1.00
## range skew kurtosis se
## gender* 2.00 0.35 -1.86 0.01
## age 81.92 -0.14 -0.99 0.32
## hypertension 1.00 2.71 5.37 0.00
## heart_disease 1.00 3.94 13.57 0.00
## ever_married* 1.00 -0.66 -1.57 0.01
## work_type* 4.00 -0.91 -0.49 0.02
## Residence_type* 1.00 -0.03 -2.00 0.01
## avg_glucose_level 216.62 1.57 1.68 0.63
## bmi 87.30 1.05 3.36 0.11
## smoking_status* 3.00 0.08 -1.35 0.02
## stroke 1.00 4.19 15.57 0.00
## vars n mean sd median trimmed mad min max
## gender* 1 5110 1.41 0.49 1.00 1.39 0.00 1.00 3.00
## age 2 5110 43.23 22.61 45.00 43.61 26.69 0.08 82.00
## hypertension 3 5110 0.10 0.30 0.00 0.00 0.00 0.00 1.00
## heart_disease 4 5110 0.05 0.23 0.00 0.00 0.00 0.00 1.00
## ever_married* 5 5110 1.66 0.48 2.00 1.70 0.00 1.00 2.00
## work_type* 6 5110 3.50 1.28 4.00 3.62 0.00 1.00 5.00
## Residence_type* 7 5110 1.51 0.50 2.00 1.51 0.00 1.00 2.00
## avg_glucose_level 8 5110 106.15 45.28 91.88 97.85 26.06 55.12 271.74
## bmi 9 4909 28.89 7.85 28.10 28.34 6.97 10.30 97.60
## smoking_status* 10 5110 2.59 1.09 2.00 2.61 1.48 1.00 4.00
## stroke 11 5110 0.05 0.22 0.00 0.00 0.00 0.00 1.00
## range skew kurtosis se
## gender* 2.00 0.35 -1.86 0.01
## age 81.92 -0.14 -0.99 0.32
## hypertension 1.00 2.71 5.37 0.00
## heart_disease 1.00 3.94 13.57 0.00
## ever_married* 1.00 -0.66 -1.57 0.01
## work_type* 4.00 -0.91 -0.49 0.02
## Residence_type* 1.00 -0.03 -2.00 0.01
## avg_glucose_level 216.62 1.57 1.68 0.63
## bmi 87.30 1.05 3.36 0.11
## smoking_status* 3.00 0.08 -1.35 0.02
## stroke 1.00 4.19 15.57 0.00
## [1] "Are there missing values in the dataset? TRUE"
## [1] "How many? 201"
## [1] "What Proportion: 0.0035758761786159"
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 201 0 0
#Data spliting
## <Training/Testing/Total>
## <3832/1278/5110>
## [1] "Balance Training dataset"
## < table of extent 0 >
## # A tibble: 3,832 × 11
## gender age hypertension heart_disease ever_married work_type
## <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Female -0.0904 -0.324 -0.240 Yes Private
## 2 Male -1.81 -0.324 -0.240 No children
## 3 Male -0.575 -0.324 -0.240 Yes Private
## 4 Female 1.01 -0.324 -0.240 Yes Private
## 5 Female -0.355 -0.324 -0.240 Yes Self-employed
## 6 Female 0.262 -0.324 -0.240 Yes Private
## 7 Male 0.703 -0.324 -0.240 Yes Govt_job
## 8 Male 0.130 3.08 -0.240 Yes Private
## 9 Female -1.02 -0.324 -0.240 No Private
## 10 Male -0.267 -0.324 -0.240 Yes Private
## # ℹ 3,822 more rows
## # ℹ 5 more variables: Residence_type <fct>, avg_glucose_level <dbl>, bmi <dbl>,
## # smoking_status <fct>, stroke <fct>
## # A tibble: 20 × 7
## mtry .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 accuracy binary 0.950 9 0.00292 Preprocessor1_Model01
## 2 5 roc_auc binary 0.825 9 0.0170 Preprocessor1_Model01
## 3 9 accuracy binary 0.950 9 0.00339 Preprocessor1_Model02
## 4 9 roc_auc binary 0.822 9 0.0158 Preprocessor1_Model02
## 5 2 accuracy binary 0.951 9 0.00310 Preprocessor1_Model03
## 6 2 roc_auc binary 0.833 9 0.0176 Preprocessor1_Model03
## 7 10 accuracy binary 0.950 9 0.00339 Preprocessor1_Model04
## 8 10 roc_auc binary 0.818 9 0.0173 Preprocessor1_Model04
## 9 7 accuracy binary 0.950 9 0.00296 Preprocessor1_Model05
## 10 7 roc_auc binary 0.821 9 0.0173 Preprocessor1_Model05
## 11 3 accuracy binary 0.950 9 0.00304 Preprocessor1_Model06
## 12 3 roc_auc binary 0.829 9 0.0178 Preprocessor1_Model06
## 13 1 accuracy binary 0.951 9 0.00310 Preprocessor1_Model07
## 14 1 roc_auc binary 0.834 9 0.0184 Preprocessor1_Model07
## 15 6 accuracy binary 0.950 9 0.00301 Preprocessor1_Model08
## 16 6 roc_auc binary 0.823 9 0.0189 Preprocessor1_Model08
## 17 4 accuracy binary 0.951 9 0.00273 Preprocessor1_Model09
## 18 4 roc_auc binary 0.827 9 0.0182 Preprocessor1_Model09
## 19 8 accuracy binary 0.950 9 0.00346 Preprocessor1_Model10
## 20 8 roc_auc binary 0.820 9 0.0171 Preprocessor1_Model10
## [1] "------Rank result accounding to Wokflow set 1"
## # A tibble: 1 × 2
## mtry .config
## <int> <chr>
## 1 2 Preprocessor1_Model03
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [3832/1278]> train/test split <tibble> <tibble> <tibble> <workflow>
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.949 Preprocessor1_Model1
## 2 roc_auc binary 0.823 Preprocessor1_Model1
## 3 brier_class binary 0.0451 Preprocessor1_Model1
## # A tibble: 1,278 × 7
## .pred_class .pred_heart_disease .pred_Normal id .row stroke .config
## <fct> <dbl> <dbl> <chr> <int> <fct> <chr>
## 1 Normal 0.0423 0.958 train/test… 1 heart… Prepro…
## 2 Normal 0.262 0.738 train/test… 3 heart… Prepro…
## 3 Normal 0.0755 0.925 train/test… 11 heart… Prepro…
## 4 Normal 0.200 0.800 train/test… 12 heart… Prepro…
## 5 Normal 0.229 0.771 train/test… 13 heart… Prepro…
## 6 Normal 0.199 0.801 train/test… 14 heart… Prepro…
## 7 Normal 0.0768 0.923 train/test… 15 heart… Prepro…
## 8 Normal 0.0341 0.966 train/test… 21 heart… Prepro…
## 9 Normal 0.162 0.838 train/test… 33 heart… Prepro…
## 10 Normal 0.130 0.870 train/test… 34 heart… Prepro…
## # ℹ 1,268 more rows
## Truth
## Prediction heart_disease Normal
## heart_disease 0 0
## Normal 65 1213
## # A tibble: 1,278 × 7
## .pred_class .pred_heart_disease .pred_Normal id .row stroke .config
## <fct> <dbl> <dbl> <chr> <int> <fct> <chr>
## 1 Normal 0.0423 0.958 train/test… 1 heart… Prepro…
## 2 Normal 0.262 0.738 train/test… 3 heart… Prepro…
## 3 Normal 0.0755 0.925 train/test… 11 heart… Prepro…
## 4 Normal 0.200 0.800 train/test… 12 heart… Prepro…
## 5 Normal 0.229 0.771 train/test… 13 heart… Prepro…
## 6 Normal 0.199 0.801 train/test… 14 heart… Prepro…
## 7 Normal 0.0768 0.923 train/test… 15 heart… Prepro…
## 8 Normal 0.0341 0.966 train/test… 21 heart… Prepro…
## 9 Normal 0.162 0.838 train/test… 33 heart… Prepro…
## 10 Normal 0.130 0.870 train/test… 34 heart… Prepro…
## # ℹ 1,268 more rows
## [[1]]
## # A tibble: 1,278 × 6
## .pred_heart_disease .pred_Normal .row .pred_class stroke .config
## <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 0.0423 0.958 1 Normal heart_disease Preprocesso…
## 2 0.262 0.738 3 Normal heart_disease Preprocesso…
## 3 0.0755 0.925 11 Normal heart_disease Preprocesso…
## 4 0.200 0.800 12 Normal heart_disease Preprocesso…
## 5 0.229 0.771 13 Normal heart_disease Preprocesso…
## 6 0.199 0.801 14 Normal heart_disease Preprocesso…
## 7 0.0768 0.923 15 Normal heart_disease Preprocesso…
## 8 0.0341 0.966 21 Normal heart_disease Preprocesso…
## 9 0.162 0.838 33 Normal heart_disease Preprocesso…
## 10 0.130 0.870 34 Normal heart_disease Preprocesso…
## # ℹ 1,268 more rows
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_normalize()
## • step_zv()
## • step_corr()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 5110
## Number of independent variables: 10
## Mtry: 2
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.04332416
## # A tibble: 1 × 10
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Female 47 0 0 Yes Private Urban
## # ℹ 3 more variables: avg_glucose_level <dbl>, bmi <dbl>, smoking_status <chr>
## # A tibble: 1 × 1
## .pred_class
## <fct>
## 1 Normal
#Variable Importance
ranger_obj <- pull_workflow_fit(final_model)$fit
ranger_obj
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 5110
## Number of independent variables: 10
## Mtry: 2
## Target node size: 10
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.04332416
#Display VAriable Importance
sort(ranger_obj$variable.importance)
## gender Residence_type ever_married hypertension
## 5.009274 5.363850 5.942271 7.099011
## heart_disease work_type smoking_status bmi
## 7.877685 9.870189 11.611983 43.775349
## avg_glucose_level age
## 56.908398 59.367519