About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Sun Jun 2 20:35:02 2024.

Data Description:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Task One: Import data and data preprocessing

Load data and install packages

##                   vars    n   mean    sd median trimmed   mad   min    max
## gender*              1 5110   1.41  0.49   1.00    1.39  0.00  1.00   3.00
## age                  2 5110  43.23 22.61  45.00   43.61 26.69  0.08  82.00
## hypertension         3 5110   0.10  0.30   0.00    0.00  0.00  0.00   1.00
## heart_disease        4 5110   0.05  0.23   0.00    0.00  0.00  0.00   1.00
## ever_married*        5 5110   1.66  0.48   2.00    1.70  0.00  1.00   2.00
## work_type*           6 5110   3.50  1.28   4.00    3.62  0.00  1.00   5.00
## Residence_type*      7 5110   1.51  0.50   2.00    1.51  0.00  1.00   2.00
## avg_glucose_level    8 5110 106.15 45.28  91.88   97.85 26.06 55.12 271.74
## bmi                  9 4909  28.89  7.85  28.10   28.34  6.97 10.30  97.60
## smoking_status*     10 5110   2.59  1.09   2.00    2.61  1.48  1.00   4.00
## stroke              11 5110   0.05  0.22   0.00    0.00  0.00  0.00   1.00
##                    range  skew kurtosis   se
## gender*             2.00  0.35    -1.86 0.01
## age                81.92 -0.14    -0.99 0.32
## hypertension        1.00  2.71     5.37 0.00
## heart_disease       1.00  3.94    13.57 0.00
## ever_married*       1.00 -0.66    -1.57 0.01
## work_type*          4.00 -0.91    -0.49 0.02
## Residence_type*     1.00 -0.03    -2.00 0.01
## avg_glucose_level 216.62  1.57     1.68 0.63
## bmi                87.30  1.05     3.36 0.11
## smoking_status*     3.00  0.08    -1.35 0.02
## stroke              1.00  4.19    15.57 0.00

Describe and explore the data

##                   vars    n   mean    sd median trimmed   mad   min    max
## gender*              1 5110   1.41  0.49   1.00    1.39  0.00  1.00   3.00
## age                  2 5110  43.23 22.61  45.00   43.61 26.69  0.08  82.00
## hypertension         3 5110   0.10  0.30   0.00    0.00  0.00  0.00   1.00
## heart_disease        4 5110   0.05  0.23   0.00    0.00  0.00  0.00   1.00
## ever_married*        5 5110   1.66  0.48   2.00    1.70  0.00  1.00   2.00
## work_type*           6 5110   3.50  1.28   4.00    3.62  0.00  1.00   5.00
## Residence_type*      7 5110   1.51  0.50   2.00    1.51  0.00  1.00   2.00
## avg_glucose_level    8 5110 106.15 45.28  91.88   97.85 26.06 55.12 271.74
## bmi                  9 4909  28.89  7.85  28.10   28.34  6.97 10.30  97.60
## smoking_status*     10 5110   2.59  1.09   2.00    2.61  1.48  1.00   4.00
## stroke              11 5110   0.05  0.22   0.00    0.00  0.00  0.00   1.00
##                    range  skew kurtosis   se
## gender*             2.00  0.35    -1.86 0.01
## age                81.92 -0.14    -0.99 0.32
## hypertension        1.00  2.71     5.37 0.00
## heart_disease       1.00  3.94    13.57 0.00
## ever_married*       1.00 -0.66    -1.57 0.01
## work_type*          4.00 -0.91    -0.49 0.02
## Residence_type*     1.00 -0.03    -2.00 0.01
## avg_glucose_level 216.62  1.57     1.68 0.63
## bmi                87.30  1.05     3.36 0.11
## smoking_status*     3.00  0.08    -1.35 0.02
## stroke              1.00  4.19    15.57 0.00

Missing Value Computation and Data Preprocessing

## [1] "Are there missing values in the dataset? TRUE"
## [1] "How many? 201"
## [1] "What Proportion: 0.0035758761786159"
##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               201                 0                 0

#Data spliting

## <Training/Testing/Total>
## <3832/1278/5110>
## [1] "Balance Training dataset"
## < table of extent 0 >
## # A tibble: 3,832 × 11
##    gender     age hypertension heart_disease ever_married work_type    
##    <fct>    <dbl>        <dbl>         <dbl> <fct>        <fct>        
##  1 Female -0.0904       -0.324        -0.240 Yes          Private      
##  2 Male   -1.81         -0.324        -0.240 No           children     
##  3 Male   -0.575        -0.324        -0.240 Yes          Private      
##  4 Female  1.01         -0.324        -0.240 Yes          Private      
##  5 Female -0.355        -0.324        -0.240 Yes          Self-employed
##  6 Female  0.262        -0.324        -0.240 Yes          Private      
##  7 Male    0.703        -0.324        -0.240 Yes          Govt_job     
##  8 Male    0.130         3.08         -0.240 Yes          Private      
##  9 Female -1.02         -0.324        -0.240 No           Private      
## 10 Male   -0.267        -0.324        -0.240 Yes          Private      
## # ℹ 3,822 more rows
## # ℹ 5 more variables: Residence_type <fct>, avg_glucose_level <dbl>, bmi <dbl>,
## #   smoking_status <fct>, stroke <fct>

Task Two: Build prediction models

## # A tibble: 20 × 7
##     mtry .metric  .estimator  mean     n std_err .config              
##    <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1     5 accuracy binary     0.950     9 0.00292 Preprocessor1_Model01
##  2     5 roc_auc  binary     0.825     9 0.0170  Preprocessor1_Model01
##  3     9 accuracy binary     0.950     9 0.00339 Preprocessor1_Model02
##  4     9 roc_auc  binary     0.822     9 0.0158  Preprocessor1_Model02
##  5     2 accuracy binary     0.951     9 0.00310 Preprocessor1_Model03
##  6     2 roc_auc  binary     0.833     9 0.0176  Preprocessor1_Model03
##  7    10 accuracy binary     0.950     9 0.00339 Preprocessor1_Model04
##  8    10 roc_auc  binary     0.818     9 0.0173  Preprocessor1_Model04
##  9     7 accuracy binary     0.950     9 0.00296 Preprocessor1_Model05
## 10     7 roc_auc  binary     0.821     9 0.0173  Preprocessor1_Model05
## 11     3 accuracy binary     0.950     9 0.00304 Preprocessor1_Model06
## 12     3 roc_auc  binary     0.829     9 0.0178  Preprocessor1_Model06
## 13     1 accuracy binary     0.951     9 0.00310 Preprocessor1_Model07
## 14     1 roc_auc  binary     0.834     9 0.0184  Preprocessor1_Model07
## 15     6 accuracy binary     0.950     9 0.00301 Preprocessor1_Model08
## 16     6 roc_auc  binary     0.823     9 0.0189  Preprocessor1_Model08
## 17     4 accuracy binary     0.951     9 0.00273 Preprocessor1_Model09
## 18     4 roc_auc  binary     0.827     9 0.0182  Preprocessor1_Model09
## 19     8 accuracy binary     0.950     9 0.00346 Preprocessor1_Model10
## 20     8 roc_auc  binary     0.820     9 0.0171  Preprocessor1_Model10
## [1] "------Rank result accounding to Wokflow set 1"

Task Three: Evaluate and select prediction models

## # A tibble: 1 × 2
##    mtry .config              
##   <int> <chr>                
## 1     2 Preprocessor1_Model03
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits              id               .metrics .notes   .predictions .workflow 
##   <list>              <chr>            <list>   <list>   <list>       <list>    
## 1 <split [3832/1278]> train/test split <tibble> <tibble> <tibble>     <workflow>
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary        0.949  Preprocessor1_Model1
## 2 roc_auc     binary        0.823  Preprocessor1_Model1
## 3 brier_class binary        0.0451 Preprocessor1_Model1
## # A tibble: 1,278 × 7
##    .pred_class .pred_heart_disease .pred_Normal id           .row stroke .config
##    <fct>                     <dbl>        <dbl> <chr>       <int> <fct>  <chr>  
##  1 Normal                   0.0423        0.958 train/test…     1 heart… Prepro…
##  2 Normal                   0.262         0.738 train/test…     3 heart… Prepro…
##  3 Normal                   0.0755        0.925 train/test…    11 heart… Prepro…
##  4 Normal                   0.200         0.800 train/test…    12 heart… Prepro…
##  5 Normal                   0.229         0.771 train/test…    13 heart… Prepro…
##  6 Normal                   0.199         0.801 train/test…    14 heart… Prepro…
##  7 Normal                   0.0768        0.923 train/test…    15 heart… Prepro…
##  8 Normal                   0.0341        0.966 train/test…    21 heart… Prepro…
##  9 Normal                   0.162         0.838 train/test…    33 heart… Prepro…
## 10 Normal                   0.130         0.870 train/test…    34 heart… Prepro…
## # ℹ 1,268 more rows
##                Truth
## Prediction      heart_disease Normal
##   heart_disease             0      0
##   Normal                   65   1213
## # A tibble: 1,278 × 7
##    .pred_class .pred_heart_disease .pred_Normal id           .row stroke .config
##    <fct>                     <dbl>        <dbl> <chr>       <int> <fct>  <chr>  
##  1 Normal                   0.0423        0.958 train/test…     1 heart… Prepro…
##  2 Normal                   0.262         0.738 train/test…     3 heart… Prepro…
##  3 Normal                   0.0755        0.925 train/test…    11 heart… Prepro…
##  4 Normal                   0.200         0.800 train/test…    12 heart… Prepro…
##  5 Normal                   0.229         0.771 train/test…    13 heart… Prepro…
##  6 Normal                   0.199         0.801 train/test…    14 heart… Prepro…
##  7 Normal                   0.0768        0.923 train/test…    15 heart… Prepro…
##  8 Normal                   0.0341        0.966 train/test…    21 heart… Prepro…
##  9 Normal                   0.162         0.838 train/test…    33 heart… Prepro…
## 10 Normal                   0.130         0.870 train/test…    34 heart… Prepro…
## # ℹ 1,268 more rows

## [[1]]
## # A tibble: 1,278 × 6
##    .pred_heart_disease .pred_Normal  .row .pred_class stroke        .config     
##                  <dbl>        <dbl> <int> <fct>       <fct>         <chr>       
##  1              0.0423        0.958     1 Normal      heart_disease Preprocesso…
##  2              0.262         0.738     3 Normal      heart_disease Preprocesso…
##  3              0.0755        0.925    11 Normal      heart_disease Preprocesso…
##  4              0.200         0.800    12 Normal      heart_disease Preprocesso…
##  5              0.229         0.771    13 Normal      heart_disease Preprocesso…
##  6              0.199         0.801    14 Normal      heart_disease Preprocesso…
##  7              0.0768        0.923    15 Normal      heart_disease Preprocesso…
##  8              0.0341        0.966    21 Normal      heart_disease Preprocesso…
##  9              0.162         0.838    33 Normal      heart_disease Preprocesso…
## 10              0.130         0.870    34 Normal      heart_disease Preprocesso…
## # ℹ 1,268 more rows
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_normalize()
## • step_zv()
## • step_corr()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      5110 
## Number of independent variables:  10 
## Mtry:                             2 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.04332416

Task Four: Deploy the prediction model

## # A tibble: 1 × 10
##   gender   age hypertension heart_disease ever_married work_type Residence_type
##   <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
## 1 Female    47            0             0 Yes          Private   Urban         
## # ℹ 3 more variables: avg_glucose_level <dbl>, bmi <dbl>, smoking_status <chr>
## # A tibble: 1 × 1
##   .pred_class
##   <fct>      
## 1 Normal

Task Five: Findings and Conclusions

#Variable Importance
ranger_obj <- pull_workflow_fit(final_model)$fit
ranger_obj
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      5110 
## Number of independent variables:  10 
## Mtry:                             2 
## Target node size:                 10 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.04332416
#Display VAriable Importance

sort(ranger_obj$variable.importance)
##            gender    Residence_type      ever_married      hypertension 
##          5.009274          5.363850          5.942271          7.099011 
##     heart_disease         work_type    smoking_status               bmi 
##          7.877685          9.870189         11.611983         43.775349 
## avg_glucose_level               age 
##         56.908398         59.367519