Introduction

Stroke is also known as brain attack; it occurs when the blood vessel in the brain bursts or when the blood supply to the part of brain is stopped or blocked. According to the World Health Organization stroke is the leading cause of death and disability globally. Annually, there are more than 795k people in United State get a stroke and around 610k of them are new stroke. People who survive from stroke can experience different level of disabilities such as loss of vision and speech, paralysis and confusion. Stroke will bring financial and psychological burden to patient’s family.

The risk factors that cause stroke are high blood pressure, high cholesterol, heart disease, diabetes, obesity, cigarette smoking, alcohol consumption, imbalance diet, age, gender and etc. Some factors are modifiable and can be controlled to avoid stroke effectively such as high blood pressure, smoking and diet. As prevention is better than cure, it is important to detect patient with high risk of getting stroke and take prevention measurement accordingly.

With today high availability of medical data, this can be achieved easily using machine learning models. Machine learning models can be used to predict whether a person will have stroke or not using their lifestyle habits and physiological measurement data.

Question

What are the factors that increase the risk of getting stroke?
What is the characteristic of patient who is likely and not likely to get stroke?
Which machine learning model has the best accuracy in predicting the occurrence of stroke?
Which machine learning model has the best accuracy in predicting the patient average glucose level? To be decide

Objective

To determine the factors that increase the risk of getting stroke.
To determine the characteristic of patient who is likely and not likely to get stroke.
To evaluate the performance of stroke prediction models with various evaluation metrics.
To evaluate the performance of glucose level prediction models with various evaluation metrics. To be decide

Import libraries

library(readr)
library(dplyr)

## 
## 载入程辑包：'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v forcats 0.5.1
## v tidyr   1.2.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Overview the dataset

filePath = "C:/Users/Teng/Desktop/R/healthcare-dataset-stroke-data.csv"
healthcare_dataset_stroke_data <- read_csv(filePath)

## Rows: 5110 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
## dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(healthcare_dataset_stroke_data)

## Rows: 5,110
## Columns: 12
## $ id                <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434~
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"~
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, ~
## $ hypertension      <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ heart_disease     <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0~
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No~
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S~
## $ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"~
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0~
## $ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "~
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "~
## $ stroke            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~

dim(healthcare_dataset_stroke_data)

## [1] 5110   12

head(healthcare_dataset_stroke_data)

## # A tibble: 6 x 12
##      id gender   age hypertension heart_disease ever_married work_type    
##   <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
## 1  9046 Male      67            0             1 Yes          Private      
## 2 51676 Female    61            0             0 Yes          Self-employed
## 3 31112 Male      80            0             1 Yes          Private      
## 4 60182 Female    49            0             0 Yes          Private      
## 5  1665 Female    79            1             0 Yes          Self-employed
## 6 56669 Male      81            0             0 Yes          Private      
## # ... with 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Remove meaningless column

Remove the id column.

new_healthcare_df <- subset(healthcare_dataset_stroke_data, select = -c(id))
head(new_healthcare_df)

## # A tibble: 6 x 11
##   gender   age hypertension heart_disease ever_married work_type  Residence_type
##   <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>      <chr>         
## 1 Male      67            0             1 Yes          Private    Urban         
## 2 Female    61            0             0 Yes          Self-empl~ Rural         
## 3 Male      80            0             1 Yes          Private    Rural         
## 4 Female    49            0             0 Yes          Private    Urban         
## 5 Female    79            1             0 Yes          Self-empl~ Rural         
## 6 Male      81            0             0 Yes          Private    Urban         
## # ... with 4 more variables: avg_glucose_level <dbl>, bmi <chr>,
## #   smoking_status <chr>, stroke <dbl>

Check the missing data and change it to “NA”

new_healthcare_df[new_healthcare_df == ""]

## <unspecified> [0]

cname <- names(new_healthcare_df)
for (i in cname){
  print(paste(i, sum(new_healthcare_df[i] == "N/A")))
  new_healthcare_df[!is.na(new_healthcare_df[i]) & new_healthcare_df[i] == "N/A", i] <- NA
}

## [1] "gender 0"
## [1] "age 0"
## [1] "hypertension 0"
## [1] "heart_disease 0"
## [1] "ever_married 0"
## [1] "work_type 0"
## [1] "Residence_type 0"
## [1] "avg_glucose_level 0"
## [1] "bmi 201"
## [1] "smoking_status 0"
## [1] "stroke 0"

Thus, we observed the bmi column has 201 “NA” value.

Show the 201 rows of bmi with “NA” value and move it to front for better observation

bmi_sorted = new_healthcare_df[which(is.na(new_healthcare_df$bmi)),]
move_bmi = bmi_sorted %>% dplyr::select("bmi", everything())
move_bmi

## # A tibble: 201 x 11
##    bmi   gender   age hypertension heart_disease ever_married work_type    
##    <chr> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
##  1 <NA>  Female    61            0             0 Yes          Self-employed
##  2 <NA>  Female    59            0             0 Yes          Private      
##  3 <NA>  Male      78            0             1 Yes          Private      
##  4 <NA>  Male      57            0             1 No           Govt_job     
##  5 <NA>  Male      58            0             0 Yes          Private      
##  6 <NA>  Male      59            0             0 Yes          Private      
##  7 <NA>  Female    63            0             0 Yes          Private      
##  8 <NA>  Female    75            0             1 No           Self-employed
##  9 <NA>  Female    76            0             0 No           Private      
## 10 <NA>  Male      78            1             0 Yes          Private      
## # ... with 191 more rows, and 4 more variables: Residence_type <chr>,
## #   avg_glucose_level <dbl>, smoking_status <chr>, stroke <dbl>

Deal with the “NA” value of bmi column

Users from Kaggle stated that bmi attribute is not very clear indication of stroke. Hence, we will remove the missing bmi data of 201 rows, and it is a small portion of data.

clean_data = na.omit(new_healthcare_df)
clean_data

## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
##  1 Male      67            0             1 Yes          Private   Urban         
##  2 Male      80            0             1 Yes          Private   Rural         
##  3 Female    49            0             0 Yes          Private   Urban         
##  4 Female    79            1             0 Yes          Self-emp~ Rural         
##  5 Male      81            0             0 Yes          Private   Urban         
##  6 Male      74            1             1 Yes          Private   Rural         
##  7 Female    69            0             0 No           Private   Urban         
##  8 Female    78            0             0 Yes          Private   Urban         
##  9 Female    81            1             0 Yes          Private   Rural         
## 10 Female    61            0             1 Yes          Govt_job  Rural         
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Data Transformation

The gender, ever_married, and Residence_type columns will be transform into binary data for our data modelling step. We will check the gender, Residence_type, and ever_married columns to see if there is more than 2 types of value.

unique_gender = unique(clean_data$gender)
unique_ever_married = unique(clean_data$ever_married)
unique_residence_type = unique(clean_data$Residence_type)
unique_work_type = unique(clean_data$work_type)
unique_gender

## [1] "Male"   "Female" "Other"

unique_ever_married

## [1] "Yes" "No"

unique_residence_type

## [1] "Urban" "Rural"

unique_work_type

## [1] "Private"       "Self-employed" "Govt_job"      "children"     
## [5] "Never_worked"

We noticed the gender column has 3 types of value which are “Male”, “Female”, and “Other”. We transform it into Male = 0, Female = 1, Other = 2.

clean_data = clean_data %>% 
                mutate(gender = recode(
                  gender,
                  "Male" = "0",
                  "Female" = "1",
                  "Other" = "2"
                ))
clean_data

## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
##  1 0         67            0             1 Yes          Private   Urban         
##  2 0         80            0             1 Yes          Private   Rural         
##  3 1         49            0             0 Yes          Private   Urban         
##  4 1         79            1             0 Yes          Self-emp~ Rural         
##  5 0         81            0             0 Yes          Private   Urban         
##  6 0         74            1             1 Yes          Private   Rural         
##  7 1         69            0             0 No           Private   Urban         
##  8 1         78            0             0 Yes          Private   Urban         
##  9 1         81            1             0 Yes          Private   Rural         
## 10 1         61            0             1 Yes          Govt_job  Rural         
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

The work_type column has 5 types of value which are “Private”, “Self-employed”, “Govt_job”, “children”, “Never_worked”. We transform it into Private = 0, Self-employed = 1, Govt_job = 2, children = 3, Never_worked = 4.

clean_data = clean_data %>% 
                mutate(work_type = recode(
                  work_type,
                  "Private" = "0",
                  "Self-employed" = "1",
                  "Govt_job" = "2",
                  "children" = "3",
                  "Never_worked" = "4"
                ))
clean_data

## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
##  1 0         67            0             1 Yes          0         Urban         
##  2 0         80            0             1 Yes          0         Rural         
##  3 1         49            0             0 Yes          0         Urban         
##  4 1         79            1             0 Yes          1         Rural         
##  5 0         81            0             0 Yes          0         Urban         
##  6 0         74            1             1 Yes          0         Rural         
##  7 1         69            0             0 No           0         Urban         
##  8 1         78            0             0 Yes          0         Urban         
##  9 1         81            1             0 Yes          0         Rural         
## 10 1         61            0             1 Yes          2         Rural         
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Transform the Residence_type and ever_married columns’ data into binary data.

clean_data$Residence_type <- ifelse(clean_data$Residence_type == "Urban", 1, 0)
clean_data$ever_married <- ifelse(clean_data$ever_married == "Yes", 1, 0)
clean_data

## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl>        <dbl> <chr>              <dbl>
##  1 0         67            0             1            1 0                      1
##  2 0         80            0             1            1 0                      0
##  3 1         49            0             0            1 0                      1
##  4 1         79            1             0            1 1                      0
##  5 0         81            0             0            1 0                      1
##  6 0         74            1             1            1 0                      0
##  7 1         69            0             0            0 0                      1
##  8 1         78            0             0            1 0                      1
##  9 1         81            1             0            1 0                      0
## 10 1         61            0             1            1 2                      0
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Data Modelling

Clustering

To start with data modelling, let’s assume that the data is not labelled, and see what clustering algorithm can tell us if we want to have 2 clusters in the dataset.

Ideally, the 2 clusters should be one with stroke and one without stroke. Normally, clustering algorithm works best with continuos variable. But we have categorical variable here, so we used gower distance to measure distance between two data points.

CRAN has a detailed documentation about gower distance which is available at https://cran.r-project.org/web/packages/gower/vignettes/intro.pdf

k-means clustering

k-means clustering works by partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). All data points are treated as vectors, and distance between data points can be measured using various methods (Euclidean, Manhattan and etc.)

library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(compareGroups)
# install.packages('cluster')
# install.packages('factoextra')
# install.packages('compareGroups')

set.seed(666)
clean_data$gender <- as.factor(clean_data$gender)
clean_data$ever_married <- as.factor(clean_data$ever_married)
clean_data$Residence_type <- as.factor(clean_data$Residence_type)
clean_data$smoking_status <- as.factor(clean_data$smoking_status)
clean_data$stroke <- as.factor(clean_data$stroke)
clean_data$heart_disease <- as.factor(clean_data$heart_disease)
clean_data$hypertension <- as.factor(clean_data$hypertension)
clean_data$bmi <- as.numeric(clean_data$bmi)

df = subset(clean_data, select = c(bmi,avg_glucose_level, age, 
                          hypertension, smoking_status, 
                          stroke) )

df <- na.omit(df)

distMat<-daisy(df,metric = "gower")
k2 <- kmeans(distMat, centers = 2, nstart = 25)

df$cluster<-k2$cluster

group<-compareGroups(cluster~.,data=df)
clustab<-createTable(group)
clustab

## 
## --------Summary descriptives table by 'cluster'---------
## 
## ______________________________________________________ 
##                          1           2       p.overall 
##                        N=691       N=4218              
## ˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉ 
## bmi                 32.6 (8.27) 28.3 (7.62)   <0.001   
## avg_glucose_level   143 (65.3)  99.1 (36.4)   <0.001   
## age                 64.4 (13.5) 39.3 (21.8)   <0.001   
## hypertension:                                  0.000   
##     0               240 (34.7%) 4218 (100%)            
##     1               451 (65.3%)  0 (0.00%)             
## smoking_status:                               <0.001   
##     formerly smoked 203 (29.4%) 634 (15.0%)            
##     never smoked    286 (41.4%) 1566 (37.1%)           
##     smokes          132 (19.1%) 605 (14.3%)            
##     Unknown         70 (10.1%)  1413 (33.5%)           
## stroke:                                       <0.001   
##     0               482 (69.8%) 4218 (100%)            
##     1               209 (30.2%)  0 (0.00%)             
## ˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉˉ

Classification

After that, we will use supervised learning method, classification to predict if an individual will get stroke or not

Random Forest Classifier

Random forests is an ensemble learning method that constructs many decision trees at traing time. It can be used for classification as well as regression task.
The output for classification task is based on the class selected by most trees.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## 载入程辑包：'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)

## 载入需要的程辑包：lattice

## 
## 载入程辑包：'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ROSE)

## Loaded ROSE 0.0-4

oversampled_data <- ovun.sample(as.factor(stroke)~.,data = clean_data, method = 'over',p = 0.3)$data
sample_index <- createDataPartition(oversampled_data$stroke, p = 0.7, 
                                   list = FALSE, 
                                   times = 1)
data_train <- oversampled_data[sample_index,]
data_test <- oversampled_data[-sample_index,]

rf_model <- randomForest(stroke~.,data = data_train,ntree = 1000,mtry = 5)

## Model performance using training data
pred_train <- predict(rf_model)

result_train <- confusionMatrix(pred_train, data_train$stroke)

(result_train$overall)['Accuracy']

##  Accuracy 
## 0.9849608

(result_train$byClass)['Sensitivity']

## Sensitivity 
##   0.9796353

(result_train$byClass)['Specificity']

## Specificity 
##   0.9972048

(result_train$byClass)['F1']

##        F1 
## 0.9891054

(result_train$byClass)['Recall']

##    Recall 
## 0.9796353

fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Train")

## Model performance using testing data
pred_test <- predict(rf_model, newdata = data_test)

result_test <- confusionMatrix(pred_test, data_test$stroke)

(result_test$overall)['Accuracy']

##  Accuracy 
## 0.9841819

(result_test$byClass)['Sensitivity']

## Sensitivity 
##   0.9787234

(result_test$byClass)['Specificity']

## Specificity 
##   0.9967374

(result_test$byClass)['F1']

##        F1 
## 0.9885387

(result_test$byClass)['Recall']

##    Recall 
## 0.9787234

fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Test")

Support Vector Machine Classifier

Support Vector Machine works by constructing multiple hyperplanes that can seperate data points distinctly.

library(kernlab)

## 
## 载入程辑包：'kernlab'

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(e1071)



svm_model <- ksvm(stroke~.,data = data_train, kernel="vanilladot")

##  Setting default kernel parameters

## Model performance using training data
pred_train <- predict(svm_model)

result_train <- confusionMatrix(pred_train, data_train$stroke)

(result_train$overall)['Accuracy']

## Accuracy 
## 0.798136

(result_train$byClass)['Sensitivity']

## Sensitivity 
##   0.8604863

(result_train$byClass)['Specificity']

## Specificity 
##   0.6547869

(result_train$byClass)['F1']

##        F1 
## 0.8559335

(result_train$byClass)['Recall']

##    Recall 
## 0.8604863

fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Train")

## Model performance using testing data
pred_test <- predict(svm_model, newdata = data_test)

result_test <- confusionMatrix(pred_test, data_test$stroke)

(result_test$overall)['Accuracy']

##  Accuracy 
## 0.7874444

(result_test$byClass)['Sensitivity']

## Sensitivity 
##   0.8595745

(result_test$byClass)['Specificity']

## Specificity 
##   0.6215334

(result_test$byClass)['F1']

##        F1 
## 0.8493343

(result_test$byClass)['Recall']

##    Recall 
## 0.8595745

fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Test")

Naive Bayes Classifier

Naive Bayes Classifier is a probabilistic machine learning algorithm based on the Bayes Theorem.

nb_model <- naiveBayes(stroke~.,data = data_train)

## Model performance using training data
pred_train <- predict(nb_model, data_train)

result_train <- confusionMatrix(pred_train, data_train$stroke)

(result_train$overall)['Accuracy']

##  Accuracy 
## 0.7665749

(result_train$byClass)['Sensitivity']

## Sensitivity 
##   0.7829787

(result_train$byClass)['Specificity']

## Specificity 
##   0.7288609

(result_train$byClass)['F1']

##        F1 
## 0.8237928

(result_train$byClass)['Recall']

##    Recall 
## 0.7829787

fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Train")

## Model performance using testing data
pred_test <- predict(nb_model, newdata = data_test)

result_test <- confusionMatrix(pred_test, data_test$stroke)

(result_test$overall)['Accuracy']

##  Accuracy 
## 0.7572912

(result_test$byClass)['Sensitivity']

## Sensitivity 
##   0.7815603

(result_test$byClass)['Specificity']

## Specificity 
##   0.7014682

(result_test$byClass)['F1']

##        F1 
## 0.8178108

(result_test$byClass)['Recall']

##    Recall 
## 0.7815603

fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Test")

Regression

With this dataset we can perform regression on average glucose level as well.

Linear regression

Linear regression works by fitting a linear equation from various variables to output. Hence, linear regression might not work well dataset with non-linear nature.

#LR all variable
linear_regression_str <- lm(avg_glucose_level~.,data = data_train)
summary(linear_regression_str)

## 
## Call:
## lm(formula = avg_glucose_level ~ ., data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.72  -31.73  -10.31   25.23  167.53 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                47.78289    4.46810  10.694  < 2e-16 ***
## gender1                    -8.62641    1.42474  -6.055 1.52e-09 ***
## gender2                    55.74107   47.42137   1.175 0.239877    
## age                         0.42468    0.05193   8.178 3.69e-16 ***
## hypertension1              13.50414    2.07378   6.512 8.20e-11 ***
## heart_disease1             28.44731    2.60689  10.912  < 2e-16 ***
## ever_married1               5.22690    2.04181   2.560 0.010500 *  
## work_type1                 -6.69649    1.94008  -3.452 0.000562 ***
## work_type2                 -3.27198    2.09254  -1.564 0.117969    
## work_type3                 22.14089    3.37243   6.565 5.76e-11 ***
## work_type4                 11.64373   14.42166   0.807 0.419490    
## Residence_type1            -0.51483    1.38203  -0.373 0.709525    
## bmi                         1.28412    0.10201  12.589  < 2e-16 ***
## smoking_statusnever smoked  0.82478    1.94800   0.423 0.672022    
## smoking_statussmokes       -0.12377    2.38445  -0.052 0.958605    
## smoking_statusUnknown      -2.71304    2.32437  -1.167 0.243184    
## stroke1                    14.13256    1.81960   7.767 9.81e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.34 on 4704 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1926 
## F-statistic: 71.36 on 16 and 4704 DF,  p-value: < 2.2e-16

#LR select significant p
linear_regression_model <- lm(avg_glucose_level~gender+age+hypertension+heart_disease+bmi+stroke,data = data_train)
summary(linear_regression_model)

## 
## Call:
## lm(formula = avg_glucose_level ~ gender + age + hypertension + 
##     heart_disease + bmi + stroke, data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -133.60  -32.26  -10.98   25.06  167.27 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63.19731    3.03674  20.811  < 2e-16 ***
## gender1        -9.11825    1.41950  -6.424 1.46e-10 ***
## gender2        47.85114   47.61372   1.005    0.315    
## age             0.27898    0.03671   7.600 3.56e-14 ***
## hypertension1  13.90457    2.05859   6.754 1.61e-11 ***
## heart_disease1 29.77206    2.58662  11.510  < 2e-16 ***
## bmi             1.11732    0.09531  11.723  < 2e-16 ***
## stroke1        15.98919    1.79695   8.898  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.59 on 4713 degrees of freedom
## Multiple R-squared:  0.1852, Adjusted R-squared:  0.184 
## F-statistic:   153 on 7 and 4713 DF,  p-value: < 2.2e-16

pred_test <- predict(linear_regression_model,data_test)

actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)

##    actuals predicteds
## 2    87.96  137.08189
## 3   110.89   75.97577
## 10  205.84  149.63411
## 11   77.08  115.22496
## 12   57.08  139.59010
## 15   95.04  112.61279

# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy

## [1] 0.7416867

# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape

## [1] 0.3498599

From the results, we can observe which variables are most important by looking at their p values. The important features identified are
- gender
- age
- hypertension
- hear_disease
- bmi
- stroke

Random Forest Regression

Random forests is an ensemble learning method that constructs many decision trees at traing time. It can be used for classification as well as regression task.
The mean or average prediction of the individual trees is used as output for regression tasks.

rf_regression_model <- randomForest(avg_glucose_level~.,data = data_train,ntree = 1000,mtry = 5)

pred_test <- predict(rf_regression_model,data_test)

actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)

##    actuals predicteds
## 2    87.96  151.15280
## 3   110.89   89.30541
## 10  205.84  166.39514
## 11   77.08   90.70945
## 12   57.08  104.59448
## 15   95.04  101.60528

# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy

## [1] 0.8337048

# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape

## [1] 0.2168683

Support Vector Regression

svm_regression_model <- svm(avg_glucose_level~.,data = data_train)
pred_test <- predict(svm_regression_model,data_test)

actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)

##    actuals predicteds
## 2    87.96  135.16215
## 3   110.89   83.44734
## 10  205.84  184.80977
## 11   77.08  104.01751
## 12   57.08  102.74096
## 15   95.04  105.42116

# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy

## [1] 0.7793447

# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape

## [1] 0.2720404

Results and Discussions

Clustering

k means clustering algorithm identified two clusters.
One without stroke and one with around 30% occurrences of stroke.
We can observe that clusters without stroke have
- lower average blood glucose level
- lower hypertension
- lower smoking rate
- lower bmi
- lower age

Classification

We experimented with 3 classifier algorithms (Random Forest Classifier, Support Vector Classifier, Naive Bayes Classifier) to predict if a patient will get stroke or not.

Random Forest Classifier performed the best with accuracy >95%, with very low false positives and false negatives.
The performance of Support Vector Classifer and Naive Bayes Classifier are similar.
However, the model’s parameter are not tuned, we can expect performance enhancement for Support Vector Classifier after hyperparameter tuning.

Regression

We experimented with 3 regression algorithms (Linear Regression, Support Vector Regression, Random Forest Regression) to predict the average glucose level of a patient.

Random Forest Regression performed the best with lowest mean average prediction error of < 0.25.
Linear Regression performed relatively poorly, this might due to the non-linear nature of dataset.
Support Vector Regression also performed relatively poorly, but it has ability to model non-linear dataset, further hyperparemeter tuning might enhance regression performance.

DataPreprocessing

Sam

5/15/2022

Introduction

Question

Objective

Import libraries

Overview the dataset

Remove meaningless column

Check the missing data and change it to “NA”

Show the 201 rows of bmi with “NA” value and move it to front for better observation

Deal with the “NA” value of bmi column

Data Transformation

Data Modelling

Clustering

k-means clustering

Classification

Random Forest Classifier

Support Vector Machine Classifier

Naive Bayes Classifier

Regression

Linear regression

Random Forest Regression

Support Vector Regression

Results and Discussions

Clustering

Classification

Regression