Introduction

Stroke is also known as brain attack; it occurs when the blood vessel in the brain bursts or when the blood supply to the part of brain is stopped or blocked. According to the World Health Organization stroke is the leading cause of death and disability globally. Annually, there are more than 795k people in United State get a stroke and around 610k of them are new stroke. People who survive from stroke can experience different level of disabilities such as loss of vision and speech, paralysis and confusion. Stroke will bring financial and psychological burden to patient’s family.

The risk factors that cause stroke are high blood pressure, high cholesterol, heart disease, diabetes, obesity, cigarette smoking, alcohol consumption, imbalance diet, age, gender and etc. Some factors are modifiable and can be controlled to avoid stroke effectively such as high blood pressure, smoking and diet. As prevention is better than cure, it is important to detect patient with high risk of getting stroke and take prevention measurement accordingly.

With today high availability of medical data, this can be achieved easily using machine learning models. Machine learning models can be used to predict whether a person will have stroke or not using their lifestyle habits and physiological measurement data.

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. In this project, we decide to use “Stroke Prediction Dataset” provided by Fedesoriano from Kaggle. The latest dataset is updated on 2021 with 5111 instances and 12 attributes. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Overlook the summary of the attribute information, the 12 attributes include “id” and 11 clinical features for predicting stroke events which “id” indicates the unique identifier. The “gender” attribute includes “Male”, “Female” or “Other”. The “age”, “average_glucose_level” and “bmi” attributes indicate the age, average glucose level in blood and body mass index of the patient respectively. For hypertension, “heart_disease” and “stroke”, 0 indicates the patient doesn’t have relevant disease history while 1 indicates the patient has relevant disease. The “ever_married” attribute indicates the marital status which is represented by “No” and “Yes” for single and married adults accordingly. The “residence_type” attribute is categorized into rural and urban. The “work_type” attribute includes children, government jobs, never worked, private jobs and self-employed. The “smoking_status” attribute includes formerly smoked, never smoked, smokes and unknown.

Dataset

The dataset was taken from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
The title of the data: Stroke Prediction Dataset
The year of the data: 2021
The dimension: 5110 rows, 12 attributes

Structure

  1. id: unique identifier
  2. gender: “Male”, “Female” or “Other”
  3. age: age of the patient
  4. hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
  5. heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
  6. ever_married: “No” or “Yes”
  7. work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
  8. Residence_type: “Rural” or “Urban”
  9. avg_glucose_level: average glucose level in blood
  10. bmi: body mass index
  11. smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
  12. stroke: 1 if the patient had a stroke or 0 if not Note: “Unknown” in smoking_status means that the information is unavailable for this patient

Question

  1. What are the factors that increase the risk of getting stroke?
  2. What is the characteristic of patient who is likely and not likely to get stroke?
  3. Which machine learning model has the best accuracy in predicting the occurrence of stroke?
  4. Which machine learning model has the best accuracy in predicting the patient average glucose level?

Objective

  1. To determine the factors that increase the risk of getting stroke.
  2. To determine the characteristic of patient who is likely and not likely to get stroke.
  3. To evaluate the performance of stroke prediction models with various evaluation metrics.
  4. To evaluate the performance of glucose level prediction models with various evaluation metrics.

Exploratory Data Analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers we need, making it easier for us to discover patterns, spot anomalies, test a hypothesis, or check assumptions.EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques we are considering for data analysis are appropriate. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.We can use exploratory analysis to ensure the results we produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming we are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals.

Setting up

# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages("openintro")

Load Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(readr)
library(dplyr)
library(ggplot2)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:survival':
## 
##     transplant
## The following objects are masked from 'package:lattice':
## 
##     ethanol, lsegments
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:openintro':
## 
##     salinity
## The following object is masked from 'package:survival':
## 
##     aml
## The following object is masked from 'package:lattice':
## 
##     melanoma

Import the data

data <- read.csv("C:\\Users\\Teng\\Desktop\\healthcare-dataset-stroke-data.csv", header=TRUE, stringsAsFactors=FALSE)
head(data)
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1

Overview the data

summary(data)
##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level     bmi            smoking_status         stroke       
##  Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :106.15                                          Mean   :0.04873  
##  3rd Qu.:114.09                                          3rd Qu.:0.00000  
##  Max.   :271.74                                          Max.   :1.00000

The summary table above has 2 problem: The column “gender” contains “other” elements and column “bmi” contain “N/A/” elements.

Check the gender and bmi attribute

table(data$gender)
## 
## Female   Male  Other 
##   2994   2115      1
table(data$bmi)
## 
## 10.3 11.3 11.5   12 12.3 12.8   13 13.2 13.3 13.4 13.5 13.7 13.8 13.9   14 14.1 
##    1    1    1    1    1    1    1    1    1    1    1    2    2    1    1    5 
## 14.2 14.3 14.4 14.5 14.6 14.8 14.9   15 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 
##    4    3    2    2    4    4    1    2    8    4    4    3    5    3    3    5 
## 15.9   16 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9   17 17.1 17.2 17.3 17.4 
##    5    8    8   10   10   11    4    8   11    7    7   10   12   11    9   13 
## 17.5 17.6 17.7 17.8 17.9   18 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9   19 
##    7   16   13    7    7   16   12    9   17   10   12   19   13   15    8    7 
## 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 19.9   20 20.1 20.2 20.3 20.4 20.5 20.6 
##   11   13    9   14   21    7    8   17    9   17   25   16   17   23   18   15 
## 20.7 20.8 20.9   21 21.1 21.2 21.3 21.4 21.5 21.6 21.7 21.8 21.9   22 22.1 22.2 
##   12   17   13   17   16   16   21   22   27   16   14   18   14   15   22   30 
## 22.3 22.4 22.5 22.6 22.7 22.8 22.9   23 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8 
##   18   22   13   18   25   25   16   27   21   20   19   36   31   24   16   22 
## 23.9   24 24.1 24.2 24.3 24.4 24.5 24.6 24.7 24.8 24.9   25 25.1 25.2 25.3 25.4 
##   24   28   28   29   26   23   26   22   22   31   27   27   34   18   28   26 
## 25.5 25.6 25.7 25.8 25.9   26 26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9   27 
##   33   21   15   24   24   25   37   27   23   34   30   29   37   21   34   35 
## 27.1 27.2 27.3 27.4 27.5 27.6 27.7 27.8 27.9   28 28.1 28.2 28.3 28.4 28.5 28.6 
##   28   24   36   22   29   37   37   23   28   28   29   25   30   38   27   27 
## 28.7 28.8 28.9   29 29.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9   30 30.1 30.2 
##   41   26   31   26   29   26   22   30   26   26   27   23   26   27   26   20 
## 30.3 30.4 30.5 30.6 30.7 30.8 30.9   31 31.1 31.2 31.3 31.4 31.5 31.6 31.7 31.8 
##   30   17   24   18   23   21   27   22   26   19   21   30   27   21   14   24 
## 31.9   32 32.1 32.2 32.3 32.4 32.5 32.6 32.7 32.8 32.9   33 33.1 33.2 33.3 33.4 
##   22   21   24   20   28   19   21   19   21   25   13   15   25   17   15   16 
## 33.5 33.6 33.7 33.8 33.9   34 34.1 34.2 34.3 34.4 34.5 34.6 34.7 34.8 34.9   35 
##   23   11   19   13   13   17   15   17   18   18   21   11   20   15   11   12 
## 35.1 35.2 35.3 35.4 35.5 35.6 35.7 35.8 35.9   36 36.1 36.2 36.3 36.4 36.5 36.6 
##   10   16   12    9   13   15   13   24   18   11    7   12   13   10    4   14 
## 36.7 36.8 36.9   37 37.1 37.2 37.3 37.4 37.5 37.6 37.7 37.8 37.9   38 38.1 38.2 
##   15    8   13   10    7    9   13   11    9    9    7   10   11   13   10    9 
## 38.3 38.4 38.5 38.6 38.7 38.8 38.9   39 39.1 39.2 39.3 39.4 39.5 39.6 39.7 39.8 
##    2    6    7    9   11   10    8    7    8   10    6   10    8    9    8    5 
## 39.9   40 40.1 40.2 40.3 40.4 40.5 40.6 40.7 40.8 40.9   41 41.1 41.2 41.3 41.4 
##    5    6   10   10    8    9    7    1    1    7    6    3    7    8    6    3 
## 41.5 41.6 41.7 41.8 41.9   42 42.1 42.2 42.3 42.4 42.5 42.6 42.7 42.8 42.9   43 
##    8    5    7   11    5    3    3    8    5    5    2    4    4    3    3    8 
## 43.1 43.2 43.3 43.4 43.6 43.7 43.8 43.9   44 44.1 44.2 44.3 44.4 44.5 44.6 44.7 
##    4    4    5    6    4    7    9    8    4    1    4    3    1    4    2    6 
## 44.8 44.9   45 45.1 45.2 45.3 45.4 45.5 45.7 45.8 45.9   46 46.1 46.2 46.3 46.4 
##    4    2    5    2    3    4    4    4    2    1    2    4    2    2    1    1 
## 46.5 46.6 46.8 46.9 47.1 47.3 47.4 47.5 47.6 47.8 47.9   48 48.1 48.2 48.3 48.4 
##    2    1    1    2    1    2    1    3    3    2    1    1    1    1    2    1 
## 48.5 48.7 48.8 48.9 49.2 49.3 49.4 49.5 49.8 49.9 50.1 50.2 50.3 50.4 50.5 50.6 
##    2    1    2    3    1    3    1    2    3    1    2    4    2    1    1    2 
## 50.8 50.9   51 51.5 51.7 51.8 51.9 52.3 52.5 52.7 52.8 52.9 53.4 53.5 53.8 53.9 
##    1    1    1    1    1    1    2    1    1    2    3    1    2    1    2    1 
##   54 54.1 54.2 54.3 54.6 54.7 54.8   55 55.1 55.2 55.7 55.9   56 56.1 56.6 57.2 
##    1    1    1    1    2    3    1    2    1    1    4    2    1    1    2    2 
## 57.3 57.5 57.7 57.9 58.1 59.7 60.2 60.9 61.2 61.6 63.3 64.4 64.8 66.8 71.9   78 
##    1    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##   92 97.6  N/A 
##    1    1  201

We will delete the the row containing gender “other” elements and bmi “N/A”

Delete the row containing gender “other”

data<- data[data$gender != "Other",]

Delete the row containing bmi “N/A”

data<- data[data$bmi != "N/A",] 

Re-check the gender and bmi attribute

table(data$gender)
## 
## Female   Male 
##   2897   2011
table(data$bmi)
## 
## 10.3 11.3 11.5   12 12.3 12.8   13 13.2 13.3 13.4 13.5 13.7 13.8 13.9   14 14.1 
##    1    1    1    1    1    1    1    1    1    1    1    2    2    1    1    5 
## 14.2 14.3 14.4 14.5 14.6 14.8 14.9   15 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 
##    4    3    2    2    4    4    1    2    8    4    4    3    5    3    3    5 
## 15.9   16 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9   17 17.1 17.2 17.3 17.4 
##    5    8    8   10   10   11    4    8   11    7    7   10   12   11    9   13 
## 17.5 17.6 17.7 17.8 17.9   18 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9   19 
##    7   16   13    7    7   16   12    9   17   10   12   19   13   15    8    7 
## 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 19.9   20 20.1 20.2 20.3 20.4 20.5 20.6 
##   11   13    9   14   21    7    8   17    9   17   25   16   17   23   18   15 
## 20.7 20.8 20.9   21 21.1 21.2 21.3 21.4 21.5 21.6 21.7 21.8 21.9   22 22.1 22.2 
##   12   17   13   17   16   16   21   22   27   16   14   18   14   15   22   30 
## 22.3 22.4 22.5 22.6 22.7 22.8 22.9   23 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8 
##   18   21   13   18   25   25   16   27   21   20   19   36   31   24   16   22 
## 23.9   24 24.1 24.2 24.3 24.4 24.5 24.6 24.7 24.8 24.9   25 25.1 25.2 25.3 25.4 
##   24   28   28   29   26   23   26   22   22   31   27   27   34   18   28   26 
## 25.5 25.6 25.7 25.8 25.9   26 26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9   27 
##   33   21   15   24   24   25   37   27   23   34   30   29   37   21   34   35 
## 27.1 27.2 27.3 27.4 27.5 27.6 27.7 27.8 27.9   28 28.1 28.2 28.3 28.4 28.5 28.6 
##   28   24   36   22   29   37   37   23   28   28   29   25   30   38   27   27 
## 28.7 28.8 28.9   29 29.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9   30 30.1 30.2 
##   41   26   31   26   29   26   22   30   26   26   27   23   26   27   26   20 
## 30.3 30.4 30.5 30.6 30.7 30.8 30.9   31 31.1 31.2 31.3 31.4 31.5 31.6 31.7 31.8 
##   30   17   24   18   23   21   27   22   26   19   21   30   27   21   14   24 
## 31.9   32 32.1 32.2 32.3 32.4 32.5 32.6 32.7 32.8 32.9   33 33.1 33.2 33.3 33.4 
##   22   21   24   20   28   19   21   19   21   25   13   15   25   17   15   16 
## 33.5 33.6 33.7 33.8 33.9   34 34.1 34.2 34.3 34.4 34.5 34.6 34.7 34.8 34.9   35 
##   23   11   19   13   13   17   15   17   18   18   21   11   20   15   11   12 
## 35.1 35.2 35.3 35.4 35.5 35.6 35.7 35.8 35.9   36 36.1 36.2 36.3 36.4 36.5 36.6 
##   10   16   12    9   13   15   13   24   18   11    7   12   13   10    4   14 
## 36.7 36.8 36.9   37 37.1 37.2 37.3 37.4 37.5 37.6 37.7 37.8 37.9   38 38.1 38.2 
##   15    8   13   10    7    9   13   11    9    9    7   10   11   13   10    9 
## 38.3 38.4 38.5 38.6 38.7 38.8 38.9   39 39.1 39.2 39.3 39.4 39.5 39.6 39.7 39.8 
##    2    6    7    9   11   10    8    7    8   10    6   10    8    9    8    5 
## 39.9   40 40.1 40.2 40.3 40.4 40.5 40.6 40.7 40.8 40.9   41 41.1 41.2 41.3 41.4 
##    5    6   10   10    8    9    7    1    1    7    6    3    7    8    6    3 
## 41.5 41.6 41.7 41.8 41.9   42 42.1 42.2 42.3 42.4 42.5 42.6 42.7 42.8 42.9   43 
##    8    5    7   11    5    3    3    8    5    5    2    4    4    3    3    8 
## 43.1 43.2 43.3 43.4 43.6 43.7 43.8 43.9   44 44.1 44.2 44.3 44.4 44.5 44.6 44.7 
##    4    4    5    6    4    7    9    8    4    1    4    3    1    4    2    6 
## 44.8 44.9   45 45.1 45.2 45.3 45.4 45.5 45.7 45.8 45.9   46 46.1 46.2 46.3 46.4 
##    4    2    5    2    3    4    4    4    2    1    2    4    2    2    1    1 
## 46.5 46.6 46.8 46.9 47.1 47.3 47.4 47.5 47.6 47.8 47.9   48 48.1 48.2 48.3 48.4 
##    2    1    1    2    1    2    1    3    3    2    1    1    1    1    2    1 
## 48.5 48.7 48.8 48.9 49.2 49.3 49.4 49.5 49.8 49.9 50.1 50.2 50.3 50.4 50.5 50.6 
##    2    1    2    3    1    3    1    2    3    1    2    4    2    1    1    2 
## 50.8 50.9   51 51.5 51.7 51.8 51.9 52.3 52.5 52.7 52.8 52.9 53.4 53.5 53.8 53.9 
##    1    1    1    1    1    1    2    1    1    2    3    1    2    1    2    1 
##   54 54.1 54.2 54.3 54.6 54.7 54.8   55 55.1 55.2 55.7 55.9   56 56.1 56.6 57.2 
##    1    1    1    1    2    3    1    2    1    1    4    2    1    1    2    2 
## 57.3 57.5 57.7 57.9 58.1 59.7 60.2 60.9 61.2 61.6 63.3 64.4 64.8 66.8 71.9   78 
##    1    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##   92 97.6 
##    1    1
data$gender <- as.factor(data$gender)
data$hypertension <- as.factor(data$hypertension)
data$heart_disease <- as.factor(data$heart_disease)
data$ever_married <- as.factor(data$ever_married)
data$bmi <- as.numeric(data$bmi)
data$stroke<- as.factor(data$stroke)

Univariate Analysis

We will use univariate analysis at first, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

library(ggplot2)
library(gridExtra)
g1 <- ggplot(data = data, aes(x=gender,fill=gender))+geom_bar()
g2 <- ggplot(data = data, aes(x=age))+geom_histogram(binwidth = 20)
g3 <- ggplot(data = data, aes(x=ever_married,fill=ever_married))+geom_bar()
g4 <- ggplot(data = data, aes(x=work_type,fill=work_type))+geom_bar()+theme(axis.text.x =element_text(angle = 45) )
grid.arrange(g1,g2,g3,g4, ncol=2)

g5 <- ggplot(data, aes(hypertension, fill=hypertension))+geom_bar()
g6 <- ggplot(data, aes(heart_disease, fill=heart_disease))+geom_bar()
g7 <- ggplot(data, aes(stroke, fill=stroke))+geom_bar()
grid.arrange(g5,g6,g7, ncol=3)

Based on the graph drawn above, we did not detect any outliers. We find that there is more female adults, normally distributed age group, more married adults, and more workers in private organizations in the dataset. We find that adults with hypertension, heart-disease, and stroke is much higher than adults without the diseases.

Multivariate Analysis

Multivariate visualizations and summary statistics that allow us to assess the relationship between each variable in the dataset and the target variable we’re looking Multivariate data arises from more than one variable. Multivariate EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.Based on the coefficient, we identify the biggest impact to the chances to get stroke involves average glucose level, working style and smoking habits.

ggplot(data=data,aes(x=work_type, fill=stroke))+geom_bar(position="fill")+coord_cartesian(ylim = c(0,0.25))+ggtitle("Work type vs Stroke")

ggplot(data=data,aes(x=smoking_status, fill=stroke))+geom_bar(position="fill")+coord_cartesian(ylim = c(0,0.25))+ggtitle("Smoking status vs Stroke")

ggplot(data=data,aes(x=stroke, y=avg_glucose_level))+geom_violin()+ggtitle("Avg glucose level vs Stroke")

There is more common to see adults with higher average glucose level get stroke. Self-employed adults have higher chance to get stroke. Never-worked adults have lower chances to get stroke. Smokers have higher chance to get stroke.

Data preprocessing

Import the libraries

library(readr)
library(dplyr)
library(stringr)
library(tidyverse)

Overview the dataset

filePath = "C:\\Users\\Teng\\Desktop\\healthcare-dataset-stroke-data.csv"
healthcare_dataset_stroke_data <- read_csv(filePath)
## Rows: 5110 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
## dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(healthcare_dataset_stroke_data)
## Rows: 5,110
## Columns: 12
## $ id                <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434~
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"~
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, ~
## $ hypertension      <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ heart_disease     <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0~
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No~
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S~
## $ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"~
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0~
## $ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "~
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "~
## $ stroke            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
dim(healthcare_dataset_stroke_data)
## [1] 5110   12
head(healthcare_dataset_stroke_data)
## # A tibble: 6 x 12
##      id gender   age hypertension heart_disease ever_married work_type    
##   <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
## 1  9046 Male      67            0             1 Yes          Private      
## 2 51676 Female    61            0             0 Yes          Self-employed
## 3 31112 Male      80            0             1 Yes          Private      
## 4 60182 Female    49            0             0 Yes          Private      
## 5  1665 Female    79            1             0 Yes          Self-employed
## 6 56669 Male      81            0             0 Yes          Private      
## # ... with 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Remove the meaningless column

Remove the id column.

new_healthcare_df <- subset(healthcare_dataset_stroke_data, select = -c(id))
head(new_healthcare_df)
## # A tibble: 6 x 11
##   gender   age hypertension heart_disease ever_married work_type  Residence_type
##   <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>      <chr>         
## 1 Male      67            0             1 Yes          Private    Urban         
## 2 Female    61            0             0 Yes          Self-empl~ Rural         
## 3 Male      80            0             1 Yes          Private    Rural         
## 4 Female    49            0             0 Yes          Private    Urban         
## 5 Female    79            1             0 Yes          Self-empl~ Rural         
## 6 Male      81            0             0 Yes          Private    Urban         
## # ... with 4 more variables: avg_glucose_level <dbl>, bmi <chr>,
## #   smoking_status <chr>, stroke <dbl>

Check the missing data and change it to “NA”

new_healthcare_df[new_healthcare_df == ""]
## <unspecified> [0]
cname <- names(new_healthcare_df)
for (i in cname){
  print(paste(i, sum(new_healthcare_df[i] == "N/A")))
  new_healthcare_df[!is.na(new_healthcare_df[i]) & new_healthcare_df[i] == "N/A", i] <- NA
}
## [1] "gender 0"
## [1] "age 0"
## [1] "hypertension 0"
## [1] "heart_disease 0"
## [1] "ever_married 0"
## [1] "work_type 0"
## [1] "Residence_type 0"
## [1] "avg_glucose_level 0"
## [1] "bmi 201"
## [1] "smoking_status 0"
## [1] "stroke 0"

Thus, we observed the bmi column has 201 “NA” value.

Show the 201 rows of bmi with “NA” value and move it to front for better observation

bmi_sorted = new_healthcare_df[which(is.na(new_healthcare_df$bmi)),]
move_bmi = bmi_sorted %>% dplyr::select("bmi", everything())
move_bmi
## # A tibble: 201 x 11
##    bmi   gender   age hypertension heart_disease ever_married work_type    
##    <chr> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
##  1 <NA>  Female    61            0             0 Yes          Self-employed
##  2 <NA>  Female    59            0             0 Yes          Private      
##  3 <NA>  Male      78            0             1 Yes          Private      
##  4 <NA>  Male      57            0             1 No           Govt_job     
##  5 <NA>  Male      58            0             0 Yes          Private      
##  6 <NA>  Male      59            0             0 Yes          Private      
##  7 <NA>  Female    63            0             0 Yes          Private      
##  8 <NA>  Female    75            0             1 No           Self-employed
##  9 <NA>  Female    76            0             0 No           Private      
## 10 <NA>  Male      78            1             0 Yes          Private      
## # ... with 191 more rows, and 4 more variables: Residence_type <chr>,
## #   avg_glucose_level <dbl>, smoking_status <chr>, stroke <dbl>

Deal with the “NA” value of bmi column

Users from Kaggle stated that bmi attribute is not very clear indication of stroke. Hence, we will remove the missing bmi data of 201 rows, and it is a small portion of data.

clean_data = na.omit(new_healthcare_df)
clean_data
## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
##  1 Male      67            0             1 Yes          Private   Urban         
##  2 Male      80            0             1 Yes          Private   Rural         
##  3 Female    49            0             0 Yes          Private   Urban         
##  4 Female    79            1             0 Yes          Self-emp~ Rural         
##  5 Male      81            0             0 Yes          Private   Urban         
##  6 Male      74            1             1 Yes          Private   Rural         
##  7 Female    69            0             0 No           Private   Urban         
##  8 Female    78            0             0 Yes          Private   Urban         
##  9 Female    81            1             0 Yes          Private   Rural         
## 10 Female    61            0             1 Yes          Govt_job  Rural         
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Data Transformation

The gender, ever_married, and Residence_type columns will be transform into binary data for our data modelling step. However, we will check the gender, Residence_type, work_type and ever_married columns to see if there is more than 2 types of value.

unique_gender = unique(clean_data$gender)
unique_ever_married = unique(clean_data$ever_married)
unique_residence_type = unique(clean_data$Residence_type)
unique_work_type = unique(clean_data$work_type)
unique_gender
## [1] "Male"   "Female" "Other"
unique_ever_married
## [1] "Yes" "No"
unique_residence_type
## [1] "Urban" "Rural"
unique_work_type
## [1] "Private"       "Self-employed" "Govt_job"      "children"     
## [5] "Never_worked"

We noticed the gender column has 3 types of value which are “Male”, “Female”, and “Other”. We transform it into Male = 0, Female = 1, Other = 2.

clean_data = clean_data %>% 
                mutate(gender = recode(
                  gender,
                  "Male" = "0",
                  "Female" = "1",
                  "Other" = "2"
                ))
clean_data
## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
##  1 0         67            0             1 Yes          Private   Urban         
##  2 0         80            0             1 Yes          Private   Rural         
##  3 1         49            0             0 Yes          Private   Urban         
##  4 1         79            1             0 Yes          Self-emp~ Rural         
##  5 0         81            0             0 Yes          Private   Urban         
##  6 0         74            1             1 Yes          Private   Rural         
##  7 1         69            0             0 No           Private   Urban         
##  8 1         78            0             0 Yes          Private   Urban         
##  9 1         81            1             0 Yes          Private   Rural         
## 10 1         61            0             1 Yes          Govt_job  Rural         
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

The work_type column has 5 types of value which are “Private”, “Self-employed”, “Govt_job”, “children”, “Never_worked”. We transform it into Private = 0, Self-employed = 1, Govt_job = 2, children = 3, Never_worked = 4.

clean_data = clean_data %>% 
                mutate(work_type = recode(
                  work_type,
                  "Private" = "0",
                  "Self-employed" = "1",
                  "Govt_job" = "2",
                  "children" = "3",
                  "Never_worked" = "4"
                ))
clean_data
## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>     <chr>         
##  1 0         67            0             1 Yes          0         Urban         
##  2 0         80            0             1 Yes          0         Rural         
##  3 1         49            0             0 Yes          0         Urban         
##  4 1         79            1             0 Yes          1         Rural         
##  5 0         81            0             0 Yes          0         Urban         
##  6 0         74            1             1 Yes          0         Rural         
##  7 1         69            0             0 No           0         Urban         
##  8 1         78            0             0 Yes          0         Urban         
##  9 1         81            1             0 Yes          0         Rural         
## 10 1         61            0             1 Yes          2         Rural         
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Transform the Residence_type and ever_married columns’ data into binary data.

clean_data$Residence_type <- ifelse(clean_data$Residence_type == "Urban", 1, 0)
clean_data$ever_married <- ifelse(clean_data$ever_married == "Yes", 1, 0)
clean_data
## # A tibble: 4,909 x 11
##    gender   age hypertension heart_disease ever_married work_type Residence_type
##    <chr>  <dbl>        <dbl>         <dbl>        <dbl> <chr>              <dbl>
##  1 0         67            0             1            1 0                      1
##  2 0         80            0             1            1 0                      0
##  3 1         49            0             0            1 0                      1
##  4 1         79            1             0            1 1                      0
##  5 0         81            0             0            1 0                      1
##  6 0         74            1             1            1 0                      0
##  7 1         69            0             0            0 0                      1
##  8 1         78            0             0            1 0                      1
##  9 1         81            1             0            1 0                      0
## 10 1         61            0             1            1 2                      0
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## #   bmi <chr>, smoking_status <chr>, stroke <dbl>

Data Modelling

Clustering

To start with data modelling, let’s assume that the data is not labelled, and see what clustering algorithm can tell us if we want to have 2 clusters in the dataset.

Ideally, the 2 clusters should be one with stroke and one without stroke. Normally, clustering algorithm works best with continuos variable. But we have categorical variable here, so we used gower distance to measure distance between two data points.

CRAN has a detailed documentation about gower distance which is available at https://cran.r-project.org/web/packages/gower/vignettes/intro.pdf

k-means clustering

k-means clustering works by partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). All data points are treated as vectors, and distance between data points can be measured using various methods (Euclidean, Manhattan and etc.)

library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(compareGroups)

set.seed(666)
clean_data$gender <- as.factor(clean_data$gender)
clean_data$ever_married <- as.factor(clean_data$ever_married)
clean_data$Residence_type <- as.factor(clean_data$Residence_type)
clean_data$smoking_status <- as.factor(clean_data$smoking_status)
clean_data$stroke <- as.factor(clean_data$stroke)
clean_data$heart_disease <- as.factor(clean_data$heart_disease)
clean_data$hypertension <- as.factor(clean_data$hypertension)
clean_data$bmi <- as.numeric(clean_data$bmi)

df = subset(clean_data, select = c(bmi,avg_glucose_level, age, 
                          hypertension, smoking_status, 
                          stroke) )

df <- na.omit(df)

distMat<-daisy(df,metric = "gower")
k2 <- kmeans(distMat, centers = 2, nstart = 25)

df$cluster<-k2$cluster

group<-compareGroups(cluster~.,data=df)
clustab<-createTable(group)
clustab
## 
## --------Summary descriptives table by 'cluster'---------
## 
## ______________________________________________________ 
##                          1           2       p.overall 
##                        N=691       N=4218              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## bmi                 32.6 (8.27) 28.3 (7.62)   <0.001   
## avg_glucose_level   143 (65.3)  99.1 (36.4)   <0.001   
## age                 64.4 (13.5) 39.3 (21.8)   <0.001   
## hypertension:                                  0.000   
##     0               240 (34.7%) 4218 (100%)            
##     1               451 (65.3%)  0 (0.00%)             
## smoking_status:                               <0.001   
##     formerly smoked 203 (29.4%) 634 (15.0%)            
##     never smoked    286 (41.4%) 1566 (37.1%)           
##     smokes          132 (19.1%) 605 (14.3%)            
##     Unknown         70 (10.1%)  1413 (33.5%)           
## stroke:                                       <0.001   
##     0               482 (69.8%) 4218 (100%)            
##     1               209 (30.2%)  0 (0.00%)             
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Classification

After that, we will use supervised learning method, classification to predict if an individual will get stroke or not

Random Forest Classifier

Random forests is an ensemble learning method that constructs many decision trees at traing time. It can be used for classification as well as regression task.
The output for classification task is based on the class selected by most trees.

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:openintro':
## 
##     dotPlot
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROSE)
## Loaded ROSE 0.0-4
oversampled_data <- ovun.sample(as.factor(stroke)~.,data = clean_data, method = 'over',p = 0.3)$data
sample_index <- createDataPartition(oversampled_data$stroke, p = 0.7, 
                                   list = FALSE, 
                                   times = 1)
data_train <- oversampled_data[sample_index,]
data_test <- oversampled_data[-sample_index,]

rf_model <- randomForest(stroke~.,data = data_train,ntree = 1000,mtry = 5)

## Model performance using training data
pred_train <- predict(rf_model)

result_train <- confusionMatrix(pred_train, data_train$stroke)

(result_train$overall)['Accuracy'] 
##  Accuracy 
## 0.9849608
(result_train$byClass)['Sensitivity']
## Sensitivity 
##   0.9796353
(result_train$byClass)['Specificity']
## Specificity 
##   0.9972048
(result_train$byClass)['F1']
##        F1 
## 0.9891054
(result_train$byClass)['Recall']
##    Recall 
## 0.9796353
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Train")

## Model performance using testing data
pred_test <- predict(rf_model, newdata = data_test)

result_test <- confusionMatrix(pred_test, data_test$stroke)

(result_test$overall)['Accuracy'] 
##  Accuracy 
## 0.9841819
(result_test$byClass)['Sensitivity']
## Sensitivity 
##   0.9787234
(result_test$byClass)['Specificity']
## Specificity 
##   0.9967374
(result_test$byClass)['F1']
##        F1 
## 0.9885387
(result_test$byClass)['Recall']
##    Recall 
## 0.9787234
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Test")

Support Vector Machine Classifier

Support Vector Machine works by constructing multiple hyperplanes that can seperate data points distinctly.

library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
svm_model <- ksvm(stroke~.,data = data_train, kernel="vanilladot")
##  Setting default kernel parameters
## Model performance using training data
pred_train <- predict(svm_model)

result_train <- confusionMatrix(pred_train, data_train$stroke)

(result_train$overall)['Accuracy'] 
## Accuracy 
## 0.798136
(result_train$byClass)['Sensitivity']
## Sensitivity 
##   0.8604863
(result_train$byClass)['Specificity']
## Specificity 
##   0.6547869
(result_train$byClass)['F1']
##        F1 
## 0.8559335
(result_train$byClass)['Recall']
##    Recall 
## 0.8604863
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Train")

## Model performance using testing data
pred_test <- predict(svm_model, newdata = data_test)

result_test <- confusionMatrix(pred_test, data_test$stroke)

(result_test$overall)['Accuracy'] 
##  Accuracy 
## 0.7874444
(result_test$byClass)['Sensitivity']
## Sensitivity 
##   0.8595745
(result_test$byClass)['Specificity']
## Specificity 
##   0.6215334
(result_test$byClass)['F1']
##        F1 
## 0.8493343
(result_test$byClass)['Recall']
##    Recall 
## 0.8595745
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Test")

Naive Bayes Classifier

Naive Bayes Classifier is a probabilistic machine learning algorithm based on the Bayes Theorem.

nb_model <- naiveBayes(stroke~.,data = data_train)

## Model performance using training data
pred_train <- predict(nb_model, data_train)

result_train <- confusionMatrix(pred_train, data_train$stroke)

(result_train$overall)['Accuracy'] 
##  Accuracy 
## 0.7665749
(result_train$byClass)['Sensitivity']
## Sensitivity 
##   0.7829787
(result_train$byClass)['Specificity']
## Specificity 
##   0.7288609
(result_train$byClass)['F1']
##        F1 
## 0.8237928
(result_train$byClass)['Recall']
##    Recall 
## 0.7829787
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Train")

## Model performance using testing data
pred_test <- predict(nb_model, newdata = data_test)

result_test <- confusionMatrix(pred_test, data_test$stroke)

(result_test$overall)['Accuracy'] 
##  Accuracy 
## 0.7572912
(result_test$byClass)['Sensitivity']
## Sensitivity 
##   0.7815603
(result_test$byClass)['Specificity']
## Specificity 
##   0.7014682
(result_test$byClass)['F1']
##        F1 
## 0.8178108
(result_test$byClass)['Recall']
##    Recall 
## 0.7815603
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
             conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Test")

Regression

With this dataset we can perform regression on average glucose level as well.

Linear regression

Linear regression works by fitting a linear equation from various variables to output. Hence, linear regression might not work well dataset with non-linear nature.

#LR all variable
linear_regression_str <- lm(avg_glucose_level~.,data = data_train)
summary(linear_regression_str)
## 
## Call:
## lm(formula = avg_glucose_level ~ ., data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.72  -31.73  -10.31   25.23  167.53 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                47.78289    4.46810  10.694  < 2e-16 ***
## gender1                    -8.62641    1.42474  -6.055 1.52e-09 ***
## gender2                    55.74107   47.42137   1.175 0.239877    
## age                         0.42468    0.05193   8.178 3.69e-16 ***
## hypertension1              13.50414    2.07378   6.512 8.20e-11 ***
## heart_disease1             28.44731    2.60689  10.912  < 2e-16 ***
## ever_married1               5.22690    2.04181   2.560 0.010500 *  
## work_type1                 -6.69649    1.94008  -3.452 0.000562 ***
## work_type2                 -3.27198    2.09254  -1.564 0.117969    
## work_type3                 22.14089    3.37243   6.565 5.76e-11 ***
## work_type4                 11.64373   14.42166   0.807 0.419490    
## Residence_type1            -0.51483    1.38203  -0.373 0.709525    
## bmi                         1.28412    0.10201  12.589  < 2e-16 ***
## smoking_statusnever smoked  0.82478    1.94800   0.423 0.672022    
## smoking_statussmokes       -0.12377    2.38445  -0.052 0.958605    
## smoking_statusUnknown      -2.71304    2.32437  -1.167 0.243184    
## stroke1                    14.13256    1.81960   7.767 9.81e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.34 on 4704 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1926 
## F-statistic: 71.36 on 16 and 4704 DF,  p-value: < 2.2e-16
#LR select significant p
linear_regression_model <- lm(avg_glucose_level~gender+age+hypertension+heart_disease+bmi+stroke,data = data_train)
summary(linear_regression_model)
## 
## Call:
## lm(formula = avg_glucose_level ~ gender + age + hypertension + 
##     heart_disease + bmi + stroke, data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -133.60  -32.26  -10.98   25.06  167.27 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63.19731    3.03674  20.811  < 2e-16 ***
## gender1        -9.11825    1.41950  -6.424 1.46e-10 ***
## gender2        47.85114   47.61372   1.005    0.315    
## age             0.27898    0.03671   7.600 3.56e-14 ***
## hypertension1  13.90457    2.05859   6.754 1.61e-11 ***
## heart_disease1 29.77206    2.58662  11.510  < 2e-16 ***
## bmi             1.11732    0.09531  11.723  < 2e-16 ***
## stroke1        15.98919    1.79695   8.898  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.59 on 4713 degrees of freedom
## Multiple R-squared:  0.1852, Adjusted R-squared:  0.184 
## F-statistic:   153 on 7 and 4713 DF,  p-value: < 2.2e-16
pred_test <- predict(linear_regression_model,data_test)

actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
##    actuals predicteds
## 2    87.96  137.08189
## 3   110.89   75.97577
## 10  205.84  149.63411
## 11   77.08  115.22496
## 12   57.08  139.59010
## 15   95.04  112.61279
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy
## [1] 0.7416867
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape
## [1] 0.3498599

From the results, we can observe which variables are most important by looking at their p values. The important features identified are
- gender
- age
- hypertension
- hear_disease
- bmi
- stroke

Random Forest Regression

Random forests is an ensemble learning method that constructs many decision trees at traing time. It can be used for classification as well as regression task.
The mean or average prediction of the individual trees is used as output for regression tasks.

rf_regression_model <- randomForest(avg_glucose_level~.,data = data_train,ntree = 1000,mtry = 5)

pred_test <- predict(rf_regression_model,data_test)

actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
##    actuals predicteds
## 2    87.96  151.15280
## 3   110.89   89.30541
## 10  205.84  166.39514
## 11   77.08   90.70945
## 12   57.08  104.59448
## 15   95.04  101.60528
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy
## [1] 0.8337048
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape
## [1] 0.2168683

Support Vector Regression

svm_regression_model <- svm(avg_glucose_level~.,data = data_train)
pred_test <- predict(svm_regression_model,data_test)

actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
##    actuals predicteds
## 2    87.96  135.16215
## 3   110.89   83.44734
## 10  205.84  184.80977
## 11   77.08  104.01751
## 12   57.08  102.74096
## 15   95.04  105.42116
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy
## [1] 0.7793447
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)  
mape
## [1] 0.2720404

Results and Discussions

Clustering

k means clustering algorithm identified two clusters.
One without stroke and one with around 30% occurrences of stroke.
We can observe that clusters without stroke have
- lower average blood glucose level
- lower hypertension
- lower smoking rate
- lower bmi
- lower age

Classification

We experimented with 3 classifier algorithms (Random Forest Classifier, Support Vector Classifier, Naive Bayes Classifier) to predict if a patient will get stroke or not.

Random Forest Classifier performed the best with accuracy >95%, with very low false positives and false negatives.
The performance of Support Vector Classifer and Naive Bayes Classifier are similar.
However, the model’s parameter are not tuned, we can expect performance enhancement for Support Vector Classifier after hyperparameter tuning.

Regression

We experimented with 3 regression algorithms (Linear Regression, Support Vector Regression, Random Forest Regression) to predict the average glucose level of a patient.

Random Forest Regression performed the best with lowest mean average prediction error of < 0.25.
Linear Regression performed relatively poorly, this might due to the non-linear nature of dataset.
Support Vector Regression also performed relatively poorly, but it has ability to model non-linear dataset, further hyperparemeter tuning might enhance regression performance.

Conclusion

Factors that increase the risk of developing a stroke include high blood sugar levels, high blood pressure, smoking habits, and excess weight.

People who have good living habits, do not smoke and exercise regularly are less likely to get a stroke. On the contrary, if a person smokes regularly and lacks exercise, they are overweight and are prone to stroke.

We experimented with 3 classifier algorithms (Random Forest Classifier, Support Vector Classifier, Naive Bayes Classifier) to predict if a patient will get stroke or not. Random Forest Classifier performed the best with accuracy >95%.

We experimented with 3 regression algorithms (Linear Regression, Support Vector Regression, Random Forest Regression) to predict the average glucose level of a patient. Random Forest Regression performed the best with lowest mean average prediction error of < 0.25.