Stroke is also known as brain attack; it occurs when the blood vessel
in the brain bursts or when the blood supply to the part of brain is
stopped or blocked. According to the World Health Organization stroke is
the leading cause of death and disability globally. Annually, there are
more than 795k people in United State get a stroke and around 610k of
them are new stroke. People who survive from stroke can experience
different level of disabilities such as loss of vision and speech,
paralysis and confusion. Stroke will bring financial and psychological
burden to patient’s family.
The risk factors that cause stroke are high blood pressure, high
cholesterol, heart disease, diabetes, obesity, cigarette smoking,
alcohol consumption, imbalance diet, age, gender and etc. Some factors
are modifiable and can be controlled to avoid stroke effectively such as
high blood pressure, smoking and diet. As prevention is better than
cure, it is important to detect patient with high risk of getting stroke
and take prevention measurement accordingly.
With today high availability of medical data, this can be achieved
easily using machine learning models. Machine learning models can be
used to predict whether a person will have stroke or not using their
lifestyle habits and physiological measurement data.
According to the World Health Organization (WHO) stroke is the 2nd
leading cause of death globally, responsible for approximately 11% of
total deaths. In this project, we decide to use “Stroke Prediction
Dataset” provided by Fedesoriano from Kaggle. The latest dataset is
updated on 2021 with 5111 instances and 12 attributes. This dataset is
used to predict whether a patient is likely to get stroke based on the
input parameters like gender, age, various diseases, and smoking status.
Each row in the data provides relevant information about the
patient.
Overlook the summary of the attribute information, the 12 attributes
include “id” and 11 clinical features for predicting stroke events which
“id” indicates the unique identifier. The “gender” attribute includes
“Male”, “Female” or “Other”. The “age”, “average_glucose_level” and
“bmi” attributes indicate the age, average glucose level in blood and
body mass index of the patient respectively. For hypertension,
“heart_disease” and “stroke”, 0 indicates the patient doesn’t have
relevant disease history while 1 indicates the patient has relevant
disease. The “ever_married” attribute indicates the marital status which
is represented by “No” and “Yes” for single and married adults
accordingly. The “residence_type” attribute is categorized into rural
and urban. The “work_type” attribute includes children, government jobs,
never worked, private jobs and self-employed. The “smoking_status”
attribute includes formerly smoked, never smoked, smokes and
unknown.
The dataset was taken from: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
The title of the data: Stroke Prediction Dataset
The year of the data: 2021
The dimension: 5110 rows, 12 attributes
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers we need, making it easier for us to discover patterns, spot anomalies, test a hypothesis, or check assumptions.EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques we are considering for data analysis are appropriate. The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.We can use exploratory analysis to ensure the results we produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming we are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals.
# install.packages("tidyverse")
# install.packages("Hmisc")
# install.packages("openintro")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(readr)
library(dplyr)
library(ggplot2)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
## The following object is masked from 'package:survival':
##
## transplant
## The following objects are masked from 'package:lattice':
##
## ethanol, lsegments
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:openintro':
##
## salinity
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
data <- read.csv("C:\\Users\\Teng\\Desktop\\healthcare-dataset-stroke-data.csv", header=TRUE, stringsAsFactors=FALSE)
head(data)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
summary(data)
## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.05401
## 3rd Qu.:0.00000
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Length:5110 Length:5110 Min. :0.00000
## 1st Qu.: 77.25 Class :character Class :character 1st Qu.:0.00000
## Median : 91.89 Mode :character Mode :character Median :0.00000
## Mean :106.15 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:0.00000
## Max. :271.74 Max. :1.00000
The summary table above has 2 problem: The column “gender” contains “other” elements and column “bmi” contain “N/A/” elements.
table(data$gender)
##
## Female Male Other
## 2994 2115 1
table(data$bmi)
##
## 10.3 11.3 11.5 12 12.3 12.8 13 13.2 13.3 13.4 13.5 13.7 13.8 13.9 14 14.1
## 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 5
## 14.2 14.3 14.4 14.5 14.6 14.8 14.9 15 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8
## 4 3 2 2 4 4 1 2 8 4 4 3 5 3 3 5
## 15.9 16 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 17 17.1 17.2 17.3 17.4
## 5 8 8 10 10 11 4 8 11 7 7 10 12 11 9 13
## 17.5 17.6 17.7 17.8 17.9 18 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 19
## 7 16 13 7 7 16 12 9 17 10 12 19 13 15 8 7
## 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 19.9 20 20.1 20.2 20.3 20.4 20.5 20.6
## 11 13 9 14 21 7 8 17 9 17 25 16 17 23 18 15
## 20.7 20.8 20.9 21 21.1 21.2 21.3 21.4 21.5 21.6 21.7 21.8 21.9 22 22.1 22.2
## 12 17 13 17 16 16 21 22 27 16 14 18 14 15 22 30
## 22.3 22.4 22.5 22.6 22.7 22.8 22.9 23 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8
## 18 22 13 18 25 25 16 27 21 20 19 36 31 24 16 22
## 23.9 24 24.1 24.2 24.3 24.4 24.5 24.6 24.7 24.8 24.9 25 25.1 25.2 25.3 25.4
## 24 28 28 29 26 23 26 22 22 31 27 27 34 18 28 26
## 25.5 25.6 25.7 25.8 25.9 26 26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9 27
## 33 21 15 24 24 25 37 27 23 34 30 29 37 21 34 35
## 27.1 27.2 27.3 27.4 27.5 27.6 27.7 27.8 27.9 28 28.1 28.2 28.3 28.4 28.5 28.6
## 28 24 36 22 29 37 37 23 28 28 29 25 30 38 27 27
## 28.7 28.8 28.9 29 29.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30 30.1 30.2
## 41 26 31 26 29 26 22 30 26 26 27 23 26 27 26 20
## 30.3 30.4 30.5 30.6 30.7 30.8 30.9 31 31.1 31.2 31.3 31.4 31.5 31.6 31.7 31.8
## 30 17 24 18 23 21 27 22 26 19 21 30 27 21 14 24
## 31.9 32 32.1 32.2 32.3 32.4 32.5 32.6 32.7 32.8 32.9 33 33.1 33.2 33.3 33.4
## 22 21 24 20 28 19 21 19 21 25 13 15 25 17 15 16
## 33.5 33.6 33.7 33.8 33.9 34 34.1 34.2 34.3 34.4 34.5 34.6 34.7 34.8 34.9 35
## 23 11 19 13 13 17 15 17 18 18 21 11 20 15 11 12
## 35.1 35.2 35.3 35.4 35.5 35.6 35.7 35.8 35.9 36 36.1 36.2 36.3 36.4 36.5 36.6
## 10 16 12 9 13 15 13 24 18 11 7 12 13 10 4 14
## 36.7 36.8 36.9 37 37.1 37.2 37.3 37.4 37.5 37.6 37.7 37.8 37.9 38 38.1 38.2
## 15 8 13 10 7 9 13 11 9 9 7 10 11 13 10 9
## 38.3 38.4 38.5 38.6 38.7 38.8 38.9 39 39.1 39.2 39.3 39.4 39.5 39.6 39.7 39.8
## 2 6 7 9 11 10 8 7 8 10 6 10 8 9 8 5
## 39.9 40 40.1 40.2 40.3 40.4 40.5 40.6 40.7 40.8 40.9 41 41.1 41.2 41.3 41.4
## 5 6 10 10 8 9 7 1 1 7 6 3 7 8 6 3
## 41.5 41.6 41.7 41.8 41.9 42 42.1 42.2 42.3 42.4 42.5 42.6 42.7 42.8 42.9 43
## 8 5 7 11 5 3 3 8 5 5 2 4 4 3 3 8
## 43.1 43.2 43.3 43.4 43.6 43.7 43.8 43.9 44 44.1 44.2 44.3 44.4 44.5 44.6 44.7
## 4 4 5 6 4 7 9 8 4 1 4 3 1 4 2 6
## 44.8 44.9 45 45.1 45.2 45.3 45.4 45.5 45.7 45.8 45.9 46 46.1 46.2 46.3 46.4
## 4 2 5 2 3 4 4 4 2 1 2 4 2 2 1 1
## 46.5 46.6 46.8 46.9 47.1 47.3 47.4 47.5 47.6 47.8 47.9 48 48.1 48.2 48.3 48.4
## 2 1 1 2 1 2 1 3 3 2 1 1 1 1 2 1
## 48.5 48.7 48.8 48.9 49.2 49.3 49.4 49.5 49.8 49.9 50.1 50.2 50.3 50.4 50.5 50.6
## 2 1 2 3 1 3 1 2 3 1 2 4 2 1 1 2
## 50.8 50.9 51 51.5 51.7 51.8 51.9 52.3 52.5 52.7 52.8 52.9 53.4 53.5 53.8 53.9
## 1 1 1 1 1 1 2 1 1 2 3 1 2 1 2 1
## 54 54.1 54.2 54.3 54.6 54.7 54.8 55 55.1 55.2 55.7 55.9 56 56.1 56.6 57.2
## 1 1 1 1 2 3 1 2 1 1 4 2 1 1 2 2
## 57.3 57.5 57.7 57.9 58.1 59.7 60.2 60.9 61.2 61.6 63.3 64.4 64.8 66.8 71.9 78
## 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 92 97.6 N/A
## 1 1 201
We will delete the the row containing gender “other” elements and bmi “N/A”
data<- data[data$gender != "Other",]
data<- data[data$bmi != "N/A",]
table(data$gender)
##
## Female Male
## 2897 2011
table(data$bmi)
##
## 10.3 11.3 11.5 12 12.3 12.8 13 13.2 13.3 13.4 13.5 13.7 13.8 13.9 14 14.1
## 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 5
## 14.2 14.3 14.4 14.5 14.6 14.8 14.9 15 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8
## 4 3 2 2 4 4 1 2 8 4 4 3 5 3 3 5
## 15.9 16 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 17 17.1 17.2 17.3 17.4
## 5 8 8 10 10 11 4 8 11 7 7 10 12 11 9 13
## 17.5 17.6 17.7 17.8 17.9 18 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 19
## 7 16 13 7 7 16 12 9 17 10 12 19 13 15 8 7
## 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8 19.9 20 20.1 20.2 20.3 20.4 20.5 20.6
## 11 13 9 14 21 7 8 17 9 17 25 16 17 23 18 15
## 20.7 20.8 20.9 21 21.1 21.2 21.3 21.4 21.5 21.6 21.7 21.8 21.9 22 22.1 22.2
## 12 17 13 17 16 16 21 22 27 16 14 18 14 15 22 30
## 22.3 22.4 22.5 22.6 22.7 22.8 22.9 23 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8
## 18 21 13 18 25 25 16 27 21 20 19 36 31 24 16 22
## 23.9 24 24.1 24.2 24.3 24.4 24.5 24.6 24.7 24.8 24.9 25 25.1 25.2 25.3 25.4
## 24 28 28 29 26 23 26 22 22 31 27 27 34 18 28 26
## 25.5 25.6 25.7 25.8 25.9 26 26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9 27
## 33 21 15 24 24 25 37 27 23 34 30 29 37 21 34 35
## 27.1 27.2 27.3 27.4 27.5 27.6 27.7 27.8 27.9 28 28.1 28.2 28.3 28.4 28.5 28.6
## 28 24 36 22 29 37 37 23 28 28 29 25 30 38 27 27
## 28.7 28.8 28.9 29 29.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30 30.1 30.2
## 41 26 31 26 29 26 22 30 26 26 27 23 26 27 26 20
## 30.3 30.4 30.5 30.6 30.7 30.8 30.9 31 31.1 31.2 31.3 31.4 31.5 31.6 31.7 31.8
## 30 17 24 18 23 21 27 22 26 19 21 30 27 21 14 24
## 31.9 32 32.1 32.2 32.3 32.4 32.5 32.6 32.7 32.8 32.9 33 33.1 33.2 33.3 33.4
## 22 21 24 20 28 19 21 19 21 25 13 15 25 17 15 16
## 33.5 33.6 33.7 33.8 33.9 34 34.1 34.2 34.3 34.4 34.5 34.6 34.7 34.8 34.9 35
## 23 11 19 13 13 17 15 17 18 18 21 11 20 15 11 12
## 35.1 35.2 35.3 35.4 35.5 35.6 35.7 35.8 35.9 36 36.1 36.2 36.3 36.4 36.5 36.6
## 10 16 12 9 13 15 13 24 18 11 7 12 13 10 4 14
## 36.7 36.8 36.9 37 37.1 37.2 37.3 37.4 37.5 37.6 37.7 37.8 37.9 38 38.1 38.2
## 15 8 13 10 7 9 13 11 9 9 7 10 11 13 10 9
## 38.3 38.4 38.5 38.6 38.7 38.8 38.9 39 39.1 39.2 39.3 39.4 39.5 39.6 39.7 39.8
## 2 6 7 9 11 10 8 7 8 10 6 10 8 9 8 5
## 39.9 40 40.1 40.2 40.3 40.4 40.5 40.6 40.7 40.8 40.9 41 41.1 41.2 41.3 41.4
## 5 6 10 10 8 9 7 1 1 7 6 3 7 8 6 3
## 41.5 41.6 41.7 41.8 41.9 42 42.1 42.2 42.3 42.4 42.5 42.6 42.7 42.8 42.9 43
## 8 5 7 11 5 3 3 8 5 5 2 4 4 3 3 8
## 43.1 43.2 43.3 43.4 43.6 43.7 43.8 43.9 44 44.1 44.2 44.3 44.4 44.5 44.6 44.7
## 4 4 5 6 4 7 9 8 4 1 4 3 1 4 2 6
## 44.8 44.9 45 45.1 45.2 45.3 45.4 45.5 45.7 45.8 45.9 46 46.1 46.2 46.3 46.4
## 4 2 5 2 3 4 4 4 2 1 2 4 2 2 1 1
## 46.5 46.6 46.8 46.9 47.1 47.3 47.4 47.5 47.6 47.8 47.9 48 48.1 48.2 48.3 48.4
## 2 1 1 2 1 2 1 3 3 2 1 1 1 1 2 1
## 48.5 48.7 48.8 48.9 49.2 49.3 49.4 49.5 49.8 49.9 50.1 50.2 50.3 50.4 50.5 50.6
## 2 1 2 3 1 3 1 2 3 1 2 4 2 1 1 2
## 50.8 50.9 51 51.5 51.7 51.8 51.9 52.3 52.5 52.7 52.8 52.9 53.4 53.5 53.8 53.9
## 1 1 1 1 1 1 2 1 1 2 3 1 2 1 2 1
## 54 54.1 54.2 54.3 54.6 54.7 54.8 55 55.1 55.2 55.7 55.9 56 56.1 56.6 57.2
## 1 1 1 1 2 3 1 2 1 1 4 2 1 1 2 2
## 57.3 57.5 57.7 57.9 58.1 59.7 60.2 60.9 61.2 61.6 63.3 64.4 64.8 66.8 71.9 78
## 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 92 97.6
## 1 1
data$gender <- as.factor(data$gender)
data$hypertension <- as.factor(data$hypertension)
data$heart_disease <- as.factor(data$heart_disease)
data$ever_married <- as.factor(data$ever_married)
data$bmi <- as.numeric(data$bmi)
data$stroke<- as.factor(data$stroke)
We will use univariate analysis at first, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
library(ggplot2)
library(gridExtra)
g1 <- ggplot(data = data, aes(x=gender,fill=gender))+geom_bar()
g2 <- ggplot(data = data, aes(x=age))+geom_histogram(binwidth = 20)
g3 <- ggplot(data = data, aes(x=ever_married,fill=ever_married))+geom_bar()
g4 <- ggplot(data = data, aes(x=work_type,fill=work_type))+geom_bar()+theme(axis.text.x =element_text(angle = 45) )
grid.arrange(g1,g2,g3,g4, ncol=2)
g5 <- ggplot(data, aes(hypertension, fill=hypertension))+geom_bar()
g6 <- ggplot(data, aes(heart_disease, fill=heart_disease))+geom_bar()
g7 <- ggplot(data, aes(stroke, fill=stroke))+geom_bar()
grid.arrange(g5,g6,g7, ncol=3)
Based on the graph drawn above, we did not detect any outliers. We find
that there is more female adults, normally distributed age group, more
married adults, and more workers in private organizations in the
dataset. We find that adults with hypertension, heart-disease, and
stroke is much higher than adults without the diseases.
Multivariate visualizations and summary statistics that allow us to assess the relationship between each variable in the dataset and the target variable we’re looking Multivariate data arises from more than one variable. Multivariate EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.Based on the coefficient, we identify the biggest impact to the chances to get stroke involves average glucose level, working style and smoking habits.
ggplot(data=data,aes(x=work_type, fill=stroke))+geom_bar(position="fill")+coord_cartesian(ylim = c(0,0.25))+ggtitle("Work type vs Stroke")
ggplot(data=data,aes(x=smoking_status, fill=stroke))+geom_bar(position="fill")+coord_cartesian(ylim = c(0,0.25))+ggtitle("Smoking status vs Stroke")
ggplot(data=data,aes(x=stroke, y=avg_glucose_level))+geom_violin()+ggtitle("Avg glucose level vs Stroke")
There is more common to see adults with higher average glucose level get
stroke. Self-employed adults have higher chance to get stroke.
Never-worked adults have lower chances to get stroke. Smokers have
higher chance to get stroke.
library(readr)
library(dplyr)
library(stringr)
library(tidyverse)
filePath = "C:\\Users\\Teng\\Desktop\\healthcare-dataset-stroke-data.csv"
healthcare_dataset_stroke_data <- read_csv(filePath)
## Rows: 5110 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
## dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(healthcare_dataset_stroke_data)
## Rows: 5,110
## Columns: 12
## $ id <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434~
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"~
## $ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, ~
## $ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1~
## $ heart_disease <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0~
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No~
## $ work_type <chr> "Private", "Self-employed", "Private", "Private", "S~
## $ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"~
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0~
## $ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "~
## $ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "~
## $ stroke <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
dim(healthcare_dataset_stroke_data)
## [1] 5110 12
head(healthcare_dataset_stroke_data)
## # A tibble: 6 x 12
## id gender age hypertension heart_disease ever_married work_type
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## # ... with 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
Remove the id column.
new_healthcare_df <- subset(healthcare_dataset_stroke_data, select = -c(id))
head(new_healthcare_df)
## # A tibble: 6 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Male 67 0 1 Yes Private Urban
## 2 Female 61 0 0 Yes Self-empl~ Rural
## 3 Male 80 0 1 Yes Private Rural
## 4 Female 49 0 0 Yes Private Urban
## 5 Female 79 1 0 Yes Self-empl~ Rural
## 6 Male 81 0 0 Yes Private Urban
## # ... with 4 more variables: avg_glucose_level <dbl>, bmi <chr>,
## # smoking_status <chr>, stroke <dbl>
new_healthcare_df[new_healthcare_df == ""]
## <unspecified> [0]
cname <- names(new_healthcare_df)
for (i in cname){
print(paste(i, sum(new_healthcare_df[i] == "N/A")))
new_healthcare_df[!is.na(new_healthcare_df[i]) & new_healthcare_df[i] == "N/A", i] <- NA
}
## [1] "gender 0"
## [1] "age 0"
## [1] "hypertension 0"
## [1] "heart_disease 0"
## [1] "ever_married 0"
## [1] "work_type 0"
## [1] "Residence_type 0"
## [1] "avg_glucose_level 0"
## [1] "bmi 201"
## [1] "smoking_status 0"
## [1] "stroke 0"
Thus, we observed the bmi column has 201 “NA” value.
bmi_sorted = new_healthcare_df[which(is.na(new_healthcare_df$bmi)),]
move_bmi = bmi_sorted %>% dplyr::select("bmi", everything())
move_bmi
## # A tibble: 201 x 11
## bmi gender age hypertension heart_disease ever_married work_type
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 <NA> Female 61 0 0 Yes Self-employed
## 2 <NA> Female 59 0 0 Yes Private
## 3 <NA> Male 78 0 1 Yes Private
## 4 <NA> Male 57 0 1 No Govt_job
## 5 <NA> Male 58 0 0 Yes Private
## 6 <NA> Male 59 0 0 Yes Private
## 7 <NA> Female 63 0 0 Yes Private
## 8 <NA> Female 75 0 1 No Self-employed
## 9 <NA> Female 76 0 0 No Private
## 10 <NA> Male 78 1 0 Yes Private
## # ... with 191 more rows, and 4 more variables: Residence_type <chr>,
## # avg_glucose_level <dbl>, smoking_status <chr>, stroke <dbl>
Users from Kaggle stated that bmi attribute is not very clear indication of stroke. Hence, we will remove the missing bmi data of 201 rows, and it is a small portion of data.
clean_data = na.omit(new_healthcare_df)
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 Male 67 0 1 Yes Private Urban
## 2 Male 80 0 1 Yes Private Rural
## 3 Female 49 0 0 Yes Private Urban
## 4 Female 79 1 0 Yes Self-emp~ Rural
## 5 Male 81 0 0 Yes Private Urban
## 6 Male 74 1 1 Yes Private Rural
## 7 Female 69 0 0 No Private Urban
## 8 Female 78 0 0 Yes Private Urban
## 9 Female 81 1 0 Yes Private Rural
## 10 Female 61 0 1 Yes Govt_job Rural
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
The gender, ever_married, and Residence_type columns will be transform into binary data for our data modelling step. However, we will check the gender, Residence_type, work_type and ever_married columns to see if there is more than 2 types of value.
unique_gender = unique(clean_data$gender)
unique_ever_married = unique(clean_data$ever_married)
unique_residence_type = unique(clean_data$Residence_type)
unique_work_type = unique(clean_data$work_type)
unique_gender
## [1] "Male" "Female" "Other"
unique_ever_married
## [1] "Yes" "No"
unique_residence_type
## [1] "Urban" "Rural"
unique_work_type
## [1] "Private" "Self-employed" "Govt_job" "children"
## [5] "Never_worked"
We noticed the gender column has 3 types of value which are “Male”, “Female”, and “Other”. We transform it into Male = 0, Female = 1, Other = 2.
clean_data = clean_data %>%
mutate(gender = recode(
gender,
"Male" = "0",
"Female" = "1",
"Other" = "2"
))
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 0 67 0 1 Yes Private Urban
## 2 0 80 0 1 Yes Private Rural
## 3 1 49 0 0 Yes Private Urban
## 4 1 79 1 0 Yes Self-emp~ Rural
## 5 0 81 0 0 Yes Private Urban
## 6 0 74 1 1 Yes Private Rural
## 7 1 69 0 0 No Private Urban
## 8 1 78 0 0 Yes Private Urban
## 9 1 81 1 0 Yes Private Rural
## 10 1 61 0 1 Yes Govt_job Rural
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
The work_type column has 5 types of value which are “Private”, “Self-employed”, “Govt_job”, “children”, “Never_worked”. We transform it into Private = 0, Self-employed = 1, Govt_job = 2, children = 3, Never_worked = 4.
clean_data = clean_data %>%
mutate(work_type = recode(
work_type,
"Private" = "0",
"Self-employed" = "1",
"Govt_job" = "2",
"children" = "3",
"Never_worked" = "4"
))
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 0 67 0 1 Yes 0 Urban
## 2 0 80 0 1 Yes 0 Rural
## 3 1 49 0 0 Yes 0 Urban
## 4 1 79 1 0 Yes 1 Rural
## 5 0 81 0 0 Yes 0 Urban
## 6 0 74 1 1 Yes 0 Rural
## 7 1 69 0 0 No 0 Urban
## 8 1 78 0 0 Yes 0 Urban
## 9 1 81 1 0 Yes 0 Rural
## 10 1 61 0 1 Yes 2 Rural
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
Transform the Residence_type and ever_married columns’ data into binary data.
clean_data$Residence_type <- ifelse(clean_data$Residence_type == "Urban", 1, 0)
clean_data$ever_married <- ifelse(clean_data$ever_married == "Yes", 1, 0)
clean_data
## # A tibble: 4,909 x 11
## gender age hypertension heart_disease ever_married work_type Residence_type
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 0 67 0 1 1 0 1
## 2 0 80 0 1 1 0 0
## 3 1 49 0 0 1 0 1
## 4 1 79 1 0 1 1 0
## 5 0 81 0 0 1 0 1
## 6 0 74 1 1 1 0 0
## 7 1 69 0 0 0 0 1
## 8 1 78 0 0 1 0 1
## 9 1 81 1 0 1 0 0
## 10 1 61 0 1 1 2 0
## # ... with 4,899 more rows, and 4 more variables: avg_glucose_level <dbl>,
## # bmi <chr>, smoking_status <chr>, stroke <dbl>
To start with data modelling, let’s assume that the data is not labelled, and see what clustering algorithm can tell us if we want to have 2 clusters in the dataset.
Ideally, the 2 clusters should be one with stroke and one without stroke. Normally, clustering algorithm works best with continuos variable. But we have categorical variable here, so we used gower distance to measure distance between two data points.
CRAN has a detailed documentation about gower distance which is available at https://cran.r-project.org/web/packages/gower/vignettes/intro.pdf
k-means clustering works by partitioning n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). All data points are treated as vectors, and distance between data points can be measured using various methods (Euclidean, Manhattan and etc.)
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(compareGroups)
set.seed(666)
clean_data$gender <- as.factor(clean_data$gender)
clean_data$ever_married <- as.factor(clean_data$ever_married)
clean_data$Residence_type <- as.factor(clean_data$Residence_type)
clean_data$smoking_status <- as.factor(clean_data$smoking_status)
clean_data$stroke <- as.factor(clean_data$stroke)
clean_data$heart_disease <- as.factor(clean_data$heart_disease)
clean_data$hypertension <- as.factor(clean_data$hypertension)
clean_data$bmi <- as.numeric(clean_data$bmi)
df = subset(clean_data, select = c(bmi,avg_glucose_level, age,
hypertension, smoking_status,
stroke) )
df <- na.omit(df)
distMat<-daisy(df,metric = "gower")
k2 <- kmeans(distMat, centers = 2, nstart = 25)
df$cluster<-k2$cluster
group<-compareGroups(cluster~.,data=df)
clustab<-createTable(group)
clustab
##
## --------Summary descriptives table by 'cluster'---------
##
## ______________________________________________________
## 1 2 p.overall
## N=691 N=4218
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## bmi 32.6 (8.27) 28.3 (7.62) <0.001
## avg_glucose_level 143 (65.3) 99.1 (36.4) <0.001
## age 64.4 (13.5) 39.3 (21.8) <0.001
## hypertension: 0.000
## 0 240 (34.7%) 4218 (100%)
## 1 451 (65.3%) 0 (0.00%)
## smoking_status: <0.001
## formerly smoked 203 (29.4%) 634 (15.0%)
## never smoked 286 (41.4%) 1566 (37.1%)
## smokes 132 (19.1%) 605 (14.3%)
## Unknown 70 (10.1%) 1413 (33.5%)
## stroke: <0.001
## 0 482 (69.8%) 4218 (100%)
## 1 209 (30.2%) 0 (0.00%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
After that, we will use supervised learning method, classification to predict if an individual will get stroke or not
Random forests is an ensemble learning method that constructs many
decision trees at traing time. It can be used for classification as well
as regression task.
The output for classification task is based on the class selected by
most trees.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:openintro':
##
## dotPlot
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(ROSE)
## Loaded ROSE 0.0-4
oversampled_data <- ovun.sample(as.factor(stroke)~.,data = clean_data, method = 'over',p = 0.3)$data
sample_index <- createDataPartition(oversampled_data$stroke, p = 0.7,
list = FALSE,
times = 1)
data_train <- oversampled_data[sample_index,]
data_test <- oversampled_data[-sample_index,]
rf_model <- randomForest(stroke~.,data = data_train,ntree = 1000,mtry = 5)
## Model performance using training data
pred_train <- predict(rf_model)
result_train <- confusionMatrix(pred_train, data_train$stroke)
(result_train$overall)['Accuracy']
## Accuracy
## 0.9849608
(result_train$byClass)['Sensitivity']
## Sensitivity
## 0.9796353
(result_train$byClass)['Specificity']
## Specificity
## 0.9972048
(result_train$byClass)['F1']
## F1
## 0.9891054
(result_train$byClass)['Recall']
## Recall
## 0.9796353
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Train")
## Model performance using testing data
pred_test <- predict(rf_model, newdata = data_test)
result_test <- confusionMatrix(pred_test, data_test$stroke)
(result_test$overall)['Accuracy']
## Accuracy
## 0.9841819
(result_test$byClass)['Sensitivity']
## Sensitivity
## 0.9787234
(result_test$byClass)['Specificity']
## Specificity
## 0.9967374
(result_test$byClass)['F1']
## F1
## 0.9885387
(result_test$byClass)['Recall']
## Recall
## 0.9787234
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Random Forest Classifier) Confusion Matrix - Test")
Support Vector Machine works by constructing multiple hyperplanes that can seperate data points distinctly.
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
svm_model <- ksvm(stroke~.,data = data_train, kernel="vanilladot")
## Setting default kernel parameters
## Model performance using training data
pred_train <- predict(svm_model)
result_train <- confusionMatrix(pred_train, data_train$stroke)
(result_train$overall)['Accuracy']
## Accuracy
## 0.798136
(result_train$byClass)['Sensitivity']
## Sensitivity
## 0.8604863
(result_train$byClass)['Specificity']
## Specificity
## 0.6547869
(result_train$byClass)['F1']
## F1
## 0.8559335
(result_train$byClass)['Recall']
## Recall
## 0.8604863
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Train")
## Model performance using testing data
pred_test <- predict(svm_model, newdata = data_test)
result_test <- confusionMatrix(pred_test, data_test$stroke)
(result_test$overall)['Accuracy']
## Accuracy
## 0.7874444
(result_test$byClass)['Sensitivity']
## Sensitivity
## 0.8595745
(result_test$byClass)['Specificity']
## Specificity
## 0.6215334
(result_test$byClass)['F1']
## F1
## 0.8493343
(result_test$byClass)['Recall']
## Recall
## 0.8595745
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Support Vector Machine Classifier) Confusion Matrix - Test")
Naive Bayes Classifier is a probabilistic machine learning algorithm based on the Bayes Theorem.
nb_model <- naiveBayes(stroke~.,data = data_train)
## Model performance using training data
pred_train <- predict(nb_model, data_train)
result_train <- confusionMatrix(pred_train, data_train$stroke)
(result_train$overall)['Accuracy']
## Accuracy
## 0.7665749
(result_train$byClass)['Sensitivity']
## Sensitivity
## 0.7829787
(result_train$byClass)['Specificity']
## Specificity
## 0.7288609
(result_train$byClass)['F1']
## F1
## 0.8237928
(result_train$byClass)['Recall']
## Recall
## 0.7829787
fourfoldplot(result_train$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Train")
## Model performance using testing data
pred_test <- predict(nb_model, newdata = data_test)
result_test <- confusionMatrix(pred_test, data_test$stroke)
(result_test$overall)['Accuracy']
## Accuracy
## 0.7572912
(result_test$byClass)['Sensitivity']
## Sensitivity
## 0.7815603
(result_test$byClass)['Specificity']
## Specificity
## 0.7014682
(result_test$byClass)['F1']
## F1
## 0.8178108
(result_test$byClass)['Recall']
## Recall
## 0.7815603
fourfoldplot(result_test$table, color = c("firebrick3", "green3"),
conf.level = 0, margin = 1, main = "(Naive Bayes Classifier) Confusion Matrix - Test")
With this dataset we can perform regression on average glucose level as well.
Linear regression works by fitting a linear equation from various variables to output. Hence, linear regression might not work well dataset with non-linear nature.
#LR all variable
linear_regression_str <- lm(avg_glucose_level~.,data = data_train)
summary(linear_regression_str)
##
## Call:
## lm(formula = avg_glucose_level ~ ., data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.72 -31.73 -10.31 25.23 167.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.78289 4.46810 10.694 < 2e-16 ***
## gender1 -8.62641 1.42474 -6.055 1.52e-09 ***
## gender2 55.74107 47.42137 1.175 0.239877
## age 0.42468 0.05193 8.178 3.69e-16 ***
## hypertension1 13.50414 2.07378 6.512 8.20e-11 ***
## heart_disease1 28.44731 2.60689 10.912 < 2e-16 ***
## ever_married1 5.22690 2.04181 2.560 0.010500 *
## work_type1 -6.69649 1.94008 -3.452 0.000562 ***
## work_type2 -3.27198 2.09254 -1.564 0.117969
## work_type3 22.14089 3.37243 6.565 5.76e-11 ***
## work_type4 11.64373 14.42166 0.807 0.419490
## Residence_type1 -0.51483 1.38203 -0.373 0.709525
## bmi 1.28412 0.10201 12.589 < 2e-16 ***
## smoking_statusnever smoked 0.82478 1.94800 0.423 0.672022
## smoking_statussmokes -0.12377 2.38445 -0.052 0.958605
## smoking_statusUnknown -2.71304 2.32437 -1.167 0.243184
## stroke1 14.13256 1.81960 7.767 9.81e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.34 on 4704 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1926
## F-statistic: 71.36 on 16 and 4704 DF, p-value: < 2.2e-16
#LR select significant p
linear_regression_model <- lm(avg_glucose_level~gender+age+hypertension+heart_disease+bmi+stroke,data = data_train)
summary(linear_regression_model)
##
## Call:
## lm(formula = avg_glucose_level ~ gender + age + hypertension +
## heart_disease + bmi + stroke, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.60 -32.26 -10.98 25.06 167.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.19731 3.03674 20.811 < 2e-16 ***
## gender1 -9.11825 1.41950 -6.424 1.46e-10 ***
## gender2 47.85114 47.61372 1.005 0.315
## age 0.27898 0.03671 7.600 3.56e-14 ***
## hypertension1 13.90457 2.05859 6.754 1.61e-11 ***
## heart_disease1 29.77206 2.58662 11.510 < 2e-16 ***
## bmi 1.11732 0.09531 11.723 < 2e-16 ***
## stroke1 15.98919 1.79695 8.898 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.59 on 4713 degrees of freedom
## Multiple R-squared: 0.1852, Adjusted R-squared: 0.184
## F-statistic: 153 on 7 and 4713 DF, p-value: < 2.2e-16
pred_test <- predict(linear_regression_model,data_test)
actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
## actuals predicteds
## 2 87.96 137.08189
## 3 110.89 75.97577
## 10 205.84 149.63411
## 11 77.08 115.22496
## 12 57.08 139.59010
## 15 95.04 112.61279
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy
## [1] 0.7416867
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape
## [1] 0.3498599
From the results, we can observe which variables are most important
by looking at their p values. The important features identified
are
- gender
- age
- hypertension
- hear_disease
- bmi
- stroke
Random forests is an ensemble learning method that constructs many
decision trees at traing time. It can be used for classification as well
as regression task.
The mean or average prediction of the individual trees is used as output
for regression tasks.
rf_regression_model <- randomForest(avg_glucose_level~.,data = data_train,ntree = 1000,mtry = 5)
pred_test <- predict(rf_regression_model,data_test)
actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
## actuals predicteds
## 2 87.96 151.15280
## 3 110.89 89.30541
## 10 205.84 166.39514
## 11 77.08 90.70945
## 12 57.08 104.59448
## 15 95.04 101.60528
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy
## [1] 0.8337048
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape
## [1] 0.2168683
svm_regression_model <- svm(avg_glucose_level~.,data = data_train)
pred_test <- predict(svm_regression_model,data_test)
actuals_preds <- data.frame(cbind(actuals=data_test$avg_glucose_level, predicteds=pred_test))
head(actuals_preds)
## actuals predicteds
## 2 87.96 135.16215
## 3 110.89 83.44734
## 10 205.84 184.80977
## 11 77.08 104.01751
## 12 57.08 102.74096
## 15 95.04 105.42116
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy
## [1] 0.7793447
# MAPE Calculation
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape
## [1] 0.2720404
k means clustering algorithm identified two clusters.
One without stroke and one with around 30% occurrences of stroke.
We can observe that clusters without stroke have
- lower average blood glucose level
- lower hypertension
- lower smoking rate
- lower bmi
- lower age
We experimented with 3 classifier algorithms (Random Forest
Classifier, Support Vector Classifier, Naive Bayes Classifier) to
predict if a patient will get stroke or not.
Random Forest Classifier performed the best with accuracy >95%,
with very low false positives and false negatives.
The performance of Support Vector Classifer and Naive Bayes Classifier
are similar.
However, the model’s parameter are not tuned, we can expect performance
enhancement for Support Vector Classifier after hyperparameter
tuning.
We experimented with 3 regression algorithms (Linear Regression,
Support Vector Regression, Random Forest Regression) to predict the
average glucose level of a patient.
Random Forest Regression performed the best with lowest mean average
prediction error of < 0.25.
Linear Regression performed relatively poorly, this might due to the
non-linear nature of dataset.
Support Vector Regression also performed relatively poorly, but it has
ability to model non-linear dataset, further hyperparemeter tuning might
enhance regression performance.
Factors that increase the risk of developing a stroke include high
blood sugar levels, high blood pressure, smoking habits, and excess
weight.
People who have good living habits, do not smoke and exercise
regularly are less likely to get a stroke. On the contrary, if a person
smokes regularly and lacks exercise, they are overweight and are prone
to stroke.
We experimented with 3 classifier algorithms (Random Forest
Classifier, Support Vector Classifier, Naive Bayes Classifier) to
predict if a patient will get stroke or not. Random Forest Classifier
performed the best with accuracy >95%.
We experimented with 3 regression algorithms (Linear Regression, Support Vector Regression, Random Forest Regression) to predict the average glucose level of a patient. Random Forest Regression performed the best with lowest mean average prediction error of < 0.25.