#Load packages
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.1.3
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ROCit)
## Warning: package 'ROCit' was built under R version 4.1.3
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
#load data
strokeData <- read.csv("./StrokeData.csv")
head(strokeData)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
See the dimension of the data
dim(strokeData)
## [1] 5110 12
The dataset has 5110 instances and 12 attributes.
str(strokeData)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : chr "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
Each attribute is further described below:
*Note: “Unknown” in smoking_status means that the information is unavailable for this patient
summary(strokeData)
## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.05401
## 3rd Qu.:0.00000
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Length:5110 Length:5110 Min. :0.00000
## 1st Qu.: 77.25 Class :character Class :character 1st Qu.:0.00000
## Median : 91.89 Mode :character Mode :character Median :0.00000
## Mean :106.15 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:0.00000
## Max. :271.74 Max. :1.00000
According to the output above:
1.The stroke mean is 0.04, implying that just 4% of persons have a stroke.
2.The hypertension mean is 0.097, implying that there are 9.7% of persons have a hypertension.
3.The heart_disease mean is 0.054, implying that there are 5.4% of persons have a heart_disease.
4.The oldest age in this data is 82, while the variable’s average age is 43.2.
5.The average glucose level in this dataset ranges from 55.12 to 271.74. In this data, the average of avg_glucose_level variable is 106.15.
6.Some character data type columns must be transformed to factors or integers.
Then I see the unique value for each attribute here.
sapply(strokeData, function(x)length(unique(x)))
## id gender age hypertension
## 5110 3 104 2
## heart_disease ever_married work_type Residence_type
## 2 2 5 2
## avg_glucose_level bmi smoking_status stroke
## 3979 419 4 2
We need to transform the data type of variables on the attributes of
gender,
hypertension,heart_disease,
ever_married, work_type,
Residence_type, smoking_status, and
stroke to be a factor data type based on the output above
since these attributes are categorical variables.
So I changed the data type of those attribute here, as well as bmi’s data type from character to numeric, because bmi is supposed to hold numbers rather than characters.
#change the stroke variable data type from integer to factor and assign a value of 1 to Yes and a value of 0 to No
strokeData$stroke <- as.factor(strokeData$stroke)
levels(strokeData$stroke)[levels(strokeData$stroke)== 1] <- "Yes"
levels(strokeData$stroke)[levels(strokeData$stroke)== 0] <- "No"
#change the gender variable data type from character to factor
strokeData$gender <- as.factor(strokeData$gender)
#change the hypertension variable data type from integer to factor and assign a value of 1 to Yes and a value of 0 to No
strokeData$hypertension <- as.factor(strokeData$hypertension)
levels(strokeData$hypertension)[levels(strokeData$hypertension)== 1] <- "Yes"
levels(strokeData$hypertension)[levels(strokeData$hypertension)== 0] <- "No"
#change the heart_disease variable data type from integer to factor and assign a value of 1 to Yes and a value of 0 to No
strokeData$heart_disease <- as.factor(strokeData$heart_disease)
levels(strokeData$heart_disease)[levels(strokeData$heart_disease)== 1] <- "Yes"
levels(strokeData$heart_disease)[levels(strokeData$heart_disease)== 0] <- "No"
#change the ever_married variable data type from character to factor
strokeData$ever_married <- as.factor(strokeData$ever_married)
#change the work_type variable data type from character to factor
strokeData$work_type <- as.factor(strokeData$work_type)
#change the Residence_type variable data type from character to factor
strokeData$Residence_type <- as.factor(strokeData$Residence_type)
#change the smoking_status variable data type from character to factor
strokeData$smoking_status <- as.factor(strokeData$smoking_status)
#change the bmi variable data type from character to numeric
strokeData$bmi <- as.numeric(strokeData$bmi)
## Warning: NAs introduced by coercion
Next I check to see whether the data type of the variable i aim to change has changed.
#check whether the variables data type has changed or not
sapply(strokeData, class)
## id gender age hypertension
## "integer" "factor" "numeric" "factor"
## heart_disease ever_married work_type Residence_type
## "factor" "factor" "factor" "factor"
## avg_glucose_level bmi smoking_status stroke
## "numeric" "numeric" "factor" "factor"
The data type of the variable we intend to modify has already changed, as we can see above.
Then I proceed to examine the description for each variable in the dataset.
describe(strokeData)
## strokeData
##
## 12 Variables 5110 Observations
## --------------------------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd .05 .10
## 5110 0 5110 1 36518 24436 3590 6972
## .25 .50 .75 .90 .95
## 17741 36932 54682 65668 69218
##
## lowest : 67 77 84 91 99, highest: 72911 72914 72915 72918 72940
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 5110 0 3
##
## Value Female Male Other
## Frequency 2994 2115 1
## Proportion 0.586 0.414 0.000
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 5110 0 104 1 43.23 26.03 5 11
## .25 .50 .75 .90 .95
## 25 45 61 75 79
##
## lowest : 0.08 0.16 0.24 0.32 0.40, highest: 78.00 79.00 80.00 81.00 82.00
## --------------------------------------------------------------------------------
## hypertension
## n missing distinct
## 5110 0 2
##
## Value No Yes
## Frequency 4612 498
## Proportion 0.903 0.097
## --------------------------------------------------------------------------------
## heart_disease
## n missing distinct
## 5110 0 2
##
## Value No Yes
## Frequency 4834 276
## Proportion 0.946 0.054
## --------------------------------------------------------------------------------
## ever_married
## n missing distinct
## 5110 0 2
##
## Value No Yes
## Frequency 1757 3353
## Proportion 0.344 0.656
## --------------------------------------------------------------------------------
## work_type
## n missing distinct
## 5110 0 5
##
## lowest : children Govt_job Never_worked Private Self-employed
## highest: children Govt_job Never_worked Private Self-employed
##
## Value children Govt_job Never_worked Private
## Frequency 687 657 22 2925
## Proportion 0.134 0.129 0.004 0.572
##
## Value Self-employed
## Frequency 819
## Proportion 0.160
## --------------------------------------------------------------------------------
## Residence_type
## n missing distinct
## 5110 0 2
##
## Value Rural Urban
## Frequency 2514 2596
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## avg_glucose_level
## n missing distinct Info Mean Gmd .05 .10
## 5110 0 3979 1 106.1 45.38 60.71 65.79
## .25 .50 .75 .90 .95
## 77.24 91.88 114.09 192.18 216.29
##
## lowest : 55.12 55.22 55.23 55.25 55.26, highest: 266.59 267.60 267.61 267.76 271.74
## --------------------------------------------------------------------------------
## bmi
## n missing distinct Info Mean Gmd .05 .10
## 4909 201 418 1 28.89 8.538 17.64 19.70
## .25 .50 .75 .90 .95
## 23.50 28.10 33.10 38.90 42.96
##
## lowest : 10.3 11.3 11.5 12.0 12.3, highest: 66.8 71.9 78.0 92.0 97.6
## --------------------------------------------------------------------------------
## smoking_status
## n missing distinct
## 5110 0 4
##
## Value formerly smoked never smoked smokes Unknown
## Frequency 885 1892 789 1544
## Proportion 0.173 0.370 0.154 0.302
## --------------------------------------------------------------------------------
## stroke
## n missing distinct
## 5110 0 2
##
## Value No Yes
## Frequency 4861 249
## Proportion 0.951 0.049
## --------------------------------------------------------------------------------
According to the output above:
Exploring Categorical Variables
plot(strokeData$stroke, col = "darkgoldenrod3",main = "Stroke Distribution")
According to the output above, there are more individuals who do not suffer from stroke than those who do.
ggplot(strokeData, aes(x = gender, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#FFCC66", "#CC0000")) + ggtitle("Stroke Distribution by Gender")
There are more women than men in the data set. More women suffer from stroke than men.
ggplot(strokeData, aes(x = Residence_type, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#FFCC99", "#FF9933")) + ggtitle("Stroke Distribution by Residence Type")
From the output above, there is no much difference in each percentage having a stroke or not based on the type of residence
ggplot(strokeData, aes(x = ever_married, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#CCCCFF", "#CC99FF")) + ggtitle("Stroke Distribution by Married Status")
In the data set, there are more ever-married persons than never-married people. People who have been married are more likely to have a stroke than those who have not.
ggplot(strokeData, aes(x = hypertension, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#66CCCC", "#336666")) + ggtitle("Stroke Distribution by Hypertension")
According to the plot above, patients with hypertension suffer fewer strokes than those who do not.
ggplot(strokeData, aes(x = heart_disease, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#FF9999", "#990033")) + ggtitle("Stroke Distribution by Heart Disease")
Patients with heart disease suffered fewer strokes than those who did not.
mosaicplot(strokeData$stroke~strokeData$work_type,las=1,off=10,xlab="Stroke",ylab="Work Type",main="Stroke VS Work_Type",color=colors()[145:150])
The majority of patients with stroke work in the government sector, private, or are self-employed.
mosaicplot(strokeData$stroke~strokeData$smoking_status,las=1,off=5,xlab="Stroke",ylab="Smoking Status",main="Stroke VS Smoking_status",color=colors()[20:30])
Patients who formely smoked or who have never smoked had higher stroke occurrences than active smokers. However, we must keep in mind that a significant chunk of the data, represented by the unknown category, lacks a clear record of the patient’s smoking status.
Exploring Numerical Variables
ggplot(strokeData, aes(x = age , fill = stroke)) + geom_histogram(bindwidth = 30) + scale_fill_manual(values=c("#9ebcda", "#8856a7")) + ggtitle("Stroke Distribution by Age")
## Warning: Ignoring unknown parameters: bindwidth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
According to the plot above, the older a person is, the more probable that person is to suffer a stroke. When a person is around 50 or older, the chance of suffering a stroke increases considerably.
ggplot(strokeData, aes(x = avg_glucose_level , fill = stroke)) + geom_histogram(bindwidth = 30) + scale_fill_manual(values=c("#feb24c", "#f03b20")) + ggtitle("Stroke Distribution by Average Glucose Level")
## Warning: Ignoring unknown parameters: bindwidth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The plot above shows that there is no apparent relationship between avg glucose level and stroke risk. However, as we all know, high glucose levels of more than 180-200mg/dL can cause diabetes, which can lead to a stroke. A diabetic individual is 1.5 times more likely to have a stroke and this might be a very plausible explanation for the above plot.
ggplot(strokeData, aes(x = bmi , fill = stroke)) + geom_histogram(bindwidth = 30) + scale_fill_manual(values=c("#c994c7", "#dd1c77")) + ggtitle("Stroke Distribution by Body Mass Index")
## Warning: Ignoring unknown parameters: bindwidth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 201 rows containing non-finite values (stat_bin).
According to the results above, people with a BMI of 20 to 40 are at risk of stroke. As a result, I can conclude that having a higher BMI does not raise the chance of having a stroke.
Checking Data Anomalies
AgeBoxplot <- boxplot(strokeData$age,main = "Age Boxplot")
paste(length(AgeBoxplot$out))
## [1] "0"
There are no outliers detected by boxplot in the age variable.
glucoseBoxplot <- boxplot(strokeData$avg_glucose_level, main = 'Avg Glucose Level Boxplot')
paste(length(glucoseBoxplot$out))
## [1] "627"
There are 627 data points detected as outliers by boxplot in the avg_glucose_level variable.
bmiBoxplot <- boxplot(strokeData$bmi, main = 'bmi Boxplot')
paste(length(bmiBoxplot$out))
## [1] "110"
There are 110 data points detected as outliers by boxplot in the bmi variable.
Hereby, I also check the normality of all variables
hist.data.frame(strokeData)
From the output above, only bmi variables is roughly bell-shaped
Next, I check to see whether there is any missing data in this dataset.
#check missing value
sapply(strokeData, function(x)sum(is.na(x)))
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 201 0 0
The bmi variable has 201 missing values, which was discovered above.
Explaination of my findings:
Utilizing eda, I may infer that my findings are:
This data is stroke data which contains 5110 instances and 12
attributes. Those attributes are id, gender,
age,hypertension, heart_disease,
ever_married, work_type,
Residence_type,
avg_glucose_level,bmi,
smoking_status, and ‘stroke’.
To predict what factors influence a person’s stroke, I will utilize the stroke variable as the dependent variable.
There are more women than men in the data set.
Stroke are becoming more common among female than male
A person’s type of residence has no bearing on whether or not they have a stroke.
People who have been married are more likely than those who have not to suffer a stroke.
People with hypertension suffer fewer strokes than those who do not.
People with heart disease suffer fewer strokes than those who do not.
The majority of patients with stroke work in the government sector, private, or are self-employed.
People who have never smoked or who formerly smoked have a greater stroke risk than active smokers. However, we must keep in mind that the majority of the data, which are represented by unknown categories, do not contain a clear record of the patients’ smoking status. As a result, before we begin modeling, we must process the categories in the smoking status variable.
A person’s risk of having a stroke increases with age.
There is no clear relation between average glucose levels and stroke risk. High glucose levels of more than 180-200mg/dL, on the other hand, can produce diabetes, which can lead to a stroke. Diabetics are 1.5 times more likely to suffer a stroke.
Stroke is more likely happen to people with a BMI of 20 to 40.
We need to handle the 201 missing values in the bmi variable before we begin modeling.
As we can see from Eda, there are attributes that we need to fix. The handling will be done in the data preparation section below.
The first thing I do is removing the ‘id’ column because it contains as many unique values as there are data points and its useless for predicting whether someone would have a stroke or not.
I handle missing values and unusual values in two ways at this data preparation stage: the first is to delete the missing and unusual value, and the second is to replace the missing and unusual value.I do this in these two ways because I don’t know what the best method to handle missing values at this point in order to generate a decent model later, because we haven’t constructed the model yet.
Missing and unusual values to be handled are ‘other’ levels in gender attributes, ‘unknown’ levels in smoking status attributes, and missing values in bmi attributes. I handled ‘other’ levels in gender attribute since there is only 1 row of Other in column gender. I handled the level of ‘unknown’ in smoking status because “unknown” in smoking status indicates that the information is not available to the patient and the number of unknowns themselves are quite large. And I handle the bmi attribute’s missing value such that it is not biased
FIRST WAY TO HANDLE MISSING AND UNUSUAL VALUE: In the first method here, I remove ‘other’ levels in gender attributes, ‘unknown’ levels in smoking status attributes, and any rows having missing values in bmi attributes.
#Put the original data into a new variable that will be altered.
UpdateStrokeDataDropNA = strokeData
#Drop 'id' columns as i said before that id columns is useless
UpdateStrokeDataDropNA <- subset(UpdateStrokeDataDropNA, select = c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
#check missing value
sapply(UpdateStrokeDataDropNA, function(x)sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 201 0 0
#Drop 'Unknown' levels at smoking_status attribute
UpdateStrokeDataDropNA <- UpdateStrokeDataDropNA %>% filter(smoking_status!='Unknown')%>%droplevels()
#Drop 'Other' levels at gender attribute
UpdateStrokeDataDropNA <- UpdateStrokeDataDropNA %>% filter(gender!='Other')%>%droplevels()
#check the missing value again
sapply(UpdateStrokeDataDropNA, function(x)sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 140 0 0
#Drop all data rows containing missing value
UpdateStrokeDataDroppedNA = UpdateStrokeDataDropNA
UpdateStrokeDataDroppedNA = na.omit(UpdateStrokeDataDroppedNA)
#check whether missing value has successfully dropped or not
sapply(UpdateStrokeDataDroppedNA, function(x)sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 0 0 0
After handling missing values and unusual values by the first method, which is dropping them out, to see which features are most important to use to predict a person’s stroke or not, I look at the correlation matrix here.
UpdateStrokeDataDroppedNACorrelation = UpdateStrokeDataDroppedNA
UpdateStrokeDataDroppedNACorrelation$gender = as.numeric(UpdateStrokeDataDroppedNACorrelation$gender)
UpdateStrokeDataDroppedNACorrelation$hypertension = as.numeric(UpdateStrokeDataDroppedNACorrelation$hypertension)
UpdateStrokeDataDroppedNACorrelation$heart_disease = as.numeric(UpdateStrokeDataDroppedNACorrelation$heart_disease)
UpdateStrokeDataDroppedNACorrelation$work_type = as.numeric(UpdateStrokeDataDroppedNACorrelation$work_type)
UpdateStrokeDataDroppedNACorrelation$ever_married = as.numeric(UpdateStrokeDataDroppedNACorrelation$ever_married)
UpdateStrokeDataDroppedNACorrelation$Residence_type = as.numeric(UpdateStrokeDataDroppedNACorrelation$Residence_type)
UpdateStrokeDataDroppedNACorrelation$smoking_status = as.numeric(UpdateStrokeDataDroppedNACorrelation$smoking_status)
UpdateStrokeDataDroppedNACorrelation$stroke = as.numeric(UpdateStrokeDataDroppedNACorrelation$stroke)
UpdateStrokeDataDroppedNACorrelation$work_type = as.numeric(UpdateStrokeDataDroppedNACorrelation$work_type)
x <- (UpdateStrokeDataDroppedNACorrelation)
UpdateStrokeDataDroppedNACorrelation.quant = x
UpdateStrokeDataDroppedNACorrelation.cor = round(cor(UpdateStrokeDataDroppedNACorrelation.quant),2)
ggplot(data = reshape2::melt(UpdateStrokeDataDroppedNACorrelation.cor),aes(x=Var1, y=Var2, fill=value)) + geom_tile() + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson Corr") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 30))
From the output above, age, hypertension, heart disease, and average glucose level are the most important features for predicting stroke because they have the highest correlation to the dependent variable.
SECOND WAY TO HANDLE MISSING AND UNUSUAL VALUE: Next, I used the second method to deal with missing values and unusual values, namely by replacing the missing values and unusual values. I removed the other levels on the gender attribute, replaced the unknown levels in the smoking status attribute with the most frequent category, which is ‘never smoked,’ and replaced the missing value in the bmi’s attribute with its mean.
#Put the original data into a new variable that will be altered.
UpdateStrokeDataReplaceNA = strokeData
#Drop 'id' columns as i said before that id columns is useless
UpdateStrokeDataReplaceNA <- subset(UpdateStrokeDataReplaceNA, select = c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))
#replace missing value in bmi attribute with its mean
UpdateStrokeDataReplaceNA$bmi[is.na(UpdateStrokeDataReplaceNA$bmi)] <- mean(UpdateStrokeDataReplaceNA$bmi, na.rm = T)
#replace the unknown with the most frequent category, which is ‘never smoked’
UpdateStrokeDataReplaceNA <- UpdateStrokeDataReplaceNA %>% mutate(smoking_status = replace(smoking_status, smoking_status == "Unknown", "never smoked"))
#buang levels 'Other' pada atribut gender
UpdateStrokeDataReplaceNA <- UpdateStrokeDataReplaceNA %>% filter(gender!='Other')%>%droplevels()
#check whether missing and unsual value has successfully replaced or not
sapply(UpdateStrokeDataReplaceNA, function(x)sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 0 0 0
After handling missing values using the replace approach, I look at the correlation matrix to evaluate which features are most important to utilize to predict a person’s stroke or not.
UpdateStrokeDataReplaceNACorrelation = UpdateStrokeDataReplaceNA
UpdateStrokeDataReplaceNACorrelation$gender = as.numeric(UpdateStrokeDataReplaceNACorrelation$gender)
UpdateStrokeDataReplaceNACorrelation$hypertension = as.numeric(UpdateStrokeDataReplaceNACorrelation$hypertension)
UpdateStrokeDataReplaceNACorrelation$heart_disease = as.numeric(UpdateStrokeDataReplaceNACorrelation$heart_disease)
UpdateStrokeDataReplaceNACorrelation$work_type = as.numeric(UpdateStrokeDataReplaceNACorrelation$work_type)
UpdateStrokeDataReplaceNACorrelation$ever_married = as.numeric(UpdateStrokeDataReplaceNACorrelation$ever_married)
UpdateStrokeDataReplaceNACorrelation$Residence_type = as.numeric(UpdateStrokeDataReplaceNACorrelation$Residence_type)
UpdateStrokeDataReplaceNACorrelation$smoking_status = as.numeric(UpdateStrokeDataReplaceNACorrelation$smoking_status)
UpdateStrokeDataReplaceNACorrelation$stroke = as.numeric(UpdateStrokeDataReplaceNACorrelation$stroke)
UpdateStrokeDataReplaceNACorrelation$work_type = as.numeric(UpdateStrokeDataReplaceNACorrelation$work_type)
x <- (UpdateStrokeDataReplaceNACorrelation)
UpdateStrokeDataReplaceNACorrelation.quant = x
UpdateStrokeDataReplaceNACorrelation.cor = round(cor(UpdateStrokeDataReplaceNACorrelation.quant),2)
ggplot(data = reshape2::melt(UpdateStrokeDataReplaceNACorrelation.cor),aes(x=Var1, y=Var2, fill=value)) + geom_tile() + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson Corr") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 30))
Same as before, from the correlation matrix above, it can be seen that age, hypertension, heart disease, and average glucose levels were the most significant factors in predicting stroke because they have the highest correlation to the dependent variable.
After dealing with missing values and unusual values, I split the data to build the model. I split the data into 70% of training data and 30% of testing data.
#set random set.seed
set.seed(589)
#for data in which the missing value has been removed, create training data set 70% for building model and testing set 30% for model evaluation
trainingIndex = createDataPartition(UpdateStrokeDataDroppedNA$stroke,p=0.7,list=FALSE)
testing_set = UpdateStrokeDataDroppedNA[-trainingIndex, ]
training_set = UpdateStrokeDataDroppedNA[trainingIndex, ]
cat("Testing Set Dimention data : ",dim(testing_set),"\n")
## Testing Set Dimention data : 1027 11
cat("Training Set Dimention data : ",dim(training_set))
## Training Set Dimention data : 2398 11
#for data in which the missing value has been replaced, create training data set 70% for building model and testing set 30% for model evaluation
trainingIndexReplacedNA = createDataPartition(UpdateStrokeDataReplaceNA$stroke,p=0.7,list=FALSE)
testing_set_ReplacedNA = UpdateStrokeDataReplaceNA[-trainingIndexReplacedNA, ]
training_set_ReplacedNA = UpdateStrokeDataReplaceNA[trainingIndexReplacedNA, ]
cat("Testing Set Dimention data : ",dim(testing_set_ReplacedNA),"\n")
## Testing Set Dimention data : 1532 11
cat("Training Set Dimention data : ",dim(training_set_ReplacedNA))
## Training Set Dimention data : 3577 11
The most important features in my dataset:
Age, hypertension, heart disease, and average glucose level are the most important features, because they have a high correlation to the stroke variable when compared to other features.
The features that I will use to predict whether someone has a stroke or not are the attributes of gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, and smoking_status.
#build function to get the accuracy of each model based on the threshold
getAccuracy <- function(predictionProbability, threshold, expected) {
predictedVal = ifelse(predictionProbability > threshold,'Yes',"No")
misclassificationError <- mean(predictedVal != expected)
accuracy = 1-misclassificationError
return (accuracy)
}
#build modellingFunction for modelling -> it contains a summary of the model itself, the auc value in the testing and training set, the accuracy in the training and testing set when the threshold is 0.5 (default threshold), the confusion matrix when the threshold is 0.5 (default threshold) and when the threshold is calculated using gmean, and the model accuracy when using gmean as the threshold
modellingFunction <- function(name,formula,predictors,training_set, testing_set,color){
model = glm(formula,family = binomial(link = "logit"), data = training_set)
print(name)
print(summary(model))
logisticPrediction <- predict(model,newdata = subset(testing_set, select = predictors), type ="response")
rocrEvaluation <- prediction(logisticPrediction, testing_set$stroke)
logisticPredictionTraining <- predict(model,newdata = subset(training_set, select = predictors), type ="response")
rocrEvaluationTraining <- prediction(logisticPredictionTraining, training_set$stroke)
aucValue <- performance(rocrEvaluation, measure = "auc")
aucValue <- aucValue@y.values[[1]]
aucValueTraining <- performance(rocrEvaluationTraining, measure = "auc")
aucValueTraining <- aucValueTraining@y.values[[1]]
accuracy = getAccuracy(logisticPrediction, 0.5, testing_set$stroke)
accuracyTraining = accuracy = getAccuracy(logisticPredictionTraining, 0.5, training_set$stroke)
roc_empirical <- rocit(score = logisticPrediction, class = testing_set$stroke)
predictedDefault = ifelse(logisticPrediction > 0.5,'Yes',"No")
cmDefault = confusionMatrix(data=as.factor(predictedDefault), reference=testing_set$stroke, positive = "Yes")
gmean = sqrt(roc_empirical$TPR * (1-roc_empirical$FPR))
optimalThreshold = roc_empirical$Cutoff[which.max(gmean)]
predictedGmean = ifelse(logisticPrediction > optimalThreshold,'Yes',"No")
cmGmean = confusionMatrix(data=as.factor(predictedGmean), reference=testing_set$stroke, positive = "Yes")
accuracyGmean = getAccuracy(logisticPrediction, optimalThreshold, testing_set$stroke)
cat(sprintf("CONFUSION MATRIX WITH THRESHOLD %.1f\n",0.5))
cat("-------------------------------------------\n")
print(cmDefault)
cat("===========================================\n")
cat(sprintf("CONFUSION MATRIX WITH THRESHOLD %f\n",optimalThreshold))
cat("-------------------------------------------\n")
print(cmGmean)
cat("===========================================\n")
cat(sprintf("%s AUC Testing: %f\n", name,aucValue))
cat(sprintf("%s AUC Training: %f\n", name,aucValueTraining))
cat(sprintf("%s Accuracy Testing with threshold %f: %f\n", name,optimalThreshold,accuracyGmean))
cat(sprintf("%s Accuracy Testing with threshold %.1f: %f\n", name,0.5,accuracy))
cat(sprintf("%s Accuracy Training with threshold %.1f: %f\n", name,0.5,accuracyTraining))
return (list(
name=name,
model=model,
roc=rocrEvaluation,
predVal=logisticPrediction,
auc=aucValue,
sensitivityDefault=cmDefault$byClass[2],
specificityDefault=cmDefault$byClass[1],
sensitivityGmean=cmGmean$byClass[2],
specificityGmean=cmGmean$byClass[1],
color=color
))
}
#build function to compare auc, specitifity and sensitivity of each model
summarizeModelMetric <- function(models) {
modelsSummary <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(modelsSummary) <-c("AUC", "Specitifity", "Sensitivity")
modelsSummaryRowNames = list()
idx = 0
for(model in models) {
modelsSummary[nrow(modelsSummary) + 1,] <- c(
AUC=model$auc,Sensitivity=model$sensitivityGmean,Specificity=model$specificityGmean
)
modelsSummaryRowNames = append(modelsSummaryRowNames,model$name)
}
rownames(modelsSummary)<-modelsSummaryRowNames
for(model in models) {
if (idx == 0) {
plot(performance(model$roc, measure = "tpr", x.measure = "fpr"),col=model$color)
} else {
plot(performance(model$roc, measure = "tpr", x.measure = "fpr"),col=model$color,add=TRUE)
}
idx = idx + 1
}
return (modelsSummary)
}
Because the data we have is not evenly distributed between the number of strokes and those who do not have a stroke (imbalanced data), determining whether or not a model is good cannot be calculated using accuracy alone, and other metrics are required to evaluate the model’s performance. So, when creating the model, I compared the confusion matrix of each model when the default threshold (0.5) was used with the confusion matrix of each model when gmean was used as the threshold.
I do this because the default threshold of 0.5 may not be optimal in anticipating probabilities, especially if the data is not balanced, therefore we need to optimize the threshold by replacing it with the gmean value. G-Mean is an imbalanced classification measure that, when optimized, seeks a balance between sensitivity and specificity. The formula for G-Mean is sqrt(Sensitivity * Specificity). We may use the index for the highest G-mean score to select which threshold value to employ after calculating the G-mean for each threshold.
The problem is that the default threshold may not represent the optimal interpretation of the predicted probabilities especially when the data is unbalanced.
Here I create my first model, i.e. I try to model the dependent variable with all predictor variables using data in which the missing value has been removed
model1 <- modellingFunction("model1 = stroke~.", stroke ~ ., c("age", "hypertension","avg_glucose_level","gender","heart_disease","work_type","Residence_type","bmi","smoking_status","ever_married"), training_set, testing_set,'red')
## [1] "model1 = stroke~."
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1793 -0.3361 -0.1900 -0.1018 3.1820
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.910e+01 5.876e+02 -0.033 0.97407
## genderMale -2.281e-02 2.007e-01 -0.114 0.90949
## age 7.512e-02 8.550e-03 8.787 < 2e-16 ***
## hypertensionYes 6.384e-01 2.178e-01 2.932 0.00337 **
## heart_diseaseYes 3.280e-01 2.669e-01 1.229 0.21904
## ever_marriedYes -7.073e-02 3.218e-01 -0.220 0.82603
## work_typeGovt_job 1.070e+01 5.876e+02 0.018 0.98548
## work_typeNever_worked -3.533e-01 1.510e+03 0.000 0.99981
## work_typePrivate 1.101e+01 5.876e+02 0.019 0.98506
## work_typeSelf-employed 1.065e+01 5.876e+02 0.018 0.98554
## Residence_typeUrban 8.740e-02 1.942e-01 0.450 0.65271
## avg_glucose_level 4.322e-03 1.650e-03 2.620 0.00880 **
## bmi 5.134e-03 1.553e-02 0.331 0.74099
## smoking_statusnever smoked -4.070e-02 2.245e-01 -0.181 0.85613
## smoking_statussmokes 2.192e-01 2.819e-01 0.778 0.43675
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 987.68 on 2397 degrees of freedom
## Residual deviance: 796.88 on 2383 degrees of freedom
## AIC: 826.88
##
## Number of Fisher Scoring iterations: 16
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 973 54
## Yes 0 0
##
## Accuracy : 0.9474
## 95% CI : (0.9319, 0.9603)
## No Information Rate : 0.9474
## P-Value [Acc > NIR] : 0.5361
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 5.498e-13
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.94742
## Prevalence : 0.05258
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.041770
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 684 11
## Yes 289 43
##
## Accuracy : 0.7079
## 95% CI : (0.679, 0.7356)
## No Information Rate : 0.9474
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1455
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.79630
## Specificity : 0.70298
## Pos Pred Value : 0.12952
## Neg Pred Value : 0.98417
## Prevalence : 0.05258
## Detection Rate : 0.04187
## Detection Prevalence : 0.32327
## Balanced Accuracy : 0.74964
##
## 'Positive' Class : Yes
##
## ===========================================
## model1 = stroke~. AUC Testing: 0.824350
## model1 = stroke~. AUC Training: 0.837246
## model1 = stroke~. Accuracy Testing with threshold 0.041770: 0.707887
## model1 = stroke~. Accuracy Testing with threshold 0.5: 0.947456
## model1 = stroke~. Accuracy Training with threshold 0.5: 0.947456
By looking at the signif code, I create my second model, i.e. I try to model the dependent variable with age, avg_glucose_level, and hypertension as the predictors using data in which the missing value has been removed
model2 <- modellingFunction("model2 = stroke~age+avg_glucose_level+hypertension", stroke ~ age + avg_glucose_level + hypertension, c("age", "avg_glucose_level","hypertension"), training_set, testing_set,'blue')
## [1] "model2 = stroke~age+avg_glucose_level+hypertension"
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0745 -0.3380 -0.1927 -0.1019 3.2385
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.912785 0.544152 -14.542 < 2e-16 ***
## age 0.072489 0.007670 9.450 < 2e-16 ***
## avg_glucose_level 0.004655 0.001582 2.941 0.00327 **
## hypertensionYes 0.663959 0.214400 3.097 0.00196 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 987.68 on 2397 degrees of freedom
## Residual deviance: 803.27 on 2394 degrees of freedom
## AIC: 811.27
##
## Number of Fisher Scoring iterations: 7
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 973 54
## Yes 0 0
##
## Accuracy : 0.9474
## 95% CI : (0.9319, 0.9603)
## No Information Rate : 0.9474
## P-Value [Acc > NIR] : 0.5361
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 5.498e-13
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.94742
## Prevalence : 0.05258
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.074082
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 781 17
## Yes 192 37
##
## Accuracy : 0.7965
## 95% CI : (0.7706, 0.8207)
## No Information Rate : 0.9474
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1928
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.68519
## Specificity : 0.80267
## Pos Pred Value : 0.16157
## Neg Pred Value : 0.97870
## Prevalence : 0.05258
## Detection Rate : 0.03603
## Detection Prevalence : 0.22298
## Balanced Accuracy : 0.74393
##
## 'Positive' Class : Yes
##
## ===========================================
## model2 = stroke~age+avg_glucose_level+hypertension AUC Testing: 0.819002
## model2 = stroke~age+avg_glucose_level+hypertension AUC Training: 0.832694
## model2 = stroke~age+avg_glucose_level+hypertension Accuracy Testing with threshold 0.074082: 0.796495
## model2 = stroke~age+avg_glucose_level+hypertension Accuracy Testing with threshold 0.5: 0.947456
## model2 = stroke~age+avg_glucose_level+hypertension Accuracy Training with threshold 0.5: 0.947456
Then I also create my third model, i.e. I try to model the dependent variable with only age and avg_glucose_level variables as the predictors using data in which the missing value has been removed
model3 <- modellingFunction("model3 = stroke~age+avg_glucose_level", stroke ~ age + avg_glucose_level , c("age", "avg_glucose_level"), training_set, testing_set,'green')
## [1] "model3 = stroke~age+avg_glucose_level"
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9594 -0.3427 -0.1958 -0.1033 3.2301
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.035647 0.539546 -14.893 < 2e-16 ***
## age 0.075443 0.007566 9.972 < 2e-16 ***
## avg_glucose_level 0.005387 0.001560 3.452 0.000556 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 987.68 on 2397 degrees of freedom
## Residual deviance: 812.33 on 2395 degrees of freedom
## AIC: 818.33
##
## Number of Fisher Scoring iterations: 7
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 973 54
## Yes 0 0
##
## Accuracy : 0.9474
## 95% CI : (0.9319, 0.9603)
## No Information Rate : 0.9474
## P-Value [Acc > NIR] : 0.5361
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 5.498e-13
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.94742
## Prevalence : 0.05258
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.064606
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 759 16
## Yes 214 38
##
## Accuracy : 0.776
## 95% CI : (0.7493, 0.8012)
## No Information Rate : 0.9474
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1771
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70370
## Specificity : 0.78006
## Pos Pred Value : 0.15079
## Neg Pred Value : 0.97935
## Prevalence : 0.05258
## Detection Rate : 0.03700
## Detection Prevalence : 0.24537
## Balanced Accuracy : 0.74188
##
## 'Positive' Class : Yes
##
## ===========================================
## model3 = stroke~age+avg_glucose_level AUC Testing: 0.816756
## model3 = stroke~age+avg_glucose_level AUC Training: 0.826186
## model3 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.064606: 0.776047
## model3 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.5: 0.947456
## model3 = stroke~age+avg_glucose_level Accuracy Training with threshold 0.5: 0.947456
Next I create my first model using data in which the missing value has been replaced, namely model4. So, below I try to model the dependent variable with all predictor variables using data in which the missing value has been replaced.
model4 <- modellingFunction("model4 = stroke~.", stroke ~ ., c("age", "hypertension","avg_glucose_level","gender","heart_disease","work_type","Residence_type","bmi","smoking_status","ever_married"), training_set_ReplacedNA, testing_set_ReplacedNA,'pink')
## [1] "model4 = stroke~."
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0283 -0.3301 -0.1717 -0.0941 3.4682
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.857308 0.804945 -7.277 3.42e-13 ***
## genderMale -0.117210 0.168938 -0.694 0.4878
## age 0.074076 0.006892 10.748 < 2e-16 ***
## hypertensionYes 0.175505 0.204452 0.858 0.3907
## heart_diseaseYes 0.127623 0.237853 0.537 0.5916
## ever_marriedYes 0.023551 0.292793 0.080 0.9359
## work_typeGovt_job -1.307031 0.876428 -1.491 0.1359
## work_typeNever_worked -10.718425 341.727822 -0.031 0.9750
## work_typePrivate -1.242327 0.859938 -1.445 0.1486
## work_typeSelf-employed -1.541529 0.885981 -1.740 0.0819 .
## Residence_typeUrban 0.039402 0.163491 0.241 0.8096
## avg_glucose_level 0.003030 0.001430 2.119 0.0341 *
## bmi -0.008599 0.013701 -0.628 0.5302
## smoking_statusnever smoked -0.256399 0.188840 -1.358 0.1745
## smoking_statussmokes -0.059624 0.265216 -0.225 0.8221
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.4 on 3576 degrees of freedom
## Residual deviance: 1130.7 on 3562 degrees of freedom
## AIC: 1160.7
##
## Number of Fisher Scoring iterations: 14
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1458 74
## Yes 0 0
##
## Accuracy : 0.9517
## 95% CI : (0.9397, 0.9619)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9517
## Prevalence : 0.0483
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.060689
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1131 14
## Yes 327 60
##
## Accuracy : 0.7774
## 95% CI : (0.7557, 0.798)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.195
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.81081
## Specificity : 0.77572
## Pos Pred Value : 0.15504
## Neg Pred Value : 0.98777
## Prevalence : 0.04830
## Detection Rate : 0.03916
## Detection Prevalence : 0.25261
## Balanced Accuracy : 0.79327
##
## 'Positive' Class : Yes
##
## ===========================================
## model4 = stroke~. AUC Testing: 0.861797
## model4 = stroke~. AUC Training: 0.836528
## model4 = stroke~. Accuracy Testing with threshold 0.060689: 0.777415
## model4 = stroke~. Accuracy Testing with threshold 0.5: 0.951076
## model4 = stroke~. Accuracy Training with threshold 0.5: 0.951076
Afterward, I create my fifth model using data in which the missing value has been replaced, namely model5. In model5, I try to model the dependent variable with age, avg_glucose_level, hypertension and heart_disease variables as the predictors using data in which the missing value has been replaced.
model5 <- modellingFunction("model5 = stroke~age+avg_glucose_level+hypertension+heart_disease", stroke ~ age + avg_glucose_level + hypertension + heart_disease, c("age", "avg_glucose_level","hypertension","heart_disease"), training_set_ReplacedNA, testing_set_ReplacedNA,'yellow')
## [1] "model5 = stroke~age+avg_glucose_level+hypertension+heart_disease"
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8922 -0.3315 -0.1810 -0.0857 3.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.298194 0.415283 -17.574 <2e-16 ***
## age 0.069600 0.006021 11.559 <2e-16 ***
## avg_glucose_level 0.002931 0.001390 2.110 0.0349 *
## hypertensionYes 0.163990 0.202474 0.810 0.4180
## heart_diseaseYes 0.142137 0.234313 0.607 0.5441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.4 on 3576 degrees of freedom
## Residual deviance: 1137.6 on 3572 degrees of freedom
## AIC: 1147.6
##
## Number of Fisher Scoring iterations: 7
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1458 74
## Yes 0 0
##
## Accuracy : 0.9517
## 95% CI : (0.9397, 0.9619)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9517
## Prevalence : 0.0483
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.068901
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1152 16
## Yes 306 58
##
## Accuracy : 0.7898
## 95% CI : (0.7685, 0.81)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2007
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.78378
## Specificity : 0.79012
## Pos Pred Value : 0.15934
## Neg Pred Value : 0.98630
## Prevalence : 0.04830
## Detection Rate : 0.03786
## Detection Prevalence : 0.23760
## Balanced Accuracy : 0.78695
##
## 'Positive' Class : Yes
##
## ===========================================
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease AUC Testing: 0.863493
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease AUC Training: 0.833043
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease Accuracy Testing with threshold 0.068901: 0.789817
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease Accuracy Testing with threshold 0.5: 0.951076
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease Accuracy Training with threshold 0.5: 0.951076
By looking at the signif code, I create my sixth model using data in which the missing value has been replaced, namely model6. In model6, I try to model the dependent variable with age and avg_glucose_level variables as the predictors using data in which the missing value has been replaced.
model6 <- modellingFunction("model6 = stroke~age+avg_glucose_level", stroke ~ age + avg_glucose_level , c("age", "avg_glucose_level"), training_set_ReplacedNA, testing_set_ReplacedNA,'brown')
## [1] "model6 = stroke~age+avg_glucose_level"
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8810 -0.3338 -0.1813 -0.0852 3.7538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.360138 0.410674 -17.922 <2e-16 ***
## age 0.070917 0.005871 12.080 <2e-16 ***
## avg_glucose_level 0.003155 0.001371 2.301 0.0214 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.4 on 3576 degrees of freedom
## Residual deviance: 1138.6 on 3574 degrees of freedom
## AIC: 1144.6
##
## Number of Fisher Scoring iterations: 7
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1458 74
## Yes 0 0
##
## Accuracy : 0.9517
## 95% CI : (0.9397, 0.9619)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9517
## Prevalence : 0.0483
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.048003
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1043 10
## Yes 415 64
##
## Accuracy : 0.7226
## 95% CI : (0.6994, 0.7449)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1613
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.86486
## Specificity : 0.71536
## Pos Pred Value : 0.13361
## Neg Pred Value : 0.99050
## Prevalence : 0.04830
## Detection Rate : 0.04178
## Detection Prevalence : 0.31266
## Balanced Accuracy : 0.79011
##
## 'Positive' Class : Yes
##
## ===========================================
## model6 = stroke~age+avg_glucose_level AUC Testing: 0.858618
## model6 = stroke~age+avg_glucose_level AUC Training: 0.832197
## model6 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.048003: 0.722585
## model6 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.5: 0.951076
## model6 = stroke~age+avg_glucose_level Accuracy Training with threshold 0.5: 0.951076
Lastly, by looking at the signif code from the model6, I create my last model using data in which the missing value has been replaced, namely model7. In model6, I try to model the dependent variable with age variable only as the predictor using data in which the missing value has been replaced.
model7 <- modellingFunction("model7 = stroke~age", stroke ~ age , c("age"), training_set_ReplacedNA, testing_set_ReplacedNA,'purple')
## [1] "model7 = stroke~age"
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7499 -0.3317 -0.1870 -0.0842 3.7418
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.09571 0.39130 -18.13 <2e-16 ***
## age 0.07281 0.00578 12.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.4 on 3576 degrees of freedom
## Residual deviance: 1143.7 on 3575 degrees of freedom
## AIC: 1147.7
##
## Number of Fisher Scoring iterations: 7
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1458 74
## Yes 0 0
##
## Accuracy : 0.9517
## 95% CI : (0.9397, 0.9619)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9517
## Prevalence : 0.0483
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.049958
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1060 14
## Yes 398 60
##
## Accuracy : 0.7311
## 95% CI : (0.7081, 0.7531)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1553
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.81081
## Specificity : 0.72702
## Pos Pred Value : 0.13100
## Neg Pred Value : 0.98696
## Prevalence : 0.04830
## Detection Rate : 0.03916
## Detection Prevalence : 0.29896
## Balanced Accuracy : 0.76892
##
## 'Positive' Class : Yes
##
## ===========================================
## model7 = stroke~age AUC Testing: 0.845962
## model7 = stroke~age AUC Training: 0.829505
## model7 = stroke~age Accuracy Testing with threshold 0.049958: 0.731070
## model7 = stroke~age Accuracy Testing with threshold 0.5: 0.951076
## model7 = stroke~age Accuracy Training with threshold 0.5: 0.951076
It is proven from each model that the 0.5 threshold cannot predict this data (imbalanced data) correctly, it can be seen from the confusion matrix of each model when using the 0.5 threshold that the result is poor, so the Gmean threshold must be applied.
After constructing the seven models described above, I use the AUC, specificity, and sensitivity of each model to determine which model is the best.
#Using my summary model function to compare the auc, specificity, and sensitivity values between each model
summarizeModelMetric(list(model1,model2,model3,model4,model5,model6,model7))
## AUC
## model1 = stroke~. 0.8243500
## model2 = stroke~age+avg_glucose_level+hypertension 0.8190019
## model3 = stroke~age+avg_glucose_level 0.8167561
## model4 = stroke~. 0.8617970
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease 0.8634931
## model6 = stroke~age+avg_glucose_level 0.8586179
## model7 = stroke~age 0.8459617
## Specitifity
## model1 = stroke~. 0.7029805
## model2 = stroke~age+avg_glucose_level+hypertension 0.8026721
## model3 = stroke~age+avg_glucose_level 0.7800617
## model4 = stroke~. 0.7757202
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease 0.7901235
## model6 = stroke~age+avg_glucose_level 0.7153635
## model7 = stroke~age 0.7270233
## Sensitivity
## model1 = stroke~. 0.7962963
## model2 = stroke~age+avg_glucose_level+hypertension 0.6851852
## model3 = stroke~age+avg_glucose_level 0.7037037
## model4 = stroke~. 0.8108108
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease 0.7837838
## model6 = stroke~age+avg_glucose_level 0.8648649
## model7 = stroke~age 0.8108108
Explanation:
AUC : AUC - ROC curve is a classification performance statistic with variable threshold values. AUC represents the degree or amount of separability. It measures how well the model can detect between classes. AUC close to one indicates a high level of separability in a good model. Similarly, the greater the AUC, the better the model distinguishes between people who have the condition and those who do not.
Sensitivity: the ability of a test to correctly identify patients with a disease.
Specificity: the ability of a test to correctly identify people without the disease.
True positive: the person has the disease and predicted positive.
True negative: the person does not have the disease and predicted negative.
False positive: the person does not have the disease and predicted positive.
False negative: the person has the disease and predicted negative
If we desire a low false negative, we must increase the sensitivity (sensitivity = true positive/(true positive + false negative)). If the specificity is low, it indicates that the false negative is high.
In order to minimize the possibility of predicting people who don’t have a stroke even though they actually have a stroke, the sensitivity must be large. Small specificity will predict more people who have a stroke even though they didn’t actually have a stroke. We, as humans, are better at predicting that we will have a stroke even though we didn’t actually have it than predicting that we will not have a stroke even though we actually had a stroke.
As a result, model 6 with age and glucose level predictors is the best model for predicting stroke or not, because it has the greatest sensitivity value among other models and the specificity is not too low. The AUC value is likewise high, close to one, indicating that model 6 is quite good at predicting whether or not someone will have a stroke. Model 6 has an AUC value of 0.86, a specificity value of 0.72 and a sensitivity of 0.86.
So, here is my final model summary and roc curve
FinalModel <- modellingFunction("Final Model = stroke~age+avg_glucose_level", stroke ~ age + avg_glucose_level , c("age", "avg_glucose_level"), training_set_ReplacedNA, testing_set_ReplacedNA,'brown')
## [1] "Final Model = stroke~age+avg_glucose_level"
##
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8810 -0.3338 -0.1813 -0.0852 3.7538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.360138 0.410674 -17.922 <2e-16 ***
## age 0.070917 0.005871 12.080 <2e-16 ***
## avg_glucose_level 0.003155 0.001371 2.301 0.0214 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.4 on 3576 degrees of freedom
## Residual deviance: 1138.6 on 3574 degrees of freedom
## AIC: 1144.6
##
## Number of Fisher Scoring iterations: 7
## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.
## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1458 74
## Yes 0 0
##
## Accuracy : 0.9517
## 95% CI : (0.9397, 0.9619)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.9517
## Prevalence : 0.0483
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Yes
##
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.048003
## -------------------------------------------
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1043 10
## Yes 415 64
##
## Accuracy : 0.7226
## 95% CI : (0.6994, 0.7449)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1613
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.86486
## Specificity : 0.71536
## Pos Pred Value : 0.13361
## Neg Pred Value : 0.99050
## Prevalence : 0.04830
## Detection Rate : 0.04178
## Detection Prevalence : 0.31266
## Balanced Accuracy : 0.79011
##
## 'Positive' Class : Yes
##
## ===========================================
## Final Model = stroke~age+avg_glucose_level AUC Testing: 0.858618
## Final Model = stroke~age+avg_glucose_level AUC Training: 0.832197
## Final Model = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.048003: 0.722585
## Final Model = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.5: 0.951076
## Final Model = stroke~age+avg_glucose_level Accuracy Training with threshold 0.5: 0.951076
plot(performance(FinalModel$roc, measure = "tpr", x.measure = "fpr"))
Explanation of Model:
My final model’s accuracy using the default threshold(0.5) is around 95%. However, because based on the confusion matrix, the prediction results with a default threshold of 0.5 are poor, I use the one with the gmean threshold to predict. Then, My final model’s accuracy using the gmean threshold is around 72%. In my final model, the AUC value in the testing set is around 0.86, whereas the AUC value in the training set is approximately 0.83. The AUC values in these two sets of differences are not far off, and the AUC value in this model is fairly high, near to 1, indicating that it is quite good at predicting. According to the plot, this model’s roccurve is also fairly good.
The equation of the line for my final model is ln[stroke/(1-stroke)] = -7.360138 + age(0.070917) + avg_glucose_level(0.003155)
So, based on this data and the model that I created and picked, model6, the probability of having a stroke is determined by a person’s age and glucose levels.The older a person becomes, the more likely he or she will have a stroke. Similarly, a person’s average glucose level. The higher a person’s average glucose level, the greater the probability of getting a stroke. Someone with high glucose levels is more likely to develop diabetes, which can cause narrowing of blood vessel walls and even total blockage, causing blood flow to the brain to halt and a stroke to occur.
#split the data into 50% of training and 50% of testing from the data which missing value has dropped
set.seed(1)
trainingIndexForDecisionTree = createDataPartition(UpdateStrokeDataDroppedNA$stroke,p=0.5,list=FALSE)
testing_set_for_DT = UpdateStrokeDataDroppedNA[-trainingIndexForDecisionTree, ]
training_set_for_DT = UpdateStrokeDataDroppedNA[trainingIndexForDecisionTree, ]
cat("Testing Set Dimention data : ",dim(testing_set_for_DT),"\n")
## Testing Set Dimention data : 1712 11
cat("Training Set Dimention data : ",dim(training_set_for_DT))
## Training Set Dimention data : 1713 11
#build the decision tree model using rpart function
DecisionTreeModel <- rpart(stroke~.,training_set_for_DT, method = "class")
DecisionTreeModel
## n= 1713
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1713 90 No (0.94746060 0.05253940)
## 2) age< 67.5 1381 32 No (0.97682839 0.02317161) *
## 3) age>=67.5 332 58 No (0.82530120 0.17469880)
## 6) avg_glucose_level< 219.625 291 43 No (0.85223368 0.14776632) *
## 7) avg_glucose_level>=219.625 41 15 No (0.63414634 0.36585366)
## 14) work_type=Self-employed 13 1 No (0.92307692 0.07692308) *
## 15) work_type=Govt_job,Private 28 14 No (0.50000000 0.50000000)
## 30) bmi< 33.5 21 7 No (0.66666667 0.33333333)
## 60) avg_glucose_level>=227.13 14 3 No (0.78571429 0.21428571) *
## 61) avg_glucose_level< 227.13 7 3 Yes (0.42857143 0.57142857) *
## 31) bmi>=33.5 7 0 Yes (0.00000000 1.00000000) *
To make the decision tree easier to understand, I visualized it below
#plot the decision tree model using rpart.plot function
rpart.plot(DecisionTreeModel)
From the plot of the decision tree model above, I can conclude:
A stroke does not occur in those under the age of 68.
People over the age of 68 with an average glucose level less than 220 do not suffer a stroke.
People over the age of 68 with an average glucose level above 220 and self-employed jobs are unlikely to suffer a stroke.
People over the age of 68 with an average glucose level more than 220, a non-self-employed employment type, a BMI less than 34, and an avg glucose level higher than 227 are less likely to have a stroke.
5.People who are above the age of 68, have an average glucose level of more than 200, are not self-employed, and have a BMI of more than 34 are at risk of stroke.
predictionDTree <- predict(DecisionTreeModel, testing_set_for_DT, type = "class")
confusionMatrix <- table(predictionDTree, testing_set_for_DT$stroke)
confusionMatrix
##
## predictionDTree No Yes
## No 1612 87
## Yes 10 3
By looking at the confusion matrix, this model is not recommended because it predicts that more people who should have suffered a stroke are predicted not to suffer from a stroke than predicting people who do not suffer from a stroke are predicted to have a stroke.
#accuraycy
sum(diag(confusionMatrix))/sum(confusionMatrix)
## [1] 0.9433411
#error rate (missclassification accuracy)
1-sum(diag(confusionMatrix))/sum(confusionMatrix)
## [1] 0.05665888
Based on the accuracy findings above, the decision tree model has an accuracy of 94%, but actually the decision tree model is less effective for classifying unbalanced data and requires generalization to overcome it.