Video Link (Copy the link to the browser) :
#Load libraries
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.1.3
library(rpart)
## Warning: package 'rpart' was built under R version 4.1.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pROC)
## Warning: package 'pROC' was built under R version 4.1.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ROSE)
## Loaded ROSE 0.0-4
#Import data
strokeData <- read.csv("E:/Binus University/Semester 2/Data Mining and Visualization/Final Exam Lec/StrokeData.csv", header = T, na.strings = c(""))
#Print data
head(strokeData)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 N/A never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24 never smoked 1
## 6 Urban 186.21 29 formerly smoked 1
Attribute Information
1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient
Make Predictive Logistic Model and Decision Tree to predict `stroke`, which is the target variable.
dim(strokeData)
## [1] 5110 12
EXPLANATION
The result above indicated that there are 5110 observations (rows) and 12 variables (columns) of data.
str(strokeData)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : chr "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
EXPLANATION
There are three data types that were used in this data set, they are num, char, and int.
However, there are several data types that do not match the contents, so the data types must be changed, including:
- from char to factor: `gender`, `ever_married`, `work_type`, `residence_type`, `smoking_status`
- from char to num : `bmi`
- from int to logical : `hypertension`, `heart_disease`
strokeData$gender <- as.factor(strokeData$gender)
strokeData$hypertension <- as.logical(strokeData$hypertension)
strokeData$heart_disease <- as.logical(strokeData$heart_disease)
strokeData$work_type <- as.factor(strokeData$work_type)
strokeData$Residence_type <- as.factor(strokeData$Residence_type)
strokeData$bmi <- as.numeric(strokeData$bmi)
## Warning: NAs introduced by coercion
strokeData$smoking_status <- as.factor(strokeData$smoking_status)
EXPLANATION
After changing the data types, check whether the data types have changed by using the str function.
str(strokeData)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 1 2 1 1 2 2 1 1 1 ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ heart_disease : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
## $ Residence_type : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
## $ smoking_status : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
EXPLANATION
The data type has changed.
summary(strokeData)
## id gender age hypertension heart_disease
## Min. : 67 Female:2994 Min. : 0.08 Mode :logical Mode :logical
## 1st Qu.:17741 Male :2115 1st Qu.:25.00 FALSE:4612 FALSE:4834
## Median :36932 Other : 1 Median :45.00 TRUE :498 TRUE :276
## Mean :36518 Mean :43.23
## 3rd Qu.:54682 3rd Qu.:61.00
## Max. :72940 Max. :82.00
##
## ever_married work_type Residence_type avg_glucose_level
## Length:5110 children : 687 Rural:2514 Min. : 55.12
## Class :character Govt_job : 657 Urban:2596 1st Qu.: 77.25
## Mode :character Never_worked : 22 Median : 91.89
## Private :2925 Mean :106.15
## Self-employed: 819 3rd Qu.:114.09
## Max. :271.74
##
## bmi smoking_status stroke
## Min. :10.30 formerly smoked: 885 Min. :0.00000
## 1st Qu.:23.50 never smoked :1892 1st Qu.:0.00000
## Median :28.10 smokes : 789 Median :0.00000
## Mean :28.89 Unknown :1544 Mean :0.04873
## 3rd Qu.:33.10 3rd Qu.:0.00000
## Max. :97.60 Max. :1.00000
## NA's :201
describe(strokeData)
## strokeData
##
## 12 Variables 5110 Observations
## --------------------------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd .05 .10
## 5110 0 5110 1 36518 24436 3590 6972
## .25 .50 .75 .90 .95
## 17741 36932 54682 65668 69218
##
## lowest : 67 77 84 91 99, highest: 72911 72914 72915 72918 72940
## --------------------------------------------------------------------------------
## gender
## n missing distinct
## 5110 0 3
##
## Value Female Male Other
## Frequency 2994 2115 1
## Proportion 0.586 0.414 0.000
## --------------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 5110 0 104 1 43.23 26.03 5 11
## .25 .50 .75 .90 .95
## 25 45 61 75 79
##
## lowest : 0.08 0.16 0.24 0.32 0.40, highest: 78.00 79.00 80.00 81.00 82.00
## --------------------------------------------------------------------------------
## hypertension
## n missing distinct
## 5110 0 2
##
## Value FALSE TRUE
## Frequency 4612 498
## Proportion 0.903 0.097
## --------------------------------------------------------------------------------
## heart_disease
## n missing distinct
## 5110 0 2
##
## Value FALSE TRUE
## Frequency 4834 276
## Proportion 0.946 0.054
## --------------------------------------------------------------------------------
## ever_married
## n missing distinct
## 5110 0 2
##
## Value No Yes
## Frequency 1757 3353
## Proportion 0.344 0.656
## --------------------------------------------------------------------------------
## work_type
## n missing distinct
## 5110 0 5
##
## lowest : children Govt_job Never_worked Private Self-employed
## highest: children Govt_job Never_worked Private Self-employed
##
## Value children Govt_job Never_worked Private
## Frequency 687 657 22 2925
## Proportion 0.134 0.129 0.004 0.572
##
## Value Self-employed
## Frequency 819
## Proportion 0.160
## --------------------------------------------------------------------------------
## Residence_type
## n missing distinct
## 5110 0 2
##
## Value Rural Urban
## Frequency 2514 2596
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## avg_glucose_level
## n missing distinct Info Mean Gmd .05 .10
## 5110 0 3979 1 106.1 45.38 60.71 65.79
## .25 .50 .75 .90 .95
## 77.24 91.88 114.09 192.18 216.29
##
## lowest : 55.12 55.22 55.23 55.25 55.26, highest: 266.59 267.60 267.61 267.76 271.74
## --------------------------------------------------------------------------------
## bmi
## n missing distinct Info Mean Gmd .05 .10
## 4909 201 418 1 28.89 8.538 17.64 19.70
## .25 .50 .75 .90 .95
## 23.50 28.10 33.10 38.90 42.96
##
## lowest : 10.3 11.3 11.5 12.0 12.3, highest: 66.8 71.9 78.0 92.0 97.6
## --------------------------------------------------------------------------------
## smoking_status
## n missing distinct
## 5110 0 4
##
## Value formerly smoked never smoked smokes Unknown
## Frequency 885 1892 789 1544
## Proportion 0.173 0.370 0.154 0.302
## --------------------------------------------------------------------------------
## stroke
## n missing distinct Info Sum Mean Gmd
## 5110 0 2 0.139 249 0.04873 0.09273
##
## --------------------------------------------------------------------------------
EXPLANATION
The insights gained from the result of summary and described are:
1) The number of female gender (2994) is more than other genders.
2) The average of the `age` is 43.
3) The frequency of people who have hypertension (4612) is more than the frequency of people who do not have hypertension (698).
4) `bmi` contains NA or missing values.
5) In this dataset, there are more people without heart disease than those with heart disease.
6) The type of work "Private" has the highest number among others, which is 2925 people.
7) The distribution between rural and urban residence types is almost equal.
8) Average of `glucose level` is 106.
9) The average `bmi` is 28,9 (will change after imputation in data preparation).
10) Number of people who have never smoked and 'unknown' status dominates in this dataset.
sapply(strokeData, function(x) sum(is.na(x)))
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 201 0 0
EXPLANATION
After changing the data types, it was found that the missing value in the BMI variable was 201 data points or equivalent to 3.9335% of the total data (only a few). So, we will do imputation (in data preparation)
all(duplicated(strokeData)==TRUE)
## [1] FALSE
EXPLANATION
The syntax above returned FALSE. This indicates that there are no duplicated data in this dataset.
sapply(strokeData, function(x) length(unique(x)))
## id gender age hypertension
## 5110 3 104 2
## heart_disease ever_married work_type Residence_type
## 2 2 5 2
## avg_glucose_level bmi smoking_status stroke
## 3979 419 4 2
levels(strokeData$gender)
## [1] "Female" "Male" "Other"
levels(strokeData$work_type)
## [1] "children" "Govt_job" "Never_worked" "Private"
## [5] "Self-employed"
levels(strokeData$smoking_status)
## [1] "formerly smoked" "never smoked" "smokes" "Unknown"
levels(strokeData$Residence_type)
## [1] "Rural" "Urban"
EXPLANATION
1) There are 3 types of gender : "Male", "Female", "Other"
2) There are 5 work types : "children", "Govt_job", "Never_worked", "Private", "Self-Employed"
3) There are 4 status types of smoking : "formerly smoked", "never smoked", "smokes", "Unknown"
4) There are 2 types of residence : "Rural", "Urban"
ThreeSigma <- function(x, t = 3){
mu <- mean(x, na.rm = TRUE)
sig <- sd(x, na.rm = TRUE)
if (sig == 0){
message("All non-missing x-values are identical")
}
up <- mu + t * sig
down <- mu - t * sig
out <- list(up = up, down = down)
return(out)
}
Hampel <- function(x, t = 3){
mu <- median(x, na.rm = TRUE)
sig <- mad(x, na.rm = TRUE)
if (sig == 0){
message("Hampel identifer implosion: MAD scale estimate is zero")
}
up <- mu + t * sig
down <- mu - t * sig
out <- list(up = up, down = down)
return(out)
}
BoxplotRule<- function(x, t = 1.5){
xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
Q <- xU - xL
if (Q == 0){
message("Boxplot rule implosion: interquartile distance is zero")
}
up <- xU + t * Q
down <- xL - t * Q
out <- list(up = up, down = down)
return(out)
}
ExtractDetails <- function(x, down, up){
outClass <- rep("N", length(x))
indexLo <- which(x < down)
indexHi <- which(x > up)
outClass[indexLo] <- "L"
outClass[indexHi] <- "U"
index <- union(indexLo, indexHi)
values <- x[index]
outClass <- outClass[index]
nOut <- length(index)
maxNom <- max(x[which(x <= up)])
minNom <- min(x[which(x >= down)])
outList <- list(nOut = nOut, lowLim = down,
upLim = up, minNom = minNom,
maxNom = maxNom, index = index,
values = values,
outClass = outClass)
return(outList)
}
FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
threeLims <- ThreeSigma(x, t = t3)
HampLims <- Hampel(x, t = tH)
boxLims <- BoxplotRule(x, t = tb)
n <- length(x)
nMiss <- length(which(is.na(x)))
threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
boxList <- ExtractDetails(x, boxLims$down, boxLims$up)
sumFrame <- data.frame(method = "ThreeSigma", n = n,
nMiss = nMiss, nOut = threeList$nOut,
lowLim = threeList$lowLim,
upLim = threeList$upLim,
minNom = threeList$minNom,
maxNom = threeList$maxNom)
upFrame <- data.frame(method = "Hampel", n = n,
nMiss = nMiss, nOut = HampList$nOut,
lowLim = HampList$lowLim,
upLim = HampList$upLim,
minNom = HampList$minNom,
maxNom = HampList$maxNom)
sumFrame <- rbind.data.frame(sumFrame, upFrame)
upFrame <- data.frame(method = "BoxplotRule", n = n,
nMiss = nMiss, nOut = boxList$nOut,
lowLim = boxList$lowLim,
upLim = boxList$upLim,
minNom = boxList$minNom,
maxNom = boxList$maxNom)
sumFrame <- rbind.data.frame(sumFrame, upFrame)
threeFrame <- data.frame(index = threeList$index,
values = threeList$values,
type = threeList$outClass)
HampFrame <- data.frame(index = HampList$index,
values = HampList$values,
type = HampList$outClass)
boxFrame <- data.frame(index = boxList$index,
values = boxList$values,
type = boxList$outClass)
outList <- list(summary = sumFrame, threeSigma = threeFrame,
Hampel = HampFrame, boxplotRule = boxFrame)
return(outList)
}
outliers_age <- FindOutliers(strokeData$age)
outliers_age$summary
## method n nMiss nOut lowLim upLim minNom maxNom
## 1 ThreeSigma 5110 0 0 -24.61133 111.0646 0.08 82
## 2 Hampel 5110 0 0 -35.06040 125.0604 0.08 82
## 3 BoxplotRule 5110 0 0 -29.00000 115.0000 0.08 82
outliers_glucose <- FindOutliers(strokeData$avg_glucose_level)
outliers_glucose$summary
## method n nMiss nOut lowLim upLim minNom maxNom
## 1 ThreeSigma 5110 0 49 -29.70300 241.9984 55.12 240.86
## 2 Hampel 5110 0 621 13.69268 170.0773 55.12 170.05
## 3 BoxplotRule 5110 0 627 21.97750 169.3575 55.12 168.68
outliers_bmi <- FindOutliers(strokeData$bmi)
outliers_bmi$summary
## method n nMiss nOut lowLim upLim minNom maxNom
## 1 ThreeSigma 5110 201 58 5.331037 52.45544 10.3 52.3
## 2 Hampel 5110 201 90 7.195340 49.00466 10.3 48.9
## 3 BoxplotRule 5110 201 110 9.100000 47.50000 10.3 47.5
EXPLANATION
The find outliers function detects that there are no outliers in the `age` variable. However, it detects avg_glucose_level and bmi variables have outliers.
I chose to use the Hampel Identifier as the most reasonable outliers detector, because the coverage taken match the distribution of the data, as can be seen from the upper and lower limits which are not too large and not too narrow in scope.
The variable avg_glucose_level has 621 outliers and the variable bmi has 90 outliers.
I decided not to eliminate outliers because avg_glucose_level outliers were only 12% and bmi outliers were only 1.7% of the total data.
table(strokeData$stroke, strokeData$gender)
##
## Female Male Other
## 0 2853 2007 1
## 1 141 108 0
EXPLANATION
There are some levels that have very low counts, which is `Other gender`. It only has 1 values, so it is okay to remove it from the dataset. It won't affect the datasets.
# Remove 'other' gender
strokeData <- strokeData %>%
filter(gender != 'Other') %>%
droplevels()
table(strokeData$stroke, strokeData$gender)
##
## Female Male
## 0 2853 2007
## 1 141 108
EXPLANATION
After removing `Other gender`, the gender variable only has 2 types, which are Female and Male.
ggplot(strokeData, aes(x = gender)) +
geom_bar()
EXPLANATION
There are more female gender than male gender in this dataset.
ggplot(strokeData, aes(x = stroke, fill = gender)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Gender and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
The number of Female gender who have had a stroke is higher than Male gender who have had a stroke.
ggplot(strokeData, aes(x = stroke, fill = gender)) +
geom_bar(position = "fill") +
labs(title = "Barplot Proportion Gender and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
However, if we look at this proportion barplot, it can be seen that the proportion of female who have had a stroke is slightly higher than that of male.
Furthermore, we can also see that the proportion of male and female who have had a stroke and have not had a stroke is almost equal, so we can make the assumption that gender doesn't really affect whether someone has a stroke.
ggplot(strokeData, aes(x = stroke, fill = hypertension)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Hypertension and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
This barplot shows people who do not have hypertension are more likely to have strokes than people who have hypertension.
However, we have to check the proportion barplot because it has a huge frequency difference.
ggplot(strokeData, aes(x = stroke, fill = hypertension)) +
geom_bar(position = "fill") +
labs(title = "Barplot Proportion Hypertension and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
After looking at other perceptions, the proportion barchart tells that the proportion of people with hypertension who have a stroke is higher than the proportion of people with hypertension who don't have a stroke.So, people with hypertension are more likely to have strokes.
ggplot(strokeData, aes(x = stroke, fill = heart_disease)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Heart Disease and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
In this dataset, the number of people who suffer from heart disease is much less than people who do not suffer from heart disease.
From this plot, we see that people who do not have heart disease are more likely to have strokes than people who have heart disease.
ggplot(strokeData, aes(x = stroke, fill = heart_disease)) +
geom_bar(position = "fill") +
labs(title = "Barplot Proportion Heart Disease and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
In the proportion barplot of heart disease, it can be seen that the proportion of people with heart disease who suffer from stroke is higher than people with heart disease that do not suffer from stroke. So, people who suffer from heart disease are more likely to have strokes.
ggplot(strokeData, aes(x = stroke, fill = ever_married)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Ever Married and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
The number of people who have been married is more than people who have never married.
People who have ever married are more likely to have strokes than people who have never married.
ggplot(strokeData, aes(x = stroke, fill = ever_married)) +
geom_bar(position = "fill") +
labs(title = "Barplot Proportion Ever Married and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
When examined from the perception of the proportion barplot, it appears that people who have ever married are more likely to have strokes.
ggplot(strokeData, aes(x = stroke, fill = work_type)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Work Types and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
Judging from the barchart above, the number of people who have a 'Private' type of work is the highest.
People who have a "Private" type of work also have more strokes than people with the self-employed and government job type.
However, because the number of people of each type is not evenly distributed. We need to look at the proportion barchart too.
ggplot(strokeData, aes(x = work_type, fill = as.logical(strokeData$stroke))) +
geom_bar(position = "fill") +
labs(title = "Barplot Proportion Work Types and Stroke Amount", x = "Work Types", y = "Amount")
## Warning: Use of `strokeData$stroke` is discouraged. Use `stroke` instead.
EXPLANATION
After looking at other perceptions, the proportion barchart illustrates that the proportion of stroke with the self-employed type of work is the highest compared to other types of work.
The proportion of people who have had a stroke with private and government jobs is almost equal.
Furthermore, almost all children and people who have never worked do not suffer a stroke.
ggplot(strokeData, aes(x = stroke, fill = Residence_type)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Residence Types and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
The difference is not too big, but the number of people living in urban areas who suffer from stroke is slightly higher than in rural areas.
ggplot(strokeData, aes(x = stroke, fill = smoking_status)) +
geom_bar(position = "dodge") +
labs(title = "Barplot Smoking Status and Stroke Amount", x = "Stroke", y = "Amount")
EXPLANATION
Judging from the barplot above, the number of people who have never smoked and 'unknown' status dominates this dataset.
People who have never smoked are more likely to have a stroke than people who were previously smoked, active smokers, and people whose smoking status is unknown.
However, because the number of people of each type is not evenly distributed. We need to look at the proportion barchart to get a more accurate insight.
ggplot(strokeData, aes(x = smoking_status, fill = as.logical(strokeData$stroke))) +
geom_bar(position = "fill") +
labs(title = "Barplot Proportion Smoking Status and Stroke Amount", x = "Smoking Status", y = "Amount")
## Warning: Use of `strokeData$stroke` is discouraged. Use `stroke` instead.
EXPLANATION
After looking at other perceptions, the proportion barchart shows that the proportion of strokes of people who formerly smoked is the highest compared to other smoking status. So, people who formerly smoked are more likely to have a stroke.
Then followed by the proportion of strokes of people who are active smokers and never smoked.
Furthermore, the proportion of strokes of people with unknown smoking status is the least.
Summary Bar Chart Explanation
1) In general, there is no strong correlation between the stroke variable and other categorical variables.
2) There are more female gender than male gender in this dataset.
3) The proportion of male and female who have had a stroke and have not had a stroke is almost equal, so we can make the assumption that gender doesn't really affect whether someone has a stroke.
4) People with hypertension are more likely to have strokes.
5) People who suffer from heart disease are more likely to have strokes.
6) People who have ever married are more likely to have strokes
7) The proportion of stroke with the self-employed type of work is the highest
8) The number of people living in urban areas who suffer from stroke is slightly higher than in rural areas.
9) People who formerly smoked are more likely to have a stroke.
data.frame(table(strokeData$stroke))
## Var1 Freq
## 1 0 4860
## 2 1 249
EXPLANATION
The number of people who have not had a stroke (4860) is more than the number of people who have had a stroke (249).
ggplot(strokeData, aes(x = stroke)) +
geom_bar()
EXPLANATION
The number of people who have not had a stroke is more than the number of people who have had a stroke. This is included in imbalanced data, so it must be solved during data preparation.
#Select data for people who have had a stroke
sumStroke <- which(strokeData$stroke == 1)
sliceStroke <- slice(strokeData, sumStroke)
#Select data for people who have not had stroke
sumNoStroke <- which(strokeData$stroke == 0)
sliceNoStroke <- slice(strokeData, sumNoStroke)
layout(mat = matrix(c(1, 2, 1, 3),
nrow = 2,
ncol = 2),
heights = c(2,2),
widths = c(3,2))
par(mar = c(4,2,1,1))
hist(strokeData$age)
par(mar = c(2,2,4,4))
# Age histogram for people who have had a stroke
hist(sliceNoStroke$age)
par(mar = c(2,4,4,1))
# Age histogram for people who have had a stroke
hist(sliceStroke$age)
EXPLANATION
The histogram of age variable as a whole is almost evenly distributed, with peak frequency around the range of 40 - 60.
Judging from the histogram of people who have had a stroke, the distribution of people who have had a stroke starts from the age of 30 and over, and most of them are in their 70s.
layout(mat = matrix(c(1, 2, 1, 3),
nrow = 2,
ncol = 2),
heights = c(2,2),
widths = c(3,2))
par(mar = c(4,2,1,1))
hist(strokeData$avg_glucose_level)
par(mar = c(2,2,4,4))
# Age histogram for people who have had a stroke
hist(sliceNoStroke$avg_glucose_level)
par(mar = c(2,4,4,1))
# Age histogram for people who have had a stroke
hist(sliceStroke$avg_glucose_level)
EXPLANATION
Judging from the overall histogram, the distribution of avg_glucose_level variable is right skewed, with a range around the 60-100.
However, when viewed from the histogram of people who had a stroke, the distribution was not normal. Furthermore, most people who have had a stroke have a glucose level of around 60 - 100, followed by 200 who are more likely to get a stroke.
layout(mat = matrix(c(1, 2, 1, 3),
nrow = 2,
ncol = 2),
heights = c(2,2),
widths = c(3,2))
par(mar = c(4,2,1,1))
hist(strokeData$bmi)
par(mar = c(2,2,4,4))
# Age histogram for people who have had a stroke
hist(sliceNoStroke$bmi)
par(mar = c(2,4,4,1))
# Age histogram for people who have had a stroke
hist(sliceStroke$bmi)
EXPLANATION
Judging from the histogram as a whole, the distribution of the bmi variable is right skewed, with a range around the 20 - 35.
Furthermore, if you look at the histograms of people who have suffered a stroke, people who have a BMI in the range 25 - 35 are more likely to get a stroke.
SUMMARY EXPLANATION of NUMERICAL
In this dataset, the number of people who have not had a stroke is more than the number of people who have had a stroke. This is included in imbalanced data, so it must be solved during data preparation.
From the histogram of people who have suffered from stroke that there is a correlation between the variables of `stroke` with `age` and `avg_glucose_level`.
Therefore, it can be concluded that people aged 30 and over, especially those in their 70s, have a glucose level in the range of 60 - 100 or 200 and above, and have a BMI value in the range of 25 - 35 have a tendency to stroke.
strokeData <- strokeData[-c(1)]
sapply(strokeData, function(x) sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 201 0 0
EXPLANATION
Variable `id` has been removed.
As mentioned above, the BMI variable has a missing value of 201 data points or equivalent to 3.9335% of the total data (only a few). So, we will do the imputation by replacing the missing value in BMI with the Mean value.
The missing value in the BMI variable will be replaced with its Mean value.
strokeData$bmi[is.na(strokeData$bmi)] <- mean(strokeData$bmi, na.rm = T)
sapply(strokeData, function(x) sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 0 0 0
EXPLANATION
After replacing the missing value in BMI with the mean value, this data is free from missing values.
#training 80, validation 20
set.seed(1)
training_index = createDataPartition(strokeData$stroke, p = 0.8, list = FALSE)
# p = 0.8 --> 80% of the records in the dataset divided to training set and 20% of the records to testing set.
valSet = strokeData[-training_index,]
trainSet = strokeData[training_index,]
EXPLANATION
Divide the data into 80% training set and 20% validation set for modeling.
#Check if the data is shuffled (not sequential)
head(trainSet)
## gender age hypertension heart_disease ever_married work_type
## 1 Male 67 FALSE TRUE Yes Private
## 3 Male 80 FALSE TRUE Yes Private
## 4 Female 49 FALSE FALSE Yes Private
## 5 Female 79 TRUE FALSE Yes Self-employed
## 6 Male 81 FALSE FALSE Yes Private
## 7 Male 74 TRUE TRUE Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24.0 never smoked 1
## 6 Urban 186.21 29.0 formerly smoked 1
## 7 Rural 70.09 27.4 never smoked 1
trainSet %>% count(stroke)
## stroke n
## 1 0 3885
## 2 1 203
valSet %>% count(stroke)
## stroke n
## 1 0 975
## 2 1 46
EXPLANATION
The data has been shuffled (not sequential) and the proportion between the trainset and the validation set is equal.
However, there is a very big difference between the number of data who suffered a stroke and the number of data that did not suffer a stroke.
The difference is no stroke (3885) : stroke (203), about 19 : 1, which is included in the imbalanced data.
Therefore, we have to solve it using the ROSE method (Random Over Sampling Examples)
trainSet <- ovun.sample(stroke ~ ., data = trainSet,
method = "both",
p = 0.5,
N = 3088,
seed = 1)$data
trainSet %>% count(stroke)
## stroke n
## 1 0 1608
## 2 1 1480
EXPLANATION
After applying the ROSE method, the difference in the number of frequencies between people who have strokes and those who don't have strokes is not much. The ROSE method is only used for the training set, without disturbing the validation set.
After doing EDA and Data Preparation, some of the most important features in this dataset are `age` and `average glucose level` variable. This is because these two variables are very good in determining other diseases, we can find out whether the person has `hypertension`, `heart disease`, and `stroke`. We can make an assumption that the variables of `age` and `glucose level` are very important and have a strong correlation to predict `stroke`.
logisticModel1 <- glm(stroke ~ ., family = binomial(link = "logit"), data = trainSet)
summary(logisticModel1)
##
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"),
## data = trainSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5671 -0.6805 -0.1912 0.7476 2.5389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.264e+00 3.719e-01 -11.465 < 2e-16 ***
## genderMale -2.360e-01 9.816e-02 -2.405 0.016189 *
## age 8.646e-02 3.966e-03 21.800 < 2e-16 ***
## hypertensionTRUE 4.284e-01 1.269e-01 3.375 0.000737 ***
## heart_diseaseTRUE 5.000e-01 1.659e-01 3.014 0.002576 **
## ever_marriedYes 1.223e-01 1.611e-01 0.759 0.447795
## work_typeGovt_job -2.355e+00 4.097e-01 -5.747 9.08e-09 ***
## work_typeNever_worked -1.249e+01 3.102e+02 -0.040 0.967885
## work_typePrivate -1.816e+00 3.901e-01 -4.655 3.24e-06 ***
## work_typeSelf-employed -2.395e+00 4.150e-01 -5.771 7.88e-09 ***
## Residence_typeUrban 4.073e-02 9.298e-02 0.438 0.661334
## avg_glucose_level 4.191e-03 9.024e-04 4.644 3.41e-06 ***
## bmi 2.108e-02 7.390e-03 2.852 0.004343 **
## smoking_statusnever smoked -3.947e-01 1.236e-01 -3.192 0.001412 **
## smoking_statussmokes -2.228e-02 1.498e-01 -0.149 0.881747
## smoking_statusUnknown 2.273e-02 1.400e-01 0.162 0.871089
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4275.6 on 3087 degrees of freedom
## Residual deviance: 2870.8 on 3072 degrees of freedom
## AIC: 2902.8
##
## Number of Fisher Scoring iterations: 13
EXPLANATION
By looking at the logistic model 1's signif. codes, the below attributes are not statistically significant to the fitting, I exclude those from the model:
Gender, heart_disease, Residence_type, bmi, smoking_status
Fit the data again with the below mentioned remaining attributes:
age, avg_glucose_level, work_type, hypertension
logisticModel2 <- glm(stroke ~ age + avg_glucose_level + work_type + hypertension, family = binomial(link = "logit"), data = trainSet)
summary(logisticModel2)
##
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + work_type +
## hypertension, family = binomial(link = "logit"), data = trainSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4057 -0.6884 -0.2302 0.7554 2.6704
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.024217 0.299642 -13.430 < 2e-16 ***
## age 0.085334 0.003605 23.671 < 2e-16 ***
## avg_glucose_level 0.005324 0.000847 6.286 3.26e-10 ***
## work_typeGovt_job -2.030551 0.364910 -5.565 2.63e-08 ***
## work_typeNever_worked -12.522631 311.791164 -0.040 0.967963
## work_typePrivate -1.509938 0.342367 -4.410 1.03e-05 ***
## work_typeSelf-employed -2.150603 0.369698 -5.817 5.98e-09 ***
## hypertensionTRUE 0.420996 0.125334 3.359 0.000782 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4275.6 on 3087 degrees of freedom
## Residual deviance: 2908.8 on 3080 degrees of freedom
## AIC: 2924.8
##
## Number of Fisher Scoring iterations: 13
EXPLANATION
After removing the variables, look again at the logistic model 2's signif. codes, it can be seen that all of the attributes have a good correlation with the `stroke` variable because it has 3 stars
The results of logistic model 2, support our assumption when determining the important feature, which states that the variables `age` and `glucose level` have a strong correlation and play an important role in predicting `stroke`.
However, turns out that the work_type and hypertension variables also have a correlation in predicting `stroke`.
predictLog <- predict(logisticModel2, newdata = valSet, type = "response")
pred <- prediction(predictLog, valSet$stroke)
rocCurve <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(rocCurve)
EXPLANATION
The plot above is the ROC result from logistic model 2. The curve of the model is quite good, because the last increase of y is almost close to 1.0, which is around 0.8.
aucScore <- performance(pred, measure = "auc")
aucScore <- aucScore@y.values[[1]]
print(paste("AUC Testing Set: ", aucScore))
## [1] "AUC Testing Set: 0.835273132664434"
EXPLANATION
The AUC (area under the curve) for this logistic model is 0.835, this number is good because the higher the AUC (closer to 1.0), the better the model.
res <- ifelse(predictLog > 0.5, 1, 0)
misclassError <- mean(res != valSet$stroke)
print(paste("Accuracy Testing Set: ", 1-misclassError))
## [1] "Accuracy Testing Set: 0.780607247796278"
EXPLANATION
If the accuracy of the model is more than 70%, then the model is good. This model has a good accuracy, which is about 78%.
SUMMARY EXPLANATION
The AUC and model accuracy figures show that this model is quite good and the difference between the AUC and the accuracy model is not too far away, which is only about 0.05. So, this model is good.
ROC_T <- predict(logisticModel2, newdata = subset(trainSet), type = "response")
ROC_V <- predict(logisticModel2, newdata = subset(valSet), type = "response")
ROCobjT <- roc(trainSet$stroke, ROC_T)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROCobjV <- roc(valSet$stroke, ROC_V)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
par(mfrow = c(1,2))
plot(ROCobjT, print.auc = TRUE,
auc.polygon = TRUE,
grid = c(0.1, 0.2),
main = "Plot ROC Training Set")
plot(ROCobjV, print.auc = TRUE,
auc.polygon = TRUE,
grid = c(0.1, 0.2),
main = "Plot ROC Validation Set")
EXPLANATION
This model is neither overfit nor underfit because the AUC training set and validation set values are almost equal with a very small difference, which is 0.02. So this model is included in the good-fit.
SUMMARY
So the model that will be used is logistic model 2 which has 4 predictors, namely `age`, `avg_glucose_level`, `hypertension`, and `work_type`. This model is included in a good-fit and is good enough to predict stroke with an accuracy of about 78%.
DesTModel <- rpart(as.factor(trainSet$stroke) ~ age + avg_glucose_level + work_type + hypertension, data = trainSet, method = "class", cp = 0.01)
DesTModel
## n= 3088
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3088 1480 0 (0.52072539 0.47927461)
## 2) age< 56.5 1350 221 0 (0.83629630 0.16370370)
## 4) age< 44.5 862 48 0 (0.94431555 0.05568445) *
## 5) age>=44.5 488 173 0 (0.64549180 0.35450820)
## 10) avg_glucose_level< 160.27 380 100 0 (0.73684211 0.26315789) *
## 11) avg_glucose_level>=160.27 108 35 1 (0.32407407 0.67592593) *
## 3) age>=56.5 1738 479 1 (0.27560414 0.72439586) *
rpart.plot(DesTModel, extra = 110)
DesTprediction <- predict(DesTModel, newdata = valSet, type = "class")
confusion_mat <- confusionMatrix(DesTprediction, as.factor(valSet$stroke))
confusion_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 687 9
## 1 288 37
##
## Accuracy : 0.7091
## 95% CI : (0.6802, 0.7368)
## No Information Rate : 0.9549
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1309
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7046
## Specificity : 0.8043
## Pos Pred Value : 0.9871
## Neg Pred Value : 0.1138
## Prevalence : 0.9549
## Detection Rate : 0.6729
## Detection Prevalence : 0.6817
## Balanced Accuracy : 0.7545
##
## 'Positive' Class : 0
##
EXPLANATION
Overall, the accuracy is quite good, which is around 70%.
1) True Positive : 687 (When the model correctly predicts positive class)
2) False Positive : 9 (When the model incorrectly predicts positive class)
3) False Negative : 288 (When the model incorrectly predicts negative class)
4) True Negative : 37 (When the model correctly predicts negative class)
From the output above, we can see that :
The True Positive Rate (Sensitivity) : 70,46 %
The True Negative Rate (Specificity) : 80,43 %
The False Positive Rate : 1 - Specificity = 19,57 %
The False Negative Rate : 1 - Sensitivity = 29,54 %
Therefore, we can conclude that this model is good because it can predict most of the data correctly, as evidenced by the high true positive rate and true negative rate.