Video Link (Copy the link to the browser) :

https://binusianorg-my.sharepoint.com/personal/natasha_winata_binus_ac_id/_layouts/15/guestaccess.aspx?docid=0667e978528644d94aa31a68ea2933eb9&authkey=AdYc4F3R01TBqizBlqQ7r7Q&e=Hw4p0d

Load Libraries and Data

#Load libraries
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.1.3
library(rpart)
## Warning: package 'rpart' was built under R version 4.1.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(pROC)
## Warning: package 'pROC' was built under R version 4.1.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(ROSE)
## Loaded ROSE 0.0-4
#Import data
strokeData <- read.csv("E:/Binus University/Semester 2/Data Mining and Visualization/Final Exam Lec/StrokeData.csv", header = T, na.strings = c(""))

#Print data
head(strokeData)
##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1

Attribute Information

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Goals

Make Predictive Logistic Model and Decision Tree to predict `stroke`, which is the target variable.

1. a. Exploratory Data Analysis

dim(strokeData)
## [1] 5110   12

EXPLANATION

The result above indicated that there are 5110 observations (rows) and 12 variables (columns) of data.
str(strokeData)
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : chr  "36.6" "N/A" "32.5" "34.4" ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

EXPLANATION

There are three data types that were used in this data set, they are num, char, and int.

However, there are several data types that do not match the contents, so the data types must be changed, including:
- from char to factor: `gender`, `ever_married`, `work_type`, `residence_type`, `smoking_status`
- from char to num : `bmi`
- from int to logical : `hypertension`, `heart_disease`

Change data types

strokeData$gender <- as.factor(strokeData$gender)
strokeData$hypertension <- as.logical(strokeData$hypertension)
strokeData$heart_disease <- as.logical(strokeData$heart_disease)
strokeData$work_type <- as.factor(strokeData$work_type)
strokeData$Residence_type <- as.factor(strokeData$Residence_type)
strokeData$bmi <- as.numeric(strokeData$bmi)
## Warning: NAs introduced by coercion
strokeData$smoking_status <- as.factor(strokeData$smoking_status)

EXPLANATION

After changing the data types, check whether the data types have changed by using the str function.
str(strokeData)
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : Factor w/ 3 levels "Female","Male",..: 2 1 2 1 1 2 2 1 1 1 ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ heart_disease    : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
##  $ Residence_type   : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
##  $ smoking_status   : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

EXPLANATION

The data type has changed.
summary(strokeData)
##        id           gender          age        hypertension    heart_disease  
##  Min.   :   67   Female:2994   Min.   : 0.08   Mode :logical   Mode :logical  
##  1st Qu.:17741   Male  :2115   1st Qu.:25.00   FALSE:4612      FALSE:4834     
##  Median :36932   Other :   1   Median :45.00   TRUE :498       TRUE :276      
##  Mean   :36518                 Mean   :43.23                                  
##  3rd Qu.:54682                 3rd Qu.:61.00                                  
##  Max.   :72940                 Max.   :82.00                                  
##                                                                               
##  ever_married               work_type    Residence_type avg_glucose_level
##  Length:5110        children     : 687   Rural:2514     Min.   : 55.12   
##  Class :character   Govt_job     : 657   Urban:2596     1st Qu.: 77.25   
##  Mode  :character   Never_worked :  22                  Median : 91.89   
##                     Private      :2925                  Mean   :106.15   
##                     Self-employed: 819                  3rd Qu.:114.09   
##                                                         Max.   :271.74   
##                                                                          
##       bmi                smoking_status     stroke       
##  Min.   :10.30   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.:23.50   never smoked   :1892   1st Qu.:0.00000  
##  Median :28.10   smokes         : 789   Median :0.00000  
##  Mean   :28.89   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:33.10                          3rd Qu.:0.00000  
##  Max.   :97.60                          Max.   :1.00000  
##  NA's   :201
describe(strokeData)
## strokeData 
## 
##  12  Variables      5110  Observations
## --------------------------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0     5110        1    36518    24436     3590     6972 
##      .25      .50      .75      .90      .95 
##    17741    36932    54682    65668    69218 
## 
## lowest :    67    77    84    91    99, highest: 72911 72914 72915 72918 72940
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##     5110        0        3 
##                                
## Value      Female   Male  Other
## Frequency    2994   2115      1
## Proportion  0.586  0.414  0.000
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0      104        1    43.23    26.03        5       11 
##      .25      .50      .75      .90      .95 
##       25       45       61       75       79 
## 
## lowest :  0.08  0.16  0.24  0.32  0.40, highest: 78.00 79.00 80.00 81.00 82.00
## --------------------------------------------------------------------------------
## hypertension 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value      FALSE  TRUE
## Frequency   4612   498
## Proportion 0.903 0.097
## --------------------------------------------------------------------------------
## heart_disease 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value      FALSE  TRUE
## Frequency   4834   276
## Proportion 0.946 0.054
## --------------------------------------------------------------------------------
## ever_married 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value         No   Yes
## Frequency   1757  3353
## Proportion 0.344 0.656
## --------------------------------------------------------------------------------
## work_type 
##        n  missing distinct 
##     5110        0        5 
## 
## lowest : children      Govt_job      Never_worked  Private       Self-employed
## highest: children      Govt_job      Never_worked  Private       Self-employed
##                                                                   
## Value           children      Govt_job  Never_worked       Private
## Frequency            687           657            22          2925
## Proportion         0.134         0.129         0.004         0.572
##                         
## Value      Self-employed
## Frequency            819
## Proportion         0.160
## --------------------------------------------------------------------------------
## Residence_type 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value      Rural Urban
## Frequency   2514  2596
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## avg_glucose_level 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0     3979        1    106.1    45.38    60.71    65.79 
##      .25      .50      .75      .90      .95 
##    77.24    91.88   114.09   192.18   216.29 
## 
## lowest :  55.12  55.22  55.23  55.25  55.26, highest: 266.59 267.60 267.61 267.76 271.74
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4909      201      418        1    28.89    8.538    17.64    19.70 
##      .25      .50      .75      .90      .95 
##    23.50    28.10    33.10    38.90    42.96 
## 
## lowest : 10.3 11.3 11.5 12.0 12.3, highest: 66.8 71.9 78.0 92.0 97.6
## --------------------------------------------------------------------------------
## smoking_status 
##        n  missing distinct 
##     5110        0        4 
##                                                                           
## Value      formerly smoked    never smoked          smokes         Unknown
## Frequency              885            1892             789            1544
## Proportion           0.173           0.370           0.154           0.302
## --------------------------------------------------------------------------------
## stroke 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     5110        0        2    0.139      249  0.04873  0.09273 
## 
## --------------------------------------------------------------------------------

EXPLANATION

The insights gained from the result of summary and described are:

1) The number of female gender (2994) is more than other genders. 

2) The average of the `age` is 43.

3) The frequency of people who have hypertension (4612) is more than the frequency of people who do not have hypertension (698).

4) `bmi` contains NA or missing values.

5) In this dataset, there are more people without heart disease than those with heart disease.

6) The type of work "Private" has the highest number among others, which is 2925 people.

7) The distribution between rural and urban residence types is almost equal.

8) Average of `glucose level` is 106.

9) The average `bmi` is 28,9 (will change after imputation in data preparation).

10) Number of people who have never smoked and 'unknown' status dominates in this dataset.
sapply(strokeData, function(x) sum(is.na(x)))
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0               201                 0                 0

EXPLANATION

After changing the data types, it was found that the missing value in the BMI variable was 201 data points or equivalent to 3.9335% of the total data (only a few). So, we will do imputation (in data preparation)
all(duplicated(strokeData)==TRUE)
## [1] FALSE

EXPLANATION

The syntax above returned FALSE. This indicates that there are no duplicated data in this dataset. 
sapply(strokeData, function(x) length(unique(x)))
##                id            gender               age      hypertension 
##              5110                 3               104                 2 
##     heart_disease      ever_married         work_type    Residence_type 
##                 2                 2                 5                 2 
## avg_glucose_level               bmi    smoking_status            stroke 
##              3979               419                 4                 2
levels(strokeData$gender)
## [1] "Female" "Male"   "Other"
levels(strokeData$work_type)
## [1] "children"      "Govt_job"      "Never_worked"  "Private"      
## [5] "Self-employed"
levels(strokeData$smoking_status)
## [1] "formerly smoked" "never smoked"    "smokes"          "Unknown"
levels(strokeData$Residence_type)
## [1] "Rural" "Urban"

EXPLANATION

1) There are 3 types of gender : "Male", "Female", "Other"

2) There are 5 work types : "children", "Govt_job", "Never_worked", "Private", "Self-Employed"

3) There are 4 status types of smoking : "formerly smoked", "never smoked", "smokes", "Unknown"

4) There are 2 types of residence : "Rural", "Urban"

Check Outliers in Numerical Data

ThreeSigma <- function(x, t = 3){

 mu <- mean(x, na.rm = TRUE)
 sig <- sd(x, na.rm = TRUE)
 if (sig == 0){
  message("All non-missing x-values are identical")
 }
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
 }

Hampel <- function(x, t = 3){

 mu <- median(x, na.rm = TRUE)
 sig <- mad(x, na.rm = TRUE)
 if (sig == 0){
  message("Hampel identifer implosion: MAD scale estimate is zero")
 }
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
}
   
BoxplotRule<- function(x, t = 1.5){

 xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
 xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
 Q <- xU - xL
 if (Q == 0){
  message("Boxplot rule implosion: interquartile distance is zero")
 }
 up <- xU + t * Q
 down <- xL - t * Q
 out <- list(up = up, down = down)
 return(out)
}   

ExtractDetails <- function(x, down, up){

 outClass <- rep("N", length(x))
 indexLo <- which(x < down)
 indexHi <- which(x > up)
 outClass[indexLo] <- "L"
 outClass[indexHi] <- "U"
 index <- union(indexLo, indexHi)
 values <- x[index]
 outClass <- outClass[index]
 nOut <- length(index)
 maxNom <- max(x[which(x <= up)])
 minNom <- min(x[which(x >= down)])
 outList <- list(nOut = nOut, lowLim = down,
 upLim = up, minNom = minNom,
 maxNom = maxNom, index = index,
 values = values,
 outClass = outClass)
 return(outList)
 }
FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
 threeLims <- ThreeSigma(x, t = t3)
 HampLims <- Hampel(x, t = tH)
 boxLims <- BoxplotRule(x, t = tb)

 n <- length(x)
 nMiss <- length(which(is.na(x)))

 threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
 HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
 boxList <- ExtractDetails(x, boxLims$down, boxLims$up)

 sumFrame <- data.frame(method = "ThreeSigma", n = n,
 nMiss = nMiss, nOut = threeList$nOut,
 lowLim = threeList$lowLim,
 upLim = threeList$upLim,
 minNom = threeList$minNom,
 maxNom = threeList$maxNom)
 upFrame <- data.frame(method = "Hampel", n = n,
 nMiss = nMiss, nOut = HampList$nOut,
 lowLim = HampList$lowLim,
 upLim = HampList$upLim,
 minNom = HampList$minNom,
 maxNom = HampList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)
 upFrame <- data.frame(method = "BoxplotRule", n = n,
 nMiss = nMiss, nOut = boxList$nOut,
 lowLim = boxList$lowLim,
 upLim = boxList$upLim,
 minNom = boxList$minNom,
 maxNom = boxList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)

 threeFrame <- data.frame(index = threeList$index,
 values = threeList$values,
 type = threeList$outClass)
 HampFrame <- data.frame(index = HampList$index,
 values = HampList$values,
 type = HampList$outClass)
 boxFrame <- data.frame(index = boxList$index,
 values = boxList$values,
 type = boxList$outClass)
 outList <- list(summary = sumFrame, threeSigma = threeFrame,
 Hampel = HampFrame, boxplotRule = boxFrame)
 return(outList)
}
outliers_age <- FindOutliers(strokeData$age)
outliers_age$summary
##        method    n nMiss nOut    lowLim    upLim minNom maxNom
## 1  ThreeSigma 5110     0    0 -24.61133 111.0646   0.08     82
## 2      Hampel 5110     0    0 -35.06040 125.0604   0.08     82
## 3 BoxplotRule 5110     0    0 -29.00000 115.0000   0.08     82
outliers_glucose <- FindOutliers(strokeData$avg_glucose_level)
outliers_glucose$summary
##        method    n nMiss nOut    lowLim    upLim minNom maxNom
## 1  ThreeSigma 5110     0   49 -29.70300 241.9984  55.12 240.86
## 2      Hampel 5110     0  621  13.69268 170.0773  55.12 170.05
## 3 BoxplotRule 5110     0  627  21.97750 169.3575  55.12 168.68
outliers_bmi <- FindOutliers(strokeData$bmi)
outliers_bmi$summary
##        method    n nMiss nOut   lowLim    upLim minNom maxNom
## 1  ThreeSigma 5110   201   58 5.331037 52.45544   10.3   52.3
## 2      Hampel 5110   201   90 7.195340 49.00466   10.3   48.9
## 3 BoxplotRule 5110   201  110 9.100000 47.50000   10.3   47.5

EXPLANATION

The find outliers function detects that there are no outliers in the `age` variable. However, it detects avg_glucose_level and bmi variables have outliers.

I chose to use the Hampel Identifier as the most reasonable outliers detector, because the coverage taken match the distribution of the data, as can be seen from the upper and lower limits which are not too large and not too narrow in scope.

The variable avg_glucose_level has 621 outliers and the variable bmi has 90 outliers.

I decided not to eliminate outliers because avg_glucose_level outliers were only 12% and bmi outliers were only 1.7% of the total data.

Exploring categorical data

table(strokeData$stroke, strokeData$gender)
##    
##     Female Male Other
##   0   2853 2007     1
##   1    141  108     0

EXPLANATION

There are some levels that have very low counts, which is `Other gender`. It only has 1 values, so it is okay to remove it from the dataset. It won't affect the datasets.
# Remove 'other' gender
strokeData <- strokeData %>%
  filter(gender != 'Other') %>%
  droplevels()

table(strokeData$stroke, strokeData$gender)
##    
##     Female Male
##   0   2853 2007
##   1    141  108

EXPLANATION

After removing `Other gender`, the gender variable only has 2 types, which are Female and Male.

Barcharts

ggplot(strokeData, aes(x = gender)) + 
  geom_bar()

EXPLANATION

There are more female gender than male gender in this dataset.
ggplot(strokeData, aes(x = stroke, fill = gender)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Gender and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

The number of Female gender who have had a stroke is higher than Male gender who have had a stroke.
ggplot(strokeData, aes(x = stroke, fill = gender)) + 
  geom_bar(position = "fill") +
  labs(title = "Barplot Proportion Gender and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

However, if we look at this proportion barplot, it can be seen that the proportion of female who have had a stroke is slightly higher than that of male. 

Furthermore, we can also see that the proportion of male and female who have had a stroke and have not had a stroke is almost equal, so we can make the assumption that gender doesn't really affect whether someone has a stroke.
ggplot(strokeData, aes(x = stroke, fill = hypertension)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Hypertension and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

This barplot shows people who do not have hypertension are more likely to have strokes than people who have hypertension.

However, we have to check the proportion barplot because it has a huge frequency difference.
ggplot(strokeData, aes(x = stroke, fill = hypertension)) + 
  geom_bar(position = "fill") +
  labs(title = "Barplot Proportion Hypertension and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

After looking at other perceptions, the proportion barchart tells that the proportion of people with hypertension who have a stroke is higher than the proportion of people with hypertension who don't have a stroke.So, people with hypertension are more likely to have strokes.
ggplot(strokeData, aes(x = stroke, fill = heart_disease)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Heart Disease and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

In this dataset, the number of people who suffer from heart disease is much less than people who do not suffer from heart disease.

From this plot, we see that people who do not have heart disease are more likely to have strokes than people who have heart disease.
ggplot(strokeData, aes(x = stroke, fill = heart_disease)) + 
  geom_bar(position = "fill") +
  labs(title = "Barplot Proportion Heart Disease and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

In the proportion barplot of heart disease, it can be seen that the proportion of people with heart disease who suffer from stroke is higher than people with heart disease that do not suffer from stroke. So, people who suffer from heart disease are more likely to have strokes.
ggplot(strokeData, aes(x = stroke, fill = ever_married)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Ever Married and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

The number of people who have been married is more than people who have never married.

People who have ever married are more likely to have strokes than people who have never married.
ggplot(strokeData, aes(x = stroke, fill = ever_married)) + 
  geom_bar(position = "fill") +
  labs(title = "Barplot Proportion Ever Married and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

When examined from the perception of the proportion barplot, it appears that people who have ever married are more likely to have strokes.
ggplot(strokeData, aes(x = stroke, fill = work_type)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Work Types and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

Judging from the barchart above, the number of people who have a 'Private' type of work is the highest.

People who have a "Private" type of work also have more strokes than people with the self-employed and government job type.

However, because the number of people of each type is not evenly distributed. We need to look at the proportion barchart too.
ggplot(strokeData, aes(x = work_type, fill =  as.logical(strokeData$stroke))) + 
  geom_bar(position = "fill") +
  labs(title = "Barplot Proportion Work Types and Stroke Amount", x = "Work Types", y = "Amount")
## Warning: Use of `strokeData$stroke` is discouraged. Use `stroke` instead.

EXPLANATION

After looking at other perceptions, the proportion barchart illustrates that the proportion of stroke with the self-employed type of work is the highest compared to other types of work. 

The proportion of people who have had a stroke with private and government jobs is almost equal.

Furthermore, almost all children and people who have never worked do not suffer a stroke.
ggplot(strokeData, aes(x = stroke, fill = Residence_type)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Residence Types and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

The difference is not too big, but the number of people living in urban areas who suffer from stroke is slightly higher than in rural areas.
ggplot(strokeData, aes(x = stroke, fill = smoking_status)) + 
  geom_bar(position = "dodge") +
  labs(title = "Barplot Smoking Status and Stroke Amount", x = "Stroke", y = "Amount")

EXPLANATION

Judging from the barplot above, the number of people who have never smoked and 'unknown' status dominates this dataset.

People who have never smoked are more likely to have a stroke than people who were previously smoked, active smokers, and people whose smoking status is unknown.

However, because the number of people of each type is not evenly distributed. We need to look at the proportion barchart to get a more accurate insight.
ggplot(strokeData, aes(x = smoking_status, fill = as.logical(strokeData$stroke))) + 
  geom_bar(position = "fill") +
  labs(title = "Barplot Proportion Smoking Status and Stroke Amount", x = "Smoking Status", y = "Amount")
## Warning: Use of `strokeData$stroke` is discouraged. Use `stroke` instead.

EXPLANATION

After looking at other perceptions, the proportion barchart shows that the proportion of strokes of people who formerly smoked is the highest compared to other smoking status. So, people who formerly smoked are more likely to have a stroke.

Then followed by the proportion of strokes of people who are active smokers and never smoked.

Furthermore, the proportion of strokes of people with unknown smoking status is the least.

Summary Bar Chart Explanation

1) In general, there is no strong correlation between the stroke variable and other categorical variables.

2) There are more female gender than male gender in this dataset.

3) The proportion of male and female who have had a stroke and have not had a stroke is almost equal, so we can make the assumption that gender doesn't really affect whether someone has a stroke.

4) People with hypertension are more likely to have strokes.

5) People who suffer from heart disease are more likely to have strokes.

6) People who have ever married are more likely to have strokes

7) The proportion of stroke with the self-employed type of work is the highest

8) The number of people living in urban areas who suffer from stroke is slightly higher than in rural areas.

9) People who formerly smoked are more likely to have a stroke.

Exploring numerical data

data.frame(table(strokeData$stroke))
##   Var1 Freq
## 1    0 4860
## 2    1  249

EXPLANATION

The number of people who have not had a stroke (4860) is more than the number of people who have had a stroke (249).
ggplot(strokeData, aes(x = stroke)) + 
  geom_bar()

EXPLANATION

The number of people who have not had a stroke is more than the number of people who have had a stroke. This is included in imbalanced data, so it must be solved during data preparation.
#Select data for people who have had a stroke
sumStroke <- which(strokeData$stroke == 1)
sliceStroke <- slice(strokeData, sumStroke)

#Select data for people who have not had stroke
sumNoStroke <- which(strokeData$stroke == 0)
sliceNoStroke <- slice(strokeData, sumNoStroke)
layout(mat = matrix(c(1, 2, 1, 3),
                    nrow = 2,
                    ncol  = 2),
       heights = c(2,2),
       widths = c(3,2))

par(mar = c(4,2,1,1))
hist(strokeData$age)

par(mar = c(2,2,4,4))
# Age histogram for people who have had a stroke
hist(sliceNoStroke$age)

par(mar = c(2,4,4,1))
# Age histogram for people who have had a stroke
hist(sliceStroke$age)

EXPLANATION

The histogram of age variable as a whole is almost evenly distributed, with peak frequency around the range of 40 - 60.

Judging from the histogram of people who have had a stroke, the distribution of people who have had a stroke starts from the age of 30 and over, and most of them are in their 70s.
layout(mat = matrix(c(1, 2, 1, 3),
                    nrow = 2,
                    ncol  = 2),
       heights = c(2,2),
       widths = c(3,2))

par(mar = c(4,2,1,1))
hist(strokeData$avg_glucose_level)

par(mar = c(2,2,4,4))
# Age histogram for people who have had a stroke
hist(sliceNoStroke$avg_glucose_level)

par(mar = c(2,4,4,1))
# Age histogram for people who have had a stroke
hist(sliceStroke$avg_glucose_level)

EXPLANATION

Judging from the overall histogram, the distribution of avg_glucose_level variable is right skewed, with a range around the 60-100.

However, when viewed from the histogram of people who had a stroke, the distribution was not normal. Furthermore, most people who have had a stroke have a glucose level of around 60 - 100, followed by 200 who are more likely to get a stroke.
layout(mat = matrix(c(1, 2, 1, 3),
                    nrow = 2,
                    ncol  = 2),
       heights = c(2,2),
       widths = c(3,2))

par(mar = c(4,2,1,1))
hist(strokeData$bmi)

par(mar = c(2,2,4,4))
# Age histogram for people who have had a stroke
hist(sliceNoStroke$bmi)

par(mar = c(2,4,4,1))
# Age histogram for people who have had a stroke
hist(sliceStroke$bmi)

EXPLANATION

Judging from the histogram as a whole, the distribution of the bmi variable is right skewed, with a range around the 20 - 35.

Furthermore, if you look at the histograms of people who have suffered a stroke, people who have a BMI in the range 25 - 35 are more likely to get a stroke.

SUMMARY EXPLANATION of NUMERICAL

In this dataset, the number of people who have not had a stroke is more than the number of people who have had a stroke. This is included in imbalanced data, so it must be solved during data preparation.

From the histogram of people who have suffered from stroke that there is a correlation between the variables of `stroke` with `age` and `avg_glucose_level`.

Therefore, it can be concluded that people aged 30 and over, especially those in their 70s, have a glucose level in the range of 60 - 100 or 200 and above, and have a BMI value in the range of 25 - 35 have a tendency to stroke.

1. b. Data Preparation

Remove unimportant variables

strokeData <- strokeData[-c(1)]
sapply(strokeData, function(x) sum(is.na(x)))
##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               201                 0                 0

EXPLANATION

Variable `id` has been removed.

As mentioned above, the BMI variable has a missing value of 201 data points or equivalent to 3.9335% of the total data (only a few). So, we will do the imputation by replacing the missing value in BMI with the Mean value.

Handle missing value

Imputation

The missing value in the BMI variable will be replaced with its Mean value.
strokeData$bmi[is.na(strokeData$bmi)] <- mean(strokeData$bmi, na.rm = T)
sapply(strokeData, function(x) sum(is.na(x)))
##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##                 0                 0                 0

EXPLANATION

After replacing the missing value in BMI with the mean value, this data is free from missing values.

Split the data into train and validation sets

#training 80, validation 20
set.seed(1)
training_index = createDataPartition(strokeData$stroke, p = 0.8, list = FALSE)
# p = 0.8 --> 80% of the records in the dataset divided to training set and 20% of the records to testing set.
valSet = strokeData[-training_index,]
trainSet = strokeData[training_index,]

EXPLANATION

Divide the data into 80% training set and 20% validation set for modeling.

Check if the data is already shuffled

#Check if the data is shuffled (not sequential)

head(trainSet)
##   gender age hypertension heart_disease ever_married     work_type
## 1   Male  67        FALSE          TRUE          Yes       Private
## 3   Male  80        FALSE          TRUE          Yes       Private
## 4 Female  49        FALSE         FALSE          Yes       Private
## 5 Female  79         TRUE         FALSE          Yes Self-employed
## 6   Male  81        FALSE         FALSE          Yes       Private
## 7   Male  74         TRUE          TRUE          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12 24.0    never smoked      1
## 6          Urban            186.21 29.0 formerly smoked      1
## 7          Rural             70.09 27.4    never smoked      1
trainSet %>% count(stroke)
##   stroke    n
## 1      0 3885
## 2      1  203
valSet %>% count(stroke)
##   stroke   n
## 1      0 975
## 2      1  46

EXPLANATION

The data has been shuffled (not sequential) and the proportion between the trainset and the validation set is equal.

However, there is a very big difference between the number of data who suffered a stroke and the number of data that did not suffer a stroke. 

The difference is no stroke (3885) : stroke (203), about 19 : 1, which is included in the imbalanced data. 

Therefore, we have to solve it using the ROSE method (Random Over Sampling Examples)

Solve the imbalanced data

trainSet <- ovun.sample(stroke ~ ., data = trainSet, 
                        method = "both", 
                        p = 0.5,
                        N = 3088, 
                        seed = 1)$data
trainSet %>% count(stroke)
##   stroke    n
## 1      0 1608
## 2      1 1480

EXPLANATION

After applying the ROSE method, the difference in the number of frequencies between people who have strokes and those who don't have strokes is not much. The ROSE method is only used for the training set, without disturbing the validation set.

Important Feature

After doing EDA and Data Preparation, some of the most important features in this dataset are `age` and `average glucose level` variable. This is because these two variables are very good in determining other diseases, we can find out whether the person has `hypertension`, `heart disease`, and `stroke`. We can make an assumption that the variables of `age` and `glucose level` are very important and have a strong correlation to predict `stroke`.

1. c. MODELLING

logisticModel1 <- glm(stroke ~ ., family = binomial(link = "logit"), data = trainSet)
summary(logisticModel1)
## 
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"), 
##     data = trainSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5671  -0.6805  -0.1912   0.7476   2.5389  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -4.264e+00  3.719e-01 -11.465  < 2e-16 ***
## genderMale                 -2.360e-01  9.816e-02  -2.405 0.016189 *  
## age                         8.646e-02  3.966e-03  21.800  < 2e-16 ***
## hypertensionTRUE            4.284e-01  1.269e-01   3.375 0.000737 ***
## heart_diseaseTRUE           5.000e-01  1.659e-01   3.014 0.002576 ** 
## ever_marriedYes             1.223e-01  1.611e-01   0.759 0.447795    
## work_typeGovt_job          -2.355e+00  4.097e-01  -5.747 9.08e-09 ***
## work_typeNever_worked      -1.249e+01  3.102e+02  -0.040 0.967885    
## work_typePrivate           -1.816e+00  3.901e-01  -4.655 3.24e-06 ***
## work_typeSelf-employed     -2.395e+00  4.150e-01  -5.771 7.88e-09 ***
## Residence_typeUrban         4.073e-02  9.298e-02   0.438 0.661334    
## avg_glucose_level           4.191e-03  9.024e-04   4.644 3.41e-06 ***
## bmi                         2.108e-02  7.390e-03   2.852 0.004343 ** 
## smoking_statusnever smoked -3.947e-01  1.236e-01  -3.192 0.001412 ** 
## smoking_statussmokes       -2.228e-02  1.498e-01  -0.149 0.881747    
## smoking_statusUnknown       2.273e-02  1.400e-01   0.162 0.871089    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4275.6  on 3087  degrees of freedom
## Residual deviance: 2870.8  on 3072  degrees of freedom
## AIC: 2902.8
## 
## Number of Fisher Scoring iterations: 13

EXPLANATION

By looking at the logistic model 1's signif. codes, the below attributes are not statistically significant to the fitting, I exclude those from the model:
  Gender, heart_disease, Residence_type, bmi, smoking_status

Fit the data again with the below mentioned remaining attributes:
 age, avg_glucose_level, work_type, hypertension
logisticModel2 <- glm(stroke ~ age + avg_glucose_level + work_type + hypertension, family = binomial(link = "logit"), data = trainSet)
summary(logisticModel2)
## 
## Call:
## glm(formula = stroke ~ age + avg_glucose_level + work_type + 
##     hypertension, family = binomial(link = "logit"), data = trainSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4057  -0.6884  -0.2302   0.7554   2.6704  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.024217   0.299642 -13.430  < 2e-16 ***
## age                      0.085334   0.003605  23.671  < 2e-16 ***
## avg_glucose_level        0.005324   0.000847   6.286 3.26e-10 ***
## work_typeGovt_job       -2.030551   0.364910  -5.565 2.63e-08 ***
## work_typeNever_worked  -12.522631 311.791164  -0.040 0.967963    
## work_typePrivate        -1.509938   0.342367  -4.410 1.03e-05 ***
## work_typeSelf-employed  -2.150603   0.369698  -5.817 5.98e-09 ***
## hypertensionTRUE         0.420996   0.125334   3.359 0.000782 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4275.6  on 3087  degrees of freedom
## Residual deviance: 2908.8  on 3080  degrees of freedom
## AIC: 2924.8
## 
## Number of Fisher Scoring iterations: 13

EXPLANATION

After removing the variables, look again at the logistic model 2's signif. codes, it can be seen that all of the attributes have a good correlation with the `stroke` variable because it has 3 stars

The results of logistic model 2, support our assumption when determining the important feature, which states that the variables `age` and `glucose level` have a strong correlation and play an important role in predicting `stroke`. 

However, turns out that the work_type and hypertension variables also have a correlation in predicting `stroke`.

1. d. Asses the model

Evaluation

ROC plot of validation set

predictLog <- predict(logisticModel2, newdata = valSet, type = "response")
pred <- prediction(predictLog, valSet$stroke)
rocCurve <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(rocCurve)

EXPLANATION

The plot above is the ROC result from logistic model 2. The curve of the model is quite good, because the last increase of y is almost close to 1.0, which is around 0.8.

Check Accuracy of validation set

res <- ifelse(predictLog > 0.5, 1, 0)
misclassError <- mean(res != valSet$stroke)
print(paste("Accuracy Testing Set: ", 1-misclassError))
## [1] "Accuracy Testing Set:  0.780607247796278"

EXPLANATION

If the accuracy of the model is more than 70%, then the model is good. This model has a good accuracy, which is about 78%.

SUMMARY EXPLANATION

The AUC and model accuracy figures show that this model is quite good and the difference between the AUC and the accuracy model is not too far away, which is only about 0.05. So, this model is good.

Check whether the model overfitted or underfitted

Compare the auc of training set with testing set

ROC_T <- predict(logisticModel2, newdata = subset(trainSet), type = "response")
ROC_V <- predict(logisticModel2, newdata = subset(valSet), type = "response")

ROCobjT <- roc(trainSet$stroke, ROC_T)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ROCobjV <- roc(valSet$stroke, ROC_V)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
par(mfrow = c(1,2))
plot(ROCobjT, print.auc = TRUE, 
    auc.polygon = TRUE, 
    grid = c(0.1, 0.2), 
    main = "Plot ROC Training Set")
plot(ROCobjV, print.auc = TRUE,
    auc.polygon = TRUE, 
    grid = c(0.1, 0.2), 
    main = "Plot ROC Validation Set")

EXPLANATION

This model is neither overfit nor underfit because the AUC training set and validation set values are almost equal with a very small difference, which is 0.02. So this model is included in the good-fit.

SUMMARY

So the model that will be used is logistic model 2 which has 4 predictors, namely `age`, `avg_glucose_level`, `hypertension`, and `work_type`. This model is included in a good-fit and is good enough to predict stroke with an accuracy of about 78%.

2. DECISION TREE MODEL

DesTModel <- rpart(as.factor(trainSet$stroke) ~ age + avg_glucose_level + work_type + hypertension, data = trainSet, method = "class", cp = 0.01)
DesTModel
## n= 3088 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3088 1480 0 (0.52072539 0.47927461)  
##    2) age< 56.5 1350  221 0 (0.83629630 0.16370370)  
##      4) age< 44.5 862   48 0 (0.94431555 0.05568445) *
##      5) age>=44.5 488  173 0 (0.64549180 0.35450820)  
##       10) avg_glucose_level< 160.27 380  100 0 (0.73684211 0.26315789) *
##       11) avg_glucose_level>=160.27 108   35 1 (0.32407407 0.67592593) *
##    3) age>=56.5 1738  479 1 (0.27560414 0.72439586) *
rpart.plot(DesTModel, extra = 110)

Confusion Matrix

DesTprediction <- predict(DesTModel, newdata = valSet, type = "class")

confusion_mat <- confusionMatrix(DesTprediction, as.factor(valSet$stroke))
confusion_mat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 687   9
##          1 288  37
##                                           
##                Accuracy : 0.7091          
##                  95% CI : (0.6802, 0.7368)
##     No Information Rate : 0.9549          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1309          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7046          
##             Specificity : 0.8043          
##          Pos Pred Value : 0.9871          
##          Neg Pred Value : 0.1138          
##              Prevalence : 0.9549          
##          Detection Rate : 0.6729          
##    Detection Prevalence : 0.6817          
##       Balanced Accuracy : 0.7545          
##                                           
##        'Positive' Class : 0               
## 

EXPLANATION

Overall, the accuracy is quite good, which is around 70%.

1) True Positive : 687 (When the model correctly predicts positive class)
2) False Positive : 9 (When the model incorrectly predicts positive class)
3) False Negative  : 288 (When the model incorrectly predicts negative class)
4) True Negative : 37 (When the model correctly predicts negative class)

From the output above, we can see that :
The True Positive Rate (Sensitivity) : 70,46 %
The True Negative Rate (Specificity) : 80,43 %
The False Positive Rate : 1 - Specificity = 19,57 %
The False Negative Rate : 1 - Sensitivity = 29,54 %

Therefore, we can conclude that this model is good because it can predict most of the data correctly, as evidenced by the high true positive rate and true negative rate.