2540134874_FinalExam

Video Link: ‘https://binusianorg-my.sharepoint.com/personal/gabrielle_ariyanto_binus_ac_id/_layouts/15/guestaccess.aspx?docid=097f6b5da5ed845ff96e92cee02cc09f1&authkey=AZ2toRZuiAQP3KtqDGHWeHA&e=5d3LKx’

1. a. Perform Exploratory Data Analysis on the dataset and explain your findings.

#Load packages
library(ROCR)

## Warning: package 'ROCR' was built under R version 4.1.3

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.1.3

library(ggplot2)
library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.1.3

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.1.3

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ROCit)

## Warning: package 'ROCit' was built under R version 4.1.3

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

#load data
strokeData <- read.csv("./StrokeData.csv")
head(strokeData)

##      id gender age hypertension heart_disease ever_married     work_type
## 1  9046   Male  67            0             1          Yes       Private
## 2 51676 Female  61            0             0          Yes Self-employed
## 3 31112   Male  80            0             1          Yes       Private
## 4 60182 Female  49            0             0          Yes       Private
## 5  1665 Female  79            1             0          Yes Self-employed
## 6 56669   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21  N/A    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12   24    never smoked      1
## 6          Urban            186.21   29 formerly smoked      1

See the dimension of the data

dim(strokeData)

## [1] 5110   12

The dataset has 5110 instances and 12 attributes.

str(strokeData)

## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : chr  "36.6" "N/A" "32.5" "34.4" ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

Each attribute is further described below:

id : unique identifier -> Data type: Integer
gender : “Male”, “Female” or “Other” -> Data type: Character
age : age of the patient -> Data type: Numeric
hypertension : 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension -> Data type: Integer
heart_disease : 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease -> Data type: Integer
ever_married : “No” or “Yes” -> Data type: Character
work_type : “children”, “Govt_job”, “Never_worked”, “Private” or “Self-employed” -> Data type: Character
Residence_type : “Rural” or “Urban” -> Data type: Character
avg_glucose_level: average glucose level in blood -> Data type: Numeric
bmi : body mass index -> Data type: character
smoking_status : “formerly smoked”, “never smoked”, “smokes” or “Unknown”* -> Data type: character
stroke : 1 if the patient had a stroke or 0 if not -> Data type: integer

*Note: “Unknown” in smoking_status means that the information is unavailable for this patient

summary(strokeData)

##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level     bmi            smoking_status         stroke       
##  Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :106.15                                          Mean   :0.04873  
##  3rd Qu.:114.09                                          3rd Qu.:0.00000  
##  Max.   :271.74                                          Max.   :1.00000

According to the output above:

1.The stroke mean is 0.04, implying that just 4% of persons have a stroke.

2.The hypertension mean is 0.097, implying that there are 9.7% of persons have a hypertension.

3.The heart_disease mean is 0.054, implying that there are 5.4% of persons have a heart_disease.

4.The oldest age in this data is 82, while the variable’s average age is 43.2.

5.The average glucose level in this dataset ranges from 55.12 to 271.74. In this data, the average of avg_glucose_level variable is 106.15.

6.Some character data type columns must be transformed to factors or integers.

Then I see the unique value for each attribute here.

sapply(strokeData, function(x)length(unique(x)))

##                id            gender               age      hypertension 
##              5110                 3               104                 2 
##     heart_disease      ever_married         work_type    Residence_type 
##                 2                 2                 5                 2 
## avg_glucose_level               bmi    smoking_status            stroke 
##              3979               419                 4                 2

We need to transform the data type of variables on the attributes of gender, hypertension,heart_disease, ever_married, work_type, Residence_type, smoking_status, and stroke to be a factor data type based on the output above since these attributes are categorical variables.

So I changed the data type of those attribute here, as well as bmi’s data type from character to numeric, because bmi is supposed to hold numbers rather than characters.

#change the stroke variable data type from integer to factor and assign a value of 1 to Yes and a value of 0 to No
strokeData$stroke <- as.factor(strokeData$stroke)
levels(strokeData$stroke)[levels(strokeData$stroke)== 1] <- "Yes"
levels(strokeData$stroke)[levels(strokeData$stroke)== 0] <- "No"

#change the gender variable data type from character to factor
strokeData$gender <- as.factor(strokeData$gender)

#change the hypertension variable data type from integer to factor and assign a value of 1 to Yes and a value of 0 to No
strokeData$hypertension <- as.factor(strokeData$hypertension)
levels(strokeData$hypertension)[levels(strokeData$hypertension)== 1] <- "Yes"
levels(strokeData$hypertension)[levels(strokeData$hypertension)== 0] <- "No"

#change the heart_disease variable data type from integer to factor and assign a value of 1 to Yes and a value of 0 to No
strokeData$heart_disease <- as.factor(strokeData$heart_disease)
levels(strokeData$heart_disease)[levels(strokeData$heart_disease)== 1] <- "Yes"
levels(strokeData$heart_disease)[levels(strokeData$heart_disease)== 0] <- "No"

#change the ever_married variable data type from character to factor
strokeData$ever_married <- as.factor(strokeData$ever_married)

#change the work_type variable data type from character to factor
strokeData$work_type <- as.factor(strokeData$work_type)

#change the Residence_type variable data type from character to factor
strokeData$Residence_type <- as.factor(strokeData$Residence_type)

#change the smoking_status variable data type from character to factor
strokeData$smoking_status <- as.factor(strokeData$smoking_status)

#change the bmi variable data type from character to numeric
strokeData$bmi <- as.numeric(strokeData$bmi)

## Warning: NAs introduced by coercion

Next I check to see whether the data type of the variable i aim to change has changed.

#check whether the variables data type has changed or not
sapply(strokeData, class)

##                id            gender               age      hypertension 
##         "integer"          "factor"         "numeric"          "factor" 
##     heart_disease      ever_married         work_type    Residence_type 
##          "factor"          "factor"          "factor"          "factor" 
## avg_glucose_level               bmi    smoking_status            stroke 
##         "numeric"         "numeric"          "factor"          "factor"

The data type of the variable we intend to modify has already changed, as we can see above.

Then I proceed to examine the description for each variable in the dataset.

describe(strokeData)

## strokeData 
## 
##  12  Variables      5110  Observations
## --------------------------------------------------------------------------------
## id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0     5110        1    36518    24436     3590     6972 
##      .25      .50      .75      .90      .95 
##    17741    36932    54682    65668    69218 
## 
## lowest :    67    77    84    91    99, highest: 72911 72914 72915 72918 72940
## --------------------------------------------------------------------------------
## gender 
##        n  missing distinct 
##     5110        0        3 
##                                
## Value      Female   Male  Other
## Frequency    2994   2115      1
## Proportion  0.586  0.414  0.000
## --------------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0      104        1    43.23    26.03        5       11 
##      .25      .50      .75      .90      .95 
##       25       45       61       75       79 
## 
## lowest :  0.08  0.16  0.24  0.32  0.40, highest: 78.00 79.00 80.00 81.00 82.00
## --------------------------------------------------------------------------------
## hypertension 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value         No   Yes
## Frequency   4612   498
## Proportion 0.903 0.097
## --------------------------------------------------------------------------------
## heart_disease 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value         No   Yes
## Frequency   4834   276
## Proportion 0.946 0.054
## --------------------------------------------------------------------------------
## ever_married 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value         No   Yes
## Frequency   1757  3353
## Proportion 0.344 0.656
## --------------------------------------------------------------------------------
## work_type 
##        n  missing distinct 
##     5110        0        5 
## 
## lowest : children      Govt_job      Never_worked  Private       Self-employed
## highest: children      Govt_job      Never_worked  Private       Self-employed
##                                                                   
## Value           children      Govt_job  Never_worked       Private
## Frequency            687           657            22          2925
## Proportion         0.134         0.129         0.004         0.572
##                         
## Value      Self-employed
## Frequency            819
## Proportion         0.160
## --------------------------------------------------------------------------------
## Residence_type 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value      Rural Urban
## Frequency   2514  2596
## Proportion 0.492 0.508
## --------------------------------------------------------------------------------
## avg_glucose_level 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5110        0     3979        1    106.1    45.38    60.71    65.79 
##      .25      .50      .75      .90      .95 
##    77.24    91.88   114.09   192.18   216.29 
## 
## lowest :  55.12  55.22  55.23  55.25  55.26, highest: 266.59 267.60 267.61 267.76 271.74
## --------------------------------------------------------------------------------
## bmi 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     4909      201      418        1    28.89    8.538    17.64    19.70 
##      .25      .50      .75      .90      .95 
##    23.50    28.10    33.10    38.90    42.96 
## 
## lowest : 10.3 11.3 11.5 12.0 12.3, highest: 66.8 71.9 78.0 92.0 97.6
## --------------------------------------------------------------------------------
## smoking_status 
##        n  missing distinct 
##     5110        0        4 
##                                                                           
## Value      formerly smoked    never smoked          smokes         Unknown
## Frequency              885            1892             789            1544
## Proportion           0.173           0.370           0.154           0.302
## --------------------------------------------------------------------------------
## stroke 
##        n  missing distinct 
##     5110        0        2 
##                       
## Value         No   Yes
## Frequency   4861   249
## Proportion 0.951 0.049
## --------------------------------------------------------------------------------

According to the output above:

The variable ‘gender’ has three levels: Female, Male, and Other. There are 2994 female genders, 2115 male genders, and 1 other gender in this data set.
The variable ‘hypertension’ has two levels: Yes and No. There are 4612 individuals in this dataset who do not have hypertension and 498 people who do.
The variable ‘heart_disease’ has two levels: Yes and No. There are 4834 individuals in this dataset who do not have heart disease and 276 people who do.
The variable ‘ever_married’ has two levels: Yes and No. This dataset contains 1757 people who have never married and 3353 persons who have.
The variable ‘work_type’ has five levels: children, Govt_job, Never_worked, Private, and Self-employed. There are 687 people in this dataset who are still children, 657 people who work in the government sector, 22 people who have never worked, 2925 people who work privately, and 819 people who work for themselves (self employed).
The variable ‘Residence_type’ has two levels: Rural and Urban. There are 2514 persons in this dataset who are residents of the rural type and 2596 people who are residents of the urban type.
The average glucose level in this dataset ranges from 55.12 to 271.74. In this data, the average of avg_glucose_level variable is 106.15.
The bmi attribute in this dataset ranges from 10.30 to 97.60. In this data, the average of bmi variable is 28.89.
The variable ‘smoking_status’ has four levels: formely smoked, never smoked, smokes, and Unknown. This dataset contains 885 people who formerly smoked, 1892 people who never smoked, 789 people who now smoke, and 1544 people whose smoking status is unknown.
The dataset contains 249 persons who have had a stroke and 4861 people who have not had a stroke.

Exploring Categorical Variables

plot(strokeData$stroke, col = "darkgoldenrod3",main = "Stroke Distribution")

According to the output above, there are more individuals who do not suffer from stroke than those who do.

ggplot(strokeData, aes(x = gender, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#FFCC66", "#CC0000")) + ggtitle("Stroke Distribution by Gender")

There are more female than male in the data set. More female suffer from stroke than male.

ggplot(strokeData, aes(x = Residence_type, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#FFCC99", "#FF9933")) + ggtitle("Stroke Distribution by Residence Type")

From the output above, there is no much difference in each percentage having a stroke or not based on the type of residence

ggplot(strokeData, aes(x = ever_married, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#CCCCFF", "#CC99FF")) + ggtitle("Stroke Distribution by Married Status")

In the data set, there are more ever-married persons than never-married people. People who have been married are more likely to have a stroke than those who have not.

ggplot(strokeData, aes(x = hypertension, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#66CCCC", "#336666")) + ggtitle("Stroke Distribution by Hypertension")

According to the plot above, patients with hypertension suffer fewer strokes than those who do not.

ggplot(strokeData, aes(x = heart_disease, fill = stroke)) + geom_bar(position = "dodge")+ scale_fill_manual(values=c("#FF9999", "#990033")) + ggtitle("Stroke Distribution by Heart Disease")

Patients with heart disease suffered fewer strokes than those who did not.

mosaicplot(strokeData$stroke~strokeData$work_type,las=1,off=10,xlab="Stroke",ylab="Work Type",main="Stroke VS Work_Type",color=colors()[145:150])

The majority of patients with stroke work in the government sector, private, or are self-employed.

mosaicplot(strokeData$stroke~strokeData$smoking_status,las=1,off=5,xlab="Stroke",ylab="Smoking Status",main="Stroke VS Smoking_status",color=colors()[20:30])

Patients who formely smoked or who have never smoked had higher stroke occurrences than active smokers. However, we must keep in mind that a significant chunk of the data, represented by the unknown category, lacks a clear record of the patient’s smoking status.

Exploring Numerical Variables

ggplot(strokeData, aes(x = age , fill = stroke)) + geom_histogram(bindwidth = 30) + scale_fill_manual(values=c("#9ebcda", "#8856a7"))  + ggtitle("Stroke Distribution by Age")

## Warning: Ignoring unknown parameters: bindwidth

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

According to the plot above, the older a person is, the more probable that person is to suffer a stroke. When a person is around 50 or older, the chance of suffering a stroke increases considerably.

ggplot(strokeData, aes(x = avg_glucose_level , fill = stroke)) + geom_histogram(bindwidth = 30) + scale_fill_manual(values=c("#feb24c", "#f03b20")) + ggtitle("Stroke Distribution by Average Glucose Level")

## Warning: Ignoring unknown parameters: bindwidth

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot above shows that there is no apparent relationship between avg glucose level and stroke risk. However, as we all know, high glucose levels of more than 180-200mg/dL can cause diabetes, which can lead to a stroke. A diabetic individual is 1.5 times more likely to have a stroke and this might be a very plausible explanation for the above plot.

ggplot(strokeData, aes(x = bmi , fill = stroke)) + geom_histogram(bindwidth = 30) + scale_fill_manual(values=c("#c994c7", "#dd1c77")) + ggtitle("Stroke Distribution by Body Mass Index")

## Warning: Ignoring unknown parameters: bindwidth

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 201 rows containing non-finite values (stat_bin).

According to the results above, people with a BMI of 20 to 40 are at risk of stroke. As a result, I can conclude that having a higher BMI does not raise the chance of having a stroke.

Checking Data Anomalies

AgeBoxplot <- boxplot(strokeData$age,main = "Age Boxplot")

paste(length(AgeBoxplot$out))

## [1] "0"

There are no outliers detected by boxplot in the age variable.

glucoseBoxplot <- boxplot(strokeData$avg_glucose_level, main = 'Avg Glucose Level Boxplot')

paste(length(glucoseBoxplot$out))

## [1] "627"

There are 627 data points detected as outliers by boxplot in the avg_glucose_level variable.

bmiBoxplot <- boxplot(strokeData$bmi, main = 'bmi Boxplot')

paste(length(bmiBoxplot$out))

## [1] "110"

There are 110 data points detected as outliers by boxplot in the bmi variable.

Hereby, I also check the normality of all variables

hist.data.frame(strokeData)

From the output above, only bmi variables is roughly bell-shaped

Next, I check to see whether there is any missing data in this dataset.

#check missing value
sapply(strokeData, function(x)sum(is.na(x)))

##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0               201                 0                 0

The bmi variable has 201 missing values, which was discovered above.

Explaination of my findings:

Utilizing eda, I may infer that my findings are:

This data is stroke data which contains 5110 instances and 12 attributes. Those attributes are id, gender, age,hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level,bmi, smoking_status, and ‘stroke’.
To predict what factors influence a person’s stroke, I will utilize the stroke variable as the dependent variable.
There are more female than male in the data set.
Stroke are becoming more common among female than male
A person’s type of residence has no bearing on whether or not they have a stroke.
People who have been married are more likely than those who have not to suffer a stroke.
People with hypertension suffer fewer strokes than those who do not.
People with heart disease suffer fewer strokes than those who do not.
The majority of patients with stroke work in the government sector, private, or are self-employed.
People who have never smoked or who formerly smoked have a greater stroke risk than active smokers. However, we must keep in mind that the majority of the data, which are represented by unknown categories, do not contain a clear record of the patients’ smoking status. As a result, before we begin modeling, we must process the categories in the smoking status variable.
A person’s risk of having a stroke increases with age.
There is no clear relation between average glucose levels and stroke risk. High glucose levels of more than 180-200mg/dL, on the other hand, can produce diabetes, which can lead to a stroke. Diabetics are 1.5 times more likely to suffer a stroke.
Stroke is more likely happen to people with a BMI of 20 to 40.
We need to handle the 201 missing values in the bmi variable before we begin modeling.

As we can see from Eda, there are attributes that we need to fix. The handling will be done in the data preparation section below.

1. b. Perform Data Preparation and give detail explanation. Based on the results of data preparation, what are the most important features on your dataset? Why?

The first thing I do is removing the ‘id’ column because it contains as many unique values as there are data points and its useless for predicting whether someone would have a stroke or not.

I handle missing values and unusual values in two ways at this data preparation stage: the first is to delete the missing and unusual value, and the second is to replace the missing and unusual value.I do this in these two ways because I don’t know what the best method to handle missing values at this point in order to generate a decent model later, because we haven’t constructed the model yet.

Missing and unusual values to be handled are ‘other’ levels in gender attributes, ‘unknown’ levels in smoking status attributes, and missing values in bmi attributes. I handled ‘other’ levels in gender attribute since there is only 1 row of Other in column gender. I handled the level of ‘unknown’ in smoking status because “unknown” in smoking status indicates that the information is not available to the patient and the number of unknowns themselves are quite large. And I handle the bmi attribute’s missing value such that it is not biased

FIRST WAY TO HANDLE MISSING AND UNUSUAL VALUE: In the first method here, I remove ‘other’ levels in gender attributes, ‘unknown’ levels in smoking status attributes, and any rows having missing values in bmi attributes.

#Put the original data into a new variable that will be altered.
UpdateStrokeDataDropNA = strokeData

#Drop 'id' columns as i said before that id columns is useless
UpdateStrokeDataDropNA <- subset(UpdateStrokeDataDropNA, select = c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))

#check missing value
sapply(UpdateStrokeDataDropNA, function(x)sum(is.na(x)))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               201                 0                 0

#Drop 'Unknown' levels at smoking_status attribute
UpdateStrokeDataDropNA <- UpdateStrokeDataDropNA %>% filter(smoking_status!='Unknown')%>%droplevels()
#Drop 'Other' levels at gender attribute
UpdateStrokeDataDropNA <- UpdateStrokeDataDropNA %>% filter(gender!='Other')%>%droplevels()
#check the missing value again
sapply(UpdateStrokeDataDropNA, function(x)sum(is.na(x)))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               140                 0                 0

#Drop all data rows containing missing value
UpdateStrokeDataDroppedNA = UpdateStrokeDataDropNA
UpdateStrokeDataDroppedNA = na.omit(UpdateStrokeDataDroppedNA)
#check whether missing value has successfully dropped or not
sapply(UpdateStrokeDataDroppedNA, function(x)sum(is.na(x)))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##                 0                 0                 0

After handling missing values and unusual values by the first method, which is dropping them out, to see which features are most important to use to predict a person’s stroke or not, I look at the correlation matrix here.

UpdateStrokeDataDroppedNACorrelation = UpdateStrokeDataDroppedNA
UpdateStrokeDataDroppedNACorrelation$gender = as.numeric(UpdateStrokeDataDroppedNACorrelation$gender)
UpdateStrokeDataDroppedNACorrelation$hypertension = as.numeric(UpdateStrokeDataDroppedNACorrelation$hypertension)
UpdateStrokeDataDroppedNACorrelation$heart_disease = as.numeric(UpdateStrokeDataDroppedNACorrelation$heart_disease)
UpdateStrokeDataDroppedNACorrelation$work_type = as.numeric(UpdateStrokeDataDroppedNACorrelation$work_type)
UpdateStrokeDataDroppedNACorrelation$ever_married = as.numeric(UpdateStrokeDataDroppedNACorrelation$ever_married)
UpdateStrokeDataDroppedNACorrelation$Residence_type = as.numeric(UpdateStrokeDataDroppedNACorrelation$Residence_type)
UpdateStrokeDataDroppedNACorrelation$smoking_status = as.numeric(UpdateStrokeDataDroppedNACorrelation$smoking_status)
UpdateStrokeDataDroppedNACorrelation$stroke = as.numeric(UpdateStrokeDataDroppedNACorrelation$stroke)
UpdateStrokeDataDroppedNACorrelation$work_type = as.numeric(UpdateStrokeDataDroppedNACorrelation$work_type)

x <- (UpdateStrokeDataDroppedNACorrelation)
UpdateStrokeDataDroppedNACorrelation.quant = x 
UpdateStrokeDataDroppedNACorrelation.cor = round(cor(UpdateStrokeDataDroppedNACorrelation.quant),2)
ggplot(data = reshape2::melt(UpdateStrokeDataDroppedNACorrelation.cor),aes(x=Var1, y=Var2, fill=value)) + geom_tile() +  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson Corr") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 30))

From the output above, age, hypertension, heart disease, and average glucose level are the most important features for predicting stroke because they have the highest correlation to the dependent variable.

SECOND WAY TO HANDLE MISSING AND UNUSUAL VALUE: Next, I used the second method to deal with missing values and unusual values, namely by replacing the missing values and unusual values. I removed the other levels on the gender attribute, replaced the unknown levels in the smoking status attribute with the most frequent category, which is ‘never smoked,’ and replaced the missing value in the bmi’s attribute with its mean.

#Put the original data into a new variable that will be altered.
UpdateStrokeDataReplaceNA = strokeData

#Drop 'id' columns as i said before that id columns is useless
UpdateStrokeDataReplaceNA <- subset(UpdateStrokeDataReplaceNA, select = c("gender","age","hypertension","heart_disease","ever_married","work_type","Residence_type","avg_glucose_level","bmi","smoking_status","stroke"))

#replace missing value in bmi attribute with its mean
UpdateStrokeDataReplaceNA$bmi[is.na(UpdateStrokeDataReplaceNA$bmi)] <- mean(UpdateStrokeDataReplaceNA$bmi, na.rm = T)

#replace the unknown with the most frequent category, which is ‘never smoked’
UpdateStrokeDataReplaceNA <- UpdateStrokeDataReplaceNA %>% mutate(smoking_status = replace(smoking_status, smoking_status == "Unknown", "never smoked"))

#buang levels 'Other' pada atribut gender
UpdateStrokeDataReplaceNA <- UpdateStrokeDataReplaceNA %>% filter(gender!='Other')%>%droplevels()

#check whether missing and unsual value has successfully replaced or not
sapply(UpdateStrokeDataReplaceNA, function(x)sum(is.na(x)))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##                 0                 0                 0

After handling missing values using the replace approach, I look at the correlation matrix to evaluate which features are most important to utilize to predict a person’s stroke or not.

UpdateStrokeDataReplaceNACorrelation = UpdateStrokeDataReplaceNA
UpdateStrokeDataReplaceNACorrelation$gender = as.numeric(UpdateStrokeDataReplaceNACorrelation$gender)
UpdateStrokeDataReplaceNACorrelation$hypertension = as.numeric(UpdateStrokeDataReplaceNACorrelation$hypertension)
UpdateStrokeDataReplaceNACorrelation$heart_disease = as.numeric(UpdateStrokeDataReplaceNACorrelation$heart_disease)
UpdateStrokeDataReplaceNACorrelation$work_type = as.numeric(UpdateStrokeDataReplaceNACorrelation$work_type)
UpdateStrokeDataReplaceNACorrelation$ever_married = as.numeric(UpdateStrokeDataReplaceNACorrelation$ever_married)
UpdateStrokeDataReplaceNACorrelation$Residence_type = as.numeric(UpdateStrokeDataReplaceNACorrelation$Residence_type)
UpdateStrokeDataReplaceNACorrelation$smoking_status = as.numeric(UpdateStrokeDataReplaceNACorrelation$smoking_status)
UpdateStrokeDataReplaceNACorrelation$stroke = as.numeric(UpdateStrokeDataReplaceNACorrelation$stroke)
UpdateStrokeDataReplaceNACorrelation$work_type = as.numeric(UpdateStrokeDataReplaceNACorrelation$work_type)

x <- (UpdateStrokeDataReplaceNACorrelation)
UpdateStrokeDataReplaceNACorrelation.quant = x
UpdateStrokeDataReplaceNACorrelation.cor = round(cor(UpdateStrokeDataReplaceNACorrelation.quant),2)
ggplot(data = reshape2::melt(UpdateStrokeDataReplaceNACorrelation.cor),aes(x=Var1, y=Var2, fill=value)) + geom_tile() +  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson Corr") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 30))

Same as before, from the correlation matrix above, it can be seen that age, hypertension, heart disease, and average glucose levels were the most significant factors in predicting stroke because they have the highest correlation to the dependent variable.

After dealing with missing values and unusual values, I split the data to build the model. I split the data into 70% of training data and 30% of testing data.

#set random set.seed
set.seed(589)

#for data in which the missing value has been removed, create training data set 70% for building model and testing set 30% for model evaluation 
trainingIndex = createDataPartition(UpdateStrokeDataDroppedNA$stroke,p=0.7,list=FALSE)
testing_set = UpdateStrokeDataDroppedNA[-trainingIndex, ]
training_set = UpdateStrokeDataDroppedNA[trainingIndex, ]
cat("Testing Set Dimention data  : ",dim(testing_set),"\n")

## Testing Set Dimention data  :  1027 11

cat("Training Set Dimention data : ",dim(training_set))

## Training Set Dimention data :  2398 11

#for data in which the missing value has been replaced, create training data set 70% for building model and testing set 30% for model evaluation 
trainingIndexReplacedNA = createDataPartition(UpdateStrokeDataReplaceNA$stroke,p=0.7,list=FALSE)

testing_set_ReplacedNA = UpdateStrokeDataReplaceNA[-trainingIndexReplacedNA, ]
training_set_ReplacedNA = UpdateStrokeDataReplaceNA[trainingIndexReplacedNA, ]
cat("Testing Set Dimention data  : ",dim(testing_set_ReplacedNA),"\n")

## Testing Set Dimention data  :  1532 11

cat("Training Set Dimention data : ",dim(training_set_ReplacedNA))

## Training Set Dimention data :  3577 11

The most important features in my dataset:

Age, hypertension, heart disease, and average glucose level are the most important features, because they have a high correlation to the stroke variable when compared to other features.

The features that I will use to predict whether someone has a stroke or not are the attributes of gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, and smoking_status.

1. c. Build the ‘best possible’ predictive model using Logistic Regression and explain your predictive model.

#build function to get the accuracy of each model based on the threshold
getAccuracy <- function(predictionProbability, threshold, expected) {
  predictedVal = ifelse(predictionProbability > threshold,'Yes',"No")
  misclassificationError <- mean(predictedVal != expected)
  accuracy =  1-misclassificationError
  
  return (accuracy)
}

#build modellingFunction for modelling -> it contains a summary of the model itself, the auc value in the testing and training set, the accuracy in the training and testing set when the threshold is 0.5 (default threshold), the confusion matrix when the threshold is 0.5 (default threshold) and when the threshold is calculated using gmean, and the model accuracy when using gmean as the threshold
modellingFunction <- function(name,formula,predictors,training_set, testing_set,color){
  model = glm(formula,family = binomial(link = "logit"), data = training_set)
  print(name)
  print(summary(model))
  logisticPrediction <- predict(model,newdata = subset(testing_set, select = predictors), type ="response")
  rocrEvaluation <- prediction(logisticPrediction, testing_set$stroke)
  logisticPredictionTraining <- predict(model,newdata = subset(training_set, select = predictors), type ="response")
  rocrEvaluationTraining <- prediction(logisticPredictionTraining, training_set$stroke)
  
  aucValue <- performance(rocrEvaluation, measure = "auc")
  aucValue <- aucValue@y.values[[1]]
  aucValueTraining <- performance(rocrEvaluationTraining, measure = "auc")
  aucValueTraining <- aucValueTraining@y.values[[1]]
  
  accuracy = getAccuracy(logisticPrediction, 0.5, testing_set$stroke)
  accuracyTraining =  accuracy = getAccuracy(logisticPredictionTraining, 0.5, training_set$stroke)
  
  
  roc_empirical <- rocit(score = logisticPrediction, class = testing_set$stroke) 
  
  predictedDefault = ifelse(logisticPrediction > 0.5,'Yes',"No")
  cmDefault = confusionMatrix(data=as.factor(predictedDefault), reference=testing_set$stroke, positive = "Yes")
  
  
  gmean = sqrt(roc_empirical$TPR * (1-roc_empirical$FPR))
  optimalThreshold = roc_empirical$Cutoff[which.max(gmean)]
  predictedGmean = ifelse(logisticPrediction > optimalThreshold,'Yes',"No")
  cmGmean = confusionMatrix(data=as.factor(predictedGmean), reference=testing_set$stroke, positive = "Yes")
  
  accuracyGmean =  getAccuracy(logisticPrediction, optimalThreshold, testing_set$stroke)
  
  cat(sprintf("CONFUSION MATRIX WITH THRESHOLD %.1f\n",0.5))
  cat("-------------------------------------------\n")
  print(cmDefault)
  cat("===========================================\n")
  cat(sprintf("CONFUSION MATRIX WITH THRESHOLD %f\n",optimalThreshold))
  cat("-------------------------------------------\n")
  print(cmGmean)
  cat("===========================================\n")
  cat(sprintf("%s AUC Testing: %f\n", name,aucValue))
  cat(sprintf("%s AUC Training: %f\n", name,aucValueTraining))
  cat(sprintf("%s Accuracy Testing with threshold %f: %f\n", name,optimalThreshold,accuracyGmean))
  cat(sprintf("%s Accuracy Testing with threshold %.1f: %f\n", name,0.5,accuracy))
  cat(sprintf("%s Accuracy Training with threshold %.1f: %f\n", name,0.5,accuracyTraining))
  
  
  return (list(
    name=name,
    model=model,
    roc=rocrEvaluation,
    predVal=logisticPrediction,
    auc=aucValue,
    sensitivityDefault=cmDefault$byClass[2],
    specificityDefault=cmDefault$byClass[1],
    sensitivityGmean=cmGmean$byClass[2],
    specificityGmean=cmGmean$byClass[1],
    color=color
  ))
}

#build function to compare auc, specitifity and sensitivity of each model
summarizeModelMetric <- function(models) {
  modelsSummary <- data.frame(matrix(ncol = 3, nrow = 0))
  colnames(modelsSummary) <-c("AUC", "Specitifity", "Sensitivity")
  
  modelsSummaryRowNames = list()
  idx = 0
  for(model in models) {
    modelsSummary[nrow(modelsSummary) + 1,] <- c(
      AUC=model$auc,Sensitivity=model$sensitivityGmean,Specificity=model$specificityGmean
    )
    modelsSummaryRowNames = append(modelsSummaryRowNames,model$name)
  }
  
  rownames(modelsSummary)<-modelsSummaryRowNames
  
  for(model in models) {
    if (idx == 0) {
      plot(performance(model$roc, measure = "tpr", x.measure = "fpr"),col=model$color)
    } else {
      plot(performance(model$roc, measure = "tpr", x.measure = "fpr"),col=model$color,add=TRUE)
    }
    idx = idx + 1
  }
  
  return (modelsSummary)
}

Because the data we have is not evenly distributed between the number of strokes and those who do not have a stroke (imbalanced data), determining whether or not a model is good cannot be calculated using accuracy alone, and other metrics are required to evaluate the model’s performance. So, when creating the model, I compared the confusion matrix of each model when the default threshold (0.5) was used with the confusion matrix of each model when gmean was used as the threshold.

I do this because the default threshold of 0.5 may not be optimal in anticipating probabilities, especially if the data is not balanced, therefore we need to optimize the threshold by replacing it with the gmean value. G-Mean is an imbalanced classification measure that, when optimized, seeks a balance between sensitivity and specificity. The formula for G-Mean is sqrt(Sensitivity * Specificity). We may use the index for the highest G-mean score to select which threshold value to employ after calculating the G-mean for each threshold.

Here I create my first model, i.e. I try to model the dependent variable with all predictor variables using data in which the missing value has been removed

model1 <- modellingFunction("model1 = stroke~.", stroke ~ ., c("age", "hypertension","avg_glucose_level","gender","heart_disease","work_type","Residence_type","bmi","smoking_status","ever_married"), training_set, testing_set,'red')

## [1] "model1 = stroke~."
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1793  -0.3361  -0.1900  -0.1018   3.1820  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.910e+01  5.876e+02  -0.033  0.97407    
## genderMale                 -2.281e-02  2.007e-01  -0.114  0.90949    
## age                         7.512e-02  8.550e-03   8.787  < 2e-16 ***
## hypertensionYes             6.384e-01  2.178e-01   2.932  0.00337 ** 
## heart_diseaseYes            3.280e-01  2.669e-01   1.229  0.21904    
## ever_marriedYes            -7.073e-02  3.218e-01  -0.220  0.82603    
## work_typeGovt_job           1.070e+01  5.876e+02   0.018  0.98548    
## work_typeNever_worked      -3.533e-01  1.510e+03   0.000  0.99981    
## work_typePrivate            1.101e+01  5.876e+02   0.019  0.98506    
## work_typeSelf-employed      1.065e+01  5.876e+02   0.018  0.98554    
## Residence_typeUrban         8.740e-02  1.942e-01   0.450  0.65271    
## avg_glucose_level           4.322e-03  1.650e-03   2.620  0.00880 ** 
## bmi                         5.134e-03  1.553e-02   0.331  0.74099    
## smoking_statusnever smoked -4.070e-02  2.245e-01  -0.181  0.85613    
## smoking_statussmokes        2.192e-01  2.819e-01   0.778  0.43675    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 987.68  on 2397  degrees of freedom
## Residual deviance: 796.88  on 2383  degrees of freedom
## AIC: 826.88
## 
## Number of Fisher Scoring iterations: 16

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  973  54
##        Yes   0   0
##                                           
##                Accuracy : 0.9474          
##                  95% CI : (0.9319, 0.9603)
##     No Information Rate : 0.9474          
##     P-Value [Acc > NIR] : 0.5361          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 5.498e-13       
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.94742         
##              Prevalence : 0.05258         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.041770
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  684  11
##        Yes 289  43
##                                          
##                Accuracy : 0.7079         
##                  95% CI : (0.679, 0.7356)
##     No Information Rate : 0.9474         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.1455         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.79630        
##             Specificity : 0.70298        
##          Pos Pred Value : 0.12952        
##          Neg Pred Value : 0.98417        
##              Prevalence : 0.05258        
##          Detection Rate : 0.04187        
##    Detection Prevalence : 0.32327        
##       Balanced Accuracy : 0.74964        
##                                          
##        'Positive' Class : Yes            
##                                          
## ===========================================
## model1 = stroke~. AUC Testing: 0.824350
## model1 = stroke~. AUC Training: 0.837246
## model1 = stroke~. Accuracy Testing with threshold 0.041770: 0.707887
## model1 = stroke~. Accuracy Testing with threshold 0.5: 0.947456
## model1 = stroke~. Accuracy Training with threshold 0.5: 0.947456

By looking at the signif code, I create my second model, i.e. I try to model the dependent variable with age, avg_glucose_level, and hypertension as the predictors using data in which the missing value has been removed

model2 <- modellingFunction("model2 = stroke~age+avg_glucose_level+hypertension", stroke ~ age + avg_glucose_level + hypertension, c("age", "avg_glucose_level","hypertension"), training_set, testing_set,'blue')

## [1] "model2 = stroke~age+avg_glucose_level+hypertension"
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0745  -0.3380  -0.1927  -0.1019   3.2385  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.912785   0.544152 -14.542  < 2e-16 ***
## age                0.072489   0.007670   9.450  < 2e-16 ***
## avg_glucose_level  0.004655   0.001582   2.941  0.00327 ** 
## hypertensionYes    0.663959   0.214400   3.097  0.00196 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 987.68  on 2397  degrees of freedom
## Residual deviance: 803.27  on 2394  degrees of freedom
## AIC: 811.27
## 
## Number of Fisher Scoring iterations: 7

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  973  54
##        Yes   0   0
##                                           
##                Accuracy : 0.9474          
##                  95% CI : (0.9319, 0.9603)
##     No Information Rate : 0.9474          
##     P-Value [Acc > NIR] : 0.5361          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 5.498e-13       
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.94742         
##              Prevalence : 0.05258         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.074082
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  781  17
##        Yes 192  37
##                                           
##                Accuracy : 0.7965          
##                  95% CI : (0.7706, 0.8207)
##     No Information Rate : 0.9474          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1928          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.68519         
##             Specificity : 0.80267         
##          Pos Pred Value : 0.16157         
##          Neg Pred Value : 0.97870         
##              Prevalence : 0.05258         
##          Detection Rate : 0.03603         
##    Detection Prevalence : 0.22298         
##       Balanced Accuracy : 0.74393         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## model2 = stroke~age+avg_glucose_level+hypertension AUC Testing: 0.819002
## model2 = stroke~age+avg_glucose_level+hypertension AUC Training: 0.832694
## model2 = stroke~age+avg_glucose_level+hypertension Accuracy Testing with threshold 0.074082: 0.796495
## model2 = stroke~age+avg_glucose_level+hypertension Accuracy Testing with threshold 0.5: 0.947456
## model2 = stroke~age+avg_glucose_level+hypertension Accuracy Training with threshold 0.5: 0.947456

Then I also create my third model, i.e. I try to model the dependent variable with only age and avg_glucose_level variables as the predictors using data in which the missing value has been removed

model3 <- modellingFunction("model3 = stroke~age+avg_glucose_level", stroke ~ age + avg_glucose_level , c("age", "avg_glucose_level"), training_set, testing_set,'green')

## [1] "model3 = stroke~age+avg_glucose_level"
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9594  -0.3427  -0.1958  -0.1033   3.2301  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.035647   0.539546 -14.893  < 2e-16 ***
## age                0.075443   0.007566   9.972  < 2e-16 ***
## avg_glucose_level  0.005387   0.001560   3.452 0.000556 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 987.68  on 2397  degrees of freedom
## Residual deviance: 812.33  on 2395  degrees of freedom
## AIC: 818.33
## 
## Number of Fisher Scoring iterations: 7

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  973  54
##        Yes   0   0
##                                           
##                Accuracy : 0.9474          
##                  95% CI : (0.9319, 0.9603)
##     No Information Rate : 0.9474          
##     P-Value [Acc > NIR] : 0.5361          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 5.498e-13       
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.94742         
##              Prevalence : 0.05258         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.064606
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  759  16
##        Yes 214  38
##                                           
##                Accuracy : 0.776           
##                  95% CI : (0.7493, 0.8012)
##     No Information Rate : 0.9474          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1771          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.70370         
##             Specificity : 0.78006         
##          Pos Pred Value : 0.15079         
##          Neg Pred Value : 0.97935         
##              Prevalence : 0.05258         
##          Detection Rate : 0.03700         
##    Detection Prevalence : 0.24537         
##       Balanced Accuracy : 0.74188         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## model3 = stroke~age+avg_glucose_level AUC Testing: 0.816756
## model3 = stroke~age+avg_glucose_level AUC Training: 0.826186
## model3 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.064606: 0.776047
## model3 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.5: 0.947456
## model3 = stroke~age+avg_glucose_level Accuracy Training with threshold 0.5: 0.947456

Next I create my first model using data in which the missing value has been replaced, namely model4. So, below I try to model the dependent variable with all predictor variables using data in which the missing value has been replaced.

model4 <- modellingFunction("model4 = stroke~.", stroke ~ ., c("age", "hypertension","avg_glucose_level","gender","heart_disease","work_type","Residence_type","bmi","smoking_status","ever_married"), training_set_ReplacedNA, testing_set_ReplacedNA,'pink')

## [1] "model4 = stroke~."
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0283  -0.3301  -0.1717  -0.0941   3.4682  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.857308   0.804945  -7.277 3.42e-13 ***
## genderMale                  -0.117210   0.168938  -0.694   0.4878    
## age                          0.074076   0.006892  10.748  < 2e-16 ***
## hypertensionYes              0.175505   0.204452   0.858   0.3907    
## heart_diseaseYes             0.127623   0.237853   0.537   0.5916    
## ever_marriedYes              0.023551   0.292793   0.080   0.9359    
## work_typeGovt_job           -1.307031   0.876428  -1.491   0.1359    
## work_typeNever_worked      -10.718425 341.727822  -0.031   0.9750    
## work_typePrivate            -1.242327   0.859938  -1.445   0.1486    
## work_typeSelf-employed      -1.541529   0.885981  -1.740   0.0819 .  
## Residence_typeUrban          0.039402   0.163491   0.241   0.8096    
## avg_glucose_level            0.003030   0.001430   2.119   0.0341 *  
## bmi                         -0.008599   0.013701  -0.628   0.5302    
## smoking_statusnever smoked  -0.256399   0.188840  -1.358   0.1745    
## smoking_statussmokes        -0.059624   0.265216  -0.225   0.8221    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.4  on 3576  degrees of freedom
## Residual deviance: 1130.7  on 3562  degrees of freedom
## AIC: 1160.7
## 
## Number of Fisher Scoring iterations: 14

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1458   74
##        Yes    0    0
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9397, 0.9619)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 0.5309          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.9517          
##              Prevalence : 0.0483          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.060689
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1131   14
##        Yes  327   60
##                                          
##                Accuracy : 0.7774         
##                  95% CI : (0.7557, 0.798)
##     No Information Rate : 0.9517         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.195          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.81081        
##             Specificity : 0.77572        
##          Pos Pred Value : 0.15504        
##          Neg Pred Value : 0.98777        
##              Prevalence : 0.04830        
##          Detection Rate : 0.03916        
##    Detection Prevalence : 0.25261        
##       Balanced Accuracy : 0.79327        
##                                          
##        'Positive' Class : Yes            
##                                          
## ===========================================
## model4 = stroke~. AUC Testing: 0.861797
## model4 = stroke~. AUC Training: 0.836528
## model4 = stroke~. Accuracy Testing with threshold 0.060689: 0.777415
## model4 = stroke~. Accuracy Testing with threshold 0.5: 0.951076
## model4 = stroke~. Accuracy Training with threshold 0.5: 0.951076

Afterward, I create my fifth model using data in which the missing value has been replaced, namely model5. In model5, I try to model the dependent variable with age, avg_glucose_level, hypertension and heart_disease variables as the predictors using data in which the missing value has been replaced.

model5 <- modellingFunction("model5 = stroke~age+avg_glucose_level+hypertension+heart_disease", stroke ~ age + avg_glucose_level + hypertension + heart_disease, c("age", "avg_glucose_level","hypertension","heart_disease"), training_set_ReplacedNA, testing_set_ReplacedNA,'yellow')

## [1] "model5 = stroke~age+avg_glucose_level+hypertension+heart_disease"
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8922  -0.3315  -0.1810  -0.0857   3.7419  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.298194   0.415283 -17.574   <2e-16 ***
## age                0.069600   0.006021  11.559   <2e-16 ***
## avg_glucose_level  0.002931   0.001390   2.110   0.0349 *  
## hypertensionYes    0.163990   0.202474   0.810   0.4180    
## heart_diseaseYes   0.142137   0.234313   0.607   0.5441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.4  on 3576  degrees of freedom
## Residual deviance: 1137.6  on 3572  degrees of freedom
## AIC: 1147.6
## 
## Number of Fisher Scoring iterations: 7

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1458   74
##        Yes    0    0
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9397, 0.9619)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 0.5309          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.9517          
##              Prevalence : 0.0483          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.068901
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1152   16
##        Yes  306   58
##                                         
##                Accuracy : 0.7898        
##                  95% CI : (0.7685, 0.81)
##     No Information Rate : 0.9517        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.2007        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.78378       
##             Specificity : 0.79012       
##          Pos Pred Value : 0.15934       
##          Neg Pred Value : 0.98630       
##              Prevalence : 0.04830       
##          Detection Rate : 0.03786       
##    Detection Prevalence : 0.23760       
##       Balanced Accuracy : 0.78695       
##                                         
##        'Positive' Class : Yes           
##                                         
## ===========================================
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease AUC Testing: 0.863493
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease AUC Training: 0.833043
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease Accuracy Testing with threshold 0.068901: 0.789817
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease Accuracy Testing with threshold 0.5: 0.951076
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease Accuracy Training with threshold 0.5: 0.951076

By looking at the signif code, I create my sixth model using data in which the missing value has been replaced, namely model6. In model6, I try to model the dependent variable with age and avg_glucose_level variables as the predictors using data in which the missing value has been replaced.

model6 <- modellingFunction("model6 = stroke~age+avg_glucose_level", stroke ~ age + avg_glucose_level , c("age", "avg_glucose_level"), training_set_ReplacedNA, testing_set_ReplacedNA,'brown')

## [1] "model6 = stroke~age+avg_glucose_level"
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8810  -0.3338  -0.1813  -0.0852   3.7538  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.360138   0.410674 -17.922   <2e-16 ***
## age                0.070917   0.005871  12.080   <2e-16 ***
## avg_glucose_level  0.003155   0.001371   2.301   0.0214 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.4  on 3576  degrees of freedom
## Residual deviance: 1138.6  on 3574  degrees of freedom
## AIC: 1144.6
## 
## Number of Fisher Scoring iterations: 7

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1458   74
##        Yes    0    0
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9397, 0.9619)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 0.5309          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.9517          
##              Prevalence : 0.0483          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.048003
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1043   10
##        Yes  415   64
##                                           
##                Accuracy : 0.7226          
##                  95% CI : (0.6994, 0.7449)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1613          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.86486         
##             Specificity : 0.71536         
##          Pos Pred Value : 0.13361         
##          Neg Pred Value : 0.99050         
##              Prevalence : 0.04830         
##          Detection Rate : 0.04178         
##    Detection Prevalence : 0.31266         
##       Balanced Accuracy : 0.79011         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## model6 = stroke~age+avg_glucose_level AUC Testing: 0.858618
## model6 = stroke~age+avg_glucose_level AUC Training: 0.832197
## model6 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.048003: 0.722585
## model6 = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.5: 0.951076
## model6 = stroke~age+avg_glucose_level Accuracy Training with threshold 0.5: 0.951076

Lastly, by looking at the signif code from the model6, I create my last model using data in which the missing value has been replaced, namely model7. In model6, I try to model the dependent variable with age variable only as the predictor using data in which the missing value has been replaced.

model7 <- modellingFunction("model7 = stroke~age", stroke ~ age  , c("age"), training_set_ReplacedNA, testing_set_ReplacedNA,'purple')

## [1] "model7 = stroke~age"
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7499  -0.3317  -0.1870  -0.0842   3.7418  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.09571    0.39130  -18.13   <2e-16 ***
## age          0.07281    0.00578   12.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.4  on 3576  degrees of freedom
## Residual deviance: 1143.7  on 3575  degrees of freedom
## AIC: 1147.7
## 
## Number of Fisher Scoring iterations: 7

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1458   74
##        Yes    0    0
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9397, 0.9619)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 0.5309          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.9517          
##              Prevalence : 0.0483          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.049958
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1060   14
##        Yes  398   60
##                                           
##                Accuracy : 0.7311          
##                  95% CI : (0.7081, 0.7531)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1553          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.81081         
##             Specificity : 0.72702         
##          Pos Pred Value : 0.13100         
##          Neg Pred Value : 0.98696         
##              Prevalence : 0.04830         
##          Detection Rate : 0.03916         
##    Detection Prevalence : 0.29896         
##       Balanced Accuracy : 0.76892         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## model7 = stroke~age AUC Testing: 0.845962
## model7 = stroke~age AUC Training: 0.829505
## model7 = stroke~age Accuracy Testing with threshold 0.049958: 0.731070
## model7 = stroke~age Accuracy Testing with threshold 0.5: 0.951076
## model7 = stroke~age Accuracy Training with threshold 0.5: 0.951076

It is proven from each model that the 0.5 threshold cannot predict this data (imbalanced data) correctly, it can be seen from the confusion matrix of each model when using the 0.5 threshold that the result is poor, so the Gmean threshold must be applied.

After constructing the seven models described above, I use the AUC, specificity, and sensitivity of each model to determine which model is the best.

#Using my summary model function to compare the auc, specificity, and sensitivity values between each model
summarizeModelMetric(list(model1,model2,model3,model4,model5,model6,model7))

##                                                                        AUC
## model1 = stroke~.                                                0.8243500
## model2 = stroke~age+avg_glucose_level+hypertension               0.8190019
## model3 = stroke~age+avg_glucose_level                            0.8167561
## model4 = stroke~.                                                0.8617970
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease 0.8634931
## model6 = stroke~age+avg_glucose_level                            0.8586179
## model7 = stroke~age                                              0.8459617
##                                                                  Specitifity
## model1 = stroke~.                                                  0.7029805
## model2 = stroke~age+avg_glucose_level+hypertension                 0.8026721
## model3 = stroke~age+avg_glucose_level                              0.7800617
## model4 = stroke~.                                                  0.7757202
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease   0.7901235
## model6 = stroke~age+avg_glucose_level                              0.7153635
## model7 = stroke~age                                                0.7270233
##                                                                  Sensitivity
## model1 = stroke~.                                                  0.7962963
## model2 = stroke~age+avg_glucose_level+hypertension                 0.6851852
## model3 = stroke~age+avg_glucose_level                              0.7037037
## model4 = stroke~.                                                  0.8108108
## model5 = stroke~age+avg_glucose_level+hypertension+heart_disease   0.7837838
## model6 = stroke~age+avg_glucose_level                              0.8648649
## model7 = stroke~age                                                0.8108108

Explanation:

AUC : AUC - ROC curve is a classification performance statistic with variable threshold values. AUC represents the degree or amount of separability. It measures how well the model can detect between classes. AUC close to one indicates a high level of separability in a good model. Similarly, the greater the AUC, the better the model distinguishes between people who have the condition and those who do not.

Sensitivity: the ability of a test to correctly identify patients with a disease.

Specificity: the ability of a test to correctly identify people without the disease.

True positive: the person has the disease and predicted positive.

True negative: the person does not have the disease and predicted negative.

False positive: the person does not have the disease and predicted positive.

False negative: the person has the disease and predicted negative

If we desire a low false negative, we must increase the sensitivity (sensitivity = true positive/(true positive + false negative)). If the specificity is low, it indicates that the false negative is high.

In order to minimize the possibility of predicting people who don’t have a stroke even though they actually have a stroke, the sensitivity must be large. Small specificity will predict more people who have a stroke even though they didn’t actually have a stroke. We, as humans, are better at predicting that we will have a stroke even though we didn’t actually have it than predicting that we will not have a stroke even though we actually had a stroke.

As a result, model 6 with age and glucose level predictors is the best model for predicting stroke or not, because it has the greatest sensitivity value among other models and the specificity is not too low. The AUC value is likewise high, close to one, indicating that model 6 is quite good at predicting whether or not someone will have a stroke. Model 6 has an AUC value of 0.86, a specificity value of 0.72 and a sensitivity of 0.86.

1. d. Assess your model and explain your results.

So, here is my final model summary and roc curve

FinalModel <- modellingFunction("Final Model = stroke~age+avg_glucose_level", stroke ~ age + avg_glucose_level , c("age", "avg_glucose_level"), training_set_ReplacedNA, testing_set_ReplacedNA,'brown')

## [1] "Final Model = stroke~age+avg_glucose_level"
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8810  -0.3338  -0.1813  -0.0852   3.7538  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.360138   0.410674 -17.922   <2e-16 ***
## age                0.070917   0.005871  12.080   <2e-16 ***
## avg_glucose_level  0.003155   0.001371   2.301   0.0214 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.4  on 3576  degrees of freedom
## Residual deviance: 1138.6  on 3574  degrees of freedom
## AIC: 1144.6
## 
## Number of Fisher Scoring iterations: 7

## Warning in confusionMatrix.default(data = as.factor(predictedDefault), reference
## = testing_set$stroke, : Levels are not in the same order for reference and data.
## Refactoring data to match.

## CONFUSION MATRIX WITH THRESHOLD 0.5
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1458   74
##        Yes    0    0
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9397, 0.9619)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 0.5309          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.9517          
##              Prevalence : 0.0483          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## CONFUSION MATRIX WITH THRESHOLD 0.048003
## -------------------------------------------
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1043   10
##        Yes  415   64
##                                           
##                Accuracy : 0.7226          
##                  95% CI : (0.6994, 0.7449)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1613          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.86486         
##             Specificity : 0.71536         
##          Pos Pred Value : 0.13361         
##          Neg Pred Value : 0.99050         
##              Prevalence : 0.04830         
##          Detection Rate : 0.04178         
##    Detection Prevalence : 0.31266         
##       Balanced Accuracy : 0.79011         
##                                           
##        'Positive' Class : Yes             
##                                           
## ===========================================
## Final Model = stroke~age+avg_glucose_level AUC Testing: 0.858618
## Final Model = stroke~age+avg_glucose_level AUC Training: 0.832197
## Final Model = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.048003: 0.722585
## Final Model = stroke~age+avg_glucose_level Accuracy Testing with threshold 0.5: 0.951076
## Final Model = stroke~age+avg_glucose_level Accuracy Training with threshold 0.5: 0.951076

plot(performance(FinalModel$roc, measure = "tpr", x.measure = "fpr"))

Explanation of Model:

My final model’s accuracy using the default threshold(0.5) is around 95%. However, because based on the confusion matrix, the prediction results with a default threshold of 0.5 are poor, I use the one with the gmean threshold to predict. Then, My final model’s accuracy using the gmean threshold is around 72%. In my final model, the AUC value in the testing set is around 0.86, whereas the AUC value in the training set is approximately 0.83. The AUC values in these two sets of differences are not far off, and the AUC value in this model is fairly high, near to 1, indicating that it is quite good at predicting. According to the plot, this model’s roccurve is also fairly good.

The equation of the line for my final model is ln[stroke/(1-stroke)] = -7.360138 + age(0.070917) + avg_glucose_level(0.003155)

So, based on this data and the model that I created and picked, model6, the probability of having a stroke is determined by a person’s age and glucose levels.The older a person becomes, the more likely he or she will have a stroke. Similarly, a person’s average glucose level. The higher a person’s average glucose level, the greater the probability of getting a stroke. Someone with high glucose levels is more likely to develop diabetes, which can cause narrowing of blood vessel walls and even total blockage, causing blood flow to the brain to halt and a stroke to occur.

2. If you build another predictive model using Decision Tree and assess the model successfully (correct and complete), you will get 10 extra points.

#split the data into 50% of training and 50% of testing from the data which missing value has dropped
set.seed(1)
trainingIndexForDecisionTree = createDataPartition(UpdateStrokeDataDroppedNA$stroke,p=0.5,list=FALSE)

testing_set_for_DT = UpdateStrokeDataDroppedNA[-trainingIndexForDecisionTree, ]
training_set_for_DT = UpdateStrokeDataDroppedNA[trainingIndexForDecisionTree, ]
cat("Testing Set Dimention data  : ",dim(testing_set_for_DT),"\n")

## Testing Set Dimention data  :  1712 11

cat("Training Set Dimention data : ",dim(training_set_for_DT))

## Training Set Dimention data :  1713 11

#build the decision tree model using rpart function
DecisionTreeModel <- rpart(stroke~.,training_set_for_DT, method = "class")
DecisionTreeModel

## n= 1713 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1713 90 No (0.94746060 0.05253940)  
##    2) age< 67.5 1381 32 No (0.97682839 0.02317161) *
##    3) age>=67.5 332 58 No (0.82530120 0.17469880)  
##      6) avg_glucose_level< 219.625 291 43 No (0.85223368 0.14776632) *
##      7) avg_glucose_level>=219.625 41 15 No (0.63414634 0.36585366)  
##       14) work_type=Self-employed 13  1 No (0.92307692 0.07692308) *
##       15) work_type=Govt_job,Private 28 14 No (0.50000000 0.50000000)  
##         30) bmi< 33.5 21  7 No (0.66666667 0.33333333)  
##           60) avg_glucose_level>=227.13 14  3 No (0.78571429 0.21428571) *
##           61) avg_glucose_level< 227.13 7  3 Yes (0.42857143 0.57142857) *
##         31) bmi>=33.5 7  0 Yes (0.00000000 1.00000000) *

To make the decision tree easier to understand, I visualized it below

#plot the decision tree model using rpart.plot function
rpart.plot(DecisionTreeModel)

From the plot of the decision tree model above, I can conclude:

A stroke does not occur in those under the age of 68.
People over the age of 68 with an average glucose level less than 220 do not suffer a stroke.
People over the age of 68 with an average glucose level above 220 and self-employed jobs are unlikely to suffer a stroke.
People over the age of 68 with an average glucose level more than 220, a non-self-employed employment type, a BMI less than 34, and an avg glucose level higher than 227 are less likely to have a stroke.

5.People who are above the age of 68, have an average glucose level of more than 200, are not self-employed, and have a BMI of more than 34 are at risk of stroke.

People who were above the age of 68, had an average glucose level of more than 200, were not self-employed, had a BMI less than 34, and an average glucose level of less than 227 were at risk of stroke.

predictionDTree <- predict(DecisionTreeModel, testing_set_for_DT, type = "class")
confusionMatrix <- table(predictionDTree, testing_set_for_DT$stroke)
confusionMatrix

##                
## predictionDTree   No  Yes
##             No  1612   87
##             Yes   10    3

By looking at the confusion matrix, this model is not recommended because it predicts that more people who should have suffered a stroke are predicted not to suffer from a stroke than predicting people who do not suffer from a stroke are predicted to have a stroke.

#accuraycy
sum(diag(confusionMatrix))/sum(confusionMatrix)

## [1] 0.9433411

#error rate (missclassification accuracy)
1-sum(diag(confusionMatrix))/sum(confusionMatrix)

## [1] 0.05665888

Based on the accuracy findings above, the decision tree model has an accuracy of 94%, but actually the decision tree model is less effective for classifying unbalanced data and requires generalization to overcome it.