2540129262_FinalExam

Link video : https://binusianorg-my.sharepoint.com/personal/florencia003_binus_ac_id/_layouts/15/guestaccess.aspx?folderid=0d806c6530e2e423fbc8d38374f743c75&authkey=AVvJfqy89zfzB7sKb-si3ro&e=CTMMJ4

#library and dataset
library(ROCR)
library(rpart)
library(rpart.plot)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

StrokeData <- read.csv("StrokeData.csv", header = T , na.strings = ("N/A"))

###Exploratory Data Analysis

dim(StrokeData)

## [1] 5110   12

__explanation:__
From what we can see, this dataset has 5110 rows of data (instances) and 12 attributes

str(StrokeData)

## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

__explanation:__
In this dataset a lot of attributes has two possibilites of value (binary variable), the attributes were hypertension, hear_disease, ever_married, and stroke

###Attribute Information 1) id: unique identifier 2) gender: “Male”, “Female” or “Other” 3) age: age of the patient 4) hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension 5) heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease 6) ever_married: “No” or “Yes” 7) work_type: “children”, “Govt_job”, “Never_worked”, “Private” or “Self-employed” 8) Residence_type: “Rural” or “Urban” 9) avg_glucose_level: average glucose level in blood 10) bmi: body mass index 11) smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown” 12) stroke: 1 if the patient had a stroke or 0 if not Note: “Unknown” in smoking_status means that the information is unavailable for this patient

BasicSummary <- function(df, dgts = 3){

m <- ncol(df)
varNames <- colnames(df)
varType <- vector("character",m)
topLevel <- vector("character",m)
topCount <- vector("numeric",m)
missCount <- vector("numeric",m)
levels <- vector("numeric", m)

for (i in 1:m){
x <- df[,i]
varType[i] <- class(x)
xtab <- table(x, useNA = "ifany")
levels[i] <- length(xtab)
nums <- as.numeric(xtab)
maxnum <- max(nums)
topCount[i] <- maxnum
maxIndex <- which.max(nums)
lvls <- names(xtab)
topLevel[i] <- lvls[maxIndex]
missIndex <- which((is.na(x)) | (x == "") | (x == " "))
missCount[i] <- length(missIndex)
}
n <- nrow(df)
topFrac <- round(topCount/n, digits = dgts)
missFrac <- round(missCount/n, digits = dgts)
## #
summaryFrame <- data.frame(variable = varNames, type = varType,
 levels = levels, topLevel = topLevel,
 topCount = topCount, topFrac = topFrac,
 missFreq = missCount, missFrac = missFrac)
 return(summaryFrame)
 }

BasicSummary(StrokeData)

##             variable      type levels     topLevel topCount topFrac missFreq
## 1                 id   integer   5110           67        1   0.000        0
## 2             gender character      3       Female     2994   0.586        0
## 3                age   numeric    104           78      102   0.020        0
## 4       hypertension   integer      2            0     4612   0.903        0
## 5      heart_disease   integer      2            0     4834   0.946        0
## 6       ever_married character      2          Yes     3353   0.656        0
## 7          work_type character      5      Private     2925   0.572        0
## 8     Residence_type character      2        Urban     2596   0.508        0
## 9  avg_glucose_level   numeric   3979        93.88        6   0.001        0
## 10               bmi   numeric    419         <NA>      201   0.039      201
## 11    smoking_status character      4 never smoked     1892   0.370        0
## 12            stroke   integer      2            0     4861   0.951        0
##    missFrac
## 1     0.000
## 2     0.000
## 3     0.000
## 4     0.000
## 5     0.000
## 6     0.000
## 7     0.000
## 8     0.000
## 9     0.000
## 10    0.039
## 11    0.000
## 12    0.000

StrokeData[, c(2, 6:8, 11)] <- lapply(StrokeData[,c(2, 6:8, 11)], as.factor)

__explanation:__
In order to help us explore the dataset easier, we need to change the attribute which was character as shown in the basic summary above as factor.

summary(StrokeData)

##        id           gender          age         hypertension    
##  Min.   :   67   Female:2994   Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Male  :2115   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Other :   1   Median :45.00   Median :0.00000  
##  Mean   :36518                 Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                 3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                 Max.   :82.00   Max.   :1.00000  
##                                                                 
##  heart_disease     ever_married         work_type    Residence_type
##  Min.   :0.00000   No :1757     children     : 687   Rural:2514    
##  1st Qu.:0.00000   Yes:3353     Govt_job     : 657   Urban:2596    
##  Median :0.00000                Never_worked :  22                 
##  Mean   :0.05401                Private      :2925                 
##  3rd Qu.:0.00000                Self-employed: 819                 
##  Max.   :1.00000                                                   
##                                                                    
##  avg_glucose_level      bmi                smoking_status     stroke       
##  Min.   : 55.12    Min.   :10.30   formerly smoked: 885   Min.   :0.00000  
##  1st Qu.: 77.25    1st Qu.:23.50   never smoked   :1892   1st Qu.:0.00000  
##  Median : 91.89    Median :28.10   smokes         : 789   Median :0.00000  
##  Mean   :106.15    Mean   :28.89   Unknown        :1544   Mean   :0.04873  
##  3rd Qu.:114.09    3rd Qu.:33.10                          3rd Qu.:0.00000  
##  Max.   :271.74    Max.   :97.60                          Max.   :1.00000  
##                    NA's   :201

writeLines("")

writeLines("Detect Missing Values")

## Detect Missing Values

sapply(StrokeData, function(x) sum(is.na(x)))

##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0               201                 0                 0

__explanation:__
There are 201 missing values found in "bmi" attribute from this dataset. 

*For the "Unknown" in the smoking status, we can just let it be. If we found that the smoking status does affects what we will analyze, we can just delete the data that has the "unknown" smoking status.

sapply(StrokeData, function(x) length(unique(x)))

##                id            gender               age      hypertension 
##              5110                 3               104                 2 
##     heart_disease      ever_married         work_type    Residence_type 
##                 2                 2                 5                 2 
## avg_glucose_level               bmi    smoking_status            stroke 
##              3979               419                 4                 2

__explanation:__
After finding the unique value and determine what do we want to prove from the dataset (we want to prove what affect a person to have a stroke), we can remove the attribute that we think wont affect or prove whether someone will have a stroke.

These attributes has most unique value:
- id
- avg_glucose_level

*From the above output we can see that in the "age", "avg_glucose_level" and "bmi" attribute, there are lots of duplicate data. It means that some people can have same age or avg_glucose_level or bmi and this duplication of data does not need to be deleted because we will use it to find out whether these attributes can prove someone has had a stroke. 

*For the "id" attribute, we will discard it later in the data preparation because the id of a person will not affect the status that the person has stroke or not.

par(mfrow = c(2,2), pty = "m")

plot(StrokeData$gender,
     main = "Gender Attribute",
     col = "dark blue")

hist(StrokeData$age,
     main = "Age ttribute",
     col = "blue")

hist(StrokeData$hypertension,
    main = "Hypertension Attribute",
    col = "dark red")

hist(StrokeData$heart_disease,
     main = "Heart Disease Attribute",
     col = "dark green")

par(mfrow = c(2,2), pty = "m")

plot(StrokeData$ever_married,
     main = "Marriage status",
     col = "dark blue")

plot(StrokeData$work_type,
     main = "Work Type",
     col = "blue",
     las = 2,
     horiz = TRUE)

plot(StrokeData$Residence_type,
    main = "Residence Type",
    col = "dark red")

hist(StrokeData$avg_glucose_level,
     main = "Average Glucose Level",
     col = "dark green")

par(mfrow = c(2,2), pty = "m")

hist(StrokeData$bmi,
     main = "Body Mass Index",
     col = "dark blue")

plot(StrokeData$smoking_status,
     main = "Smoking Status",
     col = "blue",
     horiz = TRUE,
     las = 2)

hist(StrokeData$stroke,
    main = "Stroke Attribute",
    col = "dark red")

__explanation:__
Page 1 :
- (Upper Left) In this dataset, there are more female than male, and the least are "other".
- (Upper Right) Most people in the record aged from 40 to 60.
- (Lower Left) There are more people who doesn't have hypertension in this dataset.
- (Lower Right) Most people in this dataset also doesn't have heart disease.


Page 2:
- (Upper Left) There are more people who has married.
- (Upper Right) Most of the work type of people in this dataset are private.
- (Lower Left) The "rural" residence type and "urban" residence type seems to have slightly similar amount.
- (Lower Right) Most record of the Average Glucose Level are in range 50 above until 100.

Page 3:
- (Upper Left) Most record of the body mass index are in range 20 until around 30 above.
- (Upper Right) Most record shows that the people is either never smoked or has unknown smoking status
- (Lower Right) The difference in record number of people who has stroke and people who doesn't have stroke in this dataset was very far. It is hard for us to prove which affect people to have a stroke as the "stroke" record number in this dataset is very imbalance.

##Searching for outliers in the “bmi” attribute

ThreeSigma <- function(x, t = 3){

 mu <- mean(x, na.rm = TRUE)
 sig <- sd(x, na.rm = TRUE)
 if (sig == 0){
 message("All non-missing x-values are identical")
}
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
 }

Hampel <- function(x, t = 3){

 mu <- median(x, na.rm = TRUE)
 sig <- mad(x, na.rm = TRUE)
 if (sig == 0){
 message("Hampel identifer implosion: MAD scale estimate is zero")
 }
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
 }
   
BoxplotRule<- function(x, t = 1.5){

 xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
 xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
 Q <- xU - xL
 if (Q == 0){
 message("Boxplot rule implosion: interquartile distance is zero")
 }
 up <- xU + t * Q
 down <- xU - t * Q
 out <- list(up = up, down = down)
 return(out)
}   

ExtractDetails <- function(x, down, up){

 outClass <- rep("N", length(x))
 indexLo <- which(x < down)
 indexHi <- which(x > up)
 outClass[indexLo] <- "L"
 outClass[indexHi] <- "U"
 index <- union(indexLo, indexHi)
 values <- x[index]
 outClass <- outClass[index]
 nOut <- length(index)
 maxNom <- max(x[which(x <= up)])
 minNom <- min(x[which(x >= down)])
 outList <- list(nOut = nOut, lowLim = down,
 upLim = up, minNom = minNom,
 maxNom = maxNom, index = index,
 values = values,
 outClass = outClass)
 return(outList)
 }

FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
 threeLims <- ThreeSigma(x, t = t3)
 HampLims <- Hampel(x, t = tH)
 boxLims <- BoxplotRule(x, t = tb)

 n <- length(x)
 nMiss <- length(which(is.na(x)))

 threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
 HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
 boxList <- ExtractDetails(x, boxLims$down, boxLims$up)

 sumFrame <- data.frame(method = "ThreeSigma", n = n,
 nMiss = nMiss, nOut = threeList$nOut,
 lowLim = threeList$lowLim,
 upLim = threeList$upLim,
 minNom = threeList$minNom,
 maxNom = threeList$maxNom)
 upFrame <- data.frame(method = "Hampel", n = n,
 nMiss = nMiss, nOut = HampList$nOut,
 lowLim = HampList$lowLim,
 upLim = HampList$upLim,
 minNom = HampList$minNom,
 maxNom = HampList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)
 upFrame <- data.frame(method = "BoxplotRule", n = n,
 nMiss = nMiss, nOut = boxList$nOut,
 lowLim = boxList$lowLim,
 upLim = boxList$upLim,
 minNom = boxList$minNom,
 maxNom = boxList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)

 threeFrame <- data.frame(index = threeList$index,
 values = threeList$values,
 type = threeList$outClass)
 HampFrame <- data.frame(index = HampList$index,
 values = HampList$values,
 type = HampList$outClass)
 boxFrame <- data.frame(index = boxList$index,
 values = boxList$values,
 type = boxList$outClass)
 outList <- list(summary = sumFrame, threeSigma = threeFrame,
 Hampel = HampFrame, boxplotRule = boxFrame)
 return(outList)
}

fullSummary <- FindOutliers(StrokeData$bmi)
fullSummary$summary

##        method    n nMiss nOut    lowLim    upLim minNom maxNom
## 1  ThreeSigma 5110   201   58  5.331037 52.45544   10.3   52.3
## 2      Hampel 5110   201   90  7.195340 49.00466   10.3   48.9
## 3 BoxplotRule 5110   201  478 18.700000 47.50000   18.7   47.5

__explanation:__
- The ThreeSigma rule detects 58 data points as outliers.
- The Hampel identifier detects 90 data points as outliers.
- The Boxplot Rule detects 478 data points as outliers, giving the most numbers of outliers compared from the other two outlier identifier.

Based on the numbers of data we have, which is 5110 data it's normal for the boxplot rule to detect such many outliers as it has the highest lower limit among three of them.

###Data Preparation

StrokeData_ <- StrokeData[-c(1)]
sapply(StrokeData_, function(x) sum(is.na(x)))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               201                 0                 0

__explanation:__
We remove the attribute "id" as an id of a person won't prove if a person has a stroke or not. We keep all the other attributes as they were considered important and we can use them to prove whether a person has a stroke or not.

As we can see in the "bmi" attribute there are 201 missing values found, we need to remove these missing values to reduce bias.

StrokeData_$bmi[is.na(StrokeData_$bmi)]<-mean(StrokeData_$bmi, na.rm = TRUE)
sapply(StrokeData_, function(x) sum(is.na(x)))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##                 0                 0                 0

__explanation:__
Now that the missing values are removed, we can continue to split the data into training set and testing set.

dim(StrokeData_)

## [1] 5110   11

__explanation:__
From what we can see, this dataset now has 5110 rows (instances) and 11 attributes as we already remove the id column.

#Splitting the data into training set and testing set

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

Train <- createDataPartition(y = StrokeData_$stroke, p = 0.7, list = FALSE)
train_set <- StrokeData_[Train,]
test_set <- StrokeData_[-Train,]

writeLines("Training set dimension: ")

## Training set dimension:

dim(train_set)

## [1] 3577   11

writeLines("Validation set dimension: ")

## Validation set dimension:

dim(test_set)

## [1] 1533   11

__explanation:__
Here we use 70% of list in this dataset for training and the rest for the testing set.

We can see that now the training set consist of 3577 instances with 11 attributes as for the testing set (the rest) consist of 1533 instances with also 11 attributes

#Modelling

Log_Model <- glm(stroke ~ ., family = binomial(link = "logit"), data = train_set)
summary(Log_Model)

## 
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"), 
##     data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1635  -0.3189  -0.1699  -0.0913   3.6849  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -7.077e+00  1.086e+00  -6.514 7.31e-11 ***
## genderMale                  1.022e-01  1.709e-01   0.598   0.5496    
## genderOther                -1.071e+01  1.455e+03  -0.007   0.9941    
## age                         7.013e-02  6.809e-03  10.299  < 2e-16 ***
## hypertension                4.539e-01  1.997e-01   2.273   0.0230 *  
## heart_disease               2.292e-01  2.343e-01   0.978   0.3279    
## ever_marriedYes            -3.365e-01  2.546e-01  -1.322   0.1863    
## work_typeGovt_job          -1.822e-01  1.126e+00  -0.162   0.8715    
## work_typeNever_worked      -9.892e+00  3.634e+02  -0.027   0.9783    
## work_typePrivate           -6.724e-02  1.108e+00  -0.061   0.9516    
## work_typeSelf-employed     -2.809e-01  1.129e+00  -0.249   0.8036    
## Residence_typeUrban         4.007e-02  1.676e-01   0.239   0.8111    
## avg_glucose_level           3.607e-03  1.462e-03   2.468   0.0136 *  
## bmi                        -2.352e-03  1.389e-02  -0.169   0.8656    
## smoking_statusnever smoked -7.960e-02  2.149e-01  -0.370   0.7111    
## smoking_statussmokes        2.349e-01  2.563e-01   0.916   0.3594    
## smoking_statusUnknown      -2.948e-02  2.624e-01  -0.112   0.9106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1361.6  on 3576  degrees of freedom
## Residual deviance: 1095.2  on 3560  degrees of freedom
## AIC: 1129.2
## 
## Number of Fisher Scoring iterations: 14

__explanation:__
From the result above (regression result) we can see that the attribute which has correlation with "stroke" attribute are :
- age
- hypertension
- heart disease
- avg_glucose_level

The age attribute has the strongest correlation with the "stroke".
while the hypertension was slightly correlated with "stroke" attribute

*(Note that these attributes can change at any time depends on the division of training set and testing set)

Log_Model1 <- glm(stroke ~ age + hypertension +heart_disease + avg_glucose_level, family = binomial(link = "logit"), data = train_set)
summary(Log_Model1)

## 
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + avg_glucose_level, 
##     family = binomial(link = "logit"), data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0248  -0.3193  -0.1769  -0.0864   3.7244  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.275272   0.417333 -17.433   <2e-16 ***
## age                0.066254   0.006078  10.900   <2e-16 ***
## hypertension       0.452614   0.196471   2.304   0.0212 *  
## heart_disease      0.292307   0.229859   1.272   0.2035    
## avg_glucose_level  0.003597   0.001411   2.550   0.0108 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1361.6  on 3576  degrees of freedom
## Residual deviance: 1100.5  on 3572  degrees of freedom
## AIC: 1110.5
## 
## Number of Fisher Scoring iterations: 7

__explanation:__
So, we can conclude that:
- age of a person
- does a person has hypertension or not
- does a person has heart disease
- and the average glucose level of a person

can determine whether he/ she has a stroke.

prediction_log <- predict(Log_Model, newdata = subset(test_set, select = c(1,2,3,4,5,6,7,8,9,10,11)), type = "response")
pd <- prediction(prediction_log, test_set$stroke)
rocCurve <- performance (pd, measur = "tpr", x.measure = "fpr")
plot(rocCurve)

__explanation:__
The result of the predictive model is not the best but it is good enough if we consider the number of "stroke" data that is imbalance. So we can continue to check auc (area under the curve) and the accuracy.

#Compute AUC

auc<- performance(pd, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.8567189

__explanation:__
The auc of our model was 0.8100606

#check accuracy

result <- ifelse(prediction_log > 0.5, 1, 0)
missclassificationError <- mean(result != test_set$stroke)
print(paste("Accuracy: ", 1-missclassificationError))

## [1] "Accuracy:  0.94781474233529"

__explanation:__
We also got an accuracy of 95.11% (rounded up)

DTModel <- rpart(stroke ~ ., data = train_set, method = "class", control = rpart.control ("minsplit" = 1, minbucket = 1, cp=0.007))
DTModel

## n= 3577 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 3577 169 0 (0.95275370 0.04724630)  
##      2) age< 67.5 2992  69 0 (0.97693850 0.02306150) *
##      3) age>=67.5 585 100 0 (0.82905983 0.17094017)  
##        6) avg_glucose_level< 252.355 578  96 0 (0.83391003 0.16608997)  
##         12) bmi>=34.15 97   7 0 (0.92783505 0.07216495) *
##         13) bmi< 34.15 481  89 0 (0.81496881 0.18503119)  
##           26) bmi< 34 479  87 0 (0.81837161 0.18162839)  
##             52) age< 73.5 157  18 0 (0.88535032 0.11464968) *
##             53) age>=73.5 322  69 0 (0.78571429 0.21428571)  
##              106) bmi< 28.84662 194  33 0 (0.82989691 0.17010309)  
##                212) hypertension< 0.5 164  21 0 (0.87195122 0.12804878) *
##                213) hypertension>=0.5 30  12 0 (0.60000000 0.40000000)  
##                  426) avg_glucose_level< 210.585 25   8 0 (0.68000000 0.32000000)  
##                    852) bmi>=27.05 7   0 0 (1.00000000 0.00000000) *
##                    853) bmi< 27.05 18   8 0 (0.55555556 0.44444444)  
##                     1706) avg_glucose_level< 65.51 3   0 0 (1.00000000 0.00000000) *
##                     1707) avg_glucose_level>=65.51 15   7 1 (0.46666667 0.53333333)  
##                       3414) avg_glucose_level>=75.59 12   5 0 (0.58333333 0.41666667)  
##                         6828) ever_married=Yes 6   1 0 (0.83333333 0.16666667) *
##                         6829) ever_married=No 6   2 1 (0.33333333 0.66666667) *
##                       3415) avg_glucose_level< 75.59 3   0 1 (0.00000000 1.00000000) *
##                  427) avg_glucose_level>=210.585 5   1 1 (0.20000000 0.80000000) *
##              107) bmi>=28.84662 128  36 0 (0.71875000 0.28125000)  
##                214) bmi>=29.05 93  17 0 (0.81720430 0.18279570)  
##                  428) smoking_status=smokes,Unknown 29   1 0 (0.96551724 0.03448276) *
##                  429) smoking_status=formerly smoked,never smoked 64  16 0 (0.75000000 0.25000000)  
##                    858) bmi< 31.35 28   3 0 (0.89285714 0.10714286) *
##                    859) bmi>=31.35 36  13 0 (0.63888889 0.36111111)  
##                     1718) Residence_type=Urban 18   3 0 (0.83333333 0.16666667) *
##                     1719) Residence_type=Rural 18   8 1 (0.44444444 0.55555556)  
##                       3438) work_type=Govt_job,Self-employed 12   5 0 (0.58333333 0.41666667)  
##                         6876) avg_glucose_level>=104.105 8   2 0 (0.75000000 0.25000000) *
##                         6877) avg_glucose_level< 104.105 4   1 1 (0.25000000 0.75000000) *
##                       3439) work_type=Private 6   1 1 (0.16666667 0.83333333) *
##                215) bmi< 29.05 35  16 1 (0.45714286 0.54285714)  
##                  430) smoking_status=never smoked,smokes 15   4 0 (0.73333333 0.26666667)  
##                    860) avg_glucose_level>=77.73 12   1 0 (0.91666667 0.08333333) *
##                    861) avg_glucose_level< 77.73 3   0 1 (0.00000000 1.00000000) *
##                  431) smoking_status=formerly smoked,Unknown 20   5 1 (0.25000000 0.75000000) *
##           27) bmi>=34 2   0 1 (0.00000000 1.00000000) *
##        7) avg_glucose_level>=252.355 7   3 1 (0.42857143 0.57142857)  
##         14) bmi< 30.8 4   1 0 (0.75000000 0.25000000) *
##         15) bmi>=30.8 3   0 1 (0.00000000 1.00000000) *

rpart.plot(DTModel)

__explanation:__

This decision tree models gives us the classification based on yes or no starting from the top (which is the root of the tree).

the number 0 or 1 at the node determine whether people has a stroke or not, 0 for people who doesn't have stroke, 1 for the people who has stroke.

The number 0.05 at the root shows us the proportion of people who doesn't have a stroke we can also confirm this number in the EDA which we compute before in the summary of StrokeData. 
It gives ud the Mean :0.04873 (which is proportion for people who doesn't have stroke)

After that the node asks wheter the age of the person is below 68 or not, if yes we see the left child node which shows us 83% of the people aged below 68 with the probability of having stroke 0.02.

And all the other nodes goes the same with the classfication and proportion shown in the tree.

predict_DT <- predict(DTModel, test_set, type = "class")
cm <- table(predict_DT, test_set$stroke)
cm

##           
## predict_DT    0    1
##          0 1438   74
##          1   15    6

__explanation:__
The model does predict 1453 people who doesn't have stroke correctly but also classified 73 as people who has stroke.

And it also predicted 6 people who has stroke wrongly (missclassified 6 people).

#overall accuray
sum(diag(cm)) / sum(cm)

## [1] 0.9419439

#misclassification accuracy
1 - sum(diag(cm)) / sum(cm)

## [1] 0.0580561

__explanation:__
The accuracy of the test turns out to be 0.9484671 or we can say it as 94.85%

and the missclassifiaction of 5.15%

2540129262_FinalExam

Florencia

2022-07-16