#library and dataset
library(ROCR)
library(rpart)
library(rpart.plot)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
StrokeData <- read.csv("StrokeData.csv", header = T , na.strings = ("N/A"))
###Exploratory Data Analysis
dim(StrokeData)
## [1] 5110 12
__explanation:__
From what we can see, this dataset has 5110 rows of data (instances) and 12 attributes
str(StrokeData)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 36.6 NA 32.5 34.4 24 29 27.4 22.8 NA 24.2 ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
__explanation:__
In this dataset a lot of attributes has two possibilites of value (binary variable), the attributes were hypertension, hear_disease, ever_married, and stroke
###Attribute Information 1) id: unique identifier 2) gender: “Male”, “Female” or “Other” 3) age: age of the patient 4) hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension 5) heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease 6) ever_married: “No” or “Yes” 7) work_type: “children”, “Govt_job”, “Never_worked”, “Private” or “Self-employed” 8) Residence_type: “Rural” or “Urban” 9) avg_glucose_level: average glucose level in blood 10) bmi: body mass index 11) smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown” 12) stroke: 1 if the patient had a stroke or 0 if not Note: “Unknown” in smoking_status means that the information is unavailable for this patient
BasicSummary <- function(df, dgts = 3){
m <- ncol(df)
varNames <- colnames(df)
varType <- vector("character",m)
topLevel <- vector("character",m)
topCount <- vector("numeric",m)
missCount <- vector("numeric",m)
levels <- vector("numeric", m)
for (i in 1:m){
x <- df[,i]
varType[i] <- class(x)
xtab <- table(x, useNA = "ifany")
levels[i] <- length(xtab)
nums <- as.numeric(xtab)
maxnum <- max(nums)
topCount[i] <- maxnum
maxIndex <- which.max(nums)
lvls <- names(xtab)
topLevel[i] <- lvls[maxIndex]
missIndex <- which((is.na(x)) | (x == "") | (x == " "))
missCount[i] <- length(missIndex)
}
n <- nrow(df)
topFrac <- round(topCount/n, digits = dgts)
missFrac <- round(missCount/n, digits = dgts)
## #
summaryFrame <- data.frame(variable = varNames, type = varType,
levels = levels, topLevel = topLevel,
topCount = topCount, topFrac = topFrac,
missFreq = missCount, missFrac = missFrac)
return(summaryFrame)
}
BasicSummary(StrokeData)
## variable type levels topLevel topCount topFrac missFreq
## 1 id integer 5110 67 1 0.000 0
## 2 gender character 3 Female 2994 0.586 0
## 3 age numeric 104 78 102 0.020 0
## 4 hypertension integer 2 0 4612 0.903 0
## 5 heart_disease integer 2 0 4834 0.946 0
## 6 ever_married character 2 Yes 3353 0.656 0
## 7 work_type character 5 Private 2925 0.572 0
## 8 Residence_type character 2 Urban 2596 0.508 0
## 9 avg_glucose_level numeric 3979 93.88 6 0.001 0
## 10 bmi numeric 419 <NA> 201 0.039 201
## 11 smoking_status character 4 never smoked 1892 0.370 0
## 12 stroke integer 2 0 4861 0.951 0
## missFrac
## 1 0.000
## 2 0.000
## 3 0.000
## 4 0.000
## 5 0.000
## 6 0.000
## 7 0.000
## 8 0.000
## 9 0.000
## 10 0.039
## 11 0.000
## 12 0.000
StrokeData[, c(2, 6:8, 11)] <- lapply(StrokeData[,c(2, 6:8, 11)], as.factor)
__explanation:__
In order to help us explore the dataset easier, we need to change the attribute which was character as shown in the basic summary above as factor.
summary(StrokeData)
## id gender age hypertension
## Min. : 67 Female:2994 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Male :2115 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Other : 1 Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
##
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 No :1757 children : 687 Rural:2514
## 1st Qu.:0.00000 Yes:3353 Govt_job : 657 Urban:2596
## Median :0.00000 Never_worked : 22
## Mean :0.05401 Private :2925
## 3rd Qu.:0.00000 Self-employed: 819
## Max. :1.00000
##
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Min. :10.30 formerly smoked: 885 Min. :0.00000
## 1st Qu.: 77.25 1st Qu.:23.50 never smoked :1892 1st Qu.:0.00000
## Median : 91.89 Median :28.10 smokes : 789 Median :0.00000
## Mean :106.15 Mean :28.89 Unknown :1544 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:33.10 3rd Qu.:0.00000
## Max. :271.74 Max. :97.60 Max. :1.00000
## NA's :201
writeLines("")
writeLines("Detect Missing Values")
## Detect Missing Values
sapply(StrokeData, function(x) sum(is.na(x)))
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 201 0 0
__explanation:__
There are 201 missing values found in "bmi" attribute from this dataset.
*For the "Unknown" in the smoking status, we can just let it be. If we found that the smoking status does affects what we will analyze, we can just delete the data that has the "unknown" smoking status.
sapply(StrokeData, function(x) length(unique(x)))
## id gender age hypertension
## 5110 3 104 2
## heart_disease ever_married work_type Residence_type
## 2 2 5 2
## avg_glucose_level bmi smoking_status stroke
## 3979 419 4 2
__explanation:__
After finding the unique value and determine what do we want to prove from the dataset (we want to prove what affect a person to have a stroke), we can remove the attribute that we think wont affect or prove whether someone will have a stroke.
These attributes has most unique value:
- id
- avg_glucose_level
*From the above output we can see that in the "age", "avg_glucose_level" and "bmi" attribute, there are lots of duplicate data. It means that some people can have same age or avg_glucose_level or bmi and this duplication of data does not need to be deleted because we will use it to find out whether these attributes can prove someone has had a stroke.
*For the "id" attribute, we will discard it later in the data preparation because the id of a person will not affect the status that the person has stroke or not.
par(mfrow = c(2,2), pty = "m")
plot(StrokeData$gender,
main = "Gender Attribute",
col = "dark blue")
hist(StrokeData$age,
main = "Age ttribute",
col = "blue")
hist(StrokeData$hypertension,
main = "Hypertension Attribute",
col = "dark red")
hist(StrokeData$heart_disease,
main = "Heart Disease Attribute",
col = "dark green")
par(mfrow = c(2,2), pty = "m")
plot(StrokeData$ever_married,
main = "Marriage status",
col = "dark blue")
plot(StrokeData$work_type,
main = "Work Type",
col = "blue",
las = 2,
horiz = TRUE)
plot(StrokeData$Residence_type,
main = "Residence Type",
col = "dark red")
hist(StrokeData$avg_glucose_level,
main = "Average Glucose Level",
col = "dark green")
par(mfrow = c(2,2), pty = "m")
hist(StrokeData$bmi,
main = "Body Mass Index",
col = "dark blue")
plot(StrokeData$smoking_status,
main = "Smoking Status",
col = "blue",
horiz = TRUE,
las = 2)
hist(StrokeData$stroke,
main = "Stroke Attribute",
col = "dark red")
__explanation:__
Page 1 :
- (Upper Left) In this dataset, there are more female than male, and the least are "other".
- (Upper Right) Most people in the record aged from 40 to 60.
- (Lower Left) There are more people who doesn't have hypertension in this dataset.
- (Lower Right) Most people in this dataset also doesn't have heart disease.
Page 2:
- (Upper Left) There are more people who has married.
- (Upper Right) Most of the work type of people in this dataset are private.
- (Lower Left) The "rural" residence type and "urban" residence type seems to have slightly similar amount.
- (Lower Right) Most record of the Average Glucose Level are in range 50 above until 100.
Page 3:
- (Upper Left) Most record of the body mass index are in range 20 until around 30 above.
- (Upper Right) Most record shows that the people is either never smoked or has unknown smoking status
- (Lower Right) The difference in record number of people who has stroke and people who doesn't have stroke in this dataset was very far. It is hard for us to prove which affect people to have a stroke as the "stroke" record number in this dataset is very imbalance.
##Searching for outliers in the “bmi” attribute
ThreeSigma <- function(x, t = 3){
mu <- mean(x, na.rm = TRUE)
sig <- sd(x, na.rm = TRUE)
if (sig == 0){
message("All non-missing x-values are identical")
}
up <- mu + t * sig
down <- mu - t * sig
out <- list(up = up, down = down)
return(out)
}
Hampel <- function(x, t = 3){
mu <- median(x, na.rm = TRUE)
sig <- mad(x, na.rm = TRUE)
if (sig == 0){
message("Hampel identifer implosion: MAD scale estimate is zero")
}
up <- mu + t * sig
down <- mu - t * sig
out <- list(up = up, down = down)
return(out)
}
BoxplotRule<- function(x, t = 1.5){
xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
Q <- xU - xL
if (Q == 0){
message("Boxplot rule implosion: interquartile distance is zero")
}
up <- xU + t * Q
down <- xU - t * Q
out <- list(up = up, down = down)
return(out)
}
ExtractDetails <- function(x, down, up){
outClass <- rep("N", length(x))
indexLo <- which(x < down)
indexHi <- which(x > up)
outClass[indexLo] <- "L"
outClass[indexHi] <- "U"
index <- union(indexLo, indexHi)
values <- x[index]
outClass <- outClass[index]
nOut <- length(index)
maxNom <- max(x[which(x <= up)])
minNom <- min(x[which(x >= down)])
outList <- list(nOut = nOut, lowLim = down,
upLim = up, minNom = minNom,
maxNom = maxNom, index = index,
values = values,
outClass = outClass)
return(outList)
}
FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
threeLims <- ThreeSigma(x, t = t3)
HampLims <- Hampel(x, t = tH)
boxLims <- BoxplotRule(x, t = tb)
n <- length(x)
nMiss <- length(which(is.na(x)))
threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
boxList <- ExtractDetails(x, boxLims$down, boxLims$up)
sumFrame <- data.frame(method = "ThreeSigma", n = n,
nMiss = nMiss, nOut = threeList$nOut,
lowLim = threeList$lowLim,
upLim = threeList$upLim,
minNom = threeList$minNom,
maxNom = threeList$maxNom)
upFrame <- data.frame(method = "Hampel", n = n,
nMiss = nMiss, nOut = HampList$nOut,
lowLim = HampList$lowLim,
upLim = HampList$upLim,
minNom = HampList$minNom,
maxNom = HampList$maxNom)
sumFrame <- rbind.data.frame(sumFrame, upFrame)
upFrame <- data.frame(method = "BoxplotRule", n = n,
nMiss = nMiss, nOut = boxList$nOut,
lowLim = boxList$lowLim,
upLim = boxList$upLim,
minNom = boxList$minNom,
maxNom = boxList$maxNom)
sumFrame <- rbind.data.frame(sumFrame, upFrame)
threeFrame <- data.frame(index = threeList$index,
values = threeList$values,
type = threeList$outClass)
HampFrame <- data.frame(index = HampList$index,
values = HampList$values,
type = HampList$outClass)
boxFrame <- data.frame(index = boxList$index,
values = boxList$values,
type = boxList$outClass)
outList <- list(summary = sumFrame, threeSigma = threeFrame,
Hampel = HampFrame, boxplotRule = boxFrame)
return(outList)
}
fullSummary <- FindOutliers(StrokeData$bmi)
fullSummary$summary
## method n nMiss nOut lowLim upLim minNom maxNom
## 1 ThreeSigma 5110 201 58 5.331037 52.45544 10.3 52.3
## 2 Hampel 5110 201 90 7.195340 49.00466 10.3 48.9
## 3 BoxplotRule 5110 201 478 18.700000 47.50000 18.7 47.5
__explanation:__
- The ThreeSigma rule detects 58 data points as outliers.
- The Hampel identifier detects 90 data points as outliers.
- The Boxplot Rule detects 478 data points as outliers, giving the most numbers of outliers compared from the other two outlier identifier.
Based on the numbers of data we have, which is 5110 data it's normal for the boxplot rule to detect such many outliers as it has the highest lower limit among three of them.
###Data Preparation
StrokeData_ <- StrokeData[-c(1)]
sapply(StrokeData_, function(x) sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 201 0 0
__explanation:__
We remove the attribute "id" as an id of a person won't prove if a person has a stroke or not. We keep all the other attributes as they were considered important and we can use them to prove whether a person has a stroke or not.
As we can see in the "bmi" attribute there are 201 missing values found, we need to remove these missing values to reduce bias.
StrokeData_$bmi[is.na(StrokeData_$bmi)]<-mean(StrokeData_$bmi, na.rm = TRUE)
sapply(StrokeData_, function(x) sum(is.na(x)))
## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 0 0 0
__explanation:__
Now that the missing values are removed, we can continue to split the data into training set and testing set.
dim(StrokeData_)
## [1] 5110 11
__explanation:__
From what we can see, this dataset now has 5110 rows (instances) and 11 attributes as we already remove the id column.
#Splitting the data into training set and testing set
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
Train <- createDataPartition(y = StrokeData_$stroke, p = 0.7, list = FALSE)
train_set <- StrokeData_[Train,]
test_set <- StrokeData_[-Train,]
writeLines("Training set dimension: ")
## Training set dimension:
dim(train_set)
## [1] 3577 11
writeLines("Validation set dimension: ")
## Validation set dimension:
dim(test_set)
## [1] 1533 11
__explanation:__
Here we use 70% of list in this dataset for training and the rest for the testing set.
We can see that now the training set consist of 3577 instances with 11 attributes as for the testing set (the rest) consist of 1533 instances with also 11 attributes
#Modelling
Log_Model <- glm(stroke ~ ., family = binomial(link = "logit"), data = train_set)
summary(Log_Model)
##
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"),
## data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1635 -0.3189 -0.1699 -0.0913 3.6849
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.077e+00 1.086e+00 -6.514 7.31e-11 ***
## genderMale 1.022e-01 1.709e-01 0.598 0.5496
## genderOther -1.071e+01 1.455e+03 -0.007 0.9941
## age 7.013e-02 6.809e-03 10.299 < 2e-16 ***
## hypertension 4.539e-01 1.997e-01 2.273 0.0230 *
## heart_disease 2.292e-01 2.343e-01 0.978 0.3279
## ever_marriedYes -3.365e-01 2.546e-01 -1.322 0.1863
## work_typeGovt_job -1.822e-01 1.126e+00 -0.162 0.8715
## work_typeNever_worked -9.892e+00 3.634e+02 -0.027 0.9783
## work_typePrivate -6.724e-02 1.108e+00 -0.061 0.9516
## work_typeSelf-employed -2.809e-01 1.129e+00 -0.249 0.8036
## Residence_typeUrban 4.007e-02 1.676e-01 0.239 0.8111
## avg_glucose_level 3.607e-03 1.462e-03 2.468 0.0136 *
## bmi -2.352e-03 1.389e-02 -0.169 0.8656
## smoking_statusnever smoked -7.960e-02 2.149e-01 -0.370 0.7111
## smoking_statussmokes 2.349e-01 2.563e-01 0.916 0.3594
## smoking_statusUnknown -2.948e-02 2.624e-01 -0.112 0.9106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1361.6 on 3576 degrees of freedom
## Residual deviance: 1095.2 on 3560 degrees of freedom
## AIC: 1129.2
##
## Number of Fisher Scoring iterations: 14
__explanation:__
From the result above (regression result) we can see that the attribute which has correlation with "stroke" attribute are :
- age
- hypertension
- heart disease
- avg_glucose_level
The age attribute has the strongest correlation with the "stroke".
while the hypertension was slightly correlated with "stroke" attribute
*(Note that these attributes can change at any time depends on the division of training set and testing set)
Log_Model1 <- glm(stroke ~ age + hypertension +heart_disease + avg_glucose_level, family = binomial(link = "logit"), data = train_set)
summary(Log_Model1)
##
## Call:
## glm(formula = stroke ~ age + hypertension + heart_disease + avg_glucose_level,
## family = binomial(link = "logit"), data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0248 -0.3193 -0.1769 -0.0864 3.7244
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.275272 0.417333 -17.433 <2e-16 ***
## age 0.066254 0.006078 10.900 <2e-16 ***
## hypertension 0.452614 0.196471 2.304 0.0212 *
## heart_disease 0.292307 0.229859 1.272 0.2035
## avg_glucose_level 0.003597 0.001411 2.550 0.0108 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1361.6 on 3576 degrees of freedom
## Residual deviance: 1100.5 on 3572 degrees of freedom
## AIC: 1110.5
##
## Number of Fisher Scoring iterations: 7
__explanation:__
So, we can conclude that:
- age of a person
- does a person has hypertension or not
- does a person has heart disease
- and the average glucose level of a person
can determine whether he/ she has a stroke.
prediction_log <- predict(Log_Model, newdata = subset(test_set, select = c(1,2,3,4,5,6,7,8,9,10,11)), type = "response")
pd <- prediction(prediction_log, test_set$stroke)
rocCurve <- performance (pd, measur = "tpr", x.measure = "fpr")
plot(rocCurve)
__explanation:__
The result of the predictive model is not the best but it is good enough if we consider the number of "stroke" data that is imbalance. So we can continue to check auc (area under the curve) and the accuracy.
#Compute AUC
auc<- performance(pd, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8567189
__explanation:__
The auc of our model was 0.8100606
#check accuracy
result <- ifelse(prediction_log > 0.5, 1, 0)
missclassificationError <- mean(result != test_set$stroke)
print(paste("Accuracy: ", 1-missclassificationError))
## [1] "Accuracy: 0.94781474233529"
__explanation:__
We also got an accuracy of 95.11% (rounded up)
DTModel <- rpart(stroke ~ ., data = train_set, method = "class", control = rpart.control ("minsplit" = 1, minbucket = 1, cp=0.007))
DTModel
## n= 3577
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3577 169 0 (0.95275370 0.04724630)
## 2) age< 67.5 2992 69 0 (0.97693850 0.02306150) *
## 3) age>=67.5 585 100 0 (0.82905983 0.17094017)
## 6) avg_glucose_level< 252.355 578 96 0 (0.83391003 0.16608997)
## 12) bmi>=34.15 97 7 0 (0.92783505 0.07216495) *
## 13) bmi< 34.15 481 89 0 (0.81496881 0.18503119)
## 26) bmi< 34 479 87 0 (0.81837161 0.18162839)
## 52) age< 73.5 157 18 0 (0.88535032 0.11464968) *
## 53) age>=73.5 322 69 0 (0.78571429 0.21428571)
## 106) bmi< 28.84662 194 33 0 (0.82989691 0.17010309)
## 212) hypertension< 0.5 164 21 0 (0.87195122 0.12804878) *
## 213) hypertension>=0.5 30 12 0 (0.60000000 0.40000000)
## 426) avg_glucose_level< 210.585 25 8 0 (0.68000000 0.32000000)
## 852) bmi>=27.05 7 0 0 (1.00000000 0.00000000) *
## 853) bmi< 27.05 18 8 0 (0.55555556 0.44444444)
## 1706) avg_glucose_level< 65.51 3 0 0 (1.00000000 0.00000000) *
## 1707) avg_glucose_level>=65.51 15 7 1 (0.46666667 0.53333333)
## 3414) avg_glucose_level>=75.59 12 5 0 (0.58333333 0.41666667)
## 6828) ever_married=Yes 6 1 0 (0.83333333 0.16666667) *
## 6829) ever_married=No 6 2 1 (0.33333333 0.66666667) *
## 3415) avg_glucose_level< 75.59 3 0 1 (0.00000000 1.00000000) *
## 427) avg_glucose_level>=210.585 5 1 1 (0.20000000 0.80000000) *
## 107) bmi>=28.84662 128 36 0 (0.71875000 0.28125000)
## 214) bmi>=29.05 93 17 0 (0.81720430 0.18279570)
## 428) smoking_status=smokes,Unknown 29 1 0 (0.96551724 0.03448276) *
## 429) smoking_status=formerly smoked,never smoked 64 16 0 (0.75000000 0.25000000)
## 858) bmi< 31.35 28 3 0 (0.89285714 0.10714286) *
## 859) bmi>=31.35 36 13 0 (0.63888889 0.36111111)
## 1718) Residence_type=Urban 18 3 0 (0.83333333 0.16666667) *
## 1719) Residence_type=Rural 18 8 1 (0.44444444 0.55555556)
## 3438) work_type=Govt_job,Self-employed 12 5 0 (0.58333333 0.41666667)
## 6876) avg_glucose_level>=104.105 8 2 0 (0.75000000 0.25000000) *
## 6877) avg_glucose_level< 104.105 4 1 1 (0.25000000 0.75000000) *
## 3439) work_type=Private 6 1 1 (0.16666667 0.83333333) *
## 215) bmi< 29.05 35 16 1 (0.45714286 0.54285714)
## 430) smoking_status=never smoked,smokes 15 4 0 (0.73333333 0.26666667)
## 860) avg_glucose_level>=77.73 12 1 0 (0.91666667 0.08333333) *
## 861) avg_glucose_level< 77.73 3 0 1 (0.00000000 1.00000000) *
## 431) smoking_status=formerly smoked,Unknown 20 5 1 (0.25000000 0.75000000) *
## 27) bmi>=34 2 0 1 (0.00000000 1.00000000) *
## 7) avg_glucose_level>=252.355 7 3 1 (0.42857143 0.57142857)
## 14) bmi< 30.8 4 1 0 (0.75000000 0.25000000) *
## 15) bmi>=30.8 3 0 1 (0.00000000 1.00000000) *
rpart.plot(DTModel)
__explanation:__
This decision tree models gives us the classification based on yes or no starting from the top (which is the root of the tree).
the number 0 or 1 at the node determine whether people has a stroke or not, 0 for people who doesn't have stroke, 1 for the people who has stroke.
The number 0.05 at the root shows us the proportion of people who doesn't have a stroke we can also confirm this number in the EDA which we compute before in the summary of StrokeData.
It gives ud the Mean :0.04873 (which is proportion for people who doesn't have stroke)
After that the node asks wheter the age of the person is below 68 or not, if yes we see the left child node which shows us 83% of the people aged below 68 with the probability of having stroke 0.02.
And all the other nodes goes the same with the classfication and proportion shown in the tree.
predict_DT <- predict(DTModel, test_set, type = "class")
cm <- table(predict_DT, test_set$stroke)
cm
##
## predict_DT 0 1
## 0 1438 74
## 1 15 6
__explanation:__
The model does predict 1453 people who doesn't have stroke correctly but also classified 73 as people who has stroke.
And it also predicted 6 people who has stroke wrongly (missclassified 6 people).
#overall accuray
sum(diag(cm)) / sum(cm)
## [1] 0.9419439
#misclassification accuracy
1 - sum(diag(cm)) / sum(cm)
## [1] 0.0580561
__explanation:__
The accuracy of the test turns out to be 0.9484671 or we can say it as 94.85%
and the missclassifiaction of 5.15%