Data Preparation
Load the library that we would use on the project
library(dplyr) # data wrangling wrangling
library(magrittr) # useful piping operator
library(ggplot2) # plotting
library(tidyr) # tidying data
library(DataExplorer) # Exlporatory Data Analysis
library(caret) # short for Classification And Regression Training
library(class) # for K-Nearest Neighbors Classification
set.seed(11) # for reproducbility of result
Set plotting theme
theme_set(ggthemes::theme_hc())
Load the dataset
df <- read.csv("heart.csv")
Take a look at the data
knitr::knit_print(skimr::skim(df))
Data summary
| Name |
df |
| Number of rows |
303 |
| Number of columns |
14 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
14 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| age |
0 |
1 |
54.37 |
9.08 |
29 |
47.5 |
55.0 |
61.0 |
77.0 |
▁▆▇▇▁ |
| sex |
0 |
1 |
0.68 |
0.47 |
0 |
0.0 |
1.0 |
1.0 |
1.0 |
▃▁▁▁▇ |
| cp |
0 |
1 |
0.97 |
1.03 |
0 |
0.0 |
1.0 |
2.0 |
3.0 |
▇▃▁▅▁ |
| trestbps |
0 |
1 |
131.62 |
17.54 |
94 |
120.0 |
130.0 |
140.0 |
200.0 |
▃▇▅▁▁ |
| chol |
0 |
1 |
246.26 |
51.83 |
126 |
211.0 |
240.0 |
274.5 |
564.0 |
▃▇▂▁▁ |
| fbs |
0 |
1 |
0.15 |
0.36 |
0 |
0.0 |
0.0 |
0.0 |
1.0 |
▇▁▁▁▂ |
| restecg |
0 |
1 |
0.53 |
0.53 |
0 |
0.0 |
1.0 |
1.0 |
2.0 |
▇▁▇▁▁ |
| thalach |
0 |
1 |
149.65 |
22.91 |
71 |
133.5 |
153.0 |
166.0 |
202.0 |
▁▂▅▇▂ |
| exang |
0 |
1 |
0.33 |
0.47 |
0 |
0.0 |
0.0 |
1.0 |
1.0 |
▇▁▁▁▃ |
| oldpeak |
0 |
1 |
1.04 |
1.16 |
0 |
0.0 |
0.8 |
1.6 |
6.2 |
▇▂▁▁▁ |
| slope |
0 |
1 |
1.40 |
0.62 |
0 |
1.0 |
1.0 |
2.0 |
2.0 |
▁▁▇▁▇ |
| ca |
0 |
1 |
0.73 |
1.02 |
0 |
0.0 |
0.0 |
1.0 |
4.0 |
▇▃▂▁▁ |
| thal |
0 |
1 |
2.31 |
0.61 |
0 |
2.0 |
2.0 |
3.0 |
3.0 |
▁▁▁▇▆ |
| target |
0 |
1 |
0.54 |
0.50 |
0 |
0.0 |
1.0 |
1.0 |
1.0 |
▇▁▁▁▇ |
Dimension: 303,14 We have a little number of observation and this could influence the predictability of our model, we could gather more data to improve the model result.
Completion rate: 100% The data is complete with 0 missing value for each variable, but we should correct the variable type since there’s factor that classified as numeric and assign descriptive value to the variable.
df %<>% mutate(sex = if_else(sex == 1, "MALE", "FEMALE"),
cp = if_else(cp == 1, "ATYPICAL ANGINA",
if_else(cp == 2, "NON-ANGINAL PAIN", "ASYMPTOMATIC")),
fbs = if_else(fbs == 1, ">120", "<=120"),
restecg = if_else(restecg == 0, "NORMAL",
if_else(restecg == 1, "ABNORMALITY", "PROBABLE OR DEFINITE")),
exang = if_else(exang == 1, "YES" ,"NO"),
slope = as.factor(slope),
ca = as.factor(ca),
thal = as.factor(thal),
target = if_else(target == 1, "YES", "NO")) %>%
mutate_if(is.character, as.factor)
We gain better understanding of our numeric data distribution by using boxplot and histogram
ggplot(gather(df[,sapply(df, is.numeric)], cols, value),
aes(x = value, fill = cols)) +
geom_boxplot() +
facet_wrap(cols~., scales = "free") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "none")

age doen’t have outlier the median is centered around 55 year old
chol have outlier around 400 and one heavy outlier with value over 500
oldpeak have outlier with value over 4
thalach have outlier with value under 90
trestbps have the most of outlier around 180
ggplot(gather(df[,sapply(df, is.numeric)], cols, value),
aes(x = value)) +
geom_histogram(aes(fill = cols)) +
geom_density(aes(y = stat(count))) +
facet_wrap(cols~., scales = "free") +
theme(legend.position = "none",
axis.title.y = element_text(angle = 90))

age of our observation have 2 modes around 45 and 55 years old, and quite normaly distributed
chol variable have single modes around 250 and alittle bit right skewed
oldpeak variable have modes on 0 and heavily right skewed.
thalach have modes around 160 and left skewed
trestbps have several modes around 135 and a little bit right skewed
let’s check our target value’s proportion
round(prop.table(table(df$target)),2)
##
## NO YES
## 0.46 0.54
Our target’s proportion is quite well proportioned.
We split out dataframe based on several attributes: categorical vs numeric and discrete vs continuous to ease our Exploratory Data Analysis
split <- split_columns(df)
ggplot(gather(split$discrete %>%
select(-c(cp,restecg,target)), cols, value),
aes(x = value)) +
geom_bar(aes(fill = value)) +
facet_wrap(cols~., scales = "free") +
theme(legend.position = "none",
axis.title.y = element_text(angle = 90)) +
ggtitle("Discrete Variables Proportions")

ggplot(gather(split$discrete %>%
select(cp), cols, value),
aes(x = value)) +
geom_bar(aes(fill = value)) +
theme(legend.position = "none",
axis.title.y = element_text(angle = 90)) +
ggtitle("Chest Pain")

ggplot(gather(split$discrete %>%
select(restecg), cols, value),
aes(x = value)) +
geom_bar(aes(fill = value)) +
theme(legend.position = "none",
axis.title.y = element_text(angle = 90)) +
ggtitle("Resting Electrocardiographic Results")

AppliedPredictiveModeling::transparentTheme(trans = .5)
featurePlot(x = split$continuous,
y = df$target,
plot = "density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
adjust = 1.5,
pch = "|",
layout = c(3, 2),
auto.key = list(columns = 2))

The different between person with health disease and with healthy heart is quite clear in thalach, oldpeak and age, these factor’s could be a good predictor for our classification.
featurePlot(split$continuous,
df$target,
plot = "box",
scales = list(y = list(relation = "free"),
x = list(rot = 90)),
auto.key = list(columns = 2))

We could see from the boxplot that majority of person with heart disease have these characteristic: - higher thalach - lower oldpeak - lower age - relatively lower trestbps - relatively same level of chol
AppliedPredictiveModeling::transparentTheme(trans = 0.4)
featurePlot(x = split$continuous,
y = df$target,
plot = "pairs",
## Add a key at the top
auto.key = list(columns = 2))

Our Scatter Plot Matrix shows that, the old peak variable could be the a good feature for our model in relation with the other continuous variable. YES and NO for heart disease are hardly parted from the continuous variable, thus we hypothesized that logistic regression wouldn’t be the right classification method for our case.
Modeling
In this project, we will use classification with 2 technique, logistic regression and K-nearest neighbors. - Logistic regression divide the observation into 2 different parts using the regression line to classify the data - K-nearest neighbors classify the data by looking at the k-number of the nearest point to figure out the difference and the similarity.
Train - Test Split
inTraining <- createDataPartition(df$target, p = .75, list = FALSE)
train <- df[ inTraining,]
test <- df[-inTraining,]
train_x <- train %>% select(-target)
train_y <- train %>% select(target)
test_x <- test %>% select(-target)
test_y <- test %>% select(target)
Logistic Regression
Baseline model
logmod <- glm(target~.,train, family = binomial("logit"))
summary(logmod)
##
## Call:
## glm(formula = target ~ ., family = binomial("logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7819 -0.3902 0.1533 0.3760 2.8944
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.7242707 2399.5473365 -0.007 0.99444
## age 0.0227048 0.0292500 0.776 0.43761
## sexMALE -0.5575973 0.6372390 -0.875 0.38156
## cpATYPICAL ANGINA 0.6715472 0.8195231 0.819 0.41254
## cpNON-ANGINAL PAIN 0.7641526 0.5272676 1.449 0.14726
## trestbps -0.0205815 0.0127333 -1.616 0.10602
## chol -0.0006164 0.0045646 -0.135 0.89258
## fbs>120 0.8183040 0.6257587 1.308 0.19098
## restecgNORMAL -0.1463589 0.4626668 -0.316 0.75175
## restecgPROBABLE OR DEFINITE -14.7197368 1245.3277449 -0.012 0.99057
## thalach 0.0283954 0.0144956 1.959 0.05013 .
## exangYES -1.1390310 0.5087331 -2.239 0.02516 *
## oldpeak -0.4161185 0.2582631 -1.611 0.10713
## slope1 -1.3975074 0.9840938 -1.420 0.15558
## slope2 -0.2177829 1.1153621 -0.195 0.84519
## ca1 -1.8776663 0.5980489 -3.140 0.00169 **
## ca2 -3.5421866 0.8976941 -3.946 0.0000795 ***
## ca3 -1.9083253 1.0347502 -1.844 0.06515 .
## ca4 -0.9261473 2.2097925 -0.419 0.67514
## thal1 17.5303348 2399.5449392 0.007 0.99417
## thal2 17.5731307 2399.5448813 0.007 0.99416
## thal3 15.9678269 2399.5448490 0.007 0.99469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 314.32 on 227 degrees of freedom
## Residual deviance: 139.32 on 206 degrees of freedom
## AIC: 183.32
##
## Number of Fisher Scoring iterations: 15
We include all variable on the baseline model and the summary() function show us that the most significant value for our prediction from the Pr(>|z|) which are sex, cp, thalach, exang, and ca, in other word, feature selection. We will take a look at the result of our confussion matrix to see whether we should do some modification to improve our model.
Our thal variable could perfectly classify the dataset and it’s an indication of perfect separation. Thus we must exclude the thal variable from our logistic regression.
logmod_clean <- glm(target~.-thal,train, family = binomial("logit"))
summary(logmod_clean)
##
## Call:
## glm(formula = target ~ . - thal, family = binomial("logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5986 -0.4487 0.1586 0.4444 2.8224
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.690338 3.103619 0.545 0.58600
## age 0.016521 0.028024 0.590 0.55551
## sexMALE -1.361275 0.524416 -2.596 0.00944 **
## cpATYPICAL ANGINA 1.124713 0.798171 1.409 0.15880
## cpNON-ANGINAL PAIN 1.041716 0.497093 2.096 0.03612 *
## trestbps -0.019432 0.012640 -1.537 0.12423
## chol -0.003220 0.004422 -0.728 0.46654
## fbs>120 0.815081 0.590758 1.380 0.16767
## restecgNORMAL -0.015178 0.434601 -0.035 0.97214
## restecgPROBABLE OR DEFINITE -13.938153 1371.840702 -0.010 0.99189
## thalach 0.025514 0.012867 1.983 0.04738 *
## exangYES -1.337280 0.489743 -2.731 0.00632 **
## oldpeak -0.417201 0.232822 -1.792 0.07314 .
## slope1 -1.371795 0.874551 -1.569 0.11675
## slope2 -0.064491 0.994560 -0.065 0.94830
## ca1 -1.902762 0.585480 -3.250 0.00115 **
## ca2 -3.210743 0.857169 -3.746 0.00018 ***
## ca3 -2.019771 0.990750 -2.039 0.04149 *
## ca4 -0.838456 1.845854 -0.454 0.64966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 314.32 on 227 degrees of freedom
## Residual deviance: 151.41 on 209 degrees of freedom
## AIC: 189.41
##
## Number of Fisher Scoring iterations: 15
Now, we test our baseline logistic regression model with the test data.
logmod_logit <- predict(logmod_clean, newdata = test, type = "response")
We will convert the probability into class by the default threshold value of 0.5, any value above would be positive and vice versa.
logmod_pred <- as.factor(ifelse(logmod_logit >= 0.5, "YES", "NO"))
confusionMatrix(logmod_pred, test$target, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 25 5
## YES 9 36
##
## Accuracy : 0.8133
## 95% CI : (0.7067, 0.894)
## No Information Rate : 0.5467
## P-Value [Acc > NIR] : 0.000001183
##
## Kappa : 0.6196
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.8780
## Specificity : 0.7353
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.8333
## Prevalence : 0.5467
## Detection Rate : 0.4800
## Detection Prevalence : 0.6000
## Balanced Accuracy : 0.8067
##
## 'Positive' Class : YES
##
Our model Sensitivity should be as big as possible, and why is that? Because we made a model that predict the presence of heart disease, the result our false negative could be very dangerous for the patient. Sensitivity measures the proportion of positives that are correctly identified (e.g., the percentage of sick people who are correctly identified as having some illness).
test %>%
mutate(yhat = logmod_pred) %>%
ggplot(aes(target, yhat, col = paste(target,yhat))) +
geom_jitter() +
geom_hline(aes(yintercept = 1.5)) +
geom_vline(aes(xintercept = 1.5)) +
labs(x = "Actual",
y = "Prediction",
title = "Confussion Matrix",
subtitle = "Logistic Regression - Baseline") +
theme(legend.position = "none")

Stepwise model
We could do the backward stepwise regression which include all of the independent variables as prdictor just by simply using step() function with our logistic regression model, the direction parameter is “backward” by default. In backward stepwise regression, we will eliminating each predictor variable until we reach the lowest Akaike Information Criterion (AIC)
summary(step(logmod_clean))
## Start: AIC=189.41
## target ~ (age + sex + cp + trestbps + chol + fbs + restecg +
## thalach + exang + oldpeak + slope + ca + thal) - thal
##
## Df Deviance AIC
## - restecg 2 151.81 185.81
## - age 1 151.76 187.76
## - chol 1 151.93 187.93
## - fbs 1 153.35 189.35
## <none> 151.41 189.41
## - trestbps 1 153.84 189.84
## - cp 2 156.83 190.83
## - oldpeak 1 154.84 190.84
## - thalach 1 155.57 191.57
## - slope 2 160.59 194.59
## - sex 1 158.62 194.62
## - exang 1 159.03 195.03
## - ca 4 177.02 207.02
##
## Step: AIC=185.81
## target ~ age + sex + cp + trestbps + chol + fbs + thalach + exang +
## oldpeak + slope + ca
##
## Df Deviance AIC
## - age 1 152.18 184.18
## - chol 1 152.40 184.40
## <none> 151.81 185.81
## - fbs 1 153.85 185.85
## - trestbps 1 154.27 186.27
## - cp 2 157.37 187.37
## - oldpeak 1 155.46 187.46
## - thalach 1 156.01 188.01
## - sex 1 158.97 190.97
## - slope 2 161.00 191.00
## - exang 1 159.50 191.50
## - ca 4 177.86 203.86
##
## Step: AIC=184.18
## target ~ sex + cp + trestbps + chol + fbs + thalach + exang +
## oldpeak + slope + ca
##
## Df Deviance AIC
## - chol 1 152.69 182.69
## <none> 152.18 184.18
## - fbs 1 154.25 184.25
## - trestbps 1 154.28 184.28
## - cp 2 157.89 185.89
## - thalach 1 156.01 186.01
## - oldpeak 1 156.13 186.13
## - slope 2 161.22 189.22
## - sex 1 159.61 189.61
## - exang 1 159.86 189.86
## - ca 4 179.86 203.86
##
## Step: AIC=182.69
## target ~ sex + cp + trestbps + fbs + thalach + exang + oldpeak +
## slope + ca
##
## Df Deviance AIC
## - fbs 1 154.63 182.63
## <none> 152.69 182.69
## - trestbps 1 154.91 182.91
## - thalach 1 156.44 184.44
## - cp 2 158.53 184.53
## - oldpeak 1 156.86 184.86
## - sex 1 159.68 187.68
## - slope 2 161.92 187.92
## - exang 1 160.31 188.31
## - ca 4 180.53 202.53
##
## Step: AIC=182.63
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope +
## ca
##
## Df Deviance AIC
## - trestbps 1 156.43 182.43
## <none> 154.63 182.63
## - thalach 1 158.42 184.42
## - cp 2 161.27 185.27
## - oldpeak 1 159.56 185.56
## - sex 1 160.75 186.75
## - exang 1 161.30 187.30
## - slope 2 163.87 187.87
## - ca 4 181.04 201.04
##
## Step: AIC=182.43
## target ~ sex + cp + thalach + exang + oldpeak + slope + ca
##
## Df Deviance AIC
## <none> 156.43 182.43
## - thalach 1 159.85 183.85
## - cp 2 163.53 185.53
## - sex 1 162.29 186.29
## - oldpeak 1 162.38 186.38
## - slope 2 165.63 187.63
## - exang 1 163.87 187.87
## - ca 4 181.71 199.71
##
## Call:
## glm(formula = target ~ sex + cp + thalach + exang + oldpeak +
## slope + ca, family = binomial("logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5680 -0.4690 0.1958 0.4605 2.6970
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14396 2.04826 -0.070 0.94397
## sexMALE -1.11187 0.46958 -2.368 0.01789 *
## cpATYPICAL ANGINA 1.16496 0.76905 1.515 0.12982
## cpNON-ANGINAL PAIN 1.18832 0.48430 2.454 0.01414 *
## thalach 0.02140 0.01175 1.822 0.06842 .
## exangYES -1.28378 0.47639 -2.695 0.00704 **
## oldpeak -0.52784 0.22700 -2.325 0.02006 *
## slope1 -1.45761 0.83855 -1.738 0.08217 .
## slope2 -0.21372 0.94022 -0.227 0.82019
## ca1 -1.75476 0.54515 -3.219 0.00129 **
## ca2 -2.78264 0.77505 -3.590 0.00033 ***
## ca3 -1.75968 0.87034 -2.022 0.04319 *
## ca4 -0.83067 2.04143 -0.407 0.68408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 314.32 on 227 degrees of freedom
## Residual deviance: 156.43 on 215 degrees of freedom
## AIC: 182.43
##
## Number of Fisher Scoring iterations: 6
logstep <- glm(formula = target ~ sex + cp + thalach + exang + oldpeak +
slope + ca, family = binomial("logit"), data = train)
summary(logstep)
##
## Call:
## glm(formula = target ~ sex + cp + thalach + exang + oldpeak +
## slope + ca, family = binomial("logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5680 -0.4690 0.1958 0.4605 2.6970
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.14396 2.04826 -0.070 0.94397
## sexMALE -1.11187 0.46958 -2.368 0.01789 *
## cpATYPICAL ANGINA 1.16496 0.76905 1.515 0.12982
## cpNON-ANGINAL PAIN 1.18832 0.48430 2.454 0.01414 *
## thalach 0.02140 0.01175 1.822 0.06842 .
## exangYES -1.28378 0.47639 -2.695 0.00704 **
## oldpeak -0.52784 0.22700 -2.325 0.02006 *
## slope1 -1.45761 0.83855 -1.738 0.08217 .
## slope2 -0.21372 0.94022 -0.227 0.82019
## ca1 -1.75476 0.54515 -3.219 0.00129 **
## ca2 -2.78264 0.77505 -3.590 0.00033 ***
## ca3 -1.75968 0.87034 -2.022 0.04319 *
## ca4 -0.83067 2.04143 -0.407 0.68408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 314.32 on 227 degrees of freedom
## Residual deviance: 156.43 on 215 degrees of freedom
## AIC: 182.43
##
## Number of Fisher Scoring iterations: 6
logstep_logit <- predict(logstep, newdata = test, type = "response")
Once again, we will convert the probability into class by the default threshold value of 0.5, any value above would be positive and vice versa.
logstep_pred <- as.factor(ifelse(logstep_logit >= 0.5, "YES", "NO"))
confusionMatrix(logstep_pred, test$target, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 23 4
## YES 11 37
##
## Accuracy : 0.8
## 95% CI : (0.6917, 0.8835)
## No Information Rate : 0.5467
## P-Value [Acc > NIR] : 0.000004117
##
## Kappa : 0.5893
##
## Mcnemar's Test P-Value : 0.1213
##
## Sensitivity : 0.9024
## Specificity : 0.6765
## Pos Pred Value : 0.7708
## Neg Pred Value : 0.8519
## Prevalence : 0.5467
## Detection Rate : 0.4933
## Detection Prevalence : 0.6400
## Balanced Accuracy : 0.7895
##
## 'Positive' Class : YES
##
Can you see the difference from our baseline and stepwise logistic regression model? Just by trimming down our independent variable, we decrease accuracy of our prediction by 0.0133 but our model sensitivity is increased by 0.0244.
logplot <- test %>%
mutate(yhat = logstep_pred) %>%
ggplot(aes(target, yhat))
logplot +
geom_jitter(aes(col = paste(target,yhat))) +
geom_hline(aes(yintercept = 1.5)) +
geom_vline(aes(xintercept = 1.5)) +
labs(x = "Actual",
y = "Prediction",
title = "Confussion Matrix",
subtitle = "Logistic Regression - Stepwise") +
theme(legend.position = "none")
And what’s better? from our logistic regression model, we could exponentiate the coefficients and interpret them as odds-ratios.
round(exp(coef(logstep)),2)
## (Intercept) sexMALE cpATYPICAL ANGINA cpNON-ANGINAL PAIN
## 0.87 0.33 3.21 3.28
## thalach exangYES oldpeak slope1
## 1.02 0.28 0.59 0.23
## slope2 ca1 ca2 ca3
## 0.81 0.17 0.06 0.17
## ca4
## 0.44
For example, for 1 unit increase in non-anginal chest pain and atypical angina chest pain, the odds to have heart disease (versus asymptomatic chest pain) increase by factor of 3.28 and 3.21 consecutively, thus chest pain could be one of the best sign for presence of heart disease.
K-Nearest Neighbors
To do the K-NN algorithm, we should dummify our categorical data first to turn them into number and then normalize the dataset so the number on our dataset would be on same scale, thus preventing biased outcome. But don’t forget to exclude our target variable from the dummfied as we want to use the target variable as the label for our classification.
Dummify the categorical variable
dummified <- df %>%
select(-target) %>%
dummify()
normalized <- as.data.frame(lapply(dummified,scale)) %>%
mutate(target = df$target)
Then we do the train-test split again from our normalized data
norminTraining <- createDataPartition(normalized$target, p = .75, list = FALSE)
norm_train <- normalized[ norminTraining,-30]
norm_test <- normalized[-norminTraining,-30]
norm_train_label <- normalized[ norminTraining,30]
norm_test_label <- normalized[-norminTraining,30]
crt_norm_train <- normalized[norminTraining,]
crt_norm_test <- normalized[-norminTraining,]
For the K-nearest neighbors classification, we will use the package ‘class’ which has the KNN function in it.
First of all, we will calcultate the ‘K’ in our K-NN, as the best practice, we would find the K from square root of our observation (nrow)
round(sqrt(nrow(norm_train)))
## [1] 15
So the rounded square root of our training data observation is 15, we will use the ‘k’ value of 15 for our K-NN model.
knnmod <- knn(norm_train, norm_test, norm_train_label, 15)
confusionMatrix(knnmod, norm_test_label, positive = "YES")
## Confusion Matrix and Statistics
##
## Reference
## Prediction NO YES
## NO 27 4
## YES 7 37
##
## Accuracy : 0.8533
## 95% CI : (0.7527, 0.9244)
## No Information Rate : 0.5467
## P-Value [Acc > NIR] : 0.00000001664
##
## Kappa : 0.7018
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.9024
## Specificity : 0.7941
## Pos Pred Value : 0.8409
## Neg Pred Value : 0.8710
## Prevalence : 0.5467
## Detection Rate : 0.4933
## Detection Prevalence : 0.5867
## Balanced Accuracy : 0.8483
##
## 'Positive' Class : YES
##
Our K-NN model have higher Accuracy than our logistic regression model with same level of Sensitivity, higher Specificity as it measures the proportion of negatives that are correctly identified (e.g., the percentage of healthy people who are correctly identified as not having some illness), but we could increase our model performance by several way such as balancing our data by target variable (remember that we originally have 46% No and 54% Yes),doing k-fold cross-validation or even grid search.
knnplot <- test %>%
mutate(yhat = knnmod) %>%
ggplot(aes(target, yhat))
knnplot +
geom_jitter(aes(col = paste(target,yhat))) +
geom_hline(aes(yintercept = 1.5)) +
geom_vline(aes(xintercept = 1.5)) +
labs(x = "Actual",
y = "Prediction",
title = "Confussion Matrix",
subtitle = "K-Nearest Neighbors") +
theme(legend.position = "none")
