Heart disease is among the top causes of death in the world. It is a multi-factor disesase that is built overtime. In this project, we will try to create a model that can classify whether a person using heart disease from several predictors.We'll use two approach: logistic regression and K-nearest neighbor. The dataset that we are using is the UCI Heart Disease dataset hosted on Kaggle.
Before modeling, let's first explore the data. We'll need load these libraries to assist in our project.
library(tidyverse) # Data wranling
library(caret) # to detect variables with zero variance with nearZeroVar
library(class) # for KNN modeling
library(e1071) # For drawing confusionMatrix
library(kableExtra) # Pretty printing tables
library(car) # Testing multicollinearity with VIFWe'll then load our data. Data with missing value in this dataset is denoted by the question mark (?).
data <- read_csv('heart.csv')
data %>% head() %>%
kable() %>%
kable_styling()| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
| 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
| 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
| 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
| 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
| 57 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 | 0.4 | 1 | 0 | 1 | 1 |
The data consist of 303 rows and 14 columns.
Before diving in, let's see the number of missing values for each column.
colSums(is.na(data))## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
There are no columns with missing values. Let's check if there is any duplicate observation
data[data %>%
duplicated(), ] %>%
kable() %>%
kable_styling()| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 1 | 2 | 138 | 175 | 0 | 1 | 173 | 0 | 0 | 2 | 4 | 2 | 1 |
There is one duplicated row. Let's remove this row
data <- data %>% unique()This reduces the number of row by 1.
Let's also fix the datatypes of the rows on our dataset before doing anything further. We'll transform sex, cp, restecg, exang, slope, ca, thal and target into factors. However, before doing that, we see that thal is a variable with 3 groups : normal (1-3), fixed defect (4-6), and resersible defect (7). Let's convert thal into these 3 groups before converting the values.
data <- data %>%
mutate(
thal = case_when(
thal <= 3 ~ 'normal',
4 <= thal & thal <= 6 ~ 'fixed defect',
7 <= thal ~ 'reversible defect'
)
) %>%
mutate_at(vars(sex, cp, restecg, exang, slope, ca, thal, target, fbs), factor)Before doing any
ratio <- prop.table(table(data$target))Our dataset have 0.46 : 0.54 negative : positive ratio, showing that our target class is balanced.
First, let's see if our data has outliers
outliers <- boxplot(data %>% select_if(is.numeric), main = 'Distribution of numeric data', xlab = 'Parameters', ylab = 'value')outliers## $stats
## [,1] [,2] [,3] [,4] [,5]
## [1,] 29.0 94 126.0 88.0 0.0
## [2,] 48.0 120 211.0 133.0 0.0
## [3,] 55.5 130 240.5 152.5 0.8
## [4,] 61.0 140 275.0 166.0 1.6
## [5,] 77.0 170 360.0 202.0 4.0
##
## $n
## [1] 302 302 302 302 302
##
## $conf
## [,1] [,2] [,3] [,4] [,5]
## [1,] 54.31806 128.1816 234.6812 149.4997 0.6545299
## [2,] 56.68194 131.8184 246.3188 155.5003 0.9454701
##
## $out
## [1] 172.0 178.0 180.0 180.0 200.0 174.0 192.0 178.0 180.0 417.0 564.0 394.0
## [13] 407.0 409.0 71.0 4.2 6.2 5.6 4.2 4.4
##
## $group
## [1] 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 5 5 5 5 5
##
## $names
## [1] "age" "trestbps" "chol" "thalach" "oldpeak"
It seems there are 20 outliers in our dataset. Let's remove the outliers using the function below
removeOutliers <- function(x) {
# include library if not included
library(dplyr)
# get outliers from x
plot <- boxplot(x, plot = F)
# Get min max value of whiskers
tf.data <- data.frame(plot$stats[c(1,5), ], row.names = c('min', 'max'))
colnames(tf.data) <- colnames(x)
# Filter data
for(name in colnames(x)) {
if(tf.data['min', name] != tf.data['max', name] & is.numeric(x[name])) {
x <- x %>%
filter( tf.data['min', name] <= get(name) & get(name) <= tf.data['max', name] )
}
}
return(x)
}
data.no_outliers <- removeOutliers(data)
boxplot(data.no_outliers %>% select_if(is.numeric), main = 'Distribution of numeric data, without outliers', xlab = 'Parameters', ylab = 'value')Let's also check if there is any variable with near zero variance
zero_var <- nearZeroVar(data.no_outliers, names = T)
zero_var## [1] "thal"
It seems there is one. Let's get rid of that variable as it would not contribute anything to the data
data.no_outliers <- data.no_outliers %>%
select(-zero_var)
data.no_outliers %>%
head() %>%
kable() %>%
kable_styling()| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 |
| 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 1 |
| 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 1 |
| 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 1 |
| 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 1 |
| 57 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 | 0.4 | 1 | 0 | 1 |
Let's see if there is any perfect separator in our data. A perfect separator perfectly separates the target classes, so that we do not need other variables to predict the result. A perfect separator would have the largest coefficient, and make the significance of the other variables insignificant (z-value > 0.05).
summary(glm(target ~ ., data = data.no_outliers, family = 'binomial'))##
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.no_outliers)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7026 -0.3649 0.1243 0.4514 2.9624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.941757 2.785813 1.056 0.290978
## age 0.025709 0.024618 1.044 0.296341
## sex1 -2.271741 0.505226 -4.496 6.91e-06 ***
## cp1 1.133724 0.566393 2.002 0.045322 *
## cp2 1.978539 0.491833 4.023 5.75e-05 ***
## cp3 2.377886 0.685129 3.471 0.000519 ***
## trestbps -0.027709 0.011491 -2.411 0.015893 *
## chol -0.005619 0.004297 -1.308 0.190993
## fbs1 0.520819 0.553116 0.942 0.346392
## restecg1 0.307805 0.380319 0.809 0.418323
## restecg2 -0.546931 2.660130 -0.206 0.837101
## thalach 0.020563 0.011323 1.816 0.069364 .
## exang1 -0.963128 0.438853 -2.195 0.028189 *
## oldpeak -0.393860 0.223790 -1.760 0.078415 .
## slope1 -0.827285 0.838000 -0.987 0.323538
## slope2 0.693445 0.920303 0.753 0.451152
## ca1 -2.263875 0.496646 -4.558 5.16e-06 ***
## ca2 -3.287369 0.755865 -4.349 1.37e-05 ***
## ca3 -2.384297 0.886537 -2.689 0.007157 **
## ca4 0.497427 1.678673 0.296 0.766985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 416.42 on 301 degrees of freedom
## Residual deviance: 193.29 on 282 degrees of freedom
## AIC: 233.29
##
## Number of Fisher Scoring iterations: 6
Looking at the coefficients sectino of the modl summary, we can see there is no one singe value with very large coefficients. Therefore, we can conclude that there is no perfect separator.
Let's the range of our numeric columns, and see if our data needs scaling
data.no_outliers %>%
select(-c(cp, restecg, sex, exang, slope, ca, fbs)) %>%
pivot_longer(-target, names_to = 'params') %>%
group_by(params) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = params, y = value)) +
geom_col(fill = 'blue') +
labs(title = 'Range of values in dataset', x = 'Numeric params', y = 'Values')We can see that our values are quite in a range. Let's normalize the values to give a more balance effect between the coefficients.
data.scaled <-data.no_outliers %>% mutate_if(is.numeric, scale)data.scaled %>%
select(-c(cp, restecg, sex, exang, slope, ca, fbs)) %>%
pivot_longer(-target, names_to = 'params') %>%
group_by(params) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = params, y = value)) +
geom_col(fill = 'blue') +
labs(title = 'Range of values in scaled dataset', x = 'Numeric params', y = 'Values')The data is now more in scale.
Before making any models, let's split our dataset into test and train dataset. This allows us to have a part of the data to validate our model.
idx <- sample(nrow(data.scaled), nrow(data.scaled)* 0.80)
data.train <- data.scaled[idx, ]
data.test <- data.scaled[-idx, ]Let's check the proportion of the classes in the training dataset
prop.table(table(data.train$target))##
## 0 1
## 0.4107884 0.5892116
The training dataset is balanced.
The first classification model that we'll use is the logistc regression model. We'll build model minimal, with only the target variable as the predictor, maximal model taking account all variables, and then using stepwise regression to select the best variables.
model.none <- glm(target ~ 1, data.scaled, family = 'binomial')
summary(model.none)##
## Call:
## glm(formula = target ~ 1, family = "binomial", data = data.scaled)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.252 -1.252 1.105 1.105 1.105
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1726 0.1155 1.494 0.135
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 416.42 on 301 degrees of freedom
## Residual deviance: 416.42 on 301 degrees of freedom
## AIC: 418.42
##
## Number of Fisher Scoring iterations: 3
model.all <- glm(target ~ . , data.scaled, family = 'binomial')
summary(model.all)##
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.scaled)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7026 -0.3649 0.1243 0.4514 2.9624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9739 0.9326 2.117 0.034296 *
## age 0.2326 0.2227 1.044 0.296341
## sex1 -2.2717 0.5052 -4.496 6.91e-06 ***
## cp1 1.1337 0.5664 2.002 0.045322 *
## cp2 1.9785 0.4918 4.023 5.75e-05 ***
## cp3 2.3779 0.6851 3.471 0.000519 ***
## trestbps -0.4867 0.2018 -2.411 0.015893 *
## chol -0.2908 0.2224 -1.308 0.190993
## fbs1 0.5208 0.5531 0.942 0.346392
## restecg1 0.3078 0.3803 0.809 0.418323
## restecg2 -0.5469 2.6601 -0.206 0.837101
## thalach 0.4710 0.2593 1.816 0.069364 .
## exang1 -0.9631 0.4389 -2.195 0.028189 *
## oldpeak -0.4575 0.2599 -1.760 0.078415 .
## slope1 -0.8273 0.8380 -0.987 0.323538
## slope2 0.6934 0.9203 0.753 0.451152
## ca1 -2.2639 0.4966 -4.558 5.16e-06 ***
## ca2 -3.2874 0.7559 -4.349 1.37e-05 ***
## ca3 -2.3843 0.8865 -2.689 0.007157 **
## ca4 0.4974 1.6787 0.296 0.766985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 416.42 on 301 degrees of freedom
## Residual deviance: 193.29 on 282 degrees of freedom
## AIC: 233.29
##
## Number of Fisher Scoring iterations: 6
The maximal model uses all 12 variables.
Features for the logistic regression is selected using the stepwise regression. As the basis for feature selection, we will use model.log.all, which is a model that uses all the features. The direction of features selection is backwards, which means features are selected from the available features.
model.step <- step(model.all, scope = y ~ ., direction = 'backward', trace = 0)
variables <- attr(model.step$terms, 'term.labels')
summary(model.step)##
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang +
## oldpeak + slope + ca, family = "binomial", data = data.scaled)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7416 -0.3887 0.1243 0.4781 2.9169
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.0673 0.9008 2.295 0.021728 *
## sex1 -2.0857 0.4695 -4.443 8.89e-06 ***
## cp1 1.1723 0.5569 2.105 0.035306 *
## cp2 2.1013 0.4864 4.320 1.56e-05 ***
## cp3 2.4370 0.6715 3.629 0.000284 ***
## trestbps -0.4421 0.1859 -2.378 0.017395 *
## thalach 0.3410 0.2359 1.445 0.148367
## exang1 -0.9180 0.4295 -2.137 0.032561 *
## oldpeak -0.5200 0.2539 -2.048 0.040525 *
## slope1 -0.9217 0.8165 -1.129 0.258946
## slope2 0.5545 0.8875 0.625 0.532100
## ca1 -2.1864 0.4788 -4.566 4.97e-06 ***
## ca2 -2.9451 0.6939 -4.245 2.19e-05 ***
## ca3 -2.1549 0.8426 -2.558 0.010538 *
## ca4 0.5807 1.5968 0.364 0.716115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 416.42 on 301 degrees of freedom
## Residual deviance: 197.61 on 287 degrees of freedom
## AIC: 227.61
##
## Number of Fisher Scoring iterations: 6
Out of the 12 features available, 8 features : sex, cp, trestbps, thalach, exang, oldpeak, slope, ca were chosen
Let's now try to evaluate our models. The first model that we'll check is model.all and analyze the result using a confusion matrix. The regression model outputs a continuous range of values between 0 and 1, giving how close the prediction is to one of the two ends (0 or 1). Interpreting the result is left to us. We can divide and label using a threshold. A common threshold is halfway : 0.5. We'll classifiy All values below the threshold to class 0 (no heart disesase), and all values above the threshold to class 1 (heart disease).
# Prediciting using `model.all`
result.all <- predict(model.all, data.test)
# Changing prediction values into classes
threshold <- 0.5
pred.all <- as.factor(ifelse(result.all > threshold, 1, 0))
# Making confusion matrix
cm.all <- confusionMatrix(pred.all, data.test$target, positive ='1')
cm.all## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 6
## 1 3 16
##
## Accuracy : 0.8525
## 95% CI : (0.7383, 0.9302)
## No Information Rate : 0.6393
## P-Value [Acc > NIR] : 0.0001985
##
## Kappa : 0.6703
##
## Mcnemar's Test P-Value : 0.5049851
##
## Sensitivity : 0.7273
## Specificity : 0.9231
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.8571
## Prevalence : 0.3607
## Detection Rate : 0.2623
## Detection Prevalence : 0.3115
## Balanced Accuracy : 0.8252
##
## 'Positive' Class : 1
##
Our model was able to predict with 85.25% accuracy.
Let's compare the performance of the model with model.step
# Prediciting using `model.step`
result.step <- predict(model.step, data.test)
# Changing prediction values into classes
threshold <- 0.5
pred.step <- as.factor(ifelse(result.step > threshold, 1, 0))
# Making confusion matrix
cm.step <- confusionMatrix(pred.step, data.test$target, positive ='1')
cm.step## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 5
## 1 3 17
##
## Accuracy : 0.8689
## 95% CI : (0.7578, 0.9416)
## No Information Rate : 0.6393
## P-Value [Acc > NIR] : 5.692e-05
##
## Kappa : 0.7099
##
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.7727
## Specificity : 0.9231
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.8780
## Prevalence : 0.3607
## Detection Rate : 0.2787
## Detection Prevalence : 0.3279
## Balanced Accuracy : 0.8479
##
## 'Positive' Class : 1
##
model.step, produces a little bit lower accuracy, but requries less variables. Obviously, we would like to use model.all with its better metrics, but in real life, the decision is more subtle. Requiring less variables mean less tests have to be carried out to get the required data. Less tests mean lower cost, and lower cost mean more people can carry out the test. If we are comfortable with the difference, a screening program using model.step might give better cost efficiency and greater impact overall.
Regression models are built 3 assumptions :
It's time to check if the model has validly
We can check the multicollinearity using the Variable Inflation Factor. We'll start with model.all
vif(model.all)## GVIF Df GVIF^(1/(2*Df))
## age 1.470454 1 1.212623
## sex 1.469001 1 1.212023
## cp 1.688220 3 1.091201
## trestbps 1.259105 1 1.122098
## chol 1.242136 1 1.114512
## fbs 1.158191 1 1.076193
## restecg 1.123712 2 1.029589
## thalach 1.487834 1 1.219768
## exang 1.228857 1 1.108538
## oldpeak 1.614555 1 1.270651
## slope 1.922877 2 1.177573
## ca 1.925609 4 1.085353
From the data, we see that values are all below 4. This shows that there is no indication of multicollinearity between the factors. Now let's try doing the test for model.step
vif(model.step)## GVIF Df GVIF^(1/(2*Df))
## sex 1.302285 1 1.141177
## cp 1.644168 3 1.086403
## trestbps 1.104030 1 1.050728
## thalach 1.295745 1 1.138308
## exang 1.201578 1 1.096165
## oldpeak 1.575478 1 1.255180
## slope 1.819738 2 1.161454
## ca 1.583362 4 1.059126
We see that the values are lower in model.step than model.all, although both comes from the same dataset. This means that in relation to all the variables used in each model. the variables used in model.step are more uncorrelated (are less correlated) compared to the variables used in model.all.
Independence of observation means that our model does not come from repeated measurements of the same individuals. Although there is no further information about this, for our case, we can assume that the data does not come from repeated measurements.
This assumption states that the predictors are linearly related to log of odds. Note that this assumption is only relevant for numeric predictors.
One test to do this is the Box-Tidwell test. There is a function in the car package to do this, but a search in the internet shows the difficulty in using the function, especially with a large number of variables or variables with zeros <> <> <>. To do the test manually, we run a logistic regression between the predictor with the variable multiplied by its log . If we have any significant values, then the variable is non-linear.
# Transforming data to contain x * log(x) value
data.box_tidwell <- data.scaled %>%
select(-c(cp, restecg, sex, exang, slope, ca, fbs)) %>%
mutate(
age = age * log(age),
trestbps = trestbps * log(trestbps),
chol = chol * log(chol),
thalach = thalach * log(thalach),
oldpeak = oldpeak * log(oldpeak)
)
# Running the regression
summary(glm(target ~ ., data = data.box_tidwell, family = 'binomial'))##
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.box_tidwell)
##
## Deviance Residuals:
## 7 76 97 162 167 203
## 8.641e-06 2.110e-08 2.110e-08 6.995e-06 -2.110e-08 -1.298e-06
## 207 220 222 237 246
## -9.427e-06 -2.110e-08 -5.455e-06 -2.110e-08 -4.936e-06
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -69.52 245451.94 0 1
## age 152.73 1411719.25 0 1
## trestbps -37.02 391406.05 0 1
## chol 10.03 110966.54 0 1
## thalach -352.59 1817127.29 0 1
## oldpeak -82.49 232514.50 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.4421e+01 on 10 degrees of freedom
## Residual deviance: 2.6828e-10 on 5 degrees of freedom
## (291 observations deleted due to missingness)
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
From our test above, we can see that there is no significant variables, so all our numeric variables are linear with its log of odds.
All our asssumptions pass, and we can savely say that our logistic regression model was used.
Further optimization of our model depends on the use case of the result. To illustrate, let's say that we are a doctor consulting on further treatment for these patients. We would not like to fail in detecting people without heart disease, saying that the patient are negative when they are actually positive (false negative). Therefore, we would actually like a higher recall / sensitivity rate. This can be done by changing our the threshold value. Our threshold value for the previous predictions are 0.5. Changing the threshold can.
Logically, if we want to include people with minor indication of heart disease to be classified as having heart disease, we reduce the threshold. Let's try reducing the threshold to 0.35. We'll use model.all to see the effects of lowering the threshold
threshold <- 0.15
# Prediciing using `model.all`
result.mod <- predict(model.all, data.test)
# Changing prediction values into classes
pred.mod <- as.factor(ifelse(result.mod > threshold, 1, 0))
# Making confusion matrix
cm.mod <- confusionMatrix(pred.mod, data.test$target, positive ='1')
cm.mod## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 5
## 1 3 17
##
## Accuracy : 0.8689
## 95% CI : (0.7578, 0.9416)
## No Information Rate : 0.6393
## P-Value [Acc > NIR] : 5.692e-05
##
## Kappa : 0.7099
##
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.7727
## Specificity : 0.9231
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.8780
## Prevalence : 0.3607
## Detection Rate : 0.2787
## Detection Prevalence : 0.3279
## Balanced Accuracy : 0.8479
##
## 'Positive' Class : 1
##
Even though we have reduced our threshold to 0.15, we have only improved the sensitivity just a little bit. This means our model is quite good at classifying the results.
The logistic regression model is an interpretable model. We can see how each variable contribute to the probability of the outcome. We do this by looking at the coefficients of teh trained model. However, let's see the variables from the unscaled data. Training using scaled data
interpret <- summary(glm(target ~ ., data = data.no_outliers, family = 'binomial'))
interpret##
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.no_outliers)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7026 -0.3649 0.1243 0.4514 2.9624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.941757 2.785813 1.056 0.290978
## age 0.025709 0.024618 1.044 0.296341
## sex1 -2.271741 0.505226 -4.496 6.91e-06 ***
## cp1 1.133724 0.566393 2.002 0.045322 *
## cp2 1.978539 0.491833 4.023 5.75e-05 ***
## cp3 2.377886 0.685129 3.471 0.000519 ***
## trestbps -0.027709 0.011491 -2.411 0.015893 *
## chol -0.005619 0.004297 -1.308 0.190993
## fbs1 0.520819 0.553116 0.942 0.346392
## restecg1 0.307805 0.380319 0.809 0.418323
## restecg2 -0.546931 2.660130 -0.206 0.837101
## thalach 0.020563 0.011323 1.816 0.069364 .
## exang1 -0.963128 0.438853 -2.195 0.028189 *
## oldpeak -0.393860 0.223790 -1.760 0.078415 .
## slope1 -0.827285 0.838000 -0.987 0.323538
## slope2 0.693445 0.920303 0.753 0.451152
## ca1 -2.263875 0.496646 -4.558 5.16e-06 ***
## ca2 -3.287369 0.755865 -4.349 1.37e-05 ***
## ca3 -2.384297 0.886537 -2.689 0.007157 **
## ca4 0.497427 1.678673 0.296 0.766985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 416.42 on 301 degrees of freedom
## Residual deviance: 193.29 on 282 degrees of freedom
## AIC: 233.29
##
## Number of Fisher Scoring iterations: 6
The estimated coefficient values for each term is a log of odds, showing how much the probability of an outcome changes as each variable change. The interpretation for numeric and categoric values is different. For numeric variables, the log of odd shows how much the probability change as the value of the predictor increase. For a categorical variable, the log of odd shows how much the probability changes if the observation falls into that category.
To make it easier, let's take interpret a numeric variable, and a categorical one.
First, let's look at the numeric variable trestbps. Among the other numeric predictors, this variable has the highest significance. Its log of odds value is -0.027709. To interpret this value, we have to turn it into odds by exponentiating it
odds.trestbps <- exp(interpret$coefficients['trestbps', 1])
odds.trestbps## [1] 0.9726713
This means that according to our model, an increase of 1 value in the resting blood pressure (trestbps) causes the patient to be 97.27 more likely to be classified as having heart disease, with all other conditions the same.
Now, let's look at the impact of sex to probability of heart disease. The variable has three stars (***), meaning according to our model, it significantly affects the occurrence of heart disease. We have 2 values for sex : 0 for female and 1 for male. For categoric variables in regression, R automatically creates dummy variables and remove the base class (category 0). That is why we only have 1 value of sex, sex1, which is a variable for sex of class 1 (male). The variable sex1 has a log of odds value of -2.2717415.
The odds of the intercept is
odds.sex1 <- exp(interpret$coefficients['sex1', 1])
odds.sex1## [1] 0.1031324
This means that according to our model, being male (sex class 1) causes the patient to be 10.31 more likely to be classified as having heart disease, with all other conditions the same.
The K-nearest neighbor method plots each data point in the data space, and then places the test data in the space. For each test data, it then polls k-number of the test data's nearest neighbors (using euclidean distance) to find out which class the data falls into. The class of the test data is decided by majority vote. If there are a larger number of data in class A, the test data is most likely class A, and will be classified as so. Hence, it is important to choose the right number of K to ensure a clean vote is obtained. This generally means that K should be of the right size, and an odd number.
Using K-nearest neighbor method with categorical data is not recommended because when placed along a continuous spectrum, each category in the categorical data are represented by an integer. Hence, there is only a few points where data could be placed and they are "clumped" together, not giving much resolving power. However, we'll see if this is true.
Our train data set has 241 values. Usually, the value of k is taken from the square root of the number of data points.
k <- round(sqrt(nrow(data.train)))
k## [1] 16
It is not recommended to use even number of k value, because during voting, the number of votes can be the same and the result becomes inconclusive. Therefore, let's reduce k by 1.
k <- k - 1Now, let's run the analysis and make predictions
# Splitting target and predictors, excluding categorical variables
train_x <- data.train %>% select_if(is.numeric)
train_y <- data.train$target
test_x <- data.test %>% select_if(is.numeric)
# Using KNN
pred.knn <- knn(train = train_x, test = test_x, cl = train_y, k = k)
# Evaluting the data
cm.knn <- confusionMatrix(as.factor(pred.knn), data.test$target)
cm.knn## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22 4
## 1 17 18
##
## Accuracy : 0.6557
## 95% CI : (0.5231, 0.7727)
## No Information Rate : 0.6393
## P-Value [Acc > NIR] : 0.451905
##
## Kappa : 0.3387
##
## Mcnemar's Test P-Value : 0.008829
##
## Sensitivity : 0.5641
## Specificity : 0.8182
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.5143
## Prevalence : 0.6393
## Detection Rate : 0.3607
## Detection Prevalence : 0.4262
## Balanced Accuracy : 0.6911
##
## 'Positive' Class : 0
##
Our analysis using KNN has an accuracy of 0.6557377. This is pretty good.
To improve our model, let's try to include the categorical variables in our model.
# Splitting target and predictors, including categorical variables
train_x.fct <- data.train
train_y <- data.train$target
test_x.fct <- data.test
# Using KNN
pred.knn <- knn(train = train_x.fct, test = test_x.fct, cl = train_y, k = k)
# Evaluating the data
cm.knn.fct <- confusionMatrix(as.factor(pred.knn), data.test$target)
cm.knn.fct## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 30 2
## 1 9 20
##
## Accuracy : 0.8197
## 95% CI : (0.7002, 0.9064)
## No Information Rate : 0.6393
## P-Value [Acc > NIR] : 0.00170
##
## Kappa : 0.6343
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.7692
## Specificity : 0.9091
## Pos Pred Value : 0.9375
## Neg Pred Value : 0.6897
## Prevalence : 0.6393
## Detection Rate : 0.4918
## Detection Prevalence : 0.5246
## Balanced Accuracy : 0.8392
##
## 'Positive' Class : 0
##
Our analysis using the data with factors included has a much better accuracy of 0.8196721. This is comparable to using model.step. By omitting the categorical variables in the model using only numeric variables, we might have omitted information which might helped in classification. That is why this model has better accuracy.
In this proejet, we have successfully created logistic regression and k-NN models to classify make predictions whether a person has heart disease or not. We can also see how the effect of each factor can be deducted by looking at the coefficients of the logistic regression model.
We have found out that a model that we built, model.step, tradesexplor off a little bit of accuracy with almost 25% reduction in requirements. This model might be more suitable to be used in real life scenario than model.all that uses all variables. This shows the considerations when deploying a model in real life.
Aside from logistic regression, we have also implemented k-neural network model. Unlike logistic regression, this model can not be interpreted. It's advantage is that it is less demanding to run, although more sensitive to noise. Before doing a k-NN analysis, the data should . It is recommended to use k-NN with numeric variables, although we have seen that addition of categoric variabels improve the prediction result. This maybe due to loss of information by omitting the categorical variables.
StackOverflow. What R function can be used instead of box.tidewell() when a lot of predictors need to be transformed?. Accessed 18 Aug 2020 from https://stackoverflow.com/questions/42533582/what-r-function-can-be-used-instead-of-box-tidewell-when-a-lot-of-predictors-n
StackOverflow. How to use the Box-Tidwell function with a logistic regression in R. Accessed 18 Aug 2020 from https://stackoverflow.com/questions/56350546/how-to-use-the-box-tidwell-function-with-a-logistic-regression-in-r
Statistical Associates Publishers. Logistic Regression: 10 Worst Pitfalls and Mistakes. Accessed 18 Aug 2020 from http://www.statisticalassociates.com/logistic10.htm