Predicting Heart Disease

Introduction

Heart disease is among the top causes of death in the world. It is a multi-factor disesase that is built overtime. In this project, we will try to create a model that can classify whether a person using heart disease from several predictors.We'll use two approach: logistic regression and K-nearest neighbor. The dataset that we are using is the UCI Heart Disease dataset hosted on Kaggle.

Data Preprocessing

Before modeling, let's first explore the data. We'll need load these libraries to assist in our project.

library(tidyverse) # Data wranling
library(caret) # to detect variables with zero variance with nearZeroVar
library(class) # for KNN modeling
library(e1071) # For drawing confusionMatrix
library(kableExtra) # Pretty printing tables
library(car) # Testing multicollinearity with VIF

We'll then load our data. Data with missing value in this dataset is denoted by the question mark (?).

data <- read_csv('heart.csv')
data %>% head() %>% 
    kable() %>% 
    kable_styling()

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

The data consist of 303 rows and 14 columns.

age : age of the patient
sex: sex of the patient (1 = male, 0 = female)
cp: chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
trestbps : resting blood pressure (mmHg)
chol : serum cholesterol (mg/dl)
fbs : fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
restecg : resting electrocardiographic results (0 = normal, 1 = ST-T wave abnormality, 2 = left ventricular hypertrophy)
thalach : maximum heart rate achieved
exang : exercise induced angina (1 = yes; 0 = no)
oldpeak : ST depression induced by exercise relative to rest
slope : the slope of the peak exercise ST segment (0 = upsloping, 1 = flat, 2 = downsloping)
ca : number of major vessels (0-3) colored by flourosopy
thal : 1-3 = normal; 4-6 = fixed defect; 7 = reversible defect
target : does the patient have heart disease? (0 = <50% diameter narrowing, no heart disease, 1 = >50% diameter narrowing, heart disease)

Before diving in, let's see the number of missing values for each column.

colSums(is.na(data))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

There are no columns with missing values. Let's check if there is any duplicate observation

data[data %>% 
    duplicated(), ] %>% 
    kable() %>% 
    kable_styling()

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
38	1	2	138	175	0	1	173	0	0	2	4	2	1

There is one duplicated row. Let's remove this row

data <- data %>% unique()

This reduces the number of row by 1.

Let's also fix the datatypes of the rows on our dataset before doing anything further. We'll transform sex, cp, restecg, exang, slope, ca, thal and target into factors. However, before doing that, we see that thal is a variable with 3 groups : normal (1-3), fixed defect (4-6), and resersible defect (7). Let's convert thal into these 3 groups before converting the values.

data <- data %>%
    mutate(
        thal = case_when(
            thal <= 3 ~ 'normal',
            4 <= thal & thal <= 6 ~ 'fixed defect',
            7 <= thal ~ 'reversible defect'
        )
    ) %>% 
    mutate_at(vars(sex, cp, restecg, exang, slope, ca, thal, target, fbs), factor)

Data wrangling & exploration

Class balance

Before doing any

ratio <- prop.table(table(data$target))

Our dataset have 0.46 : 0.54 negative : positive ratio, showing that our target class is balanced.

Outliers

First, let's see if our data has outliers

outliers <- boxplot(data %>% select_if(is.numeric), main = 'Distribution of numeric data', xlab = 'Parameters', ylab = 'value')

outliers

## $stats
##      [,1] [,2]  [,3]  [,4] [,5]
## [1,] 29.0   94 126.0  88.0  0.0
## [2,] 48.0  120 211.0 133.0  0.0
## [3,] 55.5  130 240.5 152.5  0.8
## [4,] 61.0  140 275.0 166.0  1.6
## [5,] 77.0  170 360.0 202.0  4.0
## 
## $n
## [1] 302 302 302 302 302
## 
## $conf
##          [,1]     [,2]     [,3]     [,4]      [,5]
## [1,] 54.31806 128.1816 234.6812 149.4997 0.6545299
## [2,] 56.68194 131.8184 246.3188 155.5003 0.9454701
## 
## $out
##  [1] 172.0 178.0 180.0 180.0 200.0 174.0 192.0 178.0 180.0 417.0 564.0 394.0
## [13] 407.0 409.0  71.0   4.2   6.2   5.6   4.2   4.4
## 
## $group
##  [1] 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 5 5 5 5 5
## 
## $names
## [1] "age"      "trestbps" "chol"     "thalach"  "oldpeak"

It seems there are 20 outliers in our dataset. Let's remove the outliers using the function below

removeOutliers <- function(x) {
    
    # include library if not included
    library(dplyr)
    
    # get outliers from x
    plot <- boxplot(x, plot = F)
    
    # Get min max value of whiskers
    tf.data <- data.frame(plot$stats[c(1,5), ], row.names = c('min', 'max'))
    colnames(tf.data) <- colnames(x)
    
    # Filter data
    for(name in colnames(x)) {
        
        if(tf.data['min', name] != tf.data['max', name] & is.numeric(x[name])) {
            x <- x %>%
                filter( tf.data['min', name] <= get(name) & get(name) <= tf.data['max', name] )   
        }
    }
    
    return(x)
    
}

data.no_outliers <- removeOutliers(data)
boxplot(data.no_outliers %>% select_if(is.numeric), main = 'Distribution of numeric data, without outliers', xlab = 'Parameters', ylab = 'value')

Let's also check if there is any variable with near zero variance

zero_var <- nearZeroVar(data.no_outliers, names = T)
zero_var

## [1] "thal"

It seems there is one. Let's get rid of that variable as it would not contribute anything to the data

data.no_outliers <- data.no_outliers %>% 
    select(-zero_var)

data.no_outliers %>% 
    head() %>% 
    kable() %>% 
    kable_styling()

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	target
63	1	3	145	233	1	0	150	0	2.3	0	1
37	1	2	130	250	0	1	187	0	3.5	0	1
41	0	1	130	204	0	0	172	0	1.4	2	1
56	1	1	120	236	0	1	178	0	0.8	2	1
57	0	0	120	354	0	1	163	1	0.6	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1

Perfect separator

Let's see if there is any perfect separator in our data. A perfect separator perfectly separates the target classes, so that we do not need other variables to predict the result. A perfect separator would have the largest coefficient, and make the significance of the other variables insignificant (z-value > 0.05).

summary(glm(target ~ ., data = data.no_outliers, family = 'binomial'))

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.no_outliers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7026  -0.3649   0.1243   0.4514   2.9624  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.941757   2.785813   1.056 0.290978    
## age          0.025709   0.024618   1.044 0.296341    
## sex1        -2.271741   0.505226  -4.496 6.91e-06 ***
## cp1          1.133724   0.566393   2.002 0.045322 *  
## cp2          1.978539   0.491833   4.023 5.75e-05 ***
## cp3          2.377886   0.685129   3.471 0.000519 ***
## trestbps    -0.027709   0.011491  -2.411 0.015893 *  
## chol        -0.005619   0.004297  -1.308 0.190993    
## fbs1         0.520819   0.553116   0.942 0.346392    
## restecg1     0.307805   0.380319   0.809 0.418323    
## restecg2    -0.546931   2.660130  -0.206 0.837101    
## thalach      0.020563   0.011323   1.816 0.069364 .  
## exang1      -0.963128   0.438853  -2.195 0.028189 *  
## oldpeak     -0.393860   0.223790  -1.760 0.078415 .  
## slope1      -0.827285   0.838000  -0.987 0.323538    
## slope2       0.693445   0.920303   0.753 0.451152    
## ca1         -2.263875   0.496646  -4.558 5.16e-06 ***
## ca2         -3.287369   0.755865  -4.349 1.37e-05 ***
## ca3         -2.384297   0.886537  -2.689 0.007157 ** 
## ca4          0.497427   1.678673   0.296 0.766985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 193.29  on 282  degrees of freedom
## AIC: 233.29
## 
## Number of Fisher Scoring iterations: 6

Looking at the coefficients sectino of the modl summary, we can see there is no one singe value with very large coefficients. Therefore, we can conclude that there is no perfect separator.

Scaling

Let's the range of our numeric columns, and see if our data needs scaling

data.no_outliers %>% 
    select(-c(cp, restecg, sex, exang, slope, ca, fbs)) %>% 
    pivot_longer(-target, names_to = 'params') %>% 
    group_by(params) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = params, y = value)) +
    geom_col(fill = 'blue') + 
    labs(title = 'Range of values in dataset', x = 'Numeric params', y = 'Values')

We can see that our values are quite in a range. Let's normalize the values to give a more balance effect between the coefficients.

data.scaled <-data.no_outliers %>% mutate_if(is.numeric, scale)

data.scaled %>% 
    select(-c(cp, restecg, sex, exang, slope, ca, fbs)) %>% 
    pivot_longer(-target, names_to = 'params') %>% 
    group_by(params) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = params, y = value)) +
    geom_col(fill = 'blue') + 
    labs(title = 'Range of values in scaled dataset', x = 'Numeric params', y = 'Values')

The data is now more in scale.

Data splitting

Before making any models, let's split our dataset into test and train dataset. This allows us to have a part of the data to validate our model.

idx <- sample(nrow(data.scaled), nrow(data.scaled)* 0.80)

data.train <- data.scaled[idx, ]
data.test <- data.scaled[-idx, ]

Let's check the proportion of the classes in the training dataset

prop.table(table(data.train$target))

## 
##         0         1 
## 0.4107884 0.5892116

The training dataset is balanced.

Logistic regression model

The first classification model that we'll use is the logistc regression model. We'll build model minimal, with only the target variable as the predictor, maximal model taking account all variables, and then using stepwise regression to select the best variables.

Minimal model

model.none <- glm(target ~ 1,  data.scaled, family = 'binomial')
summary(model.none)

## 
## Call:
## glm(formula = target ~ 1, family = "binomial", data = data.scaled)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.252  -1.252   1.105   1.105   1.105  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.1726     0.1155   1.494    0.135
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 416.42  on 301  degrees of freedom
## AIC: 418.42
## 
## Number of Fisher Scoring iterations: 3

Maximal model

model.all <- glm(target ~ . ,  data.scaled, family = 'binomial')
summary(model.all)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.scaled)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7026  -0.3649   0.1243   0.4514   2.9624  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.9739     0.9326   2.117 0.034296 *  
## age           0.2326     0.2227   1.044 0.296341    
## sex1         -2.2717     0.5052  -4.496 6.91e-06 ***
## cp1           1.1337     0.5664   2.002 0.045322 *  
## cp2           1.9785     0.4918   4.023 5.75e-05 ***
## cp3           2.3779     0.6851   3.471 0.000519 ***
## trestbps     -0.4867     0.2018  -2.411 0.015893 *  
## chol         -0.2908     0.2224  -1.308 0.190993    
## fbs1          0.5208     0.5531   0.942 0.346392    
## restecg1      0.3078     0.3803   0.809 0.418323    
## restecg2     -0.5469     2.6601  -0.206 0.837101    
## thalach       0.4710     0.2593   1.816 0.069364 .  
## exang1       -0.9631     0.4389  -2.195 0.028189 *  
## oldpeak      -0.4575     0.2599  -1.760 0.078415 .  
## slope1       -0.8273     0.8380  -0.987 0.323538    
## slope2        0.6934     0.9203   0.753 0.451152    
## ca1          -2.2639     0.4966  -4.558 5.16e-06 ***
## ca2          -3.2874     0.7559  -4.349 1.37e-05 ***
## ca3          -2.3843     0.8865  -2.689 0.007157 ** 
## ca4           0.4974     1.6787   0.296 0.766985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 193.29  on 282  degrees of freedom
## AIC: 233.29
## 
## Number of Fisher Scoring iterations: 6

The maximal model uses all 12 variables.

Stepwise regression

Features for the logistic regression is selected using the stepwise regression. As the basis for feature selection, we will use model.log.all, which is a model that uses all the features. The direction of features selection is backwards, which means features are selected from the available features.

model.step <- step(model.all, scope = y ~ ., direction = 'backward', trace = 0)
variables <- attr(model.step$terms, 'term.labels')
summary(model.step)

## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     oldpeak + slope + ca, family = "binomial", data = data.scaled)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7416  -0.3887   0.1243   0.4781   2.9169  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.0673     0.9008   2.295 0.021728 *  
## sex1         -2.0857     0.4695  -4.443 8.89e-06 ***
## cp1           1.1723     0.5569   2.105 0.035306 *  
## cp2           2.1013     0.4864   4.320 1.56e-05 ***
## cp3           2.4370     0.6715   3.629 0.000284 ***
## trestbps     -0.4421     0.1859  -2.378 0.017395 *  
## thalach       0.3410     0.2359   1.445 0.148367    
## exang1       -0.9180     0.4295  -2.137 0.032561 *  
## oldpeak      -0.5200     0.2539  -2.048 0.040525 *  
## slope1       -0.9217     0.8165  -1.129 0.258946    
## slope2        0.5545     0.8875   0.625 0.532100    
## ca1          -2.1864     0.4788  -4.566 4.97e-06 ***
## ca2          -2.9451     0.6939  -4.245 2.19e-05 ***
## ca3          -2.1549     0.8426  -2.558 0.010538 *  
## ca4           0.5807     1.5968   0.364 0.716115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 197.61  on 287  degrees of freedom
## AIC: 227.61
## 
## Number of Fisher Scoring iterations: 6

Out of the 12 features available, 8 features : sex, cp, trestbps, thalach, exang, oldpeak, slope, ca were chosen

Model evaluation

Let's now try to evaluate our models. The first model that we'll check is model.all and analyze the result using a confusion matrix. The regression model outputs a continuous range of values between 0 and 1, giving how close the prediction is to one of the two ends (0 or 1). Interpreting the result is left to us. We can divide and label using a threshold. A common threshold is halfway : 0.5. We'll classifiy All values below the threshold to class 0 (no heart disesase), and all values above the threshold to class 1 (heart disease).

# Prediciting using `model.all`
result.all <- predict(model.all, data.test)

# Changing prediction values into classes
threshold <- 0.5
pred.all <- as.factor(ifelse(result.all > threshold, 1, 0))

# Making confusion matrix
cm.all <- confusionMatrix(pred.all, data.test$target, positive ='1')
cm.all

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36  6
##          1  3 16
##                                           
##                Accuracy : 0.8525          
##                  95% CI : (0.7383, 0.9302)
##     No Information Rate : 0.6393          
##     P-Value [Acc > NIR] : 0.0001985       
##                                           
##                   Kappa : 0.6703          
##                                           
##  Mcnemar's Test P-Value : 0.5049851       
##                                           
##             Sensitivity : 0.7273          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.3607          
##          Detection Rate : 0.2623          
##    Detection Prevalence : 0.3115          
##       Balanced Accuracy : 0.8252          
##                                           
##        'Positive' Class : 1               
##

Our model was able to predict with 85.25% accuracy.

Let's compare the performance of the model with model.step

# Prediciting using `model.step`
result.step <- predict(model.step, data.test)

# Changing prediction values into classes
threshold <- 0.5
pred.step <- as.factor(ifelse(result.step > threshold, 1, 0))

# Making confusion matrix
cm.step <- confusionMatrix(pred.step, data.test$target, positive ='1')
cm.step

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36  5
##          1  3 17
##                                           
##                Accuracy : 0.8689          
##                  95% CI : (0.7578, 0.9416)
##     No Information Rate : 0.6393          
##     P-Value [Acc > NIR] : 5.692e-05       
##                                           
##                   Kappa : 0.7099          
##                                           
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.7727          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.8500          
##          Neg Pred Value : 0.8780          
##              Prevalence : 0.3607          
##          Detection Rate : 0.2787          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8479          
##                                           
##        'Positive' Class : 1               
##

model.step, produces a little bit lower accuracy, but requries less variables. Obviously, we would like to use model.all with its better metrics, but in real life, the decision is more subtle. Requiring less variables mean less tests have to be carried out to get the required data. Less tests mean lower cost, and lower cost mean more people can carry out the test. If we are comfortable with the difference, a screening program using model.step might give better cost efficiency and greater impact overall.

Checking assumptions

Regression models are built 3 assumptions :

Multicollinearity
Independence of Observations
Linearity of predictor and log odds

It's time to check if the model has validly

Multicollinearity

We can check the multicollinearity using the Variable Inflation Factor. We'll start with model.all

vif(model.all)

##              GVIF Df GVIF^(1/(2*Df))
## age      1.470454  1        1.212623
## sex      1.469001  1        1.212023
## cp       1.688220  3        1.091201
## trestbps 1.259105  1        1.122098
## chol     1.242136  1        1.114512
## fbs      1.158191  1        1.076193
## restecg  1.123712  2        1.029589
## thalach  1.487834  1        1.219768
## exang    1.228857  1        1.108538
## oldpeak  1.614555  1        1.270651
## slope    1.922877  2        1.177573
## ca       1.925609  4        1.085353

From the data, we see that values are all below 4. This shows that there is no indication of multicollinearity between the factors. Now let's try doing the test for model.step

vif(model.step)

##              GVIF Df GVIF^(1/(2*Df))
## sex      1.302285  1        1.141177
## cp       1.644168  3        1.086403
## trestbps 1.104030  1        1.050728
## thalach  1.295745  1        1.138308
## exang    1.201578  1        1.096165
## oldpeak  1.575478  1        1.255180
## slope    1.819738  2        1.161454
## ca       1.583362  4        1.059126

We see that the values are lower in model.step than model.all, although both comes from the same dataset. This means that in relation to all the variables used in each model. the variables used in model.step are more uncorrelated (are less correlated) compared to the variables used in model.all.

Independence of Observation

Independence of observation means that our model does not come from repeated measurements of the same individuals. Although there is no further information about this, for our case, we can assume that the data does not come from repeated measurements.

Linearity of predictor and log of odds

This assumption states that the predictors are linearly related to log of odds. Note that this assumption is only relevant for numeric predictors.

One test to do this is the Box-Tidwell test. There is a function in the car package to do this, but a search in the internet shows the difficulty in using the function, especially with a large number of variables or variables with zeros <> <> <>. To do the test manually, we run a logistic regression between the predictor with the variable multiplied by its log . If we have any significant values, then the variable is non-linear.

# Transforming data to contain x * log(x) value
data.box_tidwell <- data.scaled %>% 
    select(-c(cp, restecg, sex, exang, slope, ca, fbs)) %>% 
    mutate(
        age = age * log(age),
        trestbps = trestbps * log(trestbps),
        chol = chol * log(chol),
        thalach = thalach * log(thalach),
        oldpeak = oldpeak * log(oldpeak)
    )

# Running the regression
summary(glm(target ~ ., data = data.box_tidwell, family = 'binomial'))

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.box_tidwell)
## 
## Deviance Residuals: 
##          7          76          97         162         167         203  
##  8.641e-06   2.110e-08   2.110e-08   6.995e-06  -2.110e-08  -1.298e-06  
##        207         220         222         237         246  
## -9.427e-06  -2.110e-08  -5.455e-06  -2.110e-08  -4.936e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -69.52  245451.94       0        1
## age             152.73 1411719.25       0        1
## trestbps        -37.02  391406.05       0        1
## chol             10.03  110966.54       0        1
## thalach        -352.59 1817127.29       0        1
## oldpeak         -82.49  232514.50       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.4421e+01  on 10  degrees of freedom
## Residual deviance: 2.6828e-10  on  5  degrees of freedom
##   (291 observations deleted due to missingness)
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

From our test above, we can see that there is no significant variables, so all our numeric variables are linear with its log of odds.

All our asssumptions pass, and we can savely say that our logistic regression model was used.

Model improvement

Further optimization of our model depends on the use case of the result. To illustrate, let's say that we are a doctor consulting on further treatment for these patients. We would not like to fail in detecting people without heart disease, saying that the patient are negative when they are actually positive (false negative). Therefore, we would actually like a higher recall / sensitivity rate. This can be done by changing our the threshold value. Our threshold value for the previous predictions are 0.5. Changing the threshold can.

Logically, if we want to include people with minor indication of heart disease to be classified as having heart disease, we reduce the threshold. Let's try reducing the threshold to 0.35. We'll use model.all to see the effects of lowering the threshold

threshold <- 0.15

# Prediciing using `model.all`
result.mod <- predict(model.all, data.test)

# Changing prediction values into classes
pred.mod <- as.factor(ifelse(result.mod > threshold, 1, 0))

# Making confusion matrix
cm.mod <- confusionMatrix(pred.mod, data.test$target, positive ='1')
cm.mod

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36  5
##          1  3 17
##                                           
##                Accuracy : 0.8689          
##                  95% CI : (0.7578, 0.9416)
##     No Information Rate : 0.6393          
##     P-Value [Acc > NIR] : 5.692e-05       
##                                           
##                   Kappa : 0.7099          
##                                           
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.7727          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.8500          
##          Neg Pred Value : 0.8780          
##              Prevalence : 0.3607          
##          Detection Rate : 0.2787          
##    Detection Prevalence : 0.3279          
##       Balanced Accuracy : 0.8479          
##                                           
##        'Positive' Class : 1               
##

Even though we have reduced our threshold to 0.15, we have only improved the sensitivity just a little bit. This means our model is quite good at classifying the results.

Model interpretation

The logistic regression model is an interpretable model. We can see how each variable contribute to the probability of the outcome. We do this by looking at the coefficients of teh trained model. However, let's see the variables from the unscaled data. Training using scaled data

interpret <- summary(glm(target ~ ., data = data.no_outliers, family = 'binomial'))
interpret

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = data.no_outliers)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7026  -0.3649   0.1243   0.4514   2.9624  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.941757   2.785813   1.056 0.290978    
## age          0.025709   0.024618   1.044 0.296341    
## sex1        -2.271741   0.505226  -4.496 6.91e-06 ***
## cp1          1.133724   0.566393   2.002 0.045322 *  
## cp2          1.978539   0.491833   4.023 5.75e-05 ***
## cp3          2.377886   0.685129   3.471 0.000519 ***
## trestbps    -0.027709   0.011491  -2.411 0.015893 *  
## chol        -0.005619   0.004297  -1.308 0.190993    
## fbs1         0.520819   0.553116   0.942 0.346392    
## restecg1     0.307805   0.380319   0.809 0.418323    
## restecg2    -0.546931   2.660130  -0.206 0.837101    
## thalach      0.020563   0.011323   1.816 0.069364 .  
## exang1      -0.963128   0.438853  -2.195 0.028189 *  
## oldpeak     -0.393860   0.223790  -1.760 0.078415 .  
## slope1      -0.827285   0.838000  -0.987 0.323538    
## slope2       0.693445   0.920303   0.753 0.451152    
## ca1         -2.263875   0.496646  -4.558 5.16e-06 ***
## ca2         -3.287369   0.755865  -4.349 1.37e-05 ***
## ca3         -2.384297   0.886537  -2.689 0.007157 ** 
## ca4          0.497427   1.678673   0.296 0.766985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 416.42  on 301  degrees of freedom
## Residual deviance: 193.29  on 282  degrees of freedom
## AIC: 233.29
## 
## Number of Fisher Scoring iterations: 6

The estimated coefficient values for each term is a log of odds, showing how much the probability of an outcome changes as each variable change. The interpretation for numeric and categoric values is different. For numeric variables, the log of odd shows how much the probability change as the value of the predictor increase. For a categorical variable, the log of odd shows how much the probability changes if the observation falls into that category.

To make it easier, let's take interpret a numeric variable, and a categorical one.

First, let's look at the numeric variable trestbps. Among the other numeric predictors, this variable has the highest significance. Its log of odds value is -0.027709. To interpret this value, we have to turn it into odds by exponentiating it

odds.trestbps <- exp(interpret$coefficients['trestbps', 1])
odds.trestbps

## [1] 0.9726713

This means that according to our model, an increase of 1 value in the resting blood pressure (trestbps) causes the patient to be 97.27 more likely to be classified as having heart disease, with all other conditions the same.

Now, let's look at the impact of sex to probability of heart disease. The variable has three stars (***), meaning according to our model, it significantly affects the occurrence of heart disease. We have 2 values for sex : 0 for female and 1 for male. For categoric variables in regression, R automatically creates dummy variables and remove the base class (category 0). That is why we only have 1 value of sex, sex1, which is a variable for sex of class 1 (male). The variable sex1 has a log of odds value of -2.2717415.

The odds of the intercept is

odds.sex1 <- exp(interpret$coefficients['sex1', 1])
odds.sex1

## [1] 0.1031324

This means that according to our model, being male (sex class 1) causes the patient to be 10.31 more likely to be classified as having heart disease, with all other conditions the same.

K-nearest neighbor

The K-nearest neighbor method plots each data point in the data space, and then places the test data in the space. For each test data, it then polls k-number of the test data's nearest neighbors (using euclidean distance) to find out which class the data falls into. The class of the test data is decided by majority vote. If there are a larger number of data in class A, the test data is most likely class A, and will be classified as so. Hence, it is important to choose the right number of K to ensure a clean vote is obtained. This generally means that K should be of the right size, and an odd number.

Using K-nearest neighbor method with categorical data is not recommended because when placed along a continuous spectrum, each category in the categorical data are represented by an integer. Hence, there is only a few points where data could be placed and they are "clumped" together, not giving much resolving power. However, we'll see if this is true.

Our train data set has 241 values. Usually, the value of k is taken from the square root of the number of data points.

k <- round(sqrt(nrow(data.train)))
k

## [1] 16

It is not recommended to use even number of k value, because during voting, the number of votes can be the same and the result becomes inconclusive. Therefore, let's reduce k by 1.

k <- k - 1

Now, let's run the analysis and make predictions

# Splitting target and predictors, excluding categorical variables
train_x <- data.train %>% select_if(is.numeric)
train_y <- data.train$target
test_x <- data.test %>% select_if(is.numeric)

# Using KNN
pred.knn <- knn(train = train_x, test = test_x, cl = train_y, k = k)

# Evaluting the data
cm.knn <- confusionMatrix(as.factor(pred.knn), data.test$target)
cm.knn

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 22  4
##          1 17 18
##                                           
##                Accuracy : 0.6557          
##                  95% CI : (0.5231, 0.7727)
##     No Information Rate : 0.6393          
##     P-Value [Acc > NIR] : 0.451905        
##                                           
##                   Kappa : 0.3387          
##                                           
##  Mcnemar's Test P-Value : 0.008829        
##                                           
##             Sensitivity : 0.5641          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.5143          
##              Prevalence : 0.6393          
##          Detection Rate : 0.3607          
##    Detection Prevalence : 0.4262          
##       Balanced Accuracy : 0.6911          
##                                           
##        'Positive' Class : 0               
##

Our analysis using KNN has an accuracy of 0.6557377. This is pretty good.

Model improvement

To improve our model, let's try to include the categorical variables in our model.

# Splitting target and predictors, including categorical variables
train_x.fct <- data.train
train_y <- data.train$target
test_x.fct <- data.test

# Using KNN
pred.knn <- knn(train = train_x.fct, test = test_x.fct, cl = train_y, k = k)

# Evaluating the data
cm.knn.fct <- confusionMatrix(as.factor(pred.knn), data.test$target)
cm.knn.fct

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 30  2
##          1  9 20
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.6393          
##     P-Value [Acc > NIR] : 0.00170         
##                                           
##                   Kappa : 0.6343          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.7692          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.9375          
##          Neg Pred Value : 0.6897          
##              Prevalence : 0.6393          
##          Detection Rate : 0.4918          
##    Detection Prevalence : 0.5246          
##       Balanced Accuracy : 0.8392          
##                                           
##        'Positive' Class : 0               
##

Our analysis using the data with factors included has a much better accuracy of 0.8196721. This is comparable to using model.step. By omitting the categorical variables in the model using only numeric variables, we might have omitted information which might helped in classification. That is why this model has better accuracy.

Conclusion

In this proejet, we have successfully created logistic regression and k-NN models to classify make predictions whether a person has heart disease or not. We can also see how the effect of each factor can be deducted by looking at the coefficients of the logistic regression model.

We have found out that a model that we built, model.step, tradesexplor off a little bit of accuracy with almost 25% reduction in requirements. This model might be more suitable to be used in real life scenario than model.all that uses all variables. This shows the considerations when deploying a model in real life.

Aside from logistic regression, we have also implemented k-neural network model. Unlike logistic regression, this model can not be interpreted. It's advantage is that it is less demanding to run, although more sensitive to noise. Before doing a k-NN analysis, the data should . It is recommended to use k-NN with numeric variables, although we have seen that addition of categoric variabels improve the prediction result. This maybe due to loss of information by omitting the categorical variables.

References

StackOverflow. What R function can be used instead of box.tidewell() when a lot of predictors need to be transformed?. Accessed 18 Aug 2020 from https://stackoverflow.com/questions/42533582/what-r-function-can-be-used-instead-of-box-tidewell-when-a-lot-of-predictors-n
StackOverflow. How to use the Box-Tidwell function with a logistic regression in R. Accessed 18 Aug 2020 from https://stackoverflow.com/questions/56350546/how-to-use-the-box-tidwell-function-with-a-logistic-regression-in-r
Statistical Associates Publishers. Logistic Regression: 10 Worst Pitfalls and Mistakes. Accessed 18 Aug 2020 from http://www.statisticalassociates.com/logistic10.htm

Predicting Heart Disease

Anugerah Erlaut

8/10/2020

Introduction

Data Preprocessing

Data wrangling & exploration

Class balance

Outliers

Perfect separator

Scaling

Data splitting

Logistic regression model

Minimal model

Maximal model

Stepwise regression

Model evaluation

Checking assumptions

Multicollinearity

Independence of Observation

Linearity of predictor and log of odds

Model improvement

Model interpretation

K-nearest neighbor

Model improvement

Conclusion

References

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	target
63	1	3	145	233	1	0	150	0	2.3	0	1
37	1	2	130	250	0	1	187	0	3.5	0	1
41	0	1	130	204	0	0	172	0	1.4	2	1
56	1	1	120	236	0	1	178	0	0.8	2	1
57	0	0	120	354	0	1	163	1	0.6	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	target
63	1	3	145	233	1	0	150	0	2.3	0	1
37	1	2	130	250	0	1	187	0	3.5	0	1
41	0	1	130	204	0	0	172	0	1.4	2	1
56	1	1	120	236	0	1	178	0	0.8	2	1
57	0	0	120	354	0	1	163	1	0.6	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal	target
63	1	3	145	233	1	0	150	0	2.3	0	1	1
37	1	2	130	250	0	1	187	0	3.5	0	2	1
41	0	1	130	204	0	0	172	0	1.4	2	2	1
56	1	1	120	236	0	1	178	0	0.8	2	2	1
57	0	0	120	354	0	1	163	1	0.6	2	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1	1

age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	target
63	1	3	145	233	1	0	150	0	2.3	0	1
37	1	2	130	250	0	1	187	0	3.5	0	1
41	0	1	130	204	0	0	172	0	1.4	2	1
56	1	1	120	236	0	1	178	0	0.8	2	1
57	0	0	120	354	0	1	163	1	0.6	2	1
57	1	0	140	192	0	1	148	0	0.4	1	1