Part 1.Introduction

The term “heart disease” refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which affects the blood flow to the heart. Decreased blood flow can cause a heart attack.

Sometimes heart disease may be “silent” and not diagnosed until a person experiences signs or symptoms of a heart attack, heart failure, or an arrhythmia. When these events happen, symptoms may include : Heart attack: Chest pain or discomfort, upper back or neck pain, indigestion, heartburn, nausea or vomiting, extreme fatigue, upper body discomfort, dizziness, and shortness of breath. Arrhythmia: Fluttering feelings in the chest (palpitations). Heart failure: Shortness of breath, fatigue, or swelling of the feet, ankles, legs, abdomen, or neck veins.

Part 2.Project Goal

This project will mainly focus on creating a Classification Machine Learning System using Heart Disease dataset to predict if a person have a Heart Disease or not. The problem is aimed to provide relevant information for classifying the patients according to their heart disease status, which would help the physicians make accurate prediction in advance and prescribe the necessary medication in time.

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

Attribute Information in this dataset are :

age sex chest pain type (4 values) resting blood pressure serum cholestoral in mg/dl fasting blood sugar > 120 mg/dl resting electrocardiographic results (values 0,1,2) maximum heart rate achieved exercise induced angina oldpeak = ST depression induced by exercise relative to rest the slope of the peak exercise ST segment number of major vessels (0-3) colored by flourosopy thal: 0 = normal; 1 = fixed defect; 2 = reversable defect The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

The table includes a column that shows the definition if we think of the proportions as probabilities.

Measure of Name 1 Name 2 Definition Probability representation
sensitivity TPR Recall \(\frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}}\) \(\mbox{Pr}(\hat{Y}=1 \mid Y=1)\)
specificity TNR 1-FPR \(\frac{\mbox{TN}}{\mbox{TN}+\mbox{FP}}\) \(\mbox{Pr}(\hat{Y}=0 \mid Y=0)\)
specificity PPV Precision \(\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}\) \(\mbox{Pr}(Y=1 \mid \hat{Y}=1)\)

Part 3.Load Requirement Packages

Load all packages dan all libraries into RStudio

Part 4.Data exploration and visualization

The dataset is a data table made of 14 variables (columns) and a total of 1025 observations (rows).

Each row represents a person with heart disease or not…

Number of NA into the dataset:

##                     age                     sex         chest_pain_type 
##                       0                       0                       0 
##  resting_blood_pressure             cholesterol     fasting_blood_sugar 
##                       0                       0                       0 
##                rest_ecg max_heart_rate_achieved exercise_induced_angina 
##                       0                       0                       0 
##           st_depression                st_slope       num_major_vessels 
##                       0                       0                       0 
##    thallium_stress_test                 disease 
##                       0                       0

There are no missing values (NA).

Below are summary and plot distribution for each variable:

##       age             sex         chest_pain_type  resting_blood_pressure
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0         
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0         
##  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0         
##  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6         
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0         
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0         
##   cholesterol  fasting_blood_sugar    rest_ecg      max_heart_rate_achieved
##  Min.   :126   Min.   :0.0000      Min.   :0.0000   Min.   : 71.0          
##  1st Qu.:211   1st Qu.:0.0000      1st Qu.:0.0000   1st Qu.:132.0          
##  Median :240   Median :0.0000      Median :1.0000   Median :152.0          
##  Mean   :246   Mean   :0.1493      Mean   :0.5298   Mean   :149.1          
##  3rd Qu.:275   3rd Qu.:0.0000      3rd Qu.:1.0000   3rd Qu.:166.0          
##  Max.   :564   Max.   :1.0000      Max.   :2.0000   Max.   :202.0          
##  exercise_induced_angina st_depression      st_slope     num_major_vessels
##  Min.   :0.0000          Min.   :0.000   Min.   :0.000   Min.   :0.0000   
##  1st Qu.:0.0000          1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000   
##  Median :0.0000          Median :0.800   Median :1.000   Median :0.0000   
##  Mean   :0.3366          Mean   :1.072   Mean   :1.385   Mean   :0.7541   
##  3rd Qu.:1.0000          3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000   
##  Max.   :1.0000          Max.   :6.200   Max.   :2.000   Max.   :4.0000   
##  thallium_stress_test    disease      
##  Min.   :0.000        Min.   :0.0000  
##  1st Qu.:2.000        1st Qu.:0.0000  
##  Median :2.000        Median :1.0000  
##  Mean   :2.324        Mean   :0.5132  
##  3rd Qu.:3.000        3rd Qu.:1.0000  
##  Max.   :3.000        Max.   :1.0000

Above, we can see several variables with normal distribution, others are binary categorical (0, 1), and the rest with several categorical levels.

Let´s see the proportion and distribution of “disease”(heart disease) variable:

Note: “0” = No disease , “1” = Have disease

## 
##    0    1 
## 0.49 0.51

This means NO heart disease present in 49% of patients and the remaining 51% HAVE heart disease.

Distribution of Heart Disease per Age(quantity):

Above, we see that people with 54 years old have more heart disease.

Above, we can see another type of distribution on this dataset of Heart Disease per Age(percentual, cyan = heart disease, red = no heart disease).

We have to makes sure that the predictor variables are independent from each others, so we check correlation between predictor, if we have values near to 1, we have to discard that variable.

As we can see, in our data, the variables are all independent between them, so we won’t discard any predictor (< 0.43). According to this we are good to to procces and modeling dataset.

Part 5.Pre Data Processing

Principal Component Analysis(PCA)

We can get variable importance without using a predictive model using information theory, ordered from highest to lowest:

Note:

**en**: entropy measured in bits  

**mi**: mutual information  

**ig**: information gain  

**gr**: gain ratio (is the most important metric here, ranged from 0 to 1,
        with higher being better)   

According to the plot, “thallium_stress_test” is the most important variable in this dataset.

Now, we transform numeric “disease” variable as a factor, because confusionMatrix() function needs work with factors of the same length,

# converts disease as factor "0" or "1" to use with confusionMatrix()

Hearts <- mutate_at(Hearts, vars(disease), as.factor) 

str(Hearts$disease)
##  Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...

Then, “disease” column is removed, then we scale and center the data. Compute PCA,

# now calculate PCA removing "disease" with correlation "1"

set.seed(1)
pca <- prcomp(Hearts %>% select(-disease), scale = TRUE, center = TRUE)

str(pca)
## List of 5
##  $ sdev    : num [1:13] 1.67 1.25 1.1 1.08 1 ...
##  $ rotation: num [1:13, 1:13] 0.3096 0.0781 -0.2856 0.1788 0.128 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:13] "age" "sex" "chest_pain_type" "resting_blood_pressure" ...
##   .. ..$ : chr [1:13] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:13] 54.434 0.696 0.942 131.612 246 ...
##   ..- attr(*, "names")= chr [1:13] "age" "sex" "chest_pain_type" "resting_blood_pressure" ...
##  $ scale   : Named num [1:13] 9.07 0.46 1.03 17.52 51.59 ...
##   ..- attr(*, "names")= chr [1:13] "age" "sex" "chest_pain_type" "resting_blood_pressure" ...
##  $ x       : num [1:1025, 1:13] -0.522 2.589 3.041 -0.492 2.186 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:13] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     1.6668 1.2475 1.09590 1.08131 0.99959 0.9861 0.93618
## Proportion of Variance 0.2137 0.1197 0.09238 0.08994 0.07686 0.0748 0.06742
## Cumulative Proportion  0.2137 0.3334 0.42580 0.51574 0.59260 0.6674 0.73482
##                            PC8     PC9   PC10    PC11   PC12    PC13
## Standard deviation     0.87635 0.85202 0.7940 0.72267 0.6570 0.60754
## Proportion of Variance 0.05908 0.05584 0.0485 0.04017 0.0332 0.02839
## Cumulative Proportion  0.79389 0.84973 0.8982 0.93840 0.9716 1.00000
# Compute the proportion of variance explained(PVE)

pve_Hearts <- pca$sdev^2 / sum(pca$sdev^2)
cum_pve <- cumsum(pve_Hearts)     # Cummulative percent explained
pve_table <- tibble(comp = seq(1:ncol(Hearts %>% select(-disease))),
                    pve_Hearts, cum_pve)

# plot components vs PVE to check if there are correlations

ggplot(pve_table, aes(x = comp, y = cum_pve)) + 
  geom_point() + 
  geom_abline(intercept = 0.95, color = "red",
              slope = 0)   # line at 95% of cum_pve

As we see, the above plot shows that 95% of the variance is explained with almost all principal components(13 PCs).

create edx and validation datasets:

#  Spliting Hearts dataset in edx and validation sets

set.seed(1) # if you are using R 3.5 or Microsoft R Open 3.5.3
# set.seed(1, sample.kind="Rounding") if using R 3.5.3 or later

# Validation set will be 20% of Hearts data because it is a little dataset

test_index <- createDataPartition(y = Hearts$disease,
                                  times = 1, p = 0.2, list = FALSE) 
edx <- Hearts[-test_index,]
validation <- Hearts[test_index,] # we will use it only to do final test

We will split edx data into train_set and test_set.

#  Spliting edx dataset in train_set and test_set

set.seed(1) # if you are using R 3.5 or Microsoft R open 3.5.3
# set.seed(1, sample.kind="Rounding") # if using R 3.5.3 or later

test_index <- createDataPartition(y = edx$disease,
                                  times = 1, p = 0.2, 
                                  list = FALSE)  # test_set 20%
train_set <- edx[-test_index,]
test_set <- edx[test_index,]

Part 6.Model Building

We´ll construct a function to apply several models and choose the one with the smallest difference between its values with different datasets, reducing the variation of prediction.

To start, we choose several models of classification:

glm: Generalized Linear Model

lda: Linear Discriminant Analysis

naive_bayes: Naive Bayes

svmLinear: Support Vector Machines with Linear Kernel

gamLoess: Generalized Additive Model using LOESS

qda: Quadratic Discriminant Analysis

knn: k-Nearest Neighbors

kknn: Other k-Nearest Neighbors

gam: Generalized Additive Model using Splines

rf: Random Forest

ranger: Other Random Forest

wsrf: Weighted Subspace Random Forest

mlp: Multi-Layer Perceptron Neural Network

# model list, it can be any model you want to test,
# just be sure to load the library that contains it

models <- c("glm", "lda", "naive_bayes", "svmLinear",
            "gamLoess", "qda", "knn", "kknn",
            "gam", "rf", "ranger", "wsrf", "mlp")

# trainControl function for control tuning parameters models

control <- trainControl(method = "cv",   # cross validation
                        number = 10,     # 10 k-folds or number 
                                  # of resampling iterations
                        repeats = 5)

Here are the results of differents models on 2 differents datasets: first train/test_set, and edx/validation later:

# initializing variables

data_train <- train_set       # first value for data parameter
data_test <-  test_set        # first we´ll use train and test dataset 
true_value <- test_set$disease # true outcome from test_set 

# loop to use train and test set first and edx and validation later

for(i in 1:2) {       
  fits <- lapply(models, function(model){ 
#    print(model)  # it´s used to debug code
    set.seed(1)
    train(disease ~ .,
          method = model,
          preProcess=c("center", "scale"),   # to normalize the data
          data = data_train,
          trControl = control)
  })

  names(fits) <- models

  # to be sure that the actual value of the output
  # has not influence on the prediction

  vali2 <- data_test %>% select(-disease) 

  pred <- sapply(fits, function(object) # predicting outcome
      predict(object, newdata = vali2))

  # avg predicted values if equals to true values
  
  if (i == 1) acc <- colMeans(pred == true_value)
  
  data_train <- edx               # last value for data parameter
  data_test <-  validation        # last we´ll use edx and validation
  true_value <- validation$disease  # true outcome from validation set

}

acc    # all accuracy values with first dataset. Train and Test set
##         glm         lda naive_bayes   svmLinear    gamLoess         qda 
##   0.8292683   0.8475610   0.8475610   0.8536585   0.8536585   0.8109756 
##         knn        kknn         gam          rf      ranger        wsrf 
##   0.8414634   1.0000000   0.8597561   0.9939024   0.9939024   0.9939024 
##         mlp 
##   0.9207317
acc2 <- colMeans(pred == true_value) # avg predicted values

acc2   # all accuracy values with last dataset. Edx and Validation set
##         glm         lda naive_bayes   svmLinear    gamLoess         qda 
##   0.8640777   0.8398058   0.8543689   0.8543689   0.8932039   0.8543689 
##         knn        kknn         gam          rf      ranger        wsrf 
##   0.8543689   0.9708738   0.9029126   0.9854369   0.9854369   0.9854369 
##         mlp 
##   0.9563107

Then we make the difference between the most and the least optimistic models. Here we choose the model with the smallest value(variation between datasets). It´s Ranger Model, other Random Forest model:

results <- acc2 - acc # accuracy diff by model

# show results of difference on the same model for different dataset 

results
##           glm           lda   naive_bayes     svmLinear      gamLoess 
##  0.0348093772 -0.0077551504  0.0068079564  0.0007103955  0.0395453469 
##           qda           knn          kknn           gam            rf 
##  0.0433933223  0.0129055174 -0.0291262136  0.0431565238 -0.0084655458 
##        ranger          wsrf           mlp 
## -0.0084655458 -0.0084655458  0.0355789723

#Compute balanced accuracy, sensitivity, specificity, prevalence with help of confusionMatrix() function with edx/validation datasets,

# Compute balance accuracy, sensitivity, specificity,prevalence with confusionMatrix

cm_validation_hearts<- confusionMatrix(as.factor(pred[,11]),
                                       validation$disease, positive = "1")
cm_validation_hearts
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 100   3
##          1   0 103
##                                         
##                Accuracy : 0.9854        
##                  95% CI : (0.958, 0.997)
##     No Information Rate : 0.5146        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9709        
##                                         
##  Mcnemar's Test P-Value : 0.2482        
##                                         
##             Sensitivity : 0.9717        
##             Specificity : 1.0000        
##          Pos Pred Value : 1.0000        
##          Neg Pred Value : 0.9709        
##              Prevalence : 0.5146        
##          Detection Rate : 0.5000        
##    Detection Prevalence : 0.5000        
##       Balanced Accuracy : 0.9858        
##                                         
##        'Positive' Class : 1             
## 

Here we see that **Prevalence = 0.51 (The proportion of individuals with disease) is a good value (balanced).

Sensitivity = 0.97 (true positive rate or recall), Detection Rate = 0.500 and Balanced Accuracy = 0.98 is very good.

Part 7.Random Forest Model

# Final Ranger model algorithm computed with principal
# edx and validation dataset

# to avoid error in confusionMatrix, we convert num 0,1 to No,Yes

levels(edx$disease) <- c("No", "Yes")        
levels(validation$disease) <- c("No", "Yes")

# to be sure that the actual value of the output
# has not influence on the prediction

vali2 <- validation %>% select(-disease)

# trainControl function for control iteration model
# we test differents parameters and choose that ones that improve accuracy

control <- trainControl(method = "cv",   # cross validation
                        number = 30)   # optimum k-folds or number 30
                                       # of resampling iterations

# training Ranger Model

set.seed(1)
ranger_model <- train(disease ~., data = edx,
                     method = "ranger",  # ranger model
                     preProcess=c("center", "scale"),   # to normalize the data
                     trControl = control)

# predicting outcome

prediction_ranger <- predict(ranger_model, newdata = vali2)

# Check results with confusionMatrix() function and validation set

cm_ranger_model <- confusionMatrix(prediction_ranger,
                                   validation$disease, positive = "Yes")
cm_ranger_model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  100   3
##        Yes   0 103
##                                         
##                Accuracy : 0.9854        
##                  95% CI : (0.958, 0.997)
##     No Information Rate : 0.5146        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9709        
##                                         
##  Mcnemar's Test P-Value : 0.2482        
##                                         
##             Sensitivity : 0.9717        
##             Specificity : 1.0000        
##          Pos Pred Value : 1.0000        
##          Neg Pred Value : 0.9709        
##              Prevalence : 0.5146        
##          Detection Rate : 0.5000        
##    Detection Prevalence : 0.5000        
##       Balanced Accuracy : 0.9858        
##                                         
##        'Positive' Class : Yes           
## 

Now we got **better Balanced Accuracy = 0.98 , better Detection Rate = 0.5 and better Sensitivity = 0.97, Specificity = 1.

Part 8.Result

These are the results that we were able to achieve with our models, trained with edx and tested with validation datasets:

Model Accuracy
glm 0.8640777
lda 0.8398058
naive_bayes 0.8543689
svmLinear 0.8543689
gamLoess 0.8932039
qda 0.8543689
knn 0.8543689
kknn 0.9708738
gam 0.9029126
rf 0.9854369
ranger 0.9854369
wsrf 0.9854369
mlp 0.9563107

Here is the table of differences in the results of the same models tested with 2 different datasets:

Accuracy difference by model in %
glm 3.4809377
lda -0.7755150
naive_bayes 0.6807956
svmLinear 0.0710395
gamLoess 3.9545347
qda 4.3393322
knn 1.2905517
kknn -2.9126214
gam 4.3156524
rf -0.8465546
ranger -0.8465546
wsrf -0.8465546
mlp 3.5578972

In the barchart above, difference between 2 datasets on the same model, smaller is better. This means that has a better precision outcome stability when there are differents dataset as input.

And here is the Accuracy result from our FINAL Ranger Model:

## [1] "Ranger Model Accuracy: 98.54 %"

Part 9.Summary

In this project by using Heart Disease dataset from the UCI Repository(heart.csv) we succeded build model and prediction with accuracy and sensitivity of 98.54 %.