The term “heart disease” refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which affects the blood flow to the heart. Decreased blood flow can cause a heart attack.
Sometimes heart disease may be “silent” and not diagnosed until a person experiences signs or symptoms of a heart attack, heart failure, or an arrhythmia. When these events happen, symptoms may include : Heart attack: Chest pain or discomfort, upper back or neck pain, indigestion, heartburn, nausea or vomiting, extreme fatigue, upper body discomfort, dizziness, and shortness of breath. Arrhythmia: Fluttering feelings in the chest (palpitations). Heart failure: Shortness of breath, fatigue, or swelling of the feet, ankles, legs, abdomen, or neck veins.
This project will mainly focus on creating a Classification Machine Learning System using Heart Disease dataset to predict if a person have a Heart Disease or not. The problem is aimed to provide relevant information for classifying the patients according to their heart disease status, which would help the physicians make accurate prediction in advance and prescribe the necessary medication in time.
This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.
Attribute Information in this dataset are :
age sex chest pain type (4 values) resting blood pressure serum cholestoral in mg/dl fasting blood sugar > 120 mg/dl resting electrocardiographic results (values 0,1,2) maximum heart rate achieved exercise induced angina oldpeak = ST depression induced by exercise relative to rest the slope of the peak exercise ST segment number of major vessels (0-3) colored by flourosopy thal: 0 = normal; 1 = fixed defect; 2 = reversable defect The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.
The table includes a column that shows the definition if we think of the proportions as probabilities.
| Measure of | Name 1 | Name 2 | Definition | Probability representation |
|---|---|---|---|---|
| sensitivity | TPR | Recall | \(\frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}}\) | \(\mbox{Pr}(\hat{Y}=1 \mid Y=1)\) |
| specificity | TNR | 1-FPR | \(\frac{\mbox{TN}}{\mbox{TN}+\mbox{FP}}\) | \(\mbox{Pr}(\hat{Y}=0 \mid Y=0)\) |
| specificity | PPV | Precision | \(\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}\) | \(\mbox{Pr}(Y=1 \mid \hat{Y}=1)\) |
Load all packages dan all libraries into RStudio
The dataset is a data table made of 14 variables (columns) and a total of 1025 observations (rows).
Each row represents a person with heart disease or not…
Number of NA into the dataset:
## age sex chest_pain_type
## 0 0 0
## resting_blood_pressure cholesterol fasting_blood_sugar
## 0 0 0
## rest_ecg max_heart_rate_achieved exercise_induced_angina
## 0 0 0
## st_depression st_slope num_major_vessels
## 0 0 0
## thallium_stress_test disease
## 0 0
There are no missing values (NA).
Below are summary and plot distribution for each variable:
## age sex chest_pain_type resting_blood_pressure
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.43 Mean :0.6956 Mean :0.9424 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## cholesterol fasting_blood_sugar rest_ecg max_heart_rate_achieved
## Min. :126 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0
## Median :240 Median :0.0000 Median :1.0000 Median :152.0
## Mean :246 Mean :0.1493 Mean :0.5298 Mean :149.1
## 3rd Qu.:275 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564 Max. :1.0000 Max. :2.0000 Max. :202.0
## exercise_induced_angina st_depression st_slope num_major_vessels
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3366 Mean :1.072 Mean :1.385 Mean :0.7541
## 3rd Qu.:1.0000 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thallium_stress_test disease
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.324 Mean :0.5132
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
Above, we can see several variables with normal distribution, others are binary categorical (0, 1), and the rest with several categorical levels.
Let´s see the proportion and distribution of “disease”(heart disease) variable:
Note: “0” = No disease , “1” = Have disease
##
## 0 1
## 0.49 0.51
This means NO heart disease present in 49% of patients and the remaining 51% HAVE heart disease.
Distribution of Heart Disease per Age(quantity):
Above, we see that people with 54 years old have more heart disease.
Above, we can see another type of distribution on this dataset of Heart Disease per Age(percentual, cyan = heart disease, red = no heart disease).
We have to makes sure that the predictor variables are independent from each others, so we check correlation between predictor, if we have values near to 1, we have to discard that variable.
As we can see, in our data, the variables are all independent between them, so we won’t discard any predictor (< 0.43). According to this we are good to to procces and modeling dataset.
Principal Component Analysis(PCA)
We can get variable importance without using a predictive model using information theory, ordered from highest to lowest:
Note:
**en**: entropy measured in bits
**mi**: mutual information
**ig**: information gain
**gr**: gain ratio (is the most important metric here, ranged from 0 to 1,
with higher being better)
According to the plot, “thallium_stress_test” is the most important variable in this dataset.
Now, we transform numeric “disease” variable as a factor, because confusionMatrix() function needs work with factors of the same length,
# converts disease as factor "0" or "1" to use with confusionMatrix()
Hearts <- mutate_at(Hearts, vars(disease), as.factor)
str(Hearts$disease)
## Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
Then, “disease” column is removed, then we scale and center the data. Compute PCA,
# now calculate PCA removing "disease" with correlation "1"
set.seed(1)
pca <- prcomp(Hearts %>% select(-disease), scale = TRUE, center = TRUE)
str(pca)
## List of 5
## $ sdev : num [1:13] 1.67 1.25 1.1 1.08 1 ...
## $ rotation: num [1:13, 1:13] 0.3096 0.0781 -0.2856 0.1788 0.128 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:13] "age" "sex" "chest_pain_type" "resting_blood_pressure" ...
## .. ..$ : chr [1:13] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:13] 54.434 0.696 0.942 131.612 246 ...
## ..- attr(*, "names")= chr [1:13] "age" "sex" "chest_pain_type" "resting_blood_pressure" ...
## $ scale : Named num [1:13] 9.07 0.46 1.03 17.52 51.59 ...
## ..- attr(*, "names")= chr [1:13] "age" "sex" "chest_pain_type" "resting_blood_pressure" ...
## $ x : num [1:1025, 1:13] -0.522 2.589 3.041 -0.492 2.186 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:13] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.6668 1.2475 1.09590 1.08131 0.99959 0.9861 0.93618
## Proportion of Variance 0.2137 0.1197 0.09238 0.08994 0.07686 0.0748 0.06742
## Cumulative Proportion 0.2137 0.3334 0.42580 0.51574 0.59260 0.6674 0.73482
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.87635 0.85202 0.7940 0.72267 0.6570 0.60754
## Proportion of Variance 0.05908 0.05584 0.0485 0.04017 0.0332 0.02839
## Cumulative Proportion 0.79389 0.84973 0.8982 0.93840 0.9716 1.00000
# Compute the proportion of variance explained(PVE)
pve_Hearts <- pca$sdev^2 / sum(pca$sdev^2)
cum_pve <- cumsum(pve_Hearts) # Cummulative percent explained
pve_table <- tibble(comp = seq(1:ncol(Hearts %>% select(-disease))),
pve_Hearts, cum_pve)
# plot components vs PVE to check if there are correlations
ggplot(pve_table, aes(x = comp, y = cum_pve)) +
geom_point() +
geom_abline(intercept = 0.95, color = "red",
slope = 0) # line at 95% of cum_pve
As we see, the above plot shows that 95% of the variance is explained with almost all principal components(13 PCs).
create edx and validation datasets:
# Spliting Hearts dataset in edx and validation sets
set.seed(1) # if you are using R 3.5 or Microsoft R Open 3.5.3
# set.seed(1, sample.kind="Rounding") if using R 3.5.3 or later
# Validation set will be 20% of Hearts data because it is a little dataset
test_index <- createDataPartition(y = Hearts$disease,
times = 1, p = 0.2, list = FALSE)
edx <- Hearts[-test_index,]
validation <- Hearts[test_index,] # we will use it only to do final test
We will split edx data into train_set and test_set.
# Spliting edx dataset in train_set and test_set
set.seed(1) # if you are using R 3.5 or Microsoft R open 3.5.3
# set.seed(1, sample.kind="Rounding") # if using R 3.5.3 or later
test_index <- createDataPartition(y = edx$disease,
times = 1, p = 0.2,
list = FALSE) # test_set 20%
train_set <- edx[-test_index,]
test_set <- edx[test_index,]
We´ll construct a function to apply several models and choose the one with the smallest difference between its values with different datasets, reducing the variation of prediction.
To start, we choose several models of classification:
glm: Generalized Linear Model
lda: Linear Discriminant Analysis
naive_bayes: Naive Bayes
svmLinear: Support Vector Machines with Linear Kernel
gamLoess: Generalized Additive Model using LOESS
qda: Quadratic Discriminant Analysis
knn: k-Nearest Neighbors
kknn: Other k-Nearest Neighbors
gam: Generalized Additive Model using Splines
rf: Random Forest
ranger: Other Random Forest
wsrf: Weighted Subspace Random Forest
mlp: Multi-Layer Perceptron Neural Network
# model list, it can be any model you want to test,
# just be sure to load the library that contains it
models <- c("glm", "lda", "naive_bayes", "svmLinear",
"gamLoess", "qda", "knn", "kknn",
"gam", "rf", "ranger", "wsrf", "mlp")
# trainControl function for control tuning parameters models
control <- trainControl(method = "cv", # cross validation
number = 10, # 10 k-folds or number
# of resampling iterations
repeats = 5)
Here are the results of differents models on 2 differents datasets: first train/test_set, and edx/validation later:
# initializing variables
data_train <- train_set # first value for data parameter
data_test <- test_set # first we´ll use train and test dataset
true_value <- test_set$disease # true outcome from test_set
# loop to use train and test set first and edx and validation later
for(i in 1:2) {
fits <- lapply(models, function(model){
# print(model) # it´s used to debug code
set.seed(1)
train(disease ~ .,
method = model,
preProcess=c("center", "scale"), # to normalize the data
data = data_train,
trControl = control)
})
names(fits) <- models
# to be sure that the actual value of the output
# has not influence on the prediction
vali2 <- data_test %>% select(-disease)
pred <- sapply(fits, function(object) # predicting outcome
predict(object, newdata = vali2))
# avg predicted values if equals to true values
if (i == 1) acc <- colMeans(pred == true_value)
data_train <- edx # last value for data parameter
data_test <- validation # last we´ll use edx and validation
true_value <- validation$disease # true outcome from validation set
}
acc # all accuracy values with first dataset. Train and Test set
## glm lda naive_bayes svmLinear gamLoess qda
## 0.8292683 0.8475610 0.8475610 0.8536585 0.8536585 0.8109756
## knn kknn gam rf ranger wsrf
## 0.8414634 1.0000000 0.8597561 0.9939024 0.9939024 0.9939024
## mlp
## 0.9207317
acc2 <- colMeans(pred == true_value) # avg predicted values
acc2 # all accuracy values with last dataset. Edx and Validation set
## glm lda naive_bayes svmLinear gamLoess qda
## 0.8640777 0.8398058 0.8543689 0.8543689 0.8932039 0.8543689
## knn kknn gam rf ranger wsrf
## 0.8543689 0.9708738 0.9029126 0.9854369 0.9854369 0.9854369
## mlp
## 0.9563107
Then we make the difference between the most and the least optimistic models. Here we choose the model with the smallest value(variation between datasets). It´s Ranger Model, other Random Forest model:
results <- acc2 - acc # accuracy diff by model
# show results of difference on the same model for different dataset
results
## glm lda naive_bayes svmLinear gamLoess
## 0.0348093772 -0.0077551504 0.0068079564 0.0007103955 0.0395453469
## qda knn kknn gam rf
## 0.0433933223 0.0129055174 -0.0291262136 0.0431565238 -0.0084655458
## ranger wsrf mlp
## -0.0084655458 -0.0084655458 0.0355789723
#Compute balanced accuracy, sensitivity, specificity, prevalence with help of confusionMatrix() function with edx/validation datasets,
# Compute balance accuracy, sensitivity, specificity,prevalence with confusionMatrix
cm_validation_hearts<- confusionMatrix(as.factor(pred[,11]),
validation$disease, positive = "1")
cm_validation_hearts
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 100 3
## 1 0 103
##
## Accuracy : 0.9854
## 95% CI : (0.958, 0.997)
## No Information Rate : 0.5146
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9709
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.9717
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9709
## Prevalence : 0.5146
## Detection Rate : 0.5000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9858
##
## 'Positive' Class : 1
##
Here we see that **Prevalence = 0.51 (The proportion of individuals with disease) is a good value (balanced).
Sensitivity = 0.97 (true positive rate or recall), Detection Rate = 0.500 and Balanced Accuracy = 0.98 is very good.
# Final Ranger model algorithm computed with principal
# edx and validation dataset
# to avoid error in confusionMatrix, we convert num 0,1 to No,Yes
levels(edx$disease) <- c("No", "Yes")
levels(validation$disease) <- c("No", "Yes")
# to be sure that the actual value of the output
# has not influence on the prediction
vali2 <- validation %>% select(-disease)
# trainControl function for control iteration model
# we test differents parameters and choose that ones that improve accuracy
control <- trainControl(method = "cv", # cross validation
number = 30) # optimum k-folds or number 30
# of resampling iterations
# training Ranger Model
set.seed(1)
ranger_model <- train(disease ~., data = edx,
method = "ranger", # ranger model
preProcess=c("center", "scale"), # to normalize the data
trControl = control)
# predicting outcome
prediction_ranger <- predict(ranger_model, newdata = vali2)
# Check results with confusionMatrix() function and validation set
cm_ranger_model <- confusionMatrix(prediction_ranger,
validation$disease, positive = "Yes")
cm_ranger_model
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 100 3
## Yes 0 103
##
## Accuracy : 0.9854
## 95% CI : (0.958, 0.997)
## No Information Rate : 0.5146
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9709
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.9717
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9709
## Prevalence : 0.5146
## Detection Rate : 0.5000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9858
##
## 'Positive' Class : Yes
##
Now we got **better Balanced Accuracy = 0.98 , better Detection Rate = 0.5 and better Sensitivity = 0.97, Specificity = 1.
These are the results that we were able to achieve with our models, trained with edx and tested with validation datasets:
| Model Accuracy | |
|---|---|
| glm | 0.8640777 |
| lda | 0.8398058 |
| naive_bayes | 0.8543689 |
| svmLinear | 0.8543689 |
| gamLoess | 0.8932039 |
| qda | 0.8543689 |
| knn | 0.8543689 |
| kknn | 0.9708738 |
| gam | 0.9029126 |
| rf | 0.9854369 |
| ranger | 0.9854369 |
| wsrf | 0.9854369 |
| mlp | 0.9563107 |
Here is the table of differences in the results of the same models tested with 2 different datasets:
| Accuracy difference by model in % | |
|---|---|
| glm | 3.4809377 |
| lda | -0.7755150 |
| naive_bayes | 0.6807956 |
| svmLinear | 0.0710395 |
| gamLoess | 3.9545347 |
| qda | 4.3393322 |
| knn | 1.2905517 |
| kknn | -2.9126214 |
| gam | 4.3156524 |
| rf | -0.8465546 |
| ranger | -0.8465546 |
| wsrf | -0.8465546 |
| mlp | 3.5578972 |
In the barchart above, difference between 2 datasets on the same model, smaller is better. This means that has a better precision outcome stability when there are differents dataset as input.
And here is the Accuracy result from our FINAL Ranger Model:
## [1] "Ranger Model Accuracy: 98.54 %"
In this project by using Heart Disease dataset from the UCI Repository(heart.csv) we succeded build model and prediction with accuracy and sensitivity of 98.54 %.