This report objective is to analyse a dataset from the website Kangle called “Heart Disease UCI”. There are 14 variables and 303 observations on the dataset. The dependent variable target indicates if the patient has or not heart disease (0 = disease, 1 = no disease), on the other hand, the independent variables are:
The guide to the varaibles descriptions was extracted from Kanggle forum discussion.
First of all, all the libraries and the dataset will be imported into R. After that, the basic structure of the dataset will be checked, beyond that, it will also checked if it has missing values.
# Set working directory
setwd("C:/Users/olive/Desktop/Data Science Report")
# Load packages
library(naniar)
library(dplyr)
library(ggplot2)
library(plotly)
library(caTools)
library(caret)
library(rpart)
# Read dataset
df <- read.csv('heart.csv')
# Check dataset structure
str(df)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
# Check for missing values
NA_df <- as.data.frame(miss_var_summary(df))
print(NA_df)
## variable n_miss pct_miss
## 1 ï..age 0 0
## 2 sex 0 0
## 3 cp 0 0
## 4 trestbps 0 0
## 5 chol 0 0
## 6 fbs 0 0
## 7 restecg 0 0
## 8 thalach 0 0
## 9 exang 0 0
## 10 oldpeak 0 0
## 11 slope 0 0
## 12 ca 0 0
## 13 thal 0 0
## 14 target 0 0
It can be noticed that the dataset doesn’t have missing values. Additionally, it can also be noticed that all the variables from the dataset are numerical.
Following the data analysis, it is useful to rename the dataset columns and the variables responses, so it can be easier to identify what each variable is representing.
Furthermore, it is also important to set the correct class for each variable.
# Change the columns names
names(df) <-
c(
'age',
'sex',
'chest_pain_type',
'resting_blood_pressure',
'cholesterol',
'fasting_blood_sugar',
'rest_ecg',
'max_heart_rate_achieved',
'exercise_induced_angina',
'st_depression',
'st_slope',
'num_major_vessels',
'thalassemia',
'target'
)
# Change the values of the categorical variables
df$target[which(df$target == 0)] <- 'disease'
df$target[which(df$target == 1)] <- 'no disease'
df$sex[which(df$sex == 0)] <- 'female'
df$sex[which(df$sex == 1)] <- 'male'
df$chest_pain_type[which(df$chest_pain_type == 0)] <-'asymptomatic'
df$chest_pain_type[which(df$chest_pain_type == 1)] <- 'atypical angina'
df$chest_pain_type[which(df$chest_pain_type == 2)] <- 'non-anginal pain'
df$chest_pain_type[which(df$chest_pain_type == 3)] <- 'typical angina'
df$fasting_blood_sugar[which(df$fasting_blood_sugar==0)] <- 'lower than 120mg/ml'
df$fasting_blood_sugar[which(df$fasting_blood_sugar==1)] <- 'greater than 120mg/ml'
df$rest_ecg[which(df$rest_ecg==0)] <- 'left ventricular hypertrophy'
df$rest_ecg[which(df$rest_ecg==1)] <- 'normal'
df$rest_ecg[which(df$rest_ecg==2)] <- 'ST-T wave abnormality'
df$exercise_induced_angina[which(df$exercise_induced_angina == 0)] <- 'no'
df$exercise_induced_angina[which(df$exercise_induced_angina == 1)] <- 'yes'
df$st_slope[which(df$st_slope==0)] <- 'downsloping'
df$st_slope[which(df$st_slope==1)] <- 'flat'
df$st_slope[which(df$st_slope==2)] <- 'upsloping'
df$thalassemia[which(df$thalassemia==0)] <- NA
df$thalassemia[which(df$thalassemia==1)] <- 'fixed efect'
df$thalassemia[which(df$thalassemia==2)] <- 'normal'
df$thalassemia[which(df$thalassemia==3)] <- 'reversable defect'
# Change variables class
df <- df %>% mutate_if(is.character, as.factor)
# Check dataset structure
str(df)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 1 2 2 2 ...
## $ chest_pain_type : Factor w/ 4 levels "asymptomatic",..: 4 3 2 2 1 1 2 2 3 3 ...
## $ resting_blood_pressure : int 145 130 130 120 120 140 140 120 172 150 ...
## $ cholesterol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fasting_blood_sugar : Factor w/ 2 levels "greater than 120mg/ml",..: 1 2 2 2 2 2 2 2 1 2 ...
## $ rest_ecg : Factor w/ 3 levels "left ventricular hypertrophy",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ max_heart_rate_achieved: int 150 187 172 178 163 148 153 173 162 174 ...
## $ exercise_induced_angina: Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ st_depression : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ st_slope : Factor w/ 3 levels "downsloping",..: 1 1 3 3 3 2 2 3 3 3 ...
## $ num_major_vessels : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thalassemia : Factor w/ 3 levels "fixed efect",..: 1 2 2 2 2 1 2 3 3 2 ...
## $ target : Factor w/ 2 levels "disease","no disease": 2 2 2 2 2 2 2 2 2 2 ...
In this part of the report, it will be created some visualizations to comprehend the relationship between the variables and get insights from the dataset. The first graph will exhibit the difference between the age interval from the patients with and without heart disease.
#Plot Figure 1
Plot1 <- ggplot(df,aes(x=target, y=age , color=target)) +
geom_boxplot()+
theme_minimal() +
theme(legend.position = "none") +
scale_color_manual(values=c('red','blue'))
ggplotly(Plot1)
Observing Figure 1, it can be seen that patients without heart disease tend to be younger than patients with heart disease. This result is expected since, usually, the person health deteriorates with time.
The second graph will show the difference between the number of heart disease according to patient sex.
#Plot Figure 2
Plot2 <- ggplot(df, aes(sex,fill = target)) +
geom_bar(position = 'dodge') +
theme_minimal() +
scale_fill_manual(values=c('red','blue'))
ggplotly(Plot2)
As it can be seen with Figure 2 that male patientes has more probability to have a heart disease comparing to female patients.
The third graph will ilustrates the relatioship between the patiente cholesterol and max heart rate achieved.
#Plot Figure 3
Plot3 <- ggplot(df, aes(x=age,y=max_heart_rate_achieved,fill=target))+
geom_point() +
theme_minimal() +
scale_fill_manual(values=c('red','blue'))
ggplotly(Plot3)
Figure 3 shows that, younger patients have the condition to achieve a high heart rate, on the other hand, the older patients have more difficulty to achieve a high heart rate. Beyond that, it can also be noticed that patients with a small heart rate achieve capacity has a more probability to have heart disease.
The last graph will show the count of target variable according to the chest pain type varible.
#Plot Figure 4
Plot4 <- ggplot(df, aes(chest_pain_type,fill=target))+
geom_bar(position = 'dodge') +
theme_minimal() +
scale_fill_manual(values=c('red','blue'))
ggplotly(Plot4)
Observing Figure 4 it can be noticed that, most of the patients that experience chest pain doesn’t have heart disease, on the other hand, most of the asymptomatic patients had heart disease. This means that it is important to continuous check for heart disease since it can be a silent disease.
The main objective of machine learning predictive models is to learn through the patters of a dataset sample called the training set, to make accurate predictions.
Supervised learning algorithms is a subarea of machine learning that uses labelled data to perform predictions. A labelled data is one which has been categorized in or it has been given a particular value. The supervised learning problems can be categorized into two classes, classification, where the value to be predicted is categorical, and regression, where the value to be predicted is numeric.
In this project, the dependent variable “target” is a categorical variable with only two categories “disease” or “no disease”. Therefore, the project can be described as a binary classification problem.
Furthermore, the results of a classification model predictions will be evaluated according to the confusion matrix. Confusion matrix is a table with the count of correct and incorrect predictions, it is composed of four values:
For this project, an example of a FP result is a predicted value of “disease”, but the real result is “no disease”. A TP example is a predicted value of “disease” and the real result is, in fact, “disease”. The same logical idea can be applied to the TN and TP.
The performance of the predictive model is based on the total number of predictions and the number TP, TN, FP and FN results. These results can be summarized by the following statistical variables sensitivity, specificity and accuracy. The sensitivity measures the results performance regarding one output category (disease), on the other hand, the specificity measures the results performance based on the other output category (no disease). Finally, the accuracy measures the results performance based on the other category both output categories.
Before fitting any predictive model to our dataset, preprocessing features needs to be done. The dataset will be divide into a training set (80 %) and test set (20 %), the training set is used to train the predictive models and the test set is used to evaluate the model’s outcomes.
# Split the data into training set and test set
set.seed(7)
split <- sample.split(df$target, SplitRatio = 0.8)
training_set <- subset(df, split==TRUE)
test_set <- subset(df, split==FALSE)
The logistic model (or logit model) is used to model the probability of a certain category to occur or not occur, such as pass/fail, win/lose, alive/dead or healthy/sick. The logistic regression uses the class probabilities depend on the distance for the positive or negative boundary to predict the output value.
The logistic model is commonly used by many data scientists and statisticians, it is fundamental, powerful and easy to implement. It is a basic technique used to solve classification problems.
# Apply logistic regression model
logisticModel <- glm(target ~.,
family= 'binomial',
data=training_set)
# Check logistic regression model summary
summary(logisticModel)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4839 -0.2856 0.1046 0.4505 2.3119
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.950e+00 3.319e+00 1.190 0.233893
## age 7.913e-05 2.706e-02 0.003 0.997667
## sexmale -2.235e+00 6.882e-01 -3.248 0.001162
## chest_pain_typeatypical angina 9.307e-01 6.502e-01 1.432 0.152277
## chest_pain_typenon-anginal pain 2.090e+00 5.994e-01 3.487 0.000489
## chest_pain_typetypical angina 1.841e+00 7.309e-01 2.519 0.011782
## resting_blood_pressure -2.259e-02 1.188e-02 -1.902 0.057157
## cholesterol -5.350e-03 4.662e-03 -1.148 0.251164
## fasting_blood_sugarlower than 120mg/ml -2.656e-01 6.926e-01 -0.383 0.701390
## rest_ecgnormal 5.135e-01 4.364e-01 1.177 0.239312
## rest_ecgST-T wave abnormality -7.088e-01 2.480e+00 -0.286 0.775005
## max_heart_rate_achieved 2.234e-02 1.281e-02 1.743 0.081285
## exercise_induced_anginayes -1.034e+00 5.067e-01 -2.041 0.041239
## st_depression -3.580e-01 2.515e-01 -1.423 0.154671
## st_slopeflat -6.819e-01 1.007e+00 -0.677 0.498164
## st_slopeupsloping 3.861e-01 1.074e+00 0.360 0.719216
## num_major_vessels -9.467e-01 2.457e-01 -3.853 0.000117
## thalassemianormal -1.107e-01 8.560e-01 -0.129 0.897123
## thalassemiareversable defect -1.469e+00 8.322e-01 -1.766 0.077466
##
## (Intercept)
## age
## sexmale **
## chest_pain_typeatypical angina
## chest_pain_typenon-anginal pain ***
## chest_pain_typetypical angina *
## resting_blood_pressure .
## cholesterol
## fasting_blood_sugarlower than 120mg/ml
## rest_ecgnormal
## rest_ecgST-T wave abnormality
## max_heart_rate_achieved .
## exercise_induced_anginayes *
## st_depression
## st_slopeflat
## st_slopeupsloping
## num_major_vessels ***
## thalassemianormal
## thalassemiareversable defect .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 149.81 on 223 degrees of freedom
## AIC: 187.81
##
## Number of Fisher Scoring iterations: 6
# Predicting the test set results
predLogistic <- predict(logisticModel,
type = 'response',
newdata = test_set[-14])
# Check the logistic regression model results
predLogistic <- ifelse(predLogistic > 0.5, 'no disease', 'disease')
resultsLogistic<- confusionMatrix(data=factor(predLogistic),reference=test_set[,14])
print(resultsLogistic)
## Confusion Matrix and Statistics
##
## Reference
## Prediction disease no disease
## disease 22 4
## no disease 5 28
##
## Accuracy : 0.8475
## 95% CI : (0.7301, 0.9278)
## No Information Rate : 0.5424
## P-Value [Acc > NIR] : 7.195e-07
##
## Kappa : 0.6918
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8148
## Specificity : 0.8750
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.8485
## Prevalence : 0.4576
## Detection Rate : 0.3729
## Detection Prevalence : 0.4407
## Balanced Accuracy : 0.8449
##
## 'Positive' Class : disease
##
Ultimately, the logistic regression model has the follow results 22 TP, 28 TN, 5 FP and 4 FN. The accurancy of the model was of 84.75 %, the sensitivity 81.48 % and the specificity 87.50 %. The results show that the model had a good predictive performace, beyond that, the model had a higher predictive accuracy when predicting TN.
The classification tree algorithm, also known as decision tree, is an algorithm that classifies instances by sorting them based on feature values. Each node in a decision tree represents a feature in an instance to be classified, and each branch represents a value that the node can assume. Instances are classified starting at the root node and sorted based on their feature values.
The satistical calculation used by the decison tree model are more complex comparingg to the ones used for the logistic regression model.
# Apply decision tree model
treeModel <- rpart(target ~ .,
data = training_set)
# Predicting the test set results
predTree <-
predict(treeModel, newdata = test_set[-14], type = 'class')
# Check the decision tree model results
resultsTree <- confusionMatrix(data=factor(predTree), reference=test_set[,14])
print(resultsTree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction disease no disease
## disease 16 5
## no disease 12 28
##
## Accuracy : 0.7213
## 95% CI : (0.5917, 0.8285)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 0.003014
##
## Kappa : 0.428
##
## Mcnemar's Test P-Value : 0.145610
##
## Sensitivity : 0.5714
## Specificity : 0.8485
## Pos Pred Value : 0.7619
## Neg Pred Value : 0.7000
## Prevalence : 0.4590
## Detection Rate : 0.2623
## Detection Prevalence : 0.3443
## Balanced Accuracy : 0.7100
##
## 'Positive' Class : disease
##
For the decision tree model, the results obtained were 16 TP, 28 TN, 5 FP and 12 FN. The accuracy of the model was of 72.13 %, the sensitivity of 57.14 % and the specificity of 84.85 %. The results show that the model had a poor predictive performance, beyond that, the model had a higher predictive accuracy when predicting TN.
To conclude, the model that best fit the dataset was the logistic regression model, this model achieved better predictive results comparing to the decision tree model. This occurred because the dataset is relatively simple and small, therefore a simple model can fit better the dataset.
As the final part of the project, a table will be created to compare the results of the two models created. This way, it will be easy to notice the best results obtained by the logistic predictive model over the decision tree model.
# Get results from models
resultsLogistic <- resultsLogistic$byClass
resultsTree <- resultsTree$byClass
# Create comparison table
tb <- data.frame('Logistic Regression' = resultsLogistic, 'Decison Tree' = resultsTree)
# Print comparison table
print(tb)
## Logistic.Regression Decison.Tree
## Sensitivity 0.8148148 0.5714286
## Specificity 0.8750000 0.8484848
## Pos Pred Value 0.8461538 0.7619048
## Neg Pred Value 0.8484848 0.7000000
## Precision 0.8461538 0.7619048
## Recall 0.8148148 0.5714286
## F1 0.8301887 0.6530612
## Prevalence 0.4576271 0.4590164
## Detection Rate 0.3728814 0.2622951
## Detection Prevalence 0.4406780 0.3442623
## Balanced Accuracy 0.8449074 0.7099567