Introduction

This report objective is to analyse a dataset from the website Kangle called “Heart Disease UCI”. There are 14 varibales and 303 observations on the dataset. The dependent variable target indicates if the pacient have or not a heart disease (0 = disease, 1 = no disease), on the other hand, the independet variables are:

Data cleaning

First of all, all the libraries and the dataset needs to be imported into R. After that, it will be checked if the dataset has missing values.

# Set working directory
setwd("C:/Users/olive/Desktop/Data Science Report")

# Load packages
library(naniar)
library(dplyr)
library(ggplot2)
library(plotly)

# Read dataset
df <- read.csv('heart.csv')

# Check dataset basic info
str(df)
## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...
# Check for missing values
NA_df <- as.data.frame(miss_var_summary(df))
print(NA_df)
##    variable n_miss pct_miss
## 1    ï..age      0        0
## 2       sex      0        0
## 3        cp      0        0
## 4  trestbps      0        0
## 5      chol      0        0
## 6       fbs      0        0
## 7   restecg      0        0
## 8   thalach      0        0
## 9     exang      0        0
## 10  oldpeak      0        0
## 11    slope      0        0
## 12       ca      0        0
## 13     thal      0        0
## 14   target      0        0

It can be noticed that, the daset dosen´t have missing values. Aditionaly, all the variables from the dataset are numerical.

Following the data analysis, it can be usefull rename the dataset columns and the variables responses, so it is easier to indentify what each variable is representing.

Beyond that, it is also imporant to set the correct type of each variable.

# change the column names
names(df) <-
  c(
    'age',
    'sex',
    'chest_pain_type',
    'resting_blood_pressure',
    'cholesterol',
    'fasting_blood_sugar',
    'rest_ecg',
    'max_heart_rate_achieved',
    'exercise_induced_angina',
    'st_depression',
    'st_slope',
    'num_major_vessels',
    'thalassemia',
    'target'
  )

#  Change the values of the categorical variables
df$target[which(df$target == 0)] <- 'disease'
df$target[which(df$target == 1)] <- 'no disease'

df$sex[which(df$sex == 0)] <- 'female'
df$sex[which(df$sex == 1)] <- 'male'

df$chest_pain_type[which(df$chest_pain_type == 0)] <-'asymptomatic'
df$chest_pain_type[which(df$chest_pain_type == 1)] <- 'atypical angina'
df$chest_pain_type[which(df$chest_pain_type == 2)] <- 'non-anginal pain'
df$chest_pain_type[which(df$chest_pain_type == 3)] <- 'typical angina'

df$fasting_blood_sugar[which(df$fasting_blood_sugar==0)] <- 'lower than 120mg/ml'
df$fasting_blood_sugar[which(df$fasting_blood_sugar==1)] <- 'greater than 120mg/ml'

df$rest_ecg[which(df$rest_ecg==0)] <- 'left ventricular hypertrophy'
df$rest_ecg[which(df$rest_ecg==1)] <- 'normal'
df$rest_ecg[which(df$rest_ecg==2)] <- 'ST-T wave abnormality'

df$exercise_induced_angina[which(df$exercise_induced_angina == 0)] <- 'no'
df$exercise_induced_angina[which(df$exercise_induced_angina == 1)] <- 'yes'

df$st_slope[which(df$st_slope==0)] <- 'downsloping'
df$st_slope[which(df$st_slope==1)] <- 'flat'
df$st_slope[which(df$st_slope==2)] <- 'upsloping'

df$thalassemia[which(df$thalassemia==0)] <- NA
df$thalassemia[which(df$thalassemia==1)] <- 'fixed efect'
df$thalassemia[which(df$thalassemia==2)] <- 'normal'
df$thalassemia[which(df$thalassemia==3)] <- 'reversable defect'

# Change variables type
df <- df %>% mutate_if(is.character, as.factor)

str(df)
## 'data.frame':    303 obs. of  14 variables:
##  $ age                    : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex                    : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 1 2 2 2 ...
##  $ chest_pain_type        : Factor w/ 4 levels "asymptomatic",..: 4 3 2 2 1 1 2 2 3 3 ...
##  $ resting_blood_pressure : int  145 130 130 120 120 140 140 120 172 150 ...
##  $ cholesterol            : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fasting_blood_sugar    : Factor w/ 2 levels "greater than 120mg/ml",..: 1 2 2 2 2 2 2 2 1 2 ...
##  $ rest_ecg               : Factor w/ 3 levels "left ventricular hypertrophy",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ max_heart_rate_achieved: int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exercise_induced_angina: Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ st_depression          : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ st_slope               : Factor w/ 3 levels "downsloping",..: 1 1 3 3 3 2 2 3 3 3 ...
##  $ num_major_vessels      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thalassemia            : Factor w/ 3 levels "fixed efect",..: 1 2 2 2 2 1 2 3 3 2 ...
##  $ target                 : Factor w/ 2 levels "disease","no disease": 2 2 2 2 2 2 2 2 2 2 ...

Data analysis

In this part of the report, it will be used some visualizations to see the relationship between variables and get insights from the dataset. The first graph created shows the difference between the ages from the patients with and without heart disease.

Figure 1 - Boxplot targget vs age

#Plot graphic
Plot1 <- ggplot(df,aes(x=target, y=age , color=target)) +
  geom_boxplot()+
  theme_minimal() +
  theme(legend.position = "none") +
  scale_color_manual(values=c('red','blue'))

ggplotly(Plot1)

The second graph created ilustrates the relatioship between the patiente cholesterol and max heart rate achieved.

Figure 2 - Age vs Max heart rate achieved

Plot2 <- ggplot(df, aes(x=age,y=max_heart_rate_achieved,fill=target))+
  geom_point() +
  theme_minimal() +
  scale_fill_manual(values=c('red','blue'))

ggplotly(Plot2)

It can be noticed with Figure 2 that the younger patients without have the condition to achieve a high heart rate, on the other hand, the older patients have more difficulty to achieve a high heart rate. This indicates that younger patients with a small heart rate achieve capacity has a high probability to have heart disease.

Figure 3 - Chest pain type count

Plot3 <- ggplot(df, aes(chest_pain_type,fill=target))+
  geom_bar(position = 'dodge') +
  theme_minimal() +
  scale_fill_manual(values=c('red','blue'))

ggplotly(Plot3)

Observing Figure 3 it can be noticed that, most of the patients that experience chest pain don’t have heart disease. This means that it is important to continuous check for heart disease since it is a silent disease.