Introduction

The objective of this project is to build a model to predict whether a person diagnose a heart disease based on pattern extracted from analysing 14 descriptive features found in Cleveland data set from UCI Machine Learning Repository. The project consists of two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. The model building, validation and prediction are presented in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section ends with a summary.

Data Set

The UCI Machine Learning Repository provides four databases linked to the above source namely, Cleveland, Hungary, Switzerland and the VA Long Beach with 76 attributes. Out of these 4 sets Cleveland data set was the benchmark for many researchers as it is more robust and easy to use compared to other 3. Out of 76 attributes 14 were selected as many of them repeat similar information and some attributes are not related with target attribute. Details of 303 patients information were included in this data set. The data was initially donated by David W. Aha (aha ‘@’ ics.uci.edu).

Target Feature

The response feature is, the presence of heart disease (num). It takes 5 levels based on angiographic disease status. 0-Healthy, 1-diagnosed with stage 1, 2-diagnosed with stage 2, 3-diagnosed with stage 3, 4-diagnosed with stage 4. For this project we just consider whether a person diagnose with a heart disease or not. Therefore, levels of the target feature is reduced to 2 by combining stage 1-4 to a single level and give a new column name as “target” with two levels of “Yes” and “No” (0=No, 1-4=Yes).

Descriptive Features

Only 13 descriptive features named below are used.

  • age: Age of the patient (Continuous)
  • sex: Sex of the patient (Categorical with 2 levels- Male, Female)
  • cp: Chest pain type (Categorical with 4 levels-Type 1, Type 2, Type 3 and Type 4) Type 1:typical angina Type 2:atypical angina Type 3:non-anginal pain Type 4:asymptomatic
  • trestbps: Resting blood pressure-in mm Hg on admission to the hospital(Continuous)
  • chol: Serum cholesterol in mg/dl (Continuous)
  • fbs: Fasting blood sugar > 120 mg/dl (Categorical with 2 levels-True,False)
  • restecg: Resting electrocardiographic results (Categorical with 3 levels-N(Normal), L1(Level 1), L2(Level 2))
  • thalach: Maximum heart rate achieved (Continuous)
  • exang: Exercise induced angina (Categorical with 2 levels-Yes, No)
  • oldpeak: ST depression induced by exercise relative to rest (Continuous)
  • slope: The slope of the peak exercise ST segment (categorical with 3 levels-Up, Flat, Down)
  • ca: Number of major vessels (0-3) colored by flourosopy (Categorical with 4 levels-0, 1, 2, 3)
  • thal: The heart status as retrieved from Thallium test (Categorical with 3 levels-N(normal),FD(fixed defect), RD(reversible defect)

Data Pre-processing

Preliminaries (Optional)

In this project, we used the following R packages.

library(knitr)
library(readr)
library(dplyr)
library(ggplot2)
library(mlr)
library(cowplot)

We read the data to R and checked the internal structure of the data and got a column wise summary in order to make sure that data is free of anomalies.

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : int  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 5 levels "?","0","1","2",..: 2 5 4 2 2 2 4 2 3 2 ...
##  $ thal    : Factor w/ 4 levels "?","3","6","7": 3 2 4 2 2 2 2 2 4 4 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...
Feature Summary before Data Preprocessing
name type na mean disp median mad min max nlevs
age integer 0 54.4389439 9.0386624 56.0 8.89560 29 77.0 0
sex integer 0 0.6798680 0.4672988 1.0 0.00000 0 1.0 0
cp integer 0 3.1584158 0.9601256 3.0 1.48260 1 4.0 0
trestbps integer 0 131.6897690 17.5997477 130.0 14.82600 94 200.0 0
chol integer 0 246.6930693 51.7769175 241.0 47.44320 126 564.0 0
fbs integer 0 0.1485149 0.3561979 0.0 0.00000 0 1.0 0
restecg integer 0 0.9900990 0.9949713 1.0 1.48260 0 2.0 0
thalach integer 0 149.6072607 22.8750033 153.0 22.23900 71 202.0 0
exang integer 0 0.3267327 0.4697945 0.0 0.00000 0 1.0 0
oldpeak numeric 0 1.0396040 1.1610750 0.8 1.18608 0 6.2 0
slope integer 0 1.6006601 0.6162261 2.0 1.48260 1 3.0 0
ca factor 0 NA 0.4191419 NA NA 4 176.0 5
thal factor 0 NA 0.4521452 NA NA 2 166.0 4
num integer 0 0.9372937 1.2285357 0.0 0.00000 0 4.0 0

Data Cleaning and Transformation

With str and summarizeColumns (see Table 1), we noticed the following anomalies:

  • Some columns are not in correct data format
  • ca and thal features reported unusual character “?”, probably for missing values
  • ca and thal features reported wrong cardinality
  • Target feature num is in integer data format

Firstly, we converted sex, cp, fbs,restecg,exang, slope in to factors and for ca and thal we redefined levels. For target feature num we regenerated new feature as target with two levels based on initial values of num. The R chunk related to these work are included below.

Internal structure and column summary of data after changing the data type format as desired by the analysis are listed in below R chunk.

str(Heart_data)
## 'data.frame':    303 obs. of  15 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 2 2 1 1 ...
##  $ cp      : Factor w/ 4 levels "Type 1","Type 2",..: 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "True","False": 1 2 2 2 2 2 2 2 2 1 ...
##  $ restecg : Factor w/ 3 levels "N","L1","L2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "Up","Flat","Down": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "N","FD","RD": 2 1 3 1 1 1 1 1 3 3 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...
##  $ target  : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...
summarizeColumns(Heart_data)
##        name    type na        mean       disp median      mad min   max
## 1       age integer  0  54.4389439  9.0386624   56.0  8.89560  29  77.0
## 2       sex  factor  0          NA  0.3201320     NA       NA  97 206.0
## 3        cp  factor  0          NA  0.5247525     NA       NA  23 144.0
## 4  trestbps integer  0 131.6897690 17.5997477  130.0 14.82600  94 200.0
## 5      chol integer  0 246.6930693 51.7769175  241.0 47.44320 126 564.0
## 6       fbs  factor  0          NA  0.1485149     NA       NA  45 258.0
## 7   restecg  factor  0          NA  0.5016502     NA       NA   4 151.0
## 8   thalach integer  0 149.6072607 22.8750033  153.0 22.23900  71 202.0
## 9     exang  factor  0          NA  0.3267327     NA       NA  99 204.0
## 10  oldpeak numeric  0   1.0396040  1.1610750    0.8  1.18608   0   6.2
## 11    slope  factor  0          NA  0.5313531     NA       NA  21 142.0
## 12       ca  factor  4          NA         NA     NA       NA  20 176.0
## 13     thal  factor  2          NA         NA     NA       NA  18 166.0
## 14      num integer  0   0.9372937  1.2285357    0.0  0.00000   0   4.0
## 15   target  factor  0          NA  0.4587459     NA       NA 139 164.0
##    nlevs
## 1      0
## 2      2
## 3      4
## 4      0
## 5      0
## 6      2
## 7      3
## 8      0
## 9      2
## 10     0
## 11     3
## 12     4
## 13     3
## 14     0
## 15     2

Now all the factor features are with correct cardinality and we noticed ca is having 4 missing values (row nos.167,193,288,303) and thal is having 2 missing values (row nos.88,267). The maximum and minimum values for factor features are meaningless shown summary table. As there are few missing values we got, we used the target feature in order to estimate missing values for ca and thal. The first, we checked what values num got for missing value of ca and thal. Then we checked the frequency of ca and thal values for each num values relevant to ca and thal missing values.

The relevant num values for 4 missing value of ca is 0,1,0,0. Then we drew bar chart for ca with each level of num as shown in below R chunk. We can see when num=0 or 1 the most possible value for ca is 0. Therefore, we assigned 0 for all the missing value of ca.

p1 <- ggplot(Heart_data, aes(x = ca, fill = num)) + 
  geom_bar() + facet_grid(num~.) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p1

The same method we applied for missing values of thal too. The num values relevant for missing thal is 0,2. We used below R chunk to generate bar chart for thal with num. We can see when num=0, the most possible value for thal is N and when num=2, the most possible value for thal=RD. Therefore, we assigned N and RD for missing values of thal accordingly.

p2 <- ggplot(Heart_data, aes(x = thal, fill = num)) + 
  geom_bar() + facet_grid(num~.) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p2

The below chunk is used in imputing missing values of ca and thal.

Heart_data[167,'ca'] <- 0
Heart_data[193,'ca'] <- 0
Heart_data[288,'ca'] <- 0
Heart_data[303,'ca'] <- 0
Heart_data[88,'thal'] <- "N"
Heart_data[267,'thal'] <- "RD"

Table 5 presents the summary statistics after data-preprocessing.

summarizeColumns(Heart_data) %>% knitr::kable( caption =  'Feature Summary after Data Preprocessing')
Feature Summary after Data Preprocessing
name type na mean disp median mad min max nlevs
age integer 0 54.4389439 9.0386624 56.0 8.89560 29 77.0 0
sex factor 0 NA 0.3201320 NA NA 97 206.0 2
cp factor 0 NA 0.5247525 NA NA 23 144.0 4
trestbps integer 0 131.6897690 17.5997477 130.0 14.82600 94 200.0 0
chol integer 0 246.6930693 51.7769175 241.0 47.44320 126 564.0 0
fbs factor 0 NA 0.1485149 NA NA 45 258.0 2
restecg factor 0 NA 0.5016502 NA NA 4 151.0 3
thalach integer 0 149.6072607 22.8750033 153.0 22.23900 71 202.0 0
exang factor 0 NA 0.3267327 NA NA 99 204.0 2
oldpeak numeric 0 1.0396040 1.1610750 0.8 1.18608 0 6.2 0
slope factor 0 NA 0.5313531 NA NA 21 142.0 3
ca factor 0 NA 0.4059406 NA NA 20 180.0 4
thal factor 0 NA 0.4488449 NA NA 18 167.0 3
num integer 0 0.9372937 1.2285357 0.0 0.00000 0 4.0 0
target factor 0 NA 0.4587459 NA NA 139 164.0 2

Data Exploration

We explored the data for each feature individually and split them by the classes of target features. Then we proceeded to multivariate visualisation.

Univariate Visualisation

Numerical Features

Age

Patients from age 29 years to 77 years were included in this data set. The Histogram of age is little skewed to the left,showing the average age is little lower than the median age. Around 50% of patients’ age was in between 45-65 years. There is no visualised difference in ages for patients with or without heart disease. That is age may not be a major factor to diagnose a hear disease.

Resting Blood Pressure (TRESTBPS)

The Histogram of resting blood pressure is right skewed, showing that few patients have had an extremely higher blood pressure. When comparing Histograms separately for levels of target feature, we can see patients having a heat disease showed a higher blood pressure compared to patients not having a heart disease.

Serum Cholestoral (CHOL)

The distribution of patients’serum cholesterol level is highly right skewed, showing that few patients have had extremely high cholesterol levels. When we compare this distribution separately for patients with a heart disease and patients without a heart disease, the healthy patients’ distribution is leptokurtic. That means, there were many healthy peaple who had there cholesterol level around 200-220 mg/fl than patients with a heart disease.

Maximum Heart Rate Achieved (THALACH)

The histogram for maximum heart rate achieved by patients is left skewed as few patients showed a comparatively low heart rate. The separate Histograms for two levels of target feature show healthy people have had a quite higher maximum heart rate (around 160) compared to the maximum heart rate (150) of patients with a heart disease. Further, the Histogram for health people is leptokurtic (There are many people around the peak point).

ST Depression Induced by Exercise Relative to Rest (OLDPEAK)

The Histogram is highly skewed to the right showing a pattern of inverse exponential. The level wise Histograms for target feature show many healthy people have reported ST depression around zero,contrast patients with a heart disease have reported St depression rate around 0-4.

Categorical Features

Sex

Out of all patients around 67% (2/3) were male. Level wise comparison for target feature shows men are more likely to have a heart disease than women.

Chest Pain Type (CP)

The most patients have had a chest pain of type of 4(asymptomatic). However, the most of the healthy people have had a Type 3 pain(non-anginal pain).

Fasting Blood Sugar > 120 mg/dl (FBS)

The most of the patients have had fasting blood sugar level less than 120mg/dl. Level wise bar charts for target feature show the same pattern for both patients and suggest that the fasting blood sugar level may not be a deciding factor for having a heart disease or not.

Resting Electrocardiographic Results (RESTECG)

Patients showed (Normal) and L2(Level 2) ECG results equally. However, healthy people showed normal ECG results while patients with a heart disease showed a ECG result of level 2. The result of level 1 is not a common type of ECG result for none of patient groups.

Exercise Induced Angina (EXANG)

Out of all patients around 67% haven’t had an exercise induced angina. But having a exercise induced angina for a patient with a heart disease is more prominent than a patient without a heart disease.This shows having a exercise induced angina may be a deciding factor for having a heart disease.

The Slope of the Peak Exercise ST Segment (SLOPE)

Up sloping and flat were two major types exercise ST segment for many patients. Majority of people from healthy group have had up sloping exercise while majority of patients with a heart disease have had flat exercise ST segment.

Number of Major Vessels (0-3) Colored by Flourosopy (CA)

Here, there’s no cavalue for 4 patients. Even though we estimated values for those missing values we couldn’t amalgamate those values to the data set for the moment. We kept it for the next phase. We can see many patients reported value for ca as zero and the distribution looks like inverse exponential. The same pattern persists for ca value for healthy people too. However, for sick people both value 0 and 1 are equally important.

The Heart Status as Retrieved from Thallium Test (thal)

We haven’t got thal value for 2 patients here. We have estimated values for those missing records. However, we couldn’t amalgamate those values to data set at this stage and left for phase II. When considering the bar chart for all patients levels of heart status commonly can be seen are Normal and Reversible Defect. However, two bar charts drawn for target feature are not the same for two levels of target. Healthy people mostly showed a normal heart status while sick people showed mostly a reversible defect condition.

Multivariate Visualisation

Scatter Plot Matrice

Age, Resting Blood Pressure (TRESTBPS),Serum Cholesterol (CHOL),Maximum Heart Rate Achieved (THALACH) and ST Depression Induced by Exercise Relative to Rest (OLDPEAK)

There’s no significant linear correlation between any two of features. Blood pressure, cholesterol level and ST depression shows a weak positive linear correlation with age while maximum heart rate achieved shows a weak negative correlation with age. Maximum heart rate achieved also shows a weak negative linear relationship with ST depression induced by exercise relative to rest.

Sex, Resting ECG Results, and Exercise Induced Angina

As per the three dimensional bar chart depicted below there is no men in the data set with resting ECG result level 1 and having a exercise induced angina.The most of female who’s not having exercise induced angina haven’t had a heart disease for almost all the levels of resting ECG results.

summary

Some numerical features with few values (2-4) were converted to factors to improve the convenience of the analysis. Two categorical features ca and thal had few missing values and due to that reason it cardinality also get wrong. We estimated values for missing cells and regenerated the cadinality too. The target feature was num with 4 cardinalities. We regenerated this feature with two cardinalities and renamed it as target. From the data exploration, we found that sex,serum cholesterol level, maximum heart rate achieved, chest pain type, resting ECG result, exercise induced angina,ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment and The heart status as retrieved from Thallium test may be potentially useful features in predicting the diagnosis of patient’s heart disease.