Introduction

The objective of this project is to build a model to predict whether a person diagnose a heart disease based on pattern extracted from analysing 14 descriptive features found in Cleveland data set from UCI Machine Learning Repository. The project consists of two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. The model building, validation and prediction are presented in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section ends with a summary.

Data Set

The UCI Machine Learning Repository provides four databases linked to the above source namely, Cleveland, Hungary, Switzerland and the VA Long Beach with 76 attributes. Out of these 4 sets Cleveland data set was the benchmark for many researchers as it is more robust and easy to use compared to other 3. Out of 76 attributes 14 were selected as many of them repeat similar information and some attributes are not related with target attribute. Details of 303 patients information were included in this data set. The data was initially donated by David W. Aha (aha ‘@’ ics.uci.edu).

Target Feature

The response feature is, the presence of heart disease (num). It takes 5 levels based on angiographic disease status. 0-Healthy, 1-diagnosed with stage 1, 2-diagnosed with stage 2, 3-diagnosed with stage 3, 4-diagnosed with stage 4. For this project we just consider whether a person diagnose with a heart disease or not. Therefore, levels of the target feature is reduced to 2 by combining stage 1-4 to a single level and give a new column name as “target” with two levels of “Yes” and “No” (0=No, 1-4=Yes).

Descriptive Features

Only 13 descriptive features named below are used.

age: Age of the patient (Continuous)
sex: Sex of the patient (Categorical with 2 levels- Male, Female)
cp: Chest pain type (Categorical with 4 levels-Type 1, Type 2, Type 3 and Type 4) Type 1:typical angina Type 2:atypical angina Type 3:non-anginal pain Type 4:asymptomatic
trestbps: Resting blood pressure-in mm Hg on admission to the hospital(Continuous)
chol: Serum cholesterol in mg/dl (Continuous)
fbs: Fasting blood sugar > 120 mg/dl (Categorical with 2 levels-True,False)
restecg: Resting electrocardiographic results (Categorical with 3 levels-N(Normal), L1(Level 1), L2(Level 2))
thalach: Maximum heart rate achieved (Continuous)
exang: Exercise induced angina (Categorical with 2 levels-Yes, No)
oldpeak: ST depression induced by exercise relative to rest (Continuous)
slope: The slope of the peak exercise ST segment (categorical with 3 levels-Up, Flat, Down)
ca: Number of major vessels (0-3) colored by flourosopy (Categorical with 4 levels-0, 1, 2, 3)
thal: The heart status as retrieved from Thallium test (Categorical with 3 levels-N(normal),FD(fixed defect), RD(reversible defect)

Data Pre-processing

Preliminaries (Optional)

In this project, we used the following R packages.

library(knitr)
library(readr)
library(dplyr)
library(ggplot2)
library(mlr)
library(cowplot)

We read the data to R and checked the internal structure of the data and got a column wise summary in order to make sure that data is free of anomalies.

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : int  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 5 levels "?","0","1","2",..: 2 5 4 2 2 2 4 2 3 2 ...
##  $ thal    : Factor w/ 4 levels "?","3","6","7": 3 2 4 2 2 2 2 2 4 4 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...

Feature Summary before Data Preprocessing
name	type	mean	disp	median	mad	min	max	nlevs
age	integer	54.4389439	9.0386624	56.0	8.89560	29	77.0	0
sex	integer	0.6798680	0.4672988	1.0	0.00000	0	1.0	0
cp	integer	3.1584158	0.9601256	3.0	1.48260	1	4.0	0
trestbps	integer	131.6897690	17.5997477	130.0	14.82600	94	200.0	0
chol	integer	246.6930693	51.7769175	241.0	47.44320	126	564.0	0
fbs	integer	0.1485149	0.3561979	0.0	0.00000	0	1.0	0
restecg	integer	0.9900990	0.9949713	1.0	1.48260	0	2.0	0
thalach	integer	149.6072607	22.8750033	153.0	22.23900	71	202.0	0
exang	integer	0.3267327	0.4697945	0.0	0.00000	0	1.0	0
oldpeak	numeric	1.0396040	1.1610750	0.8	1.18608	0	6.2	0
slope	integer	1.6006601	0.6162261	2.0	1.48260	1	3.0	0
ca	factor	NA	0.4191419	NA	NA	4	176.0	5
thal	factor	NA	0.4521452	NA	NA	2	166.0	4
num	integer	0.9372937	1.2285357	0.0	0.00000	0	4.0	0

Data Cleaning and Transformation

With str and summarizeColumns (see Table 1), we noticed the following anomalies:

Some columns are not in correct data format
ca and thal features reported unusual character “?”, probably for missing values
ca and thal features reported wrong cardinality
Target feature num is in integer data format

Firstly, we converted sex, cp, fbs,restecg,exang, slope in to factors and for ca and thal we redefined levels. For target feature num we regenerated new feature as target with two levels based on initial values of num. The R chunk related to these work are included below.

Internal structure and column summary of data after changing the data type format as desired by the analysis are listed in below R chunk.

str(Heart_data)

## 'data.frame':    303 obs. of  15 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 2 2 1 1 ...
##  $ cp      : Factor w/ 4 levels "Type 1","Type 2",..: 1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : Factor w/ 2 levels "True","False": 1 2 2 2 2 2 2 2 2 1 ...
##  $ restecg : Factor w/ 3 levels "N","L1","L2": 3 3 3 1 3 1 3 1 3 3 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 2 1 2 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "Up","Flat","Down": 3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
##  $ thal    : Factor w/ 3 levels "N","FD","RD": 2 1 3 1 1 1 1 1 3 3 ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...
##  $ target  : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 2 1 2 2 ...

summarizeColumns(Heart_data)

##        name    type na        mean       disp median      mad min   max
## 1       age integer  0  54.4389439  9.0386624   56.0  8.89560  29  77.0
## 2       sex  factor  0          NA  0.3201320     NA       NA  97 206.0
## 3        cp  factor  0          NA  0.5247525     NA       NA  23 144.0
## 4  trestbps integer  0 131.6897690 17.5997477  130.0 14.82600  94 200.0
## 5      chol integer  0 246.6930693 51.7769175  241.0 47.44320 126 564.0
## 6       fbs  factor  0          NA  0.1485149     NA       NA  45 258.0
## 7   restecg  factor  0          NA  0.5016502     NA       NA   4 151.0
## 8   thalach integer  0 149.6072607 22.8750033  153.0 22.23900  71 202.0
## 9     exang  factor  0          NA  0.3267327     NA       NA  99 204.0
## 10  oldpeak numeric  0   1.0396040  1.1610750    0.8  1.18608   0   6.2
## 11    slope  factor  0          NA  0.5313531     NA       NA  21 142.0
## 12       ca  factor  4          NA         NA     NA       NA  20 176.0
## 13     thal  factor  2          NA         NA     NA       NA  18 166.0
## 14      num integer  0   0.9372937  1.2285357    0.0  0.00000   0   4.0
## 15   target  factor  0          NA  0.4587459     NA       NA 139 164.0
##    nlevs
## 1      0
## 2      2
## 3      4
## 4      0
## 5      0
## 6      2
## 7      3
## 8      0
## 9      2
## 10     0
## 11     3
## 12     4
## 13     3
## 14     0
## 15     2

Now all the factor features are with correct cardinality and we noticed ca is having 4 missing values (row nos.167,193,288,303) and thal is having 2 missing values (row nos.88,267). The maximum and minimum values for factor features are meaningless shown summary table. As there are few missing values we got, we used the target feature in order to estimate missing values for ca and thal. The first, we checked what values num got for missing value of ca and thal. Then we checked the frequency of ca and thal values for each num values relevant to ca and thal missing values.

The relevant num values for 4 missing value of ca is 0,1,0,0. Then we drew bar chart for ca with each level of num as shown in below R chunk. We can see when num=0 or 1 the most possible value for ca is 0. Therefore, we assigned 0 for all the missing value of ca.

p1 <- ggplot(Heart_data, aes(x = ca, fill = num)) + 
  geom_bar() + facet_grid(num~.) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p1

The same method we applied for missing values of thal too. The num values relevant for missing thal is 0,2. We used below R chunk to generate bar chart for thal with num. We can see when num=0, the most possible value for thal is N and when num=2, the most possible value for thal=RD. Therefore, we assigned N and RD for missing values of thal accordingly.

p2 <- ggplot(Heart_data, aes(x = thal, fill = num)) + 
  geom_bar() + facet_grid(num~.) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
p2

The below chunk is used in imputing missing values of ca and thal.

Heart_data[167,'ca'] <- 0
Heart_data[193,'ca'] <- 0
Heart_data[288,'ca'] <- 0
Heart_data[303,'ca'] <- 0
Heart_data[88,'thal'] <- "N"
Heart_data[267,'thal'] <- "RD"

Table 5 presents the summary statistics after data-preprocessing.

summarizeColumns(Heart_data) %>% knitr::kable( caption =  'Feature Summary after Data Preprocessing')

Feature Summary after Data Preprocessing
name	type	mean	disp	median	mad	min	max	nlevs
age	integer	54.4389439	9.0386624	56.0	8.89560	29	77.0	0
sex	factor	NA	0.3201320	NA	NA	97	206.0	2
cp	factor	NA	0.5247525	NA	NA	23	144.0	4
trestbps	integer	131.6897690	17.5997477	130.0	14.82600	94	200.0	0
chol	integer	246.6930693	51.7769175	241.0	47.44320	126	564.0	0
fbs	factor	NA	0.1485149	NA	NA	45	258.0	2
restecg	factor	NA	0.5016502	NA	NA	4	151.0	3
thalach	integer	149.6072607	22.8750033	153.0	22.23900	71	202.0	0
exang	factor	NA	0.3267327	NA	NA	99	204.0	2
oldpeak	numeric	1.0396040	1.1610750	0.8	1.18608	0	6.2	0
slope	factor	NA	0.5313531	NA	NA	21	142.0	3
ca	factor	NA	0.4059406	NA	NA	20	180.0	4
thal	factor	NA	0.4488449	NA	NA	18	167.0	3
num	integer	0.9372937	1.2285357	0.0	0.00000	0	4.0	0
target	factor	NA	0.4587459	NA	NA	139	164.0	2

Data Exploration

We explored the data for each feature individually and split them by the classes of target features. Then we proceeded to multivariate visualisation.

Univariate Visualisation

Numerical Features

Age

Patients from age 29 years to 77 years were included in this data set. The Histogram of age is little skewed to the left,showing the average age is little lower than the median age. Around 50% of patients’ age was in between 45-65 years. There is no visualised difference in ages for patients with or without heart disease. That is age may not be a major factor to diagnose a hear disease.

Resting Blood Pressure (TRESTBPS)

The Histogram of resting blood pressure is right skewed, showing that few patients have had an extremely higher blood pressure. When comparing Histograms separately for levels of target feature, we can see patients having a heat disease showed a higher blood pressure compared to patients not having a heart disease.

Serum Cholestoral (CHOL)

The distribution of patients’serum cholesterol level is highly right skewed, showing that few patients have had extremely high cholesterol levels. When we compare this distribution separately for patients with a heart disease and patients without a heart disease, the healthy patients’ distribution is leptokurtic. That means, there were many healthy peaple who had there cholesterol level around 200-220 mg/fl than patients with a heart disease.

Maximum Heart Rate Achieved (THALACH)

The histogram for maximum heart rate achieved by patients is left skewed as few patients showed a comparatively low heart rate. The separate Histograms for two levels of target feature show healthy people have had a quite higher maximum heart rate (around 160) compared to the maximum heart rate (150) of patients with a heart disease. Further, the Histogram for health people is leptokurtic (There are many people around the peak point).

ST Depression Induced by Exercise Relative to Rest (OLDPEAK)

The Histogram is highly skewed to the right showing a pattern of inverse exponential. The level wise Histograms for target feature show many healthy people have reported ST depression around zero,contrast patients with a heart disease have reported St depression rate around 0-4.

Categorical Features

Sex

Out of all patients around 67% (2/3) were male. Level wise comparison for target feature shows men are more likely to have a heart disease than women.

Chest Pain Type (CP)

The most patients have had a chest pain of type of 4(asymptomatic). However, the most of the healthy people have had a Type 3 pain(non-anginal pain).

Fasting Blood Sugar > 120 mg/dl (FBS)

The most of the patients have had fasting blood sugar level less than 120mg/dl. Level wise bar charts for target feature show the same pattern for both patients and suggest that the fasting blood sugar level may not be a deciding factor for having a heart disease or not.

Resting Electrocardiographic Results (RESTECG)

Patients showed (Normal) and L2(Level 2) ECG results equally. However, healthy people showed normal ECG results while patients with a heart disease showed a ECG result of level 2. The result of level 1 is not a common type of ECG result for none of patient groups.

Exercise Induced Angina (EXANG)

Out of all patients around 67% haven’t had an exercise induced angina. But having a exercise induced angina for a patient with a heart disease is more prominent than a patient without a heart disease.This shows having a exercise induced angina may be a deciding factor for having a heart disease.

The Slope of the Peak Exercise ST Segment (SLOPE)

Up sloping and flat were two major types exercise ST segment for many patients. Majority of people from healthy group have had up sloping exercise while majority of patients with a heart disease have had flat exercise ST segment.

Number of Major Vessels (0-3) Colored by Flourosopy (CA)

Here, there’s no cavalue for 4 patients. Even though we estimated values for those missing values we couldn’t amalgamate those values to the data set for the moment. We kept it for the next phase. We can see many patients reported value for ca as zero and the distribution looks like inverse exponential. The same pattern persists for ca value for healthy people too. However, for sick people both value 0 and 1 are equally important.

The Heart Status as Retrieved from Thallium Test (thal)

We haven’t got thal value for 2 patients here. We have estimated values for those missing records. However, we couldn’t amalgamate those values to data set at this stage and left for phase II. When considering the bar chart for all patients levels of heart status commonly can be seen are Normal and Reversible Defect. However, two bar charts drawn for target feature are not the same for two levels of target. Healthy people mostly showed a normal heart status while sick people showed mostly a reversible defect condition.

Multivariate Visualisation

Scatter Plot Matrice

Age, Resting Blood Pressure (TRESTBPS),Serum Cholesterol (CHOL),Maximum Heart Rate Achieved (THALACH) and ST Depression Induced by Exercise Relative to Rest (OLDPEAK)

There’s no significant linear correlation between any two of features. Blood pressure, cholesterol level and ST depression shows a weak positive linear correlation with age while maximum heart rate achieved shows a weak negative correlation with age. Maximum heart rate achieved also shows a weak negative linear relationship with ST depression induced by exercise relative to rest.

Sex, Resting ECG Results, and Exercise Induced Angina

As per the three dimensional bar chart depicted below there is no men in the data set with resting ECG result level 1 and having a exercise induced angina.The most of female who’s not having exercise induced angina haven’t had a heart disease for almost all the levels of resting ECG results.

summary

Some numerical features with few values (2-4) were converted to factors to improve the convenience of the analysis. Two categorical features ca and thal had few missing values and due to that reason it cardinality also get wrong. We estimated values for missing cells and regenerated the cadinality too. The target feature was num with 4 cardinalities. We regenerated this feature with two cardinalities and renamed it as target. From the data exploration, we found that sex,serum cholesterol level, maximum heart rate achieved, chest pain type, resting ECG result, exercise induced angina,ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment and The heart status as retrieved from Thallium test may be potentially useful features in predicting the diagnosis of patient’s heart disease.

Predicting a Person Diagnose a Heart Disease based on Cleveland Database

MATH 2319 Machine Learning Applied Project Phase I

Praneetha Meegahalanda Durage (s3685754)

08 April 2018