Data Source and Description

This section provides information on the source of the data and high level explanation on the features (extracted from website where the data was shared )

The data used for this project is an open source database obtained from UCI Machine Learning website. The data consists of 303 rows and 14 columns. The last column in the dataset is the target feature that shows the presence of heart disease.

The details of all the features are listed below:

Age: Age of subject

Sex: Gender of subject: 0 = female 1 = male

Chest-pain type: Type of chest-pain experienced by the individual: 1 = typical angina 2 = atypical angina 3 = non-angina pain 4 = asymptomatic angina

Resting Blood Pressure: Resting blood pressure in mm Hg

Serum Cholesterol: Serum cholesterol in mg/dl

Fasting Blood Sugar: Fasting blood sugar level relative to 120 mg/dl: 0 = fasting blood sugar <= 120 mg/dl 1 = fasting blood sugar > 120 mg/dl

Resting ECG: Resting electrocardiographic results 0 = normal 1 = ST-T wave abnormality 2 = left ventricle hyperthrophy

Max Heart Rate Achieved: Max heart rate of subject

Exercise Induced Angina: 0 = no 1 = yes

ST Depression Induced by Exercise Relative to Rest: ST Depression of subject

Peak Exercise ST Segment: 1 = Up-sloaping 2 = Flat 3 = Down-sloaping

Number of Major Vessels (0-3) Visible on Flouroscopy: Number of visible vessels under flouro

Thal: Form of thalassemia: 3 3 = normal 6 = fixed defect 7 = reversible defect

Diagnosis of Heart Disease: Indicates whether subject is suffering from heart disease or not: 0 = absence 1 c-d = heart disease present

###Install the relevant packages and library

library(dplyr)
library(Rcpp)
library(lattice)
library(ggplot2)
library(proto)
library(RSQLite)
library(gsubfn)
library(caret)
library(sqldf)
library(Amelia)
library(BinMat)
library(tidyr)
library(tidyverse)
library(MASS)
library(Hmisc)
library(Formula)
library(klaR)
library(e1071)
library(survival)
library(ROCR)
library(mlbench)
library(readr)
library(skimr)
library(DataExplorer)
library(funModeling) 
library(Hmisc)
library(Rcpp)
library(ROCR)

Load the dataset

heart<-read.csv("processed_cleveland_ori_data.csv", header = T)

Assign Column Names

#Prepare column names
names <- c("Age",
           "Sex",
           "Chest_Pain_Type",
           "Resting_Blood_Pressure",
           "Serum_Cholesterol",
           "Fasting_Blood_Sugar",
           "Resting_ECG",
           "Max_Heart_Rate_Achieved",
           "Exercise_Induced_Angina",
           "ST_Depression_Exercise",
           "Peak_Exercise_ST_Segment",
           "Num_Major_Vessels_Flouro",
           "Thalassemia",
           "Diagnosis_Heart_Disease")

#Apply column names to the dataframe
colnames(heart) <- names

Data Examination Using Head, Tail, Structure, Summary, Dimension, Glimpse

# HEAD / TAIL
# It allows us to see the first and last 6 rows by default. 

head(heart)
##   Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1  67   1               4                    160               286
## 2  67   1               4                    120               229
## 3  37   1               3                    130               250
## 4  41   0               2                    130               204
## 5  56   1               2                    120               236
## 6  62   0               4                    140               268
##   Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1                   0           2                     108
## 2                   0           2                     129
## 3                   0           0                     187
## 4                   0           2                     172
## 5                   0           0                     178
## 6                   0           2                     160
##   Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1                       1                    1.5                        2
## 2                       1                    2.6                        2
## 3                       0                    3.5                        3
## 4                       0                    1.4                        1
## 5                       0                    0.8                        1
## 6                       0                    3.6                        3
##   Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## 1                        3           3                       2
## 2                        2           7                       1
## 3                        0           3                       0
## 4                        0           3                       0
## 5                        0           3                       0
## 6                        2           3                       3
tail(heart)
##     Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 297  57   0               4                    140               241
## 298  45   1               1                    110               264
## 299  68   1               4                    144               193
## 300  57   1               4                    130               131
## 301  57   0               2                    130               236
## 302  38   1               3                    138               175
##     Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 297                   0           0                     123
## 298                   0           0                     132
## 299                   1           0                     141
## 300                   0           0                     115
## 301                   0           2                     174
## 302                   0           0                     173
##     Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 297                       1                    0.2                        2
## 298                       0                    1.2                        2
## 299                       0                    3.4                        2
## 300                       1                    1.2                        2
## 301                       0                    0.0                        2
## 302                       0                    0.0                        1
##     Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## 297                        0           7                       1
## 298                        0           7                       1
## 299                        2           7                       2
## 300                        1           7                       3
## 301                        1           3                       1
## 302                        ?           3                       0
# Structure 
str(heart)
## 'data.frame':    302 obs. of  14 variables:
##  $ Age                     : int  67 67 37 41 56 62 57 63 53 57 ...
##  $ Sex                     : int  1 1 1 0 1 0 0 1 1 1 ...
##  $ Chest_Pain_Type         : int  4 4 3 2 2 4 4 4 4 4 ...
##  $ Resting_Blood_Pressure  : int  160 120 130 130 120 140 120 130 140 140 ...
##  $ Serum_Cholesterol       : int  286 229 250 204 236 268 354 254 203 192 ...
##  $ Fasting_Blood_Sugar     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Resting_ECG             : int  2 2 0 2 0 2 0 2 2 0 ...
##  $ Max_Heart_Rate_Achieved : int  108 129 187 172 178 160 163 147 155 148 ...
##  $ Exercise_Induced_Angina : int  1 1 0 0 0 0 1 0 1 0 ...
##  $ ST_Depression_Exercise  : num  1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
##  $ Peak_Exercise_ST_Segment: int  2 2 3 1 1 3 1 2 3 2 ...
##  $ Num_Major_Vessels_Flouro: chr  "3" "2" "0" "0" ...
##  $ Thalassemia             : chr  "3" "7" "3" "3" ...
##  $ Diagnosis_Heart_Disease : int  2 1 0 0 0 3 0 2 1 0 ...
# Summary
summary(heart)
##       Age             Sex         Chest_Pain_Type Resting_Blood_Pressure
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0         
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0         
##  Median :55.50   Median :1.0000   Median :3.000   Median :130.0         
##  Mean   :54.41   Mean   :0.6788   Mean   :3.166   Mean   :131.6         
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0         
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0         
##  Serum_Cholesterol Fasting_Blood_Sugar  Resting_ECG     Max_Heart_Rate_Achieved
##  Min.   :126.0     Min.   :0.0000      Min.   :0.0000   Min.   : 71.0          
##  1st Qu.:211.0     1st Qu.:0.0000      1st Qu.:0.0000   1st Qu.:133.2          
##  Median :241.5     Median :0.0000      Median :0.5000   Median :153.0          
##  Mean   :246.7     Mean   :0.1457      Mean   :0.9868   Mean   :149.6          
##  3rd Qu.:275.0     3rd Qu.:0.0000      3rd Qu.:2.0000   3rd Qu.:166.0          
##  Max.   :564.0     Max.   :1.0000      Max.   :2.0000   Max.   :202.0          
##  Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
##  Min.   :0.0000          Min.   :0.000          Min.   :1.000           
##  1st Qu.:0.0000          1st Qu.:0.000          1st Qu.:1.000           
##  Median :0.0000          Median :0.800          Median :2.000           
##  Mean   :0.3278          Mean   :1.035          Mean   :1.596           
##  3rd Qu.:1.0000          3rd Qu.:1.600          3rd Qu.:2.000           
##  Max.   :1.0000          Max.   :6.200          Max.   :3.000           
##  Num_Major_Vessels_Flouro Thalassemia        Diagnosis_Heart_Disease
##  Length:302               Length:302         Min.   :0.0000         
##  Class :character         Class :character   1st Qu.:0.0000         
##  Mode  :character         Mode  :character   Median :0.0000         
##                                              Mean   :0.9404         
##                                              3rd Qu.:2.0000         
##                                              Max.   :4.0000
# DIMENSION
# Displays the dimensions of the table. The output takes the form of row, column.
dim(heart)
## [1] 302  14
### GLIMPSE
# Displays the type and a preview of all columns as a row so that it's very easy to take in.
# This will display a vertical preview of the dataset. 
# It allows us to easily preview the data type and sample data.
glimpse(heart)
## Rows: 302
## Columns: 14
## $ Age                      <int> 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 5~
## $ Sex                      <int> 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, ~
## $ Chest_Pain_Type          <int> 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, ~
## $ Resting_Blood_Pressure   <int> 160, 120, 130, 130, 120, 140, 120, 130, 140, ~
## $ Serum_Cholesterol        <int> 286, 229, 250, 204, 236, 268, 354, 254, 203, ~
## $ Fasting_Blood_Sugar      <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, ~
## $ Resting_ECG              <int> 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, ~
## $ Max_Heart_Rate_Achieved  <int> 108, 129, 187, 172, 178, 160, 163, 147, 155, ~
## $ Exercise_Induced_Angina  <int> 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, ~
## $ ST_Depression_Exercise   <dbl> 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, ~
## $ Peak_Exercise_ST_Segment <int> 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, ~
## $ Num_Major_Vessels_Flouro <chr> "3", "2", "0", "0", "0", "2", "0", "1", "0", ~
## $ Thalassemia              <chr> "3", "7", "3", "3", "3", "3", "3", "7", "7", ~
## $ Diagnosis_Heart_Disease  <int> 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, ~

Statistical Info and Data Exploration

# Skim
# This function is a good addition to the summary function. 
# It displays most of the numerical attributes from summary, but it also 
# displays missing values, more quantile information and an inline histogram for each variable
skim(heart)
Data summary
Name heart
Number of rows 302
Number of columns 14
_______________________
Column type frequency:
character 2
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Num_Major_Vessels_Flouro 0 1 1 1 0 5 0
Thalassemia 0 1 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 54.41 9.04 29 48.00 55.5 61.0 77.0 ▁▆▇▇▁
Sex 0 1 0.68 0.47 0 0.00 1.0 1.0 1.0 ▃▁▁▁▇
Chest_Pain_Type 0 1 3.17 0.95 1 3.00 3.0 4.0 4.0 ▁▃▁▅▇
Resting_Blood_Pressure 0 1 131.65 17.61 94 120.00 130.0 140.0 200.0 ▃▇▅▁▁
Serum_Cholesterol 0 1 246.74 51.86 126 211.00 241.5 275.0 564.0 ▃▇▂▁▁
Fasting_Blood_Sugar 0 1 0.15 0.35 0 0.00 0.0 0.0 1.0 ▇▁▁▁▂
Resting_ECG 0 1 0.99 0.99 0 0.00 0.5 2.0 2.0 ▇▁▁▁▇
Max_Heart_Rate_Achieved 0 1 149.61 22.91 71 133.25 153.0 166.0 202.0 ▁▂▅▇▂
Exercise_Induced_Angina 0 1 0.33 0.47 0 0.00 0.0 1.0 1.0 ▇▁▁▁▃
ST_Depression_Exercise 0 1 1.04 1.16 0 0.00 0.8 1.6 6.2 ▇▂▁▁▁
Peak_Exercise_ST_Segment 0 1 1.60 0.61 1 1.00 2.0 2.0 3.0 ▇▁▇▁▁
Diagnosis_Heart_Disease 0 1 0.94 1.23 0 0.00 0.0 2.0 4.0 ▇▃▂▂▁
# Analyzing categorical variables
# freq function runs for all factor or character variables automatically:
freq(heart)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
##   Num_Major_Vessels_Flouro frequency percentage cumulative_perc
## 1                        0       175      57.95           57.95
## 2                        1        65      21.52           79.47
## 3                        2        38      12.58           92.05
## 4                        3        20       6.62           98.67
## 5                        ?         4       1.32          100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

##   Thalassemia frequency percentage cumulative_perc
## 1           3       166      54.97           54.97
## 2           7       117      38.74           93.71
## 3           6        17       5.63           99.34
## 4           ?         2       0.66          100.00
## [1] "Variables processed: Num_Major_Vessels_Flouro, Thalassemia"
# Analyzing numerical variables
# Quantitatively
# profiling_num runs for all numerical/integer variables automatically:
profiling_num(heart)
##                    variable        mean    std_dev variation_coef   p_01   p_05
## 1                       Age  54.4105960  9.0401625      0.1661471  35.00  40.00
## 2                       Sex   0.6788079  0.4677094      0.6890158   0.00   0.00
## 3           Chest_Pain_Type   3.1655629  0.9536115      0.3012455   1.00   1.00
## 4    Resting_Blood_Pressure 131.6456954 17.6122022      0.1337849 100.00 108.00
## 5         Serum_Cholesterol 246.7384106 51.8568287      0.2101693 149.00 175.05
## 6       Fasting_Blood_Sugar   0.1456954  0.3533861      2.4255137   0.00   0.00
## 7               Resting_ECG   0.9867550  0.9949157      1.0082703   0.00   0.00
## 8   Max_Heart_Rate_Achieved 149.6059603 22.9129589      0.1531554  95.01 108.05
## 9   Exercise_Induced_Angina   0.3278146  0.4701960      1.4343352   0.00   0.00
## 10   ST_Depression_Exercise   1.0354305  1.1607234      1.1210056   0.00   0.00
## 11 Peak_Exercise_ST_Segment   1.5960265  0.6119389      0.3834140   1.00   1.00
## 12  Diagnosis_Heart_Disease   0.9403974  1.2293844      1.3073031   0.00   0.00
##      p_25  p_50  p_75   p_95   p_99    skewness kurtosis   iqr        range_98
## 1   48.00  55.5  61.0  68.00  71.00 -0.20201573 2.466776 13.00        [35, 71]
## 2    0.00   1.0   1.0   1.00   1.00 -0.76588040 1.586573  1.00          [0, 1]
## 3    3.00   3.0   4.0   4.00   4.00 -0.84164245 2.603891  1.00          [1, 4]
## 4  120.00 130.0 140.0 160.00 180.00  0.70946162 3.853786 20.00      [100, 180]
## 5  211.00 241.5 275.0 326.95 406.87  1.12583520 7.373255 64.00   [149, 406.87]
## 6    0.00   0.0   0.0   1.00   1.00  2.00852657 5.034179  0.00          [0, 1]
## 7    0.00   0.5   2.0   2.00   2.00  0.02649061 1.014129  2.00          [0, 2]
## 8  133.25 153.0 166.0 181.95 191.98 -0.53373140 2.917824 32.75 [95.01, 191.98]
## 9    0.00   0.0   1.0   1.00   1.00  0.73361419 1.538190  1.00          [0, 1]
## 10   0.00   0.8   1.6   3.40   4.20  1.27532638 4.564464  1.60        [0, 4.2]
## 11   1.00   2.0   2.0   3.00   3.00  0.50118171 2.361785  1.00          [1, 3]
## 12   0.00   0.0   2.0   3.00   4.00  1.04845485 2.833627  2.00          [0, 4]
##          range_80
## 1        [42, 66]
## 2          [0, 1]
## 3          [2, 4]
## 4      [110, 152]
## 5  [188.4, 308.9]
## 6          [0, 1]
## 7          [0, 2]
## 8    [116, 176.8]
## 9          [0, 1]
## 10       [0, 2.8]
## 11         [1, 2]
## 12         [0, 3]
# Graphically
# Plot_num and profiling_num. Both run automatically for all numerical/integer variables:
plot_num(heart)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Describe from Hmisc Package
# Analyzing numerical and categorical at the same time
describe(heart)
## heart 
## 
##  14  Variables      302  Observations
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      302        0       41    0.999    54.41     10.3     40.0     42.0 
##      .25      .50      .75      .90      .95 
##     48.0     55.5     61.0     66.0     68.0 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## --------------------------------------------------------------------------------
## Sex 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      302        0        2    0.654      205   0.6788   0.4375 
## 
## --------------------------------------------------------------------------------
## Chest_Pain_Type 
##        n  missing distinct     Info     Mean      Gmd 
##      302        0        4    0.864    3.166        1 
##                                   
## Value          1     2     3     4
## Frequency     22    50    86   144
## Proportion 0.073 0.166 0.285 0.477
## --------------------------------------------------------------------------------
## Resting_Blood_Pressure 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      302        0       50    0.995    131.6    19.41      108      110 
##      .25      .50      .75      .90      .95 
##      120      130      140      152      160 
## 
## lowest :  94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## Serum_Cholesterol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      302        0      152        1    246.7    56.02    175.1    188.4 
##      .25      .50      .75      .90      .95 
##    211.0    241.5    275.0    308.9    326.9 
## 
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------
## Fasting_Blood_Sugar 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      302        0        2    0.373       44   0.1457   0.2498 
## 
## --------------------------------------------------------------------------------
## Resting_ECG 
##        n  missing distinct     Info     Mean      Gmd 
##      302        0        3     0.76   0.9868    1.003 
##                             
## Value          0     1     2
## Frequency    151     4   147
## Proportion 0.500 0.013 0.487
## --------------------------------------------------------------------------------
## Max_Heart_Rate_Achieved 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      302        0       91        1    149.6    25.78    108.0    116.0 
##      .25      .50      .75      .90      .95 
##    133.2    153.0    166.0    176.8    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## Exercise_Induced_Angina 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      302        0        2    0.661       99   0.3278   0.4422 
## 
## --------------------------------------------------------------------------------
## ST_Depression_Exercise 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      302        0       40    0.964    1.035    1.223      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.8      1.6      2.8      3.4 
## 
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## --------------------------------------------------------------------------------
## Peak_Exercise_ST_Segment 
##        n  missing distinct     Info     Mean      Gmd 
##      302        0        3    0.796    1.596    0.624 
##                             
## Value          1     2     3
## Frequency    142   140    20
## Proportion 0.470 0.464 0.066
## --------------------------------------------------------------------------------
## Num_Major_Vessels_Flouro 
##        n  missing distinct 
##      302        0        5 
## 
## lowest : ? 0 1 2 3, highest: ? 0 1 2 3
##                                         
## Value          ?     0     1     2     3
## Frequency      4   175    65    38    20
## Proportion 0.013 0.579 0.215 0.126 0.066
## --------------------------------------------------------------------------------
## Thalassemia 
##        n  missing distinct 
##      302        0        4 
##                                   
## Value          ?     3     6     7
## Frequency      2   166    17   117
## Proportion 0.007 0.550 0.056 0.387
## --------------------------------------------------------------------------------
## Diagnosis_Heart_Disease 
##        n  missing distinct     Info     Mean      Gmd 
##      302        0        5    0.833   0.9404    1.252 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency    163    55    36    35    13
## Proportion 0.540 0.182 0.119 0.116 0.043
## --------------------------------------------------------------------------------

Data Pre-processing

# The section below checks for missing values and perform missing value imputation (using median)

heart$Num_Major_Vessels_Flouro[which(heart$Num_Major_Vessels_Flouro== "?")] <- NA
heart$Thalassemia[which(heart$Thalassemia== "?")] <- NA
colSums(is.na(heart))
##                      Age                      Sex          Chest_Pain_Type 
##                        0                        0                        0 
##   Resting_Blood_Pressure        Serum_Cholesterol      Fasting_Blood_Sugar 
##                        0                        0                        0 
##              Resting_ECG  Max_Heart_Rate_Achieved  Exercise_Induced_Angina 
##                        0                        0                        0 
##   ST_Depression_Exercise Peak_Exercise_ST_Segment Num_Major_Vessels_Flouro 
##                        0                        0                        4 
##              Thalassemia  Diagnosis_Heart_Disease 
##                        2                        0
# Change the data type

heart$Num_Major_Vessels_Flouro <- as.numeric(heart$Num_Major_Vessels_Flouro)

# Obtain the median value

median.result_heart <- median(heart$Num_Major_Vessels_Flouro, na.rm = TRUE)
median.result_heart #1
## [1] 0
# Missing Value Imputation with Median 
# Replace na value with median value
heart$Num_Major_Vessels_Flouro[is.na(heart$Num_Major_Vessels_Flouro)] <- 1
heart$Thalassemia[is.na(heart$Thalassemia)] <- 3


# Recode the existing target feature "diagnosis" to a new feature called "target"

heart <- sqldf("select *,case when Diagnosis_Heart_Disease = 0 then 0 else 1 end target from heart")

head(heart)
##   Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1  67   1               4                    160               286
## 2  67   1               4                    120               229
## 3  37   1               3                    130               250
## 4  41   0               2                    130               204
## 5  56   1               2                    120               236
## 6  62   0               4                    140               268
##   Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1                   0           2                     108
## 2                   0           2                     129
## 3                   0           0                     187
## 4                   0           2                     172
## 5                   0           0                     178
## 6                   0           2                     160
##   Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1                       1                    1.5                        2
## 2                       1                    2.6                        2
## 3                       0                    3.5                        3
## 4                       0                    1.4                        1
## 5                       0                    0.8                        1
## 6                       0                    3.6                        3
##   Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease target
## 1                        3           3                       2      1
## 2                        2           7                       1      1
## 3                        0           3                       0      0
## 4                        0           3                       0      0
## 5                        0           3                       0      0
## 6                        2           3                       3      1

###Analyse relatioship between features in the dataset

# The relationship between Gender and heart disease

heart$Sex<-as.factor(heart$Sex)
levels(heart$Sex)<-c("Female","Male")
str(heart)
## 'data.frame':    302 obs. of  15 variables:
##  $ Age                     : int  67 67 37 41 56 62 57 63 53 57 ...
##  $ Sex                     : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 2 2 2 ...
##  $ Chest_Pain_Type         : int  4 4 3 2 2 4 4 4 4 4 ...
##  $ Resting_Blood_Pressure  : int  160 120 130 130 120 140 120 130 140 140 ...
##  $ Serum_Cholesterol       : int  286 229 250 204 236 268 354 254 203 192 ...
##  $ Fasting_Blood_Sugar     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Resting_ECG             : int  2 2 0 2 0 2 0 2 2 0 ...
##  $ Max_Heart_Rate_Achieved : int  108 129 187 172 178 160 163 147 155 148 ...
##  $ Exercise_Induced_Angina : int  1 1 0 0 0 0 1 0 1 0 ...
##  $ ST_Depression_Exercise  : num  1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
##  $ Peak_Exercise_ST_Segment: int  2 2 3 1 1 3 1 2 3 2 ...
##  $ Num_Major_Vessels_Flouro: num  3 2 0 0 0 2 0 1 0 0 ...
##  $ Thalassemia             : chr  "3" "7" "3" "3" ...
##  $ Diagnosis_Heart_Disease : int  2 1 0 0 0 3 0 2 1 0 ...
##  $ target                  : int  1 1 0 0 0 1 0 1 1 0 ...

##Exploratory Data Analysis

From the histogram we can see that people age between 50 to 70 have the highest risk of heart disease compared to other age group.

table <- table(as.numeric(heart$Chest_Pain_Type))

pie(table)

Create Logistic Regression Model

# First, create a Train-test split with 75% data included in the training set.

set.seed(123) 
#100 is used to control the sampling permutation to 100. 
index<-sample(nrow(heart),0.75*nrow(heart))
train<-heart[index,]
test<-heart[-index,]
dim(train)
## [1] 226  15
dim(test)
## [1] 76 15
modelblr<-glm(Exercise_Induced_Angina~.,data = train,family = "binomial")

train$pred<-fitted(modelblr)
# fitted can be used only to get predicted score of the data on which model has been generated.
head(train)
##     Age    Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 179  53   Male               3                    130               246
## 14   52   Male               3                    172               199
## 195  67   Male               4                    100               299
## 118  63   Male               4                    130               330
## 229  66   Male               4                    112               212
## 244  60 Female               3                    120               178
##     Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 179                   1           2                     173
## 14                    1           0                     162
## 195                   0           2                     125
## 118                   1           2                     132
## 229                   0           2                     132
## 244                   1           0                      96
##     Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 179                       0                    0.0                        1
## 14                        0                    0.5                        1
## 195                       1                    0.9                        2
## 118                       1                    1.8                        1
## 229                       1                    0.1                        1
## 244                       0                    0.0                        1
##     Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease target
## 179                        3           3                       0      0
## 14                         0           7                       0      0
## 195                        2           3                       3      1
## 118                        3           7                       3      1
## 229                        1           3                       2      1
## 244                        0           3                       0      0
##           pred
## 179 0.03221014
## 14  0.16108116
## 195 0.65497840
## 118 0.56884660
## 229 0.54523593
## 244 0.29515345
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred<-prediction(train$pred,train$Exercise_Induced_Angina)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize = T,print.cutoffs.at = seq(0.1,by = 0.1),
       ,lwd=4, axes = F,ylab="TPR",xlab="FPR")
axis(side = 1 ,col=7)
axis(side = 2,col=7 )
grid()

train$pred1<-ifelse(train$pred<0.6,"No","Yes")
  library(caret)
  ## Loading required package: lattice

Create Naive Bayes Model

indxTrain <- createDataPartition(y = heart$target,p = 0.75,list = FALSE)
training <- heart[indxTrain,]
testing <- heart[-indxTrain,]

#Check dimensions of the split
prop.table(table(heart$target)) * 100
## 
##        0        1 
## 53.97351 46.02649
prop.table(table(training$target)) * 100
## 
##        0        1 
## 51.54185 48.45815
#create objects x which holds the predictor variables and y which holds the response variables
x = training[,-10]
y = training$target

y <- as.factor(y)
defaultW <- getOption("warn") 
options(warn = -1) 
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))
model
## Naive Bayes 
## 
## 227 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 204, 205, 204, 204, 205, 204, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy  Kappa
##   FALSE      NaN       NaN  
##    TRUE        1         1  
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.
#Model Evaluation
#Predict testing set
Predict <- predict(model,newdata = testing ) 
#Get the confusion matrix to see accuracy value and other parameter values
#Confusion Matrix and Statistics
confusionMatrix(Predict, as.factor(testing$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46  0
##          1  0 29
##                                     
##                Accuracy : 1         
##                  95% CI : (0.952, 1)
##     No Information Rate : 0.6133    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.0000    
##             Specificity : 1.0000    
##          Pos Pred Value : 1.0000    
##          Neg Pred Value : 1.0000    
##              Prevalence : 0.6133    
##          Detection Rate : 0.6133    
##    Detection Prevalence : 0.6133    
##       Balanced Accuracy : 1.0000    
##                                     
##        'Positive' Class : 0         
## 
options(warn = defaultW)

Create Random Forest Model (Min)

# Import library
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# To control the sampling permutation
set.seed(100)

# Change column 'target' to factor
heart$target <- as.factor(heart$target)

# Check the latest class of 'target'
str(heart)
## 'data.frame':    302 obs. of  15 variables:
##  $ Age                     : int  67 67 37 41 56 62 57 63 53 57 ...
##  $ Sex                     : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 2 2 2 ...
##  $ Chest_Pain_Type         : int  4 4 3 2 2 4 4 4 4 4 ...
##  $ Resting_Blood_Pressure  : int  160 120 130 130 120 140 120 130 140 140 ...
##  $ Serum_Cholesterol       : int  286 229 250 204 236 268 354 254 203 192 ...
##  $ Fasting_Blood_Sugar     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Resting_ECG             : int  2 2 0 2 0 2 0 2 2 0 ...
##  $ Max_Heart_Rate_Achieved : int  108 129 187 172 178 160 163 147 155 148 ...
##  $ Exercise_Induced_Angina : int  1 1 0 0 0 0 1 0 1 0 ...
##  $ ST_Depression_Exercise  : num  1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
##  $ Peak_Exercise_ST_Segment: int  2 2 3 1 1 3 1 2 3 2 ...
##  $ Num_Major_Vessels_Flouro: num  3 2 0 0 0 2 0 1 0 0 ...
##  $ Thalassemia             : chr  "3" "7" "3" "3" ...
##  $ Diagnosis_Heart_Disease : int  2 1 0 0 0 3 0 2 1 0 ...
##  $ target                  : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 1 ...
# Split dataset into training and testing set with probability 75% & 25% 
rf_sample <- sample(2, nrow(heart), replace = TRUE, prob = c(0.75, 0.25))
rf_train <- heart[rf_sample==1,]
rf_test <- heart[rf_sample==2,]
str(rf_train)
## 'data.frame':    228 obs. of  15 variables:
##  $ Age                     : int  67 67 37 41 56 62 63 53 57 56 ...
##  $ Sex                     : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 2 2 2 1 ...
##  $ Chest_Pain_Type         : int  4 4 3 2 2 4 4 4 4 2 ...
##  $ Resting_Blood_Pressure  : int  160 120 130 130 120 140 130 140 140 140 ...
##  $ Serum_Cholesterol       : int  286 229 250 204 236 268 254 203 192 294 ...
##  $ Fasting_Blood_Sugar     : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Resting_ECG             : int  2 2 0 2 0 2 2 2 0 2 ...
##  $ Max_Heart_Rate_Achieved : int  108 129 187 172 178 160 147 155 148 153 ...
##  $ Exercise_Induced_Angina : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ ST_Depression_Exercise  : num  1.5 2.6 3.5 1.4 0.8 3.6 1.4 3.1 0.4 1.3 ...
##  $ Peak_Exercise_ST_Segment: int  2 2 3 1 1 3 2 3 2 2 ...
##  $ Num_Major_Vessels_Flouro: num  3 2 0 0 0 2 1 0 0 0 ...
##  $ Thalassemia             : chr  "3" "7" "3" "3" ...
##  $ Diagnosis_Heart_Disease : int  2 1 0 0 0 3 2 1 0 0 ...
##  $ target                  : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 1 1 ...
# Running the Random Forest model
rf <- randomForest(target~., data=rf_train, proximity=TRUE) 
print(rf)
## 
## Call:
##  randomForest(formula = target ~ ., data = rf_train, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##     0   1 class.error
## 0 121   0           0
## 1   0 107           0
#Checking the accuracy of training and testing set
p1 <- predict(rf, rf_train)
confusionMatrix(p1, rf_train$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 121   0
##          1   0 107
##                                     
##                Accuracy : 1         
##                  95% CI : (0.984, 1)
##     No Information Rate : 0.5307    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.0000    
##             Specificity : 1.0000    
##          Pos Pred Value : 1.0000    
##          Neg Pred Value : 1.0000    
##              Prevalence : 0.5307    
##          Detection Rate : 0.5307    
##    Detection Prevalence : 0.5307    
##       Balanced Accuracy : 1.0000    
##                                     
##        'Positive' Class : 0         
## 
p2 <- predict(rf, rf_test)
confusionMatrix(p2, as.factor(rf_test$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 42  0
##          1  0 32
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9514, 1)
##     No Information Rate : 0.5676     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5676     
##          Detection Rate : 0.5676     
##    Detection Prevalence : 0.5676     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
plot(rf)

Data Visualization

library(ggplot2)
ggplot(heart,aes(x=Age,fill=Diagnosis_Heart_Disease,color=Diagnosis_Heart_Disease)) + geom_histogram(binwidth = 1,color="black") + labs(x = "Age",y = "Frequency", title = "Heart disease and age")

Create a PDF Report

heart1 <- heart
# Recode some categorical variables to numeric for report generation

heart1$Num_Major_Vessels_Flouro = factor(heart1$Num_Major_Vessels_Flouro, levels = c("0","1","2","3"), labels = c(0,1,2,3))

heart1$Thalassemia = factor(heart1$Thalassemia, levels = c("3","6","7"), labels = c(3,6,7))

# Reporting
# Create_report in DataExplorer
# Pull a full data profile of your data frame. 
# It will produce an html file with the basic statistics, structure, missing data, 
# distribution visualizations, correlation matrix and principal component analysis for your data frame

DataExplorer::create_report(heart1)
## 
## 
## processing file: report.rmd
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..                                                                    |   2%
##    inline R code fragments
## 
## 
  |                                                                            
  |...                                                                   |   5%
## label: global_options (with options) 
## List of 1
##  $ include: logi FALSE
## 
## 
  |                                                                            
  |.....                                                                 |   7%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.......                                                               |  10%
## label: introduce
## 
  |                                                                            
  |........                                                              |  12%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..........                                                            |  14%
## label: plot_intro
## 
  |                                                                            
  |............                                                          |  17%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.............                                                         |  19%
## label: data_structure
## 
  |                                                                            
  |...............                                                       |  21%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.................                                                     |  24%
## label: missing_profile
## 
  |                                                                            
  |..................                                                    |  26%
##   ordinary text without R code
## 
## 
  |                                                                            
  |....................                                                  |  29%
## label: univariate_distribution_header
## 
  |                                                                            
  |......................                                                |  31%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.......................                                               |  33%
## label: plot_histogram
## 
  |                                                                            
  |.........................                                             |  36%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...........................                                           |  38%
## label: plot_density
## 
  |                                                                            
  |............................                                          |  40%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..............................                                        |  43%
## label: plot_frequency_bar
## 
  |                                                                            
  |................................                                      |  45%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.................................                                     |  48%
## label: plot_response_bar
## 
  |                                                                            
  |...................................                                   |  50%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.....................................                                 |  52%
## label: plot_with_bar
## 
  |                                                                            
  |......................................                                |  55%
##   ordinary text without R code
## 
## 
  |                                                                            
  |........................................                              |  57%
## label: plot_normal_qq
## 
  |                                                                            
  |..........................................                            |  60%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...........................................                           |  62%
## label: plot_response_qq
## 
  |                                                                            
  |.............................................                         |  64%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...............................................                       |  67%
## label: plot_by_qq
## 
  |                                                                            
  |................................................                      |  69%
##   ordinary text without R code
## 
## 
  |                                                                            
  |..................................................                    |  71%
## label: correlation_analysis
## 
  |                                                                            
  |....................................................                  |  74%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.....................................................                 |  76%
## label: principal_component_analysis
## 
  |                                                                            
  |.......................................................               |  79%
##   ordinary text without R code
## 
## 
  |                                                                            
  |.........................................................             |  81%
## label: bivariate_distribution_header
## 
  |                                                                            
  |..........................................................            |  83%
##   ordinary text without R code
## 
## 
  |                                                                            
  |............................................................          |  86%
## label: plot_response_boxplot
## 
  |                                                                            
  |..............................................................        |  88%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...............................................................       |  90%
## label: plot_by_boxplot
## 
  |                                                                            
  |.................................................................     |  93%
##   ordinary text without R code
## 
## 
  |                                                                            
  |...................................................................   |  95%
## label: plot_response_scatterplot
## 
  |                                                                            
  |....................................................................  |  98%
##   ordinary text without R code
## 
## 
  |                                                                            
  |......................................................................| 100%
## label: plot_by_scatterplot
## output file: H:/360MoveData/Users/s2124/Desktop/report.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS "H:/360MoveData/Users/s2124/Desktop/report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc16b8052f94740.html --lua-filter "C:\Users\s2124\Documents\R\win-library\4.1\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\s2124\Documents\R\win-library\4.1\rmarkdown\rmarkdown\lua\latex-div.lua" --self-contained --variable bs3=TRUE --standalone --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\s2124\Documents\R\win-library\4.1\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --include-in-header "C:\Users\s2124\AppData\Local\Temp\RtmpYtl99f\rmarkdown-str16b805dc8412.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
## 
## Output created: report.html

```