This section provides information on the source of the data and high level explanation on the features (extracted from website where the data was shared )
The data used for this project is an open source database obtained from UCI Machine Learning website. The data consists of 303 rows and 14 columns. The last column in the dataset is the target feature that shows the presence of heart disease.
The details of all the features are listed below:
Age: Age of subject
Sex: Gender of subject: 0 = female 1 = male
Chest-pain type: Type of chest-pain experienced by the individual: 1 = typical angina 2 = atypical angina 3 = non-angina pain 4 = asymptomatic angina
Resting Blood Pressure: Resting blood pressure in mm Hg
Serum Cholesterol: Serum cholesterol in mg/dl
Fasting Blood Sugar: Fasting blood sugar level relative to 120 mg/dl: 0 = fasting blood sugar <= 120 mg/dl 1 = fasting blood sugar > 120 mg/dl
Resting ECG: Resting electrocardiographic results 0 = normal 1 = ST-T wave abnormality 2 = left ventricle hyperthrophy
Max Heart Rate Achieved: Max heart rate of subject
Exercise Induced Angina: 0 = no 1 = yes
ST Depression Induced by Exercise Relative to Rest: ST Depression of subject
Peak Exercise ST Segment: 1 = Up-sloaping 2 = Flat 3 = Down-sloaping
Number of Major Vessels (0-3) Visible on Flouroscopy: Number of visible vessels under flouro
Thal: Form of thalassemia: 3 3 = normal 6 = fixed defect 7 = reversible defect
Diagnosis of Heart Disease: Indicates whether subject is suffering from heart disease or not: 0 = absence 1 c-d = heart disease present
###Install the relevant packages and library
library(dplyr)
library(Rcpp)
library(lattice)
library(ggplot2)
library(proto)
library(RSQLite)
library(gsubfn)
library(caret)
library(sqldf)
library(Amelia)
library(BinMat)
library(tidyr)
library(tidyverse)
library(MASS)
library(Hmisc)
library(Formula)
library(klaR)
library(e1071)
library(survival)
library(ROCR)
library(mlbench)
library(readr)
library(skimr)
library(DataExplorer)
library(funModeling)
library(Hmisc)
library(Rcpp)
library(ROCR)
heart<-read.csv("processed_cleveland_ori_data.csv", header = T)
#Prepare column names
names <- c("Age",
"Sex",
"Chest_Pain_Type",
"Resting_Blood_Pressure",
"Serum_Cholesterol",
"Fasting_Blood_Sugar",
"Resting_ECG",
"Max_Heart_Rate_Achieved",
"Exercise_Induced_Angina",
"ST_Depression_Exercise",
"Peak_Exercise_ST_Segment",
"Num_Major_Vessels_Flouro",
"Thalassemia",
"Diagnosis_Heart_Disease")
#Apply column names to the dataframe
colnames(heart) <- names
# HEAD / TAIL
# It allows us to see the first and last 6 rows by default.
head(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1 67 1 4 160 286
## 2 67 1 4 120 229
## 3 37 1 3 130 250
## 4 41 0 2 130 204
## 5 56 1 2 120 236
## 6 62 0 4 140 268
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1 0 2 108
## 2 0 2 129
## 3 0 0 187
## 4 0 2 172
## 5 0 0 178
## 6 0 2 160
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1 1 1.5 2
## 2 1 2.6 2
## 3 0 3.5 3
## 4 0 1.4 1
## 5 0 0.8 1
## 6 0 3.6 3
## Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## 1 3 3 2
## 2 2 7 1
## 3 0 3 0
## 4 0 3 0
## 5 0 3 0
## 6 2 3 3
tail(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 297 57 0 4 140 241
## 298 45 1 1 110 264
## 299 68 1 4 144 193
## 300 57 1 4 130 131
## 301 57 0 2 130 236
## 302 38 1 3 138 175
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 297 0 0 123
## 298 0 0 132
## 299 1 0 141
## 300 0 0 115
## 301 0 2 174
## 302 0 0 173
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 297 1 0.2 2
## 298 0 1.2 2
## 299 0 3.4 2
## 300 1 1.2 2
## 301 0 0.0 2
## 302 0 0.0 1
## Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## 297 0 7 1
## 298 0 7 1
## 299 2 7 2
## 300 1 7 3
## 301 1 3 1
## 302 ? 3 0
# Structure
str(heart)
## 'data.frame': 302 obs. of 14 variables:
## $ Age : int 67 67 37 41 56 62 57 63 53 57 ...
## $ Sex : int 1 1 1 0 1 0 0 1 1 1 ...
## $ Chest_Pain_Type : int 4 4 3 2 2 4 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 160 120 130 130 120 140 120 130 140 140 ...
## $ Serum_Cholesterol : int 286 229 250 204 236 268 354 254 203 192 ...
## $ Fasting_Blood_Sugar : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Resting_ECG : int 2 2 0 2 0 2 0 2 2 0 ...
## $ Max_Heart_Rate_Achieved : int 108 129 187 172 178 160 163 147 155 148 ...
## $ Exercise_Induced_Angina : int 1 1 0 0 0 0 1 0 1 0 ...
## $ ST_Depression_Exercise : num 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
## $ Peak_Exercise_ST_Segment: int 2 2 3 1 1 3 1 2 3 2 ...
## $ Num_Major_Vessels_Flouro: chr "3" "2" "0" "0" ...
## $ Thalassemia : chr "3" "7" "3" "3" ...
## $ Diagnosis_Heart_Disease : int 2 1 0 0 0 3 0 2 1 0 ...
# Summary
summary(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :55.50 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.41 Mean :0.6788 Mean :3.166 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## Serum_Cholesterol Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.2
## Median :241.5 Median :0.0000 Median :0.5000 Median :153.0
## Mean :246.7 Mean :0.1457 Mean :0.9868 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## Min. :0.0000 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000
## Median :0.0000 Median :0.800 Median :2.000
## Mean :0.3278 Mean :1.035 Mean :1.596
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000
## Max. :1.0000 Max. :6.200 Max. :3.000
## Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease
## Length:302 Length:302 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.9404
## 3rd Qu.:2.0000
## Max. :4.0000
# DIMENSION
# Displays the dimensions of the table. The output takes the form of row, column.
dim(heart)
## [1] 302 14
### GLIMPSE
# Displays the type and a preview of all columns as a row so that it's very easy to take in.
# This will display a vertical preview of the dataset.
# It allows us to easily preview the data type and sample data.
glimpse(heart)
## Rows: 302
## Columns: 14
## $ Age <int> 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 5~
## $ Sex <int> 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, ~
## $ Chest_Pain_Type <int> 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, ~
## $ Resting_Blood_Pressure <int> 160, 120, 130, 130, 120, 140, 120, 130, 140, ~
## $ Serum_Cholesterol <int> 286, 229, 250, 204, 236, 268, 354, 254, 203, ~
## $ Fasting_Blood_Sugar <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, ~
## $ Resting_ECG <int> 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, ~
## $ Max_Heart_Rate_Achieved <int> 108, 129, 187, 172, 178, 160, 163, 147, 155, ~
## $ Exercise_Induced_Angina <int> 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, ~
## $ ST_Depression_Exercise <dbl> 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, ~
## $ Peak_Exercise_ST_Segment <int> 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, ~
## $ Num_Major_Vessels_Flouro <chr> "3", "2", "0", "0", "0", "2", "0", "1", "0", ~
## $ Thalassemia <chr> "3", "7", "3", "3", "3", "3", "3", "7", "7", ~
## $ Diagnosis_Heart_Disease <int> 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, ~
# Skim
# This function is a good addition to the summary function.
# It displays most of the numerical attributes from summary, but it also
# displays missing values, more quantile information and an inline histogram for each variable
skim(heart)
| Name | heart |
| Number of rows | 302 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Num_Major_Vessels_Flouro | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
| Thalassemia | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 54.41 | 9.04 | 29 | 48.00 | 55.5 | 61.0 | 77.0 | ▁▆▇▇▁ |
| Sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.00 | 1.0 | 1.0 | 1.0 | ▃▁▁▁▇ |
| Chest_Pain_Type | 0 | 1 | 3.17 | 0.95 | 1 | 3.00 | 3.0 | 4.0 | 4.0 | ▁▃▁▅▇ |
| Resting_Blood_Pressure | 0 | 1 | 131.65 | 17.61 | 94 | 120.00 | 130.0 | 140.0 | 200.0 | ▃▇▅▁▁ |
| Serum_Cholesterol | 0 | 1 | 246.74 | 51.86 | 126 | 211.00 | 241.5 | 275.0 | 564.0 | ▃▇▂▁▁ |
| Fasting_Blood_Sugar | 0 | 1 | 0.15 | 0.35 | 0 | 0.00 | 0.0 | 0.0 | 1.0 | ▇▁▁▁▂ |
| Resting_ECG | 0 | 1 | 0.99 | 0.99 | 0 | 0.00 | 0.5 | 2.0 | 2.0 | ▇▁▁▁▇ |
| Max_Heart_Rate_Achieved | 0 | 1 | 149.61 | 22.91 | 71 | 133.25 | 153.0 | 166.0 | 202.0 | ▁▂▅▇▂ |
| Exercise_Induced_Angina | 0 | 1 | 0.33 | 0.47 | 0 | 0.00 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▃ |
| ST_Depression_Exercise | 0 | 1 | 1.04 | 1.16 | 0 | 0.00 | 0.8 | 1.6 | 6.2 | ▇▂▁▁▁ |
| Peak_Exercise_ST_Segment | 0 | 1 | 1.60 | 0.61 | 1 | 1.00 | 2.0 | 2.0 | 3.0 | ▇▁▇▁▁ |
| Diagnosis_Heart_Disease | 0 | 1 | 0.94 | 1.23 | 0 | 0.00 | 0.0 | 2.0 | 4.0 | ▇▃▂▂▁ |
# Analyzing categorical variables
# freq function runs for all factor or character variables automatically:
freq(heart)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Num_Major_Vessels_Flouro frequency percentage cumulative_perc
## 1 0 175 57.95 57.95
## 2 1 65 21.52 79.47
## 3 2 38 12.58 92.05
## 4 3 20 6.62 98.67
## 5 ? 4 1.32 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Thalassemia frequency percentage cumulative_perc
## 1 3 166 54.97 54.97
## 2 7 117 38.74 93.71
## 3 6 17 5.63 99.34
## 4 ? 2 0.66 100.00
## [1] "Variables processed: Num_Major_Vessels_Flouro, Thalassemia"
# Analyzing numerical variables
# Quantitatively
# profiling_num runs for all numerical/integer variables automatically:
profiling_num(heart)
## variable mean std_dev variation_coef p_01 p_05
## 1 Age 54.4105960 9.0401625 0.1661471 35.00 40.00
## 2 Sex 0.6788079 0.4677094 0.6890158 0.00 0.00
## 3 Chest_Pain_Type 3.1655629 0.9536115 0.3012455 1.00 1.00
## 4 Resting_Blood_Pressure 131.6456954 17.6122022 0.1337849 100.00 108.00
## 5 Serum_Cholesterol 246.7384106 51.8568287 0.2101693 149.00 175.05
## 6 Fasting_Blood_Sugar 0.1456954 0.3533861 2.4255137 0.00 0.00
## 7 Resting_ECG 0.9867550 0.9949157 1.0082703 0.00 0.00
## 8 Max_Heart_Rate_Achieved 149.6059603 22.9129589 0.1531554 95.01 108.05
## 9 Exercise_Induced_Angina 0.3278146 0.4701960 1.4343352 0.00 0.00
## 10 ST_Depression_Exercise 1.0354305 1.1607234 1.1210056 0.00 0.00
## 11 Peak_Exercise_ST_Segment 1.5960265 0.6119389 0.3834140 1.00 1.00
## 12 Diagnosis_Heart_Disease 0.9403974 1.2293844 1.3073031 0.00 0.00
## p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr range_98
## 1 48.00 55.5 61.0 68.00 71.00 -0.20201573 2.466776 13.00 [35, 71]
## 2 0.00 1.0 1.0 1.00 1.00 -0.76588040 1.586573 1.00 [0, 1]
## 3 3.00 3.0 4.0 4.00 4.00 -0.84164245 2.603891 1.00 [1, 4]
## 4 120.00 130.0 140.0 160.00 180.00 0.70946162 3.853786 20.00 [100, 180]
## 5 211.00 241.5 275.0 326.95 406.87 1.12583520 7.373255 64.00 [149, 406.87]
## 6 0.00 0.0 0.0 1.00 1.00 2.00852657 5.034179 0.00 [0, 1]
## 7 0.00 0.5 2.0 2.00 2.00 0.02649061 1.014129 2.00 [0, 2]
## 8 133.25 153.0 166.0 181.95 191.98 -0.53373140 2.917824 32.75 [95.01, 191.98]
## 9 0.00 0.0 1.0 1.00 1.00 0.73361419 1.538190 1.00 [0, 1]
## 10 0.00 0.8 1.6 3.40 4.20 1.27532638 4.564464 1.60 [0, 4.2]
## 11 1.00 2.0 2.0 3.00 3.00 0.50118171 2.361785 1.00 [1, 3]
## 12 0.00 0.0 2.0 3.00 4.00 1.04845485 2.833627 2.00 [0, 4]
## range_80
## 1 [42, 66]
## 2 [0, 1]
## 3 [2, 4]
## 4 [110, 152]
## 5 [188.4, 308.9]
## 6 [0, 1]
## 7 [0, 2]
## 8 [116, 176.8]
## 9 [0, 1]
## 10 [0, 2.8]
## 11 [1, 2]
## 12 [0, 3]
# Graphically
# Plot_num and profiling_num. Both run automatically for all numerical/integer variables:
plot_num(heart)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Describe from Hmisc Package
# Analyzing numerical and categorical at the same time
describe(heart)
## heart
##
## 14 Variables 302 Observations
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 302 0 41 0.999 54.41 10.3 40.0 42.0
## .25 .50 .75 .90 .95
## 48.0 55.5 61.0 66.0 68.0
##
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## --------------------------------------------------------------------------------
## Sex
## n missing distinct Info Sum Mean Gmd
## 302 0 2 0.654 205 0.6788 0.4375
##
## --------------------------------------------------------------------------------
## Chest_Pain_Type
## n missing distinct Info Mean Gmd
## 302 0 4 0.864 3.166 1
##
## Value 1 2 3 4
## Frequency 22 50 86 144
## Proportion 0.073 0.166 0.285 0.477
## --------------------------------------------------------------------------------
## Resting_Blood_Pressure
## n missing distinct Info Mean Gmd .05 .10
## 302 0 50 0.995 131.6 19.41 108 110
## .25 .50 .75 .90 .95
## 120 130 140 152 160
##
## lowest : 94 100 101 102 104, highest: 174 178 180 192 200
## --------------------------------------------------------------------------------
## Serum_Cholesterol
## n missing distinct Info Mean Gmd .05 .10
## 302 0 152 1 246.7 56.02 175.1 188.4
## .25 .50 .75 .90 .95
## 211.0 241.5 275.0 308.9 326.9
##
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## --------------------------------------------------------------------------------
## Fasting_Blood_Sugar
## n missing distinct Info Sum Mean Gmd
## 302 0 2 0.373 44 0.1457 0.2498
##
## --------------------------------------------------------------------------------
## Resting_ECG
## n missing distinct Info Mean Gmd
## 302 0 3 0.76 0.9868 1.003
##
## Value 0 1 2
## Frequency 151 4 147
## Proportion 0.500 0.013 0.487
## --------------------------------------------------------------------------------
## Max_Heart_Rate_Achieved
## n missing distinct Info Mean Gmd .05 .10
## 302 0 91 1 149.6 25.78 108.0 116.0
## .25 .50 .75 .90 .95
## 133.2 153.0 166.0 176.8 181.9
##
## lowest : 71 88 90 95 96, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## Exercise_Induced_Angina
## n missing distinct Info Sum Mean Gmd
## 302 0 2 0.661 99 0.3278 0.4422
##
## --------------------------------------------------------------------------------
## ST_Depression_Exercise
## n missing distinct Info Mean Gmd .05 .10
## 302 0 40 0.964 1.035 1.223 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.8 1.6 2.8 3.4
##
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## --------------------------------------------------------------------------------
## Peak_Exercise_ST_Segment
## n missing distinct Info Mean Gmd
## 302 0 3 0.796 1.596 0.624
##
## Value 1 2 3
## Frequency 142 140 20
## Proportion 0.470 0.464 0.066
## --------------------------------------------------------------------------------
## Num_Major_Vessels_Flouro
## n missing distinct
## 302 0 5
##
## lowest : ? 0 1 2 3, highest: ? 0 1 2 3
##
## Value ? 0 1 2 3
## Frequency 4 175 65 38 20
## Proportion 0.013 0.579 0.215 0.126 0.066
## --------------------------------------------------------------------------------
## Thalassemia
## n missing distinct
## 302 0 4
##
## Value ? 3 6 7
## Frequency 2 166 17 117
## Proportion 0.007 0.550 0.056 0.387
## --------------------------------------------------------------------------------
## Diagnosis_Heart_Disease
## n missing distinct Info Mean Gmd
## 302 0 5 0.833 0.9404 1.252
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 163 55 36 35 13
## Proportion 0.540 0.182 0.119 0.116 0.043
## --------------------------------------------------------------------------------
# The section below checks for missing values and perform missing value imputation (using median)
heart$Num_Major_Vessels_Flouro[which(heart$Num_Major_Vessels_Flouro== "?")] <- NA
heart$Thalassemia[which(heart$Thalassemia== "?")] <- NA
colSums(is.na(heart))
## Age Sex Chest_Pain_Type
## 0 0 0
## Resting_Blood_Pressure Serum_Cholesterol Fasting_Blood_Sugar
## 0 0 0
## Resting_ECG Max_Heart_Rate_Achieved Exercise_Induced_Angina
## 0 0 0
## ST_Depression_Exercise Peak_Exercise_ST_Segment Num_Major_Vessels_Flouro
## 0 0 4
## Thalassemia Diagnosis_Heart_Disease
## 2 0
# Change the data type
heart$Num_Major_Vessels_Flouro <- as.numeric(heart$Num_Major_Vessels_Flouro)
# Obtain the median value
median.result_heart <- median(heart$Num_Major_Vessels_Flouro, na.rm = TRUE)
median.result_heart #1
## [1] 0
# Missing Value Imputation with Median
# Replace na value with median value
heart$Num_Major_Vessels_Flouro[is.na(heart$Num_Major_Vessels_Flouro)] <- 1
heart$Thalassemia[is.na(heart$Thalassemia)] <- 3
# Recode the existing target feature "diagnosis" to a new feature called "target"
heart <- sqldf("select *,case when Diagnosis_Heart_Disease = 0 then 0 else 1 end target from heart")
head(heart)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 1 67 1 4 160 286
## 2 67 1 4 120 229
## 3 37 1 3 130 250
## 4 41 0 2 130 204
## 5 56 1 2 120 236
## 6 62 0 4 140 268
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 1 0 2 108
## 2 0 2 129
## 3 0 0 187
## 4 0 2 172
## 5 0 0 178
## 6 0 2 160
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 1 1 1.5 2
## 2 1 2.6 2
## 3 0 3.5 3
## 4 0 1.4 1
## 5 0 0.8 1
## 6 0 3.6 3
## Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease target
## 1 3 3 2 1
## 2 2 7 1 1
## 3 0 3 0 0
## 4 0 3 0 0
## 5 0 3 0 0
## 6 2 3 3 1
###Analyse relatioship between features in the dataset
# The relationship between Gender and heart disease
heart$Sex<-as.factor(heart$Sex)
levels(heart$Sex)<-c("Female","Male")
str(heart)
## 'data.frame': 302 obs. of 15 variables:
## $ Age : int 67 67 37 41 56 62 57 63 53 57 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 2 2 2 ...
## $ Chest_Pain_Type : int 4 4 3 2 2 4 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 160 120 130 130 120 140 120 130 140 140 ...
## $ Serum_Cholesterol : int 286 229 250 204 236 268 354 254 203 192 ...
## $ Fasting_Blood_Sugar : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Resting_ECG : int 2 2 0 2 0 2 0 2 2 0 ...
## $ Max_Heart_Rate_Achieved : int 108 129 187 172 178 160 163 147 155 148 ...
## $ Exercise_Induced_Angina : int 1 1 0 0 0 0 1 0 1 0 ...
## $ ST_Depression_Exercise : num 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
## $ Peak_Exercise_ST_Segment: int 2 2 3 1 1 3 1 2 3 2 ...
## $ Num_Major_Vessels_Flouro: num 3 2 0 0 0 2 0 1 0 0 ...
## $ Thalassemia : chr "3" "7" "3" "3" ...
## $ Diagnosis_Heart_Disease : int 2 1 0 0 0 3 0 2 1 0 ...
## $ target : int 1 1 0 0 0 1 0 1 1 0 ...
##Exploratory Data Analysis
From the histogram we can see that people age between 50 to 70 have the highest risk of heart disease compared to other age group.
table <- table(as.numeric(heart$Chest_Pain_Type))
pie(table)
# First, create a Train-test split with 75% data included in the training set.
set.seed(123)
#100 is used to control the sampling permutation to 100.
index<-sample(nrow(heart),0.75*nrow(heart))
train<-heart[index,]
test<-heart[-index,]
dim(train)
## [1] 226 15
dim(test)
## [1] 76 15
modelblr<-glm(Exercise_Induced_Angina~.,data = train,family = "binomial")
train$pred<-fitted(modelblr)
# fitted can be used only to get predicted score of the data on which model has been generated.
head(train)
## Age Sex Chest_Pain_Type Resting_Blood_Pressure Serum_Cholesterol
## 179 53 Male 3 130 246
## 14 52 Male 3 172 199
## 195 67 Male 4 100 299
## 118 63 Male 4 130 330
## 229 66 Male 4 112 212
## 244 60 Female 3 120 178
## Fasting_Blood_Sugar Resting_ECG Max_Heart_Rate_Achieved
## 179 1 2 173
## 14 1 0 162
## 195 0 2 125
## 118 1 2 132
## 229 0 2 132
## 244 1 0 96
## Exercise_Induced_Angina ST_Depression_Exercise Peak_Exercise_ST_Segment
## 179 0 0.0 1
## 14 0 0.5 1
## 195 1 0.9 2
## 118 1 1.8 1
## 229 1 0.1 1
## 244 0 0.0 1
## Num_Major_Vessels_Flouro Thalassemia Diagnosis_Heart_Disease target
## 179 3 3 0 0
## 14 0 7 0 0
## 195 2 3 3 1
## 118 3 7 3 1
## 229 1 3 2 1
## 244 0 3 0 0
## pred
## 179 0.03221014
## 14 0.16108116
## 195 0.65497840
## 118 0.56884660
## 229 0.54523593
## 244 0.29515345
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred<-prediction(train$pred,train$Exercise_Induced_Angina)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize = T,print.cutoffs.at = seq(0.1,by = 0.1),
,lwd=4, axes = F,ylab="TPR",xlab="FPR")
axis(side = 1 ,col=7)
axis(side = 2,col=7 )
grid()
train$pred1<-ifelse(train$pred<0.6,"No","Yes")
library(caret)
## Loading required package: lattice
indxTrain <- createDataPartition(y = heart$target,p = 0.75,list = FALSE)
training <- heart[indxTrain,]
testing <- heart[-indxTrain,]
#Check dimensions of the split
prop.table(table(heart$target)) * 100
##
## 0 1
## 53.97351 46.02649
prop.table(table(training$target)) * 100
##
## 0 1
## 51.54185 48.45815
#create objects x which holds the predictor variables and y which holds the response variables
x = training[,-10]
y = training$target
y <- as.factor(y)
defaultW <- getOption("warn")
options(warn = -1)
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))
model
## Naive Bayes
##
## 227 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 204, 205, 204, 204, 205, 204, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE NaN NaN
## TRUE 1 1
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
#Model Evaluation
#Predict testing set
Predict <- predict(model,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
#Confusion Matrix and Statistics
confusionMatrix(Predict, as.factor(testing$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46 0
## 1 0 29
##
## Accuracy : 1
## 95% CI : (0.952, 1)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6133
## Detection Rate : 0.6133
## Detection Prevalence : 0.6133
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
options(warn = defaultW)
# Import library
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## 载入程辑包:'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# To control the sampling permutation
set.seed(100)
# Change column 'target' to factor
heart$target <- as.factor(heart$target)
# Check the latest class of 'target'
str(heart)
## 'data.frame': 302 obs. of 15 variables:
## $ Age : int 67 67 37 41 56 62 57 63 53 57 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 1 2 2 2 ...
## $ Chest_Pain_Type : int 4 4 3 2 2 4 4 4 4 4 ...
## $ Resting_Blood_Pressure : int 160 120 130 130 120 140 120 130 140 140 ...
## $ Serum_Cholesterol : int 286 229 250 204 236 268 354 254 203 192 ...
## $ Fasting_Blood_Sugar : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Resting_ECG : int 2 2 0 2 0 2 0 2 2 0 ...
## $ Max_Heart_Rate_Achieved : int 108 129 187 172 178 160 163 147 155 148 ...
## $ Exercise_Induced_Angina : int 1 1 0 0 0 0 1 0 1 0 ...
## $ ST_Depression_Exercise : num 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 0.4 ...
## $ Peak_Exercise_ST_Segment: int 2 2 3 1 1 3 1 2 3 2 ...
## $ Num_Major_Vessels_Flouro: num 3 2 0 0 0 2 0 1 0 0 ...
## $ Thalassemia : chr "3" "7" "3" "3" ...
## $ Diagnosis_Heart_Disease : int 2 1 0 0 0 3 0 2 1 0 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 1 2 2 1 ...
# Split dataset into training and testing set with probability 75% & 25%
rf_sample <- sample(2, nrow(heart), replace = TRUE, prob = c(0.75, 0.25))
rf_train <- heart[rf_sample==1,]
rf_test <- heart[rf_sample==2,]
str(rf_train)
## 'data.frame': 228 obs. of 15 variables:
## $ Age : int 67 67 37 41 56 62 63 53 57 56 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 1 2 2 2 1 ...
## $ Chest_Pain_Type : int 4 4 3 2 2 4 4 4 4 2 ...
## $ Resting_Blood_Pressure : int 160 120 130 130 120 140 130 140 140 140 ...
## $ Serum_Cholesterol : int 286 229 250 204 236 268 254 203 192 294 ...
## $ Fasting_Blood_Sugar : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Resting_ECG : int 2 2 0 2 0 2 2 2 0 2 ...
## $ Max_Heart_Rate_Achieved : int 108 129 187 172 178 160 147 155 148 153 ...
## $ Exercise_Induced_Angina : int 1 1 0 0 0 0 0 1 0 0 ...
## $ ST_Depression_Exercise : num 1.5 2.6 3.5 1.4 0.8 3.6 1.4 3.1 0.4 1.3 ...
## $ Peak_Exercise_ST_Segment: int 2 2 3 1 1 3 2 3 2 2 ...
## $ Num_Major_Vessels_Flouro: num 3 2 0 0 0 2 1 0 0 0 ...
## $ Thalassemia : chr "3" "7" "3" "3" ...
## $ Diagnosis_Heart_Disease : int 2 1 0 0 0 3 2 1 0 0 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 1 1 ...
# Running the Random Forest model
rf <- randomForest(target~., data=rf_train, proximity=TRUE)
print(rf)
##
## Call:
## randomForest(formula = target ~ ., data = rf_train, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## 0 1 class.error
## 0 121 0 0
## 1 0 107 0
#Checking the accuracy of training and testing set
p1 <- predict(rf, rf_train)
confusionMatrix(p1, rf_train$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 121 0
## 1 0 107
##
## Accuracy : 1
## 95% CI : (0.984, 1)
## No Information Rate : 0.5307
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5307
## Detection Rate : 0.5307
## Detection Prevalence : 0.5307
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
p2 <- predict(rf, rf_test)
confusionMatrix(p2, as.factor(rf_test$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 42 0
## 1 0 32
##
## Accuracy : 1
## 95% CI : (0.9514, 1)
## No Information Rate : 0.5676
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5676
## Detection Rate : 0.5676
## Detection Prevalence : 0.5676
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
plot(rf)
library(ggplot2)
ggplot(heart,aes(x=Age,fill=Diagnosis_Heart_Disease,color=Diagnosis_Heart_Disease)) + geom_histogram(binwidth = 1,color="black") + labs(x = "Age",y = "Frequency", title = "Heart disease and age")
heart1 <- heart
# Recode some categorical variables to numeric for report generation
heart1$Num_Major_Vessels_Flouro = factor(heart1$Num_Major_Vessels_Flouro, levels = c("0","1","2","3"), labels = c(0,1,2,3))
heart1$Thalassemia = factor(heart1$Thalassemia, levels = c("3","6","7"), labels = c(3,6,7))
# Reporting
# Create_report in DataExplorer
# Pull a full data profile of your data frame.
# It will produce an html file with the basic statistics, structure, missing data,
# distribution visualizations, correlation matrix and principal component analysis for your data frame
DataExplorer::create_report(heart1)
##
##
## processing file: report.rmd
##
|
| | 0%
|
|.. | 2%
## inline R code fragments
##
##
|
|... | 5%
## label: global_options (with options)
## List of 1
## $ include: logi FALSE
##
##
|
|..... | 7%
## ordinary text without R code
##
##
|
|....... | 10%
## label: introduce
##
|
|........ | 12%
## ordinary text without R code
##
##
|
|.......... | 14%
## label: plot_intro
##
|
|............ | 17%
## ordinary text without R code
##
##
|
|............. | 19%
## label: data_structure
##
|
|............... | 21%
## ordinary text without R code
##
##
|
|................. | 24%
## label: missing_profile
##
|
|.................. | 26%
## ordinary text without R code
##
##
|
|.................... | 29%
## label: univariate_distribution_header
##
|
|...................... | 31%
## ordinary text without R code
##
##
|
|....................... | 33%
## label: plot_histogram
##
|
|......................... | 36%
## ordinary text without R code
##
##
|
|........................... | 38%
## label: plot_density
##
|
|............................ | 40%
## ordinary text without R code
##
##
|
|.............................. | 43%
## label: plot_frequency_bar
##
|
|................................ | 45%
## ordinary text without R code
##
##
|
|................................. | 48%
## label: plot_response_bar
##
|
|................................... | 50%
## ordinary text without R code
##
##
|
|..................................... | 52%
## label: plot_with_bar
##
|
|...................................... | 55%
## ordinary text without R code
##
##
|
|........................................ | 57%
## label: plot_normal_qq
##
|
|.......................................... | 60%
## ordinary text without R code
##
##
|
|........................................... | 62%
## label: plot_response_qq
##
|
|............................................. | 64%
## ordinary text without R code
##
##
|
|............................................... | 67%
## label: plot_by_qq
##
|
|................................................ | 69%
## ordinary text without R code
##
##
|
|.................................................. | 71%
## label: correlation_analysis
##
|
|.................................................... | 74%
## ordinary text without R code
##
##
|
|..................................................... | 76%
## label: principal_component_analysis
##
|
|....................................................... | 79%
## ordinary text without R code
##
##
|
|......................................................... | 81%
## label: bivariate_distribution_header
##
|
|.......................................................... | 83%
## ordinary text without R code
##
##
|
|............................................................ | 86%
## label: plot_response_boxplot
##
|
|.............................................................. | 88%
## ordinary text without R code
##
##
|
|............................................................... | 90%
## label: plot_by_boxplot
##
|
|................................................................. | 93%
## ordinary text without R code
##
##
|
|................................................................... | 95%
## label: plot_response_scatterplot
##
|
|.................................................................... | 98%
## ordinary text without R code
##
##
|
|......................................................................| 100%
## label: plot_by_scatterplot
## output file: H:/360MoveData/Users/s2124/Desktop/report.knit.md
## "C:/Program Files/RStudio/bin/pandoc/pandoc" +RTS -K512m -RTS "H:/360MoveData/Users/s2124/Desktop/report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc16b8052f94740.html --lua-filter "C:\Users\s2124\Documents\R\win-library\4.1\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\s2124\Documents\R\win-library\4.1\rmarkdown\rmarkdown\lua\latex-div.lua" --self-contained --variable bs3=TRUE --standalone --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\s2124\Documents\R\win-library\4.1\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --include-in-header "C:\Users\s2124\AppData\Local\Temp\RtmpYtl99f\rmarkdown-str16b805dc8412.html" --mathjax --variable "mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
##
## Output created: report.html
```