GOAL

FINAL PROJECT OF MACHLINE LEARNING

622

The main objective is to predict whether certian conidtions for a given person’s likelyhood of having heart disease or not,using machine learning algorithy. Siliar to Framingham Risk score. The risk factors which are causing potential hart diseases include age,cholesterol level,type of chest pain etc.

The algorithm which we are using in this problem are:

Decision Tree

Random Forest

We also fine tune the models using parameter tuning, allowable in the respective model algorithm. The model performance matrix are calculated and compared.

To avoid “over-optimistic” issue, we provided a 2nd data split (of the original training dataset) of training and testing, and trained the models a second time, and compared the 2nd rounds of machine learning (on a second data split), with that of the 1st machine learning algorithm.

The data has 303 observations, with 14 variables. Our first sample split will produce the first set of training and testing samples. The split proportion is 80% versus 20%. That results in 242 observations in the training data set, and 61 observations in the testing data set.

Because machine learning oftentimes result in overly optimistic outcome, or overfitting problem. To validate the potential overly optimistic result, we take the training data set, which is 242 observations, and did a second split of training and testing. The second training and testing data set result in 193 observations in the training data, and 49 observations in the testing data.

We examined the variables of importantance in the overall sample, before the machine learning algorithm. And after machine learning algorithm, especially in random forest, we extrapolate the variables of importance by their ranks, and compare them with the pre training samples.

Heart Disease Data Set

This dataset contains 14 variables, where data was collected from 303 patients who were admitted to a hospital. The “goal/target” field refers to the presence of heart disease in the patient (1=yes; 0=no). The variables’ information is as follows:

age: The person’s age in years

sex: The person’s sex (1 = male, 0 = female)

cp -> chest_pain_type: The chest pain experienced (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)

trestbps -> resting_blood_pressure: The person’s resting blood pressure (mmHg on admission to the hospital)

chol -> cholesterol: The person’s cholesterol measurement in mg/dl

fbs -> fasting_blood_sugar: The person’s fasting blood sugar (> 120 mg/dl, 1 = yes; 0 = no)

restecg -> rest_ecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)

thalach -> max_heart_rate_achieved: The person’s maximum heart rate achieved

exang -> exercise_induced_angina: Exercise induced angina (1 = yes; 0 = no)

oldpeak -> st_depression: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot)

slope -> st_slope: the slope of the peak exercise ST segment (0 = upsloping, 1 = flat, 2 = downsloping)

ca -> num_major_vessels: The number of major vessels (0-4)

thal -> thallium_stress_test: Cold spots on the scan, where no thallium shows up, indicate areas of the heart that are not getting an adequate supply of blood (0 = non, 1 = normal, 2 = fixed defect, 3 = reversable defect)

target -> disease: Heart disease (1 = yes, 0 = no)

Data is Sourced from: https://www.kaggle.com/ronitf/heart-disease-uci

Libraries

library(caret) # short for Classification And Regression Training, major reference
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(class) # for K-Nearest Neighbors Classification

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine()       masks randomForest::combine()
## x dplyr::filter()        masks stats::filter()
## x dplyr::lag()           masks stats::lag()
## x purrr::lift()          masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
library(dslabs)
library(caret)
library(kableExtra) 
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# library(tidyr)
library(stringr)
library(forcats)
library(gam)
## Loading required package: splines
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded gam 1.16.1
library(splines)
library(foreach)
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following objects are masked from 'package:gam':
## 
##     gam, gam.control, gam.fit, s
library(nlme)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(funModeling)
## Loading required package: Hmisc
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
library(magrittr) 
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(DataExplorer) 

library(skimr)
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:funModeling':
## 
##     range01
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-1
library(e1071) 
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
library(rpart)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:colorspace':
## 
##     coords
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(class)
library(VIM)
library(gbm)
## Loaded gbm 2.1.8
library(ROCR)

DATA

Hearts_UCI<- read.csv ('heart.csv', fileEncoding = 'UTF-8-BOM', header=TRUE)
# v5col_Jan2021b <- read.csv (file ='C:\\Users\\hhan01\\Documents\\1-Output 2020 Excel from SQL MyDoc\\VLJan2021\\VLtopivot460N208.csv',  fileEncoding="UTF-8-BOM", header=TRUE) ## so that the header is correct, no strange text in first header

colnames (Hearts_UCI)
##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

Missing Variable Checking

Hearts <- Hearts_UCI

dplyr::glimpse(Hearts)
## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
skimr::skim(Hearts)
Data summary
Name Hearts
Number of rows 303
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 54.37 9.08 29 47.5 55.0 61.0 77.0 <U+2581><U+2586><U+2587><U+2587><U+2581>
sex 0 1 0.68 0.47 0 0.0 1.0 1.0 1.0 <U+2583><U+2581><U+2581><U+2581><U+2587>
cp 0 1 0.97 1.03 0 0.0 1.0 2.0 3.0 <U+2587><U+2583><U+2581><U+2585><U+2581>
trestbps 0 1 131.62 17.54 94 120.0 130.0 140.0 200.0 <U+2583><U+2587><U+2585><U+2581><U+2581>
chol 0 1 246.26 51.83 126 211.0 240.0 274.5 564.0 <U+2583><U+2587><U+2582><U+2581><U+2581>
fbs 0 1 0.15 0.36 0 0.0 0.0 0.0 1.0 <U+2587><U+2581><U+2581><U+2581><U+2582>
restecg 0 1 0.53 0.53 0 0.0 1.0 1.0 2.0 <U+2587><U+2581><U+2587><U+2581><U+2581>
thalach 0 1 149.65 22.91 71 133.5 153.0 166.0 202.0 <U+2581><U+2582><U+2585><U+2587><U+2582>
exang 0 1 0.33 0.47 0 0.0 0.0 1.0 1.0 <U+2587><U+2581><U+2581><U+2581><U+2583>
oldpeak 0 1 1.04 1.16 0 0.0 0.8 1.6 6.2 <U+2587><U+2582><U+2581><U+2581><U+2581>
slope 0 1 1.40 0.62 0 1.0 1.0 2.0 2.0 <U+2581><U+2581><U+2587><U+2581><U+2587>
ca 0 1 0.73 1.02 0 0.0 0.0 1.0 4.0 <U+2587><U+2583><U+2582><U+2581><U+2581>
thal 0 1 2.31 0.61 0 2.0 2.0 3.0 3.0 <U+2581><U+2581><U+2581><U+2587><U+2586>
target 0 1 0.54 0.50 0 0.0 1.0 1.0 1.0 <U+2587><U+2581><U+2581><U+2581><U+2587>
  anyNA(Hearts)
## [1] FALSE
  colSums(is.na(Hearts))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0
md.pattern(Hearts)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 303   1   1  1        1    1   1       1       1     1       1     1  1    1
##       0   0  0        0    0   0       0       0     0       0     0  0    0
##     target  
## 303      1 0
##          0 0

Outcome VR- Target

Heart Disease Yes/No

round(prop.table(table(Hearts$target)),2)
## 
##    0    1 
## 0.46 0.54

Data Pre-Processing

Dummy Code Categorical Variable

# Hearts %<>% mutate_if(is.character, as.factor)
# We have some categorical values that need to be defined as factors
Hearts$sex=factor(Hearts$sex)
levels(Hearts$sex)=c("Female", "Male")
Hearts$cp=factor(Hearts$cp)
levels(Hearts$cp)=c("Typical Angina", "Atypical Angina", "Non-anginal Pain", "Asymptomatic")
Hearts$fbs=factor(Hearts$fbs)
levels(Hearts$fbs)=c("Below120","Above120")
Hearts$restecg=factor(Hearts$restecg)
levels(Hearts$restecg)=c("Normal","Abnormal","Hypertrophy")
Hearts$exang=factor(Hearts$exang)
levels(Hearts$exang)=c("No","Yes")
Hearts$slope=factor(Hearts$slope)
levels(Hearts$slope)=c("Upslopping","Flat","Downslopping")
Hearts$ca=factor(Hearts$ca)
Hearts$target=factor(Hearts$target)
levels(Hearts$target)=c("No Heart Disease","Heart Disease")

head(Hearts,3)
##   age    sex               cp trestbps chol      fbs  restecg thalach exang
## 1  63   Male     Asymptomatic      145  233 Above120   Normal     150    No
## 2  37   Male Non-anginal Pain      130  250 Below120 Abnormal     187    No
## 3  41 Female  Atypical Angina      130  204 Below120   Normal     172    No
##   oldpeak        slope ca thal        target
## 1     2.3   Upslopping  0    1 Heart Disease
## 2     3.5   Upslopping  0    2 Heart Disease
## 3     1.4 Downslopping  0    2 Heart Disease
glimpse(Hearts)
## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex      <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M~
## $ cp       <fct> Asymptomatic, Non-anginal Pain, Atypical Angina, Atypical Ang~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs      <fct> Above120, Below120, Below120, Below120, Below120, Below120, B~
## $ restecg  <fct> Normal, Abnormal, Normal, Abnormal, Abnormal, Abnormal, Norma~
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang    <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, No,~
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope    <fct> Upslopping, Upslopping, Downslopping, Downslopping, Downslopp~
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target   <fct> Heart Disease, Heart Disease, Heart Disease, Heart Disease, H~
Hearts<- Hearts

Principal Component Analysis (PCA)

We can get variable importance without using a predictive model using information theory, ordered from highest to lowest:
Note:
en: entropy measured in bits mi: mutual information
ig: information gain
gr: gain ratio (is the most important metric here, ranged from 0 to 1, with higher being better)

Variable Importance from PCA

##           var    en    mi           ig           gr
## en12     thal 2.077 0.213 0.2132507020 0.1646099129
## en8     exang 1.764 0.142 0.1422104940 0.1560089774
## en2        cp 2.529 0.205 0.2045988556 0.1176236333
## en11       ca 2.474 0.186 0.1859573080 0.1116193825
## en10    slope 2.171 0.117 0.1168337959 0.0902894816
## en4      chol 7.478 0.560 0.5597365269 0.0794675650
## en1       sex 1.836 0.059 0.0591383086 0.0656433918
## en9   oldpeak 4.869 0.253 0.2533884114 0.0613803042
## en7   thalach 6.828 0.335 0.3350981472 0.0543227461
## en3  trestbps 5.551 0.140 0.1404280402 0.0298947640
## en        age 5.937 0.135 0.1353052791 0.0266477255
## en6   restecg 2.058 0.024 0.0240748363 0.0221288889
## en5       fbs 1.600 0.001 0.0005658824 0.0009336281

# pca <- prcomp(as.data.frame(split$continuous), scale = TRUE)
# # Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
# pca$rotation
# 
# prop_varianza <- pca$sdev^2 / sum(pca$sdev^2)
# prop_varianza
# 
# ggplot(data = data.frame(prop_varianza, pc = 1:6),
#        aes(x = pc, y = prop_varianza)) +
#   geom_col(width = 0.3) +
#   scale_y_continuous(limits = c(0,1)) +
#   theme_bw() +
#   labs(x = "Componentent of principal",
#        y = "Proportion of variable explained")

According to the plot, “thallium_stress_test” is the most important variable in this dataset.

# We split out dataframe based on several attributes: categorical vs numeric and discrete vs continuous to ease our Exploratory Data Analysis

split <- split_columns(Hearts)

## print out the list of split, 5 elements
# split [[1]] [[1]] [[2]]
# split [[2]][[1]] [[2]]

split[[3]]
## [1] 8
split[[4]]
## [1] 6
split[[5]]
## [1] 0
# print(data_list$Matrix)
print (split$num_discrete)
## [1] 8
print (split$num_continuous)
## [1] 6
print(split$num_all_missing)
## [1] 0

DATA VISUALIZATON

NUMERICAL VR VISUALIZATION

Set plotting theme

theme_set(ggthemes::theme_hc())
# We gain better understanding of our numeric data distribution by using boxplot and histogram

ggplot(gather(Hearts[,sapply(Hearts, is.numeric)], cols, value), 
       aes(x = value, fill = cols)) + 
  geom_boxplot() + 
  facet_wrap(cols~., scales = "free") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position =  "none") 

ggplot(gather(Hearts[,sapply(Hearts, is.numeric)], cols, value), 
       aes(x = value)) + 
  geom_histogram(aes(fill = cols)) + 
  geom_density(aes(y = stat(count))) +
  facet_wrap(cols~., scales = "free") +
  theme(legend.position =  "none",
        axis.title.y = element_text(angle = 90))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##
AppliedPredictiveModeling::transparentTheme(trans = .5)

featurePlot(x = split$continuous, 
            y = Hearts$target,
            plot = "density", 
                scales = list(x = list(relation = "free"), 
                          y = list(relation = "free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(3, 2),
            auto.key = list(columns = 2))

AppliedPredictiveModeling::transparentTheme(trans = 0.4)

featurePlot(x = split$continuous, 
            y = Hearts$target, 
            plot = "pairs",
                        auto.key = list(columns = 2)) 

VISUALIZATION

DISCRETE / CATEGORICAL VR

#Gender 
table(Hearts$sex,Hearts$target)
##         
##          No Heart Disease Heart Disease
##   Female               24            72
##   Male                114            93
table(Hearts$sex,Hearts$target)
##         
##          No Heart Disease Heart Disease
##   Female               24            72
##   Male                114            93
par(mfrow=c(1,1))
col.target <- c("yellow","pink")
plot(table(Hearts$sex,Hearts$target),xlab="Gender",ylab="Diagnostics",col=col.target, main=" ")

summary(Hearts)
##       age            sex                     cp         trestbps    
##  Min.   :29.00   Female: 96   Typical Angina  :143   Min.   : 94.0  
##  1st Qu.:47.50   Male  :207   Atypical Angina : 50   1st Qu.:120.0  
##  Median :55.00                Non-anginal Pain: 87   Median :130.0  
##  Mean   :54.37                Asymptomatic    : 23   Mean   :131.6  
##  3rd Qu.:61.00                                       3rd Qu.:140.0  
##  Max.   :77.00                                       Max.   :200.0  
##       chol             fbs             restecg       thalach      exang    
##  Min.   :126.0   Below120:258   Normal     :147   Min.   : 71.0   No :204  
##  1st Qu.:211.0   Above120: 45   Abnormal   :152   1st Qu.:133.5   Yes: 99  
##  Median :240.0                  Hypertrophy:  4   Median :153.0            
##  Mean   :246.3                                    Mean   :149.6            
##  3rd Qu.:274.5                                    3rd Qu.:166.0            
##  Max.   :564.0                                    Max.   :202.0            
##     oldpeak              slope     ca           thal      
##  Min.   :0.00   Upslopping  : 21   0:175   Min.   :0.000  
##  1st Qu.:0.00   Flat        :140   1: 65   1st Qu.:2.000  
##  Median :0.80   Downslopping:142   2: 38   Median :2.000  
##  Mean   :1.04                      3: 20   Mean   :2.314  
##  3rd Qu.:1.60                      4:  5   3rd Qu.:3.000  
##  Max.   :6.20                              Max.   :3.000  
##               target   
##  No Heart Disease:138  
##  Heart Disease   :165  
##                        
##                        
##                        
## 
ggplot(Hearts, mapping = aes(x = target, fill = sex)) +
  geom_bar(position = "fill") +
  labs(
       title = "Proportion of Individuals With and Without Heart Disease",
       subtitle = "Heart Disease UCI",
       
        caption = "heart+Disease") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

# Chest Pain and Disease target
ggplot(Hearts, aes(x = target, fill = cp)) +
  geom_bar(position = "stack") +
  labs(
       title = "Chest Pain by Heart Disease Status",
       caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplot(Hearts, aes(x = target, fill = restecg)) +
  geom_bar(position = "stack") +
  labs(
       title = "Rest ECG",
       caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplot(Hearts, aes(x = target, fill = slope)) +
  geom_bar(position = "stack") +
  labs(
             title = "Slope of Peak Excercise Depression",
           caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

ggplot(Hearts, aes(x = target, fill = ca)) +
  geom_bar(position = "stack") +
  labs(     title = "Calcified Major Vessels",
       caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

VARIALBE CORRELATION PLOT

AMONG NUMBERICAL VRs

# Correlations 
library(ggcorrplot)

ggcorr(Hearts, label = T)
## Warning in ggcorr(Hearts, label = T): data in column(s) 'sex', 'cp', 'fbs',
## 'restecg', 'exang', 'slope', 'ca', 'target' are not numeric and were ignored

DATA PREPROCESSING

TARGET VR (HEART DISEASE)

Now, we transform numeric target variable as a factor, because confusionMatrix() function needs work with factors of the same length,

# converts disease2 as factor "0" or "1" to use with confusionMatrix()

Hearts <- mutate_at(Hearts, vars(target), as.factor) 
str(Hearts$target)
##  Factor w/ 2 levels "No Heart Disease",..: 2 2 2 2 2 2 2 2 2 2 ...
unique (Hearts$target)
## [1] Heart Disease    No Heart Disease
## Levels: No Heart Disease Heart Disease
## to Prevent the model from using positive case as "no heart disease"
# positive class is now correct: heart disease
Hearts$target <- relevel(Hearts$target, ref= 'Heart Disease')

Traininng / Testing Data Split

First Split

And now, we create edx and validation datasets:

test_index <- createDataPartition(y = Hearts$target,
                                  times = 1, p = 0.2, list = FALSE) 

edx <- Hearts[-test_index,]  
validation <- Hearts[test_index,] # we will use it only to do final test
Second split (did not use )
# # We will split edx data into train_set and test_set. Validation data will be used only to test the final value of the best model, this is to avoid overfitting (too optimistic result):
# #  Spliting edx dataset in train_set and test_set
# #  We will use it to train ours models

test_index <- createDataPartition(y = edx$target,
                                  times = 1, p = 0.2,
                                  list = FALSE)  # test_set 20%
train_set2 <- edx[-test_index,]
test_set2 <- edx[test_index,]
true_value_target2 <- test_set2$target 
# trainControl function for control tuning parameters models
control <- trainControl(method = "cv",   # cross validation
                        number = 10,     # 10 k-folds or number 
                        )

MODELING

KNN (K Nearest Means) MODEL

10 fold Cross Validation -KNN

fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                             repeats = 10,
                           classProbs = TRUE,
                           summaryFunction = multiClassSummary)

knncv <- train(target~., 
               data=edx,
               method = "knn", 
               trControl=control,
              metric ='Accuracy' )
print(knncv)
## k-Nearest Neighbors 
## 
## 242 samples
##  13 predictor
##   2 classes: 'Heart Disease', 'No Heart Disease' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 217, 218, 218, 218, 218, 218, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6488333  0.2856976
##   7  0.6450000  0.2769199
##   9  0.6615000  0.3087592
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

KNN PREDICTION

10 fold cross validation

prediction_knncv <- predict (knncv,newdata=validation)
confusionMatrix(prediction_knncv, validation$target)
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Heart Disease No Heart Disease
##   Heart Disease               21               11
##   No Heart Disease            12               17
##                                           
##                Accuracy : 0.623           
##                  95% CI : (0.4896, 0.7439)
##     No Information Rate : 0.541           
##     P-Value [Acc > NIR] : 0.1234          
##                                           
##                   Kappa : 0.2428          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.6071          
##          Pos Pred Value : 0.6563          
##          Neg Pred Value : 0.5862          
##              Prevalence : 0.5410          
##          Detection Rate : 0.3443          
##    Detection Prevalence : 0.5246          
##       Balanced Accuracy : 0.6218          
##                                           
##        'Positive' Class : Heart Disease   
## 
##  positive class is now correct: heart disease

Random Forest Model

1st Sample SPlit

library (randomForest)

  EconCost <- function(data, lev = NULL, model = NULL)
{
  y.pred = data$pred
  y.true = data$obs
  CM = confusionMatrix(y.pred, y.true)$table
  out = sum(as.vector(CM)*cost.p)/sum(CM)
  names(out) <- c("EconCost")
  out
  }
# 
ctrl <- trainControl(method = "cv", number = 5,#set cross-validation control to 5-fold
                     classProbs = TRUE, 
                     summaryFunction = EconCost,
                     verboseIter=T)
#100 trees
#set cutoff 
RF_train <- randomForest(target ~., 
                         data=edx,
                         ntree=100,mtry=5,cutoff=c(0.6,0.4),importance=TRUE, do.trace=T)
## ntree      OOB      1      2
##     1:  23.16% 12.00% 35.56%
##     2:  26.71% 24.36% 29.41%
##     3:  26.49% 24.74% 28.41%
##     4:  27.59% 27.36% 27.84%
##     5:  29.30% 29.82% 28.71%
##     6:  28.32% 28.69% 27.88%
##     7:  25.65% 26.19% 25.00%
##     8:  26.41% 28.35% 24.04%
##     9:  24.68% 25.00% 24.30%
##    10:  23.85% 25.95% 21.30%
##    11:  25.42% 26.72% 23.85%
##    12:  24.58% 26.72% 22.02%
##    13:  23.75% 26.72% 20.18%
##    14:  25.31% 28.79% 21.10%
##    15:  24.79% 27.27% 21.82%
##    16:  26.03% 28.03% 23.64%
##    17:  24.79% 26.52% 22.73%
##    18:  23.55% 25.76% 20.91%
##    19:  24.79% 28.03% 20.91%
##    20:  24.38% 27.27% 20.91%
##    21:  24.79% 27.27% 21.82%
##    22:  23.97% 25.76% 21.82%
##    23:  23.55% 25.76% 20.91%
##    24:  23.14% 25.00% 20.91%
##    25:  23.97% 26.52% 20.91%
##    26:  24.79% 27.27% 21.82%
##    27:  23.55% 26.52% 20.00%
##    28:  22.31% 25.76% 18.18%
##    29:  21.49% 24.24% 18.18%
##    30:  22.73% 25.00% 20.00%
##    31:  21.07% 23.48% 18.18%
##    32:  21.90% 22.73% 20.91%
##    33:  21.49% 21.97% 20.91%
##    34:  21.07% 22.73% 19.09%
##    35:  19.83% 21.21% 18.18%
##    36:  21.49% 22.73% 20.00%
##    37:  21.90% 23.48% 20.00%
##    38:  23.14% 26.52% 19.09%
##    39:  21.90% 24.24% 19.09%
##    40:  21.90% 24.24% 19.09%
##    41:  21.90% 24.24% 19.09%
##    42:  21.90% 23.48% 20.00%
##    43:  21.49% 22.73% 20.00%
##    44:  21.90% 23.48% 20.00%
##    45:  21.49% 23.48% 19.09%
##    46:  21.07% 23.48% 18.18%
##    47:  22.73% 25.00% 20.00%
##    48:  21.90% 23.48% 20.00%
##    49:  21.90% 23.48% 20.00%
##    50:  21.90% 23.48% 20.00%
##    51:  22.73% 24.24% 20.91%
##    52:  23.14% 25.00% 20.91%
##    53:  22.31% 23.48% 20.91%
##    54:  21.90% 22.73% 20.91%
##    55:  22.31% 23.48% 20.91%
##    56:  22.31% 23.48% 20.91%
##    57:  21.90% 23.48% 20.00%
##    58:  23.14% 25.00% 20.91%
##    59:  22.31% 23.48% 20.91%
##    60:  21.90% 23.48% 20.00%
##    61:  23.55% 25.76% 20.91%
##    62:  23.14% 25.00% 20.91%
##    63:  23.14% 25.00% 20.91%
##    64:  22.73% 24.24% 20.91%
##    65:  22.73% 24.24% 20.91%
##    66:  22.73% 24.24% 20.91%
##    67:  21.90% 22.73% 20.91%
##    68:  21.90% 22.73% 20.91%
##    69:  21.90% 22.73% 20.91%
##    70:  21.49% 22.73% 20.00%
##    71:  21.90% 22.73% 20.91%
##    72:  21.90% 22.73% 20.91%
##    73:  21.49% 21.97% 20.91%
##    74:  21.49% 21.97% 20.91%
##    75:  21.90% 22.73% 20.91%
##    76:  21.90% 22.73% 20.91%
##    77:  22.31% 23.48% 20.91%
##    78:  21.49% 22.73% 20.00%
##    79:  22.31% 24.24% 20.00%
##    80:  21.90% 23.48% 20.00%
##    81:  22.31% 24.24% 20.00%
##    82:  21.90% 23.48% 20.00%
##    83:  21.90% 23.48% 20.00%
##    84:  21.90% 23.48% 20.00%
##    85:  21.90% 24.24% 19.09%
##    86:  22.73% 25.00% 20.00%
##    87:  22.73% 25.00% 20.00%
##    88:  22.31% 24.24% 20.00%
##    89:  21.49% 22.73% 20.00%
##    90:  21.49% 23.48% 19.09%
##    91:  21.49% 22.73% 20.00%
##    92:  21.07% 21.97% 20.00%
##    93:  20.66% 21.97% 19.09%
##    94:  21.49% 23.48% 19.09%
##    95:  21.49% 23.48% 19.09%
##    96:  21.90% 24.24% 19.09%
##    97:  21.90% 24.24% 19.09%
##    98:  22.31% 24.24% 20.00%
##    99:  21.49% 22.73% 20.00%
##   100:  21.07% 22.73% 19.09%
# plot(RF_train)

PREDICTION OF HEART DISEASE

BY RANDOM FOREST

1st Sample Split

RF_pred <- predict(RF_train, newdata=validation)
confusionMatrix (RF_pred, validation$target)
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Heart Disease No Heart Disease
##   Heart Disease               23                1
##   No Heart Disease            10               27
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.541           
##     P-Value [Acc > NIR] : 4.82e-06        
##                                           
##                   Kappa : 0.6455          
##                                           
##  Mcnemar's Test P-Value : 0.01586         
##                                           
##             Sensitivity : 0.6970          
##             Specificity : 0.9643          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.7297          
##              Prevalence : 0.5410          
##          Detection Rate : 0.3770          
##    Detection Prevalence : 0.3934          
##       Balanced Accuracy : 0.8306          
##                                           
##        'Positive' Class : Heart Disease   
## 

ROC CURVE

KNN MODEL

Because ROC curve only allows all numerical value, nothing categorical. So I am referring back to the original Hearts_UCi datset, where they are all numberical variable except target (will be transformed into the numerical too). Also, K=9 were chosen based on previosus datarun.

head(Hearts_UCI,2)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
##   target
## 1      1
## 2      1
Hearts_UCI$target <- as.numeric (Hearts_UCI$target)
test_index2 <- createDataPartition(y = Hearts_UCI$target,
                                  times = 1, p = 0.2, list = FALSE) 

edx_rocknn <- Hearts_UCI[-test_index,]  
validation_rocknn <- Hearts_UCI[test_index,] # we will use it only to do final test
#ROCKNN
library(ROCR) 

model2roclmm<-  knn(train=edx_rocknn[,1:13], test=validation_rocknn[,1:13],cl=edx_rocknn$target, k=5,prob=TRUE)
prob <- attr(model2roclmm, "prob")
prob <- ifelse(model2roclmm == "0", 1-prob, prob)

pred_knn <- prediction(prob, validation_rocknn$target)
pred_knn <- performance(pred_knn, "tpr", "fpr")
plot(pred_knn, avg= "threshold", lwd=1, main="kNN ROC Curve at K=5")

PARAMETER FINE TUNINING

RANDOME FOREST

1st Sample Split

RF_train_tuning <- train(target ~.,
                  method = "rf",
                  data=edx,
                        # data=train_set,
                  preProcess = c("center", "scale"),
                  ntree = 100,
                  cutoff=c(0.6,0.4),
                  tuneGrid = expand.grid(mtry=c(4,5,6)),
                                maximize = F,
                  trControl = control)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecgHypertrophy

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecgHypertrophy

## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecgHypertrophy
summary(RF_train_tuning)
##                 Length Class      Mode     
## call              6    -none-     call     
## type              1    -none-     character
## predicted       242    factor     numeric  
## err.rate        300    -none-     numeric  
## confusion         6    -none-     numeric  
## votes           484    matrix     numeric  
## oob.times       242    -none-     numeric  
## classes           2    -none-     character
## importance       20    -none-     numeric  
## importanceSD      0    -none-     NULL     
## localImportance   0    -none-     NULL     
## proximity         0    -none-     NULL     
## ntree             1    -none-     numeric  
## mtry              1    -none-     numeric  
## forest           14    -none-     list     
## y               242    factor     numeric  
## test              0    -none-     NULL     
## inbag             0    -none-     NULL     
## xNames           20    -none-     character
## problemType       1    -none-     character
## tuneValue         1    data.frame list     
## obsLevels         2    -none-     character
## param             2    -none-     list
# typeof(RF_train_tuning)

RISK FACTORS VARIABLE IMPORTANCE

RANDOM FOREST MODEL

1st Sample Split

# Variable importance
RF_imp <- varImp (RF_train, scale=F)
typeof(RF_imp)
## [1] "list"
print(RF_imp)
##          Heart Disease No Heart Disease
## age       0.0080127111     0.0080127111
## sex       0.0175670465     0.0175670465
## cp        0.0337980632     0.0337980632
## trestbps  0.0127241721     0.0127241721
## chol     -0.0053457571    -0.0053457571
## fbs      -0.0004637901    -0.0004637901
## restecg   0.0018058235     0.0018058235
## thalach   0.0243581828     0.0243581828
## exang     0.0085087665     0.0085087665
## oldpeak   0.0240907243     0.0240907243
## slope     0.0042460161     0.0042460161
## ca        0.0361097703     0.0361097703
## thal      0.0266879867     0.0266879867
# plot (RF_imp)
varImpPlot(RF_train, type=2)

# 

SECOND ROUND OF TRAINING / TESTING

RANDOME FOREST MODEL

#5 trees for 2nd time 
RF_train_2nd <- randomForest(target ~., 
                         data=train_set2,
                         ntree=5,mtry=5,cutoff=c(0.6,0.4),importance=TRUE, do.trace=T)
## ntree      OOB      1      2
##     1:  29.73% 24.39% 36.36%
##     2:  26.27% 28.57% 23.64%
##     3:  24.31% 29.33% 18.84%
##     4:  26.38% 26.74% 25.97%
##     5:  28.57% 31.25% 25.32%

PREDICTION OF HEART DISEASE

BY RANDOM FOREST

Second SAMPLE ( Training Test Split)

RF_pred_set2 <- predict(RF_train_2nd, newdata=test_set2)

confusionMatrix (RF_pred_set2, test_set2$target)
## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Heart Disease No Heart Disease
##   Heart Disease               21                3
##   No Heart Disease             6               19
##                                           
##                Accuracy : 0.8163          
##                  95% CI : (0.6798, 0.9124)
##     No Information Rate : 0.551           
##     P-Value [Acc > NIR] : 9.098e-05       
##                                           
##                   Kappa : 0.6334          
##                                           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8636          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.7600          
##              Prevalence : 0.5510          
##          Detection Rate : 0.4286          
##    Detection Prevalence : 0.4898          
##       Balanced Accuracy : 0.8207          
##                                           
##        'Positive' Class : Heart Disease   
## 

SUMMARY OF PROJECT

Machine learning algorithms (classification techniques in this project, KNN and random forest) can accurately predict a person’s having heart disease or not. The all the models had an accuracy ranging from 76% to 84%. Random forest gave a slightly better accuracy of around 83.5%

REFERENCE :

Confusion Matrix and others (https://en.wikipedia.org/wiki/Confusion_matrix)

ggcorrplot: Visualization of a correlation matrix using ggplot2 https://cran.r-project.org/web/packages/ggcorrplot/readme/README.html#:~:text=The%20ggcorrplot%20package%20can%20be,matrix%20of%20correlation%20p%2Dvalues.

featureplot Visualization CARET

caret package

Arranging plots with par(mfrow) and layout() https://bookdown.org/ndphillips/YaRrr/arranging-plots-with-parmfrow-and-layout.html