The main objective is to predict whether certian conidtions for a given person’s likelyhood of having heart disease or not,using machine learning algorithy. Siliar to Framingham Risk score. The risk factors which are causing potential hart diseases include age,cholesterol level,type of chest pain etc.
The algorithm which we are using in this problem are:
Decision Tree
Random Forest
We also fine tune the models using parameter tuning, allowable in the respective model algorithm. The model performance matrix are calculated and compared.
To avoid “over-optimistic” issue, we provided a 2nd data split (of the original training dataset) of training and testing, and trained the models a second time, and compared the 2nd rounds of machine learning (on a second data split), with that of the 1st machine learning algorithm.
The data has 303 observations, with 14 variables. Our first sample split will produce the first set of training and testing samples. The split proportion is 80% versus 20%. That results in 242 observations in the training data set, and 61 observations in the testing data set.
Because machine learning oftentimes result in overly optimistic outcome, or overfitting problem. To validate the potential overly optimistic result, we take the training data set, which is 242 observations, and did a second split of training and testing. The second training and testing data set result in 193 observations in the training data, and 49 observations in the testing data.
We examined the variables of importantance in the overall sample, before the machine learning algorithm. And after machine learning algorithm, especially in random forest, we extrapolate the variables of importance by their ranks, and compare them with the pre training samples.
This dataset contains 14 variables, where data was collected from 303 patients who were admitted to a hospital. The “goal/target” field refers to the presence of heart disease in the patient (1=yes; 0=no). The variables’ information is as follows:
age: The person’s age in years
sex: The person’s sex (1 = male, 0 = female)
cp -> chest_pain_type: The chest pain experienced (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)
trestbps -> resting_blood_pressure: The person’s resting blood pressure (mmHg on admission to the hospital)
chol -> cholesterol: The person’s cholesterol measurement in mg/dl
fbs -> fasting_blood_sugar: The person’s fasting blood sugar (> 120 mg/dl, 1 = yes; 0 = no)
restecg -> rest_ecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
thalach -> max_heart_rate_achieved: The person’s maximum heart rate achieved
exang -> exercise_induced_angina: Exercise induced angina (1 = yes; 0 = no)
oldpeak -> st_depression: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot)
slope -> st_slope: the slope of the peak exercise ST segment (0 = upsloping, 1 = flat, 2 = downsloping)
ca -> num_major_vessels: The number of major vessels (0-4)
thal -> thallium_stress_test: Cold spots on the scan, where no thallium shows up, indicate areas of the heart that are not getting an adequate supply of blood (0 = non, 1 = normal, 2 = fixed defect, 3 = reversable defect)
target -> disease: Heart disease (1 = yes, 0 = no)
Data is Sourced from: https://www.kaggle.com/ronitf/heart-disease-uci
library(caret) # short for Classification And Regression Training, major reference
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(class) # for K-Nearest Neighbors Classification
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.1 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::combine() masks randomForest::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
library(dslabs)
library(caret)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# library(tidyr)
library(stringr)
library(forcats)
library(gam)
## Loading required package: splines
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded gam 1.16.1
library(splines)
library(foreach)
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
##
## Attaching package: 'mgcv'
## The following objects are masked from 'package:gam':
##
## gam, gam.control, gam.fit, s
library(nlme)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(funModeling)
## Loading required package: Hmisc
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(DataExplorer)
library(skimr)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:funModeling':
##
## range01
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-1
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
library(rpart)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:colorspace':
##
## coords
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(class)
library(VIM)
library(gbm)
## Loaded gbm 2.1.8
library(ROCR)
Hearts_UCI<- read.csv ('heart.csv', fileEncoding = 'UTF-8-BOM', header=TRUE)
# v5col_Jan2021b <- read.csv (file ='C:\\Users\\hhan01\\Documents\\1-Output 2020 Excel from SQL MyDoc\\VLJan2021\\VLtopivot460N208.csv', fileEncoding="UTF-8-BOM", header=TRUE) ## so that the header is correct, no strange text in first header
colnames (Hearts_UCI)
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
Hearts <- Hearts_UCI
dplyr::glimpse(Hearts)
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0~
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0~
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1~
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
skimr::skim(Hearts)
Name | Hearts |
Number of rows | 303 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 54.37 | 9.08 | 29 | 47.5 | 55.0 | 61.0 | 77.0 | <U+2581><U+2586><U+2587><U+2587><U+2581> |
sex | 0 | 1 | 0.68 | 0.47 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
cp | 0 | 1 | 0.97 | 1.03 | 0 | 0.0 | 1.0 | 2.0 | 3.0 | <U+2587><U+2583><U+2581><U+2585><U+2581> |
trestbps | 0 | 1 | 131.62 | 17.54 | 94 | 120.0 | 130.0 | 140.0 | 200.0 | <U+2583><U+2587><U+2585><U+2581><U+2581> |
chol | 0 | 1 | 246.26 | 51.83 | 126 | 211.0 | 240.0 | 274.5 | 564.0 | <U+2583><U+2587><U+2582><U+2581><U+2581> |
fbs | 0 | 1 | 0.15 | 0.36 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
restecg | 0 | 1 | 0.53 | 0.53 | 0 | 0.0 | 1.0 | 1.0 | 2.0 | <U+2587><U+2581><U+2587><U+2581><U+2581> |
thalach | 0 | 1 | 149.65 | 22.91 | 71 | 133.5 | 153.0 | 166.0 | 202.0 | <U+2581><U+2582><U+2585><U+2587><U+2582> |
exang | 0 | 1 | 0.33 | 0.47 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2583> |
oldpeak | 0 | 1 | 1.04 | 1.16 | 0 | 0.0 | 0.8 | 1.6 | 6.2 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
slope | 0 | 1 | 1.40 | 0.62 | 0 | 1.0 | 1.0 | 2.0 | 2.0 | <U+2581><U+2581><U+2587><U+2581><U+2587> |
ca | 0 | 1 | 0.73 | 1.02 | 0 | 0.0 | 0.0 | 1.0 | 4.0 | <U+2587><U+2583><U+2582><U+2581><U+2581> |
thal | 0 | 1 | 2.31 | 0.61 | 0 | 2.0 | 2.0 | 3.0 | 3.0 | <U+2581><U+2581><U+2581><U+2587><U+2586> |
target | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1.0 | 1.0 | 1.0 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
anyNA(Hearts)
## [1] FALSE
colSums(is.na(Hearts))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
md.pattern(Hearts)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 303 1 1 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## target
## 303 1 0
## 0 0
round(prop.table(table(Hearts$target)),2)
##
## 0 1
## 0.46 0.54
# Hearts %<>% mutate_if(is.character, as.factor)
# We have some categorical values that need to be defined as factors
Hearts$sex=factor(Hearts$sex)
levels(Hearts$sex)=c("Female", "Male")
Hearts$cp=factor(Hearts$cp)
levels(Hearts$cp)=c("Typical Angina", "Atypical Angina", "Non-anginal Pain", "Asymptomatic")
Hearts$fbs=factor(Hearts$fbs)
levels(Hearts$fbs)=c("Below120","Above120")
Hearts$restecg=factor(Hearts$restecg)
levels(Hearts$restecg)=c("Normal","Abnormal","Hypertrophy")
Hearts$exang=factor(Hearts$exang)
levels(Hearts$exang)=c("No","Yes")
Hearts$slope=factor(Hearts$slope)
levels(Hearts$slope)=c("Upslopping","Flat","Downslopping")
Hearts$ca=factor(Hearts$ca)
Hearts$target=factor(Hearts$target)
levels(Hearts$target)=c("No Heart Disease","Heart Disease")
head(Hearts,3)
## age sex cp trestbps chol fbs restecg thalach exang
## 1 63 Male Asymptomatic 145 233 Above120 Normal 150 No
## 2 37 Male Non-anginal Pain 130 250 Below120 Abnormal 187 No
## 3 41 Female Atypical Angina 130 204 Below120 Normal 172 No
## oldpeak slope ca thal target
## 1 2.3 Upslopping 0 1 Heart Disease
## 2 3.5 Upslopping 0 2 Heart Disease
## 3 1.4 Downslopping 0 2 Heart Disease
glimpse(Hearts)
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M~
## $ cp <fct> Asymptomatic, Non-anginal Pain, Atypical Angina, Atypical Ang~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs <fct> Above120, Below120, Below120, Below120, Below120, Below120, B~
## $ restecg <fct> Normal, Abnormal, Normal, Abnormal, Abnormal, Abnormal, Norma~
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, No,~
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope <fct> Upslopping, Upslopping, Downslopping, Downslopping, Downslopp~
## $ ca <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target <fct> Heart Disease, Heart Disease, Heart Disease, Heart Disease, H~
Hearts<- Hearts
We can get variable importance without using a predictive model using information theory, ordered from highest to lowest:
Note:
en: entropy measured in bits mi: mutual information
ig: information gain
gr: gain ratio (is the most important metric here, ranged from 0 to 1, with higher being better)
## var en mi ig gr
## en12 thal 2.077 0.213 0.2132507020 0.1646099129
## en8 exang 1.764 0.142 0.1422104940 0.1560089774
## en2 cp 2.529 0.205 0.2045988556 0.1176236333
## en11 ca 2.474 0.186 0.1859573080 0.1116193825
## en10 slope 2.171 0.117 0.1168337959 0.0902894816
## en4 chol 7.478 0.560 0.5597365269 0.0794675650
## en1 sex 1.836 0.059 0.0591383086 0.0656433918
## en9 oldpeak 4.869 0.253 0.2533884114 0.0613803042
## en7 thalach 6.828 0.335 0.3350981472 0.0543227461
## en3 trestbps 5.551 0.140 0.1404280402 0.0298947640
## en age 5.937 0.135 0.1353052791 0.0266477255
## en6 restecg 2.058 0.024 0.0240748363 0.0221288889
## en5 fbs 1.600 0.001 0.0005658824 0.0009336281
# pca <- prcomp(as.data.frame(split$continuous), scale = TRUE)
# # Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
# pca$rotation
#
# prop_varianza <- pca$sdev^2 / sum(pca$sdev^2)
# prop_varianza
#
# ggplot(data = data.frame(prop_varianza, pc = 1:6),
# aes(x = pc, y = prop_varianza)) +
# geom_col(width = 0.3) +
# scale_y_continuous(limits = c(0,1)) +
# theme_bw() +
# labs(x = "Componentent of principal",
# y = "Proportion of variable explained")
According to the plot, “thallium_stress_test” is the most important variable in this dataset.
# We split out dataframe based on several attributes: categorical vs numeric and discrete vs continuous to ease our Exploratory Data Analysis
split <- split_columns(Hearts)
## print out the list of split, 5 elements
# split [[1]] [[1]] [[2]]
# split [[2]][[1]] [[2]]
split[[3]]
## [1] 8
split[[4]]
## [1] 6
split[[5]]
## [1] 0
# print(data_list$Matrix)
print (split$num_discrete)
## [1] 8
print (split$num_continuous)
## [1] 6
print(split$num_all_missing)
## [1] 0
theme_set(ggthemes::theme_hc())
# We gain better understanding of our numeric data distribution by using boxplot and histogram
ggplot(gather(Hearts[,sapply(Hearts, is.numeric)], cols, value),
aes(x = value, fill = cols)) +
geom_boxplot() +
facet_wrap(cols~., scales = "free") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "none")
ggplot(gather(Hearts[,sapply(Hearts, is.numeric)], cols, value),
aes(x = value)) +
geom_histogram(aes(fill = cols)) +
geom_density(aes(y = stat(count))) +
facet_wrap(cols~., scales = "free") +
theme(legend.position = "none",
axis.title.y = element_text(angle = 90))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##
AppliedPredictiveModeling::transparentTheme(trans = .5)
featurePlot(x = split$continuous,
y = Hearts$target,
plot = "density",
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
adjust = 1.5,
pch = "|",
layout = c(3, 2),
auto.key = list(columns = 2))
AppliedPredictiveModeling::transparentTheme(trans = 0.4)
featurePlot(x = split$continuous,
y = Hearts$target,
plot = "pairs",
auto.key = list(columns = 2))
#Gender
table(Hearts$sex,Hearts$target)
##
## No Heart Disease Heart Disease
## Female 24 72
## Male 114 93
table(Hearts$sex,Hearts$target)
##
## No Heart Disease Heart Disease
## Female 24 72
## Male 114 93
par(mfrow=c(1,1))
col.target <- c("yellow","pink")
plot(table(Hearts$sex,Hearts$target),xlab="Gender",ylab="Diagnostics",col=col.target, main=" ")
summary(Hearts)
## age sex cp trestbps
## Min. :29.00 Female: 96 Typical Angina :143 Min. : 94.0
## 1st Qu.:47.50 Male :207 Atypical Angina : 50 1st Qu.:120.0
## Median :55.00 Non-anginal Pain: 87 Median :130.0
## Mean :54.37 Asymptomatic : 23 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg thalach exang
## Min. :126.0 Below120:258 Normal :147 Min. : 71.0 No :204
## 1st Qu.:211.0 Above120: 45 Abnormal :152 1st Qu.:133.5 Yes: 99
## Median :240.0 Hypertrophy: 4 Median :153.0
## Mean :246.3 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
## oldpeak slope ca thal
## Min. :0.00 Upslopping : 21 0:175 Min. :0.000
## 1st Qu.:0.00 Flat :140 1: 65 1st Qu.:2.000
## Median :0.80 Downslopping:142 2: 38 Median :2.000
## Mean :1.04 3: 20 Mean :2.314
## 3rd Qu.:1.60 4: 5 3rd Qu.:3.000
## Max. :6.20 Max. :3.000
## target
## No Heart Disease:138
## Heart Disease :165
##
##
##
##
ggplot(Hearts, mapping = aes(x = target, fill = sex)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Individuals With and Without Heart Disease",
subtitle = "Heart Disease UCI",
caption = "heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
# Chest Pain and Disease target
ggplot(Hearts, aes(x = target, fill = cp)) +
geom_bar(position = "stack") +
labs(
title = "Chest Pain by Heart Disease Status",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(Hearts, aes(x = target, fill = restecg)) +
geom_bar(position = "stack") +
labs(
title = "Rest ECG",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(Hearts, aes(x = target, fill = slope)) +
geom_bar(position = "stack") +
labs(
title = "Slope of Peak Excercise Depression",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
ggplot(Hearts, aes(x = target, fill = ca)) +
geom_bar(position = "stack") +
labs( title = "Calcified Major Vessels",
caption = "Source: https://archive.ics.uci.edu/ml/datasets/heart+Disease") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
# Correlations
library(ggcorrplot)
ggcorr(Hearts, label = T)
## Warning in ggcorr(Hearts, label = T): data in column(s) 'sex', 'cp', 'fbs',
## 'restecg', 'exang', 'slope', 'ca', 'target' are not numeric and were ignored
Now, we transform numeric target variable as a factor, because confusionMatrix() function needs work with factors of the same length,
# converts disease2 as factor "0" or "1" to use with confusionMatrix()
Hearts <- mutate_at(Hearts, vars(target), as.factor)
str(Hearts$target)
## Factor w/ 2 levels "No Heart Disease",..: 2 2 2 2 2 2 2 2 2 2 ...
unique (Hearts$target)
## [1] Heart Disease No Heart Disease
## Levels: No Heart Disease Heart Disease
## to Prevent the model from using positive case as "no heart disease"
# positive class is now correct: heart disease
Hearts$target <- relevel(Hearts$target, ref= 'Heart Disease')
And now, we create edx and validation datasets:
test_index <- createDataPartition(y = Hearts$target,
times = 1, p = 0.2, list = FALSE)
edx <- Hearts[-test_index,]
validation <- Hearts[test_index,] # we will use it only to do final test
# # We will split edx data into train_set and test_set. Validation data will be used only to test the final value of the best model, this is to avoid overfitting (too optimistic result):
# # Spliting edx dataset in train_set and test_set
# # We will use it to train ours models
test_index <- createDataPartition(y = edx$target,
times = 1, p = 0.2,
list = FALSE) # test_set 20%
train_set2 <- edx[-test_index,]
test_set2 <- edx[test_index,]
true_value_target2 <- test_set2$target
# trainControl function for control tuning parameters models
control <- trainControl(method = "cv", # cross validation
number = 10, # 10 k-folds or number
)
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = multiClassSummary)
knncv <- train(target~.,
data=edx,
method = "knn",
trControl=control,
metric ='Accuracy' )
print(knncv)
## k-Nearest Neighbors
##
## 242 samples
## 13 predictor
## 2 classes: 'Heart Disease', 'No Heart Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 217, 218, 218, 218, 218, 218, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6488333 0.2856976
## 7 0.6450000 0.2769199
## 9 0.6615000 0.3087592
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
prediction_knncv <- predict (knncv,newdata=validation)
confusionMatrix(prediction_knncv, validation$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Heart Disease No Heart Disease
## Heart Disease 21 11
## No Heart Disease 12 17
##
## Accuracy : 0.623
## 95% CI : (0.4896, 0.7439)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 0.1234
##
## Kappa : 0.2428
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.6364
## Specificity : 0.6071
## Pos Pred Value : 0.6563
## Neg Pred Value : 0.5862
## Prevalence : 0.5410
## Detection Rate : 0.3443
## Detection Prevalence : 0.5246
## Balanced Accuracy : 0.6218
##
## 'Positive' Class : Heart Disease
##
## positive class is now correct: heart disease
knncv_autogrid <- train(target~.,
data=edx,
method = "knn",
trControl = control,
metric = 'Accuracy',
tuneLength = 30)
knncv_autogrid$results %>%
arrange(desc(Accuracy)) %>%
round(3)
## k Accuracy Kappa AccuracySD KappaSD
## 1 13 0.706 0.399 0.123 0.256
## 2 15 0.706 0.400 0.114 0.237
## 3 17 0.694 0.372 0.100 0.209
## 4 21 0.686 0.356 0.099 0.208
## 5 23 0.682 0.345 0.122 0.257
## 6 19 0.674 0.330 0.098 0.207
## 7 25 0.665 0.313 0.100 0.212
## 8 9 0.665 0.315 0.095 0.195
## 9 11 0.665 0.313 0.103 0.210
## 10 5 0.664 0.314 0.095 0.196
## 11 45 0.657 0.286 0.102 0.216
## 12 27 0.657 0.296 0.103 0.218
## 13 39 0.657 0.288 0.082 0.174
## 14 7 0.656 0.296 0.112 0.233
## 15 41 0.653 0.277 0.099 0.214
## 16 31 0.653 0.285 0.083 0.179
## 17 29 0.653 0.285 0.086 0.183
## 18 43 0.649 0.269 0.092 0.195
## 19 47 0.645 0.258 0.103 0.222
## 20 33 0.640 0.261 0.068 0.144
## 21 55 0.636 0.240 0.104 0.219
## 22 57 0.636 0.239 0.096 0.203
## 23 37 0.636 0.243 0.092 0.198
## 24 59 0.632 0.231 0.085 0.180
## 25 35 0.628 0.230 0.082 0.175
## 26 49 0.624 0.215 0.101 0.214
## 27 61 0.620 0.205 0.090 0.194
## 28 51 0.620 0.205 0.096 0.203
## 29 63 0.620 0.203 0.091 0.196
## 30 53 0.615 0.196 0.094 0.201
colors <- c("Accuracy" = "red" )
knncv_autogrid$results %>%
ggplot(aes(k)) +
geom_line(aes(y = Accuracy, col = "Accuracy")) +
geom_point(aes(y = Accuracy)) +
labs(title = "K-NN with 10 repeated 10 fold Cross Validation, 1st Sample Split",
subtitle = "Accuracy ",
color = "Legend") +
scale_color_manual(values = colors) +
theme(legend.position = "top",
axis.title.y = element_blank())
### 2nd Data Training Testing SPlit
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = multiClassSummary)
knncv2 <- train(target~.,
data=train_set2,
# data=edx,
method = "knn",
trControl=control,
metric ='Accuracy' )
print(knncv)
## k-Nearest Neighbors
##
## 242 samples
## 13 predictor
## 2 classes: 'Heart Disease', 'No Heart Disease'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 217, 218, 218, 218, 218, 218, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6488333 0.2856976
## 7 0.6450000 0.2769199
## 9 0.6615000 0.3087592
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
knncv_autogrid2 <- train(target~.,
data=train_set2,
# data=edx,
method = "knn",
trControl = control,
metric = 'Accuracy',
tuneLength = 30)
knncv_autogrid2$results %>%
arrange(desc(Accuracy)) %>%
round(3)
## k Accuracy Kappa AccuracySD KappaSD
## 1 17 0.673 0.324 0.125 0.252
## 2 19 0.673 0.329 0.139 0.280
## 3 33 0.673 0.323 0.127 0.255
## 4 37 0.663 0.303 0.148 0.293
## 5 35 0.663 0.302 0.139 0.277
## 6 47 0.663 0.298 0.140 0.282
## 7 15 0.663 0.305 0.133 0.263
## 8 31 0.658 0.292 0.136 0.273
## 9 21 0.658 0.294 0.129 0.262
## 10 9 0.658 0.295 0.160 0.316
## 11 27 0.658 0.295 0.139 0.278
## 12 5 0.657 0.301 0.125 0.244
## 13 63 0.653 0.275 0.084 0.169
## 14 11 0.653 0.289 0.140 0.279
## 15 13 0.652 0.288 0.126 0.250
## 16 45 0.648 0.266 0.123 0.248
## 17 53 0.648 0.265 0.110 0.220
## 18 61 0.648 0.261 0.077 0.156
## 19 29 0.647 0.271 0.141 0.281
## 20 39 0.647 0.266 0.125 0.247
## 21 23 0.642 0.260 0.133 0.268
## 22 55 0.638 0.245 0.097 0.191
## 23 51 0.637 0.241 0.116 0.234
## 24 59 0.637 0.240 0.090 0.178
## 25 25 0.637 0.252 0.147 0.294
## 26 41 0.637 0.246 0.139 0.274
## 27 57 0.632 0.232 0.082 0.161
## 28 49 0.632 0.231 0.128 0.256
## 29 7 0.631 0.241 0.139 0.270
## 30 43 0.622 0.213 0.143 0.283
colors <- c("Accuracy" = "blue" )
knncv_autogrid2$results %>%
ggplot(aes(k)) +
geom_line(aes(y = Accuracy, col = "Accuracy")) +
geom_point(aes(y = Accuracy)) +
labs(title = "2nd Sample Split, K-NN with 10 repeated 10 fold Cross Validation",
subtitle = "Accuracy ",
color = "Legend") +
scale_color_manual(values = colors) +
theme(legend.position = "top",
axis.title.y = element_blank())
library (randomForest)
EconCost <- function(data, lev = NULL, model = NULL)
{
y.pred = data$pred
y.true = data$obs
CM = confusionMatrix(y.pred, y.true)$table
out = sum(as.vector(CM)*cost.p)/sum(CM)
names(out) <- c("EconCost")
out
}
#
ctrl <- trainControl(method = "cv", number = 5,#set cross-validation control to 5-fold
classProbs = TRUE,
summaryFunction = EconCost,
verboseIter=T)
#100 trees
#set cutoff
RF_train <- randomForest(target ~.,
data=edx,
ntree=100,mtry=5,cutoff=c(0.6,0.4),importance=TRUE, do.trace=T)
## ntree OOB 1 2
## 1: 23.16% 12.00% 35.56%
## 2: 26.71% 24.36% 29.41%
## 3: 26.49% 24.74% 28.41%
## 4: 27.59% 27.36% 27.84%
## 5: 29.30% 29.82% 28.71%
## 6: 28.32% 28.69% 27.88%
## 7: 25.65% 26.19% 25.00%
## 8: 26.41% 28.35% 24.04%
## 9: 24.68% 25.00% 24.30%
## 10: 23.85% 25.95% 21.30%
## 11: 25.42% 26.72% 23.85%
## 12: 24.58% 26.72% 22.02%
## 13: 23.75% 26.72% 20.18%
## 14: 25.31% 28.79% 21.10%
## 15: 24.79% 27.27% 21.82%
## 16: 26.03% 28.03% 23.64%
## 17: 24.79% 26.52% 22.73%
## 18: 23.55% 25.76% 20.91%
## 19: 24.79% 28.03% 20.91%
## 20: 24.38% 27.27% 20.91%
## 21: 24.79% 27.27% 21.82%
## 22: 23.97% 25.76% 21.82%
## 23: 23.55% 25.76% 20.91%
## 24: 23.14% 25.00% 20.91%
## 25: 23.97% 26.52% 20.91%
## 26: 24.79% 27.27% 21.82%
## 27: 23.55% 26.52% 20.00%
## 28: 22.31% 25.76% 18.18%
## 29: 21.49% 24.24% 18.18%
## 30: 22.73% 25.00% 20.00%
## 31: 21.07% 23.48% 18.18%
## 32: 21.90% 22.73% 20.91%
## 33: 21.49% 21.97% 20.91%
## 34: 21.07% 22.73% 19.09%
## 35: 19.83% 21.21% 18.18%
## 36: 21.49% 22.73% 20.00%
## 37: 21.90% 23.48% 20.00%
## 38: 23.14% 26.52% 19.09%
## 39: 21.90% 24.24% 19.09%
## 40: 21.90% 24.24% 19.09%
## 41: 21.90% 24.24% 19.09%
## 42: 21.90% 23.48% 20.00%
## 43: 21.49% 22.73% 20.00%
## 44: 21.90% 23.48% 20.00%
## 45: 21.49% 23.48% 19.09%
## 46: 21.07% 23.48% 18.18%
## 47: 22.73% 25.00% 20.00%
## 48: 21.90% 23.48% 20.00%
## 49: 21.90% 23.48% 20.00%
## 50: 21.90% 23.48% 20.00%
## 51: 22.73% 24.24% 20.91%
## 52: 23.14% 25.00% 20.91%
## 53: 22.31% 23.48% 20.91%
## 54: 21.90% 22.73% 20.91%
## 55: 22.31% 23.48% 20.91%
## 56: 22.31% 23.48% 20.91%
## 57: 21.90% 23.48% 20.00%
## 58: 23.14% 25.00% 20.91%
## 59: 22.31% 23.48% 20.91%
## 60: 21.90% 23.48% 20.00%
## 61: 23.55% 25.76% 20.91%
## 62: 23.14% 25.00% 20.91%
## 63: 23.14% 25.00% 20.91%
## 64: 22.73% 24.24% 20.91%
## 65: 22.73% 24.24% 20.91%
## 66: 22.73% 24.24% 20.91%
## 67: 21.90% 22.73% 20.91%
## 68: 21.90% 22.73% 20.91%
## 69: 21.90% 22.73% 20.91%
## 70: 21.49% 22.73% 20.00%
## 71: 21.90% 22.73% 20.91%
## 72: 21.90% 22.73% 20.91%
## 73: 21.49% 21.97% 20.91%
## 74: 21.49% 21.97% 20.91%
## 75: 21.90% 22.73% 20.91%
## 76: 21.90% 22.73% 20.91%
## 77: 22.31% 23.48% 20.91%
## 78: 21.49% 22.73% 20.00%
## 79: 22.31% 24.24% 20.00%
## 80: 21.90% 23.48% 20.00%
## 81: 22.31% 24.24% 20.00%
## 82: 21.90% 23.48% 20.00%
## 83: 21.90% 23.48% 20.00%
## 84: 21.90% 23.48% 20.00%
## 85: 21.90% 24.24% 19.09%
## 86: 22.73% 25.00% 20.00%
## 87: 22.73% 25.00% 20.00%
## 88: 22.31% 24.24% 20.00%
## 89: 21.49% 22.73% 20.00%
## 90: 21.49% 23.48% 19.09%
## 91: 21.49% 22.73% 20.00%
## 92: 21.07% 21.97% 20.00%
## 93: 20.66% 21.97% 19.09%
## 94: 21.49% 23.48% 19.09%
## 95: 21.49% 23.48% 19.09%
## 96: 21.90% 24.24% 19.09%
## 97: 21.90% 24.24% 19.09%
## 98: 22.31% 24.24% 20.00%
## 99: 21.49% 22.73% 20.00%
## 100: 21.07% 22.73% 19.09%
# plot(RF_train)
RF_pred <- predict(RF_train, newdata=validation)
confusionMatrix (RF_pred, validation$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Heart Disease No Heart Disease
## Heart Disease 23 1
## No Heart Disease 10 27
##
## Accuracy : 0.8197
## 95% CI : (0.7002, 0.9064)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 4.82e-06
##
## Kappa : 0.6455
##
## Mcnemar's Test P-Value : 0.01586
##
## Sensitivity : 0.6970
## Specificity : 0.9643
## Pos Pred Value : 0.9583
## Neg Pred Value : 0.7297
## Prevalence : 0.5410
## Detection Rate : 0.3770
## Detection Prevalence : 0.3934
## Balanced Accuracy : 0.8306
##
## 'Positive' Class : Heart Disease
##
Because ROC curve only allows all numerical value, nothing categorical. So I am referring back to the original Hearts_UCi datset, where they are all numberical variable except target (will be transformed into the numerical too). Also, K=9 were chosen based on previosus datarun.
head(Hearts_UCI,2)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## target
## 1 1
## 2 1
Hearts_UCI$target <- as.numeric (Hearts_UCI$target)
test_index2 <- createDataPartition(y = Hearts_UCI$target,
times = 1, p = 0.2, list = FALSE)
edx_rocknn <- Hearts_UCI[-test_index,]
validation_rocknn <- Hearts_UCI[test_index,] # we will use it only to do final test
#ROCKNN
library(ROCR)
model2roclmm<- knn(train=edx_rocknn[,1:13], test=validation_rocknn[,1:13],cl=edx_rocknn$target, k=5,prob=TRUE)
prob <- attr(model2roclmm, "prob")
prob <- ifelse(model2roclmm == "0", 1-prob, prob)
pred_knn <- prediction(prob, validation_rocknn$target)
pred_knn <- performance(pred_knn, "tpr", "fpr")
plot(pred_knn, avg= "threshold", lwd=1, main="kNN ROC Curve at K=5")
RF_train_tuning <- train(target ~.,
method = "rf",
data=edx,
# data=train_set,
preProcess = c("center", "scale"),
ntree = 100,
cutoff=c(0.6,0.4),
tuneGrid = expand.grid(mtry=c(4,5,6)),
maximize = F,
trControl = control)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecgHypertrophy
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecgHypertrophy
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: restecgHypertrophy
summary(RF_train_tuning)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 242 factor numeric
## err.rate 300 -none- numeric
## confusion 6 -none- numeric
## votes 484 matrix numeric
## oob.times 242 -none- numeric
## classes 2 -none- character
## importance 20 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 242 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 20 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
# typeof(RF_train_tuning)
# Variable importance
RF_imp <- varImp (RF_train, scale=F)
typeof(RF_imp)
## [1] "list"
print(RF_imp)
## Heart Disease No Heart Disease
## age 0.0080127111 0.0080127111
## sex 0.0175670465 0.0175670465
## cp 0.0337980632 0.0337980632
## trestbps 0.0127241721 0.0127241721
## chol -0.0053457571 -0.0053457571
## fbs -0.0004637901 -0.0004637901
## restecg 0.0018058235 0.0018058235
## thalach 0.0243581828 0.0243581828
## exang 0.0085087665 0.0085087665
## oldpeak 0.0240907243 0.0240907243
## slope 0.0042460161 0.0042460161
## ca 0.0361097703 0.0361097703
## thal 0.0266879867 0.0266879867
# plot (RF_imp)
varImpPlot(RF_train, type=2)
#
#5 trees for 2nd time
RF_train_2nd <- randomForest(target ~.,
data=train_set2,
ntree=5,mtry=5,cutoff=c(0.6,0.4),importance=TRUE, do.trace=T)
## ntree OOB 1 2
## 1: 29.73% 24.39% 36.36%
## 2: 26.27% 28.57% 23.64%
## 3: 24.31% 29.33% 18.84%
## 4: 26.38% 26.74% 25.97%
## 5: 28.57% 31.25% 25.32%
RF_pred_set2 <- predict(RF_train_2nd, newdata=test_set2)
confusionMatrix (RF_pred_set2, test_set2$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Heart Disease No Heart Disease
## Heart Disease 21 3
## No Heart Disease 6 19
##
## Accuracy : 0.8163
## 95% CI : (0.6798, 0.9124)
## No Information Rate : 0.551
## P-Value [Acc > NIR] : 9.098e-05
##
## Kappa : 0.6334
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.7778
## Specificity : 0.8636
## Pos Pred Value : 0.8750
## Neg Pred Value : 0.7600
## Prevalence : 0.5510
## Detection Rate : 0.4286
## Detection Prevalence : 0.4898
## Balanced Accuracy : 0.8207
##
## 'Positive' Class : Heart Disease
##
Machine learning algorithms (classification techniques in this project, KNN and random forest) can accurately predict a person’s having heart disease or not. The all the models had an accuracy ranging from 76% to 84%. Random forest gave a slightly better accuracy of around 83.5%
Confusion Matrix and others (https://en.wikipedia.org/wiki/Confusion_matrix)
ggcorrplot: Visualization of a correlation matrix using ggplot2 https://cran.r-project.org/web/packages/ggcorrplot/readme/README.html#:~:text=The%20ggcorrplot%20package%20can%20be,matrix%20of%20correlation%20p%2Dvalues.
featureplot Visualization CARET
caret package
Arranging plots with par(mfrow) and layout() https://bookdown.org/ndphillips/YaRrr/arranging-plots-with-parmfrow-and-layout.html