Introduction
This data set is Cleveland heart disease. Heart disease describes a range of conditions that affect the heart from UC Irvine Repository.
Heart diseases include:
- Blood vessel disease, such as coronary artery disease
- Irregular heartbeats (arrhythmias)
- Heart problems you’re born with (congenital heart defects)
- Disease of the heart muscle
- Heart valve disease
Many forms of heart disease can be prevented or treated with healthy
lifestyle choices. Heart disease symptoms depend on the type of heart
disease such as dizziness, fatigue, and feeling short of breath.
source:Mayo
Clinic
Changing your lifestyle can prevent and reverse certain types of
heart disease. Below are five common risk factors associated to heart
disease:
-High Blood Pressure
-High Low- density Lipoprotein (LDL) Cholesterol
-Diabetes
-Smoking and Secondhand Smoke Exposure
-Physical Inactivity & Unhealthy Diet
source:southcoasthealth
Data Preparation
Prerequisites
Importing Libraries
Importing Dataset
We will using read.table() function for importing the
dataset because the data is on .data file type.
cl <- read.table("processed.cleveland.data", fileEncoding = "UTF-8", sep = ",")
rmarkdown::paged_table(cl)Data Columns:
V1: age in years
V2: sex (1 = male; 0 = female)
V3: cp = chest pain type
– Value 1: typical angina
– Value 2: atypical angina
– Value 3: non-anginal pain
– Value 4: asymptomatic
V4: trestbps = resting blood pressure (in mm Hg on
admission to the hospital)
V5: chol = serum cholestoral in mg/dl
V6: fbs = fasting blood sugar > 120 mg/dl (1 = true; 0 =
false)
V7: restecg = resting electrocardiographic results
– Value 0: normal
– Value 1: having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
– Value 2: showing probable or definite left ventricular hypertrophy by
Estes’ criteria
V8: thalach = maximum heart rate achieved
V9: exang = exercise induced angina (1 = yes; 0 = no) V10:
oldpeak = ST depression induced by exercise relative to
rest V11: slope = the slope of the peak exercise ST segment
– Value 1: upsloping – Value 2: flat – Value 3: downsloping V12:
ca = number of major vessels (0-3) colored by
flourosopy
V13: thal = (3) = normal; (6) = fixed defect; (7) =
reversable defect V14: class = diagnosis of heart disease
(angiographic disease status)
Diagonosis of heart disease (4 classes). It can be 2 classes by setting
0 for 0 values and 1 for non-0 values.
Change the column name so it more intuitive according to the column data above.
names(cl) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "class")Change the class column become 2 classes
cl$class <- sapply(X= as.character(cl$class),
FUN = switch,
"0" = "0",
"1" = "1",
"2" = "1",
"3" = "1",
"4" = "1")#> age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
#> 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0
#> 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0
#> 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0
#> 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0
#> 5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0
#> 6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0
#> class
#> 1 0
#> 2 1
#> 3 1
#> 4 0
#> 5 0
#> 6 0
Data Processing
Missing Values
#> age sex cp trestbps chol fbs restecg thalach
#> 0 0 0 0 0 0 0 0
#> exang oldpeak slope ca thal class
#> 0 0 0 0 0 0
Data Types
#> Rows: 303
#> Columns: 14
#> $ age <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
#> $ sex <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
#> $ cp <dbl> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1…
#> $ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
#> $ chol <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
#> $ fbs <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ restecg <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2…
#> $ thalach <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
#> $ exang <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
#> $ oldpeak <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
#> $ slope <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2…
#> $ ca <chr> "0.0", "3.0", "2.0", "0.0", "0.0", "0.0", "2.0", "0.0", "1.0"…
#> $ thal <chr> "6.0", "3.0", "7.0", "3.0", "3.0", "3.0", "3.0", "3.0", "7.0"…
#> $ class <chr> "0", "1", "1", "0", "0", "0", "1", "0", "1", "1", "0", "0", "…
Convert sex,
fbs,cp,restecg,exang,slope,ca,
thal, and class into factor data type.
cl_clean <- cl %>%
mutate_at(.vars = c("sex", "fbs", "cp", "restecg", "exang", "slope","ca","thal", "class"),
.funs = as.factor) %>%
glimpse()#> Rows: 303
#> Columns: 14
#> $ age <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
#> $ sex <fct> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
#> $ cp <fct> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1…
#> $ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
#> $ chol <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
#> $ fbs <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ restecg <fct> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2…
#> $ thalach <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
#> $ oldpeak <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
#> $ slope <fct> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2…
#> $ ca <fct> 0.0, 3.0, 2.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1…
#> $ thal <fct> 6.0, 3.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 6.0, 3.0, 6…
#> $ class <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0…
Exploratory Data Analysis
#> # A tibble: 9 × 5
#> col_name cnt common common_pcnt levels
#> <chr> <int> <chr> <dbl> <named list>
#> 1 ca 5 0.0 58.1 <tibble [5 × 3]>
#> 2 class 2 0 54.1 <tibble [2 × 3]>
#> 3 cp 4 4 47.5 <tibble [4 × 3]>
#> 4 exang 2 0 67.3 <tibble [2 × 3]>
#> 5 fbs 2 0 85.1 <tibble [2 × 3]>
#> 6 restecg 3 0 49.8 <tibble [3 × 3]>
#> 7 sex 2 1 68.0 <tibble [2 × 3]>
#> 8 slope 3 1 46.9 <tibble [3 × 3]>
#> 9 thal 4 3.0 54.8 <tibble [4 × 3]>
#> # A tibble: 5 × 10
#> col_name min q1 median mean q3 max sd pcnt_na hist
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <named list>
#> 1 age 29 48 56 54.4 61 77 9.04 0 <tibble [25 × 2]>
#> 2 trestbps 94 120 130 132. 140 200 17.6 0 <tibble [22 × 2]>
#> 3 chol 126 211 241 247. 275 564 51.8 0 <tibble [23 × 2]>
#> 4 thalach 71 134. 153 150. 166 202 22.9 0 <tibble [27 × 2]>
#> 5 oldpeak 0 0 0.8 1.04 1.6 6.2 1.16 0 <tibble [13 × 2]>
💡 Insight:
- 67.98% of patients are male.
- The youngest patient is 29 years old, the oldest patient is 77 years
old, and the average age of the patient is 54 years old.
Cross Validation
We will divide data into train data and test data using the
rsample library. The proportion of train data is 80%.
RNGkind(sample.kind = "Rounding")
set.seed(150)
splitter <- initial_split(data = cl_clean, prop = 0.8)
cl_train <- training(splitter)
cl_test <- testing(splitter)
cl_test#> age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
#> 1 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0
#> 2 44 1 2 120 263 0 0 173 0 0.0 1 0.0 7.0
#> 3 58 1 2 120 284 0 2 160 0 1.8 2 0.0 3.0
#> 4 59 1 4 135 234 0 0 161 0 0.5 2 0.0 7.0
#> 5 42 1 4 140 226 0 0 178 0 0.0 1 0.0 3.0
#> 6 61 1 3 150 243 1 0 137 1 1.0 2 0.0 3.0
#> 7 65 0 4 150 225 0 2 114 0 1.0 2 3.0 7.0
#> 8 41 0 2 105 198 0 0 168 0 0.0 1 1.0 3.0
#> 9 44 1 4 112 290 0 2 153 0 0.0 1 1.0 3.0
#> 10 50 1 3 140 233 0 0 163 0 0.6 2 1.0 7.0
#> 11 51 1 1 125 213 0 2 125 1 1.4 1 1.0 3.0
#> 12 46 0 3 142 177 0 2 160 1 1.4 3 0.0 3.0
#> 13 58 1 4 128 216 0 2 131 1 2.2 2 3.0 7.0
#> 14 58 1 4 150 270 0 2 111 1 0.8 1 0.0 7.0
#> 15 45 1 4 104 208 0 2 148 1 3.0 2 0.0 3.0
#> 16 53 0 4 130 264 0 2 143 0 0.4 2 0.0 3.0
#> 17 39 1 3 140 321 0 2 182 0 0.0 1 0.0 3.0
#> 18 68 1 3 180 274 1 2 150 1 1.6 2 0.0 7.0
#> 19 52 1 2 120 325 0 0 172 0 0.2 1 0.0 3.0
#> 20 51 0 3 130 256 0 2 149 0 0.5 1 0.0 3.0
#> 21 62 0 4 160 164 0 2 145 0 6.2 3 3.0 7.0
#> 22 52 1 4 128 255 0 0 161 1 0.0 1 1.0 7.0
#> 23 49 1 3 120 188 0 0 139 0 2.0 2 3.0 7.0
#> 24 39 1 4 118 219 0 0 140 0 1.2 2 0.0 7.0
#> 25 48 1 4 130 256 1 2 150 1 0.0 1 2.0 7.0
#> 26 65 1 1 138 282 1 2 174 0 1.4 2 1.0 3.0
#> 27 45 0 2 130 234 0 2 175 0 0.6 2 0.0 3.0
#> 28 56 0 4 200 288 1 2 133 1 4.0 3 2.0 7.0
#> 29 29 1 2 130 204 0 2 202 0 0.0 1 0.0 3.0
#> 30 51 1 4 140 261 0 2 186 1 0.0 1 0.0 3.0
#> 31 59 1 1 170 288 0 2 159 0 0.2 2 0.0 7.0
#> 32 45 1 2 128 308 0 2 170 0 0.0 1 0.0 3.0
#> 33 42 0 4 102 265 0 2 122 0 0.6 2 0.0 3.0
#> 34 67 0 3 115 564 0 2 160 0 1.6 2 0.0 7.0
#> 35 55 1 4 160 289 0 2 145 1 0.8 2 1.0 7.0
#> 36 70 1 4 130 322 0 2 109 0 2.4 2 3.0 3.0
#> 37 58 1 4 125 300 0 2 171 0 0.0 1 2.0 7.0
#> 38 51 1 4 140 298 0 0 122 1 4.2 2 3.0 7.0
#> 39 43 1 4 132 247 1 2 143 1 0.1 2 ? 7.0
#> 40 68 0 3 120 211 0 2 115 0 1.5 2 0.0 3.0
#> 41 67 1 4 100 299 0 2 125 1 0.9 2 2.0 3.0
#> 42 45 0 4 138 236 0 2 152 1 0.2 2 0.0 3.0
#> 43 50 0 2 120 244 0 0 162 0 1.1 1 0.0 3.0
#> 44 58 1 4 128 259 0 2 130 1 3.0 2 2.0 7.0
#> 45 55 1 2 130 262 0 0 155 0 0.0 1 0.0 3.0
#> 46 56 1 1 120 193 0 2 162 0 1.9 2 0.0 7.0
#> 47 64 0 4 130 303 0 0 122 0 2.0 2 2.0 3.0
#> 48 41 0 3 112 268 0 2 172 1 0.0 1 0.0 3.0
#> 49 53 1 4 123 282 0 0 95 1 2.0 2 2.0 7.0
#> 50 66 1 4 112 212 0 2 132 1 0.1 1 1.0 3.0
#> 51 60 0 3 120 178 1 0 96 0 0.0 1 0.0 3.0
#> 52 58 1 4 100 234 0 0 156 0 0.1 1 1.0 7.0
#> 53 52 1 4 125 212 0 0 168 0 1.0 1 2.0 7.0
#> 54 62 1 2 128 208 1 2 140 0 0.0 1 0.0 3.0
#> 55 57 1 4 110 201 0 0 126 1 1.5 2 0.0 6.0
#> 56 60 0 1 150 240 0 0 171 0 0.9 1 0.0 3.0
#> 57 61 1 4 138 166 0 2 125 1 3.6 2 1.0 3.0
#> 58 71 0 4 112 149 0 0 125 0 1.6 2 0.0 3.0
#> 59 57 1 2 154 232 0 2 164 0 0.0 1 1.0 3.0
#> 60 57 1 4 110 335 0 0 143 1 3.0 2 1.0 7.0
#> 61 56 1 2 130 221 0 2 163 0 0.0 1 0.0 7.0
#> class
#> 1 0
#> 2 0
#> 3 1
#> 4 0
#> 5 0
#> 6 0
#> 7 1
#> 8 0
#> 9 1
#> 10 1
#> 11 0
#> 12 0
#> 13 1
#> 14 1
#> 15 0
#> 16 0
#> 17 0
#> 18 1
#> 19 0
#> 20 0
#> 21 1
#> 22 1
#> 23 1
#> 24 1
#> 25 1
#> 26 1
#> 27 0
#> 28 1
#> 29 0
#> 30 0
#> 31 1
#> 32 0
#> 33 0
#> 34 0
#> 35 1
#> 36 1
#> 37 1
#> 38 1
#> 39 1
#> 40 0
#> 41 1
#> 42 0
#> 43 0
#> 44 1
#> 45 0
#> 46 0
#> 47 0
#> 48 0
#> 49 1
#> 50 1
#> 51 0
#> 52 1
#> 53 1
#> 54 0
#> 55 0
#> 56 0
#> 57 1
#> 58 0
#> 59 1
#> 60 1
#> 61 0
Imbalance Class
Then we will check the imbalance class of train and test data. The data
should have a good balance because we need to keep data still and not be
biased.
#>
#> 0 1
#> 0.5454545 0.4545455
Our data has a good balance, it’s not be biased. Then, we can continue making the model.
Modeling
Logistic Regression
Build the model
1. All variable as predictor model
#>
#> Call:
#> glm(formula = class ~ ., family = "binomial", data = cl_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -16.727003 1160.349633 -0.014 0.98850
#> age -0.033775 0.028932 -1.167 0.24305
#> sex1 1.350498 0.620206 2.177 0.02944 *
#> cp2 1.548061 0.941269 1.645 0.10004
#> cp3 0.624294 0.824267 0.757 0.44881
#> cp4 2.661976 0.828497 3.213 0.00131 **
#> trestbps 0.019276 0.012910 1.493 0.13541
#> chol 0.003562 0.005190 0.686 0.49246
#> fbs1 -0.682929 0.676709 -1.009 0.31288
#> restecg1 0.491888 2.659196 0.185 0.85325
#> restecg2 0.484779 0.443916 1.092 0.27481
#> thalach -0.024950 0.013593 -1.836 0.06643 .
#> exang1 0.906034 0.513217 1.765 0.07750 .
#> oldpeak 0.505875 0.268267 1.886 0.05933 .
#> slope2 1.318425 0.532120 2.478 0.01322 *
#> slope3 0.355442 1.027862 0.346 0.72949
#> ca0.0 15.081081 1160.343611 0.013 0.98963
#> ca1.0 16.787448 1160.343702 0.014 0.98846
#> ca2.0 18.547202 1160.343934 0.016 0.98725
#> ca3.0 16.850481 1160.343966 0.015 0.98841
#> thal3.0 -1.734833 2.234844 -0.776 0.43759
#> thal6.0 -1.773673 2.349699 -0.755 0.45034
#> thal7.0 -0.493088 2.242270 -0.220 0.82594
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 333.48 on 241 degrees of freedom
#> Residual deviance: 147.52 on 219 degrees of freedom
#> AIC: 193.52
#>
#> Number of Fisher Scoring iterations: 15
2. Stepwise model
#> Start: AIC=193.52
#> class ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
#> exang + oldpeak + slope + ca + thal
#>
#> Df Deviance AIC
#> - restecg 2 148.73 190.73
#> - chol 1 147.99 191.99
#> - fbs 1 148.57 192.57
#> - age 1 148.90 192.90
#> <none> 147.52 193.52
#> - trestbps 1 149.80 193.80
#> - exang 1 150.61 194.61
#> - thal 3 154.84 194.84
#> - thalach 1 151.21 195.21
#> - oldpeak 1 151.24 195.24
#> - slope 2 154.37 196.37
#> - sex 1 152.58 196.58
#> - cp 3 167.80 207.80
#> - ca 4 172.54 210.54
#>
#> Step: AIC=190.73
#> class ~ age + sex + cp + trestbps + chol + fbs + thalach + exang +
#> oldpeak + slope + ca + thal
#>
#> Df Deviance AIC
#> - chol 1 149.52 189.52
#> - fbs 1 149.80 189.80
#> - age 1 149.87 189.87
#> <none> 148.73 190.73
#> - trestbps 1 151.26 191.26
#> - exang 1 151.53 191.53
#> - thal 3 155.88 191.88
#> - thalach 1 152.24 192.24
#> - oldpeak 1 152.31 192.31
#> - sex 1 154.25 194.25
#> - slope 2 156.31 194.31
#> - cp 3 168.87 204.87
#> - ca 4 174.37 208.37
#>
#> Step: AIC=189.52
#> class ~ age + sex + cp + trestbps + fbs + thalach + exang + oldpeak +
#> slope + ca + thal
#>
#> Df Deviance AIC
#> - age 1 150.46 188.46
#> - fbs 1 150.48 188.48
#> <none> 149.52 189.52
#> - trestbps 1 152.23 190.23
#> - exang 1 152.50 190.50
#> - thalach 1 152.71 190.71
#> - thal 3 156.90 190.90
#> - oldpeak 1 153.73 191.73
#> - sex 1 154.42 192.42
#> - slope 2 157.19 193.19
#> - cp 3 170.63 204.63
#> - ca 4 175.16 207.16
#>
#> Step: AIC=188.46
#> class ~ sex + cp + trestbps + fbs + thalach + exang + oldpeak +
#> slope + ca + thal
#>
#> Df Deviance AIC
#> - fbs 1 151.31 187.31
#> <none> 150.46 188.46
#> - trestbps 1 152.47 188.47
#> - thalach 1 152.81 188.81
#> - exang 1 153.37 189.37
#> - thal 3 157.87 189.87
#> - oldpeak 1 155.18 191.18
#> - slope 2 157.85 191.85
#> - sex 1 155.90 191.90
#> - cp 3 171.75 203.75
#> - ca 4 175.97 205.97
#>
#> Step: AIC=187.31
#> class ~ sex + cp + trestbps + thalach + exang + oldpeak + slope +
#> ca + thal
#>
#> Df Deviance AIC
#> - trestbps 1 153.16 187.16
#> <none> 151.31 187.31
#> - thalach 1 153.72 187.72
#> - exang 1 154.15 188.15
#> - thal 3 158.79 188.79
#> - oldpeak 1 156.47 190.47
#> - slope 2 158.57 190.57
#> - sex 1 156.60 190.60
#> - ca 4 175.97 203.97
#> - cp 3 173.99 203.99
#>
#> Step: AIC=187.16
#> class ~ sex + cp + thalach + exang + oldpeak + slope + ca + thal
#>
#> Df Deviance AIC
#> - thalach 1 155.13 187.13
#> <none> 153.16 187.16
#> - exang 1 155.93 187.93
#> - thal 3 161.03 189.03
#> - sex 1 157.39 189.39
#> - slope 2 160.14 190.14
#> - oldpeak 1 159.57 191.57
#> - cp 3 175.12 203.12
#> - ca 4 177.13 203.13
#>
#> Step: AIC=187.13
#> class ~ sex + cp + exang + oldpeak + slope + ca + thal
#>
#> Df Deviance AIC
#> <none> 155.13 187.13
#> - exang 1 158.54 188.54
#> - sex 1 158.63 188.63
#> - thal 3 163.66 189.66
#> - oldpeak 1 162.24 192.24
#> - slope 2 165.43 193.43
#> - ca 4 181.04 205.04
#> - cp 3 180.06 206.06
#>
#> Call: glm(formula = class ~ sex + cp + exang + oldpeak + slope + ca +
#> thal, family = "binomial", data = cl_train)
#>
#> Coefficients:
#> (Intercept) sex1 cp2 cp3 cp4 exang1
#> -18.7917 1.0037 1.4235 0.3478 2.6281 0.9066
#> oldpeak slope2 slope3 ca0.0 ca1.0 ca2.0
#> 0.6464 1.4996 0.3402 15.2757 16.9758 18.2221
#> ca3.0 thal3.0 thal6.0 thal7.0
#> 17.0174 -1.8197 -1.6453 -0.4763
#>
#> Degrees of Freedom: 241 Total (i.e. Null); 226 Residual
#> Null Deviance: 333.5
#> Residual Deviance: 155.1 AIC: 187.1
#>
#> Call:
#> glm(formula = class ~ sex + cp + exang + oldpeak + slope + ca +
#> thal, family = "binomial", data = cl_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -18.7917 1125.7802 -0.017 0.98668
#> sex1 1.0037 0.5457 1.839 0.06588 .
#> cp2 1.4235 0.8938 1.593 0.11124
#> cp3 0.3478 0.7888 0.441 0.65931
#> cp4 2.6281 0.7863 3.343 0.00083 ***
#> exang1 0.9066 0.4888 1.855 0.06364 .
#> oldpeak 0.6464 0.2540 2.545 0.01091 *
#> slope2 1.4996 0.5019 2.988 0.00281 **
#> slope3 0.3402 0.9532 0.357 0.72115
#> ca0.0 15.2757 1125.7739 0.014 0.98917
#> ca1.0 16.9758 1125.7740 0.015 0.98797
#> ca2.0 18.2221 1125.7742 0.016 0.98709
#> ca3.0 17.0174 1125.7743 0.015 0.98794
#> thal3.0 -1.8197 3.6157 -0.503 0.61476
#> thal6.0 -1.6453 3.6983 -0.445 0.65641
#> thal7.0 -0.4763 3.6224 -0.131 0.89539
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 333.48 on 241 degrees of freedom
#> Residual deviance: 155.13 on 226 degrees of freedom
#> AIC: 187.13
#>
#> Number of Fisher Scoring iterations: 15
#> [1] 147.5159
#> [1] 155.1281
#> [1] 193.5159
#> [1] 187.1281
Based on deviance value, model_all is better than
model_backward, but based on AIC value,
model_backward is better than model_all. We
can conclude that both are good models because the difference in values
between the two models is not significant. Then we will continue using
model_backward.
Logistic Regression Assumptions
1. Linearity of Predictor & Log of Odds
For numeric variables, a 1-value increase will increase the log of odds.
Then we will check on numeric variables (oldpeak)
#> oldpeak
#> 1.908686
An increase of 1 oldpeak value will increase the possibility of
someone having heart disease by 1.9 times. This is by understanding the
oldpeak value, namely ST depression induced by exercise relative to
rest. Based on sources regarding how to read an EKG from Siloamhospital
and Alomedika,
it can be seen that the higher the ST depression value, the higher the
possibility of a patient having a heart disorder. So this
model_backward fulfills the assumption of linearity of
predictor & log of odds.
2. Independence of Observations
All of measurement on this data set is independent. There is no
redundant observation as there is no duplicate data on this data
set.
3. Multicollinearity
We will use vif() function from carpackage.
The threshold of VIF value is below than 10.
#> GVIF Df GVIF^(1/(2*Df))
#> sex 1.471733 1 1.213150
#> cp 1.731642 3 1.095830
#> exang 1.150309 1 1.072525
#> oldpeak 1.568623 1 1.252447
#> slope 1.733007 2 1.147361
#> ca 1.534643 4 1.054996
#> thal 1.435561 3 1.062112
For glm() models use GVIF^(1/(2*df)) to see
the multicollinearity values. All of model_backward
variables can fulfill the threshold, so there is no multicollinearity in
this model.
Prediction using Logistic Regression Model
p_heartcl <- predict(object = model_all,
newdata = head(cl_test),
type = "response")
ifelse(p_heartcl > 0.5, yes = 1, no = 0)#> 1 2 3 4 5 6
#> 0 0 0 1 0 0
cl_test$pred_risk <- predict(object = model_backward,
newdata = cl_test,
type = "response")
head(cl_test, 10)#> age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
#> 1 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0
#> 2 44 1 2 120 263 0 0 173 0 0.0 1 0.0 7.0
#> 3 58 1 2 120 284 0 2 160 0 1.8 2 0.0 3.0
#> 4 59 1 4 135 234 0 0 161 0 0.5 2 0.0 7.0
#> 5 42 1 4 140 226 0 0 178 0 0.0 1 0.0 3.0
#> 6 61 1 3 150 243 1 0 137 1 1.0 2 0.0 3.0
#> 7 65 0 4 150 225 0 2 114 0 1.0 2 3.0 7.0
#> 8 41 0 2 105 198 0 0 168 0 0.0 1 1.0 3.0
#> 9 44 1 4 112 290 0 2 153 0 0.0 1 1.0 3.0
#> 10 50 1 3 140 233 0 0 163 0 0.6 2 1.0 7.0
#> class pred_risk
#> 1 0 0.04709934
#> 2 0 0.17290664
#> 3 1 0.43895054
#> 4 0 0.81186956
#> 5 0 0.15394398
#> 6 0 0.28256832
#> 7 1 0.92577208
#> 8 0 0.09866771
#> 9 1 0.49902984
#> 10 1 0.72043776
Classify the cl_test data based on
pred_risk and store it in a new column named
pred_label using ifelse.
#> age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
#> 1 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0
#> 2 44 1 2 120 263 0 0 173 0 0.0 1 0.0 7.0
#> 3 58 1 2 120 284 0 2 160 0 1.8 2 0.0 3.0
#> 4 59 1 4 135 234 0 0 161 0 0.5 2 0.0 7.0
#> 5 42 1 4 140 226 0 0 178 0 0.0 1 0.0 3.0
#> 6 61 1 3 150 243 1 0 137 1 1.0 2 0.0 3.0
#> 7 65 0 4 150 225 0 2 114 0 1.0 2 3.0 7.0
#> 8 41 0 2 105 198 0 0 168 0 0.0 1 1.0 3.0
#> 9 44 1 4 112 290 0 2 153 0 0.0 1 1.0 3.0
#> 10 50 1 3 140 233 0 0 163 0 0.6 2 1.0 7.0
#> class pred_risk pred_label
#> 1 0 0.04709934 0
#> 2 0 0.17290664 0
#> 3 1 0.43895054 0
#> 4 0 0.81186956 1
#> 5 0 0.15394398 0
#> 6 0 0.28256832 0
#> 7 1 0.92577208 1
#> 8 0 0.09866771 0
#> 9 1 0.49902984 0
#> 10 1 0.72043776 1
We will see the result of prediction with make aggregate table.
#> class pred_risk pred_label
#> 1 0 0.04709934 0
#> 2 0 0.17290664 0
#> 3 1 0.43895054 0
#> 4 0 0.81186956 1
#> 5 0 0.15394398 0
#> 6 0 0.28256832 0
#> 7 1 0.92577208 1
#> 8 0 0.09866771 0
#> 9 1 0.49902984 0
#> 10 1 0.72043776 1
K-Nearest Neighbour (KNN)
Pre-processing data
Because of KNN is not good for categorical data, then we will make new data frame which only include numeric variable and 3 level and more categorical variable. This categorical variable will be convert into numeric.
#> age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
#> 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0
#> 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0
#> 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0
#> 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0
#> 5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0
#> 6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0
#> class
#> 1 0
#> 2 1
#> 3 1
#> 4 0
#> 5 0
#> 6 0
cl_knn <- cl_clean %>%
select(-c("sex", "fbs", "exang", )) %>%
mutate_at(.vars = c("cp","restecg", "slope", "ca", "thal"),
.funs = as.numeric)
cl_knn#> age cp trestbps chol restecg thalach oldpeak slope ca thal class
#> 1 63 1 145 233 3 150 2.3 3 2 3 0
#> 2 67 4 160 286 3 108 1.5 2 5 2 1
#> 3 67 4 120 229 3 129 2.6 2 4 4 1
#> 4 37 3 130 250 1 187 3.5 3 2 2 0
#> 5 41 2 130 204 3 172 1.4 1 2 2 0
#> 6 56 2 120 236 1 178 0.8 1 2 2 0
#> 7 62 4 140 268 3 160 3.6 3 4 2 1
#> 8 57 4 120 354 1 163 0.6 1 2 2 0
#> 9 63 4 130 254 3 147 1.4 2 3 4 1
#> 10 53 4 140 203 3 155 3.1 3 2 4 1
#> 11 57 4 140 192 1 148 0.4 2 2 3 0
#> 12 56 2 140 294 3 153 1.3 2 2 2 0
#> 13 56 3 130 256 3 142 0.6 2 3 3 1
#> 14 44 2 120 263 1 173 0.0 1 2 4 0
#> 15 52 3 172 199 1 162 0.5 1 2 4 0
#> 16 57 3 150 168 1 174 1.6 1 2 2 0
#> 17 48 2 110 229 1 168 1.0 3 2 4 1
#> 18 54 4 140 239 1 160 1.2 1 2 2 0
#> 19 48 3 130 275 1 139 0.2 1 2 2 0
#> 20 49 2 130 266 1 171 0.6 1 2 2 0
#> 21 64 1 110 211 3 144 1.8 2 2 2 0
#> 22 58 1 150 283 3 162 1.0 1 2 2 0
#> 23 58 2 120 284 3 160 1.8 2 2 2 1
#> 24 58 3 132 224 3 173 3.2 1 4 4 1
#> 25 60 4 130 206 3 132 2.4 2 4 4 1
#> 26 50 3 120 219 1 158 1.6 2 2 2 0
#> 27 58 3 120 340 1 172 0.0 1 2 2 0
#> 28 66 1 150 226 1 114 2.6 3 2 2 0
#> 29 43 4 150 247 1 171 1.5 1 2 2 0
#> 30 40 4 110 167 3 114 2.0 2 2 4 1
#> 31 69 1 140 239 1 151 1.8 1 4 2 0
#> 32 60 4 117 230 1 160 1.4 1 4 4 1
#> 33 64 3 140 335 1 158 0.0 1 2 2 1
#> 34 59 4 135 234 1 161 0.5 2 2 4 0
#> 35 44 3 130 233 1 179 0.4 1 2 2 0
#> 36 42 4 140 226 1 178 0.0 1 2 2 0
#> 37 43 4 120 177 3 120 2.5 2 2 4 1
#> 38 57 4 150 276 3 112 0.6 2 3 3 1
#> 39 55 4 132 353 1 132 1.2 2 3 4 1
#> 40 61 3 150 243 1 137 1.0 2 2 2 0
#> 41 65 4 150 225 3 114 1.0 2 5 4 1
#> 42 40 1 140 199 1 178 1.4 1 2 4 0
#> 43 71 2 160 302 1 162 0.4 1 4 2 0
#> 44 59 3 150 212 1 157 1.6 1 2 2 0
#> 45 61 4 130 330 3 169 0.0 1 2 2 1
#> 46 58 3 112 230 3 165 2.5 2 3 4 1
#> 47 51 3 110 175 1 123 0.6 1 2 2 0
#> 48 50 4 150 243 3 128 2.6 2 2 4 1
#> 49 65 3 140 417 3 157 0.8 1 3 2 0
#> 50 53 3 130 197 3 152 1.2 3 2 2 0
#> 51 41 2 105 198 1 168 0.0 1 3 2 0
#> 52 65 4 120 177 1 140 0.4 1 2 4 0
#> 53 44 4 112 290 3 153 0.0 1 3 2 1
#> 54 44 2 130 219 3 188 0.0 1 2 2 0
#> 55 60 4 130 253 1 144 1.4 1 3 4 1
#> 56 54 4 124 266 3 109 2.2 2 3 4 1
#> 57 50 3 140 233 1 163 0.6 2 3 4 1
#> 58 41 4 110 172 3 158 0.0 1 2 4 1
#> 59 54 3 125 273 3 152 0.5 3 3 2 0
#> 60 51 1 125 213 3 125 1.4 1 3 2 0
#> 61 51 4 130 305 1 142 1.2 2 2 4 1
#> 62 46 3 142 177 3 160 1.4 3 2 2 0
#> 63 58 4 128 216 3 131 2.2 2 5 4 1
#> 64 54 3 135 304 1 170 0.0 1 2 2 0
#> 65 54 4 120 188 1 113 1.4 2 3 4 1
#> 66 60 4 145 282 3 142 2.8 2 4 4 1
#> 67 60 3 140 185 3 155 3.0 2 2 2 1
#> 68 54 3 150 232 3 165 1.6 1 2 4 0
#> 69 59 4 170 326 3 140 3.4 3 2 4 1
#> 70 46 3 150 231 1 147 3.6 2 2 2 1
#> 71 65 3 155 269 1 148 0.8 1 2 2 0
#> 72 67 4 125 254 1 163 0.2 2 4 4 1
#> 73 62 4 120 267 1 99 1.8 2 4 4 1
#> 74 65 4 110 248 3 158 0.6 1 4 3 1
#> 75 44 4 110 197 3 177 0.0 1 3 2 1
#> 76 65 3 160 360 3 151 0.8 1 2 2 0
#> 77 60 4 125 258 3 141 2.8 2 3 4 1
#> 78 51 3 140 308 3 142 1.5 1 3 2 0
#> 79 48 2 130 245 3 180 0.2 2 2 2 0
#> 80 58 4 150 270 3 111 0.8 1 2 4 1
#> 81 45 4 104 208 3 148 3.0 2 2 2 0
#> 82 53 4 130 264 3 143 0.4 2 2 2 0
#> 83 39 3 140 321 3 182 0.0 1 2 2 0
#> 84 68 3 180 274 3 150 1.6 2 2 4 1
#> 85 52 2 120 325 1 172 0.2 1 2 2 0
#> 86 44 3 140 235 3 180 0.0 1 2 2 0
#> 87 47 3 138 257 3 156 0.0 1 2 2 0
#> 88 53 3 128 216 3 115 0.0 1 2 1 0
#> 89 53 4 138 234 3 160 0.0 1 2 2 0
#> 90 51 3 130 256 3 149 0.5 1 2 2 0
#> 91 66 4 120 302 3 151 0.4 2 2 2 0
#> 92 62 4 160 164 3 145 6.2 3 5 4 1
#> 93 62 3 130 231 1 146 1.8 2 5 4 0
#> 94 44 3 108 141 1 175 0.6 2 2 2 0
#> 95 63 3 135 252 3 172 0.0 1 2 2 0
#> 96 52 4 128 255 1 161 0.0 1 3 4 1
#> 97 59 4 110 239 3 142 1.2 2 3 4 1
#> 98 60 4 150 258 3 157 2.6 2 4 4 1
#> 99 52 2 134 201 1 158 0.8 1 3 2 0
#> 100 48 4 122 222 3 186 0.0 1 2 2 0
#> 101 45 4 115 260 3 185 0.0 1 2 2 0
#> 102 34 1 118 182 3 174 0.0 1 2 2 0
#> 103 57 4 128 303 3 159 0.0 1 3 2 0
#> 104 71 3 110 265 3 130 0.0 1 3 2 0
#> 105 49 3 120 188 1 139 2.0 2 5 4 1
#> 106 54 2 108 309 1 156 0.0 1 2 4 0
#> 107 59 4 140 177 1 162 0.0 1 3 4 1
#> 108 57 3 128 229 3 150 0.4 2 3 4 1
#> 109 61 4 120 260 1 140 3.6 2 3 4 1
#> 110 39 4 118 219 1 140 1.2 2 2 4 1
#> 111 61 4 145 307 3 146 1.0 2 2 4 1
#> 112 56 4 125 249 3 144 1.2 2 3 2 1
#> 113 52 1 118 186 3 190 0.0 2 2 3 0
#> 114 43 4 132 341 3 136 3.0 2 2 4 1
#> 115 62 3 130 263 1 97 1.2 2 3 4 1
#> 116 41 2 135 203 1 132 0.0 2 2 3 0
#> 117 58 3 140 211 3 165 0.0 1 2 2 0
#> 118 35 4 138 183 1 182 1.4 1 2 2 0
#> 119 63 4 130 330 3 132 1.8 1 5 4 1
#> 120 65 4 135 254 3 127 2.8 2 3 4 1
#> 121 48 4 130 256 3 150 0.0 1 4 4 1
#> 122 63 4 150 407 3 154 4.0 2 5 4 1
#> 123 51 3 100 222 1 143 1.2 2 2 2 0
#> 124 55 4 140 217 1 111 5.6 3 2 4 1
#> 125 65 1 138 282 3 174 1.4 2 3 2 1
#> 126 45 2 130 234 3 175 0.6 2 2 2 0
#> 127 56 4 200 288 3 133 4.0 3 4 4 1
#> 128 54 4 110 239 1 126 2.8 2 3 4 1
#> 129 44 2 120 220 1 170 0.0 1 2 2 0
#> 130 62 4 124 209 1 163 0.0 1 2 2 0
#> 131 54 3 120 258 3 147 0.4 2 2 4 0
#> 132 51 3 94 227 1 154 0.0 1 3 4 0
#> 133 29 2 130 204 3 202 0.0 1 2 2 0
#> 134 51 4 140 261 3 186 0.0 1 2 2 0
#> 135 43 3 122 213 1 165 0.2 2 2 2 0
#> 136 55 2 135 250 3 161 1.4 2 2 2 0
#> 137 70 4 145 174 1 125 2.6 3 2 4 1
#> 138 62 2 120 281 3 103 1.4 2 3 4 1
#> 139 35 4 120 198 1 130 1.6 2 2 4 1
#> 140 51 3 125 245 3 166 2.4 2 2 2 0
#> 141 59 2 140 221 1 164 0.0 1 2 2 0
#> 142 59 1 170 288 3 159 0.2 2 2 4 1
#> 143 52 2 128 205 1 184 0.0 1 2 2 0
#> 144 64 3 125 309 1 131 1.8 2 2 4 1
#> 145 58 3 105 240 3 154 0.6 2 2 4 0
#> 146 47 3 108 243 1 152 0.0 1 2 2 1
#> 147 57 4 165 289 3 124 1.0 2 5 4 1
#> 148 41 3 112 250 1 179 0.0 1 2 2 0
#> 149 45 2 128 308 3 170 0.0 1 2 2 0
#> 150 60 3 102 318 1 160 0.0 1 3 2 0
#> 151 52 1 152 298 1 178 1.2 2 2 4 0
#> 152 42 4 102 265 3 122 0.6 2 2 2 0
#> 153 67 3 115 564 3 160 1.6 2 2 4 0
#> 154 55 4 160 289 3 145 0.8 2 3 4 1
#> 155 64 4 120 246 3 96 2.2 3 3 2 1
#> 156 70 4 130 322 3 109 2.4 2 5 2 1
#> 157 51 4 140 299 1 173 1.6 1 2 4 1
#> 158 58 4 125 300 3 171 0.0 1 4 4 1
#> 159 60 4 140 293 3 170 1.2 2 4 4 1
#> 160 68 3 118 277 1 151 1.0 1 3 4 0
#> 161 46 2 101 197 1 156 0.0 1 2 4 0
#> 162 77 4 125 304 3 162 0.0 1 5 2 1
#> 163 54 3 110 214 1 158 1.6 2 2 2 0
#> 164 58 4 100 248 3 122 1.0 2 2 2 0
#> 165 48 3 124 255 1 175 0.0 1 4 2 0
#> 166 57 4 132 207 1 168 0.0 1 2 4 0
#> 167 52 3 138 223 1 169 0.0 1 1 2 0
#> 168 54 2 132 288 3 159 0.0 1 3 2 0
#> 169 35 4 126 282 3 156 0.0 1 2 4 1
#> 170 45 2 112 160 1 138 0.0 2 2 2 0
#> 171 70 3 160 269 1 112 2.9 2 3 4 1
#> 172 53 4 142 226 3 111 0.0 1 2 4 0
#> 173 59 4 174 249 1 143 0.0 2 2 2 1
#> 174 62 4 140 394 3 157 1.2 2 2 2 0
#> 175 64 4 145 212 3 132 2.0 2 4 3 1
#> 176 57 4 152 274 1 88 1.2 2 3 4 1
#> 177 52 4 108 233 1 147 0.1 1 5 4 0
#> 178 56 4 132 184 3 105 2.1 2 3 3 1
#> 179 43 3 130 315 1 162 1.9 1 3 2 0
#> 180 53 3 130 246 3 173 0.0 1 5 2 0
#> 181 48 4 124 274 3 166 0.5 2 2 4 1
#> 182 56 4 134 409 3 150 1.9 2 4 4 1
#> 183 42 1 148 244 3 178 0.8 1 4 2 0
#> 184 59 1 178 270 3 145 4.2 3 2 4 0
#> 185 60 4 158 305 3 161 0.0 1 2 2 1
#> 186 63 2 140 195 1 179 0.0 1 4 2 0
#> 187 42 3 120 240 1 194 0.8 3 2 4 0
#> 188 66 2 160 246 1 120 0.0 2 5 3 1
#> 189 54 2 192 283 3 195 0.0 1 3 4 1
#> 190 69 3 140 254 3 146 2.0 2 5 4 1
#> 191 50 3 129 196 1 163 0.0 1 2 2 0
#> 192 51 4 140 298 1 122 4.2 2 5 4 1
#> 193 43 4 132 247 3 143 0.1 2 1 4 1
#> 194 62 4 138 294 1 106 1.9 2 5 2 1
#> 195 68 3 120 211 3 115 1.5 2 2 2 0
#> 196 67 4 100 299 3 125 0.9 2 4 2 1
#> 197 69 1 160 234 3 131 0.1 2 3 2 0
#> 198 45 4 138 236 3 152 0.2 2 2 2 0
#> 199 50 2 120 244 1 162 1.1 1 2 2 0
#> 200 59 1 160 273 3 125 0.0 1 2 2 1
#> 201 50 4 110 254 3 159 0.0 1 2 2 0
#> 202 64 4 180 325 1 154 0.0 1 2 2 0
#> 203 57 3 150 126 1 173 0.2 1 3 4 0
#> 204 64 3 140 313 1 133 0.2 1 2 4 0
#> 205 43 4 110 211 1 161 0.0 1 2 4 0
#> 206 45 4 142 309 3 147 0.0 2 5 4 1
#> 207 58 4 128 259 3 130 3.0 2 4 4 1
#> 208 50 4 144 200 3 126 0.9 2 2 4 1
#> 209 55 2 130 262 1 155 0.0 1 2 2 0
#> 210 62 4 150 244 1 154 1.4 2 2 2 1
#> 211 37 3 120 215 1 170 0.0 1 2 2 0
#> 212 38 1 120 231 1 182 3.8 2 2 4 1
#> 213 41 3 130 214 3 168 2.0 2 2 2 0
#> 214 66 4 178 228 1 165 1.0 2 4 4 1
#> 215 52 4 112 230 1 160 0.0 1 3 2 1
#> 216 56 1 120 193 3 162 1.9 2 2 4 0
#> 217 46 2 105 204 1 172 0.0 1 2 2 0
#> 218 46 4 138 243 3 152 0.0 2 2 2 0
#> 219 64 4 130 303 1 122 2.0 2 4 2 0
#> 220 59 4 138 271 3 182 0.0 1 2 2 0
#> 221 41 3 112 268 3 172 0.0 1 2 2 0
#> 222 54 3 108 267 3 167 0.0 1 2 2 0
#> 223 39 3 94 199 1 179 0.0 1 2 2 0
#> 224 53 4 123 282 1 95 2.0 2 4 4 1
#> 225 63 4 108 269 1 169 1.8 2 4 2 1
#> 226 34 2 118 210 1 192 0.7 1 2 2 0
#> 227 47 4 112 204 1 143 0.1 1 2 2 0
#> 228 67 3 152 277 1 172 0.0 1 3 2 0
#> 229 54 4 110 206 3 108 0.0 2 3 2 1
#> 230 66 4 112 212 3 132 0.1 1 3 2 1
#> 231 52 3 136 196 3 169 0.1 2 2 2 0
#> 232 55 4 180 327 2 117 3.4 2 2 2 1
#> 233 49 3 118 149 3 126 0.8 1 5 2 1
#> 234 74 2 120 269 3 121 0.2 1 3 2 0
#> 235 54 3 160 201 1 163 0.0 1 3 2 0
#> 236 54 4 122 286 3 116 3.2 2 4 2 1
#> 237 56 4 130 283 3 103 1.6 3 2 4 1
#> 238 46 4 120 249 3 144 0.8 1 2 4 1
#> 239 49 2 134 271 1 162 0.0 2 2 2 0
#> 240 42 2 120 295 1 162 0.0 1 2 2 0
#> 241 41 2 110 235 1 153 0.0 1 2 2 0
#> 242 41 2 126 306 1 163 0.0 1 2 2 0
#> 243 49 4 130 269 1 163 0.0 1 2 2 0
#> 244 61 1 134 234 1 145 2.6 2 4 2 1
#> 245 60 3 120 178 1 96 0.0 1 2 2 0
#> 246 67 4 120 237 1 71 1.0 2 2 2 1
#> 247 58 4 100 234 1 156 0.1 1 3 4 1
#> 248 47 4 110 275 3 118 1.0 2 3 2 1
#> 249 52 4 125 212 1 168 1.0 1 4 4 1
#> 250 62 2 128 208 3 140 0.0 1 2 2 0
#> 251 57 4 110 201 1 126 1.5 2 2 3 0
#> 252 58 4 146 218 1 105 2.0 2 3 4 1
#> 253 64 4 128 263 1 105 0.2 2 3 4 0
#> 254 51 3 120 295 3 157 0.6 1 2 2 0
#> 255 43 4 115 303 1 181 1.2 2 2 2 0
#> 256 42 3 120 209 1 173 0.0 2 2 2 0
#> 257 67 4 106 223 1 142 0.3 1 4 2 0
#> 258 76 3 140 197 2 116 1.1 2 2 2 0
#> 259 70 2 156 245 3 143 0.0 1 2 2 0
#> 260 57 2 124 261 1 141 0.3 1 2 4 1
#> 261 44 3 118 242 1 149 0.3 2 3 2 0
#> 262 58 2 136 319 3 152 0.0 1 4 2 1
#> 263 60 1 150 240 1 171 0.9 1 2 2 0
#> 264 44 3 120 226 1 169 0.0 1 2 2 0
#> 265 61 4 138 166 3 125 3.6 2 3 2 1
#> 266 42 4 136 315 1 125 1.8 2 2 3 1
#> 267 52 4 128 204 1 156 1.0 2 2 1 1
#> 268 59 3 126 218 1 134 2.2 2 3 3 1
#> 269 40 4 152 223 1 181 0.0 1 2 4 1
#> 270 42 3 130 180 1 150 0.0 1 2 2 0
#> 271 61 4 140 207 3 138 1.9 1 3 4 1
#> 272 66 4 160 228 3 138 2.3 1 2 3 0
#> 273 46 4 140 311 1 120 1.8 2 4 4 1
#> 274 71 4 112 149 1 125 1.6 2 2 2 0
#> 275 59 1 134 204 1 162 0.8 1 4 2 1
#> 276 64 1 170 227 3 155 0.6 2 2 4 0
#> 277 66 3 146 278 3 152 0.0 2 3 2 0
#> 278 39 3 138 220 1 152 0.0 2 2 2 0
#> 279 57 2 154 232 3 164 0.0 1 3 2 1
#> 280 58 4 130 197 1 131 0.6 2 2 2 0
#> 281 57 4 110 335 1 143 3.0 2 3 4 1
#> 282 47 3 130 253 1 179 0.0 1 2 2 0
#> 283 55 4 128 205 2 130 2.0 2 3 4 1
#> 284 35 2 122 192 1 174 0.0 1 2 2 0
#> 285 61 4 148 203 1 161 0.0 1 3 4 1
#> 286 58 4 114 318 2 140 4.4 3 5 3 1
#> 287 58 4 170 225 3 146 2.8 2 4 3 1
#> 288 58 2 125 220 1 144 0.4 2 1 4 0
#> 289 56 2 130 221 3 163 0.0 1 2 4 0
#> 290 56 2 120 240 1 169 0.0 3 2 2 0
#> 291 67 3 152 212 3 150 0.8 2 2 4 1
#> 292 55 2 132 342 1 166 1.2 1 2 2 0
#> 293 44 4 120 169 1 144 2.8 3 2 3 1
#> 294 63 4 140 187 3 144 4.0 1 4 4 1
#> 295 63 4 124 197 1 136 0.0 2 2 2 1
#> 296 41 2 120 157 1 182 0.0 1 2 2 0
#> 297 59 4 164 176 3 90 1.0 2 4 3 1
#> 298 57 4 140 241 1 123 0.2 2 2 4 1
#> 299 45 1 110 264 1 132 1.2 2 2 4 1
#> 300 68 4 144 193 1 141 3.4 2 4 4 1
#> 301 57 4 130 131 1 115 1.2 2 3 4 1
#> 302 57 2 130 236 3 174 0.0 2 3 2 1
#> 303 38 3 138 175 1 173 0.0 1 1 2 0
Cross Validation
RNGkind(sample.kind = "Rounding")
set.seed(150)
splitter <- initial_split(data = cl_knn, prop = 0.8)
train_knn <- training(splitter)
test_knn <- testing(splitter)Then we will check the balance of train_knn data.
#>
#> 0 1
#> 0.5454545 0.4545455
Our data has a good balance, it’s not be biased. Then, we can continue making the prediction using KNN method.
Prediction using KNN method
Split the train data and test data become predictor and target
data.
predictor as x data
target as y data
cl_train_x <- train_knn %>% select(-class)
cl_test_x <- test_knn %>% select(-class)
cl_train_y <- train_knn$class
cl_test_y <- test_knn$classThen we need to scaling the data for ensure that all variables have
the same scale. Scaling process use scale() function.
cl_train_xs <- scale(cl_train_x)
cl_test_xs <- scale(cl_test_x,
center = attr(cl_train_xs, "scaled:center"),
scale = attr(cl_train_xs, "scaled:scale"))Find the optimum k value
#> [1] 15.55635
Because our class is 2 class (even), then k value must be an odd. We
will use k = 15
KNN prediction will use knn() function from
class library.
#> [1] 0 0 0 1 0 0
#> Levels: 0 1
Save the prediction results into the test data. Because KNN makes predictions directly on the test data
#> age cp trestbps chol restecg thalach oldpeak slope ca thal class pred_label
#> 1 41 2 130 204 3 172 1.4 1 2 2 0 0
#> 2 44 2 120 263 1 173 0.0 1 2 4 0 0
#> 3 58 2 120 284 3 160 1.8 2 2 2 1 0
#> 4 59 4 135 234 1 161 0.5 2 2 4 0 1
#> 5 42 4 140 226 1 178 0.0 1 2 2 0 0
#> 6 61 3 150 243 1 137 1.0 2 2 2 0 0
#> 7 65 4 150 225 3 114 1.0 2 5 4 1 1
#> 8 41 2 105 198 1 168 0.0 1 3 2 0 0
#> 9 44 4 112 290 3 153 0.0 1 3 2 1 0
#> 10 50 3 140 233 1 163 0.6 2 3 4 1 1
#> 11 51 1 125 213 3 125 1.4 1 3 2 0 0
#> 12 46 3 142 177 3 160 1.4 3 2 2 0 0
#> 13 58 4 128 216 3 131 2.2 2 5 4 1 1
#> 14 58 4 150 270 3 111 0.8 1 2 4 1 1
#> 15 45 4 104 208 3 148 3.0 2 2 2 0 1
#> 16 53 4 130 264 3 143 0.4 2 2 2 0 0
#> 17 39 3 140 321 3 182 0.0 1 2 2 0 0
#> 18 68 3 180 274 3 150 1.6 2 2 4 1 1
#> 19 52 2 120 325 1 172 0.2 1 2 2 0 0
#> 20 51 3 130 256 3 149 0.5 1 2 2 0 0
#> 21 62 4 160 164 3 145 6.2 3 5 4 1 1
#> 22 52 4 128 255 1 161 0.0 1 3 4 1 1
#> 23 49 3 120 188 1 139 2.0 2 5 4 1 1
#> 24 39 4 118 219 1 140 1.2 2 2 4 1 1
#> 25 48 4 130 256 3 150 0.0 1 4 4 1 1
#> 26 65 1 138 282 3 174 1.4 2 3 2 1 0
#> 27 45 2 130 234 3 175 0.6 2 2 2 0 0
#> 28 56 4 200 288 3 133 4.0 3 4 4 1 1
#> 29 29 2 130 204 3 202 0.0 1 2 2 0 0
#> 30 51 4 140 261 3 186 0.0 1 2 2 0 0
#> 31 59 1 170 288 3 159 0.2 2 2 4 1 0
#> 32 45 2 128 308 3 170 0.0 1 2 2 0 0
#> 33 42 4 102 265 3 122 0.6 2 2 2 0 0
#> 34 67 3 115 564 3 160 1.6 2 2 4 0 1
#> 35 55 4 160 289 3 145 0.8 2 3 4 1 1
#> 36 70 4 130 322 3 109 2.4 2 5 2 1 1
#> 37 58 4 125 300 3 171 0.0 1 4 4 1 1
#> 38 51 4 140 298 1 122 4.2 2 5 4 1 1
#> 39 43 4 132 247 3 143 0.1 2 1 4 1 1
#> 40 68 3 120 211 3 115 1.5 2 2 2 0 0
#> 41 67 4 100 299 3 125 0.9 2 4 2 1 1
#> 42 45 4 138 236 3 152 0.2 2 2 2 0 0
#> 43 50 2 120 244 1 162 1.1 1 2 2 0 0
#> 44 58 4 128 259 3 130 3.0 2 4 4 1 1
#> 45 55 2 130 262 1 155 0.0 1 2 2 0 0
#> 46 56 1 120 193 3 162 1.9 2 2 4 0 0
#> 47 64 4 130 303 1 122 2.0 2 4 2 0 1
#> 48 41 3 112 268 3 172 0.0 1 2 2 0 0
#> 49 53 4 123 282 1 95 2.0 2 4 4 1 1
#> 50 66 4 112 212 3 132 0.1 1 3 2 1 0
#> 51 60 3 120 178 1 96 0.0 1 2 2 0 0
#> 52 58 4 100 234 1 156 0.1 1 3 4 1 0
#> 53 52 4 125 212 1 168 1.0 1 4 4 1 0
#> 54 62 2 128 208 3 140 0.0 1 2 2 0 0
#> 55 57 4 110 201 1 126 1.5 2 2 3 0 0
#> 56 60 1 150 240 1 171 0.9 1 2 2 0 0
#> 57 61 4 138 166 3 125 3.6 2 3 2 1 1
#> 58 71 4 112 149 1 125 1.6 2 2 2 0 0
#> 59 57 2 154 232 3 164 0.0 1 3 2 1 0
#> 60 57 4 110 335 1 143 3.0 2 3 4 1 1
#> 61 56 2 130 221 3 163 0.0 1 2 4 0 0
Evaluation Logistic Regression Model and KNN Model
We will evaluate the model using confusion matrix from
caret library
## The Logistic Regression Model
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 28 6
#> 1 4 23
#>
#> Accuracy : 0.8361
#> 95% CI : (0.7191, 0.9185)
#> No Information Rate : 0.5246
#> P-Value [Acc > NIR] : 0.0000003442
#>
#> Kappa : 0.6703
#>
#> Mcnemar's Test P-Value : 0.7518
#>
#> Sensitivity : 0.7931
#> Specificity : 0.8750
#> Pos Pred Value : 0.8519
#> Neg Pred Value : 0.8235
#> Prevalence : 0.4754
#> Detection Rate : 0.3770
#> Detection Prevalence : 0.4426
#> Balanced Accuracy : 0.8341
#>
#> 'Positive' Class : 1
#>
The outcome demonstrates that, on the test dataset, our logistic model has an accuracy of 83.61%, which indicates that 83.61% of our data is correctly classified. The corresponding values for specificity and sensitivity are 79.31% and 87.50%, respectively. While a tiny percentage of adverse outcomes are accurately categorized, the most favorable outcomes are. According to the precision/positive anticipated value of 85.19%, 85.19% of our optimistic prediction is correct.
KNN
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 28 8
#> 1 4 21
#>
#> Accuracy : 0.8033
#> 95% CI : (0.6816, 0.894)
#> No Information Rate : 0.5246
#> P-Value [Acc > NIR] : 0.000005828
#>
#> Kappa : 0.603
#>
#> Mcnemar's Test P-Value : 0.3865
#>
#> Sensitivity : 0.7241
#> Specificity : 0.8750
#> Pos Pred Value : 0.8400
#> Neg Pred Value : 0.7778
#> Prevalence : 0.4754
#> Detection Rate : 0.3443
#> Detection Prevalence : 0.4098
#> Balanced Accuracy : 0.7996
#>
#> 'Positive' Class : 1
#>
The outcome demonstrates that, on the test dataset, our K-NN with K = 15 has an accuracy of 83.33%, which indicates that 83.33% of our data is correctly classified. The corresponding values for specificity and sensitivity are 87.50% and 72.41%, respectively. While a tiny percentage of adverse outcomes are accurately categorized, the most favorable outcomes are. According to the precision/positive anticipated value of 84%, 84% of our optimistic prediction is correct.
Conclusion
Because of the patient’s psychology, we must minimize the false negative value. Based on those two methods, logistic regression and K-NN, the model’s ability to correctly predict actual data of people who are not healthy is better using logistic regression because its precision value is 85.19% greater than using the KNN method.
References
-https://archive.ics.uci.edu/dataset/45/heart+disease
-https://search.r-project.org/CRAN/refmans/kmed/html/heart.html
-https://www.mayoclinic.org/diseases-conditions/heart-disease/symptoms-causes/syc-20353118#:~:text=Heart%20disease%20describes%20a%20range,born%20with%20(congenital%20heart%20defects)
-https://www.southcoasthealth.com/posts/view/402-what-causes-heart-disease
-https://www.siloamhospitals.com/informasi-siloam/artikel/cara-membaca-ekg
-https://www.alomedika.com/diagnosis-banding-elevasi-segmen-st-pada-elektrokardiografi