Heart Desease
Cardiovascular disease (CVD), including heart disease, is a major cause of mortality worldwide,remains the leading cause of death globally, accounting for approximately 31% of all deaths worldwide in 2019. Early detection and intervention are crucial in reducing the burden of disease and improving patient outcomes. Machine learning (ML) has the potential to aid clinicians in identifying patients at risk of heart disease through the analysis of clinical and laboratory data. Several studies have been conducted to develop ML algorithms for predicting the risk of heart disease, achieving high levels of accuracy. The use of ML models has the potential to improve patient outcomes by enabling earlier detection and intervention, but further research is necessary to validate these models and ensure their effectiveness in real-world clinical settings.
In this Project, we are continuing Heart Disease - Classification (Machine Learning vol1.0) added 2 machine learning models that is Decision Tree and Random Forest.
Heart-disease, this data set was created in 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It has 76 attributes, one of which is the predicted attribute, but all published experiments use only 14 of them. The “target” field refers to the patient’s presence of heart disease. It has an integer value of 0 for no disease and 1 for disease.
Attribute Information:
agesex : 1 = male; 0 = femalechest pain type : (4 values : – Value 0: typical angina
– Value 1: atypical angina – Value 2: non-anginal pain – Value 3:
asymptomatic)resting blood pressure : resting blood pressure (in mm
Hg on admission to the hospital)serum cholestoral : serum cholestoral in mg/dlfasting blood sugar > 120 mg/dl : (1 = true; 0 =
false)resting electrocar diographic results (values
0,1,2)maximum heart rate achievedexercise induced angina : 1 = yes; 0 = nooldpeak = ST depression induced by exercise relative to
restSlope = the slope of the peak exercise ST segment (0 -
2)ca = number of major vessels (0-4) colored by
flourosopythal: 0 = normal; 1 = fixed defect; 2 = reversible
defect; 3 = reversible defect after thallium stress testingtarget : 1 = disease; 0 = no diseaselibrary(dplyr) # function of data wrangling 1
library(tidyr) # function of data wrangling 2
library(class) # package for `knn()`
library(caret) # confusion matrix
library(car) # multicollinearity
library(partykit) #function decision tree
library(randomForest) #function random forest
library(ROCR) #function ROC AUC
library(GGally) # ggcorRead dataset = heart.csv
df = read.csv("LBB DEAN/heart.csv")
glimpse(df)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
df have 14 columns and 1025 records of patient check up of heart
disease. There is several columns need to change as a factor. in this
case column sex,
cp,fbs,restecg,exang,slope,ca,thal
df <- df %>%
mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal,target), as.factor)
glimpse(df)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
Check missing values in df as a data-set
colSums(is.na(df))#> age sex cp trestbps chol fbs restecg thalach
#> 0 0 0 0 0 0 0 0
#> exang oldpeak slope ca thal target
#> 0 0 0 0 0 0
There is no any missing values in df as a data-set
Do EDA to know well distribution and characteristics of the data-set and make sure target of the data is balance to make machine learning models work well
summary(df)#> age sex cp trestbps chol fbs restecg
#> Min. :29.00 0:312 0:497 Min. : 94.0 Min. :126 0:872 0:497
#> 1st Qu.:48.00 1:713 1:167 1st Qu.:120.0 1st Qu.:211 1:153 1:513
#> Median :56.00 2:284 Median :130.0 Median :240 2: 15
#> Mean :54.43 3: 77 Mean :131.6 Mean :246
#> 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
#> Max. :77.00 Max. :200.0 Max. :564
#> thalach exang oldpeak slope ca thal target
#> Min. : 71.0 0:680 Min. :0.000 0: 74 0:578 0: 7 0:499
#> 1st Qu.:132.0 1:345 1st Qu.:0.000 1:482 1:226 1: 64 1:526
#> Median :152.0 Median :0.800 2:469 2:134 2:544
#> Mean :149.1 Mean :1.072 3: 69 3:410
#> 3rd Qu.:166.0 3rd Qu.:1.800 4: 18
#> Max. :202.0 Max. :6.200
insight :
round(prop.table(table(df$target)),1)#>
#> 0 1
#> 0.5 0.5
insight : Proportion of target already balance 50:50
ggcorr(df, label=TRUE)
insight =
The process of dividing the dataset we have to train and test the model is known as cross validation. We will randomly select a sample of Heart Disease to be included in the df train variable, with the remainder included in the df test.
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)
# splitting
df_train <- df[index,]
df_test <- df[-index,]We are going to make several models and test the performance.
model_none <- glm(formula = target~1,
data = df_train,
family = "binomial")
summary(model_none)#>
#> Call:
#> glm(formula = target ~ 1, family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.209 -1.209 1.146 1.146 1.146
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.07320 0.06989 1.047 0.295
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.7 on 819 degrees of freedom
#> Residual deviance: 1135.7 on 819 degrees of freedom
#> AIC: 1137.7
#>
#> Number of Fisher Scoring iterations: 3
exp(0.07320)#> [1] 1.075946
📈 Model interpretation :
sex,cp,fbs,restecg,exang,slope,ca,thal
model_categoric <- glm(formula = target~ sex+cp+fbs+restecg+exang+slope+ca+thal ,
data = df_train,
family = "binomial")
summary(model_categoric)#>
#> Call:
#> glm(formula = target ~ sex + cp + fbs + restecg + exang + slope +
#> ca + thal, family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.92220 -0.34695 0.07266 0.42072 3.14937
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1045 1.8662 -0.592 0.553971
#> sex1 -1.6801 0.3211 -5.232 0.00000016775552554 ***
#> cp1 1.1432 0.3352 3.410 0.000649 ***
#> cp2 2.1655 0.3246 6.672 0.00000000002528101 ***
#> cp3 2.1430 0.4207 5.094 0.00000035110360891 ***
#> fbs1 0.3031 0.3401 0.891 0.372822
#> restecg1 0.4937 0.2353 2.098 0.035871 *
#> restecg2 -1.7079 1.1767 -1.451 0.146669
#> exang1 -0.8924 0.2742 -3.255 0.001134 **
#> slope1 -0.2613 0.4386 -0.596 0.551356
#> slope2 1.7783 0.4471 3.977 0.00006964804733135 ***
#> ca1 -2.2969 0.3111 -7.383 0.00000000000015444 ***
#> ca2 -3.6182 0.4659 -7.766 0.00000000000000809 ***
#> ca3 -2.0375 0.5409 -3.767 0.000165 ***
#> ca4 2.2090 1.0406 2.123 0.033771 *
#> thal1 2.2613 1.8522 1.221 0.222128
#> thal2 2.6027 1.8037 1.443 0.149016
#> thal3 0.7894 1.8030 0.438 0.661519
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 494.32 on 802 degrees of freedom
#> AIC: 530.32
#>
#> Number of Fisher Scoring iterations: 6
exp(model_categoric$coefficients)#> (Intercept) sex1 cp1 cp2 cp3 fbs1
#> 0.33138730 0.18634683 3.13683135 8.71918722 8.52480997 1.35410577
#> restecg1 restecg2 exang1 slope1 slope2 ca1
#> 1.63828543 0.18125122 0.40966600 0.77004331 5.91973985 0.10056836
#> ca2 ca3 ca4 thal1 thal2 thal3
#> 0.02683137 0.13035619 9.10643407 9.59600631 13.50008312 2.20203009
Any variable that lowers the odds :
sex1,
restecg2,exang1,slope1,ca1,ca2,ca3📈 Model interpretation :
cp1 are more at risk of heart failure
(reducing the possibility of heart failure) by 3.13 times compared to
“base” (cp0), if other variables are
fixedcp2 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.71 times compared to
“base” (cp0), if other variables are
fixedcp3 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.52 times compared to
“base” (cp0), if other variables are
fixedfbs1 are more at risk of heart failure
(reducing the possibility of heart failure) by 1.35 times compared to
“base” (fbs0), if other variables are
fixedrestecg1 are more at risk of heart
failure (reducing the possibility of heart failure) by 1.63 times
compared to “base” (restecg0), if other variables
are fixedslope2 are more at risk of heart failure
(reducing the possibility of heart failure) by 5.91 times compared to
“base” (slope0), if other variables are
fixedca4 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.1 times compared to
“base” (ca0), if other variables are
fixedthal1 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.59 times compared to
“base” (thal0), if other variables are
fixedthal2 are more at risk of heart failure
(reducing the possibility of heart failure) by 13.5 times compared to
“base” (thal0), if other variables are
fixedthal3 are more at risk of heart failure
(reducing the possibility of heart failure) by 2.2 times compared to
“base” (thal0), if other variables are
fixedage,trestbps,chol,thalach,oldpeak
model_numeric <- glm(formula = target~ age+trestbps+chol+thalach+oldpeak ,
data = df_train,
family = "binomial")
summary(model_numeric)#>
#> Call:
#> glm(formula = target ~ age + trestbps + chol + thalach + oldpeak,
#> family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1848 -0.7812 0.4288 0.8541 2.6114
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.025752 1.098845 -1.844 0.0653 .
#> age 0.003310 0.010839 0.305 0.7601
#> trestbps -0.011531 0.005264 -2.190 0.0285 *
#> chol -0.004096 0.001749 -2.342 0.0192 *
#> thalach 0.035287 0.004557 7.744 0.00000000000000964 ***
#> oldpeak -0.834227 0.092076 -9.060 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 851.21 on 814 degrees of freedom
#> AIC: 863.21
#>
#> Number of Fisher Scoring iterations: 4
exp(model_numeric$coefficients)#> (Intercept) age trestbps chol thalach oldpeak
#> 0.1318946 1.0033152 0.9885348 0.9959125 1.0359168 0.4342098
Any variable that lowers the odds :
trestbps, chol,oldpeak📈 Model interpretation :
age, then the possibility
of patient getting heart disease increases 1.00 times, provided
that other variables have the same value.thalach, then the
possibility of patient getting heart disease increases 1.03 times,
provided that other variables have the same value.model_all <- glm(formula = target~. ,
data = df_train,
family = "binomial")
summary(model_all)#>
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.0412 -0.2616 0.0570 0.4065 3.3227
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.265828 2.309760 -0.115 0.908375
#> age 0.026965 0.016411 1.643 0.100368
#> sex1 -2.078567 0.375782 -5.531 0.000000031783653 ***
#> cp1 0.829104 0.353346 2.346 0.018954 *
#> cp2 2.168814 0.341372 6.353 0.000000000210833 ***
#> cp3 2.466048 0.466150 5.290 0.000000122150326 ***
#> trestbps -0.028018 0.007818 -3.584 0.000339 ***
#> chol -0.006596 0.002753 -2.396 0.016579 *
#> fbs1 0.412043 0.372173 1.107 0.268238
#> restecg1 0.374898 0.255211 1.469 0.141840
#> restecg2 -0.562131 1.926327 -0.292 0.770428
#> thalach 0.025281 0.007816 3.235 0.001218 **
#> exang1 -0.729219 0.292905 -2.490 0.012789 *
#> oldpeak -0.511915 0.155308 -3.296 0.000980 ***
#> slope1 -0.323303 0.534445 -0.605 0.545224
#> slope2 1.168961 0.563860 2.073 0.038159 *
#> ca1 -2.542979 0.342386 -7.427 0.000000000000111 ***
#> ca2 -3.680525 0.521668 -7.055 0.000000000001722 ***
#> ca3 -1.675346 0.600219 -2.791 0.005251 **
#> ca4 1.891545 1.047329 1.806 0.070908 .
#> thal1 2.940066 1.568853 1.874 0.060927 .
#> thal2 2.764906 1.513894 1.826 0.067797 .
#> thal3 1.091833 1.515119 0.721 0.471140
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 452.07 on 797 degrees of freedom
#> AIC: 498.07
#>
#> Number of Fisher Scoring iterations: 6
exp(model_all$coefficients)#> (Intercept) age sex1 cp1 cp2 cp3
#> 0.76657125 1.02733167 0.12510939 2.29126495 8.74789990 11.77581908
#> trestbps chol fbs1 restecg1 restecg2 thalach
#> 0.97237088 0.99342604 1.50989958 1.45484288 0.56999318 1.02560367
#> exang1 oldpeak slope1 slope2 ca1 ca2
#> 0.48228567 0.59934695 0.72375441 3.21864567 0.07863177 0.02520973
#> ca3 ca4 thal1 thal2 thal3
#> 0.18724333 6.62960374 18.91708684 15.87755124 2.97973045
Any variable that lowers the odds :
sex1,
restecg2,exang1,slope1,ca1,ca2,ca3,trestbps,
chol,oldpeak
📈 Model interpretation :
age, then the possibility
of patient getting heart disease increases 1.00 times, provided
that other variables have the same value.thalach, then the
possibility of patient getting heart disease increases 1.03 times,
provided that other variables have the same value.cp1 are more at risk of heart failure
(reducing the possibility of heart failure) by 3.13 times compared to
“base” (cp0), if other variables are
fixedcp2 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.71 times compared to
“base” (cp0), if other variables are
fixedcp3 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.52 times compared to
“base” (cp0), if other variables are
fixedrestecg1 are more at risk of heart
failure (reducing the possibility of heart failure) by 1.63 times
compared to “base” (restecg0), if other variables
are fixedslope2 are more at risk of heart failure
(reducing the possibility of heart failure) by 5.91 times compared to
“base” (slope0), if other variables are
fixedca4 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.1 times compared to
“base” (ca0), if other variables are
fixedthal1 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.59 times compared to
“base” (thal0), if other variables are
fixedthal2 are more at risk of heart failure
(reducing the possibility of heart failure) by 13.5 times compared to
“base” (thal0), if other variables are
fixedthal3 are more at risk of heart failure
(reducing the possibility of heart failure) by 2.2 times compared to
“base” (thal0), if other variables are
fixedWe used model with all predictor due to low AIC and do a feature selection using stepwise regression
model_step <- step(object = model_all,
direction = "backward", trace = F)
summary(model_step)#>
#> Call:
#> glm(formula = target ~ age + sex + cp + trestbps + chol + thalach +
#> exang + oldpeak + slope + ca + thal, family = "binomial",
#> data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.0096 -0.2815 0.0540 0.4044 3.3720
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.298116 2.469446 0.121 0.903911
#> age 0.024359 0.016267 1.497 0.134274
#> sex1 -2.080038 0.373448 -5.570 0.0000000255013944 ***
#> cp1 0.853616 0.351534 2.428 0.015172 *
#> cp2 2.237598 0.337174 6.636 0.0000000000321570 ***
#> cp3 2.518670 0.462415 5.447 0.0000000512930186 ***
#> trestbps -0.027613 0.007554 -3.655 0.000257 ***
#> chol -0.007103 0.002650 -2.680 0.007367 **
#> thalach 0.025386 0.007880 3.222 0.001274 **
#> exang1 -0.700975 0.291217 -2.407 0.016082 *
#> oldpeak -0.522742 0.150681 -3.469 0.000522 ***
#> slope1 -0.367936 0.521164 -0.706 0.480195
#> slope2 1.150041 0.547990 2.099 0.035848 *
#> ca1 -2.522397 0.333224 -7.570 0.0000000000000374 ***
#> ca2 -3.586970 0.508731 -7.051 0.0000000000017787 ***
#> ca3 -1.537753 0.581697 -2.644 0.008204 **
#> ca4 2.042608 1.037289 1.969 0.048932 *
#> thal1 2.880288 1.851205 1.556 0.119732
#> thal2 2.616895 1.800002 1.454 0.145994
#> thal3 0.973965 1.799240 0.541 0.588287
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 455.76 on 800 degrees of freedom
#> AIC: 495.76
#>
#> Number of Fisher Scoring iterations: 6
Compare between Step model and Model_all
# Deviance
model_all$deviance#> [1] 452.0742
model_step$deviance#> [1] 455.7572
# AIC
model_all$aic#> [1] 498.0742
model_step$aic#> [1] 495.7572
Use Model_Step due to low AIC Model_Step without Perfect Separation
k-NN or K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data).
Characteristic proximity is measured by Euclidean Distance until k data points (neighbors) with closest distance are obtained. The most class owned by these neighbors is the class of new data (majority voting).
Characteristics of k-NN :
knitr::include_graphics("assets/KNN.png")# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)
# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)
# variabel target pada `train`
train_y <- df_train[,"target"]
# variabel target pada `test`
test_y <- df_test[,"target"]Remember that distance measurements in kNN are highly dependent on the data scale of the predictor variables entered as input of the model. The existence of a predictor that has a range of values that is very different from other predictors can cause problems in the classification model. Therefore, let’s normalize the data to equalize the scale of each predictor variable so that it has a standard range of values.
To normalize the train_x data, please use the
scale() function. Meanwhile, to normalize the test
data, please use the same function but use the center and
scale attributes obtained from the train_x
data.
# your code here
# scale train_x data
train_xs <- scale(train_x)
# scale test_x data
test_xs <- scale(x = test_x,
center = attr(train_xs,"scaled:center"),
scale = attr(train_xs,"scaled:scale"))# find optimum k
round(sqrt(nrow(test_xs)))#> [1] 14
Decision Tree is a fairly simple tree-based model with robust/powerful* performance for prediction. The Decision Tree produces a visualization in the form of a decision tree which can be interpreted easily.
Decision Tree additional characters:
Note: Decision Tree is not only limited to Classification cases, but can be used in Regression cases. In this course, our focus is on the Classification case because the idea is the same.
Let’s understand how the structure of the Decision Tree and the terms that are often used. The following is an example to determine whether this weekend we will go out or not:
- Root Node: The first branch in determining the target
value, commonly referred to as the main predictor. - Interior
Node: The next branch that uses another predictor if the root
node is not enough to determine the target. - Terminal/Leaf
Node: The final decision is the predicted target value.
dtree_model <- ctree(formula = target ~.,
data = df_train,
control = ctree_control(mincriterion=0.95,
minsplit=50,
minbucket=20))
plot(dtree_model, type = "s")Random Forest is a type of Ensemble Method which consists of many Decision Trees. Each Decision Tree has its own characteristics and is not related to each other. Random Forest makes use of the Bagging (Bootstrap and Aggregation) concept in its creation. Here is the process:
mtry parameter is used to randomly select the number of
predictor candidates (Automatic Feature Selection)Pros of Random Forests:
One of the disadvantages of random forest is its computationally intensive and time-consuming nature. This can be mitigated by feature selection to reduce the number of predictors used in the model. If a large number of columns are found, we can remove columns with near-zero variance (less informative) using the nearZeroVar() function from the caret package.
By applying feature selection techniques like nearZeroVar(), we can identify and remove predictors with very low variance, which often indicate little or no variation in the data. These low-variance predictors may not contribute much to the model’s predictive power and can be safely discarded, reducing the computational burden of the random forest algorithm.
By reducing the number of predictors, the random forest algorithm can become more efficient and faster in terms of computation time, without significantly sacrificing its predictive performance. It allows us to focus on the most informative predictors and improve the efficiency of the model training process.
# Data Pre-processing
# Please Run The Code Down Below
dim(df_train)#> [1] 820 14
# feature selection using nearzerovar
zero_var <- nearZeroVar(df_train)
df_rf <- df_train %>%
select(-c(zero_var))
dim(df_rf)#> [1] 820 14
insight :
Usually we do cross validation by dividing the data only into training and testing data. K-Fold Cross Validation divides the data into equal parts of \(k\), where each part is used as testing data alternately.
In this case we are using 10 as our K-Fold with repeats = 20.
#set.seed(417)
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 20)
# make a model of random forest
#df_rf_model <- train(target~., df_train, method = "rf", trControl = ctrl)
#df_rf_model
# save the model
#saveRDS(df_rf_model, "df_forest.RDS")
# read model
df_forest = readRDS("df_forest.RDS")
df_forest#> Random Forest
#>
#> 820 samples
#> 13 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 20 times)
#> Summary of sample sizes: 738, 738, 737, 739, 737, 737, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9591782 0.9181581
#> 12 0.9823901 0.9647358
#> 22 0.9840365 0.9680364
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 22.
Insight :
df_forest$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 22
#>
#> OOB estimate of error rate: 0.73%
#> Confusion matrix:
#> 0 1 class.error
#> 0 392 3 0.007594937
#> 1 3 422 0.007058824
Insight :
# run in console, then see panel `Plot`
library(animation)
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l")The goal of logistic regression is to use a linear regression model to predict probability (which can be used for classification).
df_test$pred_risk <- predict(object = model_step,
newdata = df_test,
type = "response")
df_test_prob <- predict(object = model_step,
newdata = df_test,
type = "response")
df_test$pred_Label <- ifelse(test = df_test$pred_risk > 0.5, yes = 1, no = 0)
df_train$pred_risk <- predict(object = model_step,
newdata = df_train,
type = "response")
df_train$pred_Label <- ifelse(test = df_train$pred_risk > 0.5, yes = 1, no = 0)Insight :
\[log(\frac{p}{1-p}) = b_0 + b_1 x\]
df_predKNN <- knn(train = train_xs,
test = test_xs,
cl = train_y,
k = 15)
head(df_predKNN)#> [1] 0 1 0 0 0 1
#> Levels: 0 1
Insight :
# prediksi kelas di data test
pred_diab_test <- predict(object = dtree_model ,
newdata = df_test ,
type = "response")df_rf_model <- readRDS("df_forest.RDS")
varImp(df_rf_model)#> rf variable importance
#>
#> only 20 most important variables shown (out of 22)
#>
#> Overall
#> thal2 100.0000
#> oldpeak 46.2266
#> chol 34.0979
#> age 32.8648
#> thalach 31.1735
#> trestbps 25.3897
#> ca1 11.6868
#> cp3 10.1069
#> cp2 8.8004
#> sex1 8.2533
#> exang1 5.1028
#> slope2 4.4544
#> restecg1 2.1813
#> slope1 2.1627
#> ca2 2.0740
#> cp1 1.5632
#> fbs1 1.4097
#> ca3 1.3435
#> thal3 1.0615
#> thal1 0.9118
#RAW
pred_df_rf <- predict(object = df_rf_model,
newdata = df_test,
type ="raw")
#Prob
pred_df_rf_prob <- predict(object = df_rf_model,
newdata = df_test,
type ="prob")💣 Business Case :
Each model must have an error. We want to get a model with the smallest possible prediction error.
df_test %>%
select(target,pred_Label)After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:
table(predicted = df_test$pred_Label, actual = df_test$target)#> actual
#> predicted 0 1
#> 0 82 10
#> 1 22 91
Class determination:
Sample case:
Contents of the Confusion Matrix:
TRUE/FALSE:
Indicate Heart Disease / Not Indicate Heart Disease
Positive class: Indicate Heart Disease, Negative class: Not Indicate Heart Disease
# confusion matrix
library(caret)
confusionMatrix(data = as.factor(df_test$pred_Label),
reference = as.factor(df_test$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 82 10
#> 1 22 91
#>
#> Accuracy : 0.8439
#> 95% CI : (0.7868, 0.8907)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.6883
#>
#> Mcnemar's Test P-Value : 0.05183
#>
#> Sensitivity : 0.9010
#> Specificity : 0.7885
#> Pos Pred Value : 0.8053
#> Neg Pred Value : 0.8913
#> Prevalence : 0.4927
#> Detection Rate : 0.4439
#> Detection Prevalence : 0.5512
#> Balanced Accuracy : 0.8447
#>
#> 'Positive' Class : 1
#>
Accuracy of logistic regression in 84.39 % accurate but there are still wrong predictions :
# confusion matrix
library(caret)
confusionMatrix(data = as.factor(df_train$pred_Label),
reference = as.factor(df_train$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 339 26
#> 1 56 399
#>
#> Accuracy : 0.9
#> 95% CI : (0.8774, 0.9197)
#> No Information Rate : 0.5183
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7992
#>
#> Mcnemar's Test P-Value : 0.001362
#>
#> Sensitivity : 0.9388
#> Specificity : 0.8582
#> Pos Pred Value : 0.8769
#> Neg Pred Value : 0.9288
#> Prevalence : 0.5183
#> Detection Rate : 0.4866
#> Detection Prevalence : 0.5549
#> Balanced Accuracy : 0.8985
#>
#> 'Positive' Class : 1
#>
Accuracy has the disadvantage of demonstrating the goodness of the model in classifying the two classes. Overcoming the lack of accuracy, ROC and AUC are available as evaluation tools besides the Confusion Matrix.
ROC is a curve that describes the relationship between the True Positive Rate (Sensitivity or Recall) and the False Positive Rate (1-Specificity) at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate. Note: Specificity is a True Negative Rate.
AUC shows the area under the ROC curve. The closer to 1, the better
the model performance in separating positive and negative classes. To
get the AUC value, use the parameter measure = "auc" in the
performance() function and then take the value
y.values.
# ROC
log_roc <- data.frame(prediction=round(df_test_prob,4),
trueclass= as.numeric(df_test$target == 1))
log_roc <- prediction(log_roc$prediction, log_roc$trueclass)
# ROC curve
plot(performance(log_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)#Check AUC
log_value <- performance(prediction.obj = log_roc,
measure = "auc")
log_value@y.values#> [[1]]
#> [1] 0.9119383
Insight :
confusionMatrix(data = df_predKNN,
reference = as.factor(test_y),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 71 25
#> 1 33 76
#>
#> Accuracy : 0.7171
#> 95% CI : (0.6501, 0.7776)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : 0.0000000007776
#>
#> Kappa : 0.4347
#>
#> Mcnemar's Test P-Value : 0.358
#>
#> Sensitivity : 0.7525
#> Specificity : 0.6827
#> Pos Pred Value : 0.6972
#> Neg Pred Value : 0.7396
#> Prevalence : 0.4927
#> Detection Rate : 0.3707
#> Detection Prevalence : 0.5317
#> Balanced Accuracy : 0.7176
#>
#> 'Positive' Class : 1
#>
Accuracy of KNN in 71.71 % accurate but there are still wrong predictions, if we remember about the characteristic of KNN , rest our predictors are in categorical , the conclusion is KNN is good for numerical predictors (because it classifies by distance).
df = read.csv("LBB DEAN/heart.csv")
df <- df %>%
mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal), as.factor)
glimpse(df)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)
# splitting
df_train <- df[index,]
df_test <- df[-index,]# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)
# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)
# variabel target pada `train`
train_y <- df_train[,"target"]
# variabel target pada `test`
test_y <- df_test[,"target"]# your code here
# scale train_x data
train_xs <- scale(train_x)
# scale test_x data
test_xs <- scale(x = test_x,
center = attr(train_xs,"scaled:center"),
scale = attr(train_xs,"scaled:scale"))# find optimum k
round(sqrt(nrow(test_xs)))#> [1] 14
df_predKNN <- knn(train = train_xs,
test = test_xs,
cl = train_y,
k = 15)
head(df_predKNN)#> [1] 0 1 0 0 0 1
#> Levels: 0 1
knn_cm <- confusionMatrix(data = df_predKNN,
reference = as.factor(test_y),
positive = "1")
knn_cm#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 104 1
#> 1 0 100
#>
#> Accuracy : 0.9951
#> 95% CI : (0.9731, 0.9999)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9902
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9901
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9905
#> Prevalence : 0.4927
#> Detection Rate : 0.4878
#> Detection Prevalence : 0.4878
#> Balanced Accuracy : 0.9950
#>
#> 'Positive' Class : 1
#>
Insight :
# prediksi kelas di data test
pred_diab_test <- predict(object = dtree_model ,
newdata = df_test ,
type = "response")
pred_diab_prob <- predict(object = dtree_model ,
newdata = df_test ,
type = "prob")
# Confusion Matrix: data test
confusionMatrix(data = as.factor(pred_diab_test),
reference = as.factor(df_test$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 73 9
#> 1 31 92
#>
#> Accuracy : 0.8049
#> 95% CI : (0.7439, 0.8568)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.6109
#>
#> Mcnemar's Test P-Value : 0.0008989
#>
#> Sensitivity : 0.9109
#> Specificity : 0.7019
#> Pos Pred Value : 0.7480
#> Neg Pred Value : 0.8902
#> Prevalence : 0.4927
#> Detection Rate : 0.4488
#> Detection Prevalence : 0.6000
#> Balanced Accuracy : 0.8064
#>
#> 'Positive' Class : 1
#>
# prediksi kelas di data test
pred_diab_train <- predict(object = dtree_model ,
newdata = df_train ,
type = "response")
# Confusion Matrix: data test
confusionMatrix(data = as.factor(pred_diab_train),
reference = as.factor(df_train$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 330 44
#> 1 65 381
#>
#> Accuracy : 0.8671
#> 95% CI : (0.8419, 0.8896)
#> No Information Rate : 0.5183
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.7333
#>
#> Mcnemar's Test P-Value : 0.05541
#>
#> Sensitivity : 0.8965
#> Specificity : 0.8354
#> Pos Pred Value : 0.8543
#> Neg Pred Value : 0.8824
#> Prevalence : 0.5183
#> Detection Rate : 0.4646
#> Detection Prevalence : 0.5439
#> Balanced Accuracy : 0.8660
#>
#> 'Positive' Class : 1
#>
# ROC
dtree_roc <- data.frame(prediction=round(pred_diab_prob[,2],4),
trueclass= as.numeric(df_test$target == 1))
dtree_roc <- prediction(dtree_roc$prediction, dtree_roc$trueclass)
# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)#Check AUC
dt_value <- performance(prediction.obj = dtree_roc,
measure = "auc")
dt_value@y.values#> [[1]]
#> [1] 0.9107959
Insight :
confusionMatrix(data = as.factor(pred_df_rf),
reference = as.factor(df_test$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 104 0
#> 1 0 101
#>
#> Accuracy : 1
#> 95% CI : (0.9822, 1)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Sensitivity : 1.0000
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 1.0000
#> Prevalence : 0.4927
#> Detection Rate : 0.4927
#> Detection Prevalence : 0.4927
#> Balanced Accuracy : 1.0000
#>
#> 'Positive' Class : 1
#>
#RAW
pred_df_rf_train <- predict(object = df_rf_model,
newdata = df_train,
type ="raw")
confusionMatrix(data = as.factor(pred_df_rf_train),
reference = as.factor(df_train$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 395 0
#> 1 0 425
#>
#> Accuracy : 1
#> 95% CI : (0.9955, 1)
#> No Information Rate : 0.5183
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Sensitivity : 1.0000
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 1.0000
#> Prevalence : 0.5183
#> Detection Rate : 0.5183
#> Detection Prevalence : 0.5183
#> Balanced Accuracy : 1.0000
#>
#> 'Positive' Class : 1
#>
Insight :
# ROC
rf_roc <- data.frame(prediction=round(pred_df_rf_prob[,2],4),
trueclass= as.numeric(df_test$target == 1))
rf_roc <- prediction(rf_roc$prediction, rf_roc$trueclass)
# ROC curve
plot(performance(rf_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)#Check AUC
rf_value <- performance(prediction.obj = rf_roc,
measure = "auc")
rf_value@y.values#> [[1]]
#> [1] 1
Insight :
cbind(Random_Forest_AUC = rf_value@y.values, Decision_Tree_AUC = dt_value@y.values, Logistic_Reg_AUC = log_value@y.values, KNN = knn_cm$overall["Accuracy"])#> Random_Forest_AUC Decision_Tree_AUC Logistic_Reg_AUC KNN
#> Accuracy 1 0.9107959 0.9119383 0.995122
df_test).