Heart Desease
Cardiovascular disease (CVD), including heart disease, is a major cause of mortality worldwide,remains the leading cause of death globally, accounting for approximately 31% of all deaths worldwide in 2019. Early detection and intervention are crucial in reducing the burden of disease and improving patient outcomes. Machine learning (ML) has the potential to aid clinicians in identifying patients at risk of heart disease through the analysis of clinical and laboratory data. Several studies have been conducted to develop ML algorithms for predicting the risk of heart disease, achieving high levels of accuracy. The use of ML models has the potential to improve patient outcomes by enabling earlier detection and intervention, but further research is necessary to validate these models and ensure their effectiveness in real-world clinical settings.
Heart-disease, this data set was created in 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It has 76 attributes, one of which is the predicted attribute, but all published experiments use only 14 of them. The “target” field refers to the patient’s presence of heart disease. It has an integer value of 0 for no disease and 1 for disease.
Attribute Information:
agesex : 1 = male; 0 = femalechest pain type : (4 values : – Value 0: typical angina
– Value 1: atypical angina – Value 2: non-anginal pain – Value 3:
asymptomatic)resting blood pressure : resting blood pressure (in mm
Hg on admission to the hospital)serum cholestoral : serum cholestoral in mg/dlfasting blood sugar > 120 mg/dl : (1 = true; 0 =
false)resting electrocar diographic results (values
0,1,2)maximum heart rate achievedexercise induced angina : 1 = yes; 0 = nooldpeak = ST depression induced by exercise relative to
restSlope = the slope of the peak exercise ST segment (0 -
2)ca = number of major vessels (0-4) colored by
flourosopythal: 0 = normal; 1 = fixed defect; 2 = reversible
defect; 3 = reversible defect after thallium stress testingtarget : 1 = disease; 0 = no diseaselibrary(dplyr)
library(class) # package for `knn()`
library(caret) # confusion matrix
library(car) # multicollinearity
library(GGally) # ggcorRead dataset = heart.csv
df = read.csv("LBB DEAN/heart.csv")
glimpse(df)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
df have 14 columns and 1025 records of patient check up of heart
disease. There is several columns need to change as a factor. in this
case column sex,
cp,fbs,restecg,exang,slope,ca,thal
df <- df %>%
mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal,target), as.factor)
glimpse(df)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
Check missing values in df as a data-set
colSums(is.na(df))#> age sex cp trestbps chol fbs restecg thalach
#> 0 0 0 0 0 0 0 0
#> exang oldpeak slope ca thal target
#> 0 0 0 0 0 0
There is no any missing values in df as a data-set
Do EDA to know well distribution and characteristics of the data-set and make sure target of the data is balance to make machine learning models work well
summary(df)#> age sex cp trestbps chol fbs restecg
#> Min. :29.00 0:312 0:497 Min. : 94.0 Min. :126 0:872 0:497
#> 1st Qu.:48.00 1:713 1:167 1st Qu.:120.0 1st Qu.:211 1:153 1:513
#> Median :56.00 2:284 Median :130.0 Median :240 2: 15
#> Mean :54.43 3: 77 Mean :131.6 Mean :246
#> 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
#> Max. :77.00 Max. :200.0 Max. :564
#> thalach exang oldpeak slope ca thal target
#> Min. : 71.0 0:680 Min. :0.000 0: 74 0:578 0: 7 0:499
#> 1st Qu.:132.0 1:345 1st Qu.:0.000 1:482 1:226 1: 64 1:526
#> Median :152.0 Median :0.800 2:469 2:134 2:544
#> Mean :149.1 Mean :1.072 3: 69 3:410
#> 3rd Qu.:166.0 3rd Qu.:1.800 4: 18
#> Max. :202.0 Max. :6.200
insight :
round(prop.table(table(df$target)),1)#>
#> 0 1
#> 0.5 0.5
insight : Proportion of target already balance 50:50
ggcorr(df, label=TRUE)
insight =
The process of dividing the dataset we have to train and test the model is known as cross validation. We will randomly select a sample of Heart Disease to be included in the df train variable, with the remainder included in the df test.
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)
# splitting
df_train <- df[index,]
df_test <- df[-index,]We are going to make several models and test the performance.
model_none <- glm(formula = target~1,
data = df_train,
family = "binomial")
summary(model_none)#>
#> Call:
#> glm(formula = target ~ 1, family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.209 -1.209 1.146 1.146 1.146
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.07320 0.06989 1.047 0.295
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.7 on 819 degrees of freedom
#> Residual deviance: 1135.7 on 819 degrees of freedom
#> AIC: 1137.7
#>
#> Number of Fisher Scoring iterations: 3
exp(0.07320)#> [1] 1.075946
📈 Model interpretation :
sex,cp,fbs,restecg,exang,slope,ca,thal
model_categoric <- glm(formula = target~ sex+cp+fbs+restecg+exang+slope+ca+thal ,
data = df_train,
family = "binomial")
summary(model_categoric)#>
#> Call:
#> glm(formula = target ~ sex + cp + fbs + restecg + exang + slope +
#> ca + thal, family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.92220 -0.34695 0.07266 0.42072 3.14937
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.1045 1.8662 -0.592 0.553971
#> sex1 -1.6801 0.3211 -5.232 0.00000016775552554 ***
#> cp1 1.1432 0.3352 3.410 0.000649 ***
#> cp2 2.1655 0.3246 6.672 0.00000000002528101 ***
#> cp3 2.1430 0.4207 5.094 0.00000035110360891 ***
#> fbs1 0.3031 0.3401 0.891 0.372822
#> restecg1 0.4937 0.2353 2.098 0.035871 *
#> restecg2 -1.7079 1.1767 -1.451 0.146669
#> exang1 -0.8924 0.2742 -3.255 0.001134 **
#> slope1 -0.2613 0.4386 -0.596 0.551356
#> slope2 1.7783 0.4471 3.977 0.00006964804733135 ***
#> ca1 -2.2969 0.3111 -7.383 0.00000000000015444 ***
#> ca2 -3.6182 0.4659 -7.766 0.00000000000000809 ***
#> ca3 -2.0375 0.5409 -3.767 0.000165 ***
#> ca4 2.2090 1.0406 2.123 0.033771 *
#> thal1 2.2613 1.8522 1.221 0.222128
#> thal2 2.6027 1.8037 1.443 0.149016
#> thal3 0.7894 1.8030 0.438 0.661519
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 494.32 on 802 degrees of freedom
#> AIC: 530.32
#>
#> Number of Fisher Scoring iterations: 6
exp(model_categoric$coefficients)#> (Intercept) sex1 cp1 cp2 cp3 fbs1
#> 0.33138730 0.18634683 3.13683135 8.71918722 8.52480997 1.35410577
#> restecg1 restecg2 exang1 slope1 slope2 ca1
#> 1.63828543 0.18125122 0.40966600 0.77004331 5.91973985 0.10056836
#> ca2 ca3 ca4 thal1 thal2 thal3
#> 0.02683137 0.13035619 9.10643407 9.59600631 13.50008312 2.20203009
Any variable that lowers the odds :
sex1,
restecg2,exang1,slope1,ca1,ca2,ca3📈 Model interpretation :
cp1 are more at risk of heart failure
(reducing the possibility of heart failure) by 3.13 times compared to
“base” (cp0), if other variables are
fixedcp2 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.71 times compared to
“base” (cp0), if other variables are
fixedcp3 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.52 times compared to
“base” (cp0), if other variables are
fixedfbs1 are more at risk of heart failure
(reducing the possibility of heart failure) by 1.35 times compared to
“base” (fbs0), if other variables are
fixedrestecg1 are more at risk of heart
failure (reducing the possibility of heart failure) by 1.63 times
compared to “base” (restecg0), if other variables
are fixedslope2 are more at risk of heart failure
(reducing the possibility of heart failure) by 5.91 times compared to
“base” (slope0), if other variables are
fixedca4 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.1 times compared to
“base” (ca0), if other variables are
fixedthal1 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.59 times compared to
“base” (thal0), if other variables are
fixedthal2 are more at risk of heart failure
(reducing the possibility of heart failure) by 13.5 times compared to
“base” (thal0), if other variables are
fixedthal3 are more at risk of heart failure
(reducing the possibility of heart failure) by 2.2 times compared to
“base” (thal0), if other variables are
fixedage,trestbps,chol,thalach,oldpeak
model_numeric <- glm(formula = target~ age+trestbps+chol+thalach+oldpeak ,
data = df_train,
family = "binomial")
summary(model_numeric)#>
#> Call:
#> glm(formula = target ~ age + trestbps + chol + thalach + oldpeak,
#> family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.1848 -0.7812 0.4288 0.8541 2.6114
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -2.025752 1.098845 -1.844 0.0653 .
#> age 0.003310 0.010839 0.305 0.7601
#> trestbps -0.011531 0.005264 -2.190 0.0285 *
#> chol -0.004096 0.001749 -2.342 0.0192 *
#> thalach 0.035287 0.004557 7.744 0.00000000000000964 ***
#> oldpeak -0.834227 0.092076 -9.060 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 851.21 on 814 degrees of freedom
#> AIC: 863.21
#>
#> Number of Fisher Scoring iterations: 4
exp(model_numeric$coefficients)#> (Intercept) age trestbps chol thalach oldpeak
#> 0.1318946 1.0033152 0.9885348 0.9959125 1.0359168 0.4342098
Any variable that lowers the odds :
trestbps, chol,oldpeak📈 Model interpretation :
age, then the possibility
of patient getting heart disease increases 1.00 times, provided
that other variables have the same value.thalach, then the
possibility of patient getting heart disease increases 1.03 times,
provided that other variables have the same value.model_all <- glm(formula = target~. ,
data = df_train,
family = "binomial")
summary(model_all)#>
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.0412 -0.2616 0.0570 0.4065 3.3227
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.265828 2.309760 -0.115 0.908375
#> age 0.026965 0.016411 1.643 0.100368
#> sex1 -2.078567 0.375782 -5.531 0.000000031783653 ***
#> cp1 0.829104 0.353346 2.346 0.018954 *
#> cp2 2.168814 0.341372 6.353 0.000000000210833 ***
#> cp3 2.466048 0.466150 5.290 0.000000122150326 ***
#> trestbps -0.028018 0.007818 -3.584 0.000339 ***
#> chol -0.006596 0.002753 -2.396 0.016579 *
#> fbs1 0.412043 0.372173 1.107 0.268238
#> restecg1 0.374898 0.255211 1.469 0.141840
#> restecg2 -0.562131 1.926327 -0.292 0.770428
#> thalach 0.025281 0.007816 3.235 0.001218 **
#> exang1 -0.729219 0.292905 -2.490 0.012789 *
#> oldpeak -0.511915 0.155308 -3.296 0.000980 ***
#> slope1 -0.323303 0.534445 -0.605 0.545224
#> slope2 1.168961 0.563860 2.073 0.038159 *
#> ca1 -2.542979 0.342386 -7.427 0.000000000000111 ***
#> ca2 -3.680525 0.521668 -7.055 0.000000000001722 ***
#> ca3 -1.675346 0.600219 -2.791 0.005251 **
#> ca4 1.891545 1.047329 1.806 0.070908 .
#> thal1 2.940066 1.568853 1.874 0.060927 .
#> thal2 2.764906 1.513894 1.826 0.067797 .
#> thal3 1.091833 1.515119 0.721 0.471140
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 452.07 on 797 degrees of freedom
#> AIC: 498.07
#>
#> Number of Fisher Scoring iterations: 6
exp(model_all$coefficients)#> (Intercept) age sex1 cp1 cp2 cp3
#> 0.76657125 1.02733167 0.12510939 2.29126495 8.74789990 11.77581908
#> trestbps chol fbs1 restecg1 restecg2 thalach
#> 0.97237088 0.99342604 1.50989958 1.45484288 0.56999318 1.02560367
#> exang1 oldpeak slope1 slope2 ca1 ca2
#> 0.48228567 0.59934695 0.72375441 3.21864567 0.07863177 0.02520973
#> ca3 ca4 thal1 thal2 thal3
#> 0.18724333 6.62960374 18.91708684 15.87755124 2.97973045
Any variable that lowers the odds :
sex1,
restecg2,exang1,slope1,ca1,ca2,ca3,trestbps,
chol,oldpeak
📈 Model interpretation :
age, then the possibility
of patient getting heart disease increases 1.00 times, provided
that other variables have the same value.thalach, then the
possibility of patient getting heart disease increases 1.03 times,
provided that other variables have the same value.cp1 are more at risk of heart failure
(reducing the possibility of heart failure) by 3.13 times compared to
“base” (cp0), if other variables are
fixedcp2 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.71 times compared to
“base” (cp0), if other variables are
fixedcp3 are more at risk of heart failure
(reducing the possibility of heart failure) by 8.52 times compared to
“base” (cp0), if other variables are
fixedrestecg1 are more at risk of heart
failure (reducing the possibility of heart failure) by 1.63 times
compared to “base” (restecg0), if other variables
are fixedslope2 are more at risk of heart failure
(reducing the possibility of heart failure) by 5.91 times compared to
“base” (slope0), if other variables are
fixedca4 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.1 times compared to
“base” (ca0), if other variables are
fixedthal1 are more at risk of heart failure
(reducing the possibility of heart failure) by 9.59 times compared to
“base” (thal0), if other variables are
fixedthal2 are more at risk of heart failure
(reducing the possibility of heart failure) by 13.5 times compared to
“base” (thal0), if other variables are
fixedthal3 are more at risk of heart failure
(reducing the possibility of heart failure) by 2.2 times compared to
“base” (thal0), if other variables are
fixedWe used model with all predictor due to low AIC and do a feature selection using stepwise regression
model_step <- step(object = model_all,
direction = "backward", trace = F)
summary(model_step)#>
#> Call:
#> glm(formula = target ~ age + sex + cp + trestbps + chol + thalach +
#> exang + oldpeak + slope + ca + thal, family = "binomial",
#> data = df_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.0096 -0.2815 0.0540 0.4044 3.3720
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.298116 2.469446 0.121 0.903911
#> age 0.024359 0.016267 1.497 0.134274
#> sex1 -2.080038 0.373448 -5.570 0.0000000255013944 ***
#> cp1 0.853616 0.351534 2.428 0.015172 *
#> cp2 2.237598 0.337174 6.636 0.0000000000321570 ***
#> cp3 2.518670 0.462415 5.447 0.0000000512930186 ***
#> trestbps -0.027613 0.007554 -3.655 0.000257 ***
#> chol -0.007103 0.002650 -2.680 0.007367 **
#> thalach 0.025386 0.007880 3.222 0.001274 **
#> exang1 -0.700975 0.291217 -2.407 0.016082 *
#> oldpeak -0.522742 0.150681 -3.469 0.000522 ***
#> slope1 -0.367936 0.521164 -0.706 0.480195
#> slope2 1.150041 0.547990 2.099 0.035848 *
#> ca1 -2.522397 0.333224 -7.570 0.0000000000000374 ***
#> ca2 -3.586970 0.508731 -7.051 0.0000000000017787 ***
#> ca3 -1.537753 0.581697 -2.644 0.008204 **
#> ca4 2.042608 1.037289 1.969 0.048932 *
#> thal1 2.880288 1.851205 1.556 0.119732
#> thal2 2.616895 1.800002 1.454 0.145994
#> thal3 0.973965 1.799240 0.541 0.588287
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1135.66 on 819 degrees of freedom
#> Residual deviance: 455.76 on 800 degrees of freedom
#> AIC: 495.76
#>
#> Number of Fisher Scoring iterations: 6
Compare between Step model and Model_all
# Deviance
model_all$deviance#> [1] 452.0742
model_step$deviance#> [1] 455.7572
# AIC
model_all$aic#> [1] 498.0742
model_step$aic#> [1] 495.7572
Use Model_Step due to low AIC Model_Step without Perfect Separation
k-NN or K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data).
Characteristic proximity is measured by Euclidean Distance until k data points (neighbors) with closest distance are obtained. The most class owned by these neighbors is the class of new data (majority voting).
Characteristics of k-NN :
knitr::include_graphics("assets/KNN.png")# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)
# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)
# variabel target pada `train`
train_y <- df_train[,"target"]
# variabel target pada `test`
test_y <- df_test[,"target"]Remember that distance measurements in kNN are highly dependent on the data scale of the predictor variables entered as input of the model. The existence of a predictor that has a range of values that is very different from other predictors can cause problems in the classification model. Therefore, let’s normalize the data to equalize the scale of each predictor variable so that it has a standard range of values.
To normalize the train_x data, please use the
scale() function. Meanwhile, to normalize the test
data, please use the same function but use the center and
scale attributes obtained from the train_x
data.
# your code here
# scale train_x data
train_xs <- scale(train_x)
# scale test_x data
test_xs <- scale(x = test_x,
center = attr(train_xs,"scaled:center"),
scale = attr(train_xs,"scaled:scale"))# find optimum k
round(sqrt(nrow(test_xs)))#> [1] 14
The goal of logistic regression is to use a linear regression model to predict probability (which can be used for classification).
df_test$pred_risk <- predict(object = model_step,
newdata = df_test,
type = "response")
df_test$pred_Label <- ifelse(test = df_test$pred_risk > 0.5, yes = 1, no = 0)
df_train$pred_risk <- predict(object = model_step,
newdata = df_test,
type = "response")
df_train$pred_Label <- ifelse(test = df_test$pred_risk > 0.5, yes = 1, no = 0)Insight :
\[log(\frac{p}{1-p}) = b_0 + b_1 x\]
df_predKNN <- knn(train = train_xs,
test = test_xs,
cl = train_y,
k = 15)
head(df_predKNN)#> [1] 0 1 0 0 0 1
#> Levels: 0 1
Insight :
💣 Business Case :
Each model must have an error. We want to get a model with the smallest possible prediction error.
df_test %>%
select(target,pred_Label)After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:
table(predicted = df_test$pred_Label, actual = df_test$target)#> actual
#> predicted 0 1
#> 0 82 10
#> 1 22 91
Class determination:
Sample case:
Contents of the Confusion Matrix:
TRUE/FALSE:
Indicate Heart Disease / Not Indicate Heart Disease
Positive class: Indicate Heart Disease, Negative class: Not Indicate Heart Disease
# confusion matrix
library(caret)
confusionMatrix(data = as.factor(df_test$pred_Label),
reference = as.factor(df_test$target),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 82 10
#> 1 22 91
#>
#> Accuracy : 0.8439
#> 95% CI : (0.7868, 0.8907)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.6883
#>
#> Mcnemar's Test P-Value : 0.05183
#>
#> Sensitivity : 0.9010
#> Specificity : 0.7885
#> Pos Pred Value : 0.8053
#> Neg Pred Value : 0.8913
#> Prevalence : 0.4927
#> Detection Rate : 0.4439
#> Detection Prevalence : 0.5512
#> Balanced Accuracy : 0.8447
#>
#> 'Positive' Class : 1
#>
Accuracy of logistic regression in 84.39 % accurate but there are still wrong predictions :
confusionMatrix(data = df_predKNN,
reference = as.factor(test_y),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 71 25
#> 1 33 76
#>
#> Accuracy : 0.7171
#> 95% CI : (0.6501, 0.7776)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : 0.0000000007776
#>
#> Kappa : 0.4347
#>
#> Mcnemar's Test P-Value : 0.358
#>
#> Sensitivity : 0.7525
#> Specificity : 0.6827
#> Pos Pred Value : 0.6972
#> Neg Pred Value : 0.7396
#> Prevalence : 0.4927
#> Detection Rate : 0.3707
#> Detection Prevalence : 0.5317
#> Balanced Accuracy : 0.7176
#>
#> 'Positive' Class : 1
#>
Accuracy of KNN in 76.59 % accurate but there are still wrong predictions, if we remember about the characteristic of KNN , rest our predictors are in categorical , the conclusion is KNN is good for numerical predictors (because it classifies by distance).
df = read.csv("LBB DEAN/heart.csv")
df <- df %>%
mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal), as.factor)
glimpse(df)#> Rows: 1,025
#> Columns: 14
#> $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)
# splitting
df_train <- df[index,]
df_test <- df[-index,]# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)
# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)
# variabel target pada `train`
train_y <- df_train[,"target"]
# variabel target pada `test`
test_y <- df_test[,"target"]# your code here
# scale train_x data
train_xs <- scale(train_x)
# scale test_x data
test_xs <- scale(x = test_x,
center = attr(train_xs,"scaled:center"),
scale = attr(train_xs,"scaled:scale"))# find optimum k
round(sqrt(nrow(test_xs)))#> [1] 14
df_predKNN <- knn(train = train_xs,
test = test_xs,
cl = train_y,
k = 15)
head(df_predKNN)#> [1] 0 1 0 0 0 1
#> Levels: 0 1
confusionMatrix(data = df_predKNN,
reference = as.factor(test_y),
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 104 1
#> 1 0 100
#>
#> Accuracy : 0.9951
#> 95% CI : (0.9731, 0.9999)
#> No Information Rate : 0.5073
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.9902
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.9901
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9905
#> Prevalence : 0.4927
#> Detection Rate : 0.4878
#> Detection Prevalence : 0.4878
#> Balanced Accuracy : 0.9950
#>
#> 'Positive' Class : 1
#>
Insight :
For next model, we are going to trial Decision Tree and Random Forest to make a classification model of heart disease and compare the model, what is the best model for our characteristic of data in next topic.