Diabetes adalah penyakit kronis yang ditandai dengan ciri-ciri berupa tingginya kadar gula (glukosa) darah. Glukosa merupakan sumber energi utama bagi sel tubuh manusia.
Glukosa yang menumpuk di dalam darah akibat tidak diserap sel tubuh dengan baik dapat menimbulkan berbagai gangguan organ tubuh. Jika diabetes tidak dikontrol dengan baik, dapat timbul berbagai komplikasi yang membahayakan nyawa penderita.
Oleh karena itu hal ini menjadi menarik untuk diteliti dan dicari tahu korelasi antara variable prediktor terhadap target variable (positive diabetes atau tidak) menggunakan teknik Machine Learning Classification (Logistics Regression)
Deskripsi Variable (Kolom):
1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age: Age (years)
9. Outcome: Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
diabetes <- diabetes %>%
mutate(Outcome = ifelse(Outcome == 0, "Negative", "Positive") %>% as.factor(),
Outcome = as.factor(Outcome))
str(diabetes)## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : Factor w/ 2 levels "Negative","Positive": 2 1 2 1 2 1 2 1 2 2 ...
##
## Negative Positive
## 0.6510417 0.3489583
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Tidak Terdapat Missing Value, Ready to go!
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Negative:500
## Positive:268
##
##
##
##
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
##
## Negative Positive
## 0.6563518 0.3436482
##
## Negative Positive
## 0.5714286 0.4285714
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diab_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5563 -0.7555 -0.3576 0.7645 3.0091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4346000 0.5333596 -15.814 < 0.0000000000000002 ***
## Pregnancies 0.1158305 0.0242604 4.774 0.000001802 ***
## Glucose 0.0401253 0.0027567 14.555 < 0.0000000000000002 ***
## BloodPressure -0.0187336 0.0041620 -4.501 0.000006759 ***
## SkinThickness -0.0001620 0.0049000 -0.033 0.97362
## Insulin -0.0026606 0.0006292 -4.228 0.000023530 ***
## BMI 0.0908041 0.0106248 8.546 < 0.0000000000000002 ***
## DiabetesPedigreeFunction 1.1384901 0.2218097 5.133 0.000000286 ***
## Age 0.0196196 0.0073187 2.681 0.00735 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2017.3 on 1476 degrees of freedom
## Residual deviance: 1448.0 on 1468 degrees of freedom
## AIC: 1466
##
## Number of Fisher Scoring iterations: 5
Menggunakan Feature Selection Stepwise
## Start: AIC=1465.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 1448.0 1464.0
## <none> 1448.0 1466.0
## - Age 1 1455.2 1471.2
## - Insulin 1 1465.8 1481.8
## - BloodPressure 1 1470.1 1486.1
## - Pregnancies 1 1471.3 1487.3
## - DiabetesPedigreeFunction 1 1475.2 1491.2
## - BMI 1 1534.9 1550.9
## - Glucose 1 1732.5 1748.5
##
## Step: AIC=1463.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## <none> 1448.0 1464.0
## - Age 1 1455.4 1469.4
## - Insulin 1 1469.8 1483.8
## - BloodPressure 1 1471.3 1485.3
## - Pregnancies 1 1471.3 1485.3
## - DiabetesPedigreeFunction 1 1475.5 1489.5
## - BMI 1 1543.3 1557.3
## - Glucose 1 1738.1 1752.1
##
## Call: glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = diab_train)
##
## Coefficients:
## (Intercept) Pregnancies Glucose
## -8.434530 0.115811 0.040139
## BloodPressure Insulin BMI
## -0.018760 -0.002669 0.090708
## DiabetesPedigreeFunction Age
## 1.137656 0.019649
##
## Degrees of Freedom: 1476 Total (i.e. Null); 1469 Residual
## Null Deviance: 2017
## Residual Deviance: 1448 AIC: 1464
## Start: AIC=1465.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Call: glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +
## Age, family = "binomial", data = diab_train)
##
## Coefficients:
## (Intercept) Pregnancies Glucose
## -8.434600 0.115831 0.040125
## BloodPressure SkinThickness Insulin
## -0.018734 -0.000162 -0.002661
## BMI DiabetesPedigreeFunction Age
## 0.090804 1.138490 0.019620
##
## Degrees of Freedom: 1476 Total (i.e. Null); 1468 Residual
## Null Deviance: 2017
## Residual Deviance: 1448 AIC: 1466
## Start: AIC=1465.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 1448.0 1464.0
## <none> 1448.0 1466.0
## - Age 1 1455.2 1471.2
## - Insulin 1 1465.8 1481.8
## - BloodPressure 1 1470.1 1486.1
## - Pregnancies 1 1471.3 1487.3
## - DiabetesPedigreeFunction 1 1475.2 1491.2
## - BMI 1 1534.9 1550.9
## - Glucose 1 1732.5 1748.5
##
## Step: AIC=1463.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## <none> 1448.0 1464.0
## + SkinThickness 1 1448.0 1466.0
## - Age 1 1455.4 1469.4
## - Insulin 1 1469.8 1483.8
## - BloodPressure 1 1471.3 1485.3
## - Pregnancies 1 1471.3 1485.3
## - DiabetesPedigreeFunction 1 1475.5 1489.5
## - BMI 1 1543.3 1557.3
## - Glucose 1 1738.1 1752.1
##
## Call: glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = diab_train)
##
## Coefficients:
## (Intercept) Pregnancies Glucose
## -8.434530 0.115811 0.040139
## BloodPressure Insulin BMI
## -0.018760 -0.002669 0.090708
## DiabetesPedigreeFunction Age
## 1.137656 0.019649
##
## Degrees of Freedom: 1476 Total (i.e. Null); 1469 Residual
## Null Deviance: 2017
## Residual Deviance: 1448 AIC: 1464
model_diabetes <- glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial",
data = diab_train)
summary(model_diabetes)##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = diab_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5548 -0.7549 -0.3580 0.7644 3.0091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4345304 0.5333530 -15.814 < 0.0000000000000002 ***
## Pregnancies 0.1158110 0.0242537 4.775 0.000001797 ***
## Glucose 0.0401392 0.0027249 14.731 < 0.0000000000000002 ***
## BloodPressure -0.0187602 0.0040834 -4.594 0.000004342 ***
## Insulin -0.0026694 0.0005701 -4.683 0.000002834 ***
## BMI 0.0907076 0.0102157 8.879 < 0.0000000000000002 ***
## DiabetesPedigreeFunction 1.1376557 0.2203627 5.163 0.000000243 ***
## Age 0.0196491 0.0072646 2.705 0.00684 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2017.3 on 1476 degrees of freedom
## Residual deviance: 1448.0 on 1469 degrees of freedom
## AIC: 1464
##
## Number of Fisher Scoring iterations: 5
## (Intercept) Pregnancies Glucose
## 0.0002172351 1.1227836163 1.0409556988
## BloodPressure Insulin BMI
## 0.9814147157 0.9973341720 1.0949488180
## DiabetesPedigreeFunction Age
## 3.1194467610 1.0198434068
## 2 3 9 12 13 17
## -2.72379362 2.14354970 0.92956546 2.80459825 2.01965634 -0.49753453
## 18 23 25 31 34 41
## -1.13417569 3.48388024 1.03708195 0.17826858 -3.20365452 1.65369706
## 43 46 50 56 57 61
## -1.82546128 4.00689032 -2.59067103 -3.54551873 2.30925267 -4.07273471
## 65 74 76 84 86 92
## -0.19216475 -1.22886900 -6.38717713 -2.66656654 -1.24598223 -0.93009160
## 102 103 107 117 118 122
## -0.37991574 -2.46648365 -3.95622186 -0.18545380 -1.33300224 -0.61529079
## 128 129 130 135 139 146
## -1.15188080 -1.28653280 -1.20377535 -2.59516270 -0.55770896 -4.68397082
## 150 151 158 170 172 181
## -2.89829003 -0.47462301 -1.66905115 -1.81920978 0.37624370 -2.91961342
## 182 188 201 208 217 218
## -0.66177031 -0.12031093 -1.18073182 1.30923945 -0.66446508 -0.44040620
## 232 236 237 242 244 247
## 0.51581862 2.55245633 2.47814501 -1.92463707 0.23511052 0.27408042
## 253 255 260 264 267 270
## -3.26988635 -0.78226182 2.79670652 0.51621148 1.95003030 0.97509095
## 278 282 283 285 288 292
## -2.28025289 0.49136453 -0.38597585 -1.60317288 -0.11747297 -0.82796567
## 294 305 309 310 317 325
## 0.24337313 -0.62461371 -0.21260064 -0.47935689 -3.12174313 -1.29506050
## 337 346 352 353 361 362
## 1.25253248 0.46240336 -0.34182546 -2.87628680 1.72579276 1.34978653
## 369 373 382 391 392 395
## -3.35051088 -2.15967870 -2.98068280 -1.83314196 2.44453146 1.41435970
## 405 406 409 413 414 417
## 1.63074219 0.31452858 3.48172410 0.52022772 -1.04948334 -1.55564460
## 423 425 430 432 440 450
## -1.08955372 1.71487206 -2.23956389 -1.99883024 -0.32142292 -1.57255519
## 453 455 456 457 465 467
## -2.07421446 -1.01544085 3.08368345 0.50873487 -0.49116862 -3.27587230
## 475 476 481 482 487 489
## -1.35969128 -0.61289615 0.36532310 -1.16144892 -0.46958768 -2.14147368
## 491 499 501 503 507 510
## -1.74173818 2.04592346 -2.33001011 -3.84506986 1.21764228 -0.16409371
## 512 525 529 534 537 542
## -1.92611444 -0.64805777 -1.69131676 -0.20482000 -2.09540352 -0.71316565
## 548 549 553 558 564 576
## -0.69723466 0.89253764 -0.71517335 -0.58755570 -1.40141636 -0.50587789
## 596 600 602 613 615 623
## 1.19028039 -1.90750747 -0.97019852 1.75809964 1.49663455 4.08955481
## 628 632 634 638 642 646
## -0.80078779 -2.17211300 -2.15015885 -1.97514400 -0.21911679 -0.14803260
## 649 657 664 667 668 677
## 0.06352999 -2.89113745 1.52859293 0.90135219 -0.69334446 0.80872644
## 679 697 698 713 716 735
## -0.30459485 0.60067883 -1.47294900 1.72393593 2.58053589 -1.60332452
## 737 738 746 755
## -1.82677510 -1.83922498 -0.70675189 1.53722412
## 2 3 9 12 13 17
## 0.061583863 0.895064480 0.716987118 0.942923799 0.882845469 0.378120238
## 18 23 25 31 34 41
## 0.243391316 0.970225617 0.738286575 0.544449492 0.039028428 0.839390093
## 43 46 50 56 57 61
## 0.138779852 0.982135088 0.069741236 0.028044466 0.909640448 0.016745561
## 65 74 76 84 86 92
## 0.452106105 0.226379438 0.001680173 0.064975252 0.223396408 0.282906132
## 102 103 107 117 118 122
## 0.406147220 0.078241457 0.018775990 0.453768976 0.208663194 0.350853245
## 128 129 130 135 139 146
## 0.240145716 0.216440249 0.230804284 0.069450390 0.364077728 0.009157605
## 150 151 158 170 172 181
## 0.052238158 0.383522624 0.158550727 0.139528720 0.592966812 0.051192475
## 182 188 201 208 217 218
## 0.340342049 0.469958496 0.234920638 0.787385861 0.339737307 0.391644183
## 232 236 237 242 244 247
## 0.626169501 0.927738360 0.922595431 0.127345364 0.558508365 0.568094368
## 253 255 260 264 267 270
## 0.036618837 0.313832618 0.942497591 0.626261458 0.875449946 0.726133068
## 278 282 283 285 288 292
## 0.092771667 0.620427827 0.404686408 0.167538628 0.470665484 0.304075390
## 294 305 309 310 317 325
## 0.560544735 0.348732863 0.447049133 0.382403999 0.042219229 0.214997500
## 337 346 352 353 361 362
## 0.777737937 0.613584165 0.415366119 0.053338318 0.848873476 0.794094726
## 369 373 382 391 392 395
## 0.033878439 0.103430242 0.048306229 0.137864401 0.920160626 0.804452673
## 405 406 409 413 414 417
## 0.836271287 0.577990250 0.970163267 0.627201013 0.259324326 0.174272507
## 423 425 430 432 440 450
## 0.251702325 0.847467145 0.096253472 0.119325794 0.420329011 0.171852435
## 453 455 456 457 465 467
## 0.111628417 0.265916419 0.956214663 0.624509852 0.379618310 0.036408250
## 475 476 481 482 487 489
## 0.204290481 0.351398829 0.590328394 0.238404108 0.384713840 0.105130668
## 491 499 501 503 507 510
## 0.149092288 0.885535058 0.088667847 0.020937168 0.771648369 0.459068379
## 512 525 529 534 537 542
## 0.127181276 0.343427348 0.155602752 0.448973261 0.109544378 0.328899725
## 548 549 553 558 564 576
## 0.332425625 0.709413575 0.328456728 0.357195887 0.197591453 0.376160341
## 596 600 602 613 615 623
## 0.766791208 0.129261134 0.274840936 0.852971493 0.817071994 0.983529145
## 628 632 634 638 642 646
## 0.309857028 0.102282853 0.104316380 0.121837443 0.445438927 0.463059285
## 649 657 664 667 668 677
## 0.515877157 0.052593413 0.821800350 0.711227299 0.333289496 0.691838049
## 679 697 698 713 716 735
## 0.424434621 0.645811596 0.186494794 0.848635115 0.929598348 0.167517480
## 737 738 746 755
## 0.138622899 0.137142979 0.330316950 0.823060831
diab_data_test$pred.diab <- predict(model_diabetes, newdata = diab_data_test, type = "response")
head(diab_data_test)diab_data_test$label <- as.factor(ifelse(diab_data_test$pred.diab > 0.5,"Positive", "Negative"))
head(diab_data_test)## actual
## predict Negative Positive
## Negative 83 19
## Positive 14 38
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 83 19
## Positive 14 38
##
## Accuracy : 0.7857
## 95% CI : (0.7124, 0.8477)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 0.00002319
##
## Kappa : 0.532
##
## Mcnemar's Test P-Value : 0.4862
##
## Sensitivity : 0.6667
## Specificity : 0.8557
## Pos Pred Value : 0.7308
## Neg Pred Value : 0.8137
## Prevalence : 0.3701
## Detection Rate : 0.2468
## Detection Prevalence : 0.3377
## Balanced Accuracy : 0.7612
##
## 'Positive' Class : Positive
##
Performance model tergolong cukup bagus dengan nilai Accuracy mencapai 78.57%. Dalam ini metrics yang menjadi patokan utama adalah metrics Sensitivity dengan nilai 66.67%, karena kita ingin menurunkan potensi terjadi nya prediksi false positive. Dimana kondisi false positive terjadi oleh pasien yang diprediksi positive terkena penyakit diabetes padahal tidak terkena penyakit diabetes.
diab_data_test$label <- as.factor(ifelse(diab_data_test$pred.diab > 0.45, "Positive", "Negative"))
confusionMatrix(diab_data_test$label, reference = diab_data_test$Outcome, positive = "Positive")## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 81 15
## Positive 16 42
##
## Accuracy : 0.7987
## 95% CI : (0.7266, 0.8589)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 0.000004461
##
## Kappa : 0.5698
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7368
## Specificity : 0.8351
## Pos Pred Value : 0.7241
## Neg Pred Value : 0.8438
## Prevalence : 0.3701
## Detection Rate : 0.2727
## Detection Prevalence : 0.3766
## Balanced Accuracy : 0.7859
##
## 'Positive' Class : Positive
##
Terjadi Perubahan setelah dilakukan perubahan treshold, yaitu: 1. Accuracy : 78.57% -> 79.87 2. Sensitivity : 66.67% -> 73.68
## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1,…
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 12…
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 7…
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0,…
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0,…
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.…
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, …
## $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
diab_knn<- diab_knn %>%
mutate(Outcome = ifelse(Outcome == 0, "Negative", "Positive") %>% as.factor(),
Outcome = as.factor(Outcome))
glimpse(diab_knn)## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1,…
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 12…
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 7…
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0,…
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0,…
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.…
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, …
## $ Outcome <fct> Positive, Negative, Positive, Negative, Posi…
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
Kita coba menggunakan Z-Score scaling untuk merubah skala data. Yang kita ubah hanya skala data bertipe data numerik. Pertama, kita lakukan scaling terlebih dahulu terhadap data train.
# Scaling Data Train
train_x <- data_train %>%
select(-Outcome) %>% # buang target variable
scale() # lakukan scalling ke semua prediktor
# Menyimpan target variabel
train_y <- data_train$OutcomeUntuk melakukan scaling terhadap data test, kita WAJIB menggunakan informasi dari data train, yakni informasi rata-rata dan standar deviasi masing-masing variabel. Hal ini bertujuan untuk mencegah data leakage atau kebocoran data karena informasi data test masuk ke dalam scaling ketika kita tidak menggunakan informasi dari data train. Data test hanya digunakan untuk diuji dan selebihnya harus melalui prosedur sesuai dengan data train.
Informasi dari rata-rata dan standar deviasi setiap variabel pada data train dapat dicek dengan menggunakan str(). Pada bagian bawah terdapat keterangan attr dengan ketentuan:
attr(*, "scaled:center"): menunjukkan rata-rata dari setiap variabelattr(*, "scaled:scale"): menunjukkan standar deviasi dari setiap variabel## num [1:614, 1:8] -1.1502 0.0412 -0.2566 0.637 -0.8524 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:614] "221" "605" "314" "676" ...
## ..$ : chr [1:8] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" ...
## - attr(*, "scaled:center")= Named num [1:8] 3.86 120.09 68.97 21.01 79.07 ...
## ..- attr(*, "names")= chr [1:8] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" ...
## - attr(*, "scaled:scale")= Named num [1:8] 3.36 31.57 19.03 16.13 116.69 ...
## ..- attr(*, "names")= chr [1:8] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" ...
## Pregnancies Glucose BloodPressure
## 3.8615635 120.0895765 68.9657980
## SkinThickness Insulin BMI
## 21.0130293 79.0651466 32.2135179
## DiabetesPedigreeFunction Age
## 0.4734414 33.1156352
Untuk memasukkan informasi rata-rata dan standar deviasi dari data train ke dalam scaling untuk data test, pada function scale() kita lengkapi keterangannya.
# Scaling Data test
test_x <- data_test %>%
select(-Outcome) %>% #Buang target variable
scale(center = attr(train_x,"scaled:center"), # rata-rata
scale = attr(train_x, "scaled:scale") # standar deviasi
) # lakukan scalling dengan informasi dari data train
# Menyimpan variabel target
test_y <- data_test$OutcomeMenentukan nilai K menggunakan akar kuadrat dari data train
## [1] 25
Dengan K-NN, kita punya pilihan apakah ingin menyimpan informasi jarak antar data pada model K-NN atau langsung memberikan prediksi.
Cara 1: Menyimpan Informasi Jarak
model_knn <- knn3(x = train_x, # prediktor data train
y = train_y, # target variabel data train
k = k_choose # Jumlah K
)
# Prediksi
pred_knn <- predict(model_knn, newdata = test_x, type = "class")Cara 2: Langsung Memberikan Prediksi
# Your code here
library(caret)
pred_knn <- knn3Train(train = train_x, #prediktor data train
cl = train_y, # target data train
test = test_x, # prediktor data test
k = k_choose
) %>%
as.factor()
head(pred_knn)# jumlah K yang dipakai## [1] Negative Positive Positive Positive Positive Negative
## Levels: Negative Positive
Mengukur performa model dengan menggunakan Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 90 23
## Positive 7 34
##
## Accuracy : 0.8052
## 95% CI : (0.7337, 0.8645)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 0.000001847
##
## Kappa : 0.5565
##
## Mcnemar's Test P-Value : 0.00617
##
## Sensitivity : 0.9278
## Specificity : 0.5965
## Pos Pred Value : 0.7965
## Neg Pred Value : 0.8293
## Prevalence : 0.6299
## Detection Rate : 0.5844
## Detection Prevalence : 0.7338
## Balanced Accuracy : 0.7622
##
## 'Positive' Class : Negative
##
Saat mencoba membuat model prediksi terkena penyakit diabetes menggunakan metode KNN kita mendapatkan persentase Accuracy dan Sensitivity yang lebih baik yaitu 80.52% (Accuracy) dan 92.78% (Sensitivity) sedangkan ketika menggunakan regresi logistik persentase yang didapatkan adalah 79.87 (Accuracy) dan 73.68 (Sensitivity).
Characteristic of K-NN
Advantages:
Disadvantages:
How to improve KNN result: