1 Diabetes Detection

Diabetes adalah penyakit kronis yang ditandai dengan ciri-ciri berupa tingginya kadar gula (glukosa) darah. Glukosa merupakan sumber energi utama bagi sel tubuh manusia.

Glukosa yang menumpuk di dalam darah akibat tidak diserap sel tubuh dengan baik dapat menimbulkan berbagai gangguan organ tubuh. Jika diabetes tidak dikontrol dengan baik, dapat timbul berbagai komplikasi yang membahayakan nyawa penderita.

Oleh karena itu hal ini menjadi menarik untuk diteliti dan dicari tahu korelasi antara variable prediktor terhadap target variable (positive diabetes atau tidak) menggunakan teknik Machine Learning Classification (Logistics Regression)

2 Library and Setup

library(tidyverse)
library(gtools)
library(caret)
library(rsample)
library(DMwR)
theme_set(theme_minimal() +
            theme(legend.position = "top"))

options(scipen = 999)

3 Read Data

diabetes <- read.csv("data_input/diabetes.csv")
head(diabetes)

Deskripsi Variable (Kolom):
1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age: Age (years)
9. Outcome: Class variable (0 or 1)

Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

4 Data Wrangling

str(diabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...
diabetes <-  diabetes %>% 
  mutate(Outcome = ifelse(Outcome == 0, "Negative", "Positive") %>% as.factor(),
         Outcome = as.factor(Outcome))
str(diabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : Factor w/ 2 levels "Negative","Positive": 2 1 2 1 2 1 2 1 2 2 ...

5 Exploratory Data Analysis (EDA)

5.1 Check Proporsi Variable

prop.table(table(diabetes$Outcome))
## 
##  Negative  Positive 
## 0.6510417 0.3489583

5.2 Check Missing Value

colSums(is.na(diabetes))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Tidak Terdapat Missing Value, Ready to go!

summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##      Outcome   
##  Negative:500  
##  Positive:268  
##                
##                
##                
## 

5.3 Cross Validation

# your code here
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
row_data <- nrow(diabetes)
# sampel dan ambil 80% data secara acak
index <- sample(row_data, row_data*0.8)

diab_data_train <- diabetes[ index, ]
diab_data_test <- diabetes[ -index, ]

5.4 Check Proporsi di data train

prop.table(table(diab_data_train$Outcome))
## 
##  Negative  Positive 
## 0.6563518 0.3436482

5.5 Handling class imbalance menggunakan SMOTE

diab_train <- SMOTE(Outcome ~ ., data = diab_data_train)
prop.table(table(diab_train$Outcome))
## 
##  Negative  Positive 
## 0.5714286 0.4285714

5.6 Modelling

model_diabetes <- glm(Outcome~.,diab_train, family = "binomial")
summary(model_diabetes)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diab_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5563  -0.7555  -0.3576   0.7645   3.0091  
## 
## Coefficients:
##                            Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)              -8.4346000  0.5333596 -15.814 < 0.0000000000000002 ***
## Pregnancies               0.1158305  0.0242604   4.774          0.000001802 ***
## Glucose                   0.0401253  0.0027567  14.555 < 0.0000000000000002 ***
## BloodPressure            -0.0187336  0.0041620  -4.501          0.000006759 ***
## SkinThickness            -0.0001620  0.0049000  -0.033              0.97362    
## Insulin                  -0.0026606  0.0006292  -4.228          0.000023530 ***
## BMI                       0.0908041  0.0106248   8.546 < 0.0000000000000002 ***
## DiabetesPedigreeFunction  1.1384901  0.2218097   5.133          0.000000286 ***
## Age                       0.0196196  0.0073187   2.681              0.00735 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2017.3  on 1476  degrees of freedom
## Residual deviance: 1448.0  on 1468  degrees of freedom
## AIC: 1466
## 
## Number of Fisher Scoring iterations: 5

5.6.1 Feature Selection

Menggunakan Feature Selection Stepwise

step(model_diabetes,direction = "backward", trace = T)
## Start:  AIC=1465.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   1448.0 1464.0
## <none>                          1448.0 1466.0
## - Age                       1   1455.2 1471.2
## - Insulin                   1   1465.8 1481.8
## - BloodPressure             1   1470.1 1486.1
## - Pregnancies               1   1471.3 1487.3
## - DiabetesPedigreeFunction  1   1475.2 1491.2
## - BMI                       1   1534.9 1550.9
## - Glucose                   1   1732.5 1748.5
## 
## Step:  AIC=1463.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## <none>                          1448.0 1464.0
## - Age                       1   1455.4 1469.4
## - Insulin                   1   1469.8 1483.8
## - BloodPressure             1   1471.3 1485.3
## - Pregnancies               1   1471.3 1485.3
## - DiabetesPedigreeFunction  1   1475.5 1489.5
## - BMI                       1   1543.3 1557.3
## - Glucose                   1   1738.1 1752.1
## 
## Call:  glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = diab_train)
## 
## Coefficients:
##              (Intercept)               Pregnancies                   Glucose  
##                -8.434530                  0.115811                  0.040139  
##            BloodPressure                   Insulin                       BMI  
##                -0.018760                 -0.002669                  0.090708  
## DiabetesPedigreeFunction                       Age  
##                 1.137656                  0.019649  
## 
## Degrees of Freedom: 1476 Total (i.e. Null);  1469 Residual
## Null Deviance:       2017 
## Residual Deviance: 1448  AIC: 1464
step(model_diabetes,direction = "forward", trace = T)
## Start:  AIC=1465.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
## Call:  glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + 
##     Age, family = "binomial", data = diab_train)
## 
## Coefficients:
##              (Intercept)               Pregnancies                   Glucose  
##                -8.434600                  0.115831                  0.040125  
##            BloodPressure             SkinThickness                   Insulin  
##                -0.018734                 -0.000162                 -0.002661  
##                      BMI  DiabetesPedigreeFunction                       Age  
##                 0.090804                  1.138490                  0.019620  
## 
## Degrees of Freedom: 1476 Total (i.e. Null);  1468 Residual
## Null Deviance:       2017 
## Residual Deviance: 1448  AIC: 1466
step(model_diabetes, direction = "both", trace = T)
## Start:  AIC=1465.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   1448.0 1464.0
## <none>                          1448.0 1466.0
## - Age                       1   1455.2 1471.2
## - Insulin                   1   1465.8 1481.8
## - BloodPressure             1   1470.1 1486.1
## - Pregnancies               1   1471.3 1487.3
## - DiabetesPedigreeFunction  1   1475.2 1491.2
## - BMI                       1   1534.9 1550.9
## - Glucose                   1   1732.5 1748.5
## 
## Step:  AIC=1463.99
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## <none>                          1448.0 1464.0
## + SkinThickness             1   1448.0 1466.0
## - Age                       1   1455.4 1469.4
## - Insulin                   1   1469.8 1483.8
## - BloodPressure             1   1471.3 1485.3
## - Pregnancies               1   1471.3 1485.3
## - DiabetesPedigreeFunction  1   1475.5 1489.5
## - BMI                       1   1543.3 1557.3
## - Glucose                   1   1738.1 1752.1
## 
## Call:  glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = diab_train)
## 
## Coefficients:
##              (Intercept)               Pregnancies                   Glucose  
##                -8.434530                  0.115811                  0.040139  
##            BloodPressure                   Insulin                       BMI  
##                -0.018760                 -0.002669                  0.090708  
## DiabetesPedigreeFunction                       Age  
##                 1.137656                  0.019649  
## 
## Degrees of Freedom: 1476 Total (i.e. Null);  1469 Residual
## Null Deviance:       2017 
## Residual Deviance: 1448  AIC: 1464
model_diabetes <- glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
    Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
    data = diab_train)
summary(model_diabetes)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = diab_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5548  -0.7549  -0.3580   0.7644   3.0091  
## 
## Coefficients:
##                            Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)              -8.4345304  0.5333530 -15.814 < 0.0000000000000002 ***
## Pregnancies               0.1158110  0.0242537   4.775          0.000001797 ***
## Glucose                   0.0401392  0.0027249  14.731 < 0.0000000000000002 ***
## BloodPressure            -0.0187602  0.0040834  -4.594          0.000004342 ***
## Insulin                  -0.0026694  0.0005701  -4.683          0.000002834 ***
## BMI                       0.0907076  0.0102157   8.879 < 0.0000000000000002 ***
## DiabetesPedigreeFunction  1.1376557  0.2203627   5.163          0.000000243 ***
## Age                       0.0196491  0.0072646   2.705              0.00684 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2017.3  on 1476  degrees of freedom
## Residual deviance: 1448.0  on 1469  degrees of freedom
## AIC: 1464
## 
## Number of Fisher Scoring iterations: 5
exp(coef(model_diabetes))
##              (Intercept)              Pregnancies                  Glucose 
##             0.0002172351             1.1227836163             1.0409556988 
##            BloodPressure                  Insulin                      BMI 
##             0.9814147157             0.9973341720             1.0949488180 
## DiabetesPedigreeFunction                      Age 
##             3.1194467610             1.0198434068

6 Prediction

predict(model_diabetes, newdata = diab_data_test, type = "link")
##           2           3           9          12          13          17 
## -2.72379362  2.14354970  0.92956546  2.80459825  2.01965634 -0.49753453 
##          18          23          25          31          34          41 
## -1.13417569  3.48388024  1.03708195  0.17826858 -3.20365452  1.65369706 
##          43          46          50          56          57          61 
## -1.82546128  4.00689032 -2.59067103 -3.54551873  2.30925267 -4.07273471 
##          65          74          76          84          86          92 
## -0.19216475 -1.22886900 -6.38717713 -2.66656654 -1.24598223 -0.93009160 
##         102         103         107         117         118         122 
## -0.37991574 -2.46648365 -3.95622186 -0.18545380 -1.33300224 -0.61529079 
##         128         129         130         135         139         146 
## -1.15188080 -1.28653280 -1.20377535 -2.59516270 -0.55770896 -4.68397082 
##         150         151         158         170         172         181 
## -2.89829003 -0.47462301 -1.66905115 -1.81920978  0.37624370 -2.91961342 
##         182         188         201         208         217         218 
## -0.66177031 -0.12031093 -1.18073182  1.30923945 -0.66446508 -0.44040620 
##         232         236         237         242         244         247 
##  0.51581862  2.55245633  2.47814501 -1.92463707  0.23511052  0.27408042 
##         253         255         260         264         267         270 
## -3.26988635 -0.78226182  2.79670652  0.51621148  1.95003030  0.97509095 
##         278         282         283         285         288         292 
## -2.28025289  0.49136453 -0.38597585 -1.60317288 -0.11747297 -0.82796567 
##         294         305         309         310         317         325 
##  0.24337313 -0.62461371 -0.21260064 -0.47935689 -3.12174313 -1.29506050 
##         337         346         352         353         361         362 
##  1.25253248  0.46240336 -0.34182546 -2.87628680  1.72579276  1.34978653 
##         369         373         382         391         392         395 
## -3.35051088 -2.15967870 -2.98068280 -1.83314196  2.44453146  1.41435970 
##         405         406         409         413         414         417 
##  1.63074219  0.31452858  3.48172410  0.52022772 -1.04948334 -1.55564460 
##         423         425         430         432         440         450 
## -1.08955372  1.71487206 -2.23956389 -1.99883024 -0.32142292 -1.57255519 
##         453         455         456         457         465         467 
## -2.07421446 -1.01544085  3.08368345  0.50873487 -0.49116862 -3.27587230 
##         475         476         481         482         487         489 
## -1.35969128 -0.61289615  0.36532310 -1.16144892 -0.46958768 -2.14147368 
##         491         499         501         503         507         510 
## -1.74173818  2.04592346 -2.33001011 -3.84506986  1.21764228 -0.16409371 
##         512         525         529         534         537         542 
## -1.92611444 -0.64805777 -1.69131676 -0.20482000 -2.09540352 -0.71316565 
##         548         549         553         558         564         576 
## -0.69723466  0.89253764 -0.71517335 -0.58755570 -1.40141636 -0.50587789 
##         596         600         602         613         615         623 
##  1.19028039 -1.90750747 -0.97019852  1.75809964  1.49663455  4.08955481 
##         628         632         634         638         642         646 
## -0.80078779 -2.17211300 -2.15015885 -1.97514400 -0.21911679 -0.14803260 
##         649         657         664         667         668         677 
##  0.06352999 -2.89113745  1.52859293  0.90135219 -0.69334446  0.80872644 
##         679         697         698         713         716         735 
## -0.30459485  0.60067883 -1.47294900  1.72393593  2.58053589 -1.60332452 
##         737         738         746         755 
## -1.82677510 -1.83922498 -0.70675189  1.53722412
predict(model_diabetes, newdata = diab_data_test, type = "response")
##           2           3           9          12          13          17 
## 0.061583863 0.895064480 0.716987118 0.942923799 0.882845469 0.378120238 
##          18          23          25          31          34          41 
## 0.243391316 0.970225617 0.738286575 0.544449492 0.039028428 0.839390093 
##          43          46          50          56          57          61 
## 0.138779852 0.982135088 0.069741236 0.028044466 0.909640448 0.016745561 
##          65          74          76          84          86          92 
## 0.452106105 0.226379438 0.001680173 0.064975252 0.223396408 0.282906132 
##         102         103         107         117         118         122 
## 0.406147220 0.078241457 0.018775990 0.453768976 0.208663194 0.350853245 
##         128         129         130         135         139         146 
## 0.240145716 0.216440249 0.230804284 0.069450390 0.364077728 0.009157605 
##         150         151         158         170         172         181 
## 0.052238158 0.383522624 0.158550727 0.139528720 0.592966812 0.051192475 
##         182         188         201         208         217         218 
## 0.340342049 0.469958496 0.234920638 0.787385861 0.339737307 0.391644183 
##         232         236         237         242         244         247 
## 0.626169501 0.927738360 0.922595431 0.127345364 0.558508365 0.568094368 
##         253         255         260         264         267         270 
## 0.036618837 0.313832618 0.942497591 0.626261458 0.875449946 0.726133068 
##         278         282         283         285         288         292 
## 0.092771667 0.620427827 0.404686408 0.167538628 0.470665484 0.304075390 
##         294         305         309         310         317         325 
## 0.560544735 0.348732863 0.447049133 0.382403999 0.042219229 0.214997500 
##         337         346         352         353         361         362 
## 0.777737937 0.613584165 0.415366119 0.053338318 0.848873476 0.794094726 
##         369         373         382         391         392         395 
## 0.033878439 0.103430242 0.048306229 0.137864401 0.920160626 0.804452673 
##         405         406         409         413         414         417 
## 0.836271287 0.577990250 0.970163267 0.627201013 0.259324326 0.174272507 
##         423         425         430         432         440         450 
## 0.251702325 0.847467145 0.096253472 0.119325794 0.420329011 0.171852435 
##         453         455         456         457         465         467 
## 0.111628417 0.265916419 0.956214663 0.624509852 0.379618310 0.036408250 
##         475         476         481         482         487         489 
## 0.204290481 0.351398829 0.590328394 0.238404108 0.384713840 0.105130668 
##         491         499         501         503         507         510 
## 0.149092288 0.885535058 0.088667847 0.020937168 0.771648369 0.459068379 
##         512         525         529         534         537         542 
## 0.127181276 0.343427348 0.155602752 0.448973261 0.109544378 0.328899725 
##         548         549         553         558         564         576 
## 0.332425625 0.709413575 0.328456728 0.357195887 0.197591453 0.376160341 
##         596         600         602         613         615         623 
## 0.766791208 0.129261134 0.274840936 0.852971493 0.817071994 0.983529145 
##         628         632         634         638         642         646 
## 0.309857028 0.102282853 0.104316380 0.121837443 0.445438927 0.463059285 
##         649         657         664         667         668         677 
## 0.515877157 0.052593413 0.821800350 0.711227299 0.333289496 0.691838049 
##         679         697         698         713         716         735 
## 0.424434621 0.645811596 0.186494794 0.848635115 0.929598348 0.167517480 
##         737         738         746         755 
## 0.138622899 0.137142979 0.330316950 0.823060831
diab_data_test$pred.diab <- predict(model_diabetes, newdata = diab_data_test, type = "response")
head(diab_data_test)
diab_data_test$label <- as.factor(ifelse(diab_data_test$pred.diab > 0.5,"Positive", "Negative"))
head(diab_data_test)

6.1 Cek Hasil Predict

table("predict" = diab_data_test$label, "actual" = diab_data_test$Outcome)
##           actual
## predict    Negative Positive
##   Negative       83       19
##   Positive       14       38
confusionMatrix(diab_data_test$label, reference = diab_data_test$Outcome, positive = "Positive")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative       83       19
##   Positive       14       38
##                                           
##                Accuracy : 0.7857          
##                  95% CI : (0.7124, 0.8477)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 0.00002319      
##                                           
##                   Kappa : 0.532           
##                                           
##  Mcnemar's Test P-Value : 0.4862          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.8557          
##          Pos Pred Value : 0.7308          
##          Neg Pred Value : 0.8137          
##              Prevalence : 0.3701          
##          Detection Rate : 0.2468          
##    Detection Prevalence : 0.3377          
##       Balanced Accuracy : 0.7612          
##                                           
##        'Positive' Class : Positive        
## 

7 Model Evaluation

Performance model tergolong cukup bagus dengan nilai Accuracy mencapai 78.57%. Dalam ini metrics yang menjadi patokan utama adalah metrics Sensitivity dengan nilai 66.67%, karena kita ingin menurunkan potensi terjadi nya prediksi false positive. Dimana kondisi false positive terjadi oleh pasien yang diprediksi positive terkena penyakit diabetes padahal tidak terkena penyakit diabetes.

7.1 Improve Performance Model

diab_data_test$label <-  as.factor(ifelse(diab_data_test$pred.diab > 0.45, "Positive", "Negative"))
confusionMatrix(diab_data_test$label, reference = diab_data_test$Outcome, positive = "Positive")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative       81       15
##   Positive       16       42
##                                           
##                Accuracy : 0.7987          
##                  95% CI : (0.7266, 0.8589)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 0.000004461     
##                                           
##                   Kappa : 0.5698          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7368          
##             Specificity : 0.8351          
##          Pos Pred Value : 0.7241          
##          Neg Pred Value : 0.8438          
##              Prevalence : 0.3701          
##          Detection Rate : 0.2727          
##    Detection Prevalence : 0.3766          
##       Balanced Accuracy : 0.7859          
##                                           
##        'Positive' Class : Positive        
## 

Terjadi Perubahan setelah dilakukan perubahan treshold, yaitu: 1. Accuracy : 78.57% -> 79.87 2. Sensitivity : 66.67% -> 73.68

8 Model KNN

diab_knn <- read.csv("data_input/diabetes.csv")
glimpse(diab_knn)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1,…
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 12…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 7…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0,…
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0,…
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, …
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
diab_knn<-  diab_knn %>% 
  mutate(Outcome = ifelse(Outcome == 0, "Negative", "Positive") %>% as.factor(),
         Outcome = as.factor(Outcome))
glimpse(diab_knn)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1,…
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 12…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 7…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0,…
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0,…
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, …
## $ Outcome                  <fct> Positive, Negative, Positive, Negative, Posi…

9 Cross Validation

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
row_data <- nrow(diab_knn)
index <- sample(row_data, row_data*0.8)

data_train <- diab_knn[index, ]
data_test <- diab_knn[-index, ]

10 Scaling Data

Kita coba menggunakan Z-Score scaling untuk merubah skala data. Yang kita ubah hanya skala data bertipe data numerik. Pertama, kita lakukan scaling terlebih dahulu terhadap data train.

# Scaling Data Train
train_x <- data_train %>% 
  select(-Outcome) %>% # buang target variable
  scale() # lakukan scalling ke semua prediktor


# Menyimpan target variabel

train_y <- data_train$Outcome

Untuk melakukan scaling terhadap data test, kita WAJIB menggunakan informasi dari data train, yakni informasi rata-rata dan standar deviasi masing-masing variabel. Hal ini bertujuan untuk mencegah data leakage atau kebocoran data karena informasi data test masuk ke dalam scaling ketika kita tidak menggunakan informasi dari data train. Data test hanya digunakan untuk diuji dan selebihnya harus melalui prosedur sesuai dengan data train.

Informasi dari rata-rata dan standar deviasi setiap variabel pada data train dapat dicek dengan menggunakan str(). Pada bagian bawah terdapat keterangan attr dengan ketentuan:

  • attr(*, "scaled:center"): menunjukkan rata-rata dari setiap variabel
  • attr(*, "scaled:scale"): menunjukkan standar deviasi dari setiap variabel
str(train_x)
##  num [1:614, 1:8] -1.1502 0.0412 -0.2566 0.637 -0.8524 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:614] "221" "605" "314" "676" ...
##   ..$ : chr [1:8] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" ...
##  - attr(*, "scaled:center")= Named num [1:8] 3.86 120.09 68.97 21.01 79.07 ...
##   ..- attr(*, "names")= chr [1:8] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" ...
##  - attr(*, "scaled:scale")= Named num [1:8] 3.36 31.57 19.03 16.13 116.69 ...
##   ..- attr(*, "names")= chr [1:8] "Pregnancies" "Glucose" "BloodPressure" "SkinThickness" ...
# rata-rata dari data train
attr(train_x, "scaled:center")
##              Pregnancies                  Glucose            BloodPressure 
##                3.8615635              120.0895765               68.9657980 
##            SkinThickness                  Insulin                      BMI 
##               21.0130293               79.0651466               32.2135179 
## DiabetesPedigreeFunction                      Age 
##                0.4734414               33.1156352

Untuk memasukkan informasi rata-rata dan standar deviasi dari data train ke dalam scaling untuk data test, pada function scale() kita lengkapi keterangannya.

# Scaling Data test
test_x <-  data_test %>%  
  select(-Outcome) %>% #Buang target variable
  scale(center = attr(train_x,"scaled:center"), # rata-rata
        scale = attr(train_x, "scaled:scale") # standar deviasi
        ) # lakukan scalling dengan informasi dari data train

# Menyimpan variabel target
test_y <-  data_test$Outcome

11 Model Fitting and Evaluation

Menentukan nilai K menggunakan akar kuadrat dari data train

# Your code here
k_choose <-  sqrt(nrow(train_x)) %>%  round()
k_choose
## [1] 25

Dengan K-NN, kita punya pilihan apakah ingin menyimpan informasi jarak antar data pada model K-NN atau langsung memberikan prediksi.

Cara 1: Menyimpan Informasi Jarak

model_knn <- knn3(x = train_x,  # prediktor data train
                 y = train_y,  # target variabel data train
                 k = k_choose # Jumlah K
                 )

# Prediksi
pred_knn <- predict(model_knn, newdata = test_x, type = "class")

11.1 Cara menyimpan model sebagai Rds:

saveRDS(model_knn, "modelknn.rds")

Cara 2: Langsung Memberikan Prediksi

# Your code here
library(caret)
pred_knn <- knn3Train(train = train_x, #prediktor data train
                      cl = train_y, # target data train
                      test = test_x, # prediktor data test
                      k = k_choose
                      ) %>% 
  as.factor()
head(pred_knn)# jumlah K yang dipakai
## [1] Negative Positive Positive Positive Positive Negative
## Levels: Negative Positive

Mengukur performa model dengan menggunakan Confusion Matrix.

# Your code here
confusionMatrix(pred_knn, test_y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Negative Positive
##   Negative       90       23
##   Positive        7       34
##                                           
##                Accuracy : 0.8052          
##                  95% CI : (0.7337, 0.8645)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 0.000001847     
##                                           
##                   Kappa : 0.5565          
##                                           
##  Mcnemar's Test P-Value : 0.00617         
##                                           
##             Sensitivity : 0.9278          
##             Specificity : 0.5965          
##          Pos Pred Value : 0.7965          
##          Neg Pred Value : 0.8293          
##              Prevalence : 0.6299          
##          Detection Rate : 0.5844          
##    Detection Prevalence : 0.7338          
##       Balanced Accuracy : 0.7622          
##                                           
##        'Positive' Class : Negative        
## 

12 Conclussion Classification Machine Learning (Regresi Logistik & KNN)

Saat mencoba membuat model prediksi terkena penyakit diabetes menggunakan metode KNN kita mendapatkan persentase Accuracy dan Sensitivity yang lebih baik yaitu 80.52% (Accuracy) dan 92.78% (Sensitivity) sedangkan ketika menggunakan regresi logistik persentase yang didapatkan adalah 79.87 (Accuracy) dan 73.68 (Sensitivity).

Characteristic of K-NN

Advantages:

  • Sangat sederhana karena pada dasarnya tidak membangun model (lazy learner)
  • Tidak harus memenuhi suatu asumsi
  • Mengurangi efek outlier (karena dilakukan scaling)
  • Dapat diaplikasikan untuk multi-class classification (kategori/level target > 2)

Disadvantages:

  • Karena tidak membuat model, maka tidak dapat diinterpretasikan
  • Secara logika hanya dapat diterapkan dan menghasilkan prediksi yang bagus pada data dengan prediktor numerik
  • Semakin banyak jumlah observasi, maka komputasinya semakin lama dan costnya juga semakin besar

How to improve KNN result:

  • Mengganti nilai k
  • Menambah jumlah observasi