Perbandingan SVM dan XGboost dalam Studi Kasus Pima Indians Diabetes

Nazuwa Aulia

2024-06-06

Library

library(ISLR)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(xgboost)
library(UBL)

## Loading required package: MBA

## Loading required package: gstat

## Loading required package: automap

## Loading required package: sp

## Loading required package: randomForest

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ggplot2)
library(e1071)

Data

Data yang digunakan pada penelitian ini adalah data kondisi kesehatan 768 wanita suku Indian Pima usia 21 tahun ke atas. Data ini merupakan data sekunder yang diperoleh dari laman Kaggle. Rincian peubah respon dan peubah penjelas yang digunakan tertera tabel dibawah ini :

Peubah	Keterangan	Satuan	Sumber Pustaka
Y	Diabetes melitus	0 atau 1
X1	Jumlah kehamilan		Yanti dan Surtiningsih (2016)
X2	Kadar glukosa	mg/dL	Perkeni (2021)
X3	Tekanan darah	mm Hg	Ekasari et al. (2021)
X4	Ketebalan trisep	mm	Collier et al. (1989)
X5	Insulin	μU/mL	Wilcox (2005)
X6	IMT	kg/m2	Adnan et al. (2013)
X7	Diabetes Pedigree Function		Joshi et al. (2021)
X8	Usia	Tahun	Yan et al. (2023)

dt.tpm <- read.csv("diabetes.csv")
dt.tpm$outcome <- as.factor(dt.tpm$outcome)

Eksplorasi Data dan Pre-processing

head(dt.tpm)

##   pregnancies glucose bloodpressure skinthickness insulin  bmi
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   diabetespedigreefunction age outcome
## 1                    0.627  50    true
## 2                    0.351  31   false
## 3                    0.672  32    true
## 4                    0.167  21   false
## 5                    2.288  33    true
## 6                    0.201  30   false

Terdapat peubah yang bernilai 0 padahal tidak mungkin peubah tersebut nilainya 0

skimr::skim(dt.tpm)

Data summary
Name	dt.tpm
Number of rows	768
Number of columns	9
_______________________
Column type frequency:
factor	1
numeric	8
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
outcome	0	1	FALSE	2	fal: 500, tru: 268

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
pregnancies	1	3.85	3.37	0.00	1.00	3.00	6.00	17.00	▇▃▂▁▁
glucose	1	120.89	31.97	0.00	99.00	117.00	140.25	199.00	▁▁▇▆▂
bloodpressure	1	69.11	19.36	0.00	62.00	72.00	80.00	122.00	▁▁▇▇▁
skinthickness	1	20.54	15.95	0.00	0.00	23.00	32.00	99.00	▇▇▂▁▁
insulin	1	79.80	115.24	0.00	0.00	30.50	127.25	846.00	▇▁▁▁▁
bmi	1	31.99	7.88	0.00	27.30	32.00	36.60	67.10	▁▃▇▂▁
diabetespedigreefunction	1	0.47	0.33	0.08	0.24	0.37	0.63	2.42	▇▃▁▁▁
age	1	33.24	11.76	21.00	24.00	29.00	41.00	81.00	▇▃▁▁▁

pada data tidak terlihat adanya missing value namun pada eksplorasi awal terdapat beberapa peubah bernilai 0 padahal seharusnya tidak mungkin peubah tersebut bernilai 0 sehingga hal ini mengindikasikan adanya missing value sehingga nilai 0 akan diganti dengan NaN

cols_to_replace <- c('glucose', 'bloodpressure', 'skinthickness', 'insulin', 'bmi')
dt.tpm[cols_to_replace] <- lapply(dt.tpm[cols_to_replace], function(x) ifelse(x == 0, NA, x))

head(dt.tpm)

##   pregnancies glucose bloodpressure skinthickness insulin  bmi
## 1           6     148            72            35      NA 33.6
## 2           1      85            66            29      NA 26.6
## 3           8     183            64            NA      NA 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74            NA      NA 25.6
##   diabetespedigreefunction age outcome
## 1                    0.627  50    true
## 2                    0.351  31   false
## 3                    0.672  32    true
## 4                    0.167  21   false
## 5                    2.288  33    true
## 6                    0.201  30   false

Pada data yang hilang akan dilakukan imputasi menggunakan nilai median dari peubah tersebut

dt.tpm[cols_to_replace] <- lapply(dt.tpm[cols_to_replace], function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  return(x)
})

head(dt.tpm)

##   pregnancies glucose bloodpressure skinthickness insulin  bmi
## 1           6     148            72            35     125 33.6
## 2           1      85            66            29     125 26.6
## 3           8     183            64            29     125 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74            29     125 25.6
##   diabetespedigreefunction age outcome
## 1                    0.627  50    true
## 2                    0.351  31   false
## 3                    0.672  32    true
## 4                    0.167  21   false
## 5                    2.288  33    true
## 6                    0.201  30   false

Satuan dari setiap peubah berbeda - beda, oleh karena itu Selanjutnya akan dilakukan proses standarisasi data

# Memilih kolom-kolom numerik yang akan distandarisasi
numeric_columns <- sapply(dt.tpm, is.numeric)
selected_data <- dt.tpm[, numeric_columns]

# Standarisasi data
scaled_data <- scale(selected_data)

# Gantikan kolom numerik pada data awal dengan data yang sudah distandarisasi
dt.tpm[, numeric_columns] <- scaled_data

head(dt.tpm)

##   pregnancies    glucose bloodpressure skinthickness    insulin        bmi
## 1   0.6395305  0.8654807    -0.0319691    0.67020577 -0.1814230  0.1665109
## 2  -0.8443348 -1.2042810    -0.5279745   -0.01229328 -0.1814230 -0.8516448
## 3   1.2330766  2.0153484    -0.6933097   -0.01229328 -0.1814230 -1.3316324
## 4  -0.8443348 -1.0728676    -0.5279745   -0.69479233 -0.5402897 -0.6334686
## 5  -1.1411079  0.5040938    -2.6773314    0.67020577  0.3163598  1.5482935
## 6   0.3427574 -0.1858268     0.1333660   -0.01229328 -0.1814230 -0.9970956
##   diabetespedigreefunction         age outcome
## 1                0.4681869  1.42506672    true
## 2               -0.3648230 -0.19054773   false
## 3                0.6040037 -0.10551539    true
## 4               -0.9201630 -1.04087112   false
## 5                5.4813370 -0.02048305    true
## 6               -0.8175458 -0.27558007   false

table_diabet <- table(dt.tpm$outcome)
df_diabet <- as.data.frame(table_diabet)

bar.ploty <- ggplot(df_diabet, aes(x = Var1, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = Freq), vjust = -0.5, size = 3) +  
  theme_minimal() +
  labs(title = "Bar Plot of Outcome Diabetes", x = "Outcome Diabetes", y = "Total") +
  theme(axis.text.x = element_text(angle = 360, hjust = 1))

bar.ploty

Terlihat bahwa class data tidak seimbang sehingga nanti akan dilakukan penanganan data tidak seimbang pada data latih (train)

Splitting Data

set.seed(123)
train_idx <- createDataPartition(dt.tpm$outcome, p = 0.8, list=FALSE)
train.data <- dt.tpm[train_idx,]
test.data <- dt.tpm[-train_idx,]

Penanganan Imbalance Data

Imbalance data akan ditangani dengan Synthetic Minority Over-sampling Technique (SMOTE). SMOTE adalah metode untuk mengatasi masalah data tidak seimbang dalam machine learning. Ketika satu kelas memiliki jauh lebih sedikit contoh dibandingkan kelas lain, model cenderung mengabaikan kelas minoritas. SMOTE meningkatkan jumlah contoh kelas minoritas dengan membuat data sintetis. Teknik ini bekerja dengan mengidentifikasi contoh dari kelas minoritas, memilih tetangga terdekat, dan membuat contoh baru dengan menginterpolasi antara contoh asli dan tetangganya. Dengan demikian, SMOTE membantu menciptakan dataset yang lebih seimbang dan meningkatkan kemampuan model untuk mengenali kelas minoritas.

smote_data <- SmoteClassif(form = outcome ~ ., 
                           dat = train.data, 
                           C.perc = "balance",  
                           dist = "HVDM")

# Melihat distribusi kelas setelah SMOTE
table(smote_data$outcome)

## 
## false  true 
##   308   307

SVM

Support vector machine (SVM) merupakan salah satu teknik pembelajaran mesin yang digunakan untuk analisis prediksi, baik dalam bentuk klasifikasi maupun regresi. SVM pada dasarnya berprinsip linear classifier, yaitu teknik klasifikasi untuk data yang dapat dipisahkan secara linear (Santosa 2007). Namun, kini SVM telah dikembangkan sehingga juga mampu menyelesaikan permasalahan klasifikasi untuk data yang tidak linear dengan menerapkan konsep kernel pada ruang kerja berdimensi tinggi. SVM dilakukan dengan mencari hyperplane (bidang pembatas) yang memaksimalkan jarak antar kelas (Octaviani et al. 2014). Penggunaan kernel bertujuan untuk mentransformasikan data ke ruang berdimensi tinggi,dengan menjadikan data non linear terpisah secara linear.

SVM Dengan Kernel Linear

SVM dengan kernel linear adalah versi sederhana dari SVM yang menggunakan fungsi linear untuk memisahkan kelas-kelas dalam data. Kernel linear berarti bahwa keputusan yang diambil oleh SVM adalah berdasarkan garis lurus (untuk data dua dimensi), atau hyperplane (untuk data dengan lebih banyak dimensi), yang memisahkan kelas-kelas dalam ruang fitur.

# 10 fold cross validation
ctr <- trainControl(method='repeatedcv',
                    number=10,
                    repeats=3)

# Recall as C increases, the margin tends to get wider
grid <- data.frame(C=seq(0.01,10,0.5))

svm.linear <- train(outcome ~., smote_data,
                 method='svmLinear',
                 preProc=c('center','scale'),
                 trControl=ctr,
                 tuneGrid=grid)
svm.linear

## Support Vector Machines with Linear Kernel 
## 
## 615 samples
##   8 predictor
##   2 classes: 'false', 'true' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 554, 554, 553, 554, 554, 554, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.01  0.7366878  0.4733831
##   0.51  0.7420642  0.4841768
##   1.01  0.7414910  0.4830169
##   1.51  0.7409354  0.4819058
##   2.01  0.7403978  0.4808305
##   2.51  0.7398602  0.4797552
##   3.01  0.7398602  0.4797552
##   3.51  0.7387761  0.4775944
##   4.01  0.7387761  0.4775944
##   4.51  0.7393225  0.4786800
##   5.01  0.7387761  0.4775944
##   5.51  0.7393225  0.4786800
##   6.01  0.7387761  0.4775944
##   6.51  0.7393225  0.4786800
##   7.01  0.7393225  0.4786800
##   7.51  0.7393225  0.4786800
##   8.01  0.7387761  0.4775944
##   8.51  0.7387761  0.4775944
##   9.01  0.7387761  0.4775944
##   9.51  0.7393225  0.4786800
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.51.

svm.linear$bestTune

##      C
## 2 0.51

ggplot(svm.linear)

# Training error rate
confusionMatrix(predict(svm.linear, smote_data), smote_data$outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false   242   83
##      true     66  224
##                                           
##                Accuracy : 0.7577          
##                  95% CI : (0.7219, 0.7911)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5154          
##                                           
##  Mcnemar's Test P-Value : 0.1899          
##                                           
##             Sensitivity : 0.7857          
##             Specificity : 0.7296          
##          Pos Pred Value : 0.7446          
##          Neg Pred Value : 0.7724          
##              Prevalence : 0.5008          
##          Detection Rate : 0.3935          
##    Detection Prevalence : 0.5285          
##       Balanced Accuracy : 0.7577          
##                                           
##        'Positive' Class : false           
##

# Testing error rate
yhat_1 <- predict(svm.linear, test.data)
svm.lin.acc <- confusionMatrix(yhat_1, test.data$outcome)
svm.lin.acc

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false    83   13
##      true     17   40
##                                           
##                Accuracy : 0.8039          
##                  95% CI : (0.7321, 0.8636)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 3.3e-05         
##                                           
##                   Kappa : 0.5745          
##                                           
##  Mcnemar's Test P-Value : 0.5839          
##                                           
##             Sensitivity : 0.8300          
##             Specificity : 0.7547          
##          Pos Pred Value : 0.8646          
##          Neg Pred Value : 0.7018          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5425          
##    Detection Prevalence : 0.6275          
##       Balanced Accuracy : 0.7924          
##                                           
##        'Positive' Class : false           
##

Dengan menggunakan metode SVM dengan kernel linear diperoleh nilai akurasi sebesar 0.8039

SVM dengan Kernel Polinomial

SVM dengan kernel polinomial adalah variasi dari algoritma SVM yang menggunakan fungsi kernel polinomial untuk memetakan data ke ruang fitur yang lebih tinggi. Kernel polinomial memungkinkan SVM untuk menangani masalah klasifikasi yang tidak dapat dipisahkan secara linear dengan menambahkan fleksibilitas dalam pemisahan data. Kernel polinomial adalah fungsi kernel yang mendefinisikan kesamaan antara dua contoh data dalam ruang fitur yang lebih tinggi yang dibentuk oleh kombinasi polinomial dari fitur asli.

set.seed(123)
svm.poly <- train(outcome ~., smote_data,
                 method='svmPoly',
                 trControl=ctr,
                 tuneLength=4)
                 
svm.poly

## Support Vector Machines with Polynomial Kernel 
## 
## 615 samples
##   8 predictor
##   2 classes: 'false', 'true' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 553, 553, 553, 553, 554, 555, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     Accuracy   Kappa     
##   1       0.001  0.25  0.5423497  0.08582243
##   1       0.001  0.50  0.6446674  0.28968987
##   1       0.001  1.00  0.7042088  0.40824621
##   1       0.001  2.00  0.7128362  0.42555704
##   1       0.010  0.25  0.7171729  0.43424853
##   1       0.010  0.50  0.7263840  0.45264080
##   1       0.010  1.00  0.7306936  0.46130520
##   1       0.010  2.00  0.7382916  0.47652223
##   1       0.100  0.25  0.7366875  0.47331843
##   1       0.100  0.50  0.7371981  0.47435583
##   1       0.100  1.00  0.7388207  0.47759423
##   1       0.100  2.00  0.7393760  0.47869015
##   1       1.000  0.25  0.7398869  0.47967908
##   1       1.000  0.50  0.7393581  0.47865067
##   1       1.000  1.00  0.7382828  0.47650014
##   1       1.000  2.00  0.7393581  0.47865067
##   2       0.001  0.25  0.6446674  0.28968987
##   2       0.001  0.50  0.7042088  0.40824621
##   2       0.001  1.00  0.7128362  0.42555704
##   2       0.001  2.00  0.7246997  0.44927168
##   2       0.010  0.25  0.7258373  0.45155641
##   2       0.010  0.50  0.7334085  0.46672758
##   2       0.010  1.00  0.7377540  0.47545395
##   2       0.010  2.00  0.7393760  0.47868901
##   2       0.100  0.25  0.7366893  0.47339018
##   2       0.100  0.50  0.7438119  0.48769928
##   2       0.100  1.00  0.7508367  0.50171064
##   2       0.100  2.00  0.7486859  0.49740713
##   2       1.000  0.25  0.7557104  0.51136329
##   2       1.000  0.50  0.7567945  0.51354774
##   2       1.000  1.00  0.7584697  0.51689498
##   2       1.000  2.00  0.7589806  0.51792760
##   3       0.001  0.25  0.6982866  0.39652283
##   3       0.001  0.50  0.7145208  0.42889673
##   3       0.001  1.00  0.7225404  0.44495691
##   3       0.001  2.00  0.7301475  0.46016759
##   3       0.010  0.25  0.7333730  0.46661672
##   3       0.010  0.50  0.7382919  0.47650343
##   3       0.010  1.00  0.7464363  0.49280961
##   3       0.010  2.00  0.7388295  0.47761164
##   3       0.100  0.25  0.7600026  0.51995992
##   3       0.100  0.50  0.7708167  0.54155834
##   3       0.100  1.00  0.7620715  0.52410020
##   3       0.100  2.00  0.7713006  0.54253830
##   3       1.000  0.25  0.7382381  0.47635586
##   3       1.000  0.50  0.7284185  0.45669217
##   3       1.000  1.00  0.7186706  0.43724221
##   3       1.000  2.00  0.7126582  0.42520953
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 2.

svm.poly$bestTune

##    degree scale C
## 44      3   0.1 2

plot(svm.poly)

# Training error rate
confusionMatrix(predict(svm.poly, smote_data), smote_data$outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false   260   39
##      true     48  268
##                                           
##                Accuracy : 0.8585          
##                  95% CI : (0.8285, 0.8851)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7171          
##                                           
##  Mcnemar's Test P-Value : 0.3911          
##                                           
##             Sensitivity : 0.8442          
##             Specificity : 0.8730          
##          Pos Pred Value : 0.8696          
##          Neg Pred Value : 0.8481          
##              Prevalence : 0.5008          
##          Detection Rate : 0.4228          
##    Detection Prevalence : 0.4862          
##       Balanced Accuracy : 0.8586          
##                                           
##        'Positive' Class : false           
##

# Testing error rate
yhat_2 <- predict(svm.poly, test.data)
svm.poli.acc <- confusionMatrix(yhat_2, test.data$outcome)
svm.poli.acc

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false    77   16
##      true     23   37
##                                          
##                Accuracy : 0.7451         
##                  95% CI : (0.6684, 0.812)
##     No Information Rate : 0.6536         
##     P-Value [Acc > NIR] : 0.009661       
##                                          
##                   Kappa : 0.454          
##                                          
##  Mcnemar's Test P-Value : 0.336668       
##                                          
##             Sensitivity : 0.7700         
##             Specificity : 0.6981         
##          Pos Pred Value : 0.8280         
##          Neg Pred Value : 0.6167         
##              Prevalence : 0.6536         
##          Detection Rate : 0.5033         
##    Detection Prevalence : 0.6078         
##       Balanced Accuracy : 0.7341         
##                                          
##        'Positive' Class : false          
##

Dengan menggunakan metode SVM dengan kernel polinomial diperoleh nilai akurasi sebesar 0.7451

SVM dengan Kernel Radial

SVM dengan kernel radial basis function (RBF), juga dikenal sebagai kernel Gaussian, adalah varian dari SVM yang menggunakan fungsi kernel RBF untuk memetakan data ke ruang fitur yang lebih tinggi. Kernel RBF sangat populer karena kemampuannya untuk menangani dataset yang tidak dapat dipisahkan secara linear dengan lebih baik. Kernel RBF adalah fungsi yang mengukur kesamaan antara dua contoh data dalam ruang fitur yang lebih tinggi, menggunakan jarak Euclidean antara dua vektor.

set.seed(123)

svm.radial <- train(outcome ~., smote_data,
                 method='svmRadial',
                 trControl=ctr,
                 tuneLength=10)
svm.radial

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 615 samples
##   8 predictor
##   2 classes: 'false', 'true' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 553, 553, 553, 553, 554, 555, ... 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.7587667  0.5177016
##     0.50  0.7707809  0.5416386
##     1.00  0.7653605  0.5307793
##     2.00  0.7794985  0.5589329
##     4.00  0.7761840  0.5522352
##     8.00  0.7745805  0.5490668
##    16.00  0.7626194  0.5251263
##    32.00  0.7528888  0.5056945
##    64.00  0.7419660  0.4838423
##   128.00  0.7500664  0.5000264
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1282285
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1282285 and C = 2.

svm.radial$bestTune

##       sigma C
## 4 0.1282285 2

plot(svm.radial)

# Training error rate
confusionMatrix(predict(svm.radial, smote_data), smote_data$outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false   265   40
##      true     43  267
##                                           
##                Accuracy : 0.865           
##                  95% CI : (0.8355, 0.8911)
##     No Information Rate : 0.5008          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7301          
##                                           
##  Mcnemar's Test P-Value : 0.8262          
##                                           
##             Sensitivity : 0.8604          
##             Specificity : 0.8697          
##          Pos Pred Value : 0.8689          
##          Neg Pred Value : 0.8613          
##              Prevalence : 0.5008          
##          Detection Rate : 0.4309          
##    Detection Prevalence : 0.4959          
##       Balanced Accuracy : 0.8650          
##                                           
##        'Positive' Class : false           
##

# Testing error rate
yhat_3 <- predict(svm.radial, test.data)
svm.rad.acc <- confusionMatrix(yhat_3, test.data$outcome)
svm.rad.acc

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false    80   13
##      true     20   40
##                                           
##                Accuracy : 0.7843          
##                  95% CI : (0.7106, 0.8466)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 0.0003018       
##                                           
##                   Kappa : 0.538           
##                                           
##  Mcnemar's Test P-Value : 0.2962699       
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.7547          
##          Pos Pred Value : 0.8602          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5229          
##    Detection Prevalence : 0.6078          
##       Balanced Accuracy : 0.7774          
##                                           
##        'Positive' Class : false           
##

Dengan menggunakan metode SVM dengan kernel radial diperoleh nilai akurasi sebesar 0.7843

XGBoost

XGBoost adalah algoritma yang berbasis pada metode boosting, di mana beberapa model prediksi (biasanya pohon keputusan) digabungkan untuk membentuk model yang lebih kuat. Setiap model baru mencoba untuk memperbaiki kesalahan dari model sebelumnya dengan menambahkan bobot lebih pada kesalahan prediksi.

set.seed(123)
# Definisikan kontrol untuk 10-fold cross-validation yang diulang 3 kali
ctr <- trainControl(method = 'repeatedcv',
                    number = 10,
                    repeats = 3,
                    verboseIter = TRUE)

# Definisikan grid parameter untuk tuning
grid <- expand.grid(
  nrounds = 500,        # Jumlah boosting rounds
  max_depth = 10,       # Kedalaman maksimum setiap pohon
  eta = 0.1,            # Learning rate
  gamma = 5,            # Minimum loss reduction
  colsample_bytree = 0.7,  # Proporsi fitur yang diambil secara acak
  min_child_weight = 1,    # Minimum sum of instance weight (hessian) needed in a child
  subsample = 0.7          # Proporsi data yang diambil secara acak untuk setiap pohon
)

set.seed(123)
# Melatih model menggunakan XGBoost
xgb_model <- train(outcome ~ ., data = smote_data,
                   method = 'xgbTree',
                   trControl = ctr,
                   tuneGrid = grid,
                   preProcess = c('center', 'scale'))

## + Fold01.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold01.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold02.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold02.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold03.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold03.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold04.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold04.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold05.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold05.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold06.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold06.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold07.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold07.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold08.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold08.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold09.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold09.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold10.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold10.Rep1: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold01.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold01.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold02.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold02.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold03.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold03.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold04.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold04.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold05.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold05.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold06.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold06.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold07.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold07.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold08.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold08.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold09.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold09.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold10.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold10.Rep2: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold01.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold01.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold02.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold02.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold03.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold03.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold04.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold04.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold05.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold05.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold06.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold06.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold07.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold07.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold08.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold08.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold09.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold09.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## + Fold10.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## - Fold10.Rep3: nrounds=500, max_depth=10, eta=0.1, gamma=5, colsample_bytree=0.7, min_child_weight=1, subsample=0.7 
## Aggregating results
## Fitting final model on full training set

xgb_model

## eXtreme Gradient Boosting 
## 
## 615 samples
##   8 predictor
##   2 classes: 'false', 'true' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 553, 553, 553, 553, 554, 555, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7588016  0.5176012
## 
## Tuning parameter 'nrounds' was held constant at a value of 500
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 0.7

yhat_4 <- predict(xgb_model, newdata = test.data)

xgb.acc <- confusionMatrix(yhat_4, test.data$outcome)
xgb.acc

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction false true
##      false    85    8
##      true     15   45
##                                          
##                Accuracy : 0.8497         
##                  95% CI : (0.783, 0.9023)
##     No Information Rate : 0.6536         
##     P-Value [Acc > NIR] : 4.462e-08      
##                                          
##                   Kappa : 0.678          
##                                          
##  Mcnemar's Test P-Value : 0.2109         
##                                          
##             Sensitivity : 0.8500         
##             Specificity : 0.8491         
##          Pos Pred Value : 0.9140         
##          Neg Pred Value : 0.7500         
##              Prevalence : 0.6536         
##          Detection Rate : 0.5556         
##    Detection Prevalence : 0.6078         
##       Balanced Accuracy : 0.8495         
##                                          
##        'Positive' Class : false          
##

Dengan menggunakan metode XGBoost diperoleh nilai akurasi sebesar 0.8497

Evaluasi Model

Pada evaluasi model, metrik evaluasi yang digunakan adalah accuracy, presisi, recall dan F1 Score.

Akurasi (Accuracy), mengukur sejauh mana model secara keseluruhan dapat melakukan prediksi dengan benar.
Presisi (Precision), mengukur sejauh mana prediksi positif yang dibuat oleh model adalah benar.
Recall (juga dikenal sebagai Sensitivitas), mengukur sejauh mana model berhasil mendeteksi semua contoh positif yang sebenarnya.
F1-Score adalah ukuran gabungan dari presisi dan recall yang mencoba memberikan gambaran keseluruhan kinerja model.

# Fungsi untuk menghitung metrik evaluasi
calculate_metrics <- function(model, test.data) {
  # Prediksi
  predictions <- predict(model, test.data)
  prob_predictions <- predict(model, test.data, type = "prob")
  
  # Confusion Matrix
  conf_matrix <- confusionMatrix(predictions, test.data$outcome)
  
  # Accuracy
  accuracy <- conf_matrix$overall['Accuracy']
  
  # Precision, Recall, F1 Score
  precision <- posPredValue(predictions, test.data$outcome)
  recall <- sensitivity(predictions, test.data$outcome)
  f1 <- F_meas(predictions, test.data$outcome)
  
  return(list(accuracy = accuracy, precision = precision, recall = recall, f1 = f1, conf_matrix = conf_matrix))
}

# Menghitung metrik evaluasi untuk semua model
svm.lin.acc <- calculate_metrics(svm.linear, test.data)

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

svm.poly.acc <- calculate_metrics(svm.poly, test.data)

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

svm.rad.acc <- calculate_metrics(svm.radial, test.data)

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

xgb.acc <- calculate_metrics(xgb_model, test.data)

eval_all <- matrix(
  c(
    svm.lin.acc$accuracy, svm.lin.acc$precision, svm.lin.acc$recall, svm.lin.acc$f1,
    svm.poly.acc$accuracy, svm.poly.acc$precision, svm.poly.acc$recall, svm.poly.acc$f1,
    svm.rad.acc$accuracy, svm.rad.acc$precision, svm.rad.acc$recall, svm.rad.acc$f1,
    xgb.acc$accuracy, xgb.acc$precision, xgb.acc$recall, xgb.acc$f1
  ), 
  nrow = 4, byrow = TRUE
)
colnames(eval_all) <- c("Accuracy", "Precision", "Recall", "F1 Score")
row.names(eval_all) <- c("SVM Linear", "SVM Polynomial", "SVM Radial", "XGBoost")
eval_all <- as.data.frame(eval_all)

print(eval_all)

##                 Accuracy Precision Recall  F1 Score
## SVM Linear     0.8039216 0.8645833   0.83 0.8469388
## SVM Polynomial 0.7450980 0.8279570   0.77 0.7979275
## SVM Radial     0.7843137 0.8602151   0.80 0.8290155
## XGBoost        0.8496732 0.9139785   0.85 0.8808290

Berdasarkan output diatas dapat dilihat bahwa model XGBoost merupakan model terbaik dibandingkan dengan SVM dengan accuracy sebesar 84.96%, presisi sebesar 91.39%, recall sebesar 85% dan F1 Score sebesar 88%.

Importance Variable

plot(varImp(xgb_model), 
     main = "XGBoost Variable Importance" )

Berdasarkan plot variable importance diatas dapat dilihat bahwa 3 peubah yang paling berpengaruh terhadap diabetes adalah kadar glukosa, indeks massa tubuh dan usia.