Penyakit Lardiovaskular (Cardiovascular diseases) adalah penyebab kematian nomor 1 secara global, mengambil sekitar 17,9 juta jiwa setiap tahun, yang menyumbang 31% dari semua kematian di seluruh dunia. Gagal jantung adalah kejadian umum yang disebabkan oleh Penyakit Lardiovaskular dan kumpulan data ini berisi 12 fitur yang dapat digunakan untuk memprediksi kematian akibat gagal jantung.
Sebagian besar penyakit kardiovaskular dapat dicegah dengan mengatasi faktor risiko perilaku seperti penggunaan tembakau, diet tidak sehat dan obesitas, kurangnya aktivitas fisik, dan penggunaan alkohol yang berbahaya menggunakan strategi di seluruh populasi.Orang dengan penyakit kardiovaskular atau yang berada pada risiko kardiovaskular tinggi (karena adanya satu atau lebih faktor risiko seperti hipertensi, diabetes, hiperlipidemia, atau penyakit yang sudah ada).
Dataset dari Davide Chicco, Giuseppe Jurman:
Age: usia pasien (dalam tahun)
Anemia: Penurunan sel darah merah atau hemoglobin
High blood pressure: Jika pasien menderita hipertensi
Creatinine phosphokinase: Tingkat enzim CPK dalam darah (mcg/L)
Diabetes: Jika pasien menderita diabetes
Ejection fraction: Persentase darah yang meninggalkan jantung pada setiap kontraksi
Sex: Wanita atau pria
Platelets: Trombosit dalam darah (kiloplatelet/mL)
Serum creatinine: Tingkat kreatinin dalam darah (mg/dL)
Serum sodium: Tingkat natrium dalam darah (mEq/L)
Smoking: Jika pasien merokok
Time:: Periode tindak lanjut (dalam hari)
(target) death event: Jika pasien meninggal selama masa tindak lanjut
Install packages (jalankan sekali) install.packages(c( “tidyverse”, # data manipulation & visualization (ggplot2, dplyr, dll) “data.table”, # alternatif manipulasi data “e1071”, # SVM “randomForest”, # Random Forest “ranger”, # Extra Trees / Random Forest cepat “caret”, # model training & evaluation (setara sklearn) “smotefamily”, # SMOTE (imbalance data) “ROSE”, # alternatif SMOTE “pROC”, # ROC curve & AUC “plotly”, # visualisasi interaktif “bestNormalize”, # PowerTransformer “scales”, “gridExtra”, “ggplot2”, update = FALSE ))
# Data Manipulation
library(tidyverse) # dplyr + ggplot2 (setara pandas + matplotlib)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table) # setara numpy untuk operasi array/matrix cepat
## Warning: package 'data.table' was built under R version 4.4.3
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
# Data Visualization
library(ggplot2) # setara matplotlib
# Data Preprocessing
library(smotefamily) # SMOTE
## Warning: package 'smotefamily' was built under R version 4.4.3
# Classifiers
library(e1071) # SVM
## Warning: package 'e1071' was built under R version 4.4.3
library(rpart.plot) # plot_tree
## Warning: package 'rpart.plot' was built under R version 4.4.3
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 4.4.3
library(randomForest) # Random Forest & Extra Trees
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
# Model Evaluation & Selection
library(caret) # MinMaxScaler
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
# PowerTransformer# \createDataPartition()
# \trainControl(method = "cv")
# \train() dengan tuneGrid
# train() dengan tuneLength
# confusionMatrix()
library(pROC) # roc(), auc()
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Load data
df <- read.csv("C:\\Users\\Elsa Ika R\\OneDrive\\KULIAH\\SEMESTER 6\\Komstat Lanjut\\heart_failure_clinical_records_dataset.csv")
# Setara df.head()
head(df)
# df.shape
dim(df)
## [1] 299 13
# df.info()
str(df)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
# Perbaiki tipe data usia → integer
df$age <- as.integer(df$age)
# Kolom numerik
numeric <- c('age', 'creatinine_phosphokinase',
'ejection_fraction', 'platelets',
'serum_creatinine', 'time')
# Kolom kategori
categorical <- c('anaemia', 'diabetes', 'high_blood_pressure',
'sex', 'smoking')
colSums(is.na(df))
## age anaemia creatinine_phosphokinase
## 0 0 0
## diabetes ejection_fraction high_blood_pressure
## 0 0 0
## platelets serum_creatinine serum_sodium
## 0 0 0
## sex smoking time
## 0 0 0
## DEATH_EVENT
## 0
library(ggplot2)
death_color <- c("0" = "navy", "1" = "#DC143C")
target_count <- table(df$DEATH_EVENT)
df$DEATH_EVENT_label <- factor(df$DEATH_EVENT, labels = c("No", "True"))
ggplot(df, aes(x = DEATH_EVENT_label, fill = factor(DEATH_EVENT))) +
geom_bar() +
geom_text(stat = "count",
aes(label = paste0(round(after_stat(count)/nrow(df)*100, 2), "%\n(",
after_stat(count), ")")),
position = position_stack(vjust = 0.5),
color = "white", size = 5) +
scale_fill_manual(values = death_color) +
scale_y_continuous(breaks = seq(0, 225, 25)) +
theme_classic() +
labs(x = "DEATH_EVENT", fill = "DEATH_EVENT")
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
library(tidyr)
# Countplot fitur Kategori per DEATH_EVENT
cat_plots <- list()
for (name in categorical) {
p <- ggplot(df, aes(x = factor(.data[[name]]), fill = factor(DEATH_EVENT))) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("0" = "navy", "1" = "red3"),
labels = c("No", "True"), name = "DEATH_EVENT") +
labs(x = name, title = name) +
theme_bw()
cat_plots[[name]] <- p
}
grid.arrange(grobs = cat_plots, ncol = 2)
# Histogram fitur Numerik
colors_hist <- scales::hue_pal()(length(numeric))
hist_plots <- list()
for (i in seq_along(numeric)) {
name <- numeric[i]
p <- ggplot(df, aes(x = .data[[name]])) +
geom_histogram(fill = colors_hist[i], bins = 30, color = "white") +
labs(title = name) +
theme_bw()
hist_plots[[name]] <- p
}
grid.arrange(grobs = hist_plots, ncol = 3,
top = grid::textGrob("Histogram fitur Numerik",
gp = grid::gpar(fontsize = 15)))
X <- df[, !names(df) %in% c("DEATH_EVENT", "DEATH_EVENT_label")]
dim(X)
## [1] 299 12
colnames(X)
## [1] "age" "anaemia"
## [3] "creatinine_phosphokinase" "diabetes"
## [5] "ejection_fraction" "high_blood_pressure"
## [7] "platelets" "serum_creatinine"
## [9] "serum_sodium" "sex"
## [11] "smoking" "time"
y <- df$DEATH_EVENT # kolom target
length(y)
## [1] 299
SMOTE adalah teknik oversampling di mana sampel sintetis dihasilkan untuk kelas minoritas.
library(smotefamily)
# Setara SMOTE(random_state=2021, k_neighbors=5)
set.seed(2021)
smote_result <- SMOTE(X, y, K = 5)
# Ambil data hasil resample
X_smote <- smote_result$data[, -ncol(smote_result$data)] # fitur
dim(X_smote)
## [1] 395 12
y_smote <- smote_result$data[, ncol(smote_result$data)] # target (kolom "class")
length(y_smote)
## [1] 395
Selama EDA untuk fitur numerik, histogram dari beberapa fitur menunjukkan kemiringan. Beberapa fitur seperti creatinine_phosphokinase dan serum_creatinine sangat miring. Fitur miring seperti ini dapat dibuat lebih mirip Gaussian menggunakan transformasi daya atau transformasi log.
library(ggplot2)
# QQ Plot sebelum transformasi
p1 <- ggplot(df, aes(sample = creatinine_phosphokinase)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "sebelum transformasi",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_bw()
# QQ Plot setelah transformasi log10
p2 <- ggplot(df, aes(sample = log10(creatinine_phosphokinase))) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "setelah transformasi",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_bw()
grid.arrange(p1, p2, ncol = 2,
top = grid::textGrob(
"QQ Plot untuk creatinine_phosphokinase",
gp = grid::gpar(fontface = "bold")
)
)
p <- -1
# QQ Plot sebelum transformasi
p1 <- ggplot(df, aes(sample = serum_creatinine)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "sebelum transformasi",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_bw()
# QQ Plot sesudah transformasi (pangkat -1)
p2 <- ggplot(df, aes(sample = serum_creatinine^p)) +
stat_qq() +
stat_qq_line(color = "red") +
labs(title = "sesudah transformasi",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_bw()
# Gabungkan kedua plot
grid.arrange(p1, p2, ncol = 2,
top = grid::textGrob(
"QQ Plot untuk serum_creatinine",
gp = grid::gpar(fontface = "bold")
)
)
Power Transformer dari paket sklearn-learn menyediakan dua metode untuk
membuat distribusi seperti gaussian: 1. Boxcox 2. Yeo-johnson Kedua
metode ini mencari nilai p yang tepat (seperti pada contoh di atas) agar
distribusinya menjadi normal. Yeo-johnson adalah versi upgrade dari
Boxcox karena berurusan dengan data dengan nilai negatif.
library(caret)
# Setara PowerTransformer(method='yeo-johnson')
pt <- preProcess(X_smote, method = "YeoJohnson")
# Setara pt.fit_transform(X_smote)
X_pt <- predict(pt, X_smote)
# Tampilkan hasil
X_pt
mm <- preProcess(X_pt, method = "range")
X_scaled <- predict(mm, X_pt)
colnames(X_scaled)
## [1] "age" "anaemia"
## [3] "creatinine_phosphokinase" "diabetes"
## [5] "ejection_fraction" "high_blood_pressure"
## [7] "platelets" "serum_creatinine"
## [9] "serum_sodium" "sex"
## [11] "smoking" "time"
library(tidyr)
library(ggplot2)
# Konversi ke dataframe dengan nama kolom dari X
X_scaled_df <- as.data.frame(X_scaled)
colnames(X_scaled_df) <- colnames(X)
# Histogram semua fitur
X_scaled_df |>
pivot_longer(cols = everything(), names_to = "fitur", values_to = "nilai") |>
ggplot(aes(x = nilai)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~ fitur, scales = "free") +
theme_bw()
library(randomForest)
set.seed(2021)
# Konversi y_smote ke factor (wajib untuk klasifikasi)
y_smote_factor <- as.factor(y_smote)
# Setara RandomForestClassifier(n_estimators=100, max_depth=10, class_weight='balanced')
rf <- randomForest(x = X_scaled,
y = y_smote_factor,
ntree = 100, # n_estimators
maxnodes = 2^10, # pendekatan max_depth=10
classwt = c("0" = 1, "1" = 1), # class_weight='balanced'
importance = TRUE)
## Warning in randomForest.default(x = X_scaled, y = y_smote_factor, ntree = 100,
## : maxnodes exceeds its max value.
rf
##
## Call:
## randomForest(x = X_scaled, y = y_smote_factor, ntree = 100, classwt = c(`0` = 1, `1` = 1), maxnodes = 2^10, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 12.15%
## Confusion matrix:
## 0 1 class.error
## 0 178 25 0.1231527
## 1 23 169 0.1197917
library(ggplot2)
# Ambil feature importance dari model rf
feature_imp <- data.frame(
fitur = colnames(X),
importance = round(rf$importance[, "MeanDecreaseGini"] /
sum(rf$importance[, "MeanDecreaseGini"]) * 100, 2)
)
# Sort descending
feature_imp <- feature_imp[order(feature_imp$importance, decreasing = FALSE), ]
feature_imp$fitur <- factor(feature_imp$fitur, levels = feature_imp$fitur)
# Plot barh (horizontal)
ggplot(feature_imp, aes(x = importance, y = fitur)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "percentage", y = NULL) +
theme_bw()
# Ambil 3 fitur teratas (index sudah diurutkan descending sebelumnya)
imp_features <- tail(feature_imp$fitur, 3)
imp_features
## [1] ejection_fraction serum_creatinine time
## 12 Levels: smoking sex anaemia diabetes high_blood_pressure ... time
# Pilih kolom berdasarkan imp_features
X_scaled_df <- as.data.frame(X_scaled)
colnames(X_scaled_df) <- colnames(X)
X_selected <- X_scaled_df[, as.character(imp_features)]
X_selected
Oleh karena itu, tujuannya sekarang adalah untuk memisahkan kedua kelas seperti yang ditunjukkan pada gambar di bawah ini
Catatan: Semua pengklasifikasi telah disetel untuk memaksimalkan skor f1 alih-alih akurasi. Skor F1 adalah rata-rata harmonik dari recall dan presisi. Skor ini akan mendukung pengklasifikasi dengan presisi dan daya ingat yang sama. Saya bisa saja mencapai daya ingat atau presisi yang tinggi tetapi sayangnya, kita tidak dapat memilikinya dua arah karena meningkatkan presisi mengurangi daya ingat, dan sebaliknya.
model_data <- X_selected
model_data$target <- y_smote
model_data
library(caret)
set.seed(2021)
# Pisahkan fitur dan target
X_model <- model_data[, !names(model_data) %in% "target"]
y_model <- as.factor(model_data$target)
# Stratified split 80/20 (setara stratify=model_data['target'])
train_idx <- createDataPartition(y_model, p = 0.80, list = FALSE)
X_train <- X_model[train_idx, ]
X_test <- X_model[-train_idx, ]
y_train <- y_model[train_idx]
y_test <- y_model[-train_idx]
# Setara print shape
dim(X_train)
## [1] 317 3
length(y_train)
## [1] 317
dim(X_test)
## [1] 78 3
length(y_test)
## [1] 78
set.seed(2021)
# Grid parameter C
tune_grid <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100, 1000))
# Setara GridSearchCV dengan cv=10 dan scoring='f1'
train_ctrl <- trainControl(method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Konversi y_train ke label valid (tidak boleh 0/1)
levels(y_train) <- c("No", "Yes")
levels(y_test) <- c("No", "Yes")
# Setara SVC(kernel='linear', class_weight='balanced')
lin_svm_cv <- train(x = X_train,
y = y_train,
method = "svmLinear",
trControl = train_ctrl,
tuneGrid = tune_grid,
metric = "F1",
class.weights = c("No" = 1, "Yes" = 1))
## Warning in train.default(x = X_train, y = y_train, method = "svmLinear", : The
## metric "F1" was not in the result set. ROC will be used instead.
# Setara lin_svm_cv.best_params_
lin_svm_cv$bestTune
# Setara cross_val_predict + classification_report pada data train
train_ctrl_pred <- trainControl(method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE)
set.seed(2021)
lin_svm_cv_pred <- train(x = X_train,
y = y_train,
method = "svmLinear",
trControl = train_ctrl_pred,
tuneGrid = data.frame(C = lin_svm_cv$bestTune$C),
metric = "Accuracy")
# Ambil prediksi cross-validation
lin_svm_train_pred <- lin_svm_cv_pred$pred$pred
# Setara classification_report
confusionMatrix(lin_svm_train_pred,
lin_svm_cv_pred$pred$obs,
positive = "Yes",
dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 107 6
## Yes 56 148
##
## Accuracy : 0.8044
## 95% CI : (0.7564, 0.8466)
## No Information Rate : 0.5142
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.612
##
## Mcnemar's Test P-Value : 4.877e-10
##
## Sensitivity : 0.9610
## Specificity : 0.6564
## Pos Pred Value : 0.7255
## Neg Pred Value : 0.9469
## Prevalence : 0.4858
## Detection Rate : 0.4669
## Detection Prevalence : 0.6435
## Balanced Accuracy : 0.8087
##
## 'Positive' Class : Yes
##
set.seed(2021)
# Grid parameter C
tune_grid_rbf <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100, 1000),
sigma = 0.1) # sigma = 1/gamma (RBF kernel)
# Setara GridSearchCV dengan cv=10 dan scoring='f1'
train_ctrl_rbf <- trainControl(method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Setara SVC(kernel='rbf', class_weight='balanced')
rbf_svm_cv <- train(x = X_train,
y = y_train,
method = "svmRadial",
trControl = train_ctrl_rbf,
tuneGrid = tune_grid_rbf,
metric = "ROC",
class.weights = c("No" = 1, "Yes" = 1))
# Setara rbf_svm_cv.best_params_
rbf_svm_cv$bestTune
library(caret)
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.4.3
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
set.seed(2021)
# Setara cross_val_predict pada data train
train_ctrl_pred_rbf <- trainControl(method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE)
rbf_svm_cv_pred <- train(x = X_train,
y = y_train,
method = "svmRadial",
trControl = train_ctrl_pred_rbf,
tuneGrid = data.frame(C = rbf_svm_cv$bestTune$C,
sigma = rbf_svm_cv$bestTune$sigma),
metric = "Accuracy")
# Ambil prediksi cross-validation
rbf_svm_train_pred <- rbf_svm_cv_pred$pred$pred
actual <- rbf_svm_cv_pred$pred$obs
# Setara classification_report
confusionMatrix(rbf_svm_train_pred, actual,
positive = "Yes",
dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 113 6
## Yes 50 148
##
## Accuracy : 0.8233
## 95% CI : (0.7768, 0.8637)
## No Information Rate : 0.5142
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6492
##
## Mcnemar's Test P-Value : 9.132e-09
##
## Sensitivity : 0.9610
## Specificity : 0.6933
## Pos Pred Value : 0.7475
## Neg Pred Value : 0.9496
## Prevalence : 0.4858
## Detection Rate : 0.4669
## Detection Prevalence : 0.6246
## Balanced Accuracy : 0.8271
##
## 'Positive' Class : Yes
##
# Detail per kelas
cat("Precision:", round(Precision(actual, rbf_svm_train_pred, positive = "No"), 4), "\n")
## Precision: 0.9496
cat("Recall :", round(Recall(actual, rbf_svm_train_pred, positive = "No"), 4), "\n")
## Recall : 0.6933
cat("F1 :", round(F1_Score(actual, rbf_svm_train_pred, positive = "No"), 4), "\n\n")
## F1 : 0.8014
cat("Precision:", round(Precision(actual, rbf_svm_train_pred, positive = "Yes"), 4), "\n")
## Precision: 0.7475
cat("Recall :", round(Recall(actual, rbf_svm_train_pred, positive = "Yes"), 4), "\n")
## Recall : 0.961
cat("F1 :", round(F1_Score(actual, rbf_svm_train_pred, positive = "Yes"), 4), "\n")
## F1 : 0.8409
set.seed(2021)
tune_grid_rf <- expand.grid(
mtry = floor(sqrt(ncol(X_train))),
splitrule = c("gini", "extratrees"),
min.node.size = 1
)
train_ctrl_rf <- trainControl(method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE,
summaryFunction = twoClassSummary)
rf_cv <- train(x = X_train,
y = y_train,
method = "ranger",
trControl = train_ctrl_rf,
tuneGrid = tune_grid_rf,
metric = "ROC",
num.trees = 100,
max.depth = 10, # pakai nilai tunggal
importance = "impurity",
seed = 2021)
rf_cv$bestTune
# Cross-val predict
rf_train_pred <- rf_cv$pred$pred
actual <- rf_cv$pred$obs
# Confusion Matrix
cm_rf <- confusionMatrix(rf_train_pred, actual,
positive = "Yes",
dnn = c("Predicted", "Actual"))
print(cm_rf)
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 142 29
## Yes 21 125
##
## Accuracy : 0.8423
## 95% CI : (0.7974, 0.8806)
## No Information Rate : 0.5142
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6838
##
## Mcnemar's Test P-Value : 0.3222
##
## Sensitivity : 0.8117
## Specificity : 0.8712
## Pos Pred Value : 0.8562
## Neg Pred Value : 0.8304
## Prevalence : 0.4858
## Detection Rate : 0.3943
## Detection Prevalence : 0.4606
## Balanced Accuracy : 0.8414
##
## 'Positive' Class : Yes
##
# Detail per kelas
cat("Precision:", round(cm_rf$byClass["Neg Pred Value"], 4), "\n")
## Precision: 0.8304
cat("Recall :", round(cm_rf$byClass["Specificity"], 4), "\n\n")
## Recall : 0.8712
cat("Precision:", round(cm_rf$byClass["Pos Pred Value"], 4), "\n")
## Precision: 0.8562
cat("Recall :", round(cm_rf$byClass["Sensitivity"], 4), "\n")
## Recall : 0.8117
cat("F1 :", round(cm_rf$byClass["F1"], 4), "\n\n")
## F1 : 0.8333
cat("Accuracy :", round(cm_rf$overall["Accuracy"], 4), "\n")
## Accuracy : 0.8423
library(caret)
library(MLmetrics)
set.seed(2021)
# Setara cross_val_predict pada data train
train_ctrl_pred_rf <- trainControl(method = "cv",
number = 10,
savePredictions = "final",
classProbs = TRUE,
verboseIter = TRUE) # setara verbose=1
rf_cv_pred <- train(x = X_train,
y = y_train,
method = "ranger",
trControl = train_ctrl_pred_rf,
tuneGrid = rf_cv$bestTune,
metric = "Accuracy",
num.trees = 100,
importance = "impurity")
## + Fold01: mtry=1, splitrule=gini, min.node.size=1
## - Fold01: mtry=1, splitrule=gini, min.node.size=1
## + Fold02: mtry=1, splitrule=gini, min.node.size=1
## - Fold02: mtry=1, splitrule=gini, min.node.size=1
## + Fold03: mtry=1, splitrule=gini, min.node.size=1
## - Fold03: mtry=1, splitrule=gini, min.node.size=1
## + Fold04: mtry=1, splitrule=gini, min.node.size=1
## - Fold04: mtry=1, splitrule=gini, min.node.size=1
## + Fold05: mtry=1, splitrule=gini, min.node.size=1
## - Fold05: mtry=1, splitrule=gini, min.node.size=1
## + Fold06: mtry=1, splitrule=gini, min.node.size=1
## - Fold06: mtry=1, splitrule=gini, min.node.size=1
## + Fold07: mtry=1, splitrule=gini, min.node.size=1
## - Fold07: mtry=1, splitrule=gini, min.node.size=1
## + Fold08: mtry=1, splitrule=gini, min.node.size=1
## - Fold08: mtry=1, splitrule=gini, min.node.size=1
## + Fold09: mtry=1, splitrule=gini, min.node.size=1
## - Fold09: mtry=1, splitrule=gini, min.node.size=1
## + Fold10: mtry=1, splitrule=gini, min.node.size=1
## - Fold10: mtry=1, splitrule=gini, min.node.size=1
## Aggregating results
## Fitting final model on full training set
# Ambil prediksi cross-validation
rf_train_pred <- rf_cv_pred$pred$pred
actual <- rf_cv_pred$pred$obs
library(MLmetrics)
# Setara classification_report
confusionMatrix(rf_train_pred, actual,
positive = "Yes",
dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 142 27
## Yes 21 127
##
## Accuracy : 0.8486
## 95% CI : (0.8043, 0.8862)
## No Information Rate : 0.5142
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6966
##
## Mcnemar's Test P-Value : 0.4705
##
## Sensitivity : 0.8247
## Specificity : 0.8712
## Pos Pred Value : 0.8581
## Neg Pred Value : 0.8402
## Prevalence : 0.4858
## Detection Rate : 0.4006
## Detection Prevalence : 0.4669
## Balanced Accuracy : 0.8479
##
## 'Positive' Class : Yes
##
# Detail per kelas
cat("Precision:", round(Precision(actual, rf_train_pred, positive = "No"), 4), "\n")
## Precision: 0.8402
cat("Recall :", round(Recall(actual, rf_train_pred, positive = "No"), 4), "\n")
## Recall : 0.8712
cat("F1 :", round(F1_Score(actual, rf_train_pred, positive = "No"), 4), "\n\n")
## F1 : 0.8554
cat("Precision:", round(Precision(actual, rf_train_pred, positive = "Yes"), 4), "\n")
## Precision: 0.8581
cat("Recall :", round(Recall(actual, rf_train_pred, positive = "Yes"), 4), "\n")
## Recall : 0.8247
cat("F1 :", round(F1_Score(actual, rf_train_pred, positive = "Yes"), 4), "\n\n")
## F1 : 0.8411
# Accuracy & Macro avg
cat("=== Overall ===\n")
## === Overall ===
cat("Accuracy :", round(Accuracy(actual, rf_train_pred), 4), "\n")
## Accuracy : 0.8486
cat("Macro F1 :", round((F1_Score(actual, rf_train_pred, positive = "No") +
F1_Score(actual, rf_train_pred, positive = "Yes")) / 2, 4), "\n")
## Macro F1 : 0.8482
Random Forest telah mengungguli semua pengklasifikasi lainnya dalam akurasi, presisi, daya ingat, skor f1, dan skor AUC.
library(MLmetrics)
library(pROC)
library(ggplot2)
library(tidyr)
# Daftar nama model
models <- c("Linear SVM", "Non-Linear SVM", "Random Forest")
# Daftar prediksi tiap model
predictions <- list(lin_svm_train_pred, rbf_svm_train_pred, rf_train_pred)
# Inisialisasi vektor metrik
accuracy <- c()
recall <- c()
precision <- c()
f1 <- c()
auc <- c()
# Hitung metrik tiap model
for (model_pred in predictions) {
accuracy <- c(accuracy, Accuracy(actual, model_pred))
precision <- c(precision, Precision(actual, model_pred, positive = "Yes"))
recall <- c(recall, Recall(actual, model_pred, positive = "Yes"))
f1 <- c(f1, F1_Score(actual, model_pred, positive = "Yes"))
auc <- c(auc, as.numeric(roc(actual, as.numeric(model_pred == "Yes"),
quiet = TRUE)$auc))
}
# Gabungkan ke dataframe
results_df <- data.frame(
model = models,
accuracy = round(accuracy, 4),
precision = round(precision, 4),
recall = round(recall, 4),
f1 = round(f1, 4),
auc = round(auc, 4)
)
results_df
library(ggplot2)
# Setara sns.color_palette("Dark2")
model_colors <- RColorBrewer::brewer.pal(length(models), "Dark2")
ggplot(results_df, aes(x = model, y = accuracy, fill = model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(accuracy * 100, 3), "%")),
vjust = -0.5, hjust = 0.5, size = 3.5) +
scale_fill_manual(values = model_colors) +
labs(x = "Models",
y = "Accuracy percentage (%)",
title = "Model comparison on training data using Accuracy") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))
ggplot(results_df, aes(x = model, y = recall, fill = model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(recall * 100, 3), "%")),
vjust = -0.5, hjust = 0.5, size = 3.5) +
scale_fill_manual(values = model_colors) +
labs(x = "Models",
y = "Recall percentage (%)",
title = "Model comparison on training data using Recall") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))
ggplot(results_df, aes(x = model, y = precision, fill = model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(precision * 100, 3), "%")),
vjust = -0.5, hjust = 0.5, size = 3.5) +
scale_fill_manual(values = model_colors) +
labs(x = "Models",
y = "Precision percentage (%)",
title = "Model comparison on training data using Precision") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))
ggplot(results_df, aes(x = model, y = f1, fill = model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(f1 * 100, 3), "%")),
vjust = -0.5, hjust = 0.5, size = 3.5) +
scale_fill_manual(values = model_colors) +
labs(x = "Models",
y = "F1 percentage (%)",
title = "Model comparison on training data using F1 score") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))
ggplot(results_df, aes(x = model, y = auc, fill = model)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(auc, 3)),
vjust = -0.5, hjust = 0.5, size = 3.5) +
scale_fill_manual(values = model_colors) +
labs(x = "Models",
y = "AUC",
title = "Model comparison on training data using AUC") +
theme_classic() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1))
library(caret)
library(MLmetrics)
# Daftar model
best_models <- list(lin_svm_cv, rbf_svm_cv, rf_cv)
# Evaluasi tiap model pada data test
for (i in seq_along(models)) {
model <- best_models[[i]]
name <- models[i]
best_pred <- predict(model, newdata = X_test)
cat(toupper(name), "\n")
# Confusion Matrix + metrik
cm <- confusionMatrix(best_pred, y_test,
positive = "Yes",
dnn = c("Predicted", "Actual"))
print(cm)
# Detail per kelas
cat("Precision:", round(Precision(y_test, best_pred, positive = "No"), 4), "\n")
cat("Recall :", round(Recall(y_test, best_pred, positive = "No"), 4), "\n")
cat("F1 :", round(F1_Score(y_test, best_pred, positive = "No"), 4), "\n\n")
cat("Precision:", round(Precision(y_test, best_pred, positive = "Yes"), 4), "\n")
cat("Recall :", round(Recall(y_test, best_pred, positive = "Yes"), 4), "\n")
cat("F1 :", round(F1_Score(y_test, best_pred, positive = "Yes"), 4), "\n\n")
cat("Accuracy :", round(Accuracy(y_test, best_pred), 4), "\n")
}
## LINEAR SVM
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 17 0
## Yes 23 38
##
## Accuracy : 0.7051
## 95% CI : (0.5911, 0.803)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 0.0004323
##
## Kappa : 0.4187
##
## Mcnemar's Test P-Value : 4.49e-06
##
## Sensitivity : 1.0000
## Specificity : 0.4250
## Pos Pred Value : 0.6230
## Neg Pred Value : 1.0000
## Prevalence : 0.4872
## Detection Rate : 0.4872
## Detection Prevalence : 0.7821
## Balanced Accuracy : 0.7125
##
## 'Positive' Class : Yes
##
## Precision: 1
## Recall : 0.425
## F1 : 0.5965
##
## Precision: 0.623
## Recall : 1
## F1 : 0.7677
##
## Accuracy : 0.7051
## NON-LINEAR SVM
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 20 0
## Yes 20 38
##
## Accuracy : 0.7436
## 95% CI : (0.6321, 0.8358)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 2.584e-05
##
## Kappa : 0.4935
##
## Mcnemar's Test P-Value : 2.152e-05
##
## Sensitivity : 1.0000
## Specificity : 0.5000
## Pos Pred Value : 0.6552
## Neg Pred Value : 1.0000
## Prevalence : 0.4872
## Detection Rate : 0.4872
## Detection Prevalence : 0.7436
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : Yes
##
## Precision: 1
## Recall : 0.5
## F1 : 0.6667
##
## Precision: 0.6552
## Recall : 1
## F1 : 0.7917
##
## Accuracy : 0.7436
## RANDOM FOREST
## Confusion Matrix and Statistics
##
## Actual
## Predicted No Yes
## No 35 3
## Yes 5 35
##
## Accuracy : 0.8974
## 95% CI : (0.8079, 0.9547)
## No Information Rate : 0.5128
## P-Value [Acc > NIR] : 4.197e-13
##
## Kappa : 0.795
##
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.9211
## Specificity : 0.8750
## Pos Pred Value : 0.8750
## Neg Pred Value : 0.9211
## Prevalence : 0.4872
## Detection Rate : 0.4487
## Detection Prevalence : 0.5128
## Balanced Accuracy : 0.8980
##
## 'Positive' Class : Yes
##
## Precision: 0.9211
## Recall : 0.875
## F1 : 0.8974
##
## Precision: 0.875
## Recall : 0.9211
## F1 : 0.8974
##
## Accuracy : 0.8974
Perbandingan model pada data uji menggunakan Kurva ROC kurva ROC, memplot tingkat positif sejati (nama lain untuk penarikan) terhadap tingkat positif palsu (FPR). Sekali lagi ada trade-off: semakin tinggi recall (TPR), semakin banyak false positive (FPR) yang dihasilkan classifier. Garis putus-putus mewakili kurva ROC dari pengklasifikasi acak murni; pengklasifikasi yang baik tetap sejauh mungkin dari garis itu Salah satu cara untuk membandingkan pengklasifikasi adalah dengan mengukur area di bawah kurva (AUC). Pengklasifikasi sempurna akan memiliki ROC AUC sama dengan 1, sedangkan pengklasifikasi acak murni akan memiliki ROC AUC sama dengan 0,5. Hutan acak memiliki AUC tertinggi.
library(ggplot2)
library(pROC)
best_models <- list(lin_svm_cv, rbf_svm_cv, rf_cv)
# Kumpulkan data ROC semua model
roc_data <- data.frame()
for (i in seq_along(models)) {
model <- best_models[[i]]
name <- models[i]
best_pred <- predict(model, newdata = X_test)
roc_obj <- roc(y_test, as.numeric(best_pred == "Yes"), quiet = TRUE)
roc_data <- rbind(roc_data, data.frame(
fpr = 1 - roc_obj$specificities,
tpr = roc_obj$sensitivities,
model = name
))
}
# Plot ROC curve semua model
ggplot(roc_data, aes(x = fpr, y = tpr, color = model)) +
geom_line(linewidth = 1) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +
labs(x = "False Positive Rate",
y = "True Positive Rate (Recall)",
color = "Model") +
theme_classic()
Model yang terbaik untuk melakukan klasifikasi untuk dataset ini adalah Random Forest.