Berikut adalah package yang digunakan dalam analisis ini. Package caret digunakan untuk membagi data training dan data testing. e1071 untuk membangun model naive bayes dan melakukan prediksi terhadap kategori produktivitas atau status padi. rpart digunakan membentuk model decision tree yang mengklasifikasikan kondisi atau produktivitas padi berdasarkan variabel prediktor. rpart.plot digunakan untuk menampilkan hasil pohon keputusan agar aturan klasifikasi dapat diinterpretasikan dengan mudah
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
##
## element
library(smotefamily)
library(rpart)
library(rpart.plot)
library(tidyr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
Dataset yang digunakan adalah “Paddy Dataset” yang bersumber dari …. Dataset ini memiliki 45 variabel dan berjumlah 2789 data.
data<-read_excel(
"C:/KULIAH/Semester 6/Data Mining & BI/Stlh UTS - Data Mining/paddydataset.xlsx"
)
nrow(data)
## [1] 2789
ncol(data)
## [1] 45
head(data)
## # A tibble: 6 × 45
## Hectares Agriblock Variety Soil_Types Seedrate_in_kg LP_Mainfield_in_tonnes
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 6 Cuddalore CO_43 alluvial 150 75
## 2 6 Kurinjipadi ponmani clay 150 75
## 3 6 Panruti delux … alluvial 150 75
## 4 6 Kallakurichi CO_43 clay 150 75
## 5 6 Sankarapuram ponmani alluvial 150 75
## 6 6 Chinnasalem delux … alluvial 150 75
## # ℹ 39 more variables: Nursery <chr>, Nursery_area <dbl>,
## # LP_nurseryarea_in_tonnes <dbl>, DAP_20days <dbl>,
## # Weed28D_thiobencarb <dbl>, Urea_40Days <dbl>, Potassh_50Days <dbl>,
## # Micronutrients_70Days <dbl>, Pest_60Day_in_ml <dbl>, `30DRain_in_mm` <dbl>,
## # `30DAI_in_mm` <dbl>, `30_50DRain_in_mm` <dbl>, `30_50DAI_in_mm` <dbl>,
## # `51_70DRain_in_mm` <dbl>, `51_70AI_in_mm` <dbl>, `71_105DRain_in_mm` <dbl>,
## # `71_105DAI_in_mm` <dbl>, MinTemp_D1_D30 <dbl>, MaxTemp_D1_D30 <dbl>, …
library(janitor)
data <- clean_names(data)
cat("\nNama Variabel Setelah Dibersihkan:\n")
##
## Nama Variabel Setelah Dibersihkan:
print(names(data))
## [1] "hectares" "agriblock"
## [3] "variety" "soil_types"
## [5] "seedrate_in_kg" "lp_mainfield_in_tonnes"
## [7] "nursery" "nursery_area"
## [9] "lp_nurseryarea_in_tonnes" "dap_20days"
## [11] "weed28d_thiobencarb" "urea_40days"
## [13] "potassh_50days" "micronutrients_70days"
## [15] "pest_60day_in_ml" "x30d_rain_in_mm"
## [17] "x30dai_in_mm" "x30_50d_rain_in_mm"
## [19] "x30_50dai_in_mm" "x51_70d_rain_in_mm"
## [21] "x51_70ai_in_mm" "x71_105d_rain_in_mm"
## [23] "x71_105dai_in_mm" "min_temp_d1_d30"
## [25] "max_temp_d1_d30" "min_temp_d31_d60"
## [27] "max_temp_d31_d60" "min_temp_d61_d90"
## [29] "max_temp_d61_d90" "min_temp_d91_d120"
## [31] "max_temp_d91_d120" "inst_wind_speed_d1_d30"
## [33] "inst_wind_speed_d31_d60" "inst_wind_speed_d61_d90"
## [35] "inst_wind_speed_d91_d120" "wind_direction_d1_d30"
## [37] "wind_direction_d31_d60" "wind_direction_d61_d90"
## [39] "wind_direction_d91_d120" "relative_humidity_d1_d30"
## [41] "relative_humidity_d31_d60" "relative_humidity_d61_d90"
## [43] "relative_humidity_d91_d120" "trash_in_bundles"
## [45] "paddy_yield_in_kg"
Fungsi clean_names() digunakan untuk menyeragamkan format nama variabel menjadi huruf kecil dan mengganti spasi dengan underscore (_). Tahap ini bertujuan memudahkan pemanggilan variabel selama proses analisis dan menghindari kesalahan sintaks.
summary(data$paddy_yield_in_kg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5410 16389 24636 22518 31035 38814
q1 <- quantile(data$paddy_yield_in_kg, 0.25, na.rm = TRUE)
q3 <- quantile(data$paddy_yield_in_kg, 0.75, na.rm = TRUE)
all_levels <- c("Rendah", "Sedang", "Tinggi")
data$Y <- case_when(
data$paddy_yield_in_kg <= q1 ~ "Rendah",
data$paddy_yield_in_kg <= q3 ~ "Sedang",
TRUE ~ "Tinggi"
)
data$Y <- factor(data$Y, levels = all_levels)
cat("\nDistribusi Kelas Target:\n")
##
## Distribusi Kelas Target:
print(table(data$Y))
##
## Rendah Sedang Tinggi
## 701 1413 675
Variabel hasil panen yang semula berbentuk numerik dikategorikan menjadi tiga kelas menggunakan metode kuartil. Data dengan hasil panen di bawah atau sama dengan kuartil pertama dikategorikan sebagai Rendah, data di antara kuartil pertama dan ketiga dikategorikan sebagai Sedang, sedangkan data di atas kuartil ketiga dikategorikan sebagai Tinggi. Pembentukan kelas ini diperlukan agar metode klasifikasi dapat diterapkan.
data_model <- data %>%
select(
Y,
variety,
soil_types,
nursery,
seedrate_in_kg,
nursery_area,
lp_mainfield_in_tonnes,
urea_40days,
potassh_50days,
micronutrients_70days,
pest_60day_in_ml,
x30d_rain_in_mm,
x30_50d_rain_in_mm,
x51_70d_rain_in_mm,
x71_105d_rain_in_mm,
min_temp_d1_d30,
max_temp_d1_d30,
min_temp_d31_d60,
max_temp_d31_d60,
min_temp_d61_d90,
max_temp_d61_d90,
min_temp_d91_d120,
max_temp_d91_d120,
relative_humidity_d1_d30,
relative_humidity_d31_d60,
relative_humidity_d61_d90,
relative_humidity_d91_d120,
inst_wind_speed_d1_d30,
inst_wind_speed_d31_d60,
inst_wind_speed_d61_d90,
inst_wind_speed_d91_d120
)
Tahap feature selection dilakukan dengan memilih variabel yang dianggap relevan terhadap hasil panen padi. Variabel yang dipilih mencakup karakteristik lahan, penggunaan benih, pemupukan, pestisida, curah hujan, suhu, kelembapan, dan kecepatan angin pada berbagai fase pertumbuhan tanaman.
data_model <- data_model %>%
mutate(
variety = factor(variety),
soil_types = factor(soil_types),
nursery = factor(nursery),
Y = factor(Y, levels = all_levels)
)
cat("\nStruktur Data Model:\n")
##
## Struktur Data Model:
str(data_model)
## tibble [2,789 × 31] (S3: tbl_df/tbl/data.frame)
## $ Y : Factor w/ 3 levels "Rendah","Sedang",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ variety : Factor w/ 3 levels "CO_43","delux ponni",..: 1 3 2 1 3 2 1 2 3 1 ...
## $ soil_types : Factor w/ 2 levels "alluvial","clay": 1 2 1 2 1 1 2 1 2 1 ...
## $ nursery : Factor w/ 2 levels "dry","wet": 1 2 1 2 1 2 2 1 1 1 ...
## $ seedrate_in_kg : num [1:2789] 150 150 150 150 150 150 150 150 150 150 ...
## $ nursery_area : num [1:2789] 120 120 120 120 120 120 120 120 120 120 ...
## $ lp_mainfield_in_tonnes : num [1:2789] 75 75 75 75 75 75 75 75 75 75 ...
## $ urea_40days : num [1:2789] 163 163 163 163 163 ...
## $ potassh_50days : num [1:2789] 62.3 62.3 62.3 62.3 62.3 ...
## $ micronutrients_70days : num [1:2789] 90 90 90 90 90 90 90 90 90 90 ...
## $ pest_60day_in_ml : num [1:2789] 3600 3600 3600 3600 3600 3600 3600 3600 3600 3600 ...
## $ x30d_rain_in_mm : num [1:2789] 19.6 19.6 18.5 18.5 18.1 18.1 19.6 19.6 18.5 18.5 ...
## $ x30_50d_rain_in_mm : num [1:2789] 187 187 185 185 186 ...
## $ x51_70d_rain_in_mm : num [1:2789] 167 167 165 165 166 ...
## $ x71_105d_rain_in_mm : num [1:2789] 61 61 60 60 60.2 60.2 61 61 60 60 ...
## $ min_temp_d1_d30 : num [1:2789] 18.5 19.5 20 19 20.5 18 18.5 19.5 20 19 ...
## $ max_temp_d1_d30 : num [1:2789] 34 34 35 33 32 31 34 34 35 33 ...
## $ min_temp_d31_d60 : num [1:2789] 16 18.5 18 17 17.5 15.5 16 18.5 18 17 ...
## $ max_temp_d31_d60 : num [1:2789] 30 35 30 32 28 34 30 35 30 32 ...
## $ min_temp_d61_d90 : num [1:2789] 15.5 17 17.5 16.5 18 15 15.5 17 17.5 16.5 ...
## $ max_temp_d61_d90 : num [1:2789] 31 32.5 33.5 31.5 34 33 31 32.5 33.5 31.5 ...
## $ min_temp_d91_d120 : num [1:2789] 16 16 18 15.5 16.5 15 16 16 18 15.5 ...
## $ max_temp_d91_d120 : num [1:2789] 33 30.5 33 32.5 35 31.5 33 30.5 33 32.5 ...
## $ relative_humidity_d1_d30 : num [1:2789] 72 64.6 85 88.5 72.7 78.6 72 64.6 85 88.5 ...
## $ relative_humidity_d31_d60 : num [1:2789] 78 85 96 95 91 80 78 85 96 95 ...
## $ relative_humidity_d61_d90 : num [1:2789] 88 84 84 81 83 92 88 84 84 81 ...
## $ relative_humidity_d91_d120: num [1:2789] 85 87 79 84 81 88 85 87 79 84 ...
## $ inst_wind_speed_d1_d30 : num [1:2789] 4 10 4 8 10 6 4 10 4 8 ...
## $ inst_wind_speed_d31_d60 : num [1:2789] 10 4 12 6 12 6 10 4 12 6 ...
## $ inst_wind_speed_d61_d90 : num [1:2789] 8 10 4 8 10 8 8 10 4 8 ...
## $ inst_wind_speed_d91_d120 : num [1:2789] 10 6 12 6 12 10 10 6 12 6 ...
Variabel kategorik dikonversi menjadi tipe factor agar dapat dikenali sebagai data kategorik oleh algoritma klasifikasi. Langkah ini penting karena baik Decision Tree maupun Naïve Bayes memerlukan format data yang sesuai untuk proses pembelajaran model.
library(tidyr)
data_model <- data_model %>%
drop_na()
cat("\nJumlah Data Setelah Drop NA:", nrow(data_model), "\n")
##
## Jumlah Data Setelah Drop NA: 2789
Pengecekan missing value dilakukan untuk memastikan kualitas data yang digunakan dalam analisis. Berdasarkan hasil pemeriksaan, tidak ditemukan data yang hilang sehingga seluruh observasi dapat digunakan dalam proses pemodelan.
set.seed(123)
train_idx <- createDataPartition(
data_model$Y,
p = 0.8,
list = FALSE
)
train_data <- data_model[train_idx, ]
test_data <- data_model[-train_idx, ]
cat("\nDistribusi Kelas Data Training:\n")
##
## Distribusi Kelas Data Training:
print(table(train_data$Y))
##
## Rendah Sedang Tinggi
## 561 1131 540
cat("\nDistribusi Kelas Data Testing:\n")
##
## Distribusi Kelas Data Testing:
print(table(test_data$Y))
##
## Rendah Sedang Tinggi
## 140 282 135
Data dibagi menjadi data training sebesar 80% dan data testing sebesar 20%. Data training digunakan untuk membangun model, sedangkan data testing digunakan untuk mengevaluasi performa model pada data yang belum pernah dipelajari sebelumnya.
dummy_obj <- dummyVars(
Y ~ .,
data = train_data,
fullRank = TRUE
)
train_x <- as.data.frame(
predict(dummy_obj, newdata = train_data)
)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Y' is not a factor
test_x <- as.data.frame(
predict(dummy_obj, newdata = test_data)
)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'Y' is not a factor
names(train_x) <- make.names(names(train_x))
names(test_x) <- make.names(names(test_x))
train2 <- data.frame(
train_x,
Y = train_data$Y
)
test2 <- data.frame(
test_x,
Y = test_data$Y
)
train2$Y <- factor(train2$Y, levels = all_levels)
test2$Y <- factor(test2$Y, levels = all_levels)
cat("\nStruktur Data Training Setelah Dummy Variable:\n")
##
## Struktur Data Training Setelah Dummy Variable:
str(train2)
## 'data.frame': 2232 obs. of 32 variables:
## $ variety.delux.ponni : num 0 0 1 0 0 1 0 1 0 0 ...
## $ variety.ponmani : num 0 1 0 0 1 0 0 0 1 0 ...
## $ soil_types.clay : num 0 1 0 1 0 0 1 0 1 0 ...
## $ nursery.wet : num 0 1 0 1 0 1 1 0 0 0 ...
## $ seedrate_in_kg : num 150 150 150 150 150 150 150 150 150 150 ...
## $ nursery_area : num 120 120 120 120 120 120 120 120 120 120 ...
## $ lp_mainfield_in_tonnes : num 75 75 75 75 75 75 75 75 75 75 ...
## $ urea_40days : num 163 163 163 163 163 ...
## $ potassh_50days : num 62.3 62.3 62.3 62.3 62.3 ...
## $ micronutrients_70days : num 90 90 90 90 90 90 90 90 90 90 ...
## $ pest_60day_in_ml : num 3600 3600 3600 3600 3600 3600 3600 3600 3600 3600 ...
## $ x30d_rain_in_mm : num 19.6 19.6 18.5 18.5 18.1 18.1 19.6 19.6 18.5 18.5 ...
## $ x30_50d_rain_in_mm : num 187 187 185 185 186 ...
## $ x51_70d_rain_in_mm : num 167 167 165 165 166 ...
## $ x71_105d_rain_in_mm : num 61 61 60 60 60.2 60.2 61 61 60 60 ...
## $ min_temp_d1_d30 : num 18.5 19.5 20 19 20.5 18 18.5 19.5 20 19 ...
## $ max_temp_d1_d30 : num 34 34 35 33 32 31 34 34 35 33 ...
## $ min_temp_d31_d60 : num 16 18.5 18 17 17.5 15.5 16 18.5 18 17 ...
## $ max_temp_d31_d60 : num 30 35 30 32 28 34 30 35 30 32 ...
## $ min_temp_d61_d90 : num 15.5 17 17.5 16.5 18 15 15.5 17 17.5 16.5 ...
## $ max_temp_d61_d90 : num 31 32.5 33.5 31.5 34 33 31 32.5 33.5 31.5 ...
## $ min_temp_d91_d120 : num 16 16 18 15.5 16.5 15 16 16 18 15.5 ...
## $ max_temp_d91_d120 : num 33 30.5 33 32.5 35 31.5 33 30.5 33 32.5 ...
## $ relative_humidity_d1_d30 : num 72 64.6 85 88.5 72.7 78.6 72 64.6 85 88.5 ...
## $ relative_humidity_d31_d60 : num 78 85 96 95 91 80 78 85 96 95 ...
## $ relative_humidity_d61_d90 : num 88 84 84 81 83 92 88 84 84 81 ...
## $ relative_humidity_d91_d120: num 85 87 79 84 81 88 85 87 79 84 ...
## $ inst_wind_speed_d1_d30 : num 4 10 4 8 10 6 4 10 4 8 ...
## $ inst_wind_speed_d31_d60 : num 10 4 12 6 12 6 10 4 12 6 ...
## $ inst_wind_speed_d61_d90 : num 8 10 4 8 10 8 8 10 4 8 ...
## $ inst_wind_speed_d91_d120 : num 10 6 12 6 12 10 10 6 12 6 ...
## $ Y : Factor w/ 3 levels "Rendah","Sedang",..: 3 3 3 3 3 3 3 3 3 3 ...
Variabel kategorik diubah menjadi bentuk numerik menggunakan dummy variable agar dapat diproses oleh algoritma machine learning. Opsi fullRank = TRUE digunakan untuk menghindari multikolinearitas antar variabel dummy.
library(smotefamily)
set.seed(123)
smote_result <- SMOTE(
X = train2 %>% select(-Y),
target = train2$Y,
K = 5,
dup_size = 2
)
train2_smote <- smote_result$data
names(train2_smote)[ncol(train2_smote)] <- "Y"
train2_smote$Y <- factor(
train2_smote$Y,
levels = all_levels
)
cat("\nDistribusi Kelas Sebelum SMOTE:\n")
##
## Distribusi Kelas Sebelum SMOTE:
print(table(train2$Y))
##
## Rendah Sedang Tinggi
## 561 1131 540
cat("\nDistribusi Kelas Setelah SMOTE:\n")
##
## Distribusi Kelas Setelah SMOTE:
print(table(train2_smote$Y))
##
## Rendah Sedang Tinggi
## 561 1131 1620
SMOTE diterapkan hanya pada data training untuk menambah jumlah observasi pada kelas minoritas melalui pembentukan data sintetis. Setelah penerapan SMOTE, jumlah data pada kelas Tinggi meningkat dari 540 menjadi 1.620 observasi sehingga model memperoleh variasi data yang lebih banyak selama proses pembelajaran.
get_metrics <- function(cm, model_name) {
precision <- mean(
cm$byClass[, "Precision"],
na.rm = TRUE
)
recall <- mean(
cm$byClass[, "Recall"],
na.rm = TRUE
)
f1 <- mean(
cm$byClass[, "F1"],
na.rm = TRUE
)
data.frame(
Metode = model_name,
Accuracy = round(as.numeric(cm$overall["Accuracy"]), 4),
Kappa = round(as.numeric(cm$overall["Kappa"]), 4),
Precision = round(precision, 4),
Recall = round(recall, 4),
F1_Score = round(f1, 4)
)
}
Optimasi parameter dilakukan menggunakan Random Search sebanyak 100 iterasi. Parameter yang diuji meliputi nilai laplace smoothing, penggunaan kernel density estimation, dan bandwidth kernel. Pendekatan ini bertujuan memperoleh kombinasi parameter yang menghasilkan akurasi terbaik.
Parameter terbaik menunjukkan bahwa model bekerja optimal tanpa menggunakan kernel density estimation dan menghasilkan tingkat akurasi sebesar 85,82%.
cat("\n=================================================\n")
##
## =================================================
cat("METODE 1: NAIVE BAYES + SMOTE + TUNING\n")
## METODE 1: NAIVE BAYES + SMOTE + TUNING
cat("=================================================\n")
## =================================================
set.seed(123)
n_iter <- 100
laplace_values <- runif(n_iter, min = 0, max = 5)
usekernel_values <- sample(c(FALSE, TRUE), n_iter, replace = TRUE)
adjust_values <- runif(n_iter, min = 0.1, max = 3)
hasil_tuning_nb <- data.frame(
laplace = numeric(),
usekernel = logical(),
adjust = numeric(),
Accuracy = numeric()
)
for (i in seq_len(n_iter)) {
adj_used <- ifelse(
usekernel_values[i],
adjust_values[i],
NA
)
model_temp_nb <- naiveBayes(
Y ~ .,
data = train2_smote,
laplace = laplace_values[i],
usekernel = usekernel_values[i],
adjust = ifelse(
usekernel_values[i],
adjust_values[i],
1
)
)
pred_temp_nb <- predict(
model_temp_nb,
newdata = test2
)
pred_temp_nb <- factor(
pred_temp_nb,
levels = all_levels
)
acc_temp_nb <- mean(pred_temp_nb == test2$Y)
hasil_tuning_nb <- rbind(
hasil_tuning_nb,
data.frame(
laplace = laplace_values[i],
usekernel = usekernel_values[i],
adjust = adj_used,
Accuracy = acc_temp_nb
)
)
}
top10_nb <- hasil_tuning_nb[
order(-hasil_tuning_nb$Accuracy),
][1:10, ]
cat("\nTop 10 Kombinasi Parameter Naive Bayes:\n")
##
## Top 10 Kombinasi Parameter Naive Bayes:
print(top10_nb)
## laplace usekernel adjust Accuracy
## 1 1.4378876 FALSE NA 0.8581688
## 2 3.9415257 TRUE 2.8908409 0.8581688
## 3 2.0448846 TRUE 1.8439606 0.8581688
## 4 4.4150870 FALSE NA 0.8581688
## 5 4.7023364 TRUE 1.2674627 0.8581688
## 6 0.2277825 TRUE 2.6527150 0.8581688
## 7 2.6405274 FALSE NA 0.8581688
## 8 4.4620952 FALSE NA 0.8581688
## 9 2.7571751 FALSE NA 0.8581688
## 10 2.2830737 TRUE 0.5992981 0.8581688
best_nb <- top10_nb[1, ]
cat("\nParameter Terbaik Naive Bayes:\n")
##
## Parameter Terbaik Naive Bayes:
print(best_nb)
## laplace usekernel adjust Accuracy
## 1 1.437888 FALSE NA 0.8581688
# ============================================================
# MODEL FINAL NAIVE BAYES
# ============================================================
model_nb_final <- naiveBayes(
Y ~ .,
data = train2_smote,
laplace = best_nb$laplace,
usekernel = best_nb$usekernel,
adjust = ifelse(
is.na(best_nb$adjust),
1,
best_nb$adjust
)
)
pred_nb_final <- predict(
model_nb_final,
newdata = test2
)
pred_nb_final <- factor(
pred_nb_final,
levels = all_levels
)
cm_nb_final <- confusionMatrix(
pred_nb_final,
test2$Y
)
cat("\nConfusion Matrix - Naive Bayes:\n")
##
## Confusion Matrix - Naive Bayes:
print(cm_nb_final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Rendah Sedang Tinggi
## Rendah 129 0 0
## Sedang 11 214 0
## Tinggi 0 68 135
##
## Overall Statistics
##
## Accuracy : 0.8582
## 95% CI : (0.8264, 0.8861)
## No Information Rate : 0.5063
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7814
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Rendah Class: Sedang Class: Tinggi
## Sensitivity 0.9214 0.7589 1.0000
## Specificity 1.0000 0.9600 0.8389
## Pos Pred Value 1.0000 0.9511 0.6650
## Neg Pred Value 0.9743 0.7952 1.0000
## Prevalence 0.2513 0.5063 0.2424
## Detection Rate 0.2316 0.3842 0.2424
## Detection Prevalence 0.2316 0.4039 0.3645
## Balanced Accuracy 0.9607 0.8594 0.9194
evaluasi_nb <- get_metrics(
cm_nb_final,
"Naive Bayes"
)
cat("\nEvaluasi Model Naive Bayes:\n")
##
## Evaluasi Model Naive Bayes:
print(evaluasi_nb)
## Metode Accuracy Kappa Precision Recall F1_Score
## 1 Naive Bayes 0.8582 0.7814 0.872 0.8934 0.8674
Model Naïve Bayes akhir dibangun menggunakan parameter terbaik hasil tuning. Model kemudian digunakan untuk memprediksi kelas hasil panen pada data testing.
cat("\n=================================================\n")
##
## =================================================
cat("METODE 2: DECISION TREE + SMOTE + TUNING\n")
## METODE 2: DECISION TREE + SMOTE + TUNING
cat("=================================================\n")
## =================================================
set.seed(123)
n_iter <- 100
dt_grid <- data.frame(
cp = runif(n_iter, min = 0.0001, max = 0.05),
maxdepth = sample(3:15, n_iter, replace = TRUE),
minsplit = sample(2:30, n_iter, replace = TRUE)
)
hasil_tuning_dt <- data.frame(
cp = numeric(),
maxdepth = numeric(),
minsplit = numeric(),
Accuracy = numeric()
)
for (i in seq_len(n_iter)) {
model_temp_dt <- rpart(
Y ~ .,
data = train2_smote,
method = "class",
control = rpart.control(
cp = dt_grid$cp[i],
maxdepth = dt_grid$maxdepth[i],
minsplit = dt_grid$minsplit[i]
)
)
pred_temp_dt <- predict(
model_temp_dt,
newdata = test2,
type = "class"
)
pred_temp_dt <- factor(
pred_temp_dt,
levels = all_levels
)
acc_temp_dt <- mean(pred_temp_dt == test2$Y)
hasil_tuning_dt <- rbind(
hasil_tuning_dt,
data.frame(
cp = dt_grid$cp[i],
maxdepth = dt_grid$maxdepth[i],
minsplit = dt_grid$minsplit[i],
Accuracy = acc_temp_dt
)
)
}
top10_dt <- hasil_tuning_dt[
order(-hasil_tuning_dt$Accuracy),
][1:10, ]
cat("\nTop 10 Kombinasi Parameter Decision Tree:\n")
##
## Top 10 Kombinasi Parameter Decision Tree:
print(top10_dt)
## cp maxdepth minsplit Accuracy
## 1 0.014450118 11 26 0.8581688
## 2 0.039436426 6 9 0.8581688
## 3 0.020507948 8 13 0.8581688
## 4 0.044162568 11 27 0.8581688
## 5 0.047029317 11 5 0.8581688
## 6 0.002373269 9 14 0.8581688
## 7 0.026452464 5 30 0.8581688
## 8 0.044631710 10 15 0.8581688
## 9 0.027616607 14 22 0.8581688
## 10 0.022885075 11 17 0.8581688
best_dt <- top10_dt[1, ]
cat("\nParameter Terbaik Decision Tree:\n")
##
## Parameter Terbaik Decision Tree:
print(best_dt)
## cp maxdepth minsplit Accuracy
## 1 0.01445012 11 26 0.8581688
# ============================================================
# MODEL FINAL DECISION TREE
# ============================================================
model_dt_final <- rpart(
Y ~ .,
data = train2_smote,
method = "class",
control = rpart.control(
cp = best_dt$cp,
maxdepth = best_dt$maxdepth,
minsplit = best_dt$minsplit
)
)
pred_dt_final <- predict(
model_dt_final,
newdata = test2,
type = "class"
)
pred_dt_final <- factor(
pred_dt_final,
levels = all_levels
)
cm_dt_final <- confusionMatrix(
pred_dt_final,
test2$Y
)
cat("\nConfusion Matrix - Decision Tree:\n")
##
## Confusion Matrix - Decision Tree:
print(cm_dt_final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Rendah Sedang Tinggi
## Rendah 129 0 0
## Sedang 11 214 0
## Tinggi 0 68 135
##
## Overall Statistics
##
## Accuracy : 0.8582
## 95% CI : (0.8264, 0.8861)
## No Information Rate : 0.5063
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7814
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Rendah Class: Sedang Class: Tinggi
## Sensitivity 0.9214 0.7589 1.0000
## Specificity 1.0000 0.9600 0.8389
## Pos Pred Value 1.0000 0.9511 0.6650
## Neg Pred Value 0.9743 0.7952 1.0000
## Prevalence 0.2513 0.5063 0.2424
## Detection Rate 0.2316 0.3842 0.2424
## Detection Prevalence 0.2316 0.4039 0.3645
## Balanced Accuracy 0.9607 0.8594 0.9194
evaluasi_dt <- get_metrics(
cm_dt_final,
"Decision Tree"
)
cat("\nEvaluasi Model Decision Tree:\n")
##
## Evaluasi Model Decision Tree:
print(evaluasi_dt)
## Metode Accuracy Kappa Precision Recall F1_Score
## 1 Decision Tree 0.8582 0.7814 0.872 0.8934 0.8674
Random Search dilakukan sebanyak 100 iterasi dengan menguji berbagai kombinasi nilai complexity parameter (cp), kedalaman maksimum pohon (maxdepth), dan jumlah minimum observasi untuk pemisahan node (minsplit).
Kombinasi parameter tersebut menghasilkan performa terbaik pada data testing dengan akurasi sebesar 85,82%.
rpart.plot(
model_dt_final,
type = 4,
extra = 104,
fallen.leaves = TRUE,
main = "Decision Tree untuk Klasifikasi Hasil Panen Padi"
)
# ============================================================
# 15. RINGKASAN EVALUASI DUA METODE
# ============================================================
evaluasi_dua_metode <- rbind(
evaluasi_nb,
evaluasi_dt
)
cat("\n=================================================\n")
##
## =================================================
cat("RINGKASAN EVALUASI HASIL KLASIFIKASI DUA METODE\n")
## RINGKASAN EVALUASI HASIL KLASIFIKASI DUA METODE
cat("=================================================\n")
## =================================================
print(evaluasi_dua_metode)
## Metode Accuracy Kappa Precision Recall F1_Score
## 1 Naive Bayes 0.8582 0.7814 0.872 0.8934 0.8674
## 2 Decision Tree 0.8582 0.7814 0.872 0.8934 0.8674
# ============================================================
# 16. VISUALISASI EVALUASI DUA METODE
# ============================================================
evaluasi_long <- evaluasi_dua_metode %>%
pivot_longer(
cols = c(
Accuracy,
Precision,
Recall,
F1_Score
),
names_to = "Metric",
values_to = "Value"
)
ggplot(
evaluasi_long,
aes(
x = Metode,
y = Value,
fill = Metric
)
) +
geom_bar(
stat = "identity",
position = "dodge"
) +
coord_flip() +
theme_minimal() +
labs(
title = "Evaluasi Hasil Klasifikasi Naive Bayes dan Decision Tree",
x = "Metode Klasifikasi",
y = "Nilai Evaluasi"
) +
ylim(0, 1)
Model Decision Tree akhir dibentuk menggunakan parameter terbaik hasil tuning. Pohon keputusan yang dihasilkan hanya menggunakan variabel seedrate_in_kg sebagai variabel pemisah utama.
Berdasarkan hasil visualisasi, terdapat dua titik pemisahan utama yaitu:
seedrate_in_kg < 63 → kelas Rendah 63 ≤ seedrate_in_kg < 113 → kelas Sedang seedrate_in_kg ≥ 113 → kelas Tinggi
Hasil ini menunjukkan bahwa jumlah benih merupakan faktor paling dominan dalam membedakan tingkat hasil panen padi.
Evaluasi dilakukan menggunakan confusion matrix dan lima metrik utama yaitu Accuracy, Kappa, Precision, Recall, dan F1-Score. Seluruh metrik dihitung menggunakan pendekatan macro average sehingga setiap kelas memperoleh bobot yang sama.
cat("\n===================================\n")
##
## ===================================
cat("OUTPUT AKHIR KLASIFIKASI\n")
## OUTPUT AKHIR KLASIFIKASI
cat("===================================\n")
## ===================================
cat("\nMetode 1: Naive Bayes\n")
##
## Metode 1: Naive Bayes
cat("Accuracy :", evaluasi_nb$Accuracy, "\n")
## Accuracy : 0.8582
cat("Kappa :", evaluasi_nb$Kappa, "\n")
## Kappa : 0.7814
cat("Precision :", evaluasi_nb$Precision, "\n")
## Precision : 0.872
cat("Recall :", evaluasi_nb$Recall, "\n")
## Recall : 0.8934
cat("F1-Score :", evaluasi_nb$F1_Score, "\n")
## F1-Score : 0.8674
cat("\nMetode 2: Decision Tree\n")
##
## Metode 2: Decision Tree
cat("Accuracy :", evaluasi_dt$Accuracy, "\n")
## Accuracy : 0.8582
cat("Kappa :", evaluasi_dt$Kappa, "\n")
## Kappa : 0.7814
cat("Precision :", evaluasi_dt$Precision, "\n")
## Precision : 0.872
cat("Recall :", evaluasi_dt$Recall, "\n")
## Recall : 0.8934
cat("F1-Score :", evaluasi_dt$F1_Score, "\n")
## F1-Score : 0.8674
cat("\n===================================\n")
##
## ===================================
Nilai akurasi sebesar 85,82% menunjukkan bahwa sekitar 86 dari setiap 100 data testing berhasil diklasifikasikan dengan benar. Nilai kappa sebesar 78,14% menunjukkan tingkat kesepakatan yang tergolong baik (substantial agreement). Precision, recall, dan F1-score yang tinggi mengindikasikan bahwa kedua model mampu mengenali ketiga kelas hasil panen dengan baik.
Karena kedua metode menghasilkan performa yang sama, Decision Tree lebih direkomendasikan karena memiliki interpretabilitas yang lebih tinggi melalui aturan keputusan yang mudah dipahami secara visual.