Klasifikasi Peringkat Akreditasi Sekolah menggunakan K-Nearest Neighbors
Analisis ini bertujuan untuk memprediksi peringkat akreditasi sekolah di Sumatera Selatan dan Sumatera Barat menggunakan k-Nearest Neighbors (k-NN) , serta menilai akurasi skor literasi dan numerasi 2023–2024 sebagai prediktor akreditasi.
1 Install Package
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'dplyr' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom 1.0.8 ✔ rsample 1.3.0
✔ dials 1.4.0 ✔ tune 1.3.0
✔ infer 1.0.8 ✔ workflows 1.2.0
✔ modeldata 1.4.0 ✔ workflowsets 1.1.1
✔ parsnip 1.3.2 ✔ yardstick 1.3.2
✔ recipes 1.3.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
Warning: package 'DataExplorer' was built under R version 4.5.1
Warning: package 'VGAM' was built under R version 4.5.1
Loading required package: stats4
Loading required package: splines
Attaching package: 'VGAM'
The following object is masked from 'package:workflows':
update_formula
Warning: package 'themis' was built under R version 4.5.1
corrplot 0.95 loaded
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Warning: package 'kknn' was built under R version 4.5.1
2 Impor Data
Code
Rows: 372
Columns: 7
$ Nomor <dbl> 32, 38, 40, 47, 55, 86, 100, 110, 112, 133, 154, …
$ provinsi <chr> "Prov. Sumatera Barat", "Prov. Sumatera Barat", "…
$ Peringkat_Akreditasi <chr> "B", "A", "B", "A", "A", "B", "C", "A", "A", "B",…
$ Lit_2023 <dbl> 51.11, 91.11, 12.90, 77.78, 100.00, 43.75, 50.00,…
$ Num_2023 <dbl> 44.44, 93.33, 19.35, 66.67, 96.97, 50.00, 56.25, …
$ Lit_2024 <dbl> 61.364, 93.333, 31.579, 60.000, 100.000, 50.000, …
$ Num_2024 <dbl> 60.000, 86.667, 10.526, 55.556, 96.552, 50.000, 1…
3 Pembagian Data
4 Eksplorasi Data
4.1 Visualisasi Gambaran umum dataset
Data terdiri dari 20% variabel diskrit dan 80% variabel kontinu, dengan tidak terdapat missing value
Lit_2023 Num_2023 Lit_2024 Num_2024
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
1st Qu.: 40.63 1st Qu.: 44.44 1st Qu.: 48.89 1st Qu.: 50.00
Median : 64.44 Median : 60.00 Median : 71.43 Median : 68.18
Mean : 61.70 Mean : 59.62 Mean : 66.13 Mean : 66.02
3rd Qu.: 84.44 3rd Qu.: 77.78 3rd Qu.: 86.67 3rd Qu.: 84.61
Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
4.2 Visualisasi Target Variabel
Code
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
Sekolah dengan akreditasi C memiliki proporsi paling sedikit (22,6%), jauh lebih rendah dibandingkan dengan akreditasi A dan B
Code
# Agregasi & label persentase
df_pie <- train %>%
count(Peringkat_Akreditasi, name = "n") %>%
mutate(prop = n / sum(n),
pct = percent(prop, accuracy = 0.1))
# Pie chart (persentase saja)
ggplot(df_pie, aes(x = "", y = prop, fill = Peringkat_Akreditasi)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
geom_text(aes(label = pct), position = position_stack(vjust = 0.5), size = 4) +
theme_void() +
labs(title = "Distribusi Peringkat Akreditasi", fill = "Peringkat")Berdasarkan visualisasi di atas, semakin mempertegas bahwa Target (Y) memiliki distribusi kelas yang tidak seimbang (imbalance class). Oleh karena itu, perlu untuk mempertimbangkan beberapa metode penanganan ketidakseimbangan data sebelum melakukan pemodelan.
4.3 Distribusi fitur numerik
Code
train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
facet_wrap(~ variable, scales = "free") +
geom_density(fill = "lightgreen", alpha = 0.5, na.rm = TRUE) +
labs(
title = "Distribusi Fitur Numerik",
x = "Nilai",
y = "Kepadatan"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
strip.text.x = element_text(size = 6)
)4.4 Boxplot per fitur numerik
Code
train %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = "", y = value)) +
geom_boxplot(fill = "lightgreen", alpha = 0.5, outlier.alpha = 0.4) +
facet_wrap(~ variable, scales = "free_y") +
labs(title = "Boxplot Fitur Numerik", x = NULL, y = "Nilai") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
strip.text.x = element_text(size = 6))4.5 Boxplot Skor Literasi & Numerasi per kategori
Code
Sekolah berakreditasi A memiliki median literasi dan numerasi tertinggi, sedangkan akreditasi C menunjukkan variasi nilai yang lebih lebar. Selain itu, Outlier paling banyak muncul pada sekolah berakreditasi A
4.6 Korelasi Variabel prediktor
Lit_2023 Num_2023 Lit_2024 Num_2024
Lit_2023 1.0000000 0.8491384 0.7486465 0.7087073
Num_2023 0.8491384 1.0000000 0.7599339 0.6766126
Lit_2024 0.7486465 0.7599339 1.0000000 0.8275638
Num_2024 0.7087073 0.6766126 0.8275638 1.0000000
Code
4.7 Distribusi Fitur VS Target
Code
num_cols <- train %>% select(where(is.numeric)) %>% names()
train %>%
select(all_of(num_cols), Peringkat_Akreditasi) %>%
pivot_longer(cols = all_of(num_cols),
names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value, fill = Peringkat_Akreditasi)) +
facet_wrap(~ variable, scales = "free") +
geom_density(alpha = 0.5) +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Distribusi Fitur Berdasarkan Peringkat Akreditasi",
x = "Nilai",
y = "Kepadatan",
fill = "Peringkat"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "top")Secara umum keempat fitur (Lit_2023, Lit_2024, Num_2023, Num_2024) menunjukkan gradasi yang rapi—kelas A terkonsentrasi pada nilai tinggi, B pada rentang menengah, dan C pada nilai rendah—sehingga fitur-fitur ini informatif untuk memisahkan peringkat; pemisahan terlihat lebih jelas pada tahun 2024 dibanding 2023, khususnya literasi, meski masih ada tumpang tindih di rentang sekitar 45–60 (B–C) dan 70–80 (A–B) yang menjadi sumber salah-klasifikasi pada k-NN.
4.8 PairPlot
Code
Literasi dan numerasi memiliki korelasi positif yang kuat, terutama pada sekolah berakreditasi A, sedangkan akreditasi C menunjukkan korelasi lebih lemah.
5 Preprosessing
5.1 Normalisasi Data
Code
# Normalisasi training
x_train_scaled_mat <- scale(predictor_train)
center_vec <- attr(x_train_scaled_mat, "scaled:center")
scale_vec <- attr(x_train_scaled_mat, "scaled:scale")
# Bentuk dataset TRAIN untuk k-NN
df_knn_train <- cbind(as.data.frame(x_train_scaled_mat),
Peringkat_Akreditasi = label_train)
# simpan parameter skala untuk dipakai di TEST
scaler <- list(center = center_vec, scale = scale_vec, cols = colnames(predictor_train))5.2 Imbalance Handling
6 Pemodelan
6.1 K-fold Cross Validation
Code
Code
# Proporsi Kelas pada Tiap Assessment Fold
df_knn_train %>%
mutate(Row = row_number()) %>%
left_join(tidy(folds), by = "Row") %>%
count(Peringkat_Akreditasi, Fold, Data) %>%
group_by(Fold, Data) %>%
mutate(pct = n / sum(n)) %>%
filter(Data == "Assessment") %>%
ggplot(aes(x = Fold, y = pct, fill = Peringkat_Akreditasi)) +
geom_col(position = "fill") +
labs(title = "Proporsi Kelas di Tiap Assessment Fold",
x = "Fold", y = "Proporsi", fill = "Kelas") +
theme_classic()Code
# Tabel komposisi kelas per fold (Analysis & Assessment)
komposisi_folds <-
df_knn_train %>%
mutate(Row = row_number()) %>%
select(Row, Peringkat_Akreditasi) %>%
left_join(tidy(folds), by = "Row") %>%
count(Fold, Data, Peringkat_Akreditasi, name = "n") %>%
group_by(Fold, Data) %>%
mutate(percent = 100 * n / sum(n)) %>%
arrange(Fold, Data, Peringkat_Akreditasi) %>%
ungroup()
komposisi_folds6.2 Mendefinisikan Model KNN - Hyperparameter Tuning
Code
# Nilai k dg akurasi terbaik
# Berdasarkan akurasi
best_neighbors_acc <- select_best(knn_tune_cv, metric = "accuracy")
# Berdasarkan balance
best_neighbors_ba <- select_best(knn_tune_cv, metric = "bal_accuracy")
# Nilai k dengan aturan 1-SE
best_neighbors_1se <- select_by_one_std_err(
knn_tune_cv,
metric = "accuracy",
desc(neighbors)
)Code
# ringkas per k (supaya 1 titik per k)
acc_best_by_k <- neighbors_result %>%
dplyr::filter(.metric == "accuracy", .estimator == "multiclass") %>%
dplyr::group_by(neighbors) %>%
dplyr::slice_max(mean, with_ties = FALSE) %>%
dplyr::ungroup()
# plot: 3 garis, 3 warna, 3 breaks (cocokkan label persis!)
ggplot(acc_best_by_k, aes(x = neighbors, y = mean)) +
geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), width = 0) +
geom_point() +
geom_vline(aes(xintercept = best_neighbors_acc$neighbors,
color = "Highest accuracy"),
linetype = "dashed", linewidth = 0.8) +
geom_vline(aes(xintercept = best_neighbors_ba$neighbors,
color = "Highest balanced accuracy"),
linetype = "dashed", linewidth = 0.8) +
geom_vline(aes(xintercept = best_neighbors_1se$neighbors,
color = "1-SE-Rule"),
linetype = "dashed", linewidth = 0.8) +
ylab("Accuracy (CV)") +
scale_x_continuous(n.breaks = 12) +
scale_color_manual(
values = c(
"Highest accuracy" = "#03A9F4",
"Highest balanced accuracy" = "#9C27B0",
"1-SE-Rule" = "#f44e03"
),
breaks = c("Highest accuracy", "Highest balanced accuracy", "1-SE-Rule"),
name = "Selection"
) +
theme_bw() +
theme(legend.position = "top")Code
# pilih konfigurasi terbaik dari hasil tuning
best_params <- select_best(knn_tune_cv, metric = "accuracy")
# finalisasi WORKFLOW (recipe + model) dgn semua parameter terbaik
final_wf <- finalize_workflow(wf_smote, best_params)
# fit ke seluruh TRAIN yang sudah ternormalisasi
knn_fit_tr <- fit(final_wf, data = df_knn_train)
knn_fit_tr══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_zv()
• step_smote()
── Model ───────────────────────────────────────────────────────────────────────
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(27L, data, 5), distance = ~1.03389830508475, kernel = ~"rectangular")
Type of response variable: nominal
Minimal misclassification: 0.4666667
Best kernel: rectangular
Best k: 27
Code
# Kumpulkan seluruh metrik dari hasil tuning
metrics_all <- collect_metrics(knn_tune_cv)
# URUTKAN BERDASARKAN AKURASI TERTINGGI (semua kombinasi)
acc_all <- metrics_all %>%
filter(.metric == "accuracy", .estimator == "multiclass") %>%
arrange(desc(mean))
# Tampilkan TOP-10 kombinasi akurasi tertinggi (lengkap dgn konfigurasi)
top10_acc <- acc_all %>%
transmute(
k = neighbors,
p = dist_power,
weight = weight_func,
accuracy = round(mean, 4),
se = round(std_err, 4),
.config
) %>%
slice_head(n = 10)
top10_accCode
# METRIK LAIN untuk KONFIGURASI TERBAIK MENURUT AKURASI
best_acc_row <- acc_all %>% slice_head(n = 1) # baris akurasi tertinggi
best_cfg_metrics <- metrics_all %>%
semi_join(best_acc_row %>% select(neighbors, dist_power, weight_func),
by = c("neighbors","dist_power","weight_func")) %>%
select(.metric, mean, std_err) %>%
arrange(desc(.metric)) %>%
mutate(across(c(mean, std_err), ~round(.x, 4)))
best_cfg_metricsCode
# 1) Ambil semua hasil akurasi dari CV (tiap kombinasi hyperparameter)
acc_tbl <- collect_metrics(knn_tune_cv) %>%
filter(.metric == "accuracy", .estimator == "multiclass") %>%
transmute(
k = neighbors,
p = dist_power,
weight = weight_func, # jika kamu men-tune weight_func
accuracy = round(mean, 4),
se = round(std_err, 4),
lower = pmax(0, round(mean - 1.96*std_err, 4)), # 95% CI (bawah)
upper = pmin(1, round(mean + 1.96*std_err, 4)), # 95% CI (atas)
.config
) %>%
arrange(desc(accuracy), se)
# 2) Konfigurasi dengan akurasi TERTINGGI (skema terbaik)
best_acc_row <- acc_tbl %>% slice(1)
best_acc_row6.3 Evaluasi Model Data Testing
Code
predictor_test <- test %>% dplyr::select(-Peringkat_Akreditasi)
x_test_scaled <- as.data.frame(
scale(predictor_test, center = scaler$center, scale = scaler$scale)
)
df_knn_test <- dplyr::bind_cols(
x_test_scaled,
tibble::tibble(
Peringkat_Akreditasi = factor(
test$Peringkat_Akreditasi,
levels = levels(df_knn_train$Peringkat_Akreditasi)
)
)
)Code
# Prediksi & metrik di TEST
pred_class <- predict(knn_fit_tr, new_data = df_knn_test)
pred_prob <- predict(knn_fit_tr, new_data = df_knn_test, type = "prob")
test_res <- dplyr::bind_cols(df_knn_test, pred_class, pred_prob)
# metrik ringkas
metrics_test <- yardstick::metric_set(
accuracy, bal_accuracy, sens, spec, precision, recall, f_meas, kap
)
metrics_test(
test_res, truth = Peringkat_Akreditasi, estimate = .pred_class
)Code
Truth
Prediction A B C
A 24 8 1
B 2 7 3
C 3 14 13
Code
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.