Provinsi seperti DKI Jakarta dan DI Yogyakarta yang memiliki Indeks Pembangunan Manusia (IPM) tinggi dan angka harapan hidup panjang, ternyata juga memiliki persentase kelulusan SMA yang tinggi. Penelitian ini bertujuan untuk menganalisis: seberapa kuat hubungan antara kualitas hidup menyeluruh dengan tingkat pendidikan?
Pertanyaan penelitian: - Apakah daerah yang lebih sejahtera otomatis memiliki pendidikan yang lebih baik? - Faktor apa saja yang paling berpengaruh terhadap keberhasilan pendidikan SMA?
Analisis menggunakan data BPS 2023 dari 34 provinsi di Indonesia dengan variabel:
Sumber Data: - Data IPM BPS - Data Angka Harapan Hidup - Data APM SMA - Data TPP SMA
# Install dan load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
readr, dplyr, ggplot2, tidyr, stats, car,
broom, GGally, ggcorrplot, performance,
kableExtra, reshape2, lmtest, nortest
)
blue_main <- "#1976d2"
blue_soft <- "#bbdefb"
blue_dark <- "#1565c0"
theme_set(
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", color = blue_dark, size = 14),
plot.subtitle = element_text(color = "gray40", size = 11),
axis.title = element_text(color = blue_dark),
panel.grid.minor = element_blank()
)
)
# Set working directory (sesuaikan dengan lokasi file Anda)
# setwd("D:/DOWNLOAD/tugas 2 data sains/tugas 2")
# Import data
ipm <- read_delim("IPM.csv", delim = ";", trim_ws = TRUE, show_col_types = FALSE)
ahh <- read_delim("AHH.csv", delim = ";", trim_ws = TRUE, show_col_types = FALSE)
apm_sma <- read_delim("APM SMA.csv", delim = ";", trim_ws = TRUE, show_col_types = FALSE)
tpp <- read_delim("TPP.csv", delim = ";", trim_ws = TRUE, show_col_types = FALSE)
cat("✓ Data berhasil diimport\n")
## ✓ Data berhasil diimport
# Merge semua dataset
key_col <- colnames(ipm)[1]
Data <- ipm %>%
left_join(ahh, by = key_col) %>%
left_join(apm_sma, by = key_col) %>%
left_join(tpp, by = key_col)
# Rename dan convert tipe data
Data_clean <- Data %>%
select(1, 2, 3, 4, 5) %>%
setNames(c("Provinsi", "IPM", "AHH", "APM_SMA", "TPP_SMA")) %>%
mutate(
IPM = as.numeric(gsub(",", ".", IPM)),
AHH = as.numeric(gsub(",", ".", AHH)),
APM_SMA = as.numeric(gsub(",", ".", APM_SMA)),
TPP_SMA = as.numeric(gsub(",", ".", TPP_SMA))
) %>%
filter(!is.na(Provinsi) & !is.na(IPM)) # Remove rows dengan NA
Data <- Data_clean
cat("Dimensi data setelah cleaning:", nrow(Data), "baris ×", ncol(Data), "kolom\n")
## Dimensi data setelah cleaning: 34 baris × 5 kolom
# Cek missing values
missing_summary <- colSums(is.na(Data))
kable(data.frame(
Variabel = names(missing_summary),
Missing = missing_summary,
Persen = round(missing_summary / nrow(Data) * 100, 2)
)) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Variabel | Missing | Persen | |
|---|---|---|---|
| Provinsi | Provinsi | 0 | 0 |
| IPM | IPM | 0 | 0 |
| AHH | AHH | 0 | 0 |
| APM_SMA | APM_SMA | 0 | 0 |
| TPP_SMA | TPP_SMA | 0 | 0 |
summary(Data[, -1]) # Exclude kolom Provinsi
## IPM AHH APM_SMA TPP_SMA
## Min. :66.16 Min. :66.10 Min. : 95.43 Min. :39.50
## 1st Qu.:71.00 1st Qu.:69.28 1st Qu.:102.69 1st Qu.:60.84
## Median :72.78 Median :70.90 Median :106.70 Median :67.02
## Mean :72.91 Mean :70.75 Mean :107.01 Mean :65.81
## 3rd Qu.:73.75 3rd Qu.:72.28 3rd Qu.:111.89 3rd Qu.:69.79
## Max. :82.46 Max. :75.20 Max. :120.85 Max. :89.69
Interpretasi Statistik Deskriptif:
Dari 34 provinsi:
desc_stats <- Data %>%
select(-Provinsi) %>%
summarise(across(everything(), list(
Min = ~min(., na.rm = TRUE),
Q1 = ~quantile(., 0.25, na.rm = TRUE),
Median = ~median(., na.rm = TRUE),
Mean = ~mean(., na.rm = TRUE),
Q3 = ~quantile(., 0.75, na.rm = TRUE),
Max = ~max(., na.rm = TRUE),
SD = ~sd(., na.rm = TRUE)
))) %>%
pivot_longer(everything(), names_to = "Stat", values_to = "Value") %>%
separate(Stat, into = c("Variable", "Statistic"), sep = "_") %>%
pivot_wider(names_from = Statistic, values_from = Value)
kable(desc_stats, digits = 2, caption = "Statistik Deskriptif Lengkap") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main)
| Variable | Min | Q1 | Median | Mean | Q3 | Max | SD | SMA |
|---|---|---|---|---|---|---|---|---|
| IPM | 66.16 | 71 | 72.785 | 72.90618 | 73.7475 | 82.46 | 3.426929 | NULL |
| AHH | 66.1 | 69.275 | 70.9 | 70.75294 | 72.275 | 75.2 | 2.4229 | NULL |
| APM | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 95.430000, 102.690000, 106.705000, 107.007647, 111.887500, 120.850000, 6.381486 |
| TPP | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 39.50000, 60.84500, 67.01500, 65.81235, 69.79250, 89.69000, 10.68780 |
Data %>%
select(-Provinsi) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, fill = Variable)) +
geom_histogram(bins = 15, color = "white", alpha = 0.8) +
facet_wrap(~Variable, scales = "free", ncol = 2) +
scale_fill_manual(values = c(blue_main, "#42a5f5", "#64b5f6", "#90caf9")) +
labs(title = "Distribusi Semua Variabel",
subtitle = "Sumber: Data BPS 2023",
x = "Nilai", y = "Frekuensi") +
theme(legend.position = "none")
Interpretasi Distribusi: - IPM dan AHH: distribusi relatif normal dengan slight skewness - APM SMA: distribusi yang menunjukkan sebagian besar provinsi memiliki akses tinggi - TPP SMA: distribusi dengan beberapa outlier rendah (provinsi tertinggal)
Data %>%
select(-Provinsi) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value, fill = Variable)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c(blue_main, "#42a5f5", "#64b5f6", "#90caf9")) +
labs(title = "Boxplot Perbandingan Semua Variabel",
subtitle = "Identifikasi outlier dan distribusi",
x = "", y = "Nilai") +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
cor_ipm_tpp <- cor(Data$IPM, Data$TPP_SMA, use = "complete.obs")
ggplot(Data, aes(x = IPM, y = TPP_SMA)) +
geom_point(color = "#64b5f6", size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = blue_main, fill = blue_soft) +
labs(
title = "Hubungan IPM dan Tingkat Penyelesaian Pendidikan SMA",
subtitle = sprintf("Korelasi Pearson: r = %.3f | Sumber: Data BPS 2023", cor_ipm_tpp),
x = "Indeks Pembangunan Manusia (IPM)",
y = "Tingkat Penyelesaian Pendidikan SMA (%)"
)
Interpretasi: Korelasi positif kuat (r = 0.761) menunjukkan bahwa peningkatan IPM sangat terkait dengan peningkatan kelulusan SMA. Tidak ada provinsi dengan IPM tinggi yang memiliki TPP rendah, menunjukkan IPM sebagai necessary but not sufficient condition.
cor_ahh_tpp <- cor(Data$AHH, Data$TPP_SMA, use = "complete.obs")
ggplot(Data, aes(x = AHH, y = TPP_SMA)) +
geom_point(color = "#64b5f6", size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = blue_main, fill = blue_soft) +
labs(
title = "Hubungan Angka Harapan Hidup dan TPP SMA",
subtitle = sprintf("Korelasi Pearson: r = %.3f | Sumber: Data BPS 2023", cor_ahh_tpp),
x = "Angka Harapan Hidup (Tahun)",
y = "Tingkat Penyelesaian Pendidikan SMA (%)"
)
cor_apm_tpp <- cor(Data$APM_SMA, Data$TPP_SMA, use = "complete.obs")
ggplot(Data, aes(x = APM_SMA, y = TPP_SMA)) +
geom_point(color = "#64b5f6", size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = blue_main, fill = blue_soft) +
labs(
title = "Hubungan APM SMA dan TPP SMA",
subtitle = sprintf("Korelasi Pearson: r = %.3f | Sumber: Data BPS 2023", cor_apm_tpp),
x = "Angka Partisipasi Murni SMA (%)",
y = "Tingkat Penyelesaian Pendidikan SMA (%)"
)
Catatan: Korelasi negatif menunjukkan hubungan yang counter-intuitive - kemungkinan karena APM mengukur akses, sementara TPP mengukur kelulusan. Provinsi dengan akses tinggi belum tentu memiliki kualitas yang mendukung kelulusan.
cor_data <- Data %>% select(IPM, AHH, APM_SMA, TPP_SMA)
cor_matrix <- cor(cor_data, method = "pearson", use = "complete.obs")
kable(cor_matrix, digits = 3, caption = "Matrix Korelasi Pearson") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main) %>%
row_spec(1:nrow(cor_matrix), background = "#e3f2fd")
| IPM | AHH | APM_SMA | TPP_SMA | |
|---|---|---|---|---|
| IPM | 1.000 | 0.739 | -0.172 | 0.761 |
| AHH | 0.739 | 1.000 | -0.061 | 0.543 |
| APM_SMA | -0.172 | -0.061 | 1.000 | -0.175 |
| TPP_SMA | 0.761 | 0.543 | -0.175 | 1.000 |
ggcorrplot(
cor_matrix,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 4,
colors = c("#e3f2fd", "white", blue_main),
title = "Heatmap Korelasi antar Variabel",
ggtheme = theme_minimal()
)
cor_test_results <- data.frame(
Pasangan = c("IPM - TPP_SMA", "AHH - TPP_SMA", "APM_SMA - TPP_SMA"),
Korelasi = c(
cor.test(Data$IPM, Data$TPP_SMA)$estimate,
cor.test(Data$AHH, Data$TPP_SMA)$estimate,
cor.test(Data$APM_SMA, Data$TPP_SMA)$estimate
),
P_value = c(
cor.test(Data$IPM, Data$TPP_SMA)$p.value,
cor.test(Data$AHH, Data$TPP_SMA)$p.value,
cor.test(Data$APM_SMA, Data$TPP_SMA)$p.value
)
) %>%
mutate(
Signifikansi = ifelse(P_value < 0.001, "***",
ifelse(P_value < 0.01, "**",
ifelse(P_value < 0.05, "*", "ns")))
)
kable(cor_test_results, digits = 4, caption = "Uji Signifikansi Korelasi") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main)
| Pasangan | Korelasi | P_value | Signifikansi |
|---|---|---|---|
| IPM - TPP_SMA | 0.7609 | 0.0000 | *** |
| AHH - TPP_SMA | 0.5431 | 0.0009 | *** |
| APM_SMA - TPP_SMA | -0.1747 | 0.3230 | ns |
Interpretasi Korelasi:
Data <- Data %>%
mutate(Wilayah = case_when(
Provinsi %in% c("Aceh", "Sumatera Utara", "Sumatera Barat",
"Riau", "Jambi", "Sumatera Selatan", "Bengkulu",
"Lampung", "Kepulauan Bangka Belitung",
"Kepulauan Riau") ~ "Sumatera",
Provinsi %in% c("DKI Jakarta", "Jawa Barat", "Jawa Tengah",
"DI Yogyakarta", "Jawa Timur", "Banten") ~ "Jawa",
Provinsi %in% c("Bali", "Nusa Tenggara Barat",
"Nusa Tenggara Timur") ~ "Nusa Tenggara",
Provinsi %in% c("Kalimantan Barat", "Kalimantan Tengah",
"Kalimantan Selatan", "Kalimantan Timur",
"Kalimantan Utara") ~ "Kalimantan",
TRUE ~ "Indonesia Timur"
))
table_wilayah <- table(Data$Wilayah)
kable(data.frame(Wilayah = names(table_wilayah),
Jumlah_Provinsi = as.vector(table_wilayah))) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Wilayah | Jumlah_Provinsi |
|---|---|
| Indonesia Timur | 34 |
stats_wilayah <- Data %>%
group_by(Wilayah) %>%
summarise(
N = n(),
Rata_TPP = mean(TPP_SMA, na.rm = TRUE),
SD_TPP = sd(TPP_SMA, na.rm = TRUE),
Rata_IPM = mean(IPM, na.rm = TRUE),
SD_IPM = sd(IPM, na.rm = TRUE)
) %>%
arrange(desc(Rata_TPP))
kable(stats_wilayah, digits = 2,
caption = "Statistik TPP SMA dan IPM per Wilayah") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main)
| Wilayah | N | Rata_TPP | SD_TPP | Rata_IPM | SD_IPM |
|---|---|---|---|---|---|
| Indonesia Timur | 34 | 65.81 | 10.69 | 72.91 | 3.43 |
# Cek jumlah kategori
n_wilayah <- length(unique(Data$Wilayah[!is.na(Data$Wilayah)]))
if (n_wilayah >= 2) {
# ANOVA test
anova_model <- aov(TPP_SMA ~ Wilayah, data = Data)
cat("=== ANOVA: TPP SMA ~ Wilayah ===\n\n")
print(summary(anova_model))
# Tukey HSD jika signifikan
anova_pval <- summary(anova_model)[[1]]$`Pr(>F)`[1]
if (anova_pval < 0.05) {
cat("\n\n=== Tukey HSD Post-hoc Test ===\n")
tukey_result <- TukeyHSD(anova_model)
print(tukey_result)
# Plot Tukey
par(mar = c(5, 10, 4, 2))
plot(tukey_result, las = 1, col = blue_main)
} else {
cat("\n\nTidak ada perbedaan signifikan antar wilayah (p > 0.05)\n")
}
} else {
cat("CATATAN: Data hanya memiliki", n_wilayah,
"kategori wilayah. ANOVA memerlukan minimal 2 grup.\n")
cat("Kemungkinan semua provinsi terkategorisasi ke satu wilayah saja.\n")
cat("Silakan periksa kembali logika kategorisasi wilayah.\n")
}
## CATATAN: Data hanya memiliki 1 kategori wilayah. ANOVA memerlukan minimal 2 grup.
## Kemungkinan semua provinsi terkategorisasi ke satu wilayah saja.
## Silakan periksa kembali logika kategorisasi wilayah.
if (n_wilayah >= 2) {
ggplot(Data, aes(x = reorder(Wilayah, TPP_SMA, FUN = median),
y = TPP_SMA, fill = Wilayah)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.3, size = 2) +
scale_fill_manual(values = c("#1976d2", "#42a5f5", "#64b5f6",
"#90caf9", "#bbdefb")) +
labs(
title = "Distribusi TPP SMA per Wilayah",
subtitle = "Urutan berdasarkan median",
x = "Wilayah",
y = "TPP SMA (%)"
) +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1)
) +
coord_flip()
} else {
cat("Visualisasi tidak dapat dibuat karena hanya ada 1 kategori wilayah.\n")
}
## Visualisasi tidak dapat dibuat karena hanya ada 1 kategori wilayah.
Interpretasi ANOVA:
Jika hasil menunjukkan F-statistic signifikan (p < 0.05): - Terdapat perbedaan signifikan rata-rata TPP SMA antar wilayah - Tukey HSD mengidentifikasi pasangan wilayah mana yang berbeda signifikan - Kesenjangan pendidikan bersifat geografis dan sistemik
Catatan: Jika hasil menunjukkan “hanya 1 kategori”, maka ada masalah dalam kategorisasi wilayah (kemungkinan nama provinsi dalam data tidak match dengan logika case_when).
# Fit model
model_lm <- lm(TPP_SMA ~ IPM + AHH + APM_SMA, data = Data)
# Summary model
summary(model_lm)
##
## Call:
## lm(formula = TPP_SMA ~ IPM + AHH + APM_SMA, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.447 -1.312 1.390 3.278 12.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -92.65525 43.89386 -2.111 0.043224 *
## IPM 2.43422 0.55709 4.370 0.000137 ***
## AHH -0.16094 0.77762 -0.207 0.837434
## APM_SMA -0.07117 0.20183 -0.353 0.726854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.252 on 30 degrees of freedom
## Multiple R-squared: 0.5815, Adjusted R-squared: 0.5396
## F-statistic: 13.89 on 3 and 30 DF, p-value: 7.394e-06
coef_table <- tidy(model_lm, conf.int = TRUE) %>%
mutate(
Signifikansi = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
TRUE ~ "ns"
)
)
kable(coef_table, digits = 4,
caption = "Koefisien Regresi dan Interval Kepercayaan 95%") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high | Signifikansi |
|---|---|---|---|---|---|---|---|
| (Intercept) | -92.6553 | 43.8939 | -2.1109 | 0.0432 | -182.2985 | -3.0120 |
|
| IPM | 2.4342 | 0.5571 | 4.3695 | 0.0001 | 1.2965 | 3.5720 | *** |
| AHH | -0.1609 | 0.7776 | -0.2070 | 0.8374 | -1.7491 | 1.4272 | ns |
| APM_SMA | -0.0712 | 0.2018 | -0.3526 | 0.7269 | -0.4834 | 0.3410 | ns |
Interpretasi Model:
Model:
TPP_SMA = β₀ + β₁(IPM) + β₂(AHH) + β₃(APM_SMA) + ε
Hasil: - R² = 0.581: Model menjelaskan ~58% variasi dalam TPP SMA - F-statistic signifikan (p < 0.001): Model secara keseluruhan valid
Koefisien: 1. IPM: β = 2.434 (p < 0.001) ⭐ - Setiap kenaikan 1 poin IPM → TPP SMA naik 2.43% - Prediktor terkuat dan paling signifikan
par(mfrow = c(2, 2))
plot(model_lm, col = blue_main, pch = 19)
par(mfrow = c(1, 1))
# Shapiro-Wilk test
shapiro_test <- shapiro.test(residuals(model_lm))
cat("=== Uji Normalitas Residual (Shapiro-Wilk) ===\n")
## === Uji Normalitas Residual (Shapiro-Wilk) ===
cat("W statistic:", round(shapiro_test$statistic, 4), "\n")
## W statistic: 0.8579
cat("P-value:", format.pval(shapiro_test$p.value, digits = 4), "\n")
## P-value: 0.0004229
if (shapiro_test$p.value > 0.05) {
cat("✓ Asumsi normalitas TERPENUHI (p > 0.05)\n")
} else {
cat("✗ Asumsi normalitas TIDAK terpenuhi (p < 0.05)\n")
cat(" Pertimbangkan transformasi atau metode robust\n")
}
## ✗ Asumsi normalitas TIDAK terpenuhi (p < 0.05)
## Pertimbangkan transformasi atau metode robust
# Q-Q plot
ggplot(data.frame(residuals = residuals(model_lm)), aes(sample = residuals)) +
stat_qq(color = blue_main, size = 2) +
stat_qq_line(color = "red", linetype = "dashed") +
labs(title = "Q-Q Plot: Normalitas Residual",
x = "Theoretical Quantiles",
y = "Sample Quantiles")
# Breusch-Pagan test
bp_test <- bptest(model_lm)
cat("\n=== Uji Homoskedastisitas (Breusch-Pagan) ===\n")
##
## === Uji Homoskedastisitas (Breusch-Pagan) ===
cat("BP statistic:", round(bp_test$statistic, 4), "\n")
## BP statistic: 10.4559
cat("P-value:", format.pval(bp_test$p.value, digits = 4), "\n")
## P-value: 0.01506
if (bp_test$p.value > 0.05) {
cat("✓ Asumsi homoskedastisitas TERPENUHI (p > 0.05)\n")
} else {
cat("✗ Asumsi homoskedastisitas TIDAK terpenuhi (p < 0.05)\n")
cat(" Ada heteroskedastisitas - pertimbangkan robust SE\n")
}
## ✗ Asumsi homoskedastisitas TIDAK terpenuhi (p < 0.05)
## Ada heteroskedastisitas - pertimbangkan robust SE
# Residuals vs Fitted plot
ggplot(data.frame(
fitted = fitted(model_lm),
residuals = residuals(model_lm)
), aes(x = fitted, y = residuals)) +
geom_point(color = blue_main, size = 2, alpha = 0.7) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(se = FALSE, color = blue_dark) +
labs(title = "Residuals vs Fitted Values",
subtitle = "Checking homoscedasticity",
x = "Fitted Values",
y = "Residuals")
vif_values <- vif(model_lm)
cat("\n=== Uji Multikolinearitas (VIF) ===\n")
##
## === Uji Multikolinearitas (VIF) ===
print(vif_values)
## IPM AHH APM_SMA
## 2.287050 2.227509 1.040946
vif_df <- data.frame(
Variable = names(vif_values),
VIF = vif_values,
Status = ifelse(vif_values < 5, "✓ OK",
ifelse(vif_values < 10, "⚠ Moderate", "✗ High"))
)
kable(vif_df, digits = 3, row.names = FALSE,
caption = "Variance Inflation Factor (VIF)") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main)
| Variable | VIF | Status |
|---|---|---|
| IPM | 2.287 | ✓ OK |
| AHH | 2.228 | ✓ OK |
| APM_SMA | 1.041 | ✓ OK |
cat("\nInterpretasi VIF:\n")
##
## Interpretasi VIF:
cat("- VIF < 5: Tidak ada masalah multikolinearitas\n")
## - VIF < 5: Tidak ada masalah multikolinearitas
cat("- VIF 5-10: Multikolinearitas sedang\n")
## - VIF 5-10: Multikolinearitas sedang
cat("- VIF > 10: Multikolinearitas tinggi (problematic)\n")
## - VIF > 10: Multikolinearitas tinggi (problematic)
if (all(vif_values < 5)) {
cat("\n✓ Semua VIF < 5: Model bebas dari multikolinearitas serius\n")
} else if (any(vif_values > 10)) {
cat("\n✗ Ada VIF > 10: Pertimbangkan menghapus variabel atau gunakan metode ridge/lasso\n")
} else {
cat("\n⚠ Ada VIF 5-10: Monitor hasil, tapi masih acceptable\n")
}
##
## ✓ Semua VIF < 5: Model bebas dari multikolinearitas serius
# Add predictions to data
Data_pred <- Data %>%
filter(!is.na(TPP_SMA) & !is.na(IPM) & !is.na(AHH) & !is.na(APM_SMA)) %>%
mutate(
Predicted = predict(model_lm),
Residual = TPP_SMA - Predicted
)
# Scatter plot
ggplot(Data_pred, aes(x = TPP_SMA, y = Predicted)) +
geom_point(color = blue_main, size = 3, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
geom_smooth(method = "lm", se = TRUE, color = blue_dark, fill = blue_soft) +
labs(
title = "Prediksi vs Nilai Aktual TPP SMA",
subtitle = sprintf("R² = %.3f | RMSE = %.2f",
summary(model_lm)$r.squared,
sqrt(mean(Data_pred$Residual^2))),
x = "Nilai Aktual TPP SMA (%)",
y = "Nilai Prediksi TPP SMA (%)"
) +
annotate("text", x = min(Data_pred$TPP_SMA) + 5,
y = max(Data_pred$Predicted) - 5,
label = "Garis merah: prediksi sempurna\nGaris biru: fitted line",
hjust = 0, color = "gray40", size = 3)
# Top 5 positive residuals (underpredicted)
top_positive <- Data_pred %>%
arrange(desc(Residual)) %>%
select(Provinsi, TPP_SMA, Predicted, Residual) %>%
head(5)
cat("=== Provinsi dengan Performa LEBIH BAIK dari Prediksi ===\n")
## === Provinsi dengan Performa LEBIH BAIK dari Prediksi ===
cat("(Model underpredict - ada faktor positif lain yang berperan)\n\n")
## (Model underpredict - ada faktor positif lain yang berperan)
kable(top_positive, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4caf50")
| Provinsi | TPP_SMA | Predicted | Residual |
|---|---|---|---|
| MALUKU | 75.01 | 62.49 | 12.52 |
| PAPUA BARAT | 59.99 | 49.39 | 10.60 |
| SUMATERA UTARA | 74.43 | 67.31 | 7.12 |
| ACEH | 74.46 | 67.62 | 6.84 |
| DI YOGYAKARTA | 89.69 | 85.03 | 4.66 |
# Top 5 negative residuals (overpredicted)
top_negative <- Data_pred %>%
arrange(Residual) %>%
select(Provinsi, TPP_SMA, Predicted, Residual) %>%
head(5)
cat("\n=== Provinsi dengan Performa LEBIH BURUK dari Prediksi ===\n")
##
## === Provinsi dengan Performa LEBIH BURUK dari Prediksi ===
cat("(Model overpredict - ada hambatan struktural yang perlu diatasi)\n\n")
## (Model overpredict - ada hambatan struktural yang perlu diatasi)
kable(top_negative, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#f44336")
| Provinsi | TPP_SMA | Predicted | Residual |
|---|---|---|---|
| PAPUA | 39.50 | 65.95 | -26.45 |
| GORONTALO | 46.19 | 59.60 | -13.41 |
| JAWA TENGAH | 58.35 | 66.51 | -8.16 |
| NUSA TENGGARA TIMUR | 43.46 | 51.23 | -7.77 |
| SULAWESI TENGAH | 55.69 | 61.11 | -5.42 |
# Model alternatif
model_1 <- lm(TPP_SMA ~ IPM, data = Data) # Simple regression
model_2 <- lm(TPP_SMA ~ IPM + AHH, data = Data) # Tambah AHH
model_3 <- model_lm # Full model (sudah dibuat sebelumnya)
# Compare models
comparison <- data.frame(
Model = c("Model 1: IPM only",
"Model 2: IPM + AHH",
"Model 3: IPM + AHH + APM_SMA"),
R_squared = c(
summary(model_1)$r.squared,
summary(model_2)$r.squared,
summary(model_3)$r.squared
),
Adj_R_squared = c(
summary(model_1)$adj.r.squared,
summary(model_2)$adj.r.squared,
summary(model_3)$adj.r.squared
),
AIC = c(AIC(model_1), AIC(model_2), AIC(model_3)),
BIC = c(BIC(model_1), BIC(model_2), BIC(model_3)),
RMSE = c(
sqrt(mean(residuals(model_1)^2)),
sqrt(mean(residuals(model_2)^2)),
sqrt(mean(residuals(model_3)^2))
)
)
kable(comparison, digits = 3,
caption = "Perbandingan Model Regresi") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = blue_main) %>%
row_spec(which.min(comparison$AIC), bold = TRUE, background = "#c8e6c9")
| Model | R_squared | Adj_R_squared | AIC | BIC | RMSE |
|---|---|---|---|---|---|
| Model 1: IPM only | 0.579 | 0.566 | 233.166 | 237.745 | 6.833 |
| Model 2: IPM + AHH | 0.580 | 0.553 | 235.099 | 241.204 | 6.826 |
| Model 3: IPM + AHH + APM_SMA | 0.581 | 0.540 | 236.958 | 244.590 | 6.812 |
cat("\n** Model terbaik berdasarkan AIC/BIC (nilai terendah) **\n")
##
## ** Model terbaik berdasarkan AIC/BIC (nilai terendah) **
Rekomendasi: - Jangan hanya bangun gedung sekolah tanpa ekosistem pendukung - Program terintegrasi: kesehatan (gizi, sanitasi) + ekonomi (beasiswa, subsidi) + pendidikan (guru berkualitas, kurikulum relevan)
Target: Provinsi dengan IPM < 70 (Papua, NTT, Maluku, Papua Barat)
Contoh program konkret: - Paket Komprehensif: 1 sekolah berkualitas + puskesmas + program ekonomi keluarga - Bukan: 10 sekolah baru tanpa guru berkualitas dan akses kesehatan
Data mendukung: - Gap Jawa - Indonesia Timur: ~20 poin TPP SMA - Bukan hanya masalah “kota vs desa”, tapi sistemik regional
Rekomendasi: - Dana Alokasi Khusus (DAK) untuk Papua, Maluku, NTT - Insentif guru: Gaji 2x + tunjangan untuk mengajar di daerah terpencil - Boarding school model: Untuk daerah dengan jarak tempuh ekstrem - Teknologi: Internet gratis untuk akses e-learning
Temuan: Korelasi APM-TPP negatif → Akses tinggi ≠ Kualitas tinggi
Rekomendasi: - Audit kualitas sekolah existing sebelum bangun baru - Standarisasi minimum: Guru tersertifikasi, lab/perpustakaan fungsional, rasio siswa-guru maksimal 1:25 - Sistem monitoring: Tracking dropout rate, bukan hanya enrollment rate
Strategi: - Gunakan IPM provinsi untuk prediksi risiko pendidikan - Early warning system: Jika IPM turun → antisipasi penurunan TPP SMA - Resource allocation: Alokasi budget proporsional dengan gap IPM dari target nasional
Data: - Badan Pusat Statistik (BPS). (2023). Indeks Pembangunan Manusia Menurut Provinsi 2023. Retrieved from https://www.bps.go.id
Metode Analisis: - R Core Team (2023). R: A language and environment for statistical computing - Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software - Fox, J., & Weisberg, S. (2019). An R Companion to Applied Regression (3rd ed.)
© 2023 Muhamad Raffli | Data Science Project | Universitas Matana