Apa yang kita lakukan di sini?
Langkah pertama dalam setiap analisis data adalah memahami data yang kita miliki. Dataset ini berisi observasi cuaca dari 5 kota besar Indonesia — Jakarta, Bandung, Surabaya, Makassar, dan Medan — mencakup suhu, kelembaban, curah hujan, dan kecepatan angin. Sebelum menganalisis, kita harus tahu: apa yang ada di dalamnya? Apakah ada yang hilang? Apakah formatnya benar?
# ============================================================
# 6.9.1.1 — IMPORT DATA DARI GITHUB
# ============================================================
url <- "https://raw.githubusercontent.com/cahayasemidang-max/data-transformation/refs/heads/main/6%20Data-Transformation%20%E2%80%93%20Data%20Science%20Programming.csv"
data <- read_csv(url, show_col_types = FALSE) |>
rename(
ID_Observasi = Observation_ID,
Tanggal = Date,
Lokasi = Location,
Musim = Season,
Suhu = Temperature,
Kelembaban = Humidity,
`Curah hujan` = Rainfall,
`Kecepatan Angin` = Wind_Speed
)
cat("✅ Data berhasil dimuat!\n")
## ✅ Data berhasil dimuat!
cat(sprintf("Total observasi : %d baris x %d kolom\n", nrow(data), ncol(data)))
## Total observasi : 500 baris x 9 kolom
cat(sprintf("Kota : %s\n", paste(unique(data$Lokasi), collapse = ", ")))
## Kota : Jakarta, Bandung, Makassar, Surabaya, Medan
cat(sprintf("Rentang tanggal : %s s/d %s\n",
min(data$Tanggal, na.rm = TRUE),
max(data$Tanggal, na.rm = TRUE)))
## Rentang tanggal : 2020-01-02 s/d 2024-12-29
# Lihat struktur data
glimpse(data)
## Rows: 500
## Columns: 9
## $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ ID_Observasi <chr> "JpeDLWuzIJdl", "EF8r6hXBCqfr", "cov0TYDwyQoF", "aRB…
## $ Tanggal <date> 2021-07-14, 2020-11-16, 2023-03-22, 2023-01-02, 202…
## $ Lokasi <chr> "Jakarta", "Bandung", "Makassar", "Surabaya", "Medan…
## $ Musim <chr> "Dry Season", "Rainy Season", "Transitional Season",…
## $ Suhu <dbl> 30.7, 26.2, 27.5, 26.8, 24.9, 31.1, 23.7, 26.3, 26.0…
## $ Kelembaban <dbl> 89.5, 70.1, 80.7, 90.4, 85.4, 68.2, 99.7, 60.0, 68.3…
## $ `Curah hujan` <dbl> 7.9, 4.6, 11.0, 8.7, 7.5, 3.2, 3.8, 21.1, 5.2, 3.4, …
## $ `Kecepatan Angin` <dbl> 9.5, 5.6, 13.1, 4.4, 5.2, 4.5, 11.0, 1.1, 7.0, 7.2, …
datatable(
head(data, 50),
options = list(
pageLength = 10,
scrollX = TRUE,
dom = "Bfrtip",
buttons = c("copy","csv","excel"),
columnDefs = list(list(className = "dt-center", targets = "_all"))
),
rownames = FALSE,
caption = "Preview 50 Baris Pertama Dataset Cuaca Indonesia",
class = "display nowrap"
) |>
formatRound(columns = c("Suhu","Kelembaban","Curah hujan","Kecepatan Angin"), digits = 2) |>
formatStyle(
columns = "Suhu",
background = styleColorBar(range(data$Suhu, na.rm = TRUE), "#4a7c59"),
backgroundSize = "100% 90%",
backgroundRepeat = "no-repeat",
backgroundPosition = "center"
)
data |>
select(Suhu, Kelembaban, `Curah hujan`, `Kecepatan Angin`) |>
pivot_longer(everything(), names_to = "Variabel", values_to = "Nilai") |>
group_by(Variabel) |>
summarise(
N = n(),
Min = min(Nilai, na.rm = TRUE),
Q1 = quantile(Nilai, 0.25, na.rm = TRUE),
Median = median(Nilai, na.rm = TRUE),
Mean = mean(Nilai, na.rm = TRUE),
Q3 = quantile(Nilai, 0.75, na.rm = TRUE),
Max = max(Nilai, na.rm = TRUE),
SD = sd(Nilai, na.rm = TRUE),
Skewness = round(skewness(Nilai, na.rm = TRUE), 3)
) |>
kable(
caption = "Ringkasan Statistik Deskriptif Variabel Numerik",
digits = 2, align = "c"
) |>
kable_styling(
bootstrap_options = c("striped","hover","condensed","responsive"),
full_width = TRUE, font_size = 13
) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white") |>
column_spec(1, bold = TRUE, color = "#4a7c59")
| Variabel | N | Min | Q1 | Median | Mean | Q3 | Max | SD | Skewness |
|---|---|---|---|---|---|---|---|---|---|
| Curah hujan | 500 | 0.3 | 4.7 | 8.25 | 9.81 | 13.4 | 48.7 | 6.75 | 1.40 |
| Kecepatan Angin | 500 | 1.0 | 4.9 | 8.50 | 8.18 | 11.5 | 14.9 | 3.98 | -0.04 |
| Kelembaban | 500 | 60.0 | 69.1 | 79.15 | 79.46 | 89.6 | 100.0 | 11.69 | 0.06 |
| Suhu | 500 | 16.9 | 25.0 | 27.00 | 26.92 | 29.1 | 37.5 | 3.06 | -0.10 |
Mengapa membersihkan data itu penting?
Data di dunia nyata seringkali tidak sempurna — ada nilai yang hilang, duplikat, satuan tidak konsisten, atau format tanggal yang salah. Tahap ini memastikan data dalam kondisi bersih dan siap dianalisis.
missing_summary <- data |>
summarise(across(everything(), ~sum(is.na(.)))) |>
pivot_longer(everything(), names_to = "Kolom", values_to = "Jumlah_Missing") |>
mutate(
Persen_Missing = round(Jumlah_Missing / nrow(data) * 100, 2),
Status = ifelse(Jumlah_Missing == 0, "✅ Bersih", "⚠️ Ada Missing")
)
missing_summary |>
kable(caption = "Status Missing Values per Kolom", align = "lrrc") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white") |>
row_spec(which(missing_summary$Jumlah_Missing > 0),
background = "#fef9ec", color = "#8b5e3c")
| Kolom | Jumlah_Missing | Persen_Missing | Status |
|---|---|---|---|
| …1 | 0 | 0 | ✅ Bersih | |
| ID_Observasi | 0 | 0 | ✅ Bersih | |
| Tanggal | 0 | 0 | ✅ Bersih | |
| Lokasi | 0 | 0 | ✅ Bersih | |
| Musim | 0 | 0 | ✅ Bersih | |
| Suhu | 0 | 0 | ✅ Bersih | |
| Kelembaban | 0 | 0 | ✅ Bersih | |
| Curah hujan | 0 | 0 | ✅ Bersih | |
| Kecepatan Angin | 0 | 0 | ✅ Bersih | |
# --- Cek duplikat ---
n_duplikat <- sum(duplicated(data))
cat(sprintf("Baris duplikat ditemukan: %d\n", n_duplikat))
## Baris duplikat ditemukan: 0
if (n_duplikat == 0) cat("✅ Tidak ada duplikat — data bersih!\n") else {
data <- data |> distinct()
cat(sprintf("✅ %d duplikat telah dihapus.\n", n_duplikat))
}
## ✅ Tidak ada duplikat — data bersih!
# --- Konversi suhu: Celsius → Fahrenheit & Kelvin ---
data <- data |>
mutate(
Suhu_Fahrenheit = round((Suhu * 9/5) + 32, 2),
Suhu_Kelvin = round(Suhu + 273.15, 2)
)
data |>
select(Lokasi, Suhu, Suhu_Fahrenheit, Suhu_Kelvin) |>
head(8) |>
kable(
caption = "Konversi Suhu: Celsius → Fahrenheit & Kelvin",
align = "c", digits = 2
) |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#c17f3a", color = "white")
| Lokasi | Suhu | Suhu_Fahrenheit | Suhu_Kelvin |
|---|---|---|---|
| Jakarta | 30.7 | 87.3 | 304 |
| Bandung | 26.2 | 79.2 | 299 |
| Makassar | 27.5 | 81.5 | 301 |
| Surabaya | 26.8 | 80.2 | 300 |
| Medan | 24.9 | 76.8 | 298 |
| Jakarta | 31.1 | 88.0 | 304 |
| Medan | 23.7 | 74.7 | 297 |
| Bandung | 26.3 | 79.3 | 299 |
# --- Dekomposisi kolom tanggal ---
data <- data |>
mutate(
Tanggal = as.Date(Tanggal),
Tahun = year(Tanggal),
Bulan = month(Tanggal),
Hari = day(Tanggal),
Hari_Dalam_Tahun = yday(Tanggal),
Hari_Dalam_Minggu = wday(Tanggal, label = TRUE, abbr = FALSE),
Nama_Bulan = month(Tanggal, label = TRUE, abbr = FALSE)
# Kolom Jam tidak dibuat — data harian tidak memiliki informasi jam
)
data |>
select(Tanggal, Tahun, Bulan, Nama_Bulan, Hari, Hari_Dalam_Tahun, Hari_Dalam_Minggu) |>
head(10) |>
kable(
caption = "Hasil Parsing Kolom Tanggal Menjadi Komponen Temporal",
align = "c"
) |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white")
| Tanggal | Tahun | Bulan | Nama_Bulan | Hari | Hari_Dalam_Tahun | Hari_Dalam_Minggu |
|---|---|---|---|---|---|---|
| 2021-07-14 | 2021 | 7 | Juli | 14 | 195 | Rabu |
| 2020-11-16 | 2020 | 11 | November | 16 | 321 | Senin |
| 2023-03-22 | 2023 | 3 | Maret | 22 | 81 | Rabu |
| 2023-01-02 | 2023 | 1 | Januari | 2 | 2 | Senin |
| 2023-06-05 | 2023 | 6 | Juni | 5 | 156 | Senin |
| 2023-03-15 | 2023 | 3 | Maret | 15 | 74 | Rabu |
| 2021-09-25 | 2021 | 9 | September | 25 | 268 | Sabtu |
| 2020-02-18 | 2020 | 2 | Februari | 18 | 49 | Selasa |
| 2023-02-25 | 2023 | 2 | Februari | 25 | 56 | Sabtu |
| 2023-08-19 | 2023 | 8 | Agustus | 19 | 231 | Sabtu |
Apa itu Feature Engineering?
Feature Engineering adalah seni mengubah data mentah menjadi fitur-fitur baru yang lebih bermakna bagi model machine learning. Ibarat seorang koki yang mengubah bahan mentah menjadi hidangan lezat — kita ciptakan fitur yang lebih informatif dari data cuaca mentah.
# ============================================================
# 6.9.1.3 — REKAYASA FITUR
# ============================================================
rata_harian <- data |>
group_by(Lokasi, Tanggal) |>
summarise(
Suhu_Harian = mean(Suhu, na.rm = TRUE),
Kelembaban_Harian = mean(Kelembaban, na.rm = TRUE),
Hujan_Harian = sum(`Curah hujan`, na.rm = TRUE),
Angin_Harian = mean(`Kecepatan Angin`, na.rm = TRUE),
.groups = "drop"
)
rata_bulanan <- data |>
group_by(Lokasi, Tahun, Bulan) |>
summarise(
Suhu_Bulanan = mean(Suhu, na.rm = TRUE),
Kelembaban_Bulanan = mean(Kelembaban, na.rm = TRUE),
Hujan_Bulanan = sum(`Curah hujan`, na.rm = TRUE),
Angin_Bulanan = mean(`Kecepatan Angin`, na.rm = TRUE),
.groups = "drop"
)
head(rata_harian, 5) |>
kable(digits = 2,
captions = "Rata-Rata Bulanan 5 Sampel Pertama",
align = "c") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white")
| Lokasi | Tanggal | Suhu_Harian | Kelembaban_Harian | Hujan_Harian | Angin_Harian |
|---|---|---|---|---|---|
| Bandung | 2020-01-02 | 24.4 | 96.4 | 0.6 | 12.8 |
| Bandung | 2020-02-15 | 24.3 | 96.4 | 4.9 | 9.9 |
| Bandung | 2020-02-18 | 26.3 | 60.0 | 21.1 | 1.1 |
| Bandung | 2020-03-16 | 25.5 | 64.6 | 24.6 | 4.6 |
| Bandung | 2020-03-24 | 27.3 | 76.1 | 3.0 | 6.8 |
head(rata_bulanan, 5) |>
kable(digits = 2,
captions = "Rata-Rata Bulanan 5 Sampel Pertama",
align = "c") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#8b5e3c", color = "white")
| Lokasi | Tahun | Bulan | Suhu_Bulanan | Kelembaban_Bulanan | Hujan_Bulanan | Angin_Bulanan |
|---|---|---|---|---|---|---|
| Bandung | 2020 | 1 | 24.4 | 96.4 | 0.6 | 12.8 |
| Bandung | 2020 | 2 | 25.3 | 78.2 | 26.0 | 5.5 |
| Bandung | 2020 | 3 | 26.4 | 70.3 | 27.6 | 5.7 |
| Bandung | 2020 | 4 | 26.3 | 77.2 | 7.4 | 10.6 |
| Bandung | 2020 | 7 | 25.8 | 66.7 | 4.7 | 9.9 |
data <- data |>
arrange(Lokasi, Tanggal) |>
group_by(Lokasi) |>
mutate(
Suhu_Kemarin = lag(Suhu, 1),
Suhu_3Hari_Lalu = lag(Suhu, 3),
Suhu_7Hari_Lalu = lag(Suhu, 7),
Delta_Suhu = Suhu - Suhu_Kemarin,
Kelembaban_Kemarin = lag(Kelembaban, 1),
Delta_Kelembaban = Kelembaban - Kelembaban_Kemarin,
Hujan_Kemarin = lag(`Curah hujan`, 1),
Hujan_3Hari_Lalu = lag(`Curah hujan`, 3)
) |>
ungroup()
data |>
select(Lokasi, Tanggal, Suhu, Suhu_Kemarin, Suhu_3Hari_Lalu, Suhu_7Hari_Lalu, Delta_Suhu) |>
filter(!is.na(Suhu_7Hari_Lalu)) |>
head(10) |>
kable(digits = 2, align = "c",
caption = "Fitur Temporal: Lag 1, 3, 7 Hari dan Perubahan Harian") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white") |>
column_spec(7,
color = ifelse(
(data |> filter(!is.na(Suhu_7Hari_Lalu)) |> head(10) |> pull(Delta_Suhu)) > 0,
"#4a7c59", "#c17f3a"
)
)
| Lokasi | Tanggal | Suhu | Suhu_Kemarin | Suhu_3Hari_Lalu | Suhu_7Hari_Lalu | Delta_Suhu |
|---|---|---|---|---|---|---|
| Bandung | 2020-08-12 | 27.7 | 25.8 | 27.3 | 24.4 | 1.9 |
| Bandung | 2020-10-14 | 21.2 | 27.7 | 26.3 | 24.3 | -6.5 |
| Bandung | 2020-10-25 | 28.6 | 21.2 | 25.8 | 26.3 | 7.4 |
| Bandung | 2020-11-16 | 26.2 | 28.6 | 27.7 | 25.5 | -2.4 |
| Bandung | 2020-11-27 | 27.8 | 26.2 | 21.2 | 27.3 | 1.6 |
| Bandung | 2020-12-21 | 28.5 | 27.8 | 28.6 | 26.3 | 0.7 |
| Bandung | 2021-01-03 | 30.6 | 28.5 | 26.2 | 25.8 | 2.1 |
| Bandung | 2021-03-17 | 28.7 | 30.6 | 27.8 | 27.7 | -1.9 |
| Bandung | 2021-04-25 | 29.9 | 28.7 | 28.5 | 21.2 | 1.2 |
| Bandung | 2021-05-03 | 25.6 | 29.9 | 30.6 | 28.6 | -4.3 |
data <- data |>
arrange(Lokasi, Tanggal) |>
group_by(Lokasi) |>
mutate(
Suhu_MA3 = round(rollmeanr(Suhu, k = 3, fill = NA), 2),
Suhu_MA7 = round(rollmeanr(Suhu, k = 7, fill = NA), 2),
Suhu_MA14 = round(rollmeanr(Suhu, k = 14, fill = NA), 2),
Hujan_MA7 = round(rollmeanr(`Curah hujan`, k = 7, fill = NA), 2)
) |>
ungroup()
data |>
filter(Lokasi == "Jakarta", !is.na(Suhu_MA14)) |>
select(Tanggal, Suhu, Suhu_MA3, Suhu_MA7, Suhu_MA14) |>
head(10) |>
kable(digits = 2, align = "c",
caption = "Moving Average Suhu Jakarta: MA-3, MA-7, MA-14 Hari") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#d4a017", color = "white")
| Tanggal | Suhu | Suhu_MA3 | Suhu_MA7 | Suhu_MA14 |
|---|---|---|---|---|
| 2020-09-24 | 31.2 | 27.1 | 27.2 | 27.5 |
| 2020-09-25 | 29.7 | 29.0 | 27.3 | 27.8 |
| 2020-09-29 | 20.5 | 27.1 | 26.9 | 27.5 |
| 2020-10-30 | 24.3 | 24.8 | 26.9 | 27.8 |
| 2020-11-09 | 25.3 | 23.4 | 25.9 | 27.4 |
| 2020-11-20 | 28.9 | 26.2 | 26.6 | 27.3 |
| 2021-01-10 | 34.9 | 29.7 | 27.8 | 27.5 |
| 2021-01-11 | 29.5 | 31.1 | 27.6 | 27.4 |
| 2021-01-21 | 26.6 | 30.3 | 27.1 | 27.2 |
| 2021-02-05 | 24.7 | 26.9 | 27.7 | 27.3 |
data <- data |>
mutate(
# Heat Index — Formula Rothfusz (konversi ke Fahrenheit dulu, lalu kembali ke Celsius)
HI_F = -42.379 +
2.04901523 * Suhu_Fahrenheit +
10.14333127 * Kelembaban -
0.22475541 * Suhu_Fahrenheit * Kelembaban -
0.00683783 * Suhu_Fahrenheit^2 -
0.05481717 * Kelembaban^2 +
0.00122874 * Suhu_Fahrenheit^2 * Kelembaban +
0.00085282 * Suhu_Fahrenheit * Kelembaban^2 -
0.00000199 * Suhu_Fahrenheit^2 * Kelembaban^2,
Indeks_Panas = round((HI_F - 32) * 5/9, 2),
# Humidex — Canadian formula
Humidex = round(
Suhu + 0.5555 * (6.11 * exp(5417.7530 *
(1/273.16 - 1/(273.16 + (Kelembaban/100) * (Suhu - 14.4)))) - 10), 2
),
Kategori_Panas = case_when(
Indeks_Panas < 27 ~ "Nyaman",
Indeks_Panas < 32 ~ "Hati-hati",
Indeks_Panas < 41 ~ "Sangat Hati-hati",
Indeks_Panas < 54 ~ "Berbahaya",
TRUE ~ "Ekstrem"
)
) |>
select(-HI_F)
data |>
select(Lokasi, Tanggal, Suhu, Kelembaban, Indeks_Panas, Humidex, Kategori_Panas) |>
head(12) |>
kable(digits = 2, align = "c",
caption = "Heat Index dan Humidex berdasarkan Suhu & Kelembaban") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#c17f3a", color = "white")
| Lokasi | Tanggal | Suhu | Kelembaban | Indeks_Panas | Humidex | Kategori_Panas |
|---|---|---|---|---|---|---|
| Bandung | 2020-01-02 | 24.4 | 96.4 | 23.9 | 25.5 | Nyaman |
| Bandung | 2020-02-15 | 24.3 | 96.4 | 23.7 | 25.4 | Nyaman |
| Bandung | 2020-02-18 | 26.3 | 60.0 | 27.2 | 26.4 | Hati-hati |
| Bandung | 2020-03-16 | 25.5 | 64.6 | 26.5 | 25.6 | Nyaman |
| Bandung | 2020-03-24 | 27.3 | 76.1 | 30.0 | 28.5 | Hati-hati |
| Bandung | 2020-04-03 | 26.3 | 77.2 | 28.1 | 27.2 | Hati-hati |
| Bandung | 2020-07-15 | 25.8 | 66.7 | 26.9 | 26.1 | Nyaman |
| Bandung | 2020-08-12 | 27.7 | 84.2 | 31.9 | 29.6 | Hati-hati |
| Bandung | 2020-10-14 | 21.2 | 65.3 | 23.9 | 20.3 | Nyaman |
| Bandung | 2020-10-25 | 28.6 | 78.3 | 33.3 | 30.4 | Sangat Hati-hati |
| Bandung | 2020-11-16 | 26.2 | 70.1 | 27.6 | 26.7 | Hati-hati |
| Bandung | 2020-11-27 | 27.8 | 87.8 | 32.8 | 29.9 | Sangat Hati-hati |
# Transformasi siklik: hari ke-1 dan ke-365 BERDEKATAN secara musiman
# tapi angkanya jauh — sin/cos mengatasi ini
data <- data |>
mutate(
sin_tahun = round(sin(2 * pi * Hari_Dalam_Tahun / 365), 4),
cos_tahun = round(cos(2 * pi * Hari_Dalam_Tahun / 365), 4)
# sin_hari dihapus — redundan dengan sin_tahun (hampir identik)
)
data |>
select(Tanggal, Hari_Dalam_Tahun, sin_tahun, cos_tahun) |>
distinct(Hari_Dalam_Tahun, .keep_all = TRUE) |>
arrange(Hari_Dalam_Tahun) |>
head(12) |>
kable(digits = 4, align = "c",
caption = "Fitur Sinus & Kosinus — Menangkap Siklus Musiman Tahunan") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white")
| Tanggal | Hari_Dalam_Tahun | sin_tahun | cos_tahun |
|---|---|---|---|
| 2022-01-01 | 1 | 0.0172 | 1.000 |
| 2020-01-02 | 2 | 0.0344 | 0.999 |
| 2021-01-03 | 3 | 0.0516 | 0.999 |
| 2024-01-04 | 4 | 0.0688 | 0.998 |
| 2021-01-08 | 8 | 0.1373 | 0.991 |
| 2024-01-09 | 9 | 0.1543 | 0.988 |
| 2022-01-10 | 10 | 0.1713 | 0.985 |
| 2021-01-11 | 11 | 0.1882 | 0.982 |
| 2023-01-13 | 13 | 0.2219 | 0.975 |
| 2023-01-14 | 14 | 0.2387 | 0.971 |
| 2022-01-15 | 15 | 0.2554 | 0.967 |
| 2022-01-16 | 16 | 0.2720 | 0.962 |
Mengapa kita kategorikan data?
Model machine learning sering bekerja lebih baik ketika variabel kontinu diubah menjadi kategori yang bermakna. Curah hujan 0.5mm dan 2mm mungkin secara praktis sama-sama “hari cerah” — kategorisasi membantu model menangkap pola yang lebih general.
# ============================================================
# 6.9.1.4 — KATEGORISASI DAN PENGELOMPOKAN
# ============================================================
data <- data |>
mutate(
Bin_Hujan = cut(
`Curah hujan`,
breaks = c(-Inf, 1, 5, 15, 30, Inf),
labels = c("Sangat Ringan (<1mm)", "Ringan (1-5mm)",
"Sedang (5-15mm)", "Lebat (15-30mm)", "Sangat Lebat (>30mm)"),
right = FALSE
),
Bin_Suhu = cut(
Suhu,
breaks = c(-Inf, 20, 25, 30, 35, Inf),
labels = c("Dingin (<20°C)", "Sejuk (20-25°C)",
"Hangat (25-30°C)", "Panas (30-35°C)", "Sangat Panas (>35°C)"),
right = FALSE
),
Kategori_Cuaca = case_when(
`Curah hujan` < 1 & Kelembaban < 70 ~ "Cerah",
`Curah hujan` < 5 & Kelembaban >= 70 & Kelembaban < 85 ~ "Berawan",
`Curah hujan` >= 5 | Kelembaban >= 85 ~ "Hujan",
TRUE ~ "Badai"
)
)
data |>
count(Bin_Hujan, sort = FALSE) |>
mutate(Persen = round(n / sum(n) * 100, 1)) |>
kable(col.names = c("Kategori Curah Hujan","Frekuensi","Persen (%)"),
caption = "Distribusi Kategori Curah Hujan", align = "lcc") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white")
| Kategori Curah Hujan | Frekuensi | Persen (%) |
|---|---|---|
| Sangat Ringan (<1mm) | 9 | 1.8 |
| Ringan (1-5mm) | 126 | 25.2 |
| Sedang (5-15mm) | 267 | 53.4 |
| Lebat (15-30mm) | 93 | 18.6 |
| Sangat Lebat (>30mm) | 5 | 1.0 |
data |>
count(Bin_Suhu, sort = FALSE) |>
mutate(Persen = round(n / sum(n) * 100, 1)) |>
kable(col.names = c("Kategori Suhu","Frekuensi","Persen (%)"),
caption = "Distribusi Kategori Suhu", align = "lcc") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#8b5e3c", color = "white")
| Kategori Suhu | Frekuensi | Persen (%) |
|---|---|---|
| Dingin (<20°C) | 5 | 1.0 |
| Sejuk (20-25°C) | 120 | 24.0 |
| Hangat (25-30°C) | 299 | 59.8 |
| Panas (30-35°C) | 73 | 14.6 |
| Sangat Panas (>35°C) | 3 | 0.6 |
data |>
count(Lokasi, Kategori_Cuaca) |>
mutate(Kategori_Cuaca = factor(Kategori_Cuaca,
levels = c("Cerah","Berawan","Hujan","Badai"))) |>
kable(col.names = c("Kota","Kategori Cuaca","Frekuensi"),
caption = "Distribusi Kategori Cuaca per Kota", align = "llc") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#d4a017", color = "white")
| Kota | Kategori Cuaca | Frekuensi |
|---|---|---|
| Bandung | Badai | 10 |
| Bandung | Berawan | 12 |
| Bandung | Hujan | 79 |
| Jakarta | Badai | 10 |
| Jakarta | Berawan | 9 |
| Jakarta | Cerah | 2 |
| Jakarta | Hujan | 77 |
| Makassar | Badai | 2 |
| Makassar | Berawan | 9 |
| Makassar | Cerah | 1 |
| Makassar | Hujan | 79 |
| Medan | Badai | 7 |
| Medan | Berawan | 11 |
| Medan | Cerah | 1 |
| Medan | Hujan | 88 |
| Surabaya | Badai | 9 |
| Surabaya | Berawan | 5 |
| Surabaya | Cerah | 1 |
| Surabaya | Hujan | 88 |
Apa itu Pencilan (Outlier)?
Pencilan adalah nilai yang sangat berbeda dari mayoritas data — bisa karena kesalahan alat ukur, kondisi cuaca ekstrem nyata, atau kesalahan input. Kita identifikasi dengan dua metode: Z-Score dan IQR.
# ============================================================
# 6.9.1.5 — DETEKSI OUTLIER
# ============================================================
data <- data |>
mutate(
z_Suhu = round((Suhu - mean(Suhu, na.rm = TRUE)) / sd(Suhu, na.rm = TRUE), 3),
z_Kelembaban = round((Kelembaban - mean(Kelembaban, na.rm = TRUE)) / sd(Kelembaban, na.rm = TRUE), 3),
z_Hujan = round((`Curah hujan` - mean(`Curah hujan`, na.rm = TRUE)) / sd(`Curah hujan`, na.rm = TRUE), 3),
z_Angin = round((`Kecepatan Angin` - mean(`Kecepatan Angin`, na.rm = TRUE)) / sd(`Kecepatan Angin`, na.rm = TRUE), 3),
Outlier_Suhu_Z = abs(z_Suhu) > 3,
Outlier_Hujan_Z = abs(z_Hujan) > 3,
Outlier_Kelembaban_Z = abs(z_Kelembaban) > 3,
Outlier_Angin_Z = abs(z_Angin) > 3
)
cat("HASIL DETEKSI OUTLIER — METODE Z-SCORE (|z| > 3)\n")
## HASIL DETEKSI OUTLIER — METODE Z-SCORE (|z| > 3)
cat(sprintf("Suhu : %d outlier\n", sum(data$Outlier_Suhu_Z)))
## Suhu : 3 outlier
cat(sprintf("Kelembaban : %d outlier\n", sum(data$Outlier_Kelembaban_Z)))
## Kelembaban : 0 outlier
cat(sprintf("Curah Hujan : %d outlier\n", sum(data$Outlier_Hujan_Z)))
## Curah Hujan : 5 outlier
cat(sprintf("Kecepatan Angin : %d outlier\n", sum(data$Outlier_Angin_Z)))
## Kecepatan Angin : 0 outlier
n_outlier_z <- sum(data$Outlier_Suhu_Z | data$Outlier_Hujan_Z |
data$Outlier_Kelembaban_Z | data$Outlier_Angin_Z)
if (n_outlier_z > 0) {
data |>
filter(Outlier_Suhu_Z | Outlier_Hujan_Z | Outlier_Kelembaban_Z | Outlier_Angin_Z) |>
select(Lokasi, Tanggal, Suhu, z_Suhu, `Curah hujan`, z_Hujan) |>
kable(digits = 3, align = "c", caption = "Observasi Outlier (Z-Score)") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#c17f3a", color = "white")
} else {
cat("✅ Tidak ada outlier Z-Score terdeteksi pada threshold |z| > 3.\n")
}
| Lokasi | Tanggal | Suhu | z_Suhu | Curah hujan | z_Hujan |
|---|---|---|---|---|---|
| Jakarta | 2022-08-18 | 36.6 | 3.159 | 4.5 | -0.787 |
| Makassar | 2020-08-05 | 27.3 | 0.125 | 38.8 | 4.295 |
| Makassar | 2023-04-12 | 26.1 | -0.267 | 32.8 | 3.406 |
| Makassar | 2023-04-12 | 29.6 | 0.875 | 42.5 | 4.843 |
| Medan | 2023-06-25 | 29.9 | 0.973 | 32.3 | 3.332 |
| Surabaya | 2023-07-02 | 25.0 | -0.626 | 48.7 | 5.762 |
| Surabaya | 2023-11-14 | 16.9 | -3.269 | 2.7 | -1.053 |
| Surabaya | 2023-12-19 | 37.5 | 3.453 | 5.5 | -0.638 |
hitung_iqr_bounds <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
list(Q1 = q1, Q3 = q3, IQR = iqr,
Batas_Bawah = q1 - 1.5 * iqr,
Batas_Atas = q3 + 1.5 * iqr)
}
var_list <- list(
Suhu = data$Suhu,
Kelembaban = data$Kelembaban,
`Curah hujan` = data$`Curah hujan`,
`Kec. Angin` = data$`Kecepatan Angin`
)
iqr_summary <- lapply(names(var_list), function(nm) {
b <- hitung_iqr_bounds(var_list[[nm]])
data.frame(
Variabel = nm,
Q1 = round(b$Q1, 2),
Q3 = round(b$Q3, 2),
IQR = round(b$IQR, 2),
Batas_Bawah = round(b$Batas_Bawah, 2),
Batas_Atas = round(b$Batas_Atas, 2),
Jumlah_Outlier = sum(var_list[[nm]] < b$Batas_Bawah |
var_list[[nm]] > b$Batas_Atas, na.rm = TRUE)
)
}) |> bind_rows()
iqr_summary |>
kable(caption = "Batas Outlier & Jumlah Pencilan (Metode IQR)",
align = "c") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#8b5e3c", color = "white") |>
column_spec(7, bold = TRUE,
color = ifelse(iqr_summary$Jumlah_Outlier > 0, "#c17f3a", "#4a7c59"))
| Variabel | Q1 | Q3 | IQR | Batas_Bawah | Batas_Atas | Jumlah_Outlier | |
|---|---|---|---|---|---|---|---|
| 25%…1 | Suhu | 25.0 | 29.1 | 4.12 | 18.79 | 35.3 | 6 |
| 25%…2 | Kelembaban | 69.1 | 89.6 | 20.50 | 38.32 | 120.3 | 0 |
| 25%…3 | Curah hujan | 4.7 | 13.4 | 8.73 | -8.39 | 26.5 | 10 |
| 25%…4 | Kec. Angin | 4.9 | 11.5 | 6.60 | -5.00 | 21.4 | 0 |
batas_hujan <- hitung_iqr_bounds(data$`Curah hujan`)
data <- data |>
mutate(
Outlier_Hujan_IQR = `Curah hujan` < batas_hujan$Batas_Bawah |
`Curah hujan` > batas_hujan$Batas_Atas,
Hujan_Clipped = pmin(pmax(`Curah hujan`, batas_hujan$Batas_Bawah), batas_hujan$Batas_Atas),
Hujan_Imputed = ifelse(Outlier_Hujan_IQR,
median(`Curah hujan`, na.rm = TRUE),
`Curah hujan`)
)
data |>
filter(Outlier_Hujan_IQR) |>
select(Lokasi, Tanggal, `Curah hujan`, Hujan_Clipped, Hujan_Imputed) |>
head(15) |>
kable(digits = 2, align = "c",
caption = "Nilai Asli vs Clipping vs Imputation (Outlier Curah Hujan 10 observasi)") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#c17f3a", color = "white")
| Lokasi | Tanggal | Curah hujan | Hujan_Clipped | Hujan_Imputed |
|---|---|---|---|---|
| Makassar | 2020-04-18 | 26.8 | 26.5 | 8.25 |
| Makassar | 2020-08-05 | 38.8 | 26.5 | 8.25 |
| Makassar | 2023-04-12 | 32.8 | 26.5 | 8.25 |
| Makassar | 2023-04-12 | 42.5 | 26.5 | 8.25 |
| Medan | 2020-03-13 | 28.6 | 26.5 | 8.25 |
| Medan | 2022-04-08 | 28.4 | 26.5 | 8.25 |
| Medan | 2023-06-25 | 32.3 | 26.5 | 8.25 |
| Surabaya | 2021-08-29 | 29.9 | 26.5 | 8.25 |
| Surabaya | 2022-05-17 | 28.1 | 26.5 | 8.25 |
| Surabaya | 2023-07-02 | 48.7 | 26.5 | 8.25 |
Mengapa normalisasi diperlukan?
Model machine learning sensitif terhadap skala fitur. Suhu (17-38°C) dan kelembaban (60-100%) memiliki skala berbeda — tanpa normalisasi, fitur berskala besar mendominasi model. Normalisasi membuat semua fitur berbicara dalam “bahasa yang sama”.
# ============================================================
# 6.9.1.7 — NORMALISASI DAN SKALA DATA
# ============================================================
data_normalized <- data |>
mutate(
Suhu_ZScore = round((Suhu - mean(Suhu, na.rm = TRUE)) / sd(Suhu, na.rm = TRUE), 4),
Kelembaban_ZScore = round((Kelembaban - mean(Kelembaban, na.rm = TRUE)) / sd(Kelembaban, na.rm = TRUE), 4),
Hujan_ZScore = round((`Curah hujan` - mean(`Curah hujan`, na.rm = TRUE)) / sd(`Curah hujan`, na.rm = TRUE), 4),
Angin_ZScore = round((`Kecepatan Angin` - mean(`Kecepatan Angin`, na.rm = TRUE)) / sd(`Kecepatan Angin`, na.rm = TRUE), 4)
)
data_normalized |>
summarise(
`Mean Suhu Z` = round(mean(Suhu_ZScore), 8),
`SD Suhu Z` = round(sd(Suhu_ZScore), 4),
`Mean Hujan Z` = round(mean(Hujan_ZScore), 8),
`SD Hujan Z` = round(sd(Hujan_ZScore), 4)
) |>
kable(digits = 6, align = "c", caption = "Verifikasi: Mean ≈ 0 dan SD ≈ 1") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white")
| Mean Suhu Z | SD Suhu Z | Mean Hujan Z | SD Hujan Z |
|---|---|---|---|
| -0.000002 | 1 | -0.000002 | 1 |
data_normalized |>
select(Lokasi, Suhu, Suhu_ZScore, Kelembaban, Kelembaban_ZScore,
`Curah hujan`, Hujan_ZScore) |>
head(10) |>
kable(digits = 4, align = "c",
caption = "Z-Score: Nilai Asli vs Ternormalisasi") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#4a7c59", color = "white")
| Lokasi | Suhu | Suhu_ZScore | Kelembaban | Kelembaban_ZScore | Curah hujan | Hujan_ZScore |
|---|---|---|---|---|---|---|
| Bandung | 24.4 | -0.822 | 96.4 | 1.4482 | 0.6 | -1.364 |
| Bandung | 24.3 | -0.854 | 96.4 | 1.4482 | 4.9 | -0.727 |
| Bandung | 26.3 | -0.202 | 60.0 | -1.6643 | 21.1 | 1.673 |
| Bandung | 25.5 | -0.463 | 64.6 | -1.2710 | 24.6 | 2.191 |
| Bandung | 27.3 | 0.125 | 76.1 | -0.2876 | 3.0 | -1.009 |
| Bandung | 26.3 | -0.202 | 77.2 | -0.1936 | 7.4 | -0.357 |
| Bandung | 25.8 | -0.365 | 66.7 | -1.0914 | 4.7 | -0.757 |
| Bandung | 27.7 | 0.255 | 84.2 | 0.4050 | 11.6 | 0.265 |
| Bandung | 21.2 | -1.866 | 65.3 | -1.2111 | 3.1 | -0.994 |
| Bandung | 28.6 | 0.549 | 78.3 | -0.0995 | 11.2 | 0.206 |
min_max_norm <- function(x) {
round((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)), 4)
}
data_normalized <- data_normalized |>
mutate(
Suhu_MinMax = min_max_norm(Suhu),
Kelembaban_MinMax = min_max_norm(Kelembaban),
Hujan_MinMax = min_max_norm(`Curah hujan`),
Angin_MinMax = min_max_norm(`Kecepatan Angin`)
)
data_normalized |>
select(Lokasi, Suhu, Suhu_MinMax, Kelembaban, Kelembaban_MinMax,
`Curah hujan`, Hujan_MinMax) |>
head(10) |>
kable(digits = 4, align = "c",
caption = "Min-Max Normalisasi: Semua Nilai dalam Rentang [0, 1]") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |>
row_spec(0, bold = TRUE, background = "#8b5e3c", color = "white")
| Lokasi | Suhu | Suhu_MinMax | Kelembaban | Kelembaban_MinMax | Curah hujan | Hujan_MinMax |
|---|---|---|---|---|---|---|
| Bandung | 24.4 | 0.364 | 96.4 | 0.910 | 0.6 | 0.0062 |
| Bandung | 24.3 | 0.359 | 96.4 | 0.910 | 4.9 | 0.0950 |
| Bandung | 26.3 | 0.456 | 60.0 | 0.000 | 21.1 | 0.4298 |
| Bandung | 25.5 | 0.417 | 64.6 | 0.115 | 24.6 | 0.5021 |
| Bandung | 27.3 | 0.505 | 76.1 | 0.403 | 3.0 | 0.0558 |
| Bandung | 26.3 | 0.456 | 77.2 | 0.430 | 7.4 | 0.1467 |
| Bandung | 25.8 | 0.432 | 66.7 | 0.168 | 4.7 | 0.0909 |
| Bandung | 27.7 | 0.524 | 84.2 | 0.605 | 11.6 | 0.2335 |
| Bandung | 21.2 | 0.209 | 65.3 | 0.133 | 3.1 | 0.0579 |
| Bandung | 28.6 | 0.568 | 78.3 | 0.458 | 11.2 | 0.2252 |
# CHART 1 — Violin + Box Plot: Distribusi Suhu per Kota
# Lebih informatif dari boxplot biasa: keliatan bentuk distribusi penuh
kota_list <- names(kota_colors)
p_violin <- plot_ly()
for (kota in kota_list) {
d <- filter(data, Lokasi == kota)
p_violin <- p_violin |>
add_trace(
data = d,
y = ~Suhu,
x = ~Lokasi,
type = "violin",
name = kota,
box = list(visible = TRUE),
meanline = list(visible = TRUE, color = earthy$charcoal),
fillcolor = paste0(kota_colors[kota], "44"),
line = list(color = kota_colors[kota], width = 2),
marker = list(color = kota_colors[kota], size = 3, opacity = 0.5),
points = "outliers",
showlegend = FALSE
)
}
p_violin |>
layout(
title = list(
text = "Distribusi Suhu per Kota — Violin & Box Plot",
font = list(size = 16, color = earthy$charcoal, family = "Georgia, serif"),
x = 0.5
),
xaxis = list(title = "Kota", color = earthy$muted, gridcolor = earthy$sand,
tickfont = list(color = earthy$charcoal, size = 12)),
yaxis = list(title = "Suhu (°C)", color = earthy$muted, gridcolor = earthy$sand,
ticksuffix = "°C"),
plot_bgcolor = earthy$cream,
paper_bgcolor = "#ffffff",
font = list(color = earthy$charcoal, family = "Georgia, serif")
)
Insight:
Violin plot memperlihatkan bentuk distribusi penuh — lebar violin di suhu tertentu menunjukkan seberapa sering suhu itu muncul. Makassar memiliki distribusi yang lebih lebar di suhu tinggi, sementara Bandung lebih sempit dan condong ke suhu rendah karena faktor ketinggian (~768 mdpl).
# CHART 2 — Line + Ribbon ±SD: Tren Suhu Bulanan per Kota
# Area bayangan menunjukkan variabilitas — lebih informatif dari line biasa
tren_bulanan <- data |>
mutate(Period = floor_date(Tanggal, "month")) |>
group_by(Lokasi, Period) |>
summarise(
Suhu_Mean = mean(Suhu, na.rm = TRUE),
Suhu_SD = sd(Suhu, na.rm = TRUE),
.groups = "drop"
) |>
mutate(
CI_upper = Suhu_Mean + Suhu_SD,
CI_lower = Suhu_Mean - Suhu_SD
)
p_tren <- plot_ly()
for (kota in kota_list) {
d <- filter(tren_bulanan, Lokasi == kota)
col <- kota_colors[kota]
p_tren <- p_tren |>
add_ribbons(
data = d, x = ~Period,
ymin = ~CI_lower, ymax = ~CI_upper,
name = paste(kota, "±SD"),
fillcolor = paste0(col, "20"),
line = list(color = "transparent"),
showlegend = FALSE, hoverinfo = "skip"
) |>
add_lines(
data = d, x = ~Period, y = ~Suhu_Mean,
name = kota,
line = list(color = col, width = 2.5, shape = "spline"),
marker = list(color = col, size = 5)
)
}
p_tren |>
layout(
title = list(
text = "Tren Suhu Bulanan per Kota (±1 SD)",
font = list(size = 16, color = earthy$charcoal, family = "Georgia, serif"),
x = 0.5
),
xaxis = list(title = "Periode", color = earthy$muted, gridcolor = earthy$sand),
yaxis = list(title = "Rata-rata Suhu (°C)", color = earthy$muted,
gridcolor = earthy$sand, ticksuffix = "°C"),
plot_bgcolor = earthy$cream,
paper_bgcolor = "#ffffff",
font = list(color = earthy$charcoal, family = "Georgia, serif"),
legend = list(font = list(color = earthy$charcoal), bgcolor = earthy$cream,
bordercolor = earthy$sand, borderwidth = 1)
)
Insight:
Area bayangan menunjukkan variabilitas suhu (±1 SD) — semakin lebar area, semakin tidak stabil suhu di periode itu. Kota pesisir cenderung memiliki variabilitas lebih besar dibanding Bandung yang dikelilingi pegunungan.
##️ Distribusi Curah Hujan
# CHART 4 — Histogram Overlay: Distribusi Curah Hujan per Kota
# Menunjukkan right-skewness ekstrem — penting untuk preprocessing
p_hujan <- plot_ly()
for (kota in kota_list) {
d <- filter(data, Lokasi == kota)
col <- kota_colors[kota]
p_hujan <- p_hujan |>
add_histogram(
data = d,
x = ~`Curah hujan`,
name = kota,
nbinsx = 35,
opacity = 0.65,
marker = list(color = col, line = list(color = paste0(col, "99"), width = 0.5))
)
}
p_hujan |>
layout(
barmode = "overlay",
title = list(
text = "Distribusi Curah Hujan per Kota — Right-Skewed",
font = list(size = 16, color = earthy$charcoal, family = "Georgia, serif"),
x = 0.5
),
xaxis = list(title = "Curah Hujan (mm)", color = earthy$muted,
gridcolor = earthy$sand, ticksuffix = " mm"),
yaxis = list(title = "Frekuensi", color = earthy$muted, gridcolor = earthy$sand),
annotations = list(list(
x = 0.97, y = 0.95, xref = "paper", yref = "paper",
text = "<b>⚠️ Right-Skewed</b><br>Mayoritas: hujan ringan<br>Ekor panjang = kejadian ekstrem",
showarrow = FALSE,
font = list(size = 11, color = earthy$brown),
bgcolor = "#f5f0e8cc",
bordercolor = earthy$amber,
borderwidth = 1, borderpad = 8, align = "right"
)),
plot_bgcolor = earthy$cream,
paper_bgcolor = "#ffffff",
font = list(color = earthy$charcoal, family = "Georgia, serif"),
legend = list(font = list(color = earthy$charcoal), bgcolor = earthy$cream,
bordercolor = earthy$sand, borderwidth = 1)
)
Implikasi Modeling:
Distribusi right-skewed berarti data curah hujan tidak normal. Sebelum dimasukkan model linear, perlu transformasi log atau Box-Cox. Ekor kanan mewakili kejadian ekstrem (banjir) yang justru paling penting diprediksi.
##️ Heat Index per Kota
# CHART 5 — Stacked Bar: Proporsi Kategori Heat Index per Kota
# Domain insight langsung: risiko heat stress per kota
heat_summary <- data |>
mutate(Kategori_Panas = factor(Kategori_Panas,
levels = c("Nyaman","Hati-hati","Sangat Hati-hati","Berbahaya","Ekstrem"))) |>
count(Lokasi, Kategori_Panas) |>
group_by(Lokasi) |>
mutate(Persen = round(n / sum(n) * 100, 1)) |>
ungroup()
kategori_colors <- c(
"Nyaman" = "#4a7c59",
"Hati-hati" = "#7a9e7e",
"Sangat Hati-hati" = "#d4a017",
"Berbahaya" = "#c17f3a",
"Ekstrem" = "#8b5e3c"
)
p_heat <- plot_ly()
for (kat in levels(heat_summary$Kategori_Panas)) {
d <- filter(heat_summary, Kategori_Panas == kat)
p_heat <- p_heat |>
add_bars(
data = d, x = ~Lokasi, y = ~Persen,
name = kat,
marker = list(color = kategori_colors[kat],
line = list(color = "#ffffff", width = 1)),
text = ~paste0(Persen, "%"),
textposition = "inside",
insidetextanchor = "middle",
textfont = list(color = "#ffffff", size = 11, family = "Georgia, serif")
)
}
p_heat |>
layout(
barmode = "stack",
title = list(
text = "Proporsi Kategori Heat Index per Kota (%)",
font = list(size = 16, color = earthy$charcoal, family = "Georgia, serif"),
x = 0.5
),
xaxis = list(title = "Kota", color = earthy$muted, gridcolor = earthy$sand,
tickfont = list(color = earthy$charcoal)),
yaxis = list(title = "Proporsi (%)", color = earthy$muted,
gridcolor = earthy$sand, ticksuffix = "%", range = c(0, 105)),
legend = list(
title = list(text = "<b>Kategori</b>"),
font = list(color = earthy$charcoal, size = 11),
bgcolor = earthy$cream,
bordercolor = earthy$sand,
borderwidth = 1,
traceorder = "normal"
),
plot_bgcolor = earthy$cream,
paper_bgcolor = "#ffffff",
font = list(color = earthy$charcoal, family = "Georgia, serif")
)
Insight Kebijakan:
Kota pesisir seperti Makassar dan Medan menunjukkan proporsi kategori Sangat Hati-hati hingga Berbahaya yang lebih tinggi. Kombinasi suhu tinggi dan kelembaban di atas 80% menciptakan risiko heat stress nyata bagi pekerja luar ruang dan populasi rentan.
Analisis lintas kota mengungkap pola suhu yang berbeda signifikan berdasarkan faktor geografis. Makassar secara konsisten mencatat suhu tertinggi karena posisi pesisirnya yang menerima radiasi matahari langsung. Sebaliknya, Bandung paling sejuk berkat ketinggian ~768 mdpl yang mereduksi suhu sekitar 0.6°C per 100 meter elevasi. Gap suhu antar kota bisa mencapai lebih dari 3°C — selisih yang signifikan untuk kebutuhan energi pendinginan dan perencanaan kesehatan publik.
Distribusi curah hujan di seluruh kota bersifat heavily right-skewed dengan skewness di atas 1.5 — mayoritas hari memiliki hujan ringan di bawah 5mm, namun sebagian kecil hari mengalami hujan sangat lebat. Outlier yang terdeteksi melalui IQR bukan sekadar noise statistik, melainkan kemungkinan merepresentasikan kejadian cuaca ekstrem nyata. Untuk model prediksi banjir, wajib menggunakan transformasi log atau algoritma robust seperti Gradient Boosting.
Heat Index menggunakan formula Rothfusz menunjukkan bahwa suhu terasa jauh lebih tinggi dari suhu aktual ketika kelembaban tinggi. Lebih dari 35.2% observasi masuk kategori “Sangat Hati-hati” ke atas — kondisi yang berpotensi menyebabkan heat stroke pada populasi rentan. Fenomena ini paling parah di Makassar dan Medan dengan kelembaban relatif di atas 80%, dan memiliki implikasi langsung pada jam kerja luar ruang dan desain infrastruktur.
Matriks korelasi Pearson mengungkap tiga pola penting. Pertama, Suhu berkorelasi sangat kuat dengan Indeks Panas (r > 0.95) dan Humidex (r > 0.90). Kedua, Kelembaban berkorelasi negatif moderat dengan Suhu — fenomena umum di iklim tropis musim kering. Ketiga, Curah Hujan relatif independen terhadap Suhu, mengindikasikan bahwa hujan lebih dipengaruhi dinamika atmosfer (monsun, ITCZ) daripada kondisi termal lokal. Implikasi: hindari memasukkan Suhu, Indeks Panas, dan Humidex sekaligus dalam model regresi linear.
Transformasi menggunakan sin dan cos berbasis hari-dalam-setahun secara matematis mengatasi kelemahan representasi bulan ordinal (1–12), di mana model akan memperlakukan Januari dan Desember sebagai “jauh” padahal keduanya berada dalam musim hujan yang sebanding. Transformasi siklik memungkinkan model mengenali kedekatan musiman lintas batas tahun. Fitur ini sangat direkomendasikan untuk semua model time-series iklim tropis Indonesia.
Hierarki fitur yang direkomendasikan untuk model prediksi cuaca: Tier 1 (Wajib): Lag suhu dan curah hujan 1, 3, dan 7 hari — menangkap autokorelasi temporal terkuat. Tier 2 (Sangat Disarankan): Moving average 7 hari, sin/cos temporal, Heat Index. Tier 3 (Pengayaan): Bin curah hujan, Humidex. Gunakan XGBoost atau Random Forest dengan Time-Series Cross-Validation (bukan random split) untuk mencegah data leakage.