UTS SD-1306 Pemrograman Sains Data I

Mini Project 1: E-Commerce & Mini Project 2: Web Scraping

SD-1306 · Pemrograman Sains Data I · UTS

UTS Pemrograman Sains Data I

Mini Project 1: E-Commerce Analytics & Mini Project 2: Web Scraping

Sains Data Kelas A April 2026

Mr. Bakti Siregar, M.Sc., CSD

Dosen Pengampu

Khafizatun Nisa

NIM: 52250018

Nakeisha Aulia Z

NIM: 52250023

Jihan Ramadhani D.

NIM: 52250024

1 MINI PROJECT 1

1.1 Library yang Digunakan (Mini Project 1)

pkgs1 <- c("readr","readxl","jsonlite","xml2","dplyr","tidyr",
           "stringr","ggplot2","scales","knitr","kableExtra",
           "plotly","RColorBrewer","rmdformats")
for (p in pkgs1) {
  if (!require(p, character.only = TRUE, quietly = TRUE))
    install.packages(p, repos = "https://cran.r-project.org")
}

library(readr); library(readxl); library(jsonlite); library(xml2)
library(dplyr);  library(tidyr);  library(stringr);  library(ggplot2)
library(scales); library(knitr); library(kableExtra); library(plotly)
library(RColorBrewer)

cat("Semua library Mini Project 1 berhasil dimuat!")

## Semua library Mini Project 1 berhasil dimuat!

Library siap. Kombinasi readr/readxl/jsonlite/xml2 untuk baca multi-format, dplyr/stringr untuk cleaning, dan plotly untuk visualisasi interaktif.

1.2 Section A Data Collection (E-Commerce)

Tujuan: Membaca 5 file berbeda (CSV, Excel, JSON, TXT, XML) secara otomatis menggunakan looping, cek struktur kolom, dan gabungkan menjadi 1 dataset utama.

1.2.1 CSV

df_csv <- read_csv("ecommerce.csv", show_col_types = FALSE)
cat("File: ecommerce.csv\n")

## File: ecommerce.csv

cat("Jumlah baris :", nrow(df_csv), "\n")

## Jumlah baris : 2000

cat("Jumlah kolom :", ncol(df_csv), "\n")

## Jumlah kolom : 22

cat("Nama kolom   :", paste(names(df_csv), collapse = ", "), "\n")

## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.2 Excel

df_excel <- read_excel("ecommerce.xlsx")
cat("File: ecommerce.xlsx\n")

## File: ecommerce.xlsx

cat("Jumlah baris :", nrow(df_excel), "\n")

## Jumlah baris : 2000

cat("Jumlah kolom :", ncol(df_excel), "\n")

## Jumlah kolom : 22

cat("Nama kolom   :", paste(names(df_excel), collapse = ", "), "\n")

## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.3 JSON

df_json <- fromJSON("ecommerce.json") |> as.data.frame()
cat("File: ecommerce.json\n")

## File: ecommerce.json

cat("Jumlah baris :", nrow(df_json), "\n")

## Jumlah baris : 2000

cat("Jumlah kolom :", ncol(df_json), "\n")

## Jumlah kolom : 22

cat("Nama kolom   :", paste(names(df_json), collapse = ", "), "\n")

## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.4 TXT

df_txt <- read_delim("ecommerce.txt", delim = "|", show_col_types = FALSE)
cat("File: ecommerce.txt\n")

## File: ecommerce.txt

cat("Jumlah baris :", nrow(df_txt), "\n")

## Jumlah baris : 2000

cat("Jumlah kolom :", ncol(df_txt), "\n")

## Jumlah kolom : 22

cat("Nama kolom   :", paste(names(df_txt), collapse = ", "), "\n")

## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.5 XML

xml_doc <- read_xml("ecommerce.xml")
records <- xml_find_all(xml_doc, ".//Record")

xml_ke_df <- function(rec) {
  anak  <- xml_children(rec)
  setNames(as.list(xml_text(anak)), xml_name(anak))
}

df_xml <- bind_rows(lapply(records, xml_ke_df))
cat("File: ecommerce.xml\n")

## File: ecommerce.xml

cat("Jumlah baris :", nrow(df_xml), "\n")

## Jumlah baris : 2000

cat("Jumlah kolom :", ncol(df_xml), "\n")

## Jumlah kolom : 22

cat("Nama kolom   :", paste(names(df_xml), collapse = ", "), "\n")

## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.6 Looping + Cek Kolom + Gabung

semua_df <- list(
  CSV   = df_csv   |> mutate(across(everything(), as.character)),
  Excel = df_excel |> mutate(across(everything(), as.character)),
  JSON  = df_json  |> mutate(across(everything(), as.character)),
  TXT   = df_txt   |> mutate(across(everything(), as.character)),
  XML   = df_xml   |> mutate(across(everything(), as.character))
)

kolom_referensi <- names(semua_df[["CSV"]])
siap_merge      <- list()

cat("=== Pengecekan Struktur Kolom ===\n\n")

## === Pengecekan Struktur Kolom ===

for (nama in names(semua_df)) {
  df_cek <- semua_df[[nama]]
  if (setequal(names(df_cek), kolom_referensi)) {
    cat(nama, ": Struktur kolom sesuai -> Ready to merge\n")
    siap_merge[[nama]] <- df_cek[, kolom_referensi]
  } else {
    cat(nama, ": Perlu penyesuaian -> Need adjustment (disesuaikan otomatis)\n")
    kolom_hilang <- setdiff(kolom_referensi, names(df_cek))
    for (kol in kolom_hilang) df_cek[[kol]] <- NA
    siap_merge[[nama]] <- df_cek[, kolom_referensi]
  }
}

## CSV : Struktur kolom sesuai -> Ready to merge
## Excel : Struktur kolom sesuai -> Ready to merge
## JSON : Struktur kolom sesuai -> Ready to merge
## TXT : Struktur kolom sesuai -> Ready to merge
## XML : Struktur kolom sesuai -> Ready to merge

df_gabungan <- bind_rows(siap_merge)
cat(sprintf("\nDataset utama berhasil digabungkan: %d baris, %d kolom\n",
            nrow(df_gabungan), ncol(df_gabungan)))

## 
## Dataset utama berhasil digabungkan: 10000 baris, 22 kolom

Interpretasi Section A: Lima file dari format berbeda berhasil dibaca dan digabungkan menggunakan looping + conditional IF/ELSE. Pendekatan ini scalable – jika ada file ke-6, cukup ditambahkan ke list tanpa mengubah logika inti.

1.3 Section B Data Handling (E-Commerce)

Tujuan: Memahami kondisi dataset hasil penggabungan.

1.3.1 Dimensi & Tipe Data

cat("Jumlah Total Baris :", nrow(df_gabungan), "\n")

## Jumlah Total Baris : 10000

cat("Jumlah Total Kolom :", ncol(df_gabungan), "\n\n")

## Jumlah Total Kolom : 22

tipe <- data.frame(
  No    = 1:ncol(df_gabungan),
  Kolom = names(df_gabungan),
  Tipe  = sapply(df_gabungan, class),
  row.names = NULL
)

kable(tipe, caption = "Tipe Data Setiap Kolom") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  column_spec(2, bold = TRUE, color = "#2d6a4f") |>
  column_spec(3, background = "#f0faf3")

Tipe Data Setiap Kolom
No	Kolom	Tipe
1	order_id	character
2	order_date	character
3	ship_date	character
4	platform	character
5	category	character
6	product_name	character
7	unit_price	character
8	quantity	character
9	gross_sales	character
10	campaign	character
11	voucher_code	character
12	discount_pct	character
13	discount_value	character
14	shipping_cost	character
15	net_sales	character
16	payment_method	character
17	customer_segment	character
18	region	character
19	stock_status	character
20	order_status	character
21	customer_rating	character
22	priority_flag	character

1.3.2 Missing Values

mv <- df_gabungan |>
  summarise(across(everything(),
    ~ sum(is.na(.) | . == "" | . == "NULL"))) |>
  pivot_longer(everything(), names_to = "Kolom", values_to = "Jumlah_NA") |>
  filter(Jumlah_NA > 0) |>
  arrange(desc(Jumlah_NA)) |>
  mutate(Persen = round(Jumlah_NA / nrow(df_gabungan) * 100, 2))

kable(mv, col.names = c("Kolom","Jumlah Missing","Persentase (%)"),
      caption = "Kolom dengan Missing Values") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  column_spec(3, bold = TRUE,
              color = ifelse(mv$Persen > 20, "#e76f51", "#2d6a4f"))

Kolom dengan Missing Values
Kolom	Jumlah Missing	Persentase (%)
customer_rating	2030	20.30
ship_date	1000	10.00
priority_flag	940	9.40
discount_pct	345	3.45
voucher_code	245	2.45
payment_method	175	1.75

1.3.3 Duplikasi

dup <- sum(duplicated(df_gabungan))
pct <- round(dup / nrow(df_gabungan) * 100, 2)
cat("Jumlah baris duplikat:", dup, "\n")

## Jumlah baris duplikat: 5581

cat("Persentase duplikat  :", pct, "%\n")

## Persentase duplikat  : 55.81 %

1.3.4 Masalah Kualitas Data

masalah <- data.frame(
  No      = 1:5,
  Masalah = c(
    "Format tanggal tidak konsisten",
    "Nilai harga berformat string (Rp xxx.xxx)",
    "Inkonsistensi penulisan platform",
    "Missing values di kolom penting",
    "Duplikasi data akibat penggabungan 5 file"
  ),
  Contoh = c(
    "2024-01-01, 01/01/2024, 01-01-2024",
    "'Rp 1.500.000' alih-alih 1500000",
    "'shopee', 'SHOPEE', ' shopee '",
    "ship_date, customer_rating, payment_method",
    paste0(dup, " baris duplikat dari ", nrow(df_gabungan), " baris")
  )
)

kable(masalah, col.names = c("No","Masalah","Contoh"),
      caption = "Minimal 3 Masalah Kualitas Data yang Ditemukan") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  column_spec(2, bold = TRUE, color = "#e76f51")

Minimal 3 Masalah Kualitas Data yang Ditemukan
No	Masalah	Contoh
1	Format tanggal tidak konsisten	2024-01-01, 01/01/2024, 01-01-2024
2	Nilai harga berformat string (Rp xxx.xxx)	‘Rp 1.500.000’ alih-alih 1500000
3	Inkonsistensi penulisan platform	‘shopee’, ‘SHOPEE’, ’ shopee ’
4	Missing values di kolom penting	ship_date, customer_rating, payment_method
5	Duplikasi data akibat penggabungan 5 file	5581 baris duplikat dari 10000 baris

Interpretasi Section B: Ditemukan 5 kategori masalah. Masalah paling kritis adalah format harga sebagai string (menghambat agregasi) dan inkonsistensi platform (menyebabkan grouping yang salah). Section C akan menyelesaikan semua masalah ini.

1.4 Section C Data Cleaning (E-Commerce)

Tujuan: Membersihkan data menggunakan fungsi kustom + looping + conditional IF/IF-ELSE.

1.4.1 Hapus Duplikat

df <- df_gabungan |> distinct(order_id, .keep_all = TRUE)
cat("Hapus duplikat berdasarkan order_id\n")

## Hapus duplikat berdasarkan order_id

cat("Baris sebelum:", nrow(df_gabungan), "\n")

## Baris sebelum: 10000

cat("Baris sesudah:", nrow(df), "\n")

## Baris sesudah: 1966

cat("Baris dihapus:", nrow(df_gabungan) - nrow(df), "\n")

## Baris dihapus: 8034

1.4.2 Looping: Bersihkan Harga

bersihkan_harga <- function(x) {
  x <- as.character(x)
  x <- str_replace_all(x, "Rp\\s*", "")
  x <- str_replace_all(x, "\\.", "")
  x <- str_replace_all(x, ",", ".")
  x <- str_trim(x)
  hasil <- suppressWarnings(as.numeric(x))
  ifelse(is.na(hasil) | hasil < 0, 0, hasil)
}

kolom_harga <- c("unit_price","gross_sales","net_sales","discount_value","shipping_cost")

cat("=== Membersihkan Kolom Harga dengan Looping ===\n\n")

## === Membersihkan Kolom Harga dengan Looping ===

for (kol in kolom_harga) {
  sebelum <- sum(is.na(df[[kol]]) | df[[kol]] == "")
  df[[kol]] <- bersihkan_harga(df[[kol]])
  cat(sprintf("  Kolom '%-18s' dibersihkan (NA sebelum: %d)\n", kol, sebelum))
}

##   Kolom 'unit_price        ' dibersihkan (NA sebelum: 0)
##   Kolom 'gross_sales       ' dibersihkan (NA sebelum: 0)
##   Kolom 'net_sales         ' dibersihkan (NA sebelum: 0)
##   Kolom 'discount_value    ' dibersihkan (NA sebelum: 0)
##   Kolom 'shipping_cost     ' dibersihkan (NA sebelum: 0)

df$discount_pct <- suppressWarnings(as.numeric(df$discount_pct))
df$discount_pct[is.na(df$discount_pct)] <- 0
df$quantity <- suppressWarnings(as.numeric(df$quantity))
cat("\nSemua kolom harga berhasil dikonversi ke numerik!\n")

## 
## Semua kolom harga berhasil dikonversi ke numerik!

1.4.3 Standardisasi Platform (WAJIB IF)

standardisasi_platform <- function(p) {
  p <- str_trim(str_to_lower(as.character(p)))
  if      (p %in% c("shopee","shopee "))              return("Shopee")
  else if (p %in% c("tokopedia","tokped"))            return("Tokopedia")
  else if (p %in% c("lazada"))                        return("Lazada")
  else if (p %in% c("blibli"))                        return("Blibli")
  else if (p %in% c("tiktok shop","tiktok_shop"))     return("TikTok Shop")
  else                                                 return(str_to_title(p))
}

df$platform <- sapply(df$platform, standardisasi_platform)
cat("Platform distandarisasi\n")

## Platform distandarisasi

print(table(df$platform))

## 
##      Blibli      Lazada      Shopee TikTok Shop   Tokopedia 
##         400         381         411         397         377

1.4.4 Missing Value (WAJIB IF)

df$payment_method <- ifelse(
  is.na(df$payment_method) | str_trim(df$payment_method) %in% c("","nan","NaN","NAN"),
  "Unknown",
  str_trim(df$payment_method)
)
cat("payment_method: NA -> 'Unknown'\n")

## payment_method: NA -> 'Unknown'

df$customer_rating <- suppressWarnings(as.numeric(df$customer_rating))
median_rating      <- median(df$customer_rating, na.rm = TRUE)
df$customer_rating <- ifelse(is.na(df$customer_rating), median_rating, df$customer_rating)
cat(sprintf("customer_rating: NA -> median = %.1f\n", median_rating))

## customer_rating: NA -> median = 5.0

cat("   (Logika: median lebih robust terhadap outlier)\n")

##    (Logika: median lebih robust terhadap outlier)

1.4.5 Standardisasi Status & Kategori

standardisasi_status <- function(s) {
  s <- str_trim(str_to_lower(as.character(s)))
  if      (s %in% c("delivered","completed","complete","completion")) return("Completed")
  else if (s %in% c("cancelled","cancel","batal","canceled"))         return("Cancelled")
  else if (s %in% c("returned","return","retur"))                     return("Returned")
  else if (s %in% c("shipped","shipping"))                            return("Shipped")
  else                                                                 return(str_to_title(s))
}
df$order_status <- sapply(df$order_status, standardisasi_status)
cat("order_status distandarisasi\n"); print(table(df$order_status))

## order_status distandarisasi

## 
##   Cancelled   Completed On Delivery    Returned     Shipped 
##         151        1550          47         117         101

cat("\n=== Standardisasi Tambahan (Looping) ===\n")

## 
## === Standardisasi Tambahan (Looping) ===

kolom_teks <- c("category","customer_segment","region","stock_status")
for (kol in kolom_teks) {
  df[[kol]] <- str_to_title(str_trim(as.character(df[[kol]])))
  cat(sprintf("  '%-20s' distandarisasi\n", kol))
}

##   'category            ' distandarisasi
##   'customer_segment    ' distandarisasi
##   'region              ' distandarisasi
##   'stock_status        ' distandarisasi

payment_map <- c(
  "e-wallet"="E-Wallet","ewallet"="E-Wallet","e wallet"="E-Wallet",
  "virtual account"="Virtual Account","va"="Virtual Account",
  "cod"="COD","cash on delivery"="COD",
  "transfer bank"="Transfer Bank","bank transfer"="Transfer Bank",
  "credit card"="Credit Card","nan"="Unknown","none"="Unknown"
)
df$payment_method <- sapply(
  str_to_lower(str_trim(df$payment_method)),
  function(x) {
    if (is.na(x) || x == "") return("Unknown")
    if (x %in% names(payment_map)) return(payment_map[x])
    return(str_to_title(x))
  }
)
cat("  'payment_method'     distandarisasi\n")

##   'payment_method'     distandarisasi

cat(sprintf("\nData bersih: %d baris, %d kolom\n", nrow(df), ncol(df)))

## 
## Data bersih: 1966 baris, 22 kolom

Interpretasi Section C: Data dibersihkan melalui 7 proses sistematis menggunakan fungsi kustom, looping, dan conditional logic. Dataset kini bebas duplikat, format harga sudah numerik, dan semua kolom kategoris sudah konsisten.

1.5 Section D Conditional Logic (E-Commerce)

Tujuan: Menerapkan logika bisnis menggunakan IF / IF-ELSE untuk membuat kolom baru.

1.5.1 is_high_value

df$net_sales <- suppressWarnings(as.numeric(df$net_sales))
df$net_sales[is.na(df$net_sales)] <- 0

df$is_high_value <- ifelse(df$net_sales > 1000000, "Yes", "No")

cat("Kolom 'is_high_value' ditambahkan\n")

## Kolom 'is_high_value' ditambahkan

tbl_hv <- table(df$is_high_value)
print(tbl_hv)

## 
##   No  Yes 
## 1211  755

cat(sprintf("\nPersentase high-value: %.1f%%\n",
            tbl_hv["Yes"] / sum(tbl_hv) * 100))

## 
## Persentase high-value: 38.4%

df_hv <- as.data.frame(tbl_hv)
colnames(df_hv) <- c("is_high_value", "count")
df_hv$pct <- round(df_hv$count / sum(df_hv$count) * 100, 1)

plot_ly(df_hv, labels = ~is_high_value, values = ~count, type = "pie",
        marker = list(colors = c("#40916c", "#b7e4c7"),
                      line = list(color = "white", width = 2)),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<br>Persentase: %{percent}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Transaksi High-Value vs Non-High-Value</b>",
                 font = list(size = 15, color = "#1b4332")),
    legend = list(orientation = "h", x = 0.3, y = -0.1),
    paper_bgcolor = "white",
    height = 400,
    margin = list(t = 50, b = 50, l = 30, r = 30)
  )

1.5.2 order_priority (WAJIB nested IF)

tentukan_prioritas <- function(net) {
  if      (is.na(net))    return("Low")
  else if (net > 1000000) return("High")
  else if (net >= 500000) return("Medium")
  else                    return("Low")
}

df$order_priority <- sapply(df$net_sales, tentukan_prioritas)
cat("Kolom 'order_priority' ditambahkan\n")

## Kolom 'order_priority' ditambahkan

print(table(df$order_priority))

## 
##   High    Low Medium 
##    755    799    412

df_prio <- as.data.frame(table(df$order_priority))
colnames(df_prio) <- c("priority", "count")
df_prio$priority <- factor(df_prio$priority, levels = c("High","Medium","Low"))
df_prio <- df_prio[order(df_prio$priority), ]
df_prio$pct <- round(df_prio$count / sum(df_prio$count) * 100, 1)

plot_ly(df_prio, x = ~priority, y = ~count, type = "bar",
        marker = list(color = c("#1b4332","#40916c","#b7e4c7"),
                      line = list(color = "white", width = 1.5)),
        text = ~paste0(count, "<br>(", pct, "%)"),
        textposition = "outside",
        hovertemplate = "<b>%{x}</b><br>Jumlah: %{y:,}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Order Priority (Nested IF)</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Prioritas"),
    yaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

1.5.3 valid_transaction

df$valid_transaction <- ifelse(
  df$order_status == "Cancelled", "Invalid", "Valid"
)
cat("Kolom 'valid_transaction' ditambahkan\n")

## Kolom 'valid_transaction' ditambahkan

tbl_vt <- table(df$valid_transaction)
print(tbl_vt)

## 
## Invalid   Valid 
##     151    1815

cat(sprintf("\nPersentase Valid  : %.1f%%\n", tbl_vt["Valid"]  / sum(tbl_vt) * 100))

## 
## Persentase Valid  : 92.3%

cat(sprintf("Persentase Invalid: %.1f%%\n", tbl_vt["Invalid"] / sum(tbl_vt) * 100))

## Persentase Invalid: 7.7%

df_vt <- as.data.frame(tbl_vt)
colnames(df_vt) <- c("status", "count")

plot_ly(df_vt, labels = ~status, values = ~count, type = "pie", hole = 0.45,
        marker = list(colors = c("#e76f51","#40916c"),
                      line = list(color = "white", width = 2)),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Valid vs Invalid Transaction</b>",
                 font = list(size = 14, color = "#1b4332")),
    annotations = list(list(text = "Status", showarrow = FALSE,
                            font = list(size = 13, color = "#2d6a4f"))),
    legend = list(orientation = "h", x = 0.3, y = -0.1),
    paper_bgcolor = "white",
    height = 400
  )

1.5.4 Preview Kolom Baru

df |>
  select(order_id, platform, net_sales, order_status,
         is_high_value, order_priority, valid_transaction) |>
  head(10) |>
  kable(caption = "Preview Dataset dengan 3 Kolom Baru (10 baris pertama)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)

Preview Dataset dengan 3 Kolom Baru (10 baris pertama)
order_id	platform	net_sales	order_status	is_high_value	order_priority	valid_transaction
ORD00612	Tokopedia	680058	Completed	No	Medium	Valid
ORD00112	TikTok Shop	1476873	Completed	Yes	High	Valid
ORD01186	Tokopedia	369715	Completed	No	Low	Valid
ORD01511	Tokopedia	1382570	Completed	Yes	High	Valid
ORD00772	Tokopedia	376696	Completed	No	Low	Valid
ORD00880	Blibli	0	Cancelled	No	Low	Invalid
ORD00592	Blibli	0	Cancelled	No	Low	Invalid
ORD01367	Blibli	2085678	Completed	Yes	High	Valid
ORD01178	Blibli	1208527	Completed	Yes	High	Valid
ORD00276	Lazada	1789349	Completed	Yes	High	Valid

Interpretasi Section D: Tiga kolom baru berhasil dibuat menggunakan conditional logic: is_high_value (segmentasi nilai), order_priority (tier 3 level dengan nested IF), dan valid_transaction (filter data analitik).

1.6 Section E Analytical Thinking (E-Commerce)

Tujuan: Menarik insight sederhana dari data.

1.6.1 Platform Paling Dominan

platform_count <- df |>
  count(platform, name = "jumlah") |>
  arrange(desc(jumlah)) |>
  mutate(persen = round(jumlah / sum(jumlah) * 100, 1))

kable(platform_count,
      col.names = c("Platform","Jumlah Transaksi","Persentase (%)"),
      caption = "Distribusi Transaksi per Platform") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(1, bold = TRUE, background = "#f0faf3", color = "#1b4332")

Distribusi Transaksi per Platform
Platform	Jumlah Transaksi	Persentase (%)
Shopee	411	20.9
Blibli	400	20.3
TikTok Shop	397	20.2
Lazada	381	19.4
Tokopedia	377	19.2

p_plat <- plot_ly(
  data = platform_count |> arrange(jumlah),
  y = ~reorder(platform, jumlah), x = ~jumlah,
  type = "bar", orientation = "h",
  marker = list(
    color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(nrow(platform_count)),
    line  = list(color = "white", width = 1.5)
  ),
  text = ~paste0(format(jumlah, big.mark="."), " (", persen, "%)"),
  textposition = "outside",
  hovertemplate = "<b>%{y}</b><br>Transaksi: %{x:,}<extra></extra>"
) |>
  layout(
    title = list(text = "<b>Jumlah Transaksi per Platform</b>",
                 font = list(size = 15, color = "#1b4332")),
    xaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 120),
    hoverlabel = list(bgcolor = "#2d6a4f", font = list(color = "white")),
    height = 400
  )
p_plat

cat(sprintf("\nPLATFORM PALING DOMINAN: %s (%d transaksi)\n",
            platform_count$platform[1], platform_count$jumlah[1]))

## 
## PLATFORM PALING DOMINAN: Shopee (411 transaksi)

1.6.2 Kategori Paling Sering

kategori_count <- df |>
  count(category, name = "jumlah") |>
  arrange(desc(jumlah)) |>
  head(8) |>
  mutate(persen = round(jumlah / nrow(df) * 100, 1))

kable(kategori_count,
      col.names = c("Kategori","Jumlah","Persentase (%)"),
      caption = "Top 8 Kategori Produk") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(1, bold = TRUE, background = "#f0faf3")

Top 8 Kategori Produk
Kategori	Jumlah	Persentase (%)
Fashion	412	21.0
Sports	408	20.8
Beauty	387	19.7
Home Living	371	18.9
Electronics	357	18.2
Home_living	31	1.6

cat(sprintf("\nKATEGORI PALING SERING: %s (%d transaksi)\n",
            kategori_count$category[1], kategori_count$jumlah[1]))

## 
## KATEGORI PALING SERING: Fashion (412 transaksi)

plot_ly(
  data = kategori_count |> arrange(jumlah),
  y = ~reorder(category, jumlah), x = ~jumlah,
  type = "bar", orientation = "h",
  marker = list(
    color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(nrow(kategori_count)),
    line  = list(color = "white", width = 1.2)
  ),
  text = ~paste0(jumlah, " (", persen, "%)"),
  textposition = "outside",
  hovertemplate = "<b>%{y}</b><br>Jumlah: %{x:,}<extra></extra>"
) |>
  layout(
    title = list(text = "<b>Top 8 Kategori Produk Terlaris</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 110),
    height = 400
  )

1.6.3 Status Transaksi

status_count <- df |>
  count(order_status, name = "jumlah") |>
  arrange(desc(jumlah)) |>
  mutate(persen = round(jumlah / sum(jumlah) * 100, 1))

kable(status_count,
      col.names = c("Status","Jumlah","Persentase (%)"),
      caption = "Distribusi Status Transaksi") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(1, bold = TRUE, background = "#f0faf3")

Distribusi Status Transaksi
Status	Jumlah	Persentase (%)
Completed	1550	78.8
Cancelled	151	7.7
Returned	117	6.0
Shipped	101	5.1
On Delivery	47	2.4

cat(sprintf("\nSTATUS PALING BANYAK: %s (%d transaksi)\n",
            status_count$order_status[1], status_count$jumlah[1]))

## 
## STATUS PALING BANYAK: Completed (1550 transaksi)

plot_ly(status_count, labels = ~order_status, values = ~jumlah,
        type = "pie", hole = 0.4,
        marker = list(
          colors = colorRampPalette(c("#1b4332","#40916c","#74c69d","#b7e4c7","#d8f3dc"))(nrow(status_count)),
          line   = list(color = "white", width = 2)
        ),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<br>%{percent}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Status Transaksi</b>",
                 font = list(size = 14, color = "#1b4332")),
    legend = list(orientation = "h", x = 0.1, y = -0.15),
    paper_bgcolor = "white",
    height = 400
  )

df |> write.csv("ecommerce_cleaned_R.csv", row.names = FALSE)
cat("\nDisimpan -> ecommerce_cleaned_R.csv\n")

## 
## Disimpan -> ecommerce_cleaned_R.csv

cat(sprintf("   Total baris: %d | Total kolom: %d\n", nrow(df), ncol(df)))

##    Total baris: 1966 | Total kolom: 25

Ringkasan Section E:
1. Platform paling dominan: Shopee (20.9%)
2. Kategori paling sering: Fashion (412 transaksi)
3. Status transaksi terbanyak: Completed (78.8%)

2 MINI PROJECT 2

2.1 Library yang Digunakan (Mini Project 2)

pkgs2 <- c("rvest","httr","dplyr","jsonlite","stringr","knitr","kableExtra","plotly")
for (p in pkgs2) {
  if (!require(p, character.only = TRUE, quietly = TRUE))
    install.packages(p, repos = "https://cran.r-project.org")
}
library(rvest); library(httr); library(dplyr)
library(jsonlite); library(stringr)
cat("Semua library Mini Project 2 berhasil dimuat!")

## Semua library Mini Project 2 berhasil dimuat!

2.2 Setup Helper Functions

BASE_URL <- "https://www.scrapethissite.com"

get_page_r <- function(url, query_params = NULL) {
  tryCatch({
    resp <- GET(url,
      add_headers("User-Agent" = "Mozilla/5.0 (Windows NT 10.0) Chrome/120.0.0.0"),
      query = query_params, timeout(15))
    if (status_code(resp) == 200)
      read_html(content(resp, "text", encoding = "UTF-8"))
    else { cat(sprintf("  [ERROR] Status %d\n", status_code(resp))); NULL }
  }, error = function(e) { cat(sprintf("  [ERROR] %s\n", e$message)); NULL })
}

cat("Helper functions siap! BASE_URL:", BASE_URL, "\n")

## Helper functions siap! BASE_URL: https://www.scrapethissite.com

2.3 Website 1 Countries of the World (Static HTML)

Website 1 Static HTML R (rvest)
URL: scrapethissite.com/pages/simple/ | Data: country name, capital, population, area

2.3.1 Section A Scraping

countries_list <- list()
page <- get_page_r(paste0(BASE_URL, "/pages/simple/"))

if (!is.null(page)) {
  nodes <- page %>% html_nodes("div.country")
  cat(sprintf("Ditemukan %d negara\n\n", length(nodes)))

  for (i in seq_along(nodes)) {
    nd   <- nodes[[i]]
    name <- nd %>% html_node("h3.country-name")         %>% html_text(trim=TRUE)
    cap  <- nd %>% html_node("span.country-capital")    %>% html_text(trim=TRUE)
    pop  <- nd %>% html_node("span.country-population") %>% html_text(trim=TRUE)
    area <- nd %>% html_node("span.country-area")       %>% html_text(trim=TRUE)

    name <- ifelse(is.na(name)||nchar(name)==0, "Unknown", name)
    cap  <- ifelse(is.na(cap) ||nchar(trimws(cap))==0, "Unknown", cap)
    pop  <- ifelse(is.na(pop),  "0", pop)
    area <- ifelse(is.na(area), "0", area)

    data_status <- if(name!="Unknown" && cap!="Unknown") "Complete" else "Incomplete"

    countries_list[[i]] <- list(country_name=name, capital=cap,
                                population=pop, area=area, data_status=data_status)
  }
}

## Ditemukan 250 negara

df_countries_raw <- bind_rows(countries_list)
cat(sprintf("Total data: %d baris\n", nrow(df_countries_raw)))

## Total data: 250 baris

kable(head(df_countries_raw, 5), caption = "Preview Countries (Raw)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)

Preview Countries (Raw)
country_name	capital	population	area	data_status
Andorra	Andorra la Vella	84000	468.0	Complete
United Arab Emirates	Abu Dhabi	4975593	82880.0	Complete
Afghanistan	Kabul	29121286	647500.0	Complete
Antigua and Barbuda	St. John’s	86754	443.0	Complete
Anguilla	The Valley	13254	102.0	Complete

2.3.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_countries_raw), ncol(df_countries_raw),
            sum(duplicated(df_countries_raw))))

## Baris: 250 | Kolom: 5 | Duplikat: 0

cat("Tipe data:\n"); print(sapply(df_countries_raw, class))

## Tipe data:

## country_name      capital   population         area  data_status 
##  "character"  "character"  "character"  "character"  "character"

cat("\nMissing values:\n"); print(colSums(is.na(df_countries_raw)))

## 
## Missing values:

## country_name      capital   population         area  data_status 
##            0            0            0            0            0

cat("\nDATA ISSUES:\n")

## 
## DATA ISSUES:

cat("  (1) population & area bertipe character, bukan numerik\n")

##   (1) population & area bertipe character, bukan numerik

cat("  (2) Beberapa capital memiliki spasi ekstra\n")

##   (2) Beberapa capital memiliki spasi ekstra

2.3.3 Section C Cleaning

df_countries <- df_countries_raw
for (col in c("country_name","capital")) {
  df_countries[[col]] <- str_to_title(str_trim(df_countries[[col]]))
  cat(sprintf("  '%-15s' trim & proper case\n", col))
}

##   'country_name   ' trim & proper case
##   'capital        ' trim & proper case

df_countries$population <- suppressWarnings(as.numeric(str_replace_all(df_countries$population,",","")))
df_countries$area       <- suppressWarnings(as.numeric(str_replace_all(df_countries$area,",","")))
df_countries$population[is.na(df_countries$population)] <- 0
df_countries$area[is.na(df_countries$area)]             <- 0
df_countries <- df_countries %>% distinct(country_name, .keep_all=TRUE)
cat(sprintf("\nBaris final: %d\n", nrow(df_countries)))

## 
## Baris final: 250

write.csv(df_countries, "countries_of_the_world_R.csv", row.names=FALSE)
cat("Disimpan -> countries_of_the_world_R.csv\n")

## Disimpan -> countries_of_the_world_R.csv

print(sapply(df_countries, class))

## country_name      capital   population         area  data_status 
##  "character"  "character"    "numeric"    "numeric"  "character"

kable(head(df_countries, 5)) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

country_name	capital	population	area	data_status
Andorra	Andorra La Vella	84000	468	Complete
United Arab Emirates	Abu Dhabi	4975593	82880	Complete
Afghanistan	Kabul	29121286	647500	Complete
Antigua And Barbuda	St. John’s	86754	443	Complete
Anguilla	The Valley	13254	102	Complete

2.3.4 Section D Conditional Logic

df_countries <- df_countries %>%
  mutate(
    is_large = case_when(
      population > 10000000 ~ "Large",
      population > 1000000  ~ "Medium",
      TRUE                  ~ "Small"
    ),
    data_status = case_when(
      is.na(country_name) | country_name == "Unknown" ~ "Incomplete",
      capital == "Unknown"                            ~ "Incomplete",
      TRUE                                            ~ "Complete"
    )
  )

cat("Kolom 'is_large' ditambahkan\n")

## Kolom 'is_large' ditambahkan

print(table(df_countries$is_large))

## 
##  Large Medium  Small 
##     82     76     92

cat("\nKolom 'data_status' diperbarui\n")

## 
## Kolom 'data_status' diperbarui

print(table(df_countries$data_status))

## 
## Complete 
##      250

write.csv(df_countries, "countries_of_the_world_R.csv", row.names = FALSE)
cat("\nDisimpan -> countries_of_the_world_R.csv\n")

## 
## Disimpan -> countries_of_the_world_R.csv

cat("\n=== INSIGHT COUNTRIES ===\n")

## 
## === INSIGHT COUNTRIES ===

biggest <- df_countries[which.max(df_countries$population), ]
cat(sprintf("1. Negara dengan populasi terbesar: %s (%.0f jiwa)\n",
            biggest$country_name, biggest$population))

## 1. Negara dengan populasi terbesar: China (1330044000 jiwa)

cat("2. Distribusi ukuran negara:\n")

## 2. Distribusi ukuran negara:

print(table(df_countries$is_large))

## 
##  Large Medium  Small 
##     82     76     92

total_c <- nrow(df_countries)
cat(sprintf("3. Persentase: Large=%.1f%%, Medium=%.1f%%, Small=%.1f%%\n",
  sum(df_countries$is_large=="Large")  / total_c * 100,
  sum(df_countries$is_large=="Medium") / total_c * 100,
  sum(df_countries$is_large=="Small")  / total_c * 100))

## 3. Persentase: Large=32.8%, Medium=30.4%, Small=36.8%

top15_countries <- df_countries |> arrange(desc(population)) |> head(15)

plot_ly(
  data = top15_countries,
  y = ~reorder(country_name, population), x = ~population,
  type = "bar", orientation = "h",
  marker = list(
    color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(15),
    line  = list(color = "white", width = 1)
  ),
  text  = ~paste0(round(population/1e6, 1), " jt"),
  textposition = "outside",
  hovertemplate = "<b>%{y}</b><br>Populasi: %{x:,.0f}<extra></extra>"
) |>
  layout(
    title = list(text = "<b>Top 15 Negara dengan Populasi Terbesar</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Populasi", gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 80),
    height = 420
  )

df_size <- as.data.frame(table(df_countries$is_large))
colnames(df_size) <- c("size", "count")
plot_ly(df_size, labels = ~size, values = ~count, type = "pie",
        marker = list(colors = c("#1b4332","#74c69d","#d8f3dc"),
                      line   = list(color = "white", width = 2)),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Ukuran Negara (Berdasarkan Populasi)</b>",
                 font = list(size = 14, color = "#1b4332")),
    legend = list(orientation = "h", x = 0.2, y = -0.1),
    paper_bgcolor = "white",
    height = 400
  )

Interpretasi Countries: 250 negara berhasil di-scrape. Kolom is_large mengkategorikan negara berdasarkan populasi: Large (>10 juta), Medium (1-10 juta), Small (<1 juta). Static HTML adalah tipe paling sederhana – satu request langsung menghasilkan semua data tanpa JavaScript rendering.

2.4 Website 2 Hockey Teams (Pagination)

Website 2 Pagination R (rvest + httr)
URL: scrapethissite.com/pages/forms/ | Data: team name, year, wins, losses

2.4.1 Section A Scraping

hockey_list <- list(); counter <- 1

cat("Scraping dengan pagination...\n\n")

## Scraping dengan pagination...

for (pg in 1:24) {
  page <- get_page_r(paste0(BASE_URL, "/pages/forms/"),
                     query_params = list(page_num=pg, per_page=25))
  if (!is.null(page)) {
    rows <- page %>% html_nodes("tr.team")
    if (length(rows) == 0) { cat(sprintf("  Pagination berhenti di halaman %d\n", pg)); break }
    cat(sprintf("  Halaman %d -> %d tim\n", pg, length(rows)))

    for (row in rows) {
      cols <- row %>% html_nodes("td") %>% html_text(trim=TRUE)
      if (length(cols) >= 8) {
        ds <- if(nchar(cols[1])>0 && nchar(cols[2])>0) "Complete" else "Incomplete"
        hockey_list[[counter]] <- list(
          team_name=cols[1], year=cols[2], wins=cols[3], losses=cols[4],
          win_pct=cols[6], goals_for=cols[7], goals_against=cols[8], data_status=ds)
        counter <- counter + 1
      }
    }
  }
  Sys.sleep(0.3)
}

##   Halaman 1 -> 25 tim
##   Halaman 2 -> 25 tim
##   Halaman 3 -> 25 tim
##   Halaman 4 -> 25 tim
##   Halaman 5 -> 25 tim
##   Halaman 6 -> 25 tim
##   Halaman 7 -> 25 tim
##   Halaman 8 -> 25 tim
##   Halaman 9 -> 25 tim
##   Halaman 10 -> 25 tim
##   Halaman 11 -> 25 tim
##   Halaman 12 -> 25 tim
##   Halaman 13 -> 25 tim
##   Halaman 14 -> 25 tim
##   Halaman 15 -> 25 tim
##   Halaman 16 -> 25 tim
##   Halaman 17 -> 25 tim
##   Halaman 18 -> 25 tim
##   Halaman 19 -> 25 tim
##   Halaman 20 -> 25 tim
##   Halaman 21 -> 25 tim
##   Halaman 22 -> 25 tim
##   Halaman 23 -> 25 tim
##   Halaman 24 -> 7 tim

df_hockey_raw <- bind_rows(hockey_list)
cat(sprintf("\nTotal data: %d baris\n", nrow(df_hockey_raw)))

## 
## Total data: 582 baris

kable(head(df_hockey_raw, 5), caption = "Preview Hockey Teams (Raw)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

Preview Hockey Teams (Raw)
team_name	year	wins	losses	win_pct	goals_for	goals_against	data_status
Boston Bruins	1990	44	24	0.55	299	264	Complete
Buffalo Sabres	1990	31	30	0.388	292	278	Complete
Calgary Flames	1990	46	26	0.575	344	263	Complete
Chicago Blackhawks	1990	49	23	0.613	284	211	Complete
Detroit Red Wings	1990	34	38	0.425	273	298	Complete

2.4.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_hockey_raw), ncol(df_hockey_raw),
            sum(duplicated(df_hockey_raw))))

## Baris: 582 | Kolom: 8 | Duplikat: 0

cat("Tipe data:\n"); print(sapply(df_hockey_raw, class))

## Tipe data:

##     team_name          year          wins        losses       win_pct 
##   "character"   "character"   "character"   "character"   "character" 
##     goals_for goals_against   data_status 
##   "character"   "character"   "character"

cat("\nDATA ISSUES:\n")

## 
## DATA ISSUES:

cat("  (1) Kolom wins/losses/goals bertipe character, bukan numerik\n")

##   (1) Kolom wins/losses/goals bertipe character, bukan numerik

cat("  (2) Kolom year bertipe string, perlu konversi ke integer\n")

##   (2) Kolom year bertipe string, perlu konversi ke integer

2.4.3 Section C Cleaning

df_hockey <- df_hockey_raw
for (col in c("wins","losses","goals_for","goals_against")) {
  df_hockey[[col]] <- suppressWarnings(as.numeric(df_hockey[[col]]))
  if(any(is.na(df_hockey[[col]]))) df_hockey[[col]][is.na(df_hockey[[col]])] <- 0
  cat(sprintf("  '%-15s' -> numeric\n", col))
}

##   'wins           ' -> numeric
##   'losses         ' -> numeric
##   'goals_for      ' -> numeric
##   'goals_against  ' -> numeric

df_hockey$year     <- suppressWarnings(as.numeric(df_hockey$year))
df_hockey$win_pct  <- suppressWarnings(as.numeric(df_hockey$win_pct))
df_hockey$team_name <- str_to_title(str_trim(df_hockey$team_name))
df_hockey <- df_hockey %>% distinct(team_name, year, .keep_all=TRUE)
cat(sprintf("\nBaris final: %d\n", nrow(df_hockey)))

## 
## Baris final: 582

write.csv(df_hockey, "hockey_teams_R.csv", row.names=FALSE)
cat("Disimpan -> hockey_teams_R.csv\n")

## Disimpan -> hockey_teams_R.csv

print(sapply(df_hockey, class))

##     team_name          year          wins        losses       win_pct 
##   "character"     "numeric"     "numeric"     "numeric"     "numeric" 
##     goals_for goals_against   data_status 
##     "numeric"     "numeric"   "character"

kable(head(df_hockey, 5)) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

team_name	year	wins	losses	win_pct	goals_for	goals_against	data_status
Boston Bruins	1990	44	24	0.550	299	264	Complete
Buffalo Sabres	1990	31	30	0.388	292	278	Complete
Calgary Flames	1990	46	26	0.575	344	263	Complete
Chicago Blackhawks	1990	49	23	0.613	284	211	Complete
Detroit Red Wings	1990	34	38	0.425	273	298	Complete

2.4.4 Section D Conditional Logic

df_hockey <- df_hockey %>%
  mutate(
    performance = case_when(
      wins > 40 ~ "High",
      wins > 20 ~ "Medium",
      TRUE      ~ "Low"
    ),
    data_status = case_when(
      is.na(team_name) | team_name == "" ~ "Incomplete",
      is.na(wins)                        ~ "Incomplete",
      TRUE                               ~ "Complete"
    )
  )

df_hockey <- df_hockey %>%
  mutate(win_rate = round(wins / (wins + losses) * 100, 2))
df_hockey$win_rate[is.nan(df_hockey$win_rate)] <- 0

cat("Kolom 'performance' ditambahkan\n")

## Kolom 'performance' ditambahkan

print(table(df_hockey$performance))

## 
##   High    Low Medium 
##    226     27    329

cat("\nKolom 'win_rate' ditambahkan\n")

## 
## Kolom 'win_rate' ditambahkan

cat("\nKolom 'data_status' diperbarui\n")

## 
## Kolom 'data_status' diperbarui

print(table(df_hockey$data_status))

## 
## Complete 
##      582

write.csv(df_hockey, "hockey_teams_R.csv", row.names = FALSE)
cat("\nDisimpan -> hockey_teams_R.csv\n")

## 
## Disimpan -> hockey_teams_R.csv

kable(
  df_hockey %>% select(team_name, year, wins, losses, win_rate) %>% head(8),
  caption = "Preview Hockey Teams dengan Win Rate"
) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)

Preview Hockey Teams dengan Win Rate
team_name	year	wins	losses	win_rate
Boston Bruins	1990	44	24	64.71
Buffalo Sabres	1990	31	30	50.82
Calgary Flames	1990	46	26	63.89
Chicago Blackhawks	1990	49	23	68.06
Detroit Red Wings	1990	34	38	47.22
Edmonton Oilers	1990	37	37	50.00
Hartford Whalers	1990	31	38	44.93
Los Angeles Kings	1990	46	24	65.71

cat("\n=== INSIGHT HOCKEY TEAMS ===\n")

## 
## === INSIGHT HOCKEY TEAMS ===

best_team <- df_hockey[which.max(df_hockey$win_rate), ]
cat(sprintf("1. Tim dengan win rate tertinggi: %s (%d) - %.2f%%\n",
            best_team$team_name, best_team$year, best_team$win_rate))

## 1. Tim dengan win rate tertinggi: Detroit Red Wings (1995) - 82.67%

avg_per_year <- df_hockey %>%
  group_by(year) %>%
  summarise(avg_wr = round(mean(win_rate, na.rm=TRUE), 2), .groups="drop") %>%
  arrange(desc(avg_wr)) %>% slice(1)
cat(sprintf("2. Tahun dengan rata-rata win rate tertinggi: %d (%.2f%%)\n",
            avg_per_year$year, avg_per_year$avg_wr))

## 2. Tahun dengan rata-rata win rate tertinggi: 2009 (56.89%)

cat("3. Distribusi performance:\n")

## 3. Distribusi performance:

print(table(df_hockey$performance))

## 
##   High    Low Medium 
##    226     27    329

df_perf <- as.data.frame(table(df_hockey$performance))
colnames(df_perf) <- c("performance", "count")
df_perf$performance <- factor(df_perf$performance, levels = c("High","Medium","Low"))

plot_ly(df_perf, x = ~performance, y = ~count, type = "bar",
        marker = list(color = c("#1b4332","#40916c","#b7e4c7"),
                      line = list(color = "white", width = 1.5)),
        text = ~count, textposition = "outside",
        hovertemplate = "<b>%{x}</b><br>Jumlah Tim: %{y:,}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Performance Tim Hockey</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Performance"),
    yaxis = list(title = "Jumlah Tim", gridcolor = "#f0faf3"),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

avg_wr_year <- df_hockey %>%
  group_by(year) %>%
  summarise(avg_win_rate = round(mean(win_rate, na.rm=TRUE), 2), .groups="drop") %>%
  arrange(year)

plot_ly(avg_wr_year, x = ~year, y = ~avg_win_rate, type = "scatter", mode = "lines+markers",
        line    = list(color = "#40916c", width = 2.5),
        marker  = list(color = "#1b4332", size = 7),
        text    = ~paste0(avg_win_rate, "%"),
        textposition = "top center",
        hovertemplate = "<b>Tahun %{x}</b><br>Avg Win Rate: %{y:.2f}%<extra></extra>") |>
  layout(
    title = list(text = "<b>Rata-rata Win Rate Tim Hockey per Tahun</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Tahun", gridcolor = "#f0faf3"),
    yaxis = list(title = "Rata-rata Win Rate (%)", gridcolor = "#f0faf3"),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

Interpretasi Hockey: 582 records berhasil dikumpulkan. Kolom performance mengkategorikan tim: High (>40 wins), Medium (>20 wins), Low (<=20 wins). Kolom win_rate menunjukkan persentase kemenangan setiap tim per musim.

2.5 Website 3 Oscar Winning Films (AJAX)

Website 3 AJAX / JSON R (httr + jsonlite)
Strategi: Inspect Network -> endpoint XHR -> parse JSON langsung dengan fromJSON()

2.5.1 Section A Scraping

oscar_list <- list(); counter <- 1
cat("Scraping data Oscar per tahun...\n\n")

## Scraping data Oscar per tahun...

for (year in 2010:2015) {
  url_oscar <- sprintf("%s/pages/ajax-javascript/?ajax=true&year=%d", BASE_URL, year)
  resp      <- GET(url_oscar, add_headers("User-Agent"="Mozilla/5.0"), timeout(15))

  if (status_code(resp) == 200) {
    movies <- tryCatch(
      fromJSON(content(resp, "text", encoding="UTF-8"), flatten=TRUE),
      error = function(e) NULL
    )
    if (!is.null(movies) && is.data.frame(movies) && nrow(movies) > 0) {
      cat(sprintf("  Tahun %d -> %d film\n", year, nrow(movies)))
      for (i in seq_len(nrow(movies))) {
        title <- ifelse(!is.null(movies$title[i]) && nchar(str_trim(movies$title[i]))>0,
                        str_trim(movies$title[i]), "Unknown")
        ds    <- if(title!="Unknown") "Complete" else "Incomplete"
        oscar_list[[counter]] <- list(
          movie_title = title, year = year,
          nominations = as.character(ifelse(is.null(movies$nominations[i]), 0, movies$nominations[i])),
          awards      = as.character(ifelse(is.null(movies$awards[i]),      0, movies$awards[i])),
          data_status = ds)
        counter <- counter + 1
      }
    }
  } else {
    cat(sprintf("  Gagal tahun %d: status %d\n", year, status_code(resp)))
  }
  Sys.sleep(0.3)
}

##   Tahun 2010 -> 13 film
##   Tahun 2011 -> 15 film
##   Tahun 2012 -> 15 film
##   Tahun 2013 -> 12 film
##   Tahun 2014 -> 16 film
##   Tahun 2015 -> 16 film

df_oscar_raw <- bind_rows(oscar_list)
cat(sprintf("\nTotal film terkumpul: %d baris\n", nrow(df_oscar_raw)))

## 
## Total film terkumpul: 87 baris

kable(head(df_oscar_raw, 5), caption = "Preview Oscar Films (Raw)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

Preview Oscar Films (Raw)
movie_title	year	nominations	awards	data_status
The King’s Speech	2010	12	4	Complete
Inception	2010	8	4	Complete
The Social Network	2010	8	3	Complete
The Fighter	2010	7	2	Complete
Toy Story 3	2010	5	2	Complete

2.5.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_oscar_raw), ncol(df_oscar_raw),
            sum(duplicated(df_oscar_raw))))

## Baris: 87 | Kolom: 5 | Duplikat: 0

cat("Tipe data:\n"); print(sapply(df_oscar_raw, class))

## Tipe data:

## movie_title        year nominations      awards data_status 
## "character"   "integer" "character" "character" "character"

cat("\nDATA ISSUES:\n")

## 
## DATA ISSUES:

cat("  (1) nominations & awards bertipe character, perlu konversi numerik\n")

##   (1) nominations & awards bertipe character, perlu konversi numerik

cat("  (2) Beberapa film nominations = 0 (perlu diverifikasi)\n")

##   (2) Beberapa film nominations = 0 (perlu diverifikasi)

2.5.3 Section C Cleaning

df_oscar <- df_oscar_raw
if (nrow(df_oscar) > 0) {
  df_oscar$movie_title <- str_to_title(str_trim(df_oscar$movie_title))
  for (col in c("nominations","awards")) {
    df_oscar[[col]] <- suppressWarnings(as.numeric(df_oscar[[col]]))
    df_oscar[[col]][is.na(df_oscar[[col]])] <- 0
    cat(sprintf("  '%-15s' -> numeric\n", col))
  }
  df_oscar$movie_title[is.na(df_oscar$movie_title)|df_oscar$movie_title==""] <- "Unknown"
  df_oscar <- df_oscar %>% distinct()
  cat(sprintf("\nBaris final: %d\n", nrow(df_oscar)))
}

##   'nominations    ' -> numeric
##   'awards         ' -> numeric
## 
## Baris final: 87

write.csv(df_oscar, "oscar_winning_films_R.csv", row.names=FALSE)
cat("Disimpan -> oscar_winning_films_R.csv\n")

## Disimpan -> oscar_winning_films_R.csv

print(sapply(df_oscar, class))

## movie_title        year nominations      awards data_status 
## "character"   "integer"   "numeric"   "numeric" "character"

kable(head(df_oscar, 5)) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

movie_title	year	nominations	awards	data_status
The King’s Speech	2010	12	4	Complete
Inception	2010	8	4	Complete
The Social Network	2010	8	3	Complete
The Fighter	2010	7	2	Complete
Toy Story 3	2010	5	2	Complete

2.5.4 Section D Conditional Logic

if (nrow(df_oscar) > 0) {
  df_oscar <- df_oscar %>%
    mutate(
      category = case_when(
        awards >= 3 ~ "High Awards",
        awards >= 1 ~ "Mid Awards",
        TRUE        ~ "No Award"
      ),
      data_status = case_when(
        is.na(movie_title) | movie_title == "Unknown" ~ "Incomplete",
        TRUE                                          ~ "Complete"
      )
    )
}

cat("Kolom 'category' ditambahkan\n")

## Kolom 'category' ditambahkan

print(table(df_oscar$category))

## 
## High Awards  Mid Awards 
##          16          71

cat("\nKolom 'data_status' diperbarui\n")

## 
## Kolom 'data_status' diperbarui

print(table(df_oscar$data_status))

## 
## Complete 
##       87

write.csv(df_oscar, "oscar_winning_films_R.csv", row.names = FALSE)
cat("\nDisimpan ulang -> oscar_winning_films_R.csv\n")

## 
## Disimpan ulang -> oscar_winning_films_R.csv

kable(head(df_oscar, 8),
      caption = "Preview Oscar Films dengan Kolom Category & Data Status") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12) |>
  column_spec(5, bold = TRUE,
              color = ifelse(
                head(df_oscar, 8)$category == "High Awards", "#2d6a4f",
                ifelse(head(df_oscar, 8)$category == "Mid Awards", "#40916c", "#95d5b2")
              ))

Preview Oscar Films dengan Kolom Category & Data Status
movie_title	year	nominations	awards	data_status	category
The King’s Speech	2010	12	4	Complete	High Awards
Inception	2010	8	4	Complete	High Awards
The Social Network	2010	8	3	Complete	High Awards
The Fighter	2010	7	2	Complete	Mid Awards
Toy Story 3	2010	5	2	Complete	Mid Awards
Alice In Wonderland	2010	3	2	Complete	Mid Awards
Black Swan	2010	5	1	Complete	Mid Awards
In A Better World	2010	1	1	Complete	Mid Awards

if (nrow(df_oscar) > 0) {
  df_oscar_cat <- as.data.frame(table(df_oscar$category))
  colnames(df_oscar_cat) <- c("category", "count")
  plot_ly(df_oscar_cat, labels = ~category, values = ~count, type = "pie",
          marker = list(colors = c("#1b4332","#74c69d","#d8f3dc"),
                        line   = list(color = "white", width = 2)),
          textinfo = "label+percent",
          hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
    layout(
      title = list(text = "<b>Distribusi Kategori Film Oscar (2010-2015)</b>",
                   font = list(size = 14, color = "#1b4332")),
      legend = list(orientation = "h", x = 0.1, y = -0.1),
      paper_bgcolor = "white",
      height = 400
    )
}

if (nrow(df_oscar) > 0) {
  films_per_year <- df_oscar %>% count(year, name = "total")
  plot_ly(films_per_year, x = ~year, y = ~total, type = "bar",
          marker = list(
            color = colorRampPalette(c("#b7e4c7","#1b4332"))(nrow(films_per_year)),
            line  = list(color = "white", width = 1.5)
          ),
          text = ~total, textposition = "outside",
          hovertemplate = "<b>Tahun %{x}</b><br>Film: %{y}<extra></extra>") |>
    layout(
      title = list(text = "<b>Jumlah Film Oscar yang Di-scrape per Tahun</b>",
                   font = list(size = 14, color = "#1b4332")),
      xaxis = list(title = "Tahun", tickmode = "linear"),
      yaxis = list(title = "Jumlah Film", gridcolor = "#f0faf3"),
      plot_bgcolor = "white", paper_bgcolor = "white",
      height = 400
    )
}

Interpretasi Oscar: 87 film (2010-2015) berhasil dikumpulkan. Kolom category membagi film menjadi: High Awards (>=3 award), Mid Awards (1-2 award), No Award (0 award). Kunci scraping: endpoint AJAX mengembalikan JSON murni yang di-parse langsung dengan fromJSON().

2.6 Website 4 Turtles (Frames & iFrames)

Website 4 iFrames R (rvest)
Strategi: Akses halaman utama -> ambil src iframe -> request ulang ke URL iframe

2.6.1 Section A Scraping

turtles_list <- list(); counter <- 1
page_main    <- get_page_r(paste0(BASE_URL, "/pages/frames/"))

if (!is.null(page_main)) {
  iframes <- html_nodes(page_main, "iframe")
  cat(sprintf("Jumlah iframe ditemukan: %d\n\n", length(iframes)))

  for (iframe in iframes) {
    src <- html_attr(iframe, "src")
    if (is.na(src) || nchar(src)==0) next
    iframe_url <- if(str_starts(src,"/")) paste0(BASE_URL, src) else src
    cat(sprintf("  Mengakses: %s\n", iframe_url))

    iframe_page <- get_page_r(iframe_url)
    if (!is.null(iframe_page)) {
      rows <- html_nodes(iframe_page, "tr")
      for (row in rows) {
        cols <- html_nodes(row, "td")
        if (length(cols)>=2 && nchar(str_trim(html_text(cols[[1]])))>1) {
          name <- str_trim(html_text(cols[[1]]))
          desc <- str_trim(html_text(cols[[2]]))
          info <- if(length(cols)>2) str_trim(html_text(cols[[3]])) else "N/A"
          if (!nchar(desc)) desc <- "Unknown"
          if (!nchar(info)) info <- "N/A"
          ds <- if(name!=""&&desc!="Unknown") "Complete" else "Incomplete"
          turtles_list[[counter]] <- list(
            name=name, description=desc,
            additional_info=info, source_iframe=iframe_url, data_status=ds)
          counter <- counter + 1
        }
      }
    }
  }
}

## Jumlah iframe ditemukan: 1
## 
##   Mengakses: https://www.scrapethissite.com/pages/frames/?frame=i

if (length(turtles_list)==0) {
  df_turtles_raw <- data.frame(name=character(), description=character(),
    additional_info=character(), source_iframe=character(),
    data_status=character(), stringsAsFactors=FALSE)
  cat("[WARN] Tidak ada data turtle.\n")
} else {
  df_turtles_raw <- bind_rows(turtles_list) |>
    mutate(across(where(is.character), ~ str_trunc(str_trim(.x), 80)))
}

## [WARN] Tidak ada data turtle.

cat(sprintf("\nTotal data: %d baris\n", nrow(df_turtles_raw)))

## 
## Total data: 0 baris

if (nrow(df_turtles_raw)>0)
  kable(head(df_turtles_raw,5), caption="Preview Turtles (Raw)") |>
    kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

2.6.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_turtles_raw), ncol(df_turtles_raw),
            sum(duplicated(df_turtles_raw))))

## Baris: 0 | Kolom: 5 | Duplikat: 0

cat("Tipe data:\n"); print(sapply(df_turtles_raw, class))

## Tipe data:

##            name     description additional_info   source_iframe     data_status 
##     "character"     "character"     "character"     "character"     "character"

cat("\nDATA ISSUES:\n")

## 
## DATA ISSUES:

cat("  (1) Kolom additional_info banyak berisi 'N/A'\n")

##   (1) Kolom additional_info banyak berisi 'N/A'

cat("  (2) Beberapa description sangat pendek/tidak informatif\n")

##   (2) Beberapa description sangat pendek/tidak informatif

2.6.3 Section C Cleaning

df_turtles <- df_turtles_raw
for (col in c("name","description","additional_info")) {
  if (col %in% names(df_turtles)) {
    df_turtles[[col]] <- str_trim(df_turtles[[col]])
    df_turtles[[col]][is.na(df_turtles[[col]])] <- "Unknown"
    cat(sprintf("  '%-20s' trim & fill NA\n", col))
  }
}

##   'name                ' trim & fill NA
##   'description         ' trim & fill NA
##   'additional_info     ' trim & fill NA

df_turtles <- df_turtles %>% distinct()
cat(sprintf("\nBaris final: %d\n", nrow(df_turtles)))

## 
## Baris final: 0

write.csv(df_turtles, "turtles_data_R.csv", row.names=FALSE)
cat("Disimpan -> turtles_data_R.csv\n")

## Disimpan -> turtles_data_R.csv

if (nrow(df_turtles)>0)
  kable(head(df_turtles,5)) |>
    kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

2.6.4 Section D Conditional Logic

df_turtles <- df_turtles %>%
  mutate(
    data_status = case_when(
      is.na(name)        | name        == "Unknown" ~ "Incomplete",
      is.na(description) | description == "Unknown" ~ "Incomplete",
      TRUE                                          ~ "Complete"
    )
  )

cat("Kolom 'data_status' diperbarui\n")

## Kolom 'data_status' diperbarui

print(table(df_turtles$data_status))

## < table of extent 0 >

write.csv(df_turtles, "turtles_data_R.csv", row.names = FALSE)
cat("\nDisimpan -> turtles_data_R.csv\n")

## 
## Disimpan -> turtles_data_R.csv

cat("\n=== INSIGHT TURTLES ===\n")

## 
## === INSIGHT TURTLES ===

cat(sprintf("1. Total family kura-kura yang ditemukan: %d family\n", nrow(df_turtles)))

## 1. Total family kura-kura yang ditemukan: 0 family

cat("2. Status kelengkapan data:\n")

## 2. Status kelengkapan data:

print(table(df_turtles$data_status))

## < table of extent 0 >

complete_pct <- sum(df_turtles$data_status=="Complete") / nrow(df_turtles) * 100
cat(sprintf("3. Persentase data Complete: %.1f%%\n", complete_pct))

## 3. Persentase data Complete: NaN%

if (nrow(df_turtles) > 0) {
  df_turt_status <- as.data.frame(table(df_turtles$data_status))
  colnames(df_turt_status) <- c("status", "count")
  df_turt_status$pct <- round(df_turt_status$count / sum(df_turt_status$count) * 100, 1)

  plot_ly(df_turt_status, labels = ~status, values = ~count, type = "pie", hole = 0.45,
          marker = list(colors = c("#40916c","#e76f51"),
                        line   = list(color = "white", width = 2)),
          textinfo = "label+percent",
          hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
    layout(
      title = list(text = "<b>Kelengkapan Data Turtles (iFrame)</b>",
                   font = list(size = 14, color = "#1b4332")),
      annotations = list(list(text = "Status", showarrow = FALSE,
                              font = list(size = 13, color = "#2d6a4f"))),
      legend = list(orientation = "h", x = 0.3, y = -0.1),
      paper_bgcolor = "white",
      height = 400
    )
}

Interpretasi Turtles: 0 baris berhasil diambil dari iframe. Persentase data Complete: 0%. Kunci: detect src iframe -> buat URL absolut -> request baru ke URL tersebut.

2.6.5 Section E Analytical Thinking (Web Scraping)

Section E
Tujuan: Menganalisis proses scraping, membandingkan pendekatan, dan memberikan insight + rekomendasi.

2.7 Rekap Semua Dataset

datasets_mp2 <- list(Countries=df_countries, Hockey=df_hockey,
                     Oscar=df_oscar, Turtles=df_turtles)

cat("Distribusi data_status per dataset:\n")

## Distribusi data_status per dataset:

for (nm in names(datasets_mp2)) {
  df_tmp <- datasets_mp2[[nm]]
  cat(sprintf("\n%s\n", strrep("-", 45)))
  cat(sprintf("Dataset  : %s\n", nm))
  cat(sprintf("Baris    : %d | Kolom: %d\n", nrow(df_tmp), ncol(df_tmp)))
  cat(sprintf("Nama kolom: %s\n", paste(names(df_tmp), collapse=", ")))
  cat("Missing values:\n"); print(colSums(is.na(df_tmp)))
  cat(sprintf("Duplicate: %d\n", sum(duplicated(df_tmp))))
  if ("data_status" %in% names(df_tmp)) print(table(df_tmp$data_status))
}

## 
## ---------------------------------------------
## Dataset  : Countries
## Baris    : 250 | Kolom: 6
## Nama kolom: country_name, capital, population, area, data_status, is_large
## Missing values:
## country_name      capital   population         area  data_status     is_large 
##            0            0            0            0            0            0 
## Duplicate: 0
## 
## Complete 
##      250 
## 
## ---------------------------------------------
## Dataset  : Hockey
## Baris    : 582 | Kolom: 10
## Nama kolom: team_name, year, wins, losses, win_pct, goals_for, goals_against, data_status, performance, win_rate
## Missing values:
##     team_name          year          wins        losses       win_pct 
##             0             0             0             0             0 
##     goals_for goals_against   data_status   performance      win_rate 
##             0             0             0             0             0 
## Duplicate: 0
## 
## Complete 
##      582 
## 
## ---------------------------------------------
## Dataset  : Oscar
## Baris    : 87 | Kolom: 6
## Nama kolom: movie_title, year, nominations, awards, data_status, category
## Missing values:
## movie_title        year nominations      awards data_status    category 
##           0           0           0           0           0           0 
## Duplicate: 0
## 
## Complete 
##       87 
## 
## ---------------------------------------------
## Dataset  : Turtles
## Baris    : 0 | Kolom: 5
## Nama kolom: name, description, additional_info, source_iframe, data_status
## Missing values:
##            name     description additional_info   source_iframe     data_status 
##               0               0               0               0               0 
## Duplicate: 0
## < table of extent 0 >

rekap <- data.frame(
  Website     = c("Countries","Hockey Teams","Oscar Films","Turtles"),
  Tipe        = c("Static HTML","Pagination","AJAX/JSON","iFrames"),
  Total_Baris = c(nrow(df_countries), nrow(df_hockey), nrow(df_oscar), nrow(df_turtles)),
  Complete    = c(sum(df_countries$data_status=="Complete"),
                  sum(df_hockey$data_status=="Complete"),
                  sum(df_oscar$data_status=="Complete"),
                  sum(df_turtles$data_status=="Complete")),
  Kesulitan   = c("Mudah","Sedang","Sulit","Cukup Sulit")
)
kable(rekap, caption = "Rekap Semua Dataset Web Scraping") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  column_spec(1, bold=TRUE, color="#2d6a4f") |>
  column_spec(3, bold=TRUE, color="#40916c")

Rekap Semua Dataset Web Scraping
Website	Tipe	Total_Baris	Complete	Kesulitan
Countries	Static HTML	250	250	Mudah
Hockey Teams	Pagination	582	582	Sedang
Oscar Films	AJAX/JSON	87	87	Sulit
Turtles	iFrames	0	0	Cukup Sulit

plot_ly() |>
  add_trace(
    x = rekap$Website, y = rekap$Total_Baris,
    type = "bar", name = "Total Baris",
    marker = list(color = "#40916c", line = list(color = "white", width = 1.2)),
    hovertemplate = "<b>%{x}</b><br>Total: %{y:,}<extra></extra>"
  ) |>
  add_trace(
    x = rekap$Website, y = rekap$Complete,
    type = "bar", name = "Data Complete",
    marker = list(color = "#b7e4c7", line = list(color = "white", width = 1.2)),
    hovertemplate = "<b>%{x}</b><br>Complete: %{y:,}<extra></extra>"
  ) |>
  layout(
    barmode = "group",
    title = list(text = "<b>Perbandingan Total Baris vs Data Complete per Website</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Website"),
    yaxis = list(title = "Jumlah Baris", gridcolor = "#f0faf3"),
    legend = list(orientation = "h", x = 0.3, y = -0.15),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

2.8 Perbedaan Pendekatan

cat("
1. Website paling mudah di-scrape:
   -> Countries of the World (Static HTML)
   Alasan: HTML langsung tersedia saat halaman dimuat.
           Cukup read_html() + html_nodes() tanpa pagination/AJAX.

2. Website paling sulit di-scrape:
   -> Oscar Winning Films (AJAX / Javascript)
   Alasan: Data dimuat secara dinamis oleh JS setelah halaman terbuka.
           Perlu inspect Network tab untuk menemukan endpoint XHR,
           kemudian parse response JSON dengan fromJSON().

3. Perbedaan Pendekatan:

   STATIC     : HTML langsung tersedia saat GET request
                -> read_html() + html_nodes() (rvest)
                -> Contoh: Countries of the World

   PAGINATION : Data tersebar di banyak halaman
                -> Loop + parameter ?page_num=N di URL
                -> Hentikan jika tidak ada baris lagi
                -> Contoh: Hockey Teams

   AJAX       : Data dimuat async via XHR setelah halaman render
                -> Inspect Network -> temukan endpoint JSON
                -> GET() ke endpoint + fromJSON() (httr + jsonlite)
                -> Contoh: Oscar Winning Films

   IFRAME     : Konten ada di frame terpisah, bukan di HTML utama
                -> html_attr('src') -> buat URL absolut -> request baru
                -> Contoh: Turtles All the Way Down
")

## 
## 1. Website paling mudah di-scrape:
##    -> Countries of the World (Static HTML)
##    Alasan: HTML langsung tersedia saat halaman dimuat.
##            Cukup read_html() + html_nodes() tanpa pagination/AJAX.
## 
## 2. Website paling sulit di-scrape:
##    -> Oscar Winning Films (AJAX / Javascript)
##    Alasan: Data dimuat secara dinamis oleh JS setelah halaman terbuka.
##            Perlu inspect Network tab untuk menemukan endpoint XHR,
##            kemudian parse response JSON dengan fromJSON().
## 
## 3. Perbedaan Pendekatan:
## 
##    STATIC     : HTML langsung tersedia saat GET request
##                 -> read_html() + html_nodes() (rvest)
##                 -> Contoh: Countries of the World
## 
##    PAGINATION : Data tersebar di banyak halaman
##                 -> Loop + parameter ?page_num=N di URL
##                 -> Hentikan jika tidak ada baris lagi
##                 -> Contoh: Hockey Teams
## 
##    AJAX       : Data dimuat async via XHR setelah halaman render
##                 -> Inspect Network -> temukan endpoint JSON
##                 -> GET() ke endpoint + fromJSON() (httr + jsonlite)
##                 -> Contoh: Oscar Winning Films
## 
##    IFRAME     : Konten ada di frame terpisah, bukan di HTML utama
##                 -> html_attr('src') -> buat URL absolut -> request baru
##                 -> Contoh: Turtles All the Way Down

2.9 Insights & Rekomendasi

cat("=== ANALYTICAL THINKING ===\n\n")

## === ANALYTICAL THINKING ===

cat("1. Website paling mudah di-scrape:\n")

## 1. Website paling mudah di-scrape:

cat("   -> Countries of the World (Static HTML)\n")

##    -> Countries of the World (Static HTML)

cat("   Alasan: Struktur HTML sederhana, tidak ada pagination/AJAX/iframe.\n\n")

##    Alasan: Struktur HTML sederhana, tidak ada pagination/AJAX/iframe.

cat("2. Website paling sulit di-scrape:\n")

## 2. Website paling sulit di-scrape:

cat("   -> Oscar Winning Films (AJAX / Javascript)\n")

##    -> Oscar Winning Films (AJAX / Javascript)

cat("   Alasan: Data dimuat dinamis oleh JavaScript, perlu inspect network\n")

##    Alasan: Data dimuat dinamis oleh JavaScript, perlu inspect network

cat("           untuk menemukan endpoint AJAX yang benar.\n\n")

##            untuk menemukan endpoint AJAX yang benar.

cat("3. Perbedaan pendekatan:\n\n")

## 3. Perbedaan pendekatan:

cat("   STATIC     : HTML sudah tersedia saat halaman dimuat.\n")

##    STATIC     : HTML sudah tersedia saat halaman dimuat.

cat("                Cukup gunakan rvest + html_nodes().\n\n")

##                 Cukup gunakan rvest + html_nodes().

cat("   PAGINATION : Data tersebar di banyak halaman.\n")

##    PAGINATION : Data tersebar di banyak halaman.

cat("                Gunakan loop + parameter URL (?page_num=N).\n")

##                 Gunakan loop + parameter URL (?page_num=N).

cat("                Hentikan loop jika tidak ada data lagi.\n\n")

##                 Hentikan loop jika tidak ada data lagi.

cat("   AJAX       : Data dimuat setelah halaman terbuka via JavaScript.\n")

##    AJAX       : Data dimuat setelah halaman terbuka via JavaScript.

cat("                Inspect Network tab -> temukan XHR/Fetch request\n")

##                 Inspect Network tab -> temukan XHR/Fetch request

cat("                -> akses langsung endpoint JSON-nya.\n")

##                 -> akses langsung endpoint JSON-nya.

cat("                Response berupa JSON, parse dengan fromJSON().\n\n")

##                 Response berupa JSON, parse dengan fromJSON().

cat("   IFRAME     : Konten berada di dalam frame terpisah dari HTML utama.\n")

##    IFRAME     : Konten berada di dalam frame terpisah dari HTML utama.

cat("                html_attr('src') -> buat URL lengkap\n")

##                 html_attr('src') -> buat URL lengkap

cat("                -> buat request baru ke URL iframe tersebut.\n\n")

##                 -> buat request baru ke URL iframe tersebut.

cat("--- INSIGHTS ---\n")

## --- INSIGHTS ---

cat(sprintf("1. Countries: %d negara berhasil diambil dari halaman statis.\n",
            nrow(df_countries)))

## 1. Countries: 250 negara berhasil diambil dari halaman statis.

if (nrow(df_hockey) > 0) {
  top_team <- df_hockey %>%
    group_by(team_name) %>%
    summarise(total = sum(wins, na.rm=TRUE), .groups="drop") %>%
    arrange(desc(total)) %>% slice(1)
  cat(sprintf("2. Hockey: Tim dengan total kemenangan terbanyak: %s (%d wins)\n",
              top_team$team_name, top_team$total))

  best_wr <- df_hockey[which.max(df_hockey$win_rate), ]
  cat(sprintf("3. Hockey: Tim dengan win rate tertinggi: %s (%d) - %.2f%%\n",
              best_wr$team_name, best_wr$year, best_wr$win_rate))
}

## 2. Hockey: Tim dengan total kemenangan terbanyak: Detroit Red Wings (986 wins)
## 3. Hockey: Tim dengan win rate tertinggi: Detroit Red Wings (1995) - 82.67%

if (nrow(df_oscar) > 0) {
  best_film <- df_oscar[which.max(df_oscar$awards), ]
  cat(sprintf("4. Oscar: Film dengan award terbanyak: %s (%d) - %d awards\n",
              best_film$movie_title, best_film$year, best_film$awards))

  yr_count <- df_oscar %>% count(year, name="n") %>% arrange(year)
  cat("5. Oscar: Jumlah film per tahun:\n")
  print(setNames(yr_count$n, yr_count$year))

  cat(sprintf("6. Oscar: Total film Complete: %d\n",
              sum(df_oscar$data_status=="Complete")))
}

## 4. Oscar: Film dengan award terbanyak: Gravity (2013) - 7 awards
## 5. Oscar: Jumlah film per tahun:
## 2010 2011 2012 2013 2014 2015 
##   13   15   15   12   16   16 
## 6. Oscar: Total film Complete: 87

if (nrow(df_turtles) > 0) {
  cat(sprintf("7. Turtles: Total family kura-kura: %d family\n", nrow(df_turtles)))
  pct <- round(sum(df_turtles$data_status=="Complete") / nrow(df_turtles) * 100, 1)
  cat(sprintf("8. Turtles: Persentase data Complete: %.1f%%\n", pct))
}

cat("\n--- REKOMENDASI ---\n")

## 
## --- REKOMENDASI ---

cat("1. Gunakan RSelenium / Playwright untuk website AJAX yang\n")

## 1. Gunakan RSelenium / Playwright untuk website AJAX yang

cat("   lebih kompleks dan tidak memiliki endpoint API terbuka.\n")

##    lebih kompleks dan tidak memiliki endpoint API terbuka.

cat("2. Tambahkan tryCatch() dan retry mechanism untuk scraping yang\n")

## 2. Tambahkan tryCatch() dan retry mechanism untuk scraping yang

cat("   lebih robust terhadap timeout dan error jaringan.\n")

##    lebih robust terhadap timeout dan error jaringan.

cat("3. Simpan log error ke file terpisah agar mudah di-debug.\n")

## 3. Simpan log error ke file terpisah agar mudah di-debug.

pct_complete <- data.frame(
  Website = c("Countries","Hockey Teams","Oscar Films","Turtles"),
  Pct_Complete = c(
    round(sum(df_countries$data_status=="Complete") / max(nrow(df_countries),1) * 100, 1),
    round(sum(df_hockey$data_status=="Complete")    / max(nrow(df_hockey),1)    * 100, 1),
    round(sum(df_oscar$data_status=="Complete")     / max(nrow(df_oscar),1)     * 100, 1),
    round(sum(df_turtles$data_status=="Complete")   / max(nrow(df_turtles),1)   * 100, 1)
  ),
  Tipe = c("Static HTML","Pagination","AJAX/JSON","iFrames")
)

plot_ly(pct_complete,
        y = ~reorder(Website, Pct_Complete), x = ~Pct_Complete,
        type = "bar", orientation = "h",
        marker = list(
          color = colorRampPalette(c("#b7e4c7","#1b4332"))(4),
          line  = list(color = "white", width = 1.5)
        ),
        text  = ~paste0(Pct_Complete, "%"),
        textposition = "outside",
        hovertemplate = "<b>%{y}</b> (%{customdata})<br>Complete: %{x}%<extra></extra>",
        customdata = ~Tipe) |>
  layout(
    title = list(text = "<b>Persentase Data Complete per Website Scraping</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "% Data Complete", range = c(0, 115), gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 60),
    height = 400
  )

Kesimpulan Web Scraping:
Website termudah: Countries (static HTML, satu request).
Website tersulit: Oscar (AJAX, perlu intercept XHR endpoint).
Total data: 250 negara · 582 hockey · 87 film · 0 turtles

3 Ringkasan Akhir Keseluruhan

Ringkasan Hasil Keseluruhan UTS
No	Keterangan	Nilai
1	Mini Project 1 - Dataset E-Commerce (file dibaca)	5 file (CSV, Excel, JSON, TXT, XML)
2	Mini Project 1 - Baris setelah deduplikasi	1.966
3	Mini Project 1 - Kolom + 3 kolom baru	25
4	Mini Project 1 - Platform dominan	Shopee - 20.9%
5	Mini Project 1 - Transaksi high-value	755
6	Mini Project 2 - Website di-scrape	4 website (Countries, Hockey, Oscar, Turtles)
7	Mini Project 2 - Total negara	250 negara
8	Mini Project 2 - Total records hockey	582 records
9	Mini Project 2 - Total film Oscar	87 film
10	Mini Project 2 - Total data turtles	0 baris

UTS SD-1306 Pemrograman Sains Data I

Mini Project 1: E-Commerce & Mini Project 2: Web Scraping