UTS SD-1306 Pemrograman Sains Data I

Mini Project 1: E-Commerce & Mini Project 2: Web Scraping

SD-1306 · Pemrograman Sains Data I · UTS

UTS Pemrograman Sains Data I
Mini Project 1: E-Commerce Analytics & Mini Project 2: Web Scraping
Sains Data Kelas A April 2026
DS
Mr. Bakti Siregar, M.Sc., CSD
Dosen Pengampu
KN
Khafizatun Nisa
NIM: 52250018
NA
Nakeisha Aulia Z
NIM: 52250023
JR
Jihan Ramadhani D.
NIM: 52250024

Mini Project 1 Case Study E-Commerce

Analisis dataset transaksi e-commerce dari 5 format file: CSV, Excel, JSON, TXT, XML

1 MINI PROJECT 1

1.1 Library yang Digunakan (Mini Project 1)

pkgs1 <- c("readr","readxl","jsonlite","xml2","dplyr","tidyr",
           "stringr","ggplot2","scales","knitr","kableExtra",
           "plotly","RColorBrewer","rmdformats")
for (p in pkgs1) {
  if (!require(p, character.only = TRUE, quietly = TRUE))
    install.packages(p, repos = "https://cran.r-project.org")
}

library(readr); library(readxl); library(jsonlite); library(xml2)
library(dplyr);  library(tidyr);  library(stringr);  library(ggplot2)
library(scales); library(knitr); library(kableExtra); library(plotly)
library(RColorBrewer)

cat("Semua library Mini Project 1 berhasil dimuat!")
## Semua library Mini Project 1 berhasil dimuat!

Library siap. Kombinasi readr/readxl/jsonlite/xml2 untuk baca multi-format, dplyr/stringr untuk cleaning, dan plotly untuk visualisasi interaktif.


1.2 Section A Data Collection (E-Commerce)

1.2.1 CSV

df_csv <- read_csv("ecommerce.csv", show_col_types = FALSE)
cat("File: ecommerce.csv\n")
## File: ecommerce.csv
cat("Jumlah baris :", nrow(df_csv), "\n")
## Jumlah baris : 2000
cat("Jumlah kolom :", ncol(df_csv), "\n")
## Jumlah kolom : 22
cat("Nama kolom   :", paste(names(df_csv), collapse = ", "), "\n")
## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.2 Excel

df_excel <- read_excel("ecommerce.xlsx")
cat("File: ecommerce.xlsx\n")
## File: ecommerce.xlsx
cat("Jumlah baris :", nrow(df_excel), "\n")
## Jumlah baris : 2000
cat("Jumlah kolom :", ncol(df_excel), "\n")
## Jumlah kolom : 22
cat("Nama kolom   :", paste(names(df_excel), collapse = ", "), "\n")
## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.3 JSON

df_json <- fromJSON("ecommerce.json") |> as.data.frame()
cat("File: ecommerce.json\n")
## File: ecommerce.json
cat("Jumlah baris :", nrow(df_json), "\n")
## Jumlah baris : 2000
cat("Jumlah kolom :", ncol(df_json), "\n")
## Jumlah kolom : 22
cat("Nama kolom   :", paste(names(df_json), collapse = ", "), "\n")
## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.4 TXT

df_txt <- read_delim("ecommerce.txt", delim = "|", show_col_types = FALSE)
cat("File: ecommerce.txt\n")
## File: ecommerce.txt
cat("Jumlah baris :", nrow(df_txt), "\n")
## Jumlah baris : 2000
cat("Jumlah kolom :", ncol(df_txt), "\n")
## Jumlah kolom : 22
cat("Nama kolom   :", paste(names(df_txt), collapse = ", "), "\n")
## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.5 XML

xml_doc <- read_xml("ecommerce.xml")
records <- xml_find_all(xml_doc, ".//Record")

xml_ke_df <- function(rec) {
  anak  <- xml_children(rec)
  setNames(as.list(xml_text(anak)), xml_name(anak))
}

df_xml <- bind_rows(lapply(records, xml_ke_df))
cat("File: ecommerce.xml\n")
## File: ecommerce.xml
cat("Jumlah baris :", nrow(df_xml), "\n")
## Jumlah baris : 2000
cat("Jumlah kolom :", ncol(df_xml), "\n")
## Jumlah kolom : 22
cat("Nama kolom   :", paste(names(df_xml), collapse = ", "), "\n")
## Nama kolom   : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag

1.2.6 Looping + Cek Kolom + Gabung

semua_df <- list(
  CSV   = df_csv   |> mutate(across(everything(), as.character)),
  Excel = df_excel |> mutate(across(everything(), as.character)),
  JSON  = df_json  |> mutate(across(everything(), as.character)),
  TXT   = df_txt   |> mutate(across(everything(), as.character)),
  XML   = df_xml   |> mutate(across(everything(), as.character))
)

kolom_referensi <- names(semua_df[["CSV"]])
siap_merge      <- list()

cat("=== Pengecekan Struktur Kolom ===\n\n")
## === Pengecekan Struktur Kolom ===
for (nama in names(semua_df)) {
  df_cek <- semua_df[[nama]]
  if (setequal(names(df_cek), kolom_referensi)) {
    cat(nama, ": Struktur kolom sesuai -> Ready to merge\n")
    siap_merge[[nama]] <- df_cek[, kolom_referensi]
  } else {
    cat(nama, ": Perlu penyesuaian -> Need adjustment (disesuaikan otomatis)\n")
    kolom_hilang <- setdiff(kolom_referensi, names(df_cek))
    for (kol in kolom_hilang) df_cek[[kol]] <- NA
    siap_merge[[nama]] <- df_cek[, kolom_referensi]
  }
}
## CSV : Struktur kolom sesuai -> Ready to merge
## Excel : Struktur kolom sesuai -> Ready to merge
## JSON : Struktur kolom sesuai -> Ready to merge
## TXT : Struktur kolom sesuai -> Ready to merge
## XML : Struktur kolom sesuai -> Ready to merge
df_gabungan <- bind_rows(siap_merge)
cat(sprintf("\nDataset utama berhasil digabungkan: %d baris, %d kolom\n",
            nrow(df_gabungan), ncol(df_gabungan)))
## 
## Dataset utama berhasil digabungkan: 10000 baris, 22 kolom

Interpretasi Section A: Lima file dari format berbeda berhasil dibaca dan digabungkan menggunakan looping + conditional IF/ELSE. Pendekatan ini scalable – jika ada file ke-6, cukup ditambahkan ke list tanpa mengubah logika inti.


1.3 Section B Data Handling (E-Commerce)

1.3.1 Dimensi & Tipe Data

cat("Jumlah Total Baris :", nrow(df_gabungan), "\n")
## Jumlah Total Baris : 10000
cat("Jumlah Total Kolom :", ncol(df_gabungan), "\n\n")
## Jumlah Total Kolom : 22
tipe <- data.frame(
  No    = 1:ncol(df_gabungan),
  Kolom = names(df_gabungan),
  Tipe  = sapply(df_gabungan, class),
  row.names = NULL
)

kable(tipe, caption = "Tipe Data Setiap Kolom") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
  column_spec(2, bold = TRUE, color = "#2d6a4f") |>
  column_spec(3, background = "#f0faf3")
Tipe Data Setiap Kolom
No Kolom Tipe
1 order_id character
2 order_date character
3 ship_date character
4 platform character
5 category character
6 product_name character
7 unit_price character
8 quantity character
9 gross_sales character
10 campaign character
11 voucher_code character
12 discount_pct character
13 discount_value character
14 shipping_cost character
15 net_sales character
16 payment_method character
17 customer_segment character
18 region character
19 stock_status character
20 order_status character
21 customer_rating character
22 priority_flag character

1.3.2 Missing Values

mv <- df_gabungan |>
  summarise(across(everything(),
    ~ sum(is.na(.) | . == "" | . == "NULL"))) |>
  pivot_longer(everything(), names_to = "Kolom", values_to = "Jumlah_NA") |>
  filter(Jumlah_NA > 0) |>
  arrange(desc(Jumlah_NA)) |>
  mutate(Persen = round(Jumlah_NA / nrow(df_gabungan) * 100, 2))

kable(mv, col.names = c("Kolom","Jumlah Missing","Persentase (%)"),
      caption = "Kolom dengan Missing Values") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  column_spec(3, bold = TRUE,
              color = ifelse(mv$Persen > 20, "#e76f51", "#2d6a4f"))
Kolom dengan Missing Values
Kolom Jumlah Missing Persentase (%)
customer_rating 2030 20.30
ship_date 1000 10.00
priority_flag 940 9.40
discount_pct 345 3.45
voucher_code 245 2.45
payment_method 175 1.75

1.3.3 Duplikasi

dup <- sum(duplicated(df_gabungan))
pct <- round(dup / nrow(df_gabungan) * 100, 2)
cat("Jumlah baris duplikat:", dup, "\n")
## Jumlah baris duplikat: 5581
cat("Persentase duplikat  :", pct, "%\n")
## Persentase duplikat  : 55.81 %

1.3.4 Masalah Kualitas Data

masalah <- data.frame(
  No      = 1:5,
  Masalah = c(
    "Format tanggal tidak konsisten",
    "Nilai harga berformat string (Rp xxx.xxx)",
    "Inkonsistensi penulisan platform",
    "Missing values di kolom penting",
    "Duplikasi data akibat penggabungan 5 file"
  ),
  Contoh = c(
    "2024-01-01, 01/01/2024, 01-01-2024",
    "'Rp 1.500.000' alih-alih 1500000",
    "'shopee', 'SHOPEE', ' shopee '",
    "ship_date, customer_rating, payment_method",
    paste0(dup, " baris duplikat dari ", nrow(df_gabungan), " baris")
  )
)

kable(masalah, col.names = c("No","Masalah","Contoh"),
      caption = "Minimal 3 Masalah Kualitas Data yang Ditemukan") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  column_spec(2, bold = TRUE, color = "#e76f51")
Minimal 3 Masalah Kualitas Data yang Ditemukan
No Masalah Contoh
1 Format tanggal tidak konsisten 2024-01-01, 01/01/2024, 01-01-2024
2 Nilai harga berformat string (Rp xxx.xxx) ‘Rp 1.500.000’ alih-alih 1500000
3 Inkonsistensi penulisan platform ‘shopee’, ‘SHOPEE’, ’ shopee ’
4 Missing values di kolom penting ship_date, customer_rating, payment_method
5 Duplikasi data akibat penggabungan 5 file 5581 baris duplikat dari 10000 baris

Interpretasi Section B: Ditemukan 5 kategori masalah. Masalah paling kritis adalah format harga sebagai string (menghambat agregasi) dan inkonsistensi platform (menyebabkan grouping yang salah). Section C akan menyelesaikan semua masalah ini.


1.4 Section C Data Cleaning (E-Commerce)

1.4.1 Hapus Duplikat

df <- df_gabungan |> distinct(order_id, .keep_all = TRUE)
cat("Hapus duplikat berdasarkan order_id\n")
## Hapus duplikat berdasarkan order_id
cat("Baris sebelum:", nrow(df_gabungan), "\n")
## Baris sebelum: 10000
cat("Baris sesudah:", nrow(df), "\n")
## Baris sesudah: 1966
cat("Baris dihapus:", nrow(df_gabungan) - nrow(df), "\n")
## Baris dihapus: 8034

1.4.2 Looping: Bersihkan Harga

bersihkan_harga <- function(x) {
  x <- as.character(x)
  x <- str_replace_all(x, "Rp\\s*", "")
  x <- str_replace_all(x, "\\.", "")
  x <- str_replace_all(x, ",", ".")
  x <- str_trim(x)
  hasil <- suppressWarnings(as.numeric(x))
  ifelse(is.na(hasil) | hasil < 0, 0, hasil)
}

kolom_harga <- c("unit_price","gross_sales","net_sales","discount_value","shipping_cost")

cat("=== Membersihkan Kolom Harga dengan Looping ===\n\n")
## === Membersihkan Kolom Harga dengan Looping ===
for (kol in kolom_harga) {
  sebelum <- sum(is.na(df[[kol]]) | df[[kol]] == "")
  df[[kol]] <- bersihkan_harga(df[[kol]])
  cat(sprintf("  Kolom '%-18s' dibersihkan (NA sebelum: %d)\n", kol, sebelum))
}
##   Kolom 'unit_price        ' dibersihkan (NA sebelum: 0)
##   Kolom 'gross_sales       ' dibersihkan (NA sebelum: 0)
##   Kolom 'net_sales         ' dibersihkan (NA sebelum: 0)
##   Kolom 'discount_value    ' dibersihkan (NA sebelum: 0)
##   Kolom 'shipping_cost     ' dibersihkan (NA sebelum: 0)
df$discount_pct <- suppressWarnings(as.numeric(df$discount_pct))
df$discount_pct[is.na(df$discount_pct)] <- 0
df$quantity <- suppressWarnings(as.numeric(df$quantity))
cat("\nSemua kolom harga berhasil dikonversi ke numerik!\n")
## 
## Semua kolom harga berhasil dikonversi ke numerik!

1.4.3 Standardisasi Platform (WAJIB IF)

standardisasi_platform <- function(p) {
  p <- str_trim(str_to_lower(as.character(p)))
  if      (p %in% c("shopee","shopee "))              return("Shopee")
  else if (p %in% c("tokopedia","tokped"))            return("Tokopedia")
  else if (p %in% c("lazada"))                        return("Lazada")
  else if (p %in% c("blibli"))                        return("Blibli")
  else if (p %in% c("tiktok shop","tiktok_shop"))     return("TikTok Shop")
  else                                                 return(str_to_title(p))
}

df$platform <- sapply(df$platform, standardisasi_platform)
cat("Platform distandarisasi\n")
## Platform distandarisasi
print(table(df$platform))
## 
##      Blibli      Lazada      Shopee TikTok Shop   Tokopedia 
##         400         381         411         397         377

1.4.4 Missing Value (WAJIB IF)

df$payment_method <- ifelse(
  is.na(df$payment_method) | str_trim(df$payment_method) %in% c("","nan","NaN","NAN"),
  "Unknown",
  str_trim(df$payment_method)
)
cat("payment_method: NA -> 'Unknown'\n")
## payment_method: NA -> 'Unknown'
df$customer_rating <- suppressWarnings(as.numeric(df$customer_rating))
median_rating      <- median(df$customer_rating, na.rm = TRUE)
df$customer_rating <- ifelse(is.na(df$customer_rating), median_rating, df$customer_rating)
cat(sprintf("customer_rating: NA -> median = %.1f\n", median_rating))
## customer_rating: NA -> median = 5.0
cat("   (Logika: median lebih robust terhadap outlier)\n")
##    (Logika: median lebih robust terhadap outlier)

1.4.5 Standardisasi Status & Kategori

standardisasi_status <- function(s) {
  s <- str_trim(str_to_lower(as.character(s)))
  if      (s %in% c("delivered","completed","complete","completion")) return("Completed")
  else if (s %in% c("cancelled","cancel","batal","canceled"))         return("Cancelled")
  else if (s %in% c("returned","return","retur"))                     return("Returned")
  else if (s %in% c("shipped","shipping"))                            return("Shipped")
  else                                                                 return(str_to_title(s))
}
df$order_status <- sapply(df$order_status, standardisasi_status)
cat("order_status distandarisasi\n"); print(table(df$order_status))
## order_status distandarisasi
## 
##   Cancelled   Completed On Delivery    Returned     Shipped 
##         151        1550          47         117         101
cat("\n=== Standardisasi Tambahan (Looping) ===\n")
## 
## === Standardisasi Tambahan (Looping) ===
kolom_teks <- c("category","customer_segment","region","stock_status")
for (kol in kolom_teks) {
  df[[kol]] <- str_to_title(str_trim(as.character(df[[kol]])))
  cat(sprintf("  '%-20s' distandarisasi\n", kol))
}
##   'category            ' distandarisasi
##   'customer_segment    ' distandarisasi
##   'region              ' distandarisasi
##   'stock_status        ' distandarisasi
payment_map <- c(
  "e-wallet"="E-Wallet","ewallet"="E-Wallet","e wallet"="E-Wallet",
  "virtual account"="Virtual Account","va"="Virtual Account",
  "cod"="COD","cash on delivery"="COD",
  "transfer bank"="Transfer Bank","bank transfer"="Transfer Bank",
  "credit card"="Credit Card","nan"="Unknown","none"="Unknown"
)
df$payment_method <- sapply(
  str_to_lower(str_trim(df$payment_method)),
  function(x) {
    if (is.na(x) || x == "") return("Unknown")
    if (x %in% names(payment_map)) return(payment_map[x])
    return(str_to_title(x))
  }
)
cat("  'payment_method'     distandarisasi\n")
##   'payment_method'     distandarisasi
cat(sprintf("\nData bersih: %d baris, %d kolom\n", nrow(df), ncol(df)))
## 
## Data bersih: 1966 baris, 22 kolom

Interpretasi Section C: Data dibersihkan melalui 7 proses sistematis menggunakan fungsi kustom, looping, dan conditional logic. Dataset kini bebas duplikat, format harga sudah numerik, dan semua kolom kategoris sudah konsisten.


1.5 Section D Conditional Logic (E-Commerce)

1.5.1 is_high_value

df$net_sales <- suppressWarnings(as.numeric(df$net_sales))
df$net_sales[is.na(df$net_sales)] <- 0

df$is_high_value <- ifelse(df$net_sales > 1000000, "Yes", "No")

cat("Kolom 'is_high_value' ditambahkan\n")
## Kolom 'is_high_value' ditambahkan
tbl_hv <- table(df$is_high_value)
print(tbl_hv)
## 
##   No  Yes 
## 1211  755
cat(sprintf("\nPersentase high-value: %.1f%%\n",
            tbl_hv["Yes"] / sum(tbl_hv) * 100))
## 
## Persentase high-value: 38.4%
df_hv <- as.data.frame(tbl_hv)
colnames(df_hv) <- c("is_high_value", "count")
df_hv$pct <- round(df_hv$count / sum(df_hv$count) * 100, 1)

plot_ly(df_hv, labels = ~is_high_value, values = ~count, type = "pie",
        marker = list(colors = c("#40916c", "#b7e4c7"),
                      line = list(color = "white", width = 2)),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<br>Persentase: %{percent}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Transaksi High-Value vs Non-High-Value</b>",
                 font = list(size = 15, color = "#1b4332")),
    legend = list(orientation = "h", x = 0.3, y = -0.1),
    paper_bgcolor = "white",
    height = 400,
    margin = list(t = 50, b = 50, l = 30, r = 30)
  )

1.5.2 order_priority (WAJIB nested IF)

tentukan_prioritas <- function(net) {
  if      (is.na(net))    return("Low")
  else if (net > 1000000) return("High")
  else if (net >= 500000) return("Medium")
  else                    return("Low")
}

df$order_priority <- sapply(df$net_sales, tentukan_prioritas)
cat("Kolom 'order_priority' ditambahkan\n")
## Kolom 'order_priority' ditambahkan
print(table(df$order_priority))
## 
##   High    Low Medium 
##    755    799    412
df_prio <- as.data.frame(table(df$order_priority))
colnames(df_prio) <- c("priority", "count")
df_prio$priority <- factor(df_prio$priority, levels = c("High","Medium","Low"))
df_prio <- df_prio[order(df_prio$priority), ]
df_prio$pct <- round(df_prio$count / sum(df_prio$count) * 100, 1)

plot_ly(df_prio, x = ~priority, y = ~count, type = "bar",
        marker = list(color = c("#1b4332","#40916c","#b7e4c7"),
                      line = list(color = "white", width = 1.5)),
        text = ~paste0(count, "<br>(", pct, "%)"),
        textposition = "outside",
        hovertemplate = "<b>%{x}</b><br>Jumlah: %{y:,}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Order Priority (Nested IF)</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Prioritas"),
    yaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

1.5.3 valid_transaction

df$valid_transaction <- ifelse(
  df$order_status == "Cancelled", "Invalid", "Valid"
)
cat("Kolom 'valid_transaction' ditambahkan\n")
## Kolom 'valid_transaction' ditambahkan
tbl_vt <- table(df$valid_transaction)
print(tbl_vt)
## 
## Invalid   Valid 
##     151    1815
cat(sprintf("\nPersentase Valid  : %.1f%%\n", tbl_vt["Valid"]  / sum(tbl_vt) * 100))
## 
## Persentase Valid  : 92.3%
cat(sprintf("Persentase Invalid: %.1f%%\n", tbl_vt["Invalid"] / sum(tbl_vt) * 100))
## Persentase Invalid: 7.7%
df_vt <- as.data.frame(tbl_vt)
colnames(df_vt) <- c("status", "count")

plot_ly(df_vt, labels = ~status, values = ~count, type = "pie", hole = 0.45,
        marker = list(colors = c("#e76f51","#40916c"),
                      line = list(color = "white", width = 2)),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Valid vs Invalid Transaction</b>",
                 font = list(size = 14, color = "#1b4332")),
    annotations = list(list(text = "Status", showarrow = FALSE,
                            font = list(size = 13, color = "#2d6a4f"))),
    legend = list(orientation = "h", x = 0.3, y = -0.1),
    paper_bgcolor = "white",
    height = 400
  )

1.5.4 Preview Kolom Baru

df |>
  select(order_id, platform, net_sales, order_status,
         is_high_value, order_priority, valid_transaction) |>
  head(10) |>
  kable(caption = "Preview Dataset dengan 3 Kolom Baru (10 baris pertama)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)
Preview Dataset dengan 3 Kolom Baru (10 baris pertama)
order_id platform net_sales order_status is_high_value order_priority valid_transaction
ORD00612 Tokopedia 680058 Completed No Medium Valid
ORD00112 TikTok Shop 1476873 Completed Yes High Valid
ORD01186 Tokopedia 369715 Completed No Low Valid
ORD01511 Tokopedia 1382570 Completed Yes High Valid
ORD00772 Tokopedia 376696 Completed No Low Valid
ORD00880 Blibli 0 Cancelled No Low Invalid
ORD00592 Blibli 0 Cancelled No Low Invalid
ORD01367 Blibli 2085678 Completed Yes High Valid
ORD01178 Blibli 1208527 Completed Yes High Valid
ORD00276 Lazada 1789349 Completed Yes High Valid

Interpretasi Section D: Tiga kolom baru berhasil dibuat menggunakan conditional logic: is_high_value (segmentasi nilai), order_priority (tier 3 level dengan nested IF), dan valid_transaction (filter data analitik).


1.6 Section E Analytical Thinking (E-Commerce)

1.6.1 Platform Paling Dominan

platform_count <- df |>
  count(platform, name = "jumlah") |>
  arrange(desc(jumlah)) |>
  mutate(persen = round(jumlah / sum(jumlah) * 100, 1))

kable(platform_count,
      col.names = c("Platform","Jumlah Transaksi","Persentase (%)"),
      caption = "Distribusi Transaksi per Platform") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(1, bold = TRUE, background = "#f0faf3", color = "#1b4332")
Distribusi Transaksi per Platform
Platform Jumlah Transaksi Persentase (%)
Shopee 411 20.9
Blibli 400 20.3
TikTok Shop 397 20.2
Lazada 381 19.4
Tokopedia 377 19.2
p_plat <- plot_ly(
  data = platform_count |> arrange(jumlah),
  y = ~reorder(platform, jumlah), x = ~jumlah,
  type = "bar", orientation = "h",
  marker = list(
    color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(nrow(platform_count)),
    line  = list(color = "white", width = 1.5)
  ),
  text = ~paste0(format(jumlah, big.mark="."), " (", persen, "%)"),
  textposition = "outside",
  hovertemplate = "<b>%{y}</b><br>Transaksi: %{x:,}<extra></extra>"
) |>
  layout(
    title = list(text = "<b>Jumlah Transaksi per Platform</b>",
                 font = list(size = 15, color = "#1b4332")),
    xaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 120),
    hoverlabel = list(bgcolor = "#2d6a4f", font = list(color = "white")),
    height = 400
  )
p_plat
cat(sprintf("\nPLATFORM PALING DOMINAN: %s (%d transaksi)\n",
            platform_count$platform[1], platform_count$jumlah[1]))
## 
## PLATFORM PALING DOMINAN: Shopee (411 transaksi)

1.6.2 Kategori Paling Sering

kategori_count <- df |>
  count(category, name = "jumlah") |>
  arrange(desc(jumlah)) |>
  head(8) |>
  mutate(persen = round(jumlah / nrow(df) * 100, 1))

kable(kategori_count,
      col.names = c("Kategori","Jumlah","Persentase (%)"),
      caption = "Top 8 Kategori Produk") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(1, bold = TRUE, background = "#f0faf3")
Top 8 Kategori Produk
Kategori Jumlah Persentase (%)
Fashion 412 21.0
Sports 408 20.8
Beauty 387 19.7
Home Living 371 18.9
Electronics 357 18.2
Home_living 31 1.6
cat(sprintf("\nKATEGORI PALING SERING: %s (%d transaksi)\n",
            kategori_count$category[1], kategori_count$jumlah[1]))
## 
## KATEGORI PALING SERING: Fashion (412 transaksi)
plot_ly(
  data = kategori_count |> arrange(jumlah),
  y = ~reorder(category, jumlah), x = ~jumlah,
  type = "bar", orientation = "h",
  marker = list(
    color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(nrow(kategori_count)),
    line  = list(color = "white", width = 1.2)
  ),
  text = ~paste0(jumlah, " (", persen, "%)"),
  textposition = "outside",
  hovertemplate = "<b>%{y}</b><br>Jumlah: %{x:,}<extra></extra>"
) |>
  layout(
    title = list(text = "<b>Top 8 Kategori Produk Terlaris</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 110),
    height = 400
  )

1.6.3 Status Transaksi

status_count <- df |>
  count(order_status, name = "jumlah") |>
  arrange(desc(jumlah)) |>
  mutate(persen = round(jumlah / sum(jumlah) * 100, 1))

kable(status_count,
      col.names = c("Status","Jumlah","Persentase (%)"),
      caption = "Distribusi Status Transaksi") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(1, bold = TRUE, background = "#f0faf3")
Distribusi Status Transaksi
Status Jumlah Persentase (%)
Completed 1550 78.8
Cancelled 151 7.7
Returned 117 6.0
Shipped 101 5.1
On Delivery 47 2.4
cat(sprintf("\nSTATUS PALING BANYAK: %s (%d transaksi)\n",
            status_count$order_status[1], status_count$jumlah[1]))
## 
## STATUS PALING BANYAK: Completed (1550 transaksi)
plot_ly(status_count, labels = ~order_status, values = ~jumlah,
        type = "pie", hole = 0.4,
        marker = list(
          colors = colorRampPalette(c("#1b4332","#40916c","#74c69d","#b7e4c7","#d8f3dc"))(nrow(status_count)),
          line   = list(color = "white", width = 2)
        ),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<br>%{percent}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Status Transaksi</b>",
                 font = list(size = 14, color = "#1b4332")),
    legend = list(orientation = "h", x = 0.1, y = -0.15),
    paper_bgcolor = "white",
    height = 400
  )
df |> write.csv("ecommerce_cleaned_R.csv", row.names = FALSE)
cat("\nDisimpan -> ecommerce_cleaned_R.csv\n")
## 
## Disimpan -> ecommerce_cleaned_R.csv
cat(sprintf("   Total baris: %d | Total kolom: %d\n", nrow(df), ncol(df)))
##    Total baris: 1966 | Total kolom: 25

Ringkasan Section E:
1. Platform paling dominan: Shopee (20.9%)
2. Kategori paling sering: Fashion (412 transaksi)
3. Status transaksi terbanyak: Completed (78.8%)


Mini Project 2 Web Scraping & Data Programming Process

Pengambilan data dari 4 website berbeda: Static HTML, Pagination, AJAX, dan iFrames menggunakan R

2 MINI PROJECT 2

2.1 Library yang Digunakan (Mini Project 2)

pkgs2 <- c("rvest","httr","dplyr","jsonlite","stringr","knitr","kableExtra","plotly")
for (p in pkgs2) {
  if (!require(p, character.only = TRUE, quietly = TRUE))
    install.packages(p, repos = "https://cran.r-project.org")
}
library(rvest); library(httr); library(dplyr)
library(jsonlite); library(stringr)
cat("Semua library Mini Project 2 berhasil dimuat!")
## Semua library Mini Project 2 berhasil dimuat!

2.2 Setup Helper Functions

BASE_URL <- "https://www.scrapethissite.com"

get_page_r <- function(url, query_params = NULL) {
  tryCatch({
    resp <- GET(url,
      add_headers("User-Agent" = "Mozilla/5.0 (Windows NT 10.0) Chrome/120.0.0.0"),
      query = query_params, timeout(15))
    if (status_code(resp) == 200)
      read_html(content(resp, "text", encoding = "UTF-8"))
    else { cat(sprintf("  [ERROR] Status %d\n", status_code(resp))); NULL }
  }, error = function(e) { cat(sprintf("  [ERROR] %s\n", e$message)); NULL })
}

cat("Helper functions siap! BASE_URL:", BASE_URL, "\n")
## Helper functions siap! BASE_URL: https://www.scrapethissite.com

2.3 Website 1 Countries of the World (Static HTML)

2.3.1 Section A Scraping

countries_list <- list()
page <- get_page_r(paste0(BASE_URL, "/pages/simple/"))

if (!is.null(page)) {
  nodes <- page %>% html_nodes("div.country")
  cat(sprintf("Ditemukan %d negara\n\n", length(nodes)))

  for (i in seq_along(nodes)) {
    nd   <- nodes[[i]]
    name <- nd %>% html_node("h3.country-name")         %>% html_text(trim=TRUE)
    cap  <- nd %>% html_node("span.country-capital")    %>% html_text(trim=TRUE)
    pop  <- nd %>% html_node("span.country-population") %>% html_text(trim=TRUE)
    area <- nd %>% html_node("span.country-area")       %>% html_text(trim=TRUE)

    name <- ifelse(is.na(name)||nchar(name)==0, "Unknown", name)
    cap  <- ifelse(is.na(cap) ||nchar(trimws(cap))==0, "Unknown", cap)
    pop  <- ifelse(is.na(pop),  "0", pop)
    area <- ifelse(is.na(area), "0", area)

    data_status <- if(name!="Unknown" && cap!="Unknown") "Complete" else "Incomplete"

    countries_list[[i]] <- list(country_name=name, capital=cap,
                                population=pop, area=area, data_status=data_status)
  }
}
## Ditemukan 250 negara
df_countries_raw <- bind_rows(countries_list)
cat(sprintf("Total data: %d baris\n", nrow(df_countries_raw)))
## Total data: 250 baris
kable(head(df_countries_raw, 5), caption = "Preview Countries (Raw)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)
Preview Countries (Raw)
country_name capital population area data_status
Andorra Andorra la Vella 84000 468.0 Complete
United Arab Emirates Abu Dhabi 4975593 82880.0 Complete
Afghanistan Kabul 29121286 647500.0 Complete
Antigua and Barbuda St. John’s 86754 443.0 Complete
Anguilla The Valley 13254 102.0 Complete

2.3.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_countries_raw), ncol(df_countries_raw),
            sum(duplicated(df_countries_raw))))
## Baris: 250 | Kolom: 5 | Duplikat: 0
cat("Tipe data:\n"); print(sapply(df_countries_raw, class))
## Tipe data:
## country_name      capital   population         area  data_status 
##  "character"  "character"  "character"  "character"  "character"
cat("\nMissing values:\n"); print(colSums(is.na(df_countries_raw)))
## 
## Missing values:
## country_name      capital   population         area  data_status 
##            0            0            0            0            0
cat("\nDATA ISSUES:\n")
## 
## DATA ISSUES:
cat("  (1) population & area bertipe character, bukan numerik\n")
##   (1) population & area bertipe character, bukan numerik
cat("  (2) Beberapa capital memiliki spasi ekstra\n")
##   (2) Beberapa capital memiliki spasi ekstra

2.3.3 Section C Cleaning

df_countries <- df_countries_raw
for (col in c("country_name","capital")) {
  df_countries[[col]] <- str_to_title(str_trim(df_countries[[col]]))
  cat(sprintf("  '%-15s' trim & proper case\n", col))
}
##   'country_name   ' trim & proper case
##   'capital        ' trim & proper case
df_countries$population <- suppressWarnings(as.numeric(str_replace_all(df_countries$population,",","")))
df_countries$area       <- suppressWarnings(as.numeric(str_replace_all(df_countries$area,",","")))
df_countries$population[is.na(df_countries$population)] <- 0
df_countries$area[is.na(df_countries$area)]             <- 0
df_countries <- df_countries %>% distinct(country_name, .keep_all=TRUE)
cat(sprintf("\nBaris final: %d\n", nrow(df_countries)))
## 
## Baris final: 250
write.csv(df_countries, "countries_of_the_world_R.csv", row.names=FALSE)
cat("Disimpan -> countries_of_the_world_R.csv\n")
## Disimpan -> countries_of_the_world_R.csv
print(sapply(df_countries, class))
## country_name      capital   population         area  data_status 
##  "character"  "character"    "numeric"    "numeric"  "character"
kable(head(df_countries, 5)) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)
country_name capital population area data_status
Andorra Andorra La Vella 84000 468 Complete
United Arab Emirates Abu Dhabi 4975593 82880 Complete
Afghanistan Kabul 29121286 647500 Complete
Antigua And Barbuda St. John’s 86754 443 Complete
Anguilla The Valley 13254 102 Complete

2.3.4 Section D Conditional Logic

df_countries <- df_countries %>%
  mutate(
    is_large = case_when(
      population > 10000000 ~ "Large",
      population > 1000000  ~ "Medium",
      TRUE                  ~ "Small"
    ),
    data_status = case_when(
      is.na(country_name) | country_name == "Unknown" ~ "Incomplete",
      capital == "Unknown"                            ~ "Incomplete",
      TRUE                                            ~ "Complete"
    )
  )

cat("Kolom 'is_large' ditambahkan\n")
## Kolom 'is_large' ditambahkan
print(table(df_countries$is_large))
## 
##  Large Medium  Small 
##     82     76     92
cat("\nKolom 'data_status' diperbarui\n")
## 
## Kolom 'data_status' diperbarui
print(table(df_countries$data_status))
## 
## Complete 
##      250
write.csv(df_countries, "countries_of_the_world_R.csv", row.names = FALSE)
cat("\nDisimpan -> countries_of_the_world_R.csv\n")
## 
## Disimpan -> countries_of_the_world_R.csv
cat("\n=== INSIGHT COUNTRIES ===\n")
## 
## === INSIGHT COUNTRIES ===
biggest <- df_countries[which.max(df_countries$population), ]
cat(sprintf("1. Negara dengan populasi terbesar: %s (%.0f jiwa)\n",
            biggest$country_name, biggest$population))
## 1. Negara dengan populasi terbesar: China (1330044000 jiwa)
cat("2. Distribusi ukuran negara:\n")
## 2. Distribusi ukuran negara:
print(table(df_countries$is_large))
## 
##  Large Medium  Small 
##     82     76     92
total_c <- nrow(df_countries)
cat(sprintf("3. Persentase: Large=%.1f%%, Medium=%.1f%%, Small=%.1f%%\n",
  sum(df_countries$is_large=="Large")  / total_c * 100,
  sum(df_countries$is_large=="Medium") / total_c * 100,
  sum(df_countries$is_large=="Small")  / total_c * 100))
## 3. Persentase: Large=32.8%, Medium=30.4%, Small=36.8%
top15_countries <- df_countries |> arrange(desc(population)) |> head(15)

plot_ly(
  data = top15_countries,
  y = ~reorder(country_name, population), x = ~population,
  type = "bar", orientation = "h",
  marker = list(
    color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(15),
    line  = list(color = "white", width = 1)
  ),
  text  = ~paste0(round(population/1e6, 1), " jt"),
  textposition = "outside",
  hovertemplate = "<b>%{y}</b><br>Populasi: %{x:,.0f}<extra></extra>"
) |>
  layout(
    title = list(text = "<b>Top 15 Negara dengan Populasi Terbesar</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Populasi", gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 80),
    height = 420
  )
df_size <- as.data.frame(table(df_countries$is_large))
colnames(df_size) <- c("size", "count")
plot_ly(df_size, labels = ~size, values = ~count, type = "pie",
        marker = list(colors = c("#1b4332","#74c69d","#d8f3dc"),
                      line   = list(color = "white", width = 2)),
        textinfo = "label+percent",
        hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Ukuran Negara (Berdasarkan Populasi)</b>",
                 font = list(size = 14, color = "#1b4332")),
    legend = list(orientation = "h", x = 0.2, y = -0.1),
    paper_bgcolor = "white",
    height = 400
  )

Interpretasi Countries: 250 negara berhasil di-scrape. Kolom is_large mengkategorikan negara berdasarkan populasi: Large (>10 juta), Medium (1-10 juta), Small (<1 juta). Static HTML adalah tipe paling sederhana – satu request langsung menghasilkan semua data tanpa JavaScript rendering.


2.4 Website 2 Hockey Teams (Pagination)

2.4.1 Section A Scraping

hockey_list <- list(); counter <- 1

cat("Scraping dengan pagination...\n\n")
## Scraping dengan pagination...
for (pg in 1:24) {
  page <- get_page_r(paste0(BASE_URL, "/pages/forms/"),
                     query_params = list(page_num=pg, per_page=25))
  if (!is.null(page)) {
    rows <- page %>% html_nodes("tr.team")
    if (length(rows) == 0) { cat(sprintf("  Pagination berhenti di halaman %d\n", pg)); break }
    cat(sprintf("  Halaman %d -> %d tim\n", pg, length(rows)))

    for (row in rows) {
      cols <- row %>% html_nodes("td") %>% html_text(trim=TRUE)
      if (length(cols) >= 8) {
        ds <- if(nchar(cols[1])>0 && nchar(cols[2])>0) "Complete" else "Incomplete"
        hockey_list[[counter]] <- list(
          team_name=cols[1], year=cols[2], wins=cols[3], losses=cols[4],
          win_pct=cols[6], goals_for=cols[7], goals_against=cols[8], data_status=ds)
        counter <- counter + 1
      }
    }
  }
  Sys.sleep(0.3)
}
##   Halaman 1 -> 25 tim
##   Halaman 2 -> 25 tim
##   Halaman 3 -> 25 tim
##   Halaman 4 -> 25 tim
##   Halaman 5 -> 25 tim
##   Halaman 6 -> 25 tim
##   Halaman 7 -> 25 tim
##   Halaman 8 -> 25 tim
##   Halaman 9 -> 25 tim
##   Halaman 10 -> 25 tim
##   Halaman 11 -> 25 tim
##   Halaman 12 -> 25 tim
##   Halaman 13 -> 25 tim
##   Halaman 14 -> 25 tim
##   Halaman 15 -> 25 tim
##   Halaman 16 -> 25 tim
##   Halaman 17 -> 25 tim
##   Halaman 18 -> 25 tim
##   Halaman 19 -> 25 tim
##   Halaman 20 -> 25 tim
##   Halaman 21 -> 25 tim
##   Halaman 22 -> 25 tim
##   Halaman 23 -> 25 tim
##   Halaman 24 -> 7 tim
df_hockey_raw <- bind_rows(hockey_list)
cat(sprintf("\nTotal data: %d baris\n", nrow(df_hockey_raw)))
## 
## Total data: 582 baris
kable(head(df_hockey_raw, 5), caption = "Preview Hockey Teams (Raw)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)
Preview Hockey Teams (Raw)
team_name year wins losses win_pct goals_for goals_against data_status
Boston Bruins 1990 44 24 0.55 299 264 Complete
Buffalo Sabres 1990 31 30 0.388 292 278 Complete
Calgary Flames 1990 46 26 0.575 344 263 Complete
Chicago Blackhawks 1990 49 23 0.613 284 211 Complete
Detroit Red Wings 1990 34 38 0.425 273 298 Complete

2.4.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_hockey_raw), ncol(df_hockey_raw),
            sum(duplicated(df_hockey_raw))))
## Baris: 582 | Kolom: 8 | Duplikat: 0
cat("Tipe data:\n"); print(sapply(df_hockey_raw, class))
## Tipe data:
##     team_name          year          wins        losses       win_pct 
##   "character"   "character"   "character"   "character"   "character" 
##     goals_for goals_against   data_status 
##   "character"   "character"   "character"
cat("\nDATA ISSUES:\n")
## 
## DATA ISSUES:
cat("  (1) Kolom wins/losses/goals bertipe character, bukan numerik\n")
##   (1) Kolom wins/losses/goals bertipe character, bukan numerik
cat("  (2) Kolom year bertipe string, perlu konversi ke integer\n")
##   (2) Kolom year bertipe string, perlu konversi ke integer

2.4.3 Section C Cleaning

df_hockey <- df_hockey_raw
for (col in c("wins","losses","goals_for","goals_against")) {
  df_hockey[[col]] <- suppressWarnings(as.numeric(df_hockey[[col]]))
  if(any(is.na(df_hockey[[col]]))) df_hockey[[col]][is.na(df_hockey[[col]])] <- 0
  cat(sprintf("  '%-15s' -> numeric\n", col))
}
##   'wins           ' -> numeric
##   'losses         ' -> numeric
##   'goals_for      ' -> numeric
##   'goals_against  ' -> numeric
df_hockey$year     <- suppressWarnings(as.numeric(df_hockey$year))
df_hockey$win_pct  <- suppressWarnings(as.numeric(df_hockey$win_pct))
df_hockey$team_name <- str_to_title(str_trim(df_hockey$team_name))
df_hockey <- df_hockey %>% distinct(team_name, year, .keep_all=TRUE)
cat(sprintf("\nBaris final: %d\n", nrow(df_hockey)))
## 
## Baris final: 582
write.csv(df_hockey, "hockey_teams_R.csv", row.names=FALSE)
cat("Disimpan -> hockey_teams_R.csv\n")
## Disimpan -> hockey_teams_R.csv
print(sapply(df_hockey, class))
##     team_name          year          wins        losses       win_pct 
##   "character"     "numeric"     "numeric"     "numeric"     "numeric" 
##     goals_for goals_against   data_status 
##     "numeric"     "numeric"   "character"
kable(head(df_hockey, 5)) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)
team_name year wins losses win_pct goals_for goals_against data_status
Boston Bruins 1990 44 24 0.550 299 264 Complete
Buffalo Sabres 1990 31 30 0.388 292 278 Complete
Calgary Flames 1990 46 26 0.575 344 263 Complete
Chicago Blackhawks 1990 49 23 0.613 284 211 Complete
Detroit Red Wings 1990 34 38 0.425 273 298 Complete

2.4.4 Section D Conditional Logic

df_hockey <- df_hockey %>%
  mutate(
    performance = case_when(
      wins > 40 ~ "High",
      wins > 20 ~ "Medium",
      TRUE      ~ "Low"
    ),
    data_status = case_when(
      is.na(team_name) | team_name == "" ~ "Incomplete",
      is.na(wins)                        ~ "Incomplete",
      TRUE                               ~ "Complete"
    )
  )

df_hockey <- df_hockey %>%
  mutate(win_rate = round(wins / (wins + losses) * 100, 2))
df_hockey$win_rate[is.nan(df_hockey$win_rate)] <- 0

cat("Kolom 'performance' ditambahkan\n")
## Kolom 'performance' ditambahkan
print(table(df_hockey$performance))
## 
##   High    Low Medium 
##    226     27    329
cat("\nKolom 'win_rate' ditambahkan\n")
## 
## Kolom 'win_rate' ditambahkan
cat("\nKolom 'data_status' diperbarui\n")
## 
## Kolom 'data_status' diperbarui
print(table(df_hockey$data_status))
## 
## Complete 
##      582
write.csv(df_hockey, "hockey_teams_R.csv", row.names = FALSE)
cat("\nDisimpan -> hockey_teams_R.csv\n")
## 
## Disimpan -> hockey_teams_R.csv
kable(
  df_hockey %>% select(team_name, year, wins, losses, win_rate) %>% head(8),
  caption = "Preview Hockey Teams dengan Win Rate"
) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)
Preview Hockey Teams dengan Win Rate
team_name year wins losses win_rate
Boston Bruins 1990 44 24 64.71
Buffalo Sabres 1990 31 30 50.82
Calgary Flames 1990 46 26 63.89
Chicago Blackhawks 1990 49 23 68.06
Detroit Red Wings 1990 34 38 47.22
Edmonton Oilers 1990 37 37 50.00
Hartford Whalers 1990 31 38 44.93
Los Angeles Kings 1990 46 24 65.71
cat("\n=== INSIGHT HOCKEY TEAMS ===\n")
## 
## === INSIGHT HOCKEY TEAMS ===
best_team <- df_hockey[which.max(df_hockey$win_rate), ]
cat(sprintf("1. Tim dengan win rate tertinggi: %s (%d) - %.2f%%\n",
            best_team$team_name, best_team$year, best_team$win_rate))
## 1. Tim dengan win rate tertinggi: Detroit Red Wings (1995) - 82.67%
avg_per_year <- df_hockey %>%
  group_by(year) %>%
  summarise(avg_wr = round(mean(win_rate, na.rm=TRUE), 2), .groups="drop") %>%
  arrange(desc(avg_wr)) %>% slice(1)
cat(sprintf("2. Tahun dengan rata-rata win rate tertinggi: %d (%.2f%%)\n",
            avg_per_year$year, avg_per_year$avg_wr))
## 2. Tahun dengan rata-rata win rate tertinggi: 2009 (56.89%)
cat("3. Distribusi performance:\n")
## 3. Distribusi performance:
print(table(df_hockey$performance))
## 
##   High    Low Medium 
##    226     27    329
df_perf <- as.data.frame(table(df_hockey$performance))
colnames(df_perf) <- c("performance", "count")
df_perf$performance <- factor(df_perf$performance, levels = c("High","Medium","Low"))

plot_ly(df_perf, x = ~performance, y = ~count, type = "bar",
        marker = list(color = c("#1b4332","#40916c","#b7e4c7"),
                      line = list(color = "white", width = 1.5)),
        text = ~count, textposition = "outside",
        hovertemplate = "<b>%{x}</b><br>Jumlah Tim: %{y:,}<extra></extra>") |>
  layout(
    title = list(text = "<b>Distribusi Performance Tim Hockey</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Performance"),
    yaxis = list(title = "Jumlah Tim", gridcolor = "#f0faf3"),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )
avg_wr_year <- df_hockey %>%
  group_by(year) %>%
  summarise(avg_win_rate = round(mean(win_rate, na.rm=TRUE), 2), .groups="drop") %>%
  arrange(year)

plot_ly(avg_wr_year, x = ~year, y = ~avg_win_rate, type = "scatter", mode = "lines+markers",
        line    = list(color = "#40916c", width = 2.5),
        marker  = list(color = "#1b4332", size = 7),
        text    = ~paste0(avg_win_rate, "%"),
        textposition = "top center",
        hovertemplate = "<b>Tahun %{x}</b><br>Avg Win Rate: %{y:.2f}%<extra></extra>") |>
  layout(
    title = list(text = "<b>Rata-rata Win Rate Tim Hockey per Tahun</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Tahun", gridcolor = "#f0faf3"),
    yaxis = list(title = "Rata-rata Win Rate (%)", gridcolor = "#f0faf3"),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

Interpretasi Hockey: 582 records berhasil dikumpulkan. Kolom performance mengkategorikan tim: High (>40 wins), Medium (>20 wins), Low (<=20 wins). Kolom win_rate menunjukkan persentase kemenangan setiap tim per musim.


2.5 Website 3 Oscar Winning Films (AJAX)

2.5.1 Section A Scraping

oscar_list <- list(); counter <- 1
cat("Scraping data Oscar per tahun...\n\n")
## Scraping data Oscar per tahun...
for (year in 2010:2015) {
  url_oscar <- sprintf("%s/pages/ajax-javascript/?ajax=true&year=%d", BASE_URL, year)
  resp      <- GET(url_oscar, add_headers("User-Agent"="Mozilla/5.0"), timeout(15))

  if (status_code(resp) == 200) {
    movies <- tryCatch(
      fromJSON(content(resp, "text", encoding="UTF-8"), flatten=TRUE),
      error = function(e) NULL
    )
    if (!is.null(movies) && is.data.frame(movies) && nrow(movies) > 0) {
      cat(sprintf("  Tahun %d -> %d film\n", year, nrow(movies)))
      for (i in seq_len(nrow(movies))) {
        title <- ifelse(!is.null(movies$title[i]) && nchar(str_trim(movies$title[i]))>0,
                        str_trim(movies$title[i]), "Unknown")
        ds    <- if(title!="Unknown") "Complete" else "Incomplete"
        oscar_list[[counter]] <- list(
          movie_title = title, year = year,
          nominations = as.character(ifelse(is.null(movies$nominations[i]), 0, movies$nominations[i])),
          awards      = as.character(ifelse(is.null(movies$awards[i]),      0, movies$awards[i])),
          data_status = ds)
        counter <- counter + 1
      }
    }
  } else {
    cat(sprintf("  Gagal tahun %d: status %d\n", year, status_code(resp)))
  }
  Sys.sleep(0.3)
}
##   Tahun 2010 -> 13 film
##   Tahun 2011 -> 15 film
##   Tahun 2012 -> 15 film
##   Tahun 2013 -> 12 film
##   Tahun 2014 -> 16 film
##   Tahun 2015 -> 16 film
df_oscar_raw <- bind_rows(oscar_list)
cat(sprintf("\nTotal film terkumpul: %d baris\n", nrow(df_oscar_raw)))
## 
## Total film terkumpul: 87 baris
kable(head(df_oscar_raw, 5), caption = "Preview Oscar Films (Raw)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)
Preview Oscar Films (Raw)
movie_title year nominations awards data_status
The King’s Speech 2010 12 4 Complete
Inception 2010 8 4 Complete
The Social Network 2010 8 3 Complete
The Fighter 2010 7 2 Complete
Toy Story 3 2010 5 2 Complete

2.5.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_oscar_raw), ncol(df_oscar_raw),
            sum(duplicated(df_oscar_raw))))
## Baris: 87 | Kolom: 5 | Duplikat: 0
cat("Tipe data:\n"); print(sapply(df_oscar_raw, class))
## Tipe data:
## movie_title        year nominations      awards data_status 
## "character"   "integer" "character" "character" "character"
cat("\nDATA ISSUES:\n")
## 
## DATA ISSUES:
cat("  (1) nominations & awards bertipe character, perlu konversi numerik\n")
##   (1) nominations & awards bertipe character, perlu konversi numerik
cat("  (2) Beberapa film nominations = 0 (perlu diverifikasi)\n")
##   (2) Beberapa film nominations = 0 (perlu diverifikasi)

2.5.3 Section C Cleaning

df_oscar <- df_oscar_raw
if (nrow(df_oscar) > 0) {
  df_oscar$movie_title <- str_to_title(str_trim(df_oscar$movie_title))
  for (col in c("nominations","awards")) {
    df_oscar[[col]] <- suppressWarnings(as.numeric(df_oscar[[col]]))
    df_oscar[[col]][is.na(df_oscar[[col]])] <- 0
    cat(sprintf("  '%-15s' -> numeric\n", col))
  }
  df_oscar$movie_title[is.na(df_oscar$movie_title)|df_oscar$movie_title==""] <- "Unknown"
  df_oscar <- df_oscar %>% distinct()
  cat(sprintf("\nBaris final: %d\n", nrow(df_oscar)))
}
##   'nominations    ' -> numeric
##   'awards         ' -> numeric
## 
## Baris final: 87
write.csv(df_oscar, "oscar_winning_films_R.csv", row.names=FALSE)
cat("Disimpan -> oscar_winning_films_R.csv\n")
## Disimpan -> oscar_winning_films_R.csv
print(sapply(df_oscar, class))
## movie_title        year nominations      awards data_status 
## "character"   "integer"   "numeric"   "numeric" "character"
kable(head(df_oscar, 5)) |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)
movie_title year nominations awards data_status
The King’s Speech 2010 12 4 Complete
Inception 2010 8 4 Complete
The Social Network 2010 8 3 Complete
The Fighter 2010 7 2 Complete
Toy Story 3 2010 5 2 Complete

2.5.4 Section D Conditional Logic

if (nrow(df_oscar) > 0) {
  df_oscar <- df_oscar %>%
    mutate(
      category = case_when(
        awards >= 3 ~ "High Awards",
        awards >= 1 ~ "Mid Awards",
        TRUE        ~ "No Award"
      ),
      data_status = case_when(
        is.na(movie_title) | movie_title == "Unknown" ~ "Incomplete",
        TRUE                                          ~ "Complete"
      )
    )
}

cat("Kolom 'category' ditambahkan\n")
## Kolom 'category' ditambahkan
print(table(df_oscar$category))
## 
## High Awards  Mid Awards 
##          16          71
cat("\nKolom 'data_status' diperbarui\n")
## 
## Kolom 'data_status' diperbarui
print(table(df_oscar$data_status))
## 
## Complete 
##       87
write.csv(df_oscar, "oscar_winning_films_R.csv", row.names = FALSE)
cat("\nDisimpan ulang -> oscar_winning_films_R.csv\n")
## 
## Disimpan ulang -> oscar_winning_films_R.csv
kable(head(df_oscar, 8),
      caption = "Preview Oscar Films dengan Kolom Category & Data Status") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12) |>
  column_spec(5, bold = TRUE,
              color = ifelse(
                head(df_oscar, 8)$category == "High Awards", "#2d6a4f",
                ifelse(head(df_oscar, 8)$category == "Mid Awards", "#40916c", "#95d5b2")
              ))
Preview Oscar Films dengan Kolom Category & Data Status
movie_title year nominations awards data_status category
The King’s Speech 2010 12 4 Complete High Awards
Inception 2010 8 4 Complete High Awards
The Social Network 2010 8 3 Complete High Awards
The Fighter 2010 7 2 Complete Mid Awards
Toy Story 3 2010 5 2 Complete Mid Awards
Alice In Wonderland 2010 3 2 Complete Mid Awards
Black Swan 2010 5 1 Complete Mid Awards
In A Better World 2010 1 1 Complete Mid Awards
if (nrow(df_oscar) > 0) {
  df_oscar_cat <- as.data.frame(table(df_oscar$category))
  colnames(df_oscar_cat) <- c("category", "count")
  plot_ly(df_oscar_cat, labels = ~category, values = ~count, type = "pie",
          marker = list(colors = c("#1b4332","#74c69d","#d8f3dc"),
                        line   = list(color = "white", width = 2)),
          textinfo = "label+percent",
          hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
    layout(
      title = list(text = "<b>Distribusi Kategori Film Oscar (2010-2015)</b>",
                   font = list(size = 14, color = "#1b4332")),
      legend = list(orientation = "h", x = 0.1, y = -0.1),
      paper_bgcolor = "white",
      height = 400
    )
}
if (nrow(df_oscar) > 0) {
  films_per_year <- df_oscar %>% count(year, name = "total")
  plot_ly(films_per_year, x = ~year, y = ~total, type = "bar",
          marker = list(
            color = colorRampPalette(c("#b7e4c7","#1b4332"))(nrow(films_per_year)),
            line  = list(color = "white", width = 1.5)
          ),
          text = ~total, textposition = "outside",
          hovertemplate = "<b>Tahun %{x}</b><br>Film: %{y}<extra></extra>") |>
    layout(
      title = list(text = "<b>Jumlah Film Oscar yang Di-scrape per Tahun</b>",
                   font = list(size = 14, color = "#1b4332")),
      xaxis = list(title = "Tahun", tickmode = "linear"),
      yaxis = list(title = "Jumlah Film", gridcolor = "#f0faf3"),
      plot_bgcolor = "white", paper_bgcolor = "white",
      height = 400
    )
}

Interpretasi Oscar: 87 film (2010-2015) berhasil dikumpulkan. Kolom category membagi film menjadi: High Awards (>=3 award), Mid Awards (1-2 award), No Award (0 award). Kunci scraping: endpoint AJAX mengembalikan JSON murni yang di-parse langsung dengan fromJSON().


2.6 Website 4 Turtles (Frames & iFrames)

2.6.1 Section A Scraping

turtles_list <- list(); counter <- 1
page_main    <- get_page_r(paste0(BASE_URL, "/pages/frames/"))

if (!is.null(page_main)) {
  iframes <- html_nodes(page_main, "iframe")
  cat(sprintf("Jumlah iframe ditemukan: %d\n\n", length(iframes)))

  for (iframe in iframes) {
    src <- html_attr(iframe, "src")
    if (is.na(src) || nchar(src)==0) next
    iframe_url <- if(str_starts(src,"/")) paste0(BASE_URL, src) else src
    cat(sprintf("  Mengakses: %s\n", iframe_url))

    iframe_page <- get_page_r(iframe_url)
    if (!is.null(iframe_page)) {
      rows <- html_nodes(iframe_page, "tr")
      for (row in rows) {
        cols <- html_nodes(row, "td")
        if (length(cols)>=2 && nchar(str_trim(html_text(cols[[1]])))>1) {
          name <- str_trim(html_text(cols[[1]]))
          desc <- str_trim(html_text(cols[[2]]))
          info <- if(length(cols)>2) str_trim(html_text(cols[[3]])) else "N/A"
          if (!nchar(desc)) desc <- "Unknown"
          if (!nchar(info)) info <- "N/A"
          ds <- if(name!=""&&desc!="Unknown") "Complete" else "Incomplete"
          turtles_list[[counter]] <- list(
            name=name, description=desc,
            additional_info=info, source_iframe=iframe_url, data_status=ds)
          counter <- counter + 1
        }
      }
    }
  }
}
## Jumlah iframe ditemukan: 1
## 
##   Mengakses: https://www.scrapethissite.com/pages/frames/?frame=i
if (length(turtles_list)==0) {
  df_turtles_raw <- data.frame(name=character(), description=character(),
    additional_info=character(), source_iframe=character(),
    data_status=character(), stringsAsFactors=FALSE)
  cat("[WARN] Tidak ada data turtle.\n")
} else {
  df_turtles_raw <- bind_rows(turtles_list) |>
    mutate(across(where(is.character), ~ str_trunc(str_trim(.x), 80)))
}
## [WARN] Tidak ada data turtle.
cat(sprintf("\nTotal data: %d baris\n", nrow(df_turtles_raw)))
## 
## Total data: 0 baris
if (nrow(df_turtles_raw)>0)
  kable(head(df_turtles_raw,5), caption="Preview Turtles (Raw)") |>
    kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

2.6.2 Section B Data Handling

cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
            nrow(df_turtles_raw), ncol(df_turtles_raw),
            sum(duplicated(df_turtles_raw))))
## Baris: 0 | Kolom: 5 | Duplikat: 0
cat("Tipe data:\n"); print(sapply(df_turtles_raw, class))
## Tipe data:
##            name     description additional_info   source_iframe     data_status 
##     "character"     "character"     "character"     "character"     "character"
cat("\nDATA ISSUES:\n")
## 
## DATA ISSUES:
cat("  (1) Kolom additional_info banyak berisi 'N/A'\n")
##   (1) Kolom additional_info banyak berisi 'N/A'
cat("  (2) Beberapa description sangat pendek/tidak informatif\n")
##   (2) Beberapa description sangat pendek/tidak informatif

2.6.3 Section C Cleaning

df_turtles <- df_turtles_raw
for (col in c("name","description","additional_info")) {
  if (col %in% names(df_turtles)) {
    df_turtles[[col]] <- str_trim(df_turtles[[col]])
    df_turtles[[col]][is.na(df_turtles[[col]])] <- "Unknown"
    cat(sprintf("  '%-20s' trim & fill NA\n", col))
  }
}
##   'name                ' trim & fill NA
##   'description         ' trim & fill NA
##   'additional_info     ' trim & fill NA
df_turtles <- df_turtles %>% distinct()
cat(sprintf("\nBaris final: %d\n", nrow(df_turtles)))
## 
## Baris final: 0
write.csv(df_turtles, "turtles_data_R.csv", row.names=FALSE)
cat("Disimpan -> turtles_data_R.csv\n")
## Disimpan -> turtles_data_R.csv
if (nrow(df_turtles)>0)
  kable(head(df_turtles,5)) |>
    kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)

2.6.4 Section D Conditional Logic

df_turtles <- df_turtles %>%
  mutate(
    data_status = case_when(
      is.na(name)        | name        == "Unknown" ~ "Incomplete",
      is.na(description) | description == "Unknown" ~ "Incomplete",
      TRUE                                          ~ "Complete"
    )
  )

cat("Kolom 'data_status' diperbarui\n")
## Kolom 'data_status' diperbarui
print(table(df_turtles$data_status))
## < table of extent 0 >
write.csv(df_turtles, "turtles_data_R.csv", row.names = FALSE)
cat("\nDisimpan -> turtles_data_R.csv\n")
## 
## Disimpan -> turtles_data_R.csv
cat("\n=== INSIGHT TURTLES ===\n")
## 
## === INSIGHT TURTLES ===
cat(sprintf("1. Total family kura-kura yang ditemukan: %d family\n", nrow(df_turtles)))
## 1. Total family kura-kura yang ditemukan: 0 family
cat("2. Status kelengkapan data:\n")
## 2. Status kelengkapan data:
print(table(df_turtles$data_status))
## < table of extent 0 >
complete_pct <- sum(df_turtles$data_status=="Complete") / nrow(df_turtles) * 100
cat(sprintf("3. Persentase data Complete: %.1f%%\n", complete_pct))
## 3. Persentase data Complete: NaN%
if (nrow(df_turtles) > 0) {
  df_turt_status <- as.data.frame(table(df_turtles$data_status))
  colnames(df_turt_status) <- c("status", "count")
  df_turt_status$pct <- round(df_turt_status$count / sum(df_turt_status$count) * 100, 1)

  plot_ly(df_turt_status, labels = ~status, values = ~count, type = "pie", hole = 0.45,
          marker = list(colors = c("#40916c","#e76f51"),
                        line   = list(color = "white", width = 2)),
          textinfo = "label+percent",
          hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
    layout(
      title = list(text = "<b>Kelengkapan Data Turtles (iFrame)</b>",
                   font = list(size = 14, color = "#1b4332")),
      annotations = list(list(text = "Status", showarrow = FALSE,
                              font = list(size = 13, color = "#2d6a4f"))),
      legend = list(orientation = "h", x = 0.3, y = -0.1),
      paper_bgcolor = "white",
      height = 400
    )
}

Interpretasi Turtles: 0 baris berhasil diambil dari iframe. Persentase data Complete: 0%. Kunci: detect src iframe -> buat URL absolut -> request baru ke URL tersebut.


2.6.5 Section E Analytical Thinking (Web Scraping)

2.7 Rekap Semua Dataset

datasets_mp2 <- list(Countries=df_countries, Hockey=df_hockey,
                     Oscar=df_oscar, Turtles=df_turtles)

cat("Distribusi data_status per dataset:\n")
## Distribusi data_status per dataset:
for (nm in names(datasets_mp2)) {
  df_tmp <- datasets_mp2[[nm]]
  cat(sprintf("\n%s\n", strrep("-", 45)))
  cat(sprintf("Dataset  : %s\n", nm))
  cat(sprintf("Baris    : %d | Kolom: %d\n", nrow(df_tmp), ncol(df_tmp)))
  cat(sprintf("Nama kolom: %s\n", paste(names(df_tmp), collapse=", ")))
  cat("Missing values:\n"); print(colSums(is.na(df_tmp)))
  cat(sprintf("Duplicate: %d\n", sum(duplicated(df_tmp))))
  if ("data_status" %in% names(df_tmp)) print(table(df_tmp$data_status))
}
## 
## ---------------------------------------------
## Dataset  : Countries
## Baris    : 250 | Kolom: 6
## Nama kolom: country_name, capital, population, area, data_status, is_large
## Missing values:
## country_name      capital   population         area  data_status     is_large 
##            0            0            0            0            0            0 
## Duplicate: 0
## 
## Complete 
##      250 
## 
## ---------------------------------------------
## Dataset  : Hockey
## Baris    : 582 | Kolom: 10
## Nama kolom: team_name, year, wins, losses, win_pct, goals_for, goals_against, data_status, performance, win_rate
## Missing values:
##     team_name          year          wins        losses       win_pct 
##             0             0             0             0             0 
##     goals_for goals_against   data_status   performance      win_rate 
##             0             0             0             0             0 
## Duplicate: 0
## 
## Complete 
##      582 
## 
## ---------------------------------------------
## Dataset  : Oscar
## Baris    : 87 | Kolom: 6
## Nama kolom: movie_title, year, nominations, awards, data_status, category
## Missing values:
## movie_title        year nominations      awards data_status    category 
##           0           0           0           0           0           0 
## Duplicate: 0
## 
## Complete 
##       87 
## 
## ---------------------------------------------
## Dataset  : Turtles
## Baris    : 0 | Kolom: 5
## Nama kolom: name, description, additional_info, source_iframe, data_status
## Missing values:
##            name     description additional_info   source_iframe     data_status 
##               0               0               0               0               0 
## Duplicate: 0
## < table of extent 0 >
rekap <- data.frame(
  Website     = c("Countries","Hockey Teams","Oscar Films","Turtles"),
  Tipe        = c("Static HTML","Pagination","AJAX/JSON","iFrames"),
  Total_Baris = c(nrow(df_countries), nrow(df_hockey), nrow(df_oscar), nrow(df_turtles)),
  Complete    = c(sum(df_countries$data_status=="Complete"),
                  sum(df_hockey$data_status=="Complete"),
                  sum(df_oscar$data_status=="Complete"),
                  sum(df_turtles$data_status=="Complete")),
  Kesulitan   = c("Mudah","Sedang","Sulit","Cukup Sulit")
)
kable(rekap, caption = "Rekap Semua Dataset Web Scraping") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  column_spec(1, bold=TRUE, color="#2d6a4f") |>
  column_spec(3, bold=TRUE, color="#40916c")
Rekap Semua Dataset Web Scraping
Website Tipe Total_Baris Complete Kesulitan
Countries Static HTML 250 250 Mudah
Hockey Teams Pagination 582 582 Sedang
Oscar Films AJAX/JSON 87 87 Sulit
Turtles iFrames 0 0 Cukup Sulit
plot_ly() |>
  add_trace(
    x = rekap$Website, y = rekap$Total_Baris,
    type = "bar", name = "Total Baris",
    marker = list(color = "#40916c", line = list(color = "white", width = 1.2)),
    hovertemplate = "<b>%{x}</b><br>Total: %{y:,}<extra></extra>"
  ) |>
  add_trace(
    x = rekap$Website, y = rekap$Complete,
    type = "bar", name = "Data Complete",
    marker = list(color = "#b7e4c7", line = list(color = "white", width = 1.2)),
    hovertemplate = "<b>%{x}</b><br>Complete: %{y:,}<extra></extra>"
  ) |>
  layout(
    barmode = "group",
    title = list(text = "<b>Perbandingan Total Baris vs Data Complete per Website</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "Website"),
    yaxis = list(title = "Jumlah Baris", gridcolor = "#f0faf3"),
    legend = list(orientation = "h", x = 0.3, y = -0.15),
    plot_bgcolor = "white", paper_bgcolor = "white",
    height = 400
  )

2.8 Perbedaan Pendekatan

cat("
1. Website paling mudah di-scrape:
   -> Countries of the World (Static HTML)
   Alasan: HTML langsung tersedia saat halaman dimuat.
           Cukup read_html() + html_nodes() tanpa pagination/AJAX.

2. Website paling sulit di-scrape:
   -> Oscar Winning Films (AJAX / Javascript)
   Alasan: Data dimuat secara dinamis oleh JS setelah halaman terbuka.
           Perlu inspect Network tab untuk menemukan endpoint XHR,
           kemudian parse response JSON dengan fromJSON().

3. Perbedaan Pendekatan:

   STATIC     : HTML langsung tersedia saat GET request
                -> read_html() + html_nodes() (rvest)
                -> Contoh: Countries of the World

   PAGINATION : Data tersebar di banyak halaman
                -> Loop + parameter ?page_num=N di URL
                -> Hentikan jika tidak ada baris lagi
                -> Contoh: Hockey Teams

   AJAX       : Data dimuat async via XHR setelah halaman render
                -> Inspect Network -> temukan endpoint JSON
                -> GET() ke endpoint + fromJSON() (httr + jsonlite)
                -> Contoh: Oscar Winning Films

   IFRAME     : Konten ada di frame terpisah, bukan di HTML utama
                -> html_attr('src') -> buat URL absolut -> request baru
                -> Contoh: Turtles All the Way Down
")
## 
## 1. Website paling mudah di-scrape:
##    -> Countries of the World (Static HTML)
##    Alasan: HTML langsung tersedia saat halaman dimuat.
##            Cukup read_html() + html_nodes() tanpa pagination/AJAX.
## 
## 2. Website paling sulit di-scrape:
##    -> Oscar Winning Films (AJAX / Javascript)
##    Alasan: Data dimuat secara dinamis oleh JS setelah halaman terbuka.
##            Perlu inspect Network tab untuk menemukan endpoint XHR,
##            kemudian parse response JSON dengan fromJSON().
## 
## 3. Perbedaan Pendekatan:
## 
##    STATIC     : HTML langsung tersedia saat GET request
##                 -> read_html() + html_nodes() (rvest)
##                 -> Contoh: Countries of the World
## 
##    PAGINATION : Data tersebar di banyak halaman
##                 -> Loop + parameter ?page_num=N di URL
##                 -> Hentikan jika tidak ada baris lagi
##                 -> Contoh: Hockey Teams
## 
##    AJAX       : Data dimuat async via XHR setelah halaman render
##                 -> Inspect Network -> temukan endpoint JSON
##                 -> GET() ke endpoint + fromJSON() (httr + jsonlite)
##                 -> Contoh: Oscar Winning Films
## 
##    IFRAME     : Konten ada di frame terpisah, bukan di HTML utama
##                 -> html_attr('src') -> buat URL absolut -> request baru
##                 -> Contoh: Turtles All the Way Down

2.9 Insights & Rekomendasi

cat("=== ANALYTICAL THINKING ===\n\n")
## === ANALYTICAL THINKING ===
cat("1. Website paling mudah di-scrape:\n")
## 1. Website paling mudah di-scrape:
cat("   -> Countries of the World (Static HTML)\n")
##    -> Countries of the World (Static HTML)
cat("   Alasan: Struktur HTML sederhana, tidak ada pagination/AJAX/iframe.\n\n")
##    Alasan: Struktur HTML sederhana, tidak ada pagination/AJAX/iframe.
cat("2. Website paling sulit di-scrape:\n")
## 2. Website paling sulit di-scrape:
cat("   -> Oscar Winning Films (AJAX / Javascript)\n")
##    -> Oscar Winning Films (AJAX / Javascript)
cat("   Alasan: Data dimuat dinamis oleh JavaScript, perlu inspect network\n")
##    Alasan: Data dimuat dinamis oleh JavaScript, perlu inspect network
cat("           untuk menemukan endpoint AJAX yang benar.\n\n")
##            untuk menemukan endpoint AJAX yang benar.
cat("3. Perbedaan pendekatan:\n\n")
## 3. Perbedaan pendekatan:
cat("   STATIC     : HTML sudah tersedia saat halaman dimuat.\n")
##    STATIC     : HTML sudah tersedia saat halaman dimuat.
cat("                Cukup gunakan rvest + html_nodes().\n\n")
##                 Cukup gunakan rvest + html_nodes().
cat("   PAGINATION : Data tersebar di banyak halaman.\n")
##    PAGINATION : Data tersebar di banyak halaman.
cat("                Gunakan loop + parameter URL (?page_num=N).\n")
##                 Gunakan loop + parameter URL (?page_num=N).
cat("                Hentikan loop jika tidak ada data lagi.\n\n")
##                 Hentikan loop jika tidak ada data lagi.
cat("   AJAX       : Data dimuat setelah halaman terbuka via JavaScript.\n")
##    AJAX       : Data dimuat setelah halaman terbuka via JavaScript.
cat("                Inspect Network tab -> temukan XHR/Fetch request\n")
##                 Inspect Network tab -> temukan XHR/Fetch request
cat("                -> akses langsung endpoint JSON-nya.\n")
##                 -> akses langsung endpoint JSON-nya.
cat("                Response berupa JSON, parse dengan fromJSON().\n\n")
##                 Response berupa JSON, parse dengan fromJSON().
cat("   IFRAME     : Konten berada di dalam frame terpisah dari HTML utama.\n")
##    IFRAME     : Konten berada di dalam frame terpisah dari HTML utama.
cat("                html_attr('src') -> buat URL lengkap\n")
##                 html_attr('src') -> buat URL lengkap
cat("                -> buat request baru ke URL iframe tersebut.\n\n")
##                 -> buat request baru ke URL iframe tersebut.
cat("--- INSIGHTS ---\n")
## --- INSIGHTS ---
cat(sprintf("1. Countries: %d negara berhasil diambil dari halaman statis.\n",
            nrow(df_countries)))
## 1. Countries: 250 negara berhasil diambil dari halaman statis.
if (nrow(df_hockey) > 0) {
  top_team <- df_hockey %>%
    group_by(team_name) %>%
    summarise(total = sum(wins, na.rm=TRUE), .groups="drop") %>%
    arrange(desc(total)) %>% slice(1)
  cat(sprintf("2. Hockey: Tim dengan total kemenangan terbanyak: %s (%d wins)\n",
              top_team$team_name, top_team$total))

  best_wr <- df_hockey[which.max(df_hockey$win_rate), ]
  cat(sprintf("3. Hockey: Tim dengan win rate tertinggi: %s (%d) - %.2f%%\n",
              best_wr$team_name, best_wr$year, best_wr$win_rate))
}
## 2. Hockey: Tim dengan total kemenangan terbanyak: Detroit Red Wings (986 wins)
## 3. Hockey: Tim dengan win rate tertinggi: Detroit Red Wings (1995) - 82.67%
if (nrow(df_oscar) > 0) {
  best_film <- df_oscar[which.max(df_oscar$awards), ]
  cat(sprintf("4. Oscar: Film dengan award terbanyak: %s (%d) - %d awards\n",
              best_film$movie_title, best_film$year, best_film$awards))

  yr_count <- df_oscar %>% count(year, name="n") %>% arrange(year)
  cat("5. Oscar: Jumlah film per tahun:\n")
  print(setNames(yr_count$n, yr_count$year))

  cat(sprintf("6. Oscar: Total film Complete: %d\n",
              sum(df_oscar$data_status=="Complete")))
}
## 4. Oscar: Film dengan award terbanyak: Gravity (2013) - 7 awards
## 5. Oscar: Jumlah film per tahun:
## 2010 2011 2012 2013 2014 2015 
##   13   15   15   12   16   16 
## 6. Oscar: Total film Complete: 87
if (nrow(df_turtles) > 0) {
  cat(sprintf("7. Turtles: Total family kura-kura: %d family\n", nrow(df_turtles)))
  pct <- round(sum(df_turtles$data_status=="Complete") / nrow(df_turtles) * 100, 1)
  cat(sprintf("8. Turtles: Persentase data Complete: %.1f%%\n", pct))
}

cat("\n--- REKOMENDASI ---\n")
## 
## --- REKOMENDASI ---
cat("1. Gunakan RSelenium / Playwright untuk website AJAX yang\n")
## 1. Gunakan RSelenium / Playwright untuk website AJAX yang
cat("   lebih kompleks dan tidak memiliki endpoint API terbuka.\n")
##    lebih kompleks dan tidak memiliki endpoint API terbuka.
cat("2. Tambahkan tryCatch() dan retry mechanism untuk scraping yang\n")
## 2. Tambahkan tryCatch() dan retry mechanism untuk scraping yang
cat("   lebih robust terhadap timeout dan error jaringan.\n")
##    lebih robust terhadap timeout dan error jaringan.
cat("3. Simpan log error ke file terpisah agar mudah di-debug.\n")
## 3. Simpan log error ke file terpisah agar mudah di-debug.
pct_complete <- data.frame(
  Website = c("Countries","Hockey Teams","Oscar Films","Turtles"),
  Pct_Complete = c(
    round(sum(df_countries$data_status=="Complete") / max(nrow(df_countries),1) * 100, 1),
    round(sum(df_hockey$data_status=="Complete")    / max(nrow(df_hockey),1)    * 100, 1),
    round(sum(df_oscar$data_status=="Complete")     / max(nrow(df_oscar),1)     * 100, 1),
    round(sum(df_turtles$data_status=="Complete")   / max(nrow(df_turtles),1)   * 100, 1)
  ),
  Tipe = c("Static HTML","Pagination","AJAX/JSON","iFrames")
)

plot_ly(pct_complete,
        y = ~reorder(Website, Pct_Complete), x = ~Pct_Complete,
        type = "bar", orientation = "h",
        marker = list(
          color = colorRampPalette(c("#b7e4c7","#1b4332"))(4),
          line  = list(color = "white", width = 1.5)
        ),
        text  = ~paste0(Pct_Complete, "%"),
        textposition = "outside",
        hovertemplate = "<b>%{y}</b> (%{customdata})<br>Complete: %{x}%<extra></extra>",
        customdata = ~Tipe) |>
  layout(
    title = list(text = "<b>Persentase Data Complete per Website Scraping</b>",
                 font = list(size = 14, color = "#1b4332")),
    xaxis = list(title = "% Data Complete", range = c(0, 115), gridcolor = "#f0faf3"),
    yaxis = list(title = ""),
    plot_bgcolor = "white", paper_bgcolor = "white",
    margin = list(r = 60),
    height = 400
  )

Kesimpulan Web Scraping:
Website termudah: Countries (static HTML, satu request).
Website tersulit: Oscar (AJAX, perlu intercept XHR endpoint).
Total data: 250 negara · 582 hockey · 87 film · 0 turtles


3 Ringkasan Akhir Keseluruhan

Ringkasan Hasil Keseluruhan UTS
No Keterangan Nilai
1 Mini Project 1 - Dataset E-Commerce (file dibaca) 5 file (CSV, Excel, JSON, TXT, XML)
2 Mini Project 1 - Baris setelah deduplikasi 1.966
3 Mini Project 1 - Kolom + 3 kolom baru 25
4 Mini Project 1 - Platform dominan Shopee - 20.9%
5 Mini Project 1 - Transaksi high-value 755
6 Mini Project 2 - Website di-scrape 4 website (Countries, Hockey, Oscar, Turtles)
7 Mini Project 2 - Total negara 250 negara
8 Mini Project 2 - Total records hockey 582 records
9 Mini Project 2 - Total film Oscar 87 film
10 Mini Project 2 - Total data turtles 0 baris