UTS SD-1306 Pemrograman Sains Data I
Mini Project 1: E-Commerce & Mini Project 2: Web Scraping
1 MINI PROJECT 1
1.1 Library yang Digunakan (Mini Project 1)
pkgs1 <- c("readr","readxl","jsonlite","xml2","dplyr","tidyr",
"stringr","ggplot2","scales","knitr","kableExtra",
"plotly","RColorBrewer","rmdformats")
for (p in pkgs1) {
if (!require(p, character.only = TRUE, quietly = TRUE))
install.packages(p, repos = "https://cran.r-project.org")
}
library(readr); library(readxl); library(jsonlite); library(xml2)
library(dplyr); library(tidyr); library(stringr); library(ggplot2)
library(scales); library(knitr); library(kableExtra); library(plotly)
library(RColorBrewer)
cat("Semua library Mini Project 1 berhasil dimuat!")## Semua library Mini Project 1 berhasil dimuat!
Library siap. Kombinasi
readr/readxl/jsonlite/xml2 untuk baca multi-format,
dplyr/stringr untuk cleaning, dan plotly untuk
visualisasi interaktif.
1.2 Section A Data Collection (E-Commerce)
1.2.1 CSV
## File: ecommerce.csv
## Jumlah baris : 2000
## Jumlah kolom : 22
## Nama kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
1.2.2 Excel
## File: ecommerce.xlsx
## Jumlah baris : 2000
## Jumlah kolom : 22
## Nama kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
1.2.3 JSON
## File: ecommerce.json
## Jumlah baris : 2000
## Jumlah kolom : 22
## Nama kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
1.2.4 TXT
df_txt <- read_delim("ecommerce.txt", delim = "|", show_col_types = FALSE)
cat("File: ecommerce.txt\n")## File: ecommerce.txt
## Jumlah baris : 2000
## Jumlah kolom : 22
## Nama kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
1.2.5 XML
xml_doc <- read_xml("ecommerce.xml")
records <- xml_find_all(xml_doc, ".//Record")
xml_ke_df <- function(rec) {
anak <- xml_children(rec)
setNames(as.list(xml_text(anak)), xml_name(anak))
}
df_xml <- bind_rows(lapply(records, xml_ke_df))
cat("File: ecommerce.xml\n")## File: ecommerce.xml
## Jumlah baris : 2000
## Jumlah kolom : 22
## Nama kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
1.2.6 Looping + Cek Kolom + Gabung
semua_df <- list(
CSV = df_csv |> mutate(across(everything(), as.character)),
Excel = df_excel |> mutate(across(everything(), as.character)),
JSON = df_json |> mutate(across(everything(), as.character)),
TXT = df_txt |> mutate(across(everything(), as.character)),
XML = df_xml |> mutate(across(everything(), as.character))
)
kolom_referensi <- names(semua_df[["CSV"]])
siap_merge <- list()
cat("=== Pengecekan Struktur Kolom ===\n\n")## === Pengecekan Struktur Kolom ===
for (nama in names(semua_df)) {
df_cek <- semua_df[[nama]]
if (setequal(names(df_cek), kolom_referensi)) {
cat(nama, ": Struktur kolom sesuai -> Ready to merge\n")
siap_merge[[nama]] <- df_cek[, kolom_referensi]
} else {
cat(nama, ": Perlu penyesuaian -> Need adjustment (disesuaikan otomatis)\n")
kolom_hilang <- setdiff(kolom_referensi, names(df_cek))
for (kol in kolom_hilang) df_cek[[kol]] <- NA
siap_merge[[nama]] <- df_cek[, kolom_referensi]
}
}## CSV : Struktur kolom sesuai -> Ready to merge
## Excel : Struktur kolom sesuai -> Ready to merge
## JSON : Struktur kolom sesuai -> Ready to merge
## TXT : Struktur kolom sesuai -> Ready to merge
## XML : Struktur kolom sesuai -> Ready to merge
df_gabungan <- bind_rows(siap_merge)
cat(sprintf("\nDataset utama berhasil digabungkan: %d baris, %d kolom\n",
nrow(df_gabungan), ncol(df_gabungan)))##
## Dataset utama berhasil digabungkan: 10000 baris, 22 kolom
Interpretasi Section A: Lima file dari format berbeda berhasil dibaca dan digabungkan menggunakan looping + conditional IF/ELSE. Pendekatan ini scalable – jika ada file ke-6, cukup ditambahkan ke list tanpa mengubah logika inti.
1.3 Section B Data Handling (E-Commerce)
1.3.1 Dimensi & Tipe Data
## Jumlah Total Baris : 10000
## Jumlah Total Kolom : 22
tipe <- data.frame(
No = 1:ncol(df_gabungan),
Kolom = names(df_gabungan),
Tipe = sapply(df_gabungan, class),
row.names = NULL
)
kable(tipe, caption = "Tipe Data Setiap Kolom") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |>
column_spec(2, bold = TRUE, color = "#2d6a4f") |>
column_spec(3, background = "#f0faf3")| No | Kolom | Tipe |
|---|---|---|
| 1 | order_id | character |
| 2 | order_date | character |
| 3 | ship_date | character |
| 4 | platform | character |
| 5 | category | character |
| 6 | product_name | character |
| 7 | unit_price | character |
| 8 | quantity | character |
| 9 | gross_sales | character |
| 10 | campaign | character |
| 11 | voucher_code | character |
| 12 | discount_pct | character |
| 13 | discount_value | character |
| 14 | shipping_cost | character |
| 15 | net_sales | character |
| 16 | payment_method | character |
| 17 | customer_segment | character |
| 18 | region | character |
| 19 | stock_status | character |
| 20 | order_status | character |
| 21 | customer_rating | character |
| 22 | priority_flag | character |
1.3.2 Missing Values
mv <- df_gabungan |>
summarise(across(everything(),
~ sum(is.na(.) | . == "" | . == "NULL"))) |>
pivot_longer(everything(), names_to = "Kolom", values_to = "Jumlah_NA") |>
filter(Jumlah_NA > 0) |>
arrange(desc(Jumlah_NA)) |>
mutate(Persen = round(Jumlah_NA / nrow(df_gabungan) * 100, 2))
kable(mv, col.names = c("Kolom","Jumlah Missing","Persentase (%)"),
caption = "Kolom dengan Missing Values") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
column_spec(3, bold = TRUE,
color = ifelse(mv$Persen > 20, "#e76f51", "#2d6a4f"))| Kolom | Jumlah Missing | Persentase (%) |
|---|---|---|
| customer_rating | 2030 | 20.30 |
| ship_date | 1000 | 10.00 |
| priority_flag | 940 | 9.40 |
| discount_pct | 345 | 3.45 |
| voucher_code | 245 | 2.45 |
| payment_method | 175 | 1.75 |
1.3.3 Duplikasi
dup <- sum(duplicated(df_gabungan))
pct <- round(dup / nrow(df_gabungan) * 100, 2)
cat("Jumlah baris duplikat:", dup, "\n")## Jumlah baris duplikat: 5581
## Persentase duplikat : 55.81 %
1.3.4 Masalah Kualitas Data
masalah <- data.frame(
No = 1:5,
Masalah = c(
"Format tanggal tidak konsisten",
"Nilai harga berformat string (Rp xxx.xxx)",
"Inkonsistensi penulisan platform",
"Missing values di kolom penting",
"Duplikasi data akibat penggabungan 5 file"
),
Contoh = c(
"2024-01-01, 01/01/2024, 01-01-2024",
"'Rp 1.500.000' alih-alih 1500000",
"'shopee', 'SHOPEE', ' shopee '",
"ship_date, customer_rating, payment_method",
paste0(dup, " baris duplikat dari ", nrow(df_gabungan), " baris")
)
)
kable(masalah, col.names = c("No","Masalah","Contoh"),
caption = "Minimal 3 Masalah Kualitas Data yang Ditemukan") |>
kable_styling(bootstrap_options = c("striped","hover")) |>
column_spec(2, bold = TRUE, color = "#e76f51")| No | Masalah | Contoh |
|---|---|---|
| 1 | Format tanggal tidak konsisten | 2024-01-01, 01/01/2024, 01-01-2024 |
| 2 | Nilai harga berformat string (Rp xxx.xxx) | ‘Rp 1.500.000’ alih-alih 1500000 |
| 3 | Inkonsistensi penulisan platform | ‘shopee’, ‘SHOPEE’, ’ shopee ’ |
| 4 | Missing values di kolom penting | ship_date, customer_rating, payment_method |
| 5 | Duplikasi data akibat penggabungan 5 file | 5581 baris duplikat dari 10000 baris |
Interpretasi Section B: Ditemukan 5 kategori masalah. Masalah paling kritis adalah format harga sebagai string (menghambat agregasi) dan inkonsistensi platform (menyebabkan grouping yang salah). Section C akan menyelesaikan semua masalah ini.
1.4 Section C Data Cleaning (E-Commerce)
1.4.1 Hapus Duplikat
df <- df_gabungan |> distinct(order_id, .keep_all = TRUE)
cat("Hapus duplikat berdasarkan order_id\n")## Hapus duplikat berdasarkan order_id
## Baris sebelum: 10000
## Baris sesudah: 1966
## Baris dihapus: 8034
1.4.2 Looping: Bersihkan Harga
bersihkan_harga <- function(x) {
x <- as.character(x)
x <- str_replace_all(x, "Rp\\s*", "")
x <- str_replace_all(x, "\\.", "")
x <- str_replace_all(x, ",", ".")
x <- str_trim(x)
hasil <- suppressWarnings(as.numeric(x))
ifelse(is.na(hasil) | hasil < 0, 0, hasil)
}
kolom_harga <- c("unit_price","gross_sales","net_sales","discount_value","shipping_cost")
cat("=== Membersihkan Kolom Harga dengan Looping ===\n\n")## === Membersihkan Kolom Harga dengan Looping ===
for (kol in kolom_harga) {
sebelum <- sum(is.na(df[[kol]]) | df[[kol]] == "")
df[[kol]] <- bersihkan_harga(df[[kol]])
cat(sprintf(" Kolom '%-18s' dibersihkan (NA sebelum: %d)\n", kol, sebelum))
}## Kolom 'unit_price ' dibersihkan (NA sebelum: 0)
## Kolom 'gross_sales ' dibersihkan (NA sebelum: 0)
## Kolom 'net_sales ' dibersihkan (NA sebelum: 0)
## Kolom 'discount_value ' dibersihkan (NA sebelum: 0)
## Kolom 'shipping_cost ' dibersihkan (NA sebelum: 0)
df$discount_pct <- suppressWarnings(as.numeric(df$discount_pct))
df$discount_pct[is.na(df$discount_pct)] <- 0
df$quantity <- suppressWarnings(as.numeric(df$quantity))
cat("\nSemua kolom harga berhasil dikonversi ke numerik!\n")##
## Semua kolom harga berhasil dikonversi ke numerik!
1.4.3 Standardisasi Platform (WAJIB IF)
standardisasi_platform <- function(p) {
p <- str_trim(str_to_lower(as.character(p)))
if (p %in% c("shopee","shopee ")) return("Shopee")
else if (p %in% c("tokopedia","tokped")) return("Tokopedia")
else if (p %in% c("lazada")) return("Lazada")
else if (p %in% c("blibli")) return("Blibli")
else if (p %in% c("tiktok shop","tiktok_shop")) return("TikTok Shop")
else return(str_to_title(p))
}
df$platform <- sapply(df$platform, standardisasi_platform)
cat("Platform distandarisasi\n")## Platform distandarisasi
##
## Blibli Lazada Shopee TikTok Shop Tokopedia
## 400 381 411 397 377
1.4.4 Missing Value (WAJIB IF)
df$payment_method <- ifelse(
is.na(df$payment_method) | str_trim(df$payment_method) %in% c("","nan","NaN","NAN"),
"Unknown",
str_trim(df$payment_method)
)
cat("payment_method: NA -> 'Unknown'\n")## payment_method: NA -> 'Unknown'
df$customer_rating <- suppressWarnings(as.numeric(df$customer_rating))
median_rating <- median(df$customer_rating, na.rm = TRUE)
df$customer_rating <- ifelse(is.na(df$customer_rating), median_rating, df$customer_rating)
cat(sprintf("customer_rating: NA -> median = %.1f\n", median_rating))## customer_rating: NA -> median = 5.0
## (Logika: median lebih robust terhadap outlier)
1.4.5 Standardisasi Status & Kategori
standardisasi_status <- function(s) {
s <- str_trim(str_to_lower(as.character(s)))
if (s %in% c("delivered","completed","complete","completion")) return("Completed")
else if (s %in% c("cancelled","cancel","batal","canceled")) return("Cancelled")
else if (s %in% c("returned","return","retur")) return("Returned")
else if (s %in% c("shipped","shipping")) return("Shipped")
else return(str_to_title(s))
}
df$order_status <- sapply(df$order_status, standardisasi_status)
cat("order_status distandarisasi\n"); print(table(df$order_status))## order_status distandarisasi
##
## Cancelled Completed On Delivery Returned Shipped
## 151 1550 47 117 101
##
## === Standardisasi Tambahan (Looping) ===
kolom_teks <- c("category","customer_segment","region","stock_status")
for (kol in kolom_teks) {
df[[kol]] <- str_to_title(str_trim(as.character(df[[kol]])))
cat(sprintf(" '%-20s' distandarisasi\n", kol))
}## 'category ' distandarisasi
## 'customer_segment ' distandarisasi
## 'region ' distandarisasi
## 'stock_status ' distandarisasi
payment_map <- c(
"e-wallet"="E-Wallet","ewallet"="E-Wallet","e wallet"="E-Wallet",
"virtual account"="Virtual Account","va"="Virtual Account",
"cod"="COD","cash on delivery"="COD",
"transfer bank"="Transfer Bank","bank transfer"="Transfer Bank",
"credit card"="Credit Card","nan"="Unknown","none"="Unknown"
)
df$payment_method <- sapply(
str_to_lower(str_trim(df$payment_method)),
function(x) {
if (is.na(x) || x == "") return("Unknown")
if (x %in% names(payment_map)) return(payment_map[x])
return(str_to_title(x))
}
)
cat(" 'payment_method' distandarisasi\n")## 'payment_method' distandarisasi
##
## Data bersih: 1966 baris, 22 kolom
Interpretasi Section C: Data dibersihkan melalui 7 proses sistematis menggunakan fungsi kustom, looping, dan conditional logic. Dataset kini bebas duplikat, format harga sudah numerik, dan semua kolom kategoris sudah konsisten.
1.5 Section D Conditional Logic (E-Commerce)
1.5.1 is_high_value
df$net_sales <- suppressWarnings(as.numeric(df$net_sales))
df$net_sales[is.na(df$net_sales)] <- 0
df$is_high_value <- ifelse(df$net_sales > 1000000, "Yes", "No")
cat("Kolom 'is_high_value' ditambahkan\n")## Kolom 'is_high_value' ditambahkan
##
## No Yes
## 1211 755
##
## Persentase high-value: 38.4%
df_hv <- as.data.frame(tbl_hv)
colnames(df_hv) <- c("is_high_value", "count")
df_hv$pct <- round(df_hv$count / sum(df_hv$count) * 100, 1)
plot_ly(df_hv, labels = ~is_high_value, values = ~count, type = "pie",
marker = list(colors = c("#40916c", "#b7e4c7"),
line = list(color = "white", width = 2)),
textinfo = "label+percent",
hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<br>Persentase: %{percent}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Transaksi High-Value vs Non-High-Value</b>",
font = list(size = 15, color = "#1b4332")),
legend = list(orientation = "h", x = 0.3, y = -0.1),
paper_bgcolor = "white",
height = 400,
margin = list(t = 50, b = 50, l = 30, r = 30)
)1.5.2 order_priority (WAJIB nested IF)
tentukan_prioritas <- function(net) {
if (is.na(net)) return("Low")
else if (net > 1000000) return("High")
else if (net >= 500000) return("Medium")
else return("Low")
}
df$order_priority <- sapply(df$net_sales, tentukan_prioritas)
cat("Kolom 'order_priority' ditambahkan\n")## Kolom 'order_priority' ditambahkan
##
## High Low Medium
## 755 799 412
df_prio <- as.data.frame(table(df$order_priority))
colnames(df_prio) <- c("priority", "count")
df_prio$priority <- factor(df_prio$priority, levels = c("High","Medium","Low"))
df_prio <- df_prio[order(df_prio$priority), ]
df_prio$pct <- round(df_prio$count / sum(df_prio$count) * 100, 1)
plot_ly(df_prio, x = ~priority, y = ~count, type = "bar",
marker = list(color = c("#1b4332","#40916c","#b7e4c7"),
line = list(color = "white", width = 1.5)),
text = ~paste0(count, "<br>(", pct, "%)"),
textposition = "outside",
hovertemplate = "<b>%{x}</b><br>Jumlah: %{y:,}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Order Priority (Nested IF)</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Prioritas"),
yaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
plot_bgcolor = "white", paper_bgcolor = "white",
height = 400
)1.5.3 valid_transaction
df$valid_transaction <- ifelse(
df$order_status == "Cancelled", "Invalid", "Valid"
)
cat("Kolom 'valid_transaction' ditambahkan\n")## Kolom 'valid_transaction' ditambahkan
##
## Invalid Valid
## 151 1815
##
## Persentase Valid : 92.3%
## Persentase Invalid: 7.7%
df_vt <- as.data.frame(tbl_vt)
colnames(df_vt) <- c("status", "count")
plot_ly(df_vt, labels = ~status, values = ~count, type = "pie", hole = 0.45,
marker = list(colors = c("#e76f51","#40916c"),
line = list(color = "white", width = 2)),
textinfo = "label+percent",
hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Valid vs Invalid Transaction</b>",
font = list(size = 14, color = "#1b4332")),
annotations = list(list(text = "Status", showarrow = FALSE,
font = list(size = 13, color = "#2d6a4f"))),
legend = list(orientation = "h", x = 0.3, y = -0.1),
paper_bgcolor = "white",
height = 400
)1.5.4 Preview Kolom Baru
df |>
select(order_id, platform, net_sales, order_status,
is_high_value, order_priority, valid_transaction) |>
head(10) |>
kable(caption = "Preview Dataset dengan 3 Kolom Baru (10 baris pertama)") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)| order_id | platform | net_sales | order_status | is_high_value | order_priority | valid_transaction |
|---|---|---|---|---|---|---|
| ORD00612 | Tokopedia | 680058 | Completed | No | Medium | Valid |
| ORD00112 | TikTok Shop | 1476873 | Completed | Yes | High | Valid |
| ORD01186 | Tokopedia | 369715 | Completed | No | Low | Valid |
| ORD01511 | Tokopedia | 1382570 | Completed | Yes | High | Valid |
| ORD00772 | Tokopedia | 376696 | Completed | No | Low | Valid |
| ORD00880 | Blibli | 0 | Cancelled | No | Low | Invalid |
| ORD00592 | Blibli | 0 | Cancelled | No | Low | Invalid |
| ORD01367 | Blibli | 2085678 | Completed | Yes | High | Valid |
| ORD01178 | Blibli | 1208527 | Completed | Yes | High | Valid |
| ORD00276 | Lazada | 1789349 | Completed | Yes | High | Valid |
Interpretasi Section D: Tiga kolom baru berhasil dibuat menggunakan conditional logic: is_high_value (segmentasi nilai), order_priority (tier 3 level dengan nested IF), dan valid_transaction (filter data analitik).
1.6 Section E Analytical Thinking (E-Commerce)
1.6.1 Platform Paling Dominan
platform_count <- df |>
count(platform, name = "jumlah") |>
arrange(desc(jumlah)) |>
mutate(persen = round(jumlah / sum(jumlah) * 100, 1))
kable(platform_count,
col.names = c("Platform","Jumlah Transaksi","Persentase (%)"),
caption = "Distribusi Transaksi per Platform") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(1, bold = TRUE, background = "#f0faf3", color = "#1b4332")| Platform | Jumlah Transaksi | Persentase (%) |
|---|---|---|
| Shopee | 411 | 20.9 |
| Blibli | 400 | 20.3 |
| TikTok Shop | 397 | 20.2 |
| Lazada | 381 | 19.4 |
| Tokopedia | 377 | 19.2 |
p_plat <- plot_ly(
data = platform_count |> arrange(jumlah),
y = ~reorder(platform, jumlah), x = ~jumlah,
type = "bar", orientation = "h",
marker = list(
color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(nrow(platform_count)),
line = list(color = "white", width = 1.5)
),
text = ~paste0(format(jumlah, big.mark="."), " (", persen, "%)"),
textposition = "outside",
hovertemplate = "<b>%{y}</b><br>Transaksi: %{x:,}<extra></extra>"
) |>
layout(
title = list(text = "<b>Jumlah Transaksi per Platform</b>",
font = list(size = 15, color = "#1b4332")),
xaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
yaxis = list(title = ""),
plot_bgcolor = "white", paper_bgcolor = "white",
margin = list(r = 120),
hoverlabel = list(bgcolor = "#2d6a4f", font = list(color = "white")),
height = 400
)
p_platcat(sprintf("\nPLATFORM PALING DOMINAN: %s (%d transaksi)\n",
platform_count$platform[1], platform_count$jumlah[1]))##
## PLATFORM PALING DOMINAN: Shopee (411 transaksi)
1.6.2 Kategori Paling Sering
kategori_count <- df |>
count(category, name = "jumlah") |>
arrange(desc(jumlah)) |>
head(8) |>
mutate(persen = round(jumlah / nrow(df) * 100, 1))
kable(kategori_count,
col.names = c("Kategori","Jumlah","Persentase (%)"),
caption = "Top 8 Kategori Produk") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(1, bold = TRUE, background = "#f0faf3")| Kategori | Jumlah | Persentase (%) |
|---|---|---|
| Fashion | 412 | 21.0 |
| Sports | 408 | 20.8 |
| Beauty | 387 | 19.7 |
| Home Living | 371 | 18.9 |
| Electronics | 357 | 18.2 |
| Home_living | 31 | 1.6 |
cat(sprintf("\nKATEGORI PALING SERING: %s (%d transaksi)\n",
kategori_count$category[1], kategori_count$jumlah[1]))##
## KATEGORI PALING SERING: Fashion (412 transaksi)
plot_ly(
data = kategori_count |> arrange(jumlah),
y = ~reorder(category, jumlah), x = ~jumlah,
type = "bar", orientation = "h",
marker = list(
color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(nrow(kategori_count)),
line = list(color = "white", width = 1.2)
),
text = ~paste0(jumlah, " (", persen, "%)"),
textposition = "outside",
hovertemplate = "<b>%{y}</b><br>Jumlah: %{x:,}<extra></extra>"
) |>
layout(
title = list(text = "<b>Top 8 Kategori Produk Terlaris</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Jumlah Transaksi", gridcolor = "#f0faf3"),
yaxis = list(title = ""),
plot_bgcolor = "white", paper_bgcolor = "white",
margin = list(r = 110),
height = 400
)1.6.3 Status Transaksi
status_count <- df |>
count(order_status, name = "jumlah") |>
arrange(desc(jumlah)) |>
mutate(persen = round(jumlah / sum(jumlah) * 100, 1))
kable(status_count,
col.names = c("Status","Jumlah","Persentase (%)"),
caption = "Distribusi Status Transaksi") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
row_spec(1, bold = TRUE, background = "#f0faf3")| Status | Jumlah | Persentase (%) |
|---|---|---|
| Completed | 1550 | 78.8 |
| Cancelled | 151 | 7.7 |
| Returned | 117 | 6.0 |
| Shipped | 101 | 5.1 |
| On Delivery | 47 | 2.4 |
cat(sprintf("\nSTATUS PALING BANYAK: %s (%d transaksi)\n",
status_count$order_status[1], status_count$jumlah[1]))##
## STATUS PALING BANYAK: Completed (1550 transaksi)
plot_ly(status_count, labels = ~order_status, values = ~jumlah,
type = "pie", hole = 0.4,
marker = list(
colors = colorRampPalette(c("#1b4332","#40916c","#74c69d","#b7e4c7","#d8f3dc"))(nrow(status_count)),
line = list(color = "white", width = 2)
),
textinfo = "label+percent",
hovertemplate = "<b>%{label}</b><br>Jumlah: %{value:,}<br>%{percent}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Status Transaksi</b>",
font = list(size = 14, color = "#1b4332")),
legend = list(orientation = "h", x = 0.1, y = -0.15),
paper_bgcolor = "white",
height = 400
)df |> write.csv("ecommerce_cleaned_R.csv", row.names = FALSE)
cat("\nDisimpan -> ecommerce_cleaned_R.csv\n")##
## Disimpan -> ecommerce_cleaned_R.csv
## Total baris: 1966 | Total kolom: 25
Ringkasan Section E:
1. Platform paling dominan:
Shopee (20.9%)
2. Kategori paling sering:
Fashion (412 transaksi)
3. Status transaksi
terbanyak: Completed (78.8%)
2 MINI PROJECT 2
2.1 Library yang Digunakan (Mini Project 2)
pkgs2 <- c("rvest","httr","dplyr","jsonlite","stringr","knitr","kableExtra","plotly")
for (p in pkgs2) {
if (!require(p, character.only = TRUE, quietly = TRUE))
install.packages(p, repos = "https://cran.r-project.org")
}
library(rvest); library(httr); library(dplyr)
library(jsonlite); library(stringr)
cat("Semua library Mini Project 2 berhasil dimuat!")## Semua library Mini Project 2 berhasil dimuat!
2.2 Setup Helper Functions
BASE_URL <- "https://www.scrapethissite.com"
get_page_r <- function(url, query_params = NULL) {
tryCatch({
resp <- GET(url,
add_headers("User-Agent" = "Mozilla/5.0 (Windows NT 10.0) Chrome/120.0.0.0"),
query = query_params, timeout(15))
if (status_code(resp) == 200)
read_html(content(resp, "text", encoding = "UTF-8"))
else { cat(sprintf(" [ERROR] Status %d\n", status_code(resp))); NULL }
}, error = function(e) { cat(sprintf(" [ERROR] %s\n", e$message)); NULL })
}
cat("Helper functions siap! BASE_URL:", BASE_URL, "\n")## Helper functions siap! BASE_URL: https://www.scrapethissite.com
2.3 Website 1 Countries of the World (Static HTML)
2.3.1 Section A Scraping
countries_list <- list()
page <- get_page_r(paste0(BASE_URL, "/pages/simple/"))
if (!is.null(page)) {
nodes <- page %>% html_nodes("div.country")
cat(sprintf("Ditemukan %d negara\n\n", length(nodes)))
for (i in seq_along(nodes)) {
nd <- nodes[[i]]
name <- nd %>% html_node("h3.country-name") %>% html_text(trim=TRUE)
cap <- nd %>% html_node("span.country-capital") %>% html_text(trim=TRUE)
pop <- nd %>% html_node("span.country-population") %>% html_text(trim=TRUE)
area <- nd %>% html_node("span.country-area") %>% html_text(trim=TRUE)
name <- ifelse(is.na(name)||nchar(name)==0, "Unknown", name)
cap <- ifelse(is.na(cap) ||nchar(trimws(cap))==0, "Unknown", cap)
pop <- ifelse(is.na(pop), "0", pop)
area <- ifelse(is.na(area), "0", area)
data_status <- if(name!="Unknown" && cap!="Unknown") "Complete" else "Incomplete"
countries_list[[i]] <- list(country_name=name, capital=cap,
population=pop, area=area, data_status=data_status)
}
}## Ditemukan 250 negara
df_countries_raw <- bind_rows(countries_list)
cat(sprintf("Total data: %d baris\n", nrow(df_countries_raw)))## Total data: 250 baris
kable(head(df_countries_raw, 5), caption = "Preview Countries (Raw)") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)| country_name | capital | population | area | data_status |
|---|---|---|---|---|
| Andorra | Andorra la Vella | 84000 | 468.0 | Complete |
| United Arab Emirates | Abu Dhabi | 4975593 | 82880.0 | Complete |
| Afghanistan | Kabul | 29121286 | 647500.0 | Complete |
| Antigua and Barbuda | St. John’s | 86754 | 443.0 | Complete |
| Anguilla | The Valley | 13254 | 102.0 | Complete |
2.3.2 Section B Data Handling
cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
nrow(df_countries_raw), ncol(df_countries_raw),
sum(duplicated(df_countries_raw))))## Baris: 250 | Kolom: 5 | Duplikat: 0
## Tipe data:
## country_name capital population area data_status
## "character" "character" "character" "character" "character"
##
## Missing values:
## country_name capital population area data_status
## 0 0 0 0 0
##
## DATA ISSUES:
## (1) population & area bertipe character, bukan numerik
## (2) Beberapa capital memiliki spasi ekstra
2.3.3 Section C Cleaning
df_countries <- df_countries_raw
for (col in c("country_name","capital")) {
df_countries[[col]] <- str_to_title(str_trim(df_countries[[col]]))
cat(sprintf(" '%-15s' trim & proper case\n", col))
}## 'country_name ' trim & proper case
## 'capital ' trim & proper case
df_countries$population <- suppressWarnings(as.numeric(str_replace_all(df_countries$population,",","")))
df_countries$area <- suppressWarnings(as.numeric(str_replace_all(df_countries$area,",","")))
df_countries$population[is.na(df_countries$population)] <- 0
df_countries$area[is.na(df_countries$area)] <- 0
df_countries <- df_countries %>% distinct(country_name, .keep_all=TRUE)
cat(sprintf("\nBaris final: %d\n", nrow(df_countries)))##
## Baris final: 250
write.csv(df_countries, "countries_of_the_world_R.csv", row.names=FALSE)
cat("Disimpan -> countries_of_the_world_R.csv\n")## Disimpan -> countries_of_the_world_R.csv
## country_name capital population area data_status
## "character" "character" "numeric" "numeric" "character"
kable(head(df_countries, 5)) |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)| country_name | capital | population | area | data_status |
|---|---|---|---|---|
| Andorra | Andorra La Vella | 84000 | 468 | Complete |
| United Arab Emirates | Abu Dhabi | 4975593 | 82880 | Complete |
| Afghanistan | Kabul | 29121286 | 647500 | Complete |
| Antigua And Barbuda | St. John’s | 86754 | 443 | Complete |
| Anguilla | The Valley | 13254 | 102 | Complete |
2.3.4 Section D Conditional Logic
df_countries <- df_countries %>%
mutate(
is_large = case_when(
population > 10000000 ~ "Large",
population > 1000000 ~ "Medium",
TRUE ~ "Small"
),
data_status = case_when(
is.na(country_name) | country_name == "Unknown" ~ "Incomplete",
capital == "Unknown" ~ "Incomplete",
TRUE ~ "Complete"
)
)
cat("Kolom 'is_large' ditambahkan\n")## Kolom 'is_large' ditambahkan
##
## Large Medium Small
## 82 76 92
##
## Kolom 'data_status' diperbarui
##
## Complete
## 250
write.csv(df_countries, "countries_of_the_world_R.csv", row.names = FALSE)
cat("\nDisimpan -> countries_of_the_world_R.csv\n")##
## Disimpan -> countries_of_the_world_R.csv
##
## === INSIGHT COUNTRIES ===
biggest <- df_countries[which.max(df_countries$population), ]
cat(sprintf("1. Negara dengan populasi terbesar: %s (%.0f jiwa)\n",
biggest$country_name, biggest$population))## 1. Negara dengan populasi terbesar: China (1330044000 jiwa)
## 2. Distribusi ukuran negara:
##
## Large Medium Small
## 82 76 92
total_c <- nrow(df_countries)
cat(sprintf("3. Persentase: Large=%.1f%%, Medium=%.1f%%, Small=%.1f%%\n",
sum(df_countries$is_large=="Large") / total_c * 100,
sum(df_countries$is_large=="Medium") / total_c * 100,
sum(df_countries$is_large=="Small") / total_c * 100))## 3. Persentase: Large=32.8%, Medium=30.4%, Small=36.8%
top15_countries <- df_countries |> arrange(desc(population)) |> head(15)
plot_ly(
data = top15_countries,
y = ~reorder(country_name, population), x = ~population,
type = "bar", orientation = "h",
marker = list(
color = colorRampPalette(c("#b7e4c7","#40916c","#1b4332"))(15),
line = list(color = "white", width = 1)
),
text = ~paste0(round(population/1e6, 1), " jt"),
textposition = "outside",
hovertemplate = "<b>%{y}</b><br>Populasi: %{x:,.0f}<extra></extra>"
) |>
layout(
title = list(text = "<b>Top 15 Negara dengan Populasi Terbesar</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Populasi", gridcolor = "#f0faf3"),
yaxis = list(title = ""),
plot_bgcolor = "white", paper_bgcolor = "white",
margin = list(r = 80),
height = 420
)df_size <- as.data.frame(table(df_countries$is_large))
colnames(df_size) <- c("size", "count")
plot_ly(df_size, labels = ~size, values = ~count, type = "pie",
marker = list(colors = c("#1b4332","#74c69d","#d8f3dc"),
line = list(color = "white", width = 2)),
textinfo = "label+percent",
hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Ukuran Negara (Berdasarkan Populasi)</b>",
font = list(size = 14, color = "#1b4332")),
legend = list(orientation = "h", x = 0.2, y = -0.1),
paper_bgcolor = "white",
height = 400
)Interpretasi Countries: 250 negara berhasil di-scrape. Kolom is_large mengkategorikan negara berdasarkan populasi: Large (>10 juta), Medium (1-10 juta), Small (<1 juta). Static HTML adalah tipe paling sederhana – satu request langsung menghasilkan semua data tanpa JavaScript rendering.
2.4 Website 2 Hockey Teams (Pagination)
2.4.1 Section A Scraping
## Scraping dengan pagination...
for (pg in 1:24) {
page <- get_page_r(paste0(BASE_URL, "/pages/forms/"),
query_params = list(page_num=pg, per_page=25))
if (!is.null(page)) {
rows <- page %>% html_nodes("tr.team")
if (length(rows) == 0) { cat(sprintf(" Pagination berhenti di halaman %d\n", pg)); break }
cat(sprintf(" Halaman %d -> %d tim\n", pg, length(rows)))
for (row in rows) {
cols <- row %>% html_nodes("td") %>% html_text(trim=TRUE)
if (length(cols) >= 8) {
ds <- if(nchar(cols[1])>0 && nchar(cols[2])>0) "Complete" else "Incomplete"
hockey_list[[counter]] <- list(
team_name=cols[1], year=cols[2], wins=cols[3], losses=cols[4],
win_pct=cols[6], goals_for=cols[7], goals_against=cols[8], data_status=ds)
counter <- counter + 1
}
}
}
Sys.sleep(0.3)
}## Halaman 1 -> 25 tim
## Halaman 2 -> 25 tim
## Halaman 3 -> 25 tim
## Halaman 4 -> 25 tim
## Halaman 5 -> 25 tim
## Halaman 6 -> 25 tim
## Halaman 7 -> 25 tim
## Halaman 8 -> 25 tim
## Halaman 9 -> 25 tim
## Halaman 10 -> 25 tim
## Halaman 11 -> 25 tim
## Halaman 12 -> 25 tim
## Halaman 13 -> 25 tim
## Halaman 14 -> 25 tim
## Halaman 15 -> 25 tim
## Halaman 16 -> 25 tim
## Halaman 17 -> 25 tim
## Halaman 18 -> 25 tim
## Halaman 19 -> 25 tim
## Halaman 20 -> 25 tim
## Halaman 21 -> 25 tim
## Halaman 22 -> 25 tim
## Halaman 23 -> 25 tim
## Halaman 24 -> 7 tim
df_hockey_raw <- bind_rows(hockey_list)
cat(sprintf("\nTotal data: %d baris\n", nrow(df_hockey_raw)))##
## Total data: 582 baris
kable(head(df_hockey_raw, 5), caption = "Preview Hockey Teams (Raw)") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)| team_name | year | wins | losses | win_pct | goals_for | goals_against | data_status |
|---|---|---|---|---|---|---|---|
| Boston Bruins | 1990 | 44 | 24 | 0.55 | 299 | 264 | Complete |
| Buffalo Sabres | 1990 | 31 | 30 | 0.388 | 292 | 278 | Complete |
| Calgary Flames | 1990 | 46 | 26 | 0.575 | 344 | 263 | Complete |
| Chicago Blackhawks | 1990 | 49 | 23 | 0.613 | 284 | 211 | Complete |
| Detroit Red Wings | 1990 | 34 | 38 | 0.425 | 273 | 298 | Complete |
2.4.2 Section B Data Handling
cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
nrow(df_hockey_raw), ncol(df_hockey_raw),
sum(duplicated(df_hockey_raw))))## Baris: 582 | Kolom: 8 | Duplikat: 0
## Tipe data:
## team_name year wins losses win_pct
## "character" "character" "character" "character" "character"
## goals_for goals_against data_status
## "character" "character" "character"
##
## DATA ISSUES:
## (1) Kolom wins/losses/goals bertipe character, bukan numerik
## (2) Kolom year bertipe string, perlu konversi ke integer
2.4.3 Section C Cleaning
df_hockey <- df_hockey_raw
for (col in c("wins","losses","goals_for","goals_against")) {
df_hockey[[col]] <- suppressWarnings(as.numeric(df_hockey[[col]]))
if(any(is.na(df_hockey[[col]]))) df_hockey[[col]][is.na(df_hockey[[col]])] <- 0
cat(sprintf(" '%-15s' -> numeric\n", col))
}## 'wins ' -> numeric
## 'losses ' -> numeric
## 'goals_for ' -> numeric
## 'goals_against ' -> numeric
df_hockey$year <- suppressWarnings(as.numeric(df_hockey$year))
df_hockey$win_pct <- suppressWarnings(as.numeric(df_hockey$win_pct))
df_hockey$team_name <- str_to_title(str_trim(df_hockey$team_name))
df_hockey <- df_hockey %>% distinct(team_name, year, .keep_all=TRUE)
cat(sprintf("\nBaris final: %d\n", nrow(df_hockey)))##
## Baris final: 582
## Disimpan -> hockey_teams_R.csv
## team_name year wins losses win_pct
## "character" "numeric" "numeric" "numeric" "numeric"
## goals_for goals_against data_status
## "numeric" "numeric" "character"
kable(head(df_hockey, 5)) |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)| team_name | year | wins | losses | win_pct | goals_for | goals_against | data_status |
|---|---|---|---|---|---|---|---|
| Boston Bruins | 1990 | 44 | 24 | 0.550 | 299 | 264 | Complete |
| Buffalo Sabres | 1990 | 31 | 30 | 0.388 | 292 | 278 | Complete |
| Calgary Flames | 1990 | 46 | 26 | 0.575 | 344 | 263 | Complete |
| Chicago Blackhawks | 1990 | 49 | 23 | 0.613 | 284 | 211 | Complete |
| Detroit Red Wings | 1990 | 34 | 38 | 0.425 | 273 | 298 | Complete |
2.4.4 Section D Conditional Logic
df_hockey <- df_hockey %>%
mutate(
performance = case_when(
wins > 40 ~ "High",
wins > 20 ~ "Medium",
TRUE ~ "Low"
),
data_status = case_when(
is.na(team_name) | team_name == "" ~ "Incomplete",
is.na(wins) ~ "Incomplete",
TRUE ~ "Complete"
)
)
df_hockey <- df_hockey %>%
mutate(win_rate = round(wins / (wins + losses) * 100, 2))
df_hockey$win_rate[is.nan(df_hockey$win_rate)] <- 0
cat("Kolom 'performance' ditambahkan\n")## Kolom 'performance' ditambahkan
##
## High Low Medium
## 226 27 329
##
## Kolom 'win_rate' ditambahkan
##
## Kolom 'data_status' diperbarui
##
## Complete
## 582
write.csv(df_hockey, "hockey_teams_R.csv", row.names = FALSE)
cat("\nDisimpan -> hockey_teams_R.csv\n")##
## Disimpan -> hockey_teams_R.csv
kable(
df_hockey %>% select(team_name, year, wins, losses, win_rate) %>% head(8),
caption = "Preview Hockey Teams dengan Win Rate"
) |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12)| team_name | year | wins | losses | win_rate |
|---|---|---|---|---|
| Boston Bruins | 1990 | 44 | 24 | 64.71 |
| Buffalo Sabres | 1990 | 31 | 30 | 50.82 |
| Calgary Flames | 1990 | 46 | 26 | 63.89 |
| Chicago Blackhawks | 1990 | 49 | 23 | 68.06 |
| Detroit Red Wings | 1990 | 34 | 38 | 47.22 |
| Edmonton Oilers | 1990 | 37 | 37 | 50.00 |
| Hartford Whalers | 1990 | 31 | 38 | 44.93 |
| Los Angeles Kings | 1990 | 46 | 24 | 65.71 |
##
## === INSIGHT HOCKEY TEAMS ===
best_team <- df_hockey[which.max(df_hockey$win_rate), ]
cat(sprintf("1. Tim dengan win rate tertinggi: %s (%d) - %.2f%%\n",
best_team$team_name, best_team$year, best_team$win_rate))## 1. Tim dengan win rate tertinggi: Detroit Red Wings (1995) - 82.67%
avg_per_year <- df_hockey %>%
group_by(year) %>%
summarise(avg_wr = round(mean(win_rate, na.rm=TRUE), 2), .groups="drop") %>%
arrange(desc(avg_wr)) %>% slice(1)
cat(sprintf("2. Tahun dengan rata-rata win rate tertinggi: %d (%.2f%%)\n",
avg_per_year$year, avg_per_year$avg_wr))## 2. Tahun dengan rata-rata win rate tertinggi: 2009 (56.89%)
## 3. Distribusi performance:
##
## High Low Medium
## 226 27 329
df_perf <- as.data.frame(table(df_hockey$performance))
colnames(df_perf) <- c("performance", "count")
df_perf$performance <- factor(df_perf$performance, levels = c("High","Medium","Low"))
plot_ly(df_perf, x = ~performance, y = ~count, type = "bar",
marker = list(color = c("#1b4332","#40916c","#b7e4c7"),
line = list(color = "white", width = 1.5)),
text = ~count, textposition = "outside",
hovertemplate = "<b>%{x}</b><br>Jumlah Tim: %{y:,}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Performance Tim Hockey</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Performance"),
yaxis = list(title = "Jumlah Tim", gridcolor = "#f0faf3"),
plot_bgcolor = "white", paper_bgcolor = "white",
height = 400
)avg_wr_year <- df_hockey %>%
group_by(year) %>%
summarise(avg_win_rate = round(mean(win_rate, na.rm=TRUE), 2), .groups="drop") %>%
arrange(year)
plot_ly(avg_wr_year, x = ~year, y = ~avg_win_rate, type = "scatter", mode = "lines+markers",
line = list(color = "#40916c", width = 2.5),
marker = list(color = "#1b4332", size = 7),
text = ~paste0(avg_win_rate, "%"),
textposition = "top center",
hovertemplate = "<b>Tahun %{x}</b><br>Avg Win Rate: %{y:.2f}%<extra></extra>") |>
layout(
title = list(text = "<b>Rata-rata Win Rate Tim Hockey per Tahun</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Tahun", gridcolor = "#f0faf3"),
yaxis = list(title = "Rata-rata Win Rate (%)", gridcolor = "#f0faf3"),
plot_bgcolor = "white", paper_bgcolor = "white",
height = 400
)Interpretasi Hockey: 582 records berhasil dikumpulkan. Kolom performance mengkategorikan tim: High (>40 wins), Medium (>20 wins), Low (<=20 wins). Kolom win_rate menunjukkan persentase kemenangan setiap tim per musim.
2.5 Website 3 Oscar Winning Films (AJAX)
2.5.1 Section A Scraping
## Scraping data Oscar per tahun...
for (year in 2010:2015) {
url_oscar <- sprintf("%s/pages/ajax-javascript/?ajax=true&year=%d", BASE_URL, year)
resp <- GET(url_oscar, add_headers("User-Agent"="Mozilla/5.0"), timeout(15))
if (status_code(resp) == 200) {
movies <- tryCatch(
fromJSON(content(resp, "text", encoding="UTF-8"), flatten=TRUE),
error = function(e) NULL
)
if (!is.null(movies) && is.data.frame(movies) && nrow(movies) > 0) {
cat(sprintf(" Tahun %d -> %d film\n", year, nrow(movies)))
for (i in seq_len(nrow(movies))) {
title <- ifelse(!is.null(movies$title[i]) && nchar(str_trim(movies$title[i]))>0,
str_trim(movies$title[i]), "Unknown")
ds <- if(title!="Unknown") "Complete" else "Incomplete"
oscar_list[[counter]] <- list(
movie_title = title, year = year,
nominations = as.character(ifelse(is.null(movies$nominations[i]), 0, movies$nominations[i])),
awards = as.character(ifelse(is.null(movies$awards[i]), 0, movies$awards[i])),
data_status = ds)
counter <- counter + 1
}
}
} else {
cat(sprintf(" Gagal tahun %d: status %d\n", year, status_code(resp)))
}
Sys.sleep(0.3)
}## Tahun 2010 -> 13 film
## Tahun 2011 -> 15 film
## Tahun 2012 -> 15 film
## Tahun 2013 -> 12 film
## Tahun 2014 -> 16 film
## Tahun 2015 -> 16 film
df_oscar_raw <- bind_rows(oscar_list)
cat(sprintf("\nTotal film terkumpul: %d baris\n", nrow(df_oscar_raw)))##
## Total film terkumpul: 87 baris
kable(head(df_oscar_raw, 5), caption = "Preview Oscar Films (Raw)") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)| movie_title | year | nominations | awards | data_status |
|---|---|---|---|---|
| The King’s Speech | 2010 | 12 | 4 | Complete |
| Inception | 2010 | 8 | 4 | Complete |
| The Social Network | 2010 | 8 | 3 | Complete |
| The Fighter | 2010 | 7 | 2 | Complete |
| Toy Story 3 | 2010 | 5 | 2 | Complete |
2.5.2 Section B Data Handling
cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
nrow(df_oscar_raw), ncol(df_oscar_raw),
sum(duplicated(df_oscar_raw))))## Baris: 87 | Kolom: 5 | Duplikat: 0
## Tipe data:
## movie_title year nominations awards data_status
## "character" "integer" "character" "character" "character"
##
## DATA ISSUES:
## (1) nominations & awards bertipe character, perlu konversi numerik
## (2) Beberapa film nominations = 0 (perlu diverifikasi)
2.5.3 Section C Cleaning
df_oscar <- df_oscar_raw
if (nrow(df_oscar) > 0) {
df_oscar$movie_title <- str_to_title(str_trim(df_oscar$movie_title))
for (col in c("nominations","awards")) {
df_oscar[[col]] <- suppressWarnings(as.numeric(df_oscar[[col]]))
df_oscar[[col]][is.na(df_oscar[[col]])] <- 0
cat(sprintf(" '%-15s' -> numeric\n", col))
}
df_oscar$movie_title[is.na(df_oscar$movie_title)|df_oscar$movie_title==""] <- "Unknown"
df_oscar <- df_oscar %>% distinct()
cat(sprintf("\nBaris final: %d\n", nrow(df_oscar)))
}## 'nominations ' -> numeric
## 'awards ' -> numeric
##
## Baris final: 87
write.csv(df_oscar, "oscar_winning_films_R.csv", row.names=FALSE)
cat("Disimpan -> oscar_winning_films_R.csv\n")## Disimpan -> oscar_winning_films_R.csv
## movie_title year nominations awards data_status
## "character" "integer" "numeric" "numeric" "character"
kable(head(df_oscar, 5)) |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size=12)| movie_title | year | nominations | awards | data_status |
|---|---|---|---|---|
| The King’s Speech | 2010 | 12 | 4 | Complete |
| Inception | 2010 | 8 | 4 | Complete |
| The Social Network | 2010 | 8 | 3 | Complete |
| The Fighter | 2010 | 7 | 2 | Complete |
| Toy Story 3 | 2010 | 5 | 2 | Complete |
2.5.4 Section D Conditional Logic
if (nrow(df_oscar) > 0) {
df_oscar <- df_oscar %>%
mutate(
category = case_when(
awards >= 3 ~ "High Awards",
awards >= 1 ~ "Mid Awards",
TRUE ~ "No Award"
),
data_status = case_when(
is.na(movie_title) | movie_title == "Unknown" ~ "Incomplete",
TRUE ~ "Complete"
)
)
}
cat("Kolom 'category' ditambahkan\n")## Kolom 'category' ditambahkan
##
## High Awards Mid Awards
## 16 71
##
## Kolom 'data_status' diperbarui
##
## Complete
## 87
write.csv(df_oscar, "oscar_winning_films_R.csv", row.names = FALSE)
cat("\nDisimpan ulang -> oscar_winning_films_R.csv\n")##
## Disimpan ulang -> oscar_winning_films_R.csv
kable(head(df_oscar, 8),
caption = "Preview Oscar Films dengan Kolom Category & Data Status") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"), font_size = 12) |>
column_spec(5, bold = TRUE,
color = ifelse(
head(df_oscar, 8)$category == "High Awards", "#2d6a4f",
ifelse(head(df_oscar, 8)$category == "Mid Awards", "#40916c", "#95d5b2")
))| movie_title | year | nominations | awards | data_status | category |
|---|---|---|---|---|---|
| The King’s Speech | 2010 | 12 | 4 | Complete | High Awards |
| Inception | 2010 | 8 | 4 | Complete | High Awards |
| The Social Network | 2010 | 8 | 3 | Complete | High Awards |
| The Fighter | 2010 | 7 | 2 | Complete | Mid Awards |
| Toy Story 3 | 2010 | 5 | 2 | Complete | Mid Awards |
| Alice In Wonderland | 2010 | 3 | 2 | Complete | Mid Awards |
| Black Swan | 2010 | 5 | 1 | Complete | Mid Awards |
| In A Better World | 2010 | 1 | 1 | Complete | Mid Awards |
if (nrow(df_oscar) > 0) {
df_oscar_cat <- as.data.frame(table(df_oscar$category))
colnames(df_oscar_cat) <- c("category", "count")
plot_ly(df_oscar_cat, labels = ~category, values = ~count, type = "pie",
marker = list(colors = c("#1b4332","#74c69d","#d8f3dc"),
line = list(color = "white", width = 2)),
textinfo = "label+percent",
hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
layout(
title = list(text = "<b>Distribusi Kategori Film Oscar (2010-2015)</b>",
font = list(size = 14, color = "#1b4332")),
legend = list(orientation = "h", x = 0.1, y = -0.1),
paper_bgcolor = "white",
height = 400
)
}if (nrow(df_oscar) > 0) {
films_per_year <- df_oscar %>% count(year, name = "total")
plot_ly(films_per_year, x = ~year, y = ~total, type = "bar",
marker = list(
color = colorRampPalette(c("#b7e4c7","#1b4332"))(nrow(films_per_year)),
line = list(color = "white", width = 1.5)
),
text = ~total, textposition = "outside",
hovertemplate = "<b>Tahun %{x}</b><br>Film: %{y}<extra></extra>") |>
layout(
title = list(text = "<b>Jumlah Film Oscar yang Di-scrape per Tahun</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Tahun", tickmode = "linear"),
yaxis = list(title = "Jumlah Film", gridcolor = "#f0faf3"),
plot_bgcolor = "white", paper_bgcolor = "white",
height = 400
)
}Interpretasi Oscar: 87 film
(2010-2015) berhasil dikumpulkan. Kolom category
membagi film menjadi: High Awards (>=3 award), Mid Awards (1-2
award), No Award (0 award). Kunci scraping: endpoint AJAX mengembalikan
JSON murni yang di-parse langsung dengan
fromJSON().
2.6 Website 4 Turtles (Frames & iFrames)
2.6.1 Section A Scraping
turtles_list <- list(); counter <- 1
page_main <- get_page_r(paste0(BASE_URL, "/pages/frames/"))
if (!is.null(page_main)) {
iframes <- html_nodes(page_main, "iframe")
cat(sprintf("Jumlah iframe ditemukan: %d\n\n", length(iframes)))
for (iframe in iframes) {
src <- html_attr(iframe, "src")
if (is.na(src) || nchar(src)==0) next
iframe_url <- if(str_starts(src,"/")) paste0(BASE_URL, src) else src
cat(sprintf(" Mengakses: %s\n", iframe_url))
iframe_page <- get_page_r(iframe_url)
if (!is.null(iframe_page)) {
rows <- html_nodes(iframe_page, "tr")
for (row in rows) {
cols <- html_nodes(row, "td")
if (length(cols)>=2 && nchar(str_trim(html_text(cols[[1]])))>1) {
name <- str_trim(html_text(cols[[1]]))
desc <- str_trim(html_text(cols[[2]]))
info <- if(length(cols)>2) str_trim(html_text(cols[[3]])) else "N/A"
if (!nchar(desc)) desc <- "Unknown"
if (!nchar(info)) info <- "N/A"
ds <- if(name!=""&&desc!="Unknown") "Complete" else "Incomplete"
turtles_list[[counter]] <- list(
name=name, description=desc,
additional_info=info, source_iframe=iframe_url, data_status=ds)
counter <- counter + 1
}
}
}
}
}## Jumlah iframe ditemukan: 1
##
## Mengakses: https://www.scrapethissite.com/pages/frames/?frame=i
if (length(turtles_list)==0) {
df_turtles_raw <- data.frame(name=character(), description=character(),
additional_info=character(), source_iframe=character(),
data_status=character(), stringsAsFactors=FALSE)
cat("[WARN] Tidak ada data turtle.\n")
} else {
df_turtles_raw <- bind_rows(turtles_list) |>
mutate(across(where(is.character), ~ str_trunc(str_trim(.x), 80)))
}## [WARN] Tidak ada data turtle.
##
## Total data: 0 baris
2.6.2 Section B Data Handling
cat(sprintf("Baris: %d | Kolom: %d | Duplikat: %d\n\n",
nrow(df_turtles_raw), ncol(df_turtles_raw),
sum(duplicated(df_turtles_raw))))## Baris: 0 | Kolom: 5 | Duplikat: 0
## Tipe data:
## name description additional_info source_iframe data_status
## "character" "character" "character" "character" "character"
##
## DATA ISSUES:
## (1) Kolom additional_info banyak berisi 'N/A'
## (2) Beberapa description sangat pendek/tidak informatif
2.6.3 Section C Cleaning
df_turtles <- df_turtles_raw
for (col in c("name","description","additional_info")) {
if (col %in% names(df_turtles)) {
df_turtles[[col]] <- str_trim(df_turtles[[col]])
df_turtles[[col]][is.na(df_turtles[[col]])] <- "Unknown"
cat(sprintf(" '%-20s' trim & fill NA\n", col))
}
}## 'name ' trim & fill NA
## 'description ' trim & fill NA
## 'additional_info ' trim & fill NA
##
## Baris final: 0
write.csv(df_turtles, "turtles_data_R.csv", row.names=FALSE)
cat("Disimpan -> turtles_data_R.csv\n")## Disimpan -> turtles_data_R.csv
2.6.4 Section D Conditional Logic
df_turtles <- df_turtles %>%
mutate(
data_status = case_when(
is.na(name) | name == "Unknown" ~ "Incomplete",
is.na(description) | description == "Unknown" ~ "Incomplete",
TRUE ~ "Complete"
)
)
cat("Kolom 'data_status' diperbarui\n")## Kolom 'data_status' diperbarui
## < table of extent 0 >
write.csv(df_turtles, "turtles_data_R.csv", row.names = FALSE)
cat("\nDisimpan -> turtles_data_R.csv\n")##
## Disimpan -> turtles_data_R.csv
##
## === INSIGHT TURTLES ===
## 1. Total family kura-kura yang ditemukan: 0 family
## 2. Status kelengkapan data:
## < table of extent 0 >
complete_pct <- sum(df_turtles$data_status=="Complete") / nrow(df_turtles) * 100
cat(sprintf("3. Persentase data Complete: %.1f%%\n", complete_pct))## 3. Persentase data Complete: NaN%
if (nrow(df_turtles) > 0) {
df_turt_status <- as.data.frame(table(df_turtles$data_status))
colnames(df_turt_status) <- c("status", "count")
df_turt_status$pct <- round(df_turt_status$count / sum(df_turt_status$count) * 100, 1)
plot_ly(df_turt_status, labels = ~status, values = ~count, type = "pie", hole = 0.45,
marker = list(colors = c("#40916c","#e76f51"),
line = list(color = "white", width = 2)),
textinfo = "label+percent",
hovertemplate = "<b>%{label}</b><br>Jumlah: %{value}<extra></extra>") |>
layout(
title = list(text = "<b>Kelengkapan Data Turtles (iFrame)</b>",
font = list(size = 14, color = "#1b4332")),
annotations = list(list(text = "Status", showarrow = FALSE,
font = list(size = 13, color = "#2d6a4f"))),
legend = list(orientation = "h", x = 0.3, y = -0.1),
paper_bgcolor = "white",
height = 400
)
}Interpretasi Turtles: 0 baris berhasil diambil dari iframe. Persentase data Complete: 0%. Kunci: detect src iframe -> buat URL absolut -> request baru ke URL tersebut.
2.6.5 Section E Analytical Thinking (Web Scraping)
2.7 Rekap Semua Dataset
datasets_mp2 <- list(Countries=df_countries, Hockey=df_hockey,
Oscar=df_oscar, Turtles=df_turtles)
cat("Distribusi data_status per dataset:\n")## Distribusi data_status per dataset:
for (nm in names(datasets_mp2)) {
df_tmp <- datasets_mp2[[nm]]
cat(sprintf("\n%s\n", strrep("-", 45)))
cat(sprintf("Dataset : %s\n", nm))
cat(sprintf("Baris : %d | Kolom: %d\n", nrow(df_tmp), ncol(df_tmp)))
cat(sprintf("Nama kolom: %s\n", paste(names(df_tmp), collapse=", ")))
cat("Missing values:\n"); print(colSums(is.na(df_tmp)))
cat(sprintf("Duplicate: %d\n", sum(duplicated(df_tmp))))
if ("data_status" %in% names(df_tmp)) print(table(df_tmp$data_status))
}##
## ---------------------------------------------
## Dataset : Countries
## Baris : 250 | Kolom: 6
## Nama kolom: country_name, capital, population, area, data_status, is_large
## Missing values:
## country_name capital population area data_status is_large
## 0 0 0 0 0 0
## Duplicate: 0
##
## Complete
## 250
##
## ---------------------------------------------
## Dataset : Hockey
## Baris : 582 | Kolom: 10
## Nama kolom: team_name, year, wins, losses, win_pct, goals_for, goals_against, data_status, performance, win_rate
## Missing values:
## team_name year wins losses win_pct
## 0 0 0 0 0
## goals_for goals_against data_status performance win_rate
## 0 0 0 0 0
## Duplicate: 0
##
## Complete
## 582
##
## ---------------------------------------------
## Dataset : Oscar
## Baris : 87 | Kolom: 6
## Nama kolom: movie_title, year, nominations, awards, data_status, category
## Missing values:
## movie_title year nominations awards data_status category
## 0 0 0 0 0 0
## Duplicate: 0
##
## Complete
## 87
##
## ---------------------------------------------
## Dataset : Turtles
## Baris : 0 | Kolom: 5
## Nama kolom: name, description, additional_info, source_iframe, data_status
## Missing values:
## name description additional_info source_iframe data_status
## 0 0 0 0 0
## Duplicate: 0
## < table of extent 0 >
rekap <- data.frame(
Website = c("Countries","Hockey Teams","Oscar Films","Turtles"),
Tipe = c("Static HTML","Pagination","AJAX/JSON","iFrames"),
Total_Baris = c(nrow(df_countries), nrow(df_hockey), nrow(df_oscar), nrow(df_turtles)),
Complete = c(sum(df_countries$data_status=="Complete"),
sum(df_hockey$data_status=="Complete"),
sum(df_oscar$data_status=="Complete"),
sum(df_turtles$data_status=="Complete")),
Kesulitan = c("Mudah","Sedang","Sulit","Cukup Sulit")
)
kable(rekap, caption = "Rekap Semua Dataset Web Scraping") |>
kable_styling(bootstrap_options = c("striped","hover")) |>
column_spec(1, bold=TRUE, color="#2d6a4f") |>
column_spec(3, bold=TRUE, color="#40916c")| Website | Tipe | Total_Baris | Complete | Kesulitan |
|---|---|---|---|---|
| Countries | Static HTML | 250 | 250 | Mudah |
| Hockey Teams | Pagination | 582 | 582 | Sedang |
| Oscar Films | AJAX/JSON | 87 | 87 | Sulit |
| Turtles | iFrames | 0 | 0 | Cukup Sulit |
plot_ly() |>
add_trace(
x = rekap$Website, y = rekap$Total_Baris,
type = "bar", name = "Total Baris",
marker = list(color = "#40916c", line = list(color = "white", width = 1.2)),
hovertemplate = "<b>%{x}</b><br>Total: %{y:,}<extra></extra>"
) |>
add_trace(
x = rekap$Website, y = rekap$Complete,
type = "bar", name = "Data Complete",
marker = list(color = "#b7e4c7", line = list(color = "white", width = 1.2)),
hovertemplate = "<b>%{x}</b><br>Complete: %{y:,}<extra></extra>"
) |>
layout(
barmode = "group",
title = list(text = "<b>Perbandingan Total Baris vs Data Complete per Website</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "Website"),
yaxis = list(title = "Jumlah Baris", gridcolor = "#f0faf3"),
legend = list(orientation = "h", x = 0.3, y = -0.15),
plot_bgcolor = "white", paper_bgcolor = "white",
height = 400
)2.8 Perbedaan Pendekatan
cat("
1. Website paling mudah di-scrape:
-> Countries of the World (Static HTML)
Alasan: HTML langsung tersedia saat halaman dimuat.
Cukup read_html() + html_nodes() tanpa pagination/AJAX.
2. Website paling sulit di-scrape:
-> Oscar Winning Films (AJAX / Javascript)
Alasan: Data dimuat secara dinamis oleh JS setelah halaman terbuka.
Perlu inspect Network tab untuk menemukan endpoint XHR,
kemudian parse response JSON dengan fromJSON().
3. Perbedaan Pendekatan:
STATIC : HTML langsung tersedia saat GET request
-> read_html() + html_nodes() (rvest)
-> Contoh: Countries of the World
PAGINATION : Data tersebar di banyak halaman
-> Loop + parameter ?page_num=N di URL
-> Hentikan jika tidak ada baris lagi
-> Contoh: Hockey Teams
AJAX : Data dimuat async via XHR setelah halaman render
-> Inspect Network -> temukan endpoint JSON
-> GET() ke endpoint + fromJSON() (httr + jsonlite)
-> Contoh: Oscar Winning Films
IFRAME : Konten ada di frame terpisah, bukan di HTML utama
-> html_attr('src') -> buat URL absolut -> request baru
-> Contoh: Turtles All the Way Down
")##
## 1. Website paling mudah di-scrape:
## -> Countries of the World (Static HTML)
## Alasan: HTML langsung tersedia saat halaman dimuat.
## Cukup read_html() + html_nodes() tanpa pagination/AJAX.
##
## 2. Website paling sulit di-scrape:
## -> Oscar Winning Films (AJAX / Javascript)
## Alasan: Data dimuat secara dinamis oleh JS setelah halaman terbuka.
## Perlu inspect Network tab untuk menemukan endpoint XHR,
## kemudian parse response JSON dengan fromJSON().
##
## 3. Perbedaan Pendekatan:
##
## STATIC : HTML langsung tersedia saat GET request
## -> read_html() + html_nodes() (rvest)
## -> Contoh: Countries of the World
##
## PAGINATION : Data tersebar di banyak halaman
## -> Loop + parameter ?page_num=N di URL
## -> Hentikan jika tidak ada baris lagi
## -> Contoh: Hockey Teams
##
## AJAX : Data dimuat async via XHR setelah halaman render
## -> Inspect Network -> temukan endpoint JSON
## -> GET() ke endpoint + fromJSON() (httr + jsonlite)
## -> Contoh: Oscar Winning Films
##
## IFRAME : Konten ada di frame terpisah, bukan di HTML utama
## -> html_attr('src') -> buat URL absolut -> request baru
## -> Contoh: Turtles All the Way Down
2.9 Insights & Rekomendasi
## === ANALYTICAL THINKING ===
## 1. Website paling mudah di-scrape:
## -> Countries of the World (Static HTML)
## Alasan: Struktur HTML sederhana, tidak ada pagination/AJAX/iframe.
## 2. Website paling sulit di-scrape:
## -> Oscar Winning Films (AJAX / Javascript)
## Alasan: Data dimuat dinamis oleh JavaScript, perlu inspect network
## untuk menemukan endpoint AJAX yang benar.
## 3. Perbedaan pendekatan:
## STATIC : HTML sudah tersedia saat halaman dimuat.
## Cukup gunakan rvest + html_nodes().
## PAGINATION : Data tersebar di banyak halaman.
## Gunakan loop + parameter URL (?page_num=N).
## Hentikan loop jika tidak ada data lagi.
## AJAX : Data dimuat setelah halaman terbuka via JavaScript.
## Inspect Network tab -> temukan XHR/Fetch request
## -> akses langsung endpoint JSON-nya.
## Response berupa JSON, parse dengan fromJSON().
## IFRAME : Konten berada di dalam frame terpisah dari HTML utama.
## html_attr('src') -> buat URL lengkap
## -> buat request baru ke URL iframe tersebut.
## --- INSIGHTS ---
## 1. Countries: 250 negara berhasil diambil dari halaman statis.
if (nrow(df_hockey) > 0) {
top_team <- df_hockey %>%
group_by(team_name) %>%
summarise(total = sum(wins, na.rm=TRUE), .groups="drop") %>%
arrange(desc(total)) %>% slice(1)
cat(sprintf("2. Hockey: Tim dengan total kemenangan terbanyak: %s (%d wins)\n",
top_team$team_name, top_team$total))
best_wr <- df_hockey[which.max(df_hockey$win_rate), ]
cat(sprintf("3. Hockey: Tim dengan win rate tertinggi: %s (%d) - %.2f%%\n",
best_wr$team_name, best_wr$year, best_wr$win_rate))
}## 2. Hockey: Tim dengan total kemenangan terbanyak: Detroit Red Wings (986 wins)
## 3. Hockey: Tim dengan win rate tertinggi: Detroit Red Wings (1995) - 82.67%
if (nrow(df_oscar) > 0) {
best_film <- df_oscar[which.max(df_oscar$awards), ]
cat(sprintf("4. Oscar: Film dengan award terbanyak: %s (%d) - %d awards\n",
best_film$movie_title, best_film$year, best_film$awards))
yr_count <- df_oscar %>% count(year, name="n") %>% arrange(year)
cat("5. Oscar: Jumlah film per tahun:\n")
print(setNames(yr_count$n, yr_count$year))
cat(sprintf("6. Oscar: Total film Complete: %d\n",
sum(df_oscar$data_status=="Complete")))
}## 4. Oscar: Film dengan award terbanyak: Gravity (2013) - 7 awards
## 5. Oscar: Jumlah film per tahun:
## 2010 2011 2012 2013 2014 2015
## 13 15 15 12 16 16
## 6. Oscar: Total film Complete: 87
if (nrow(df_turtles) > 0) {
cat(sprintf("7. Turtles: Total family kura-kura: %d family\n", nrow(df_turtles)))
pct <- round(sum(df_turtles$data_status=="Complete") / nrow(df_turtles) * 100, 1)
cat(sprintf("8. Turtles: Persentase data Complete: %.1f%%\n", pct))
}
cat("\n--- REKOMENDASI ---\n")##
## --- REKOMENDASI ---
## 1. Gunakan RSelenium / Playwright untuk website AJAX yang
## lebih kompleks dan tidak memiliki endpoint API terbuka.
## 2. Tambahkan tryCatch() dan retry mechanism untuk scraping yang
## lebih robust terhadap timeout dan error jaringan.
## 3. Simpan log error ke file terpisah agar mudah di-debug.
pct_complete <- data.frame(
Website = c("Countries","Hockey Teams","Oscar Films","Turtles"),
Pct_Complete = c(
round(sum(df_countries$data_status=="Complete") / max(nrow(df_countries),1) * 100, 1),
round(sum(df_hockey$data_status=="Complete") / max(nrow(df_hockey),1) * 100, 1),
round(sum(df_oscar$data_status=="Complete") / max(nrow(df_oscar),1) * 100, 1),
round(sum(df_turtles$data_status=="Complete") / max(nrow(df_turtles),1) * 100, 1)
),
Tipe = c("Static HTML","Pagination","AJAX/JSON","iFrames")
)
plot_ly(pct_complete,
y = ~reorder(Website, Pct_Complete), x = ~Pct_Complete,
type = "bar", orientation = "h",
marker = list(
color = colorRampPalette(c("#b7e4c7","#1b4332"))(4),
line = list(color = "white", width = 1.5)
),
text = ~paste0(Pct_Complete, "%"),
textposition = "outside",
hovertemplate = "<b>%{y}</b> (%{customdata})<br>Complete: %{x}%<extra></extra>",
customdata = ~Tipe) |>
layout(
title = list(text = "<b>Persentase Data Complete per Website Scraping</b>",
font = list(size = 14, color = "#1b4332")),
xaxis = list(title = "% Data Complete", range = c(0, 115), gridcolor = "#f0faf3"),
yaxis = list(title = ""),
plot_bgcolor = "white", paper_bgcolor = "white",
margin = list(r = 60),
height = 400
)Kesimpulan Web Scraping:
Website termudah:
Countries (static HTML, satu request).
Website
tersulit: Oscar (AJAX, perlu intercept XHR
endpoint).
Total data: 250 negara · 582 hockey · 87 film · 0
turtles
3 Ringkasan Akhir Keseluruhan
| No | Keterangan | Nilai |
|---|---|---|
| 1 | Mini Project 1 - Dataset E-Commerce (file dibaca) | 5 file (CSV, Excel, JSON, TXT, XML) |
| 2 | Mini Project 1 - Baris setelah deduplikasi | 1.966 |
| 3 | Mini Project 1 - Kolom + 3 kolom baru | 25 |
| 4 | Mini Project 1 - Platform dominan | Shopee - 20.9% |
| 5 | Mini Project 1 - Transaksi high-value | 755 |
| 6 | Mini Project 2 - Website di-scrape | 4 website (Countries, Hockey, Oscar, Turtles) |
| 7 | Mini Project 2 - Total negara | 250 negara |
| 8 | Mini Project 2 - Total records hockey | 582 records |
| 9 | Mini Project 2 - Total film Oscar | 87 film |
| 10 | Mini Project 2 - Total data turtles | 0 baris |