Midterm Examination — Data Science Programming I

Mini Project: E-Commerce Data Pipeline & Web Scraping Intelligence

✦ Data Science Programming I — Class B
Midterm Data Programming

E-Commerce Data Pipeline & Web Scraping Intelligence

Andre
Andre
52250065
Refantanur Husnul Haqib
Refantanur Husnul Haqib
52250052
Naila Syahrani Putri
Naila Syahrani Putri
52250070
🎓 Data Science Program
📅 2026-04-26
📊 2 Mini Projects

1 Mini Project 1 — E-Commerce Data Pipeline

About Mini Project 1 — E-Commerce Data Pipeline

This mini project simulates a real-world data engineering scenario: integrating e-commerce transaction data from 5 different file formats (CSV, Excel, JSON, TXT, XML) into a single clean, analysis-ready dataset. The pipeline follows production-grade principles — each stage is documented, reproducible, and designed to handle edge cases gracefully. The final output is a structured dataset with business-driven feature engineering ready for executive reporting, BI dashboards, or predictive modeling.

1.1 Section A — Data Collection

Section A — Automated Multi-Format Data Ingestion

The primary objective is to build a fully automated, format-agnostic data ingestion pipeline. Rather than writing separate code for each file, a single loop structure handles all formats through condition-based dispatch — reflecting the principles of Clean Code and DRY (Don’t Repeat Yourself). Every file is verified for availability before reading, then merged into a unified dataset with provenance metadata enabling a full audit trail back to the original source of each row.

1.1.1 A.1 — File Inventory and Availability Check

1.1.2 A.2 — Custom XML Parser Function

baca_xml() defined and ready.

1.1.3 A.3 — REQUIRED LOOP: Read All Files

  [OK] Midterm_Exam_Classroom.csv             | 2000 rows | 22 cols
  [OK] Midterm_Exam_Classroom.xlsx            | 2000 rows | 22 cols
  [OK] Midterm_Exam_Classroom.json            | 2000 rows | 22 cols
  [OK] Midterm_Exam_Classroom.txt             | 2000 rows | 1 cols
  [OK] Midterm_Exam_Classroom.xml             | 2000 rows | 22 cols

5 of 5 files successfully read.

1.1.4 A.4 — Merge and Export to Excel

  Merged: Midterm_Exam_Classroom.csv             (2000 rows)
  Merged: Midterm_Exam_Classroom.xlsx            (2000 rows)
  Merged: Midterm_Exam_Classroom.json            (2000 rows)
  Merged: Midterm_Exam_Classroom.txt             (2000 rows)
  Merged: Midterm_Exam_Classroom.xml             (2000 rows)

Total Rows   : 10,000
Total Columns: 24
All columns cast to character — numeric conversion occurs in Section C.

[EXCEL] Saved: hasil_gabungan.xlsx

1.2 Section B — Data Handling

Section B — Comprehensive Data Quality Audit

Before any intervention, the state of the data must be thoroughly and objectively understood — much like a medical diagnosis before treatment. Section B executes a three-dimensional audit: (1) Data type profiling to verify alignment between storage format and column semantics; (2) Missing value analysis measuring not just counts but patterns that may indicate systemic upstream issues; and (3) Duplicate detection separating technical redundancy from genuine business duplication. The audit findings form the scientific basis for all cleaning strategies in Section C.

1.2.1 B.1 — Column Profile

1.2.2 B.2 — Missing Values

Total missing values : 54682
Columns with missing : 23 of 24

1.2.3 B.3 — Duplicate Rows

Duplicate rows  : 45
Percentage      : 0.45%

Sample duplicate rows (first 5):

1.2.4 B.4 — Data Quality Issues

1.3 Section C — Data Cleaning

Section C — Systematic and Auditable Data Cleaning

Data cleaning is not merely tidying — it is an analytical decision-making process that must be scientifically defensible. Every transformation in Section C follows informed cleaning principles: the rationale for each strategy (why median over mean? why fill with “Unknown” rather than drop rows?) is explicitly grounded in data distribution characteristics and business context. The sequence of operations is intentional — standardization precedes imputation, and deduplication occurs last to ensure no clean rows are inadvertently removed. Each step is quantitatively validated by comparing before-and-after states.

1.3.1 C.0 — Create Working Copy

Working copy created: 10,000 rows x 24 columns

1.3.2 C.1 — Platform Standardization

Platform distribution after standardization:

     Blibli      Lazada      Shopee Tiktok Shop   Tokopedia     Unknown 
       1616        1552        1664        1624        1544        2000 

1.3.3 C.2 — Net Sales Cleaning

Min  : Rp 0
Max  : Rp 18,080,224
Mean : Rp 1,101,874
Remaining NA: 0

1.3.4 C.3 — Missing Value Imputation

Missing payment_method after imputation: 0
Missing net_sales after imputation     : 0

1.3.5 C.4 — Order Status Standardization

Order status distribution after standardization:

       Batal     Delivered   On Delivery         Retur      Returned  
           12            50             4             4             6 
        Batal     Cancelled     Completed     Delivered   On Delivery 
          120           476          1600          4670           188 
        Retur      Returned       Shipped       Unknown 
          168           294           408          2000 

1.3.6 C.5 — REQUIRED LOOP: Text Column Cleaning

  [OK] platform             | missing: 0 -> 0
  [OK] category             | missing: 2000 -> 0
  [OK] product_name         | missing: 2000 -> 0
  [OK] order_status         | missing: 0 -> 0
  [OK] payment_method       | missing: 0 -> 0

1.3.7 C.6 — Deduplication and Summary

Duplicates removed : 45 rows
Clean dataset      : 9,955 rows
Total columns      : 24

1.4 Section D — Conditional Logic

Section D — Business-Driven Feature Engineering

Feature engineering bridges raw data and actionable business intelligence. The three derived columns added here are not merely technical exercises — each represents business logic actively used in e-commerce operations: is_high_value for premium customer segmentation and loyalty programs, order_priority for fulfillment resource allocation and SLA definition, and valid_transaction for separating confirmed revenue from at-risk pipeline. The enriched dataset carries significantly higher analytical value and can be consumed directly by Business Intelligence teams without further preprocessing.

New Column Logic Type Business Use
is_high_value net_sales > 1,000,000 Binary Premium customer identification
order_priority 3-tier based on net_sales Multi-class SLA and resource allocation
valid_transaction Based on order_status Binary Revenue forecasting

1.4.1 D.1 — is_high_value Column

Distribution of is_high_value:

  No  Yes 
6871 3084 

1.4.2 D.2 — order_priority Column (NESTED IF)

Distribution of order_priority:

  High    Low Medium 
  3084   5203   1668 

1.4.3 D.3 — valid_transaction Column

Distribution of valid_transaction:

Invalid   Valid 
    472    9483 

1.4.4 D.4 — Display Results and Save

[CSV]   Saved: dataset_final_bersih.csv
[EXCEL] Saved: dataset_final_bersih.xlsx
<div style="margin:16px 0 24px; display:flex; gap:14px; flex-wrap:wrap;">
  <a href="dataset_final_bersih.csv" download
     style="display:inline-flex;align-items:center;gap:8px;
            background:linear-gradient(135deg,#0ea5e9,#6366f1);
            color:#fff;font-weight:700;font-size:14px;
            padding:12px 28px;border-radius:50px;text-decoration:none;
            box-shadow:0 4px 16px rgba(99,102,241,0.35);">
    Download CSV
  </a>
  <a href="dataset_final_bersih.xlsx" download
     style="display:inline-flex;align-items:center;gap:8px;
            background:linear-gradient(135deg,#10b981,#059669);
            color:#fff;font-weight:700;font-size:14px;
            padding:12px 28px;border-radius:50px;text-decoration:none;
            box-shadow:0 4px 16px rgba(16,185,129,0.35);">
    Download Excel
  </a>
</div>

1.5 Section E — Analytical Thinking

Section E — Analytical Thinking and Strategic Decision Intelligence

This is the most consequential stage of the entire pipeline: converting numbers into actionable business narratives. Analysis transcends descriptive statistics when every finding is connected to operational implications and executable recommendations. Four analytical dimensions are explored: platform dominance (for marketing budget allocation), category concentration (for product portfolio risk management), transaction pipeline health (for customer success KPIs), and an executive summary grounded in data.

1.5.1 E.1 — Platform Dominance

Strategic Insight — Platform Dominance and Resource Allocation

Unknown leads with 1,991 transactions (20% market share). This concentration reflects three mutually reinforcing mechanisms: network effects (a mature ecosystem of buyers and sellers creates gravitational pull), high switching costs for sellers who have built reputation and ratings on the platform, and accumulated trust capital through years of reliable post-sale service and dispute resolution.

The gap with second-ranked platforms justifies an asymmetric, data-driven investment strategy: allocate 60–70% of acquisition and retention budget to the dominant platform where ROI is measurable and predictable. Remaining budget diversifies across other platforms to mitigate single-platform dependency risk. Monitor quarterly trends and prepare rebalancing scenarios if the dominant platform’s share erodes by more than 5 percentage points.

1.5.2 E.2 — Category Distribution

Strategic Insight — Category Concentration and Portfolio Risk

Category “Unknown” dominates at 20% of total transactions. From an operational efficiency perspective, this concentration enables supply specialization — deeper vendor relationships, more favorable volume contracts, and more accurate demand forecasting. From a portfolio risk perspective, concentration above 35–40% in a single category creates vulnerability to seasonal demand shifts and supply chain disruptions. The tactical recommendation: drive 15–20% quarterly growth in two adjacent categories with the highest demographic overlap, while defending leadership in Unknown through differentiation strategies that are difficult to replicate.

1.5.3 E.3 — Transaction Status Distribution

Insight — Transaction Pipeline Health

16% of transactions reached Completed status — the true conversion rate of the pipeline. The cancellation rate of 4.7% falls within an acceptable range but warrants ongoing monitoring. Every 1% reduction in cancel rate directly translates to measurable revenue recovery.

Impact formula: 1% reduction in cancel rate × average transaction value = potential revenue recovery per period. This should become a shared KPI for both the customer success and fulfillment teams.

1.5.4 E.4 — Executive Summary

<div class="val">9,955</div>
<div class="lbl">Total Transactions</div>
<div class="val">Rp 11B</div>
<div class="lbl">Gross Revenue</div>
<div class="val">31%</div>
<div class="lbl">High Value Transactions</div>
<div class="val">4.7%</div>
<div class="lbl">Cancel Rate</div>
<div class="val">Rp 1,102,849</div>
<div class="lbl">Avg. per Transaction</div>
<div class="val">Rp 0M</div>
<div class="lbl">Revenue Lost (Cancel)</div>
Strategic Recommendations — 4 Priority Actions

1. Strengthen position on Unknown — Implement platform-exclusive programs: scheduled flash sales timed to peak traffic windows, A/B-tested product listing thumbnails, and bundling features to increase average order value. Target: maintain >50% market share.

2. High-value loyalty program (31% of transactions, disproportionate revenue contribution) — Provide early-sale access, free premium shipping, and a dedicated customer service channel. High-LTV customers are the most valuable assets to protect from churn.

3. Cancel rate reduction mission from 4.7% — Audit the top 3 cancellation reasons via exit surveys, implement proactive status-change notifications, and strengthen delivery SLAs with express shipping options. Target: reduce by 2–3 percentage points within two quarters.

4. Standardize data input processes — Implement real-time validation at the point of entry (frontend validation + backend sanitization) to prevent the format inconsistencies identified during cleaning. Data quality at the source is far less costly than remediation downstream.


2 Mini Project 2 — Web Scraping Intelligence

About Mini Project 2 — Automated Web Data Collection Engineering

Web scraping is a distinct discipline within the data engineering ecosystem: the ability to extract structured data from web interfaces designed for humans, not machines. This project implements 4 distinct scraping techniques from ScrapeThisSite.com — each selected not by preference, but because the way each website delivers its data demands that specific approach. Core competency developed: the ability to diagnose a website’s data delivery architecture (static/dynamic/paginated/API-driven/framed) and select the optimal extraction strategy — a skill of high commercial value since much critical data is accessible only through web interfaces.

# Website Technique Complexity Data
1 Countries of the World Static HTML Low Country, population, area
2 Hockey Teams Pagination Medium NHL teams, wins, losses, win rate
3 Oscar Winning Films AJAX to JSON High Films, year, nominations, awards
4 Turtles All the Way Down iFrame Medium-High Turtle families

2.1 Section A — Data Collection

Section A — Implementation of 4 Distinct Scraping Techniques

Each scraping technique implemented here reflects a fundamentally different engineering challenge. Static HTML trains CSS selector skills and DOM traversal. Pagination requires state management across requests and reliable termination condition detection. AJAX/JSON scraping builds understanding of browser-server communication architecture through network traffic reverse engineering. iFrame scraping demands analysis of layered document structures not visible from the primary source. Mastery of all four techniques covers the majority of web scraping scenarios encountered in industry practice.

2.1.1 A.1 — Countries (Static HTML)

Scraping: Countries of the World (Static HTML)...
URL: https://www.scrapethissite.com/pages/simple/
HTTP Status: 200
Success! 250 countries found
DataFrame: 250 rows x 4 cols | Saved: scraping_countries.csv

2.1.2 A.2 — Hockey Teams (Pagination)

Scraping: Hockey Teams (Pagination)...
  Page 1: 25 teams
  Page 2: 25 teams
  Page 3: 25 teams
  Page 4: 25 teams
  Page 5: 25 teams
  Page 6: 25 teams
  Page 7: 25 teams
  Page 8: 25 teams
  Page 9: 25 teams
  Page 10: 25 teams
  Page 11: 25 teams
  Page 12: 25 teams
  Page 13: 25 teams
  Page 14: 25 teams
  Page 15: 25 teams
  Page 16: 25 teams
  Page 17: 25 teams
  Page 18: 25 teams
  Page 19: 25 teams
  Page 20: 25 teams
  Page 21: 25 teams
  Page 22: 25 teams
  Page 23: 25 teams
  Page 24: 7 teams
Complete! Total: 582 teams
DataFrame: 582 rows x 5 cols | Saved: scraping_hockey.csv

2.1.3 A.3 — Oscar Films (AJAX to JSON)

Scraping: Oscar Winning Films (AJAX -> JSON per year)...
Years: 2010, 2011, 2012, 2013, 2014, 2015
  [2010] Success — 13 films
  [2011] Success — 15 films
  [2012] Success — 15 films
  [2013] Success — 12 films
  [2014] Success — 16 films
  [2015] Success — 16 films

Total: 87 films from 6 years
DataFrame: 87 rows x 5 cols | Saved: scraping_oscar.csv

2.1.4 A.4 — Turtles (iFrame)

Scraping: Turtles All the Way Down...
Primary URL: https://www.scrapethissite.com/pages/frames/?frame=i
HTTP Status (main): 200
14 families found:
  + Carettochelyidae
  + Cheloniidae
  + Chelydridae
  + Dermatemydidae
  + Dermochelyidae
  + Emydidae
  + Geoemydidae
  + Kinosternidae
  + Platysternidae
  + Testudinidae
  + Trionychidae
  + Chelidae
  + Pelomedusidae
  + Podocnemididae

14 families identified.
family_name            common_name                    year     discovered_by
-------------------------------------------------------------------------------- 
  Carettochelyidae       Pig-nosed turtle               1887     Boulenger.
  Cheloniidae            Sea turtles                    1811     Oppel.
  Chelydridae            Snapping turtles               1831     Gray.
  Dermatemydidae         Central American river turtle  1870     Gray.
  Dermochelyidae         Leatherback sea turtle         1843     Fitzinger.
  Emydidae               Pond or water turtles          1815     Rafinesque.
  Geoemydidae            Asian river, leaf, roofed or Asian box turtles 1868     Theobald.
  Kinosternidae          Mud or musk turtles            1857     Agassiz.
  Platysternidae         Big-headed turtle              1869     Gray.
  Testudinidae           Tortoises                      Unknown  Unknown
  Trionychidae           Softshell turtles              1826     Fitzinger.
  Chelidae               Austro-American sideneck turtles 1831     Gray.
  Pelomedusidae          Afro-American sideneck turtles 1868     Cope.
  Podocnemididae         Madagascar big-headed, Big-headed Amazon River turtle and South American sideneck river turtles 1869     Gray.

DataFrame: 14 rows x 4 cols | Saved: scraping_turtles.csv

2.2 Section B — Data Handling

Section B — Quality Audit of 4 Scraped Datasets

All four scraped datasets are audited centrally using a single loop — one function, one iteration, four datasets simultaneously. This reflects the DRY principle: when the same analytical pattern applies to N datasets, use a loop rather than copy-pasting code N times.

2.3 Section C — Data Cleaning

Section C — Systematic Cleaning Pipeline for 4 Datasets

Each dataset is cleaned systematically: text standardization (title case + whitespace trimming), appropriate numeric type conversion per column, context-driven missing value handling, and deduplication. Despite differing schemas, a consistent cleaning pattern is applied through function abstraction.

[Countries] Clean: 250 rows (duplicates removed: 0)
[Hockey] Clean: 582 rows (duplicates removed: 0)
[Oscar] Clean: 87 rows (duplicates removed: 0)
  [OK] family_name — trimmed and filled
  [OK] common_name — trimmed and filled
  [OK] year_discovered — trimmed and filled
  [OK] discovered_by — trimmed and filled
  Duplicates removed: 0
[Turtles] Clean: 14 rows x 4 cols

All clean datasets saved successfully.

2.4 Section D — Conditional Logic

Section D — Data Quality Flagging via Nested IF

A data_status column is added to each scraped dataset as a per-row completeness indicator. This is a form of simple data quality scoring: each row is evaluated against dataset-specific completeness criteria rather than a generic universal standard.

2.5 Section E — Analytical Thinking

Section E — Evaluation of Scraping Techniques and Substantive Insights

Two analytical layers are executed here. The first is meta-analytic: a comparative evaluation of the four scraping techniques across implementation complexity, fragility to website changes, operational risk (rate limiting, IP blocking), and optimal use cases — building a decision framework for future technique selection. The second is substantive: extracting meaningful insights from the collected data, demonstrating that the scraping process has generated knowledge that was not previously available.

2.5.1 E.1 — Comparison of 4 Scraping Techniques

Key Insight — Technique Selection is an Engineering Requirement, Not a Preference

Static HTML scraping is the most reliable technique: no JavaScript dependency, no race conditions, and a single HTTP request suffices. It should always be attempted first — success eliminates the need for more complex approaches, and failure is a strong signal that the target site uses dynamic rendering. For AJAX endpoints, the primary risk is that hidden endpoints discovered via DevTools are undocumented and can change without notice, causing scrapers to fail silently. Mitigation requires periodic monitoring and robust error alerting in production pipelines.

2.5.2 E.2 — Insights from Scraped Data

<div class="val">250</div>
<div class="lbl">Countries</div>
<div class="val">582</div>
<div class="lbl">NHL Hockey Teams</div>
<div class="val">87</div>
<div class="lbl">Oscar Films (2010-15)</div>
<div class="val">14</div>
<div class="lbl">Turtle Families</div>
<div class="val">China</div>
<div class="lbl">Most Populous Country</div>
<div class="val">Detroit Red Wings</div>
<div class="lbl">Highest Win Rate Team</div>
Key Findings Across 4 Scraped Datasets

[Countries — Geography and Demography] China is the most populous country at 1,330,044,000 people, while Russia holds the largest land area. The fact that the most populous country is not the largest by area reflects a highly uneven global population distribution shaped by historical, climatic, and economic factors — with direct implications for infrastructure burden, market opportunity, and environmental pressure.

[NHL Hockey — Team Performance] Detroit Red Wings recorded the highest win rate at 75.6%, significantly above the league average of 45.9%. In sports analytics, win rate is an efficiency metric — high win-rate teams demonstrate consistency, sound tactical systems, and roster depth, making it a more meaningful benchmark than raw total wins which can be skewed by games played.

[Oscar Films 2010-2015 — Film Industry]Gravity” led with 7 wins. With an average of 3.9 nominations per film but only 6.9% earning Best Picture, the data reveals the Academy’s high selectivity: many nominations do not guarantee a win, reflecting the intense competition and inherent subjectivity of creative work evaluation.

[Turtles — Biodiversity] 14 turtle families were catalogued — a structured reference relevant to reptile conservation research. Scraping biodiversity data from authoritative sources can support species protection programs and public education initiatives.


3 References



---
title: "Midterm Examination — Data Science Programming I"
subtitle: "Mini Project: E-Commerce Data Pipeline & Web Scraping Intelligence"
author: 
- "Andre — NIM 52250065"
- "Refantanur Husnul Haqib — NIM 52250052"
- "Naila Syahrani Putri — NIM 52250070"
date: "`r Sys.Date()`"

output:
  rmdformats::robobook:
    highlight: kate
    self_contained: true
    thumbnails: true
    lightbox: true
    gallery: true
    number_sections: true
    df_print: "default"
    code_folding: "show"
    code_download: true
    toc_depth: 3
    use_bookdown: true
    css: dashboard_style.css
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo    = FALSE,
  warning = FALSE,
  message = FALSE,
  comment = ""
)
options(htmltools.preserve.raw = TRUE)
```

```{r load-libraries, include=FALSE}
library(tidyverse)
library(readxl)
library(jsonlite)
library(xml2)
library(openxlsx)
library(knitr)
library(kableExtra)
library(DT)
library(httr)
library(rvest)
library(htmltools)
```

```{css, echo=FALSE}
/* ═══════════════════════════════════════════════════════════
   GLOBAL STYLES — COVER, CARDS, BOXES
════════════════════════════════════════════════════════════ */

/* Hero Cover */
.hero-cover {
  background: linear-gradient(135deg, #0d1b2a 0%, #1b3a5c 50%, #0d1b2a 100%);
  border-radius: 18px;
  padding: 48px 32px 40px;
  margin: 12px 0 32px;
  text-align: center;
  position: relative;
  overflow: hidden;
  box-shadow: 0 20px 60px rgba(0,0,0,0.4);
}
.hero-cover::before {
  content: '';
  position: absolute;
  top: -60px; left: -60px;
  width: 220px; height: 220px;
  background: radial-gradient(circle, rgba(56,189,248,0.18) 0%, transparent 70%);
  border-radius: 50%;
}
.hero-cover::after {
  content: '';
  position: absolute;
  bottom: -50px; right: -50px;
  width: 200px; height: 200px;
  background: radial-gradient(circle, rgba(99,102,241,0.18) 0%, transparent 70%);
  border-radius: 50%;
}
.hero-badge {
  display: inline-block;
  background: rgba(56,189,248,0.15);
  border: 1px solid rgba(56,189,248,0.4);
  color: #38bdf8;
  font-size: 11px;
  font-weight: 700;
  letter-spacing: 2.5px;
  text-transform: uppercase;
  padding: 6px 18px;
  border-radius: 50px;
  margin-bottom: 20px;
}
.hero-title {
  font-size: 2.6em;
  font-weight: 900;
  color: #ffffff;
  line-height: 1.15;
  margin: 0 0 10px;
  letter-spacing: -0.5px;
  text-shadow: 0 2px 20px rgba(56,189,248,0.3);
}
.hero-title span {
  background: linear-gradient(90deg, #38bdf8, #818cf8);
  -webkit-background-clip: text;
  -webkit-text-fill-color: transparent;
  background-clip: text;
}
.hero-subtitle {
  font-size: 1em;
  color: rgba(255,255,255,0.65);
  margin: 0 0 32px;
  font-style: italic;
  letter-spacing: 0.3px;
}
.hero-divider {
  width: 60px;
  height: 3px;
  background: linear-gradient(90deg, #38bdf8, #818cf8);
  border-radius: 4px;
  margin: 0 auto 32px;
}

/* Team Cards */
.team-grid {
  display: flex;
  flex-wrap: wrap;
  gap: 22px;
  justify-content: center;
  position: relative;
  z-index: 1;
}
.team-card {
  background: rgba(255,255,255,0.06);
  border: 1px solid rgba(255,255,255,0.12);
  border-radius: 16px;
  padding: 20px 18px 18px;
  width: 170px;
  text-align: center;
  backdrop-filter: blur(10px);
  transition: transform 0.25s ease, box-shadow 0.25s ease;
  box-shadow: 0 4px 24px rgba(0,0,0,0.25);
}
.team-card:hover {
  transform: translateY(-6px);
  box-shadow: 0 16px 40px rgba(56,189,248,0.2);
  border-color: rgba(56,189,248,0.4);
}
.frame-photo {
  position: relative;
  width: 110px;
  height: 135px;
  margin: 0 auto 14px;
}
.frame-photo::before {
  content: '';
  position: absolute;
  inset: -4px;
  background: linear-gradient(135deg, #38bdf8, #818cf8, #f472b6);
  border-radius: 14px;
  z-index: 0;
  animation: borderSpin 4s linear infinite;
}
@keyframes borderSpin {
  0%   { background: linear-gradient(135deg, #38bdf8, #818cf8, #f472b6); }
  33%  { background: linear-gradient(225deg, #818cf8, #f472b6, #38bdf8); }
  66%  { background: linear-gradient(315deg, #f472b6, #38bdf8, #818cf8); }
  100% { background: linear-gradient(135deg, #38bdf8, #818cf8, #f472b6); }
}
.frame-photo::after {
  content: '';
  position: absolute;
  inset: -4px;
  background: linear-gradient(135deg, #38bdf8, #818cf8, #f472b6);
  border-radius: 14px;
  z-index: 0;
  filter: blur(8px);
  opacity: 0.5;
}
.frame-photo img {
  position: relative;
  width: 110px;
  height: 135px;
  object-fit: cover;
  object-position: top center;
  border-radius: 11px;
  display: block;
  z-index: 1;
  border: 3px solid #0d1b2a;
}
.team-name {
  font-size: 13px;
  font-weight: 700;
  color: #f0f9ff;
  margin: 0 0 4px;
  line-height: 1.3;
}
.team-nim {
  font-size: 11px;
  color: #38bdf8;
  font-weight: 600;
  letter-spacing: 0.5px;
  background: rgba(56,189,248,0.1);
  padding: 2px 10px;
  border-radius: 20px;
  display: inline-block;
}

/* Hero Meta */
.hero-meta {
  display: flex;
  gap: 20px;
  justify-content: center;
  flex-wrap: wrap;
  margin-top: 28px;
  position: relative;
  z-index: 1;
}
.meta-chip {
  background: rgba(255,255,255,0.07);
  border: 1px solid rgba(255,255,255,0.15);
  color: rgba(255,255,255,0.75);
  font-size: 12px;
  padding: 7px 16px;
  border-radius: 50px;
  display: flex;
  align-items: center;
  gap: 6px;
}
.meta-chip strong { color: #fff; }

/* Info Box */
.info-box {
  background: linear-gradient(135deg, rgba(14,165,233,0.08), rgba(99,102,241,0.08));
  border: 1px solid rgba(56,189,248,0.25);
  border-left: 4px solid #38bdf8;
  border-radius: 10px;
  padding: 18px 20px;
  margin: 18px 0;
}
.info-box .box-title {
  font-weight: 700;
  color: #38bdf8;
  font-size: 0.88em;
  letter-spacing: 0.5px;
  text-transform: uppercase;
  margin-bottom: 10px;
}

/* Insight Box */
.insight-box {
  background: linear-gradient(135deg, rgba(16,185,129,0.08), rgba(5,150,105,0.06));
  border: 1px solid rgba(16,185,129,0.25);
  border-left: 4px solid #10b981;
  border-radius: 10px;
  padding: 18px 20px;
  margin: 18px 0;
}
.insight-box .box-title {
  font-weight: 700;
  color: #10b981;
  font-size: 0.88em;
  letter-spacing: 0.5px;
  text-transform: uppercase;
  margin-bottom: 10px;
}

/* Warning Box */
.warning-box {
  background: linear-gradient(135deg, rgba(245,158,11,0.08), rgba(217,119,6,0.06));
  border: 1px solid rgba(245,158,11,0.3);
  border-left: 4px solid #f59e0b;
  border-radius: 10px;
  padding: 18px 20px;
  margin: 18px 0;
}
.warning-box .box-title {
  font-weight: 700;
  color: #f59e0b;
  font-size: 0.88em;
  letter-spacing: 0.5px;
  text-transform: uppercase;
  margin-bottom: 10px;
}

/* Stat Grid */
.stat-grid {
  display: flex;
  flex-wrap: wrap;
  gap: 16px;
  margin: 20px 0;
}
.stat-card {
  flex: 1;
  min-width: 140px;
  border-radius: 12px;
  padding: 18px 14px;
  text-align: center;
  border: 1px solid rgba(255,255,255,0.08);
}
.stat-card.blue   { background: linear-gradient(135deg,rgba(59,123,255,0.15),rgba(59,123,255,0.06)); border-color:rgba(59,123,255,0.3); }
.stat-card.teal   { background: linear-gradient(135deg,rgba(20,184,166,0.15),rgba(20,184,166,0.06)); border-color:rgba(20,184,166,0.3); }
.stat-card.emerald{ background: linear-gradient(135deg,rgba(16,185,129,0.15),rgba(16,185,129,0.06)); border-color:rgba(16,185,129,0.3); }
.stat-card.purple { background: linear-gradient(135deg,rgba(139,92,246,0.15),rgba(139,92,246,0.06)); border-color:rgba(139,92,246,0.3); }
.stat-card.amber  { background: linear-gradient(135deg,rgba(245,158,11,0.15),rgba(245,158,11,0.06)); border-color:rgba(245,158,11,0.3); }
.stat-card.rose   { background: linear-gradient(135deg,rgba(244,63,94,0.15),rgba(244,63,94,0.06));  border-color:rgba(244,63,94,0.3); }
.stat-card .val { font-size: 1.7em; font-weight: 800; line-height:1.1; margin-bottom:6px; }
.stat-card .lbl { font-size: 11px; opacity: 0.7; text-transform: uppercase; letter-spacing: 0.8px; }

/* Footer */
.footer-note {
  text-align: center;
  padding: 16px;
  margin-top: 32px;
  font-size: 0.85em;
  opacity: 0.6;
  border-top: 1px solid rgba(255,255,255,0.1);
}
```

```{r cover-html, echo=FALSE, results='asis'}
html_cover <- paste0(
'<div class="hero-cover">',
'<div class="hero-badge">&#10022; Data Science Programming I &mdash; Class B</div>',
'<div class="hero-title">Midterm <span>Data Programming</span></div>',
'<p class="hero-subtitle">E-Commerce Data Pipeline &amp; Web Scraping Intelligence</p>',
'<div class="hero-divider"></div>',
'<div class="team-grid">',

  '<div class="team-card">',
  '<div class="frame-photo">',
  '<img src="Foto1.jpeg" alt="Andre" />',
  '</div>',
  '<div class="team-name">Andre</div>',
  '<div class="team-nim">52250065</div>',
  '</div>',

  '<div class="team-card">',
  '<div class="frame-photo">',
  '<img src="Foto2.jpeg" alt="Refantanur Husnul Haqib" />',
  '</div>',
  '<div class="team-name">Refantanur Husnul Haqib</div>',
  '<div class="team-nim">52250052</div>',
  '</div>',

  '<div class="team-card">',
  '<div class="frame-photo">',
  '<img src="Foto3.jpeg" alt="Naila Syahrani Putri" />',
  '</div>',
  '<div class="team-name">Naila Syahrani Putri</div>',
  '<div class="team-nim">52250070</div>',
  '</div>',

'</div>',
'<div class="hero-meta">',
'<div class="meta-chip">&#127891; <strong>Data Science Program</strong></div>',
'<div class="meta-chip">&#128197; <strong>2026-04-26</strong></div>',
'<div class="meta-chip">&#128202; <strong>2 Mini Projects</strong></div>',
'</div>',
'</div>'
)
cat(html_cover)
```

# Mini Project 1 — E-Commerce Data Pipeline {#mini-project-1}

<div class="info-box">
  <div class="box-title">About Mini Project 1 — E-Commerce Data Pipeline</div>
  <p>This mini project simulates a real-world data engineering scenario: integrating e-commerce transaction data from <strong>5 different file formats</strong> (CSV, Excel, JSON, TXT, XML) into a single clean, analysis-ready dataset. The pipeline follows production-grade principles — each stage is documented, reproducible, and designed to handle edge cases gracefully. The final output is a structured dataset with business-driven feature engineering ready for executive reporting, BI dashboards, or predictive modeling.</p>
</div>

## Section A — Data Collection

<div class="info-box">
  <div class="box-title">Section A — Automated Multi-Format Data Ingestion</div>
  <p>The primary objective is to build a fully automated, format-agnostic data ingestion pipeline. Rather than writing separate code for each file, a single loop structure handles all formats through condition-based dispatch — reflecting the principles of Clean Code and DRY (Don't Repeat Yourself). Every file is verified for availability before reading, then merged into a unified dataset with provenance metadata enabling a full audit trail back to the original source of each row.</p>
</div>

### A.1 — File Inventory and Availability Check

```{r a1-daftar-file}
file_list <- list(
  list(nama = "Midterm_Exam_Classroom.csv",  format = "CSV"),
  list(nama = "Midterm_Exam_Classroom.xlsx", format = "Excel"),
  list(nama = "Midterm_Exam_Classroom.json", format = "JSON"),
  list(nama = "Midterm_Exam_Classroom.txt",  format = "TXT"),
  list(nama = "Midterm_Exam_Classroom.xml",  format = "XML")
)

status_df <- data.frame(
  No     = seq_along(file_list),
  File   = sapply(file_list, function(f) f$nama),
  Format = sapply(file_list, function(f) f$format),
  Status = sapply(file_list, function(f) {
    if (file.exists(f$nama)) "Found" else "Not Found"  # REQUIRED IF
  }),
  stringsAsFactors = FALSE
)

datatable(status_df,
          caption     = "File Availability — Dataset Source Files",
          extensions  = 'Buttons',
          options     = list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames    = FALSE) %>%
  formatStyle("Status",
    color = styleEqual(c("Found","Not Found"), c("#10b981","#f43f5e")),
    fontWeight = "bold")
```

### A.2 — Custom XML Parser Function

```{r a2-fungsi-xml}
baca_xml <- function(path) {
  doc   <- read_xml(path)
  nodes <- xml_children(doc)

  # REQUIRED LOOP: parse each XML node into a data row
  rows <- lapply(nodes, function(node) {
    anak  <- xml_children(node)
    vals  <- xml_text(anak)
    names(vals) <- xml_name(anak)
    as.list(vals)
  })
  bind_rows(rows)
}

cat("baca_xml() defined and ready.\n")
```

### A.3 — REQUIRED LOOP: Read All Files

```{r a3-baca-semua-file}
dataframes <- list()

for (f in file_list) {                        # REQUIRED LOOP
  nama <- f$nama; fmt <- f$format

  if (!file.exists(nama)) {                   # REQUIRED IF
    cat(sprintf("  [SKIP] %s not found\n", nama))
    next
  }

  df <- tryCatch({
    if      (fmt == "CSV")   read_csv(nama, show_col_types = FALSE)
    else if (fmt == "Excel") read_excel(nama)                        # REQUIRED IF-ELSE
    else if (fmt == "JSON")  as_tibble(fromJSON(nama, flatten = TRUE))
    else if (fmt == "TXT")   read_tsv(nama, show_col_types = FALSE)
    else if (fmt == "XML")   baca_xml(nama)
    else                     NULL
  }, error = function(e) { cat(sprintf("  [ERROR] %s\n", e$message)); NULL })

  if (!is.null(df)) {                         # REQUIRED IF
    cat(sprintf("  [OK] %-38s | %d rows | %d cols\n",
                nama, nrow(df), ncol(df)))
    dataframes[[nama]] <- df
  }
}

cat(sprintf("\n%d of %d files successfully read.\n",
            length(dataframes), length(file_list)))
```

### A.4 — Merge and Export to Excel

```{r a4-gabungkan}
if (length(dataframes) == 0) stop("No files were successfully read!")

# REQUIRED LOOP: type-harmonize and tag each source file
df_list_tagged <- lapply(names(dataframes), function(nama) {
  df_temp <- dataframes[[nama]]
  for (col in names(df_temp)) df_temp[[col]] <- as.character(df_temp[[col]])  # REQUIRED LOOP
  df_temp$source_file <- basename(nama)
  cat(sprintf("  Merged: %-38s (%d rows)\n", nama, nrow(df_temp)))
  df_temp
})

df_gabungan <- dplyr::bind_rows(df_list_tagged)

cat(sprintf("\nTotal Rows   : %s\n", format(nrow(df_gabungan), big.mark = ",")))
cat(sprintf("Total Columns: %d\n",  ncol(df_gabungan)))
cat("All columns cast to character — numeric conversion occurs in Section C.\n")

# Save to Excel
wb <- createWorkbook()
addWorksheet(wb, "Merged Data", tabColour = "#1246c4")
writeData(wb, "Merged Data", df_gabungan)
hs <- createStyle(fontColour="#FFFFFF", fgFill="#0d1529",
                  textDecoration="bold", halign="center")
addStyle(wb, "Merged Data", hs, rows=1, cols=1:ncol(df_gabungan), gridExpand=TRUE)
setColWidths(wb, "Merged Data", cols=1:ncol(df_gabungan), widths="auto")
saveWorkbook(wb, "hasil_gabungan.xlsx", overwrite=TRUE)
cat("\n[EXCEL] Saved: hasil_gabungan.xlsx\n")

datatable(head(df_gabungan, 20),
          caption  = "Preview — First 20 Rows of Merged Dataset",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=5, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE)
```

## Section B — Data Handling

<div class="info-box">
  <div class="box-title">Section B — Comprehensive Data Quality Audit</div>
  <p>Before any intervention, the state of the data must be thoroughly and objectively understood — much like a medical diagnosis before treatment. Section B executes a three-dimensional audit: <strong>(1) Data type profiling</strong> to verify alignment between storage format and column semantics; <strong>(2) Missing value analysis</strong> measuring not just counts but patterns that may indicate systemic upstream issues; and <strong>(3) Duplicate detection</strong> separating technical redundancy from genuine business duplication. The audit findings form the scientific basis for all cleaning strategies in Section C.</p>
</div>

### B.1 — Column Profile

```{r b1-info}
info_df <- data.frame(
  Column         = names(df_gabungan),
  Type           = sapply(df_gabungan, function(x) class(x)[1]),
  Missing        = sapply(df_gabungan, function(x) sum(is.na(x))),
  Pct_Missing    = sapply(df_gabungan, function(x) sprintf("%.1f%%", mean(is.na(x))*100)),
  Sample_Values  = sapply(df_gabungan, function(x) paste(head(na.omit(unique(x)), 3), collapse=" | ")),
  row.names      = NULL,
  stringsAsFactors = FALSE
)

datatable(info_df,
          caption  = "Column Profile — Merged Dataset",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE) %>%
  formatStyle("Missing",
    backgroundColor = styleInterval(c(0, 100), c("#d1fae5","#fef3c7","#ffe4e6")))
```

### B.2 — Missing Values

```{r b2-missing}
missing_count <- colSums(is.na(df_gabungan))
total_missing <- sum(missing_count)

cat(sprintf("Total missing values : %d\n", total_missing))
cat(sprintf("Columns with missing : %d of %d\n",
            sum(missing_count > 0), ncol(df_gabungan)))

# Build the missing_df regardless — used below
missing_df <- if (total_missing > 0) {
  data.frame(
    Column     = names(missing_count[missing_count > 0]),
    Count      = missing_count[missing_count > 0],
    Percentage = paste0(round(missing_count[missing_count > 0]/nrow(df_gabungan)*100, 2), "%"),
    row.names  = NULL
  )
} else {
  data.frame(Column = character(), Count = integer(), Percentage = character())
}
```

```{r b2-missing-table, results='asis'}
if (nrow(missing_df) == 0) {                                       # REQUIRED IF
  cat("<p>No missing values detected in this dataset.</p>")
} else {
  print(
    datatable(missing_df,
              caption     = "Columns with Missing Values",
              extensions  = 'Buttons',
              options     = list(dom='Bfrtip', pageLength=15, scrollX=TRUE,
                                 buttons=list('csv','excel','print')),
              rownames    = FALSE)
  )
}
```

### B.3 — Duplicate Rows

```{r b3-duplikat}
n_duplikat <- sum(duplicated(df_gabungan))

cat(sprintf("Duplicate rows  : %s\n", format(n_duplikat, big.mark=",")))
cat(sprintf("Percentage      : %.2f%%\n", n_duplikat/nrow(df_gabungan)*100))

dup_preview <- if (n_duplikat > 0) {                             # REQUIRED IF
  cat("\nSample duplicate rows (first 5):\n")
  df_gabungan[duplicated(df_gabungan), ][1:min(5,n_duplikat), 1:5]
} else {
  cat("No duplicates found.\n")
  NULL
}
```

```{r b3-dup-table, results='asis'}
if (!is.null(dup_preview)) {
  print(
    datatable(dup_preview,
              caption    = "Sample Duplicate Rows",
              extensions = 'Buttons',
              options    = list(dom='Bfrtip', scrollX=TRUE, buttons=list('csv','excel','print')),
              rownames   = FALSE)
  )
}
```

### B.4 — Data Quality Issues

```{r b4-masalah}
masalah <- data.frame(
  No      = 1:4,
  Column  = c("platform","net_sales","payment_method","order_status"),
  Issue   = c("Inconsistent platform name casing and formatting across sources",
               "Mixed formats: currency symbols, thousand separators, negative signs",
               "NA values present — non-informative for analysis",
               "Capitalization inconsistency: 'Cancelled' vs 'cancelled' vs 'CANCELLED'"),
  Solution= c("Standardize with tolower() + case_when() to canonical form",
               "Strip non-numeric characters with gsub(), convert to numeric",
               "Fill NA with 'Unknown' to preserve rows without distortion",
               "Normalize with str_to_title() + case_when() to unified form"),
  stringsAsFactors = FALSE
)

datatable(masalah,
          caption  = "Data Quality Issues & Remediation Plan",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE) %>%
  formatStyle("Issue",    color="#f43f5e") %>%
  formatStyle("Solution", color="#10b981")
```

## Section C — Data Cleaning

<div class="info-box">
  <div class="box-title">Section C — Systematic and Auditable Data Cleaning</div>
  <p>Data cleaning is not merely tidying — it is an analytical decision-making process that must be scientifically defensible. Every transformation in Section C follows informed cleaning principles: the rationale for each strategy (why median over mean? why fill with "Unknown" rather than drop rows?) is explicitly grounded in data distribution characteristics and business context. The sequence of operations is intentional — standardization precedes imputation, and deduplication occurs last to ensure no clean rows are inadvertently removed. Each step is quantitatively validated by comparing before-and-after states.</p>
</div>

### C.0 — Create Working Copy

```{r c0-salinan}
df_clean <- df_gabungan
cat(sprintf("Working copy created: %s rows x %d columns\n",
            format(nrow(df_clean), big.mark=","), ncol(df_clean)))
```

### C.1 — Platform Standardization

```{r c1-platform}
standardisasi_platform <- function(nilai) {
  if (is.na(nilai)) return("Unknown")                    # REQUIRED IF
  v <- tolower(trimws(nilai))
  case_when(                                             # REQUIRED IF-ELSE chain
    str_detect(v, "shopee")    ~ "Shopee",
    str_detect(v, "tokopedia") ~ "Tokopedia",
    str_detect(v, "lazada")    ~ "Lazada",
    str_detect(v, "bukalapak") ~ "Bukalapak",
    str_detect(v, "blibli")    ~ "Blibli",
    TRUE                       ~ str_to_title(v)
  )
}

df_clean$platform <- sapply(df_clean$platform, standardisasi_platform)

cat("Platform distribution after standardization:\n")
print(table(df_clean$platform))
```

### C.2 — Net Sales Cleaning

```{r c2-netsales}
bersihkan_sales <- function(nilai) {
  if (is.na(nilai)) return(0)                            # REQUIRED IF: handle NA
  if (is.numeric(nilai)) return(max(0, nilai))           # REQUIRED IF: already numeric
  bersih <- gsub("[Rp\\s\\.,]", "", as.character(nilai)) # strip currency symbols
  hasil  <- suppressWarnings(as.numeric(bersih))
  if (is.na(hasil) || hasil < 0) return(0)               # REQUIRED IF: invalid value
  return(hasil)
}

df_clean$net_sales <- sapply(df_clean$net_sales, bersihkan_sales)
df_clean$net_sales <- as.numeric(df_clean$net_sales)

cat(sprintf("Min  : Rp %s\n", format(min(df_clean$net_sales, na.rm=TRUE), big.mark=",")))
cat(sprintf("Max  : Rp %s\n", format(max(df_clean$net_sales, na.rm=TRUE), big.mark=",")))
cat(sprintf("Mean : Rp %s\n", format(round(mean(df_clean$net_sales, na.rm=TRUE)), big.mark=",")))
cat(sprintf("Remaining NA: %d\n", sum(is.na(df_clean$net_sales))))
```

### C.3 — Missing Value Imputation

```{r c3-missing}
df_clean <- df_clean %>%
  mutate(
    payment_method = ifelse(is.na(payment_method), "Unknown", payment_method),  # REQUIRED IF
    net_sales      = ifelse(is.na(net_sales),
                            median(net_sales, na.rm=TRUE), net_sales)            # REQUIRED IF
  )

cat(sprintf("Missing payment_method after imputation: %d\n", sum(is.na(df_clean$payment_method))))
cat(sprintf("Missing net_sales after imputation     : %d\n", sum(is.na(df_clean$net_sales))))
```

### C.4 — Order Status Standardization

```{r c4-status}
df_clean <- df_clean %>%
  mutate(order_status = case_when(
    str_detect(tolower(order_status), "complet")  ~ "Completed",   # REQUIRED IF-ELSE
    str_detect(tolower(order_status), "cancel")   ~ "Cancelled",
    str_detect(tolower(order_status), "pending")  ~ "Pending",
    str_detect(tolower(order_status), "ship")     ~ "Shipped",
    is.na(order_status)                           ~ "Unknown",
    TRUE                                          ~ str_to_title(order_status)
  ))

cat("Order status distribution after standardization:\n")
print(table(df_clean$order_status))
```

### C.5 — REQUIRED LOOP: Text Column Cleaning

```{r c5-looping-teks}
kolom_teks <- c("platform", "category", "product_name", "order_status", "payment_method")

for (kolom in kolom_teks) {                          # REQUIRED LOOP
  if (kolom %in% names(df_clean) &&                 # REQUIRED IF: column exists?
      is.character(df_clean[[kolom]])) {
    sebelum <- sum(is.na(df_clean[[kolom]]))
    df_clean[[kolom]] <- trimws(df_clean[[kolom]])
    df_clean[[kolom]][is.na(df_clean[[kolom]]) |
                      df_clean[[kolom]] == ""] <- "Unknown"
    sesudah <- sum(is.na(df_clean[[kolom]]))
    cat(sprintf("  [OK] %-20s | missing: %d -> %d\n", kolom, sebelum, sesudah))
  }
}
```

### C.6 — Deduplication and Summary

```{r c6-hapus-duplikat}
n_sblm <- nrow(df_clean)
df_clean <- df_clean %>% distinct()

cat(sprintf("Duplicates removed : %s rows\n",   format(n_sblm - nrow(df_clean), big.mark=",")))
cat(sprintf("Clean dataset      : %s rows\n",   format(nrow(df_clean), big.mark=",")))
cat(sprintf("Total columns      : %d\n\n", ncol(df_clean)))

datatable(head(df_clean, 50),
          caption  = "Clean Dataset — First 50 Rows",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE)
```

## Section D — Conditional Logic

<div class="info-box">
  <div class="box-title">Section D — Business-Driven Feature Engineering</div>
  <p>Feature engineering bridges raw data and actionable business intelligence. The three derived columns added here are not merely technical exercises — each represents business logic actively used in e-commerce operations: <code>is_high_value</code> for premium customer segmentation and loyalty programs, <code>order_priority</code> for fulfillment resource allocation and SLA definition, and <code>valid_transaction</code> for separating confirmed revenue from at-risk pipeline. The enriched dataset carries significantly higher analytical value and can be consumed directly by Business Intelligence teams without further preprocessing.</p>
</div>

| New Column | Logic | Type | Business Use |
|:---|:---|:---:|:---|
| `is_high_value` | `net_sales > 1,000,000` | Binary | Premium customer identification |
| `order_priority` | 3-tier based on net_sales | Multi-class | SLA and resource allocation |
| `valid_transaction` | Based on order_status | Binary | Revenue forecasting |

### D.1 — is_high_value Column

```{r d1-high-value}
cek_high_value <- function(sales) {
  ifelse(sales > 1000000, "Yes", "No")          # REQUIRED IF-ELSE
}

df_clean$is_high_value <- cek_high_value(df_clean$net_sales)

cat("Distribution of is_high_value:\n")
print(table(df_clean$is_high_value))
```

### D.2 — order_priority Column (NESTED IF)

```{r d2-priority}
tentukan_prioritas <- function(sales) {
  case_when(
    sales > 1000000 ~ "High",        # REQUIRED NESTED IF level 1
    sales >= 500000 ~ "Medium",      # REQUIRED NESTED IF level 2
    TRUE            ~ "Low"          # REQUIRED NESTED IF level 3 (else)
  )
}

df_clean$order_priority <- tentukan_prioritas(df_clean$net_sales)

cat("Distribution of order_priority:\n")
print(table(df_clean$order_priority))
```

### D.3 — valid_transaction Column

```{r d3-valid}
df_clean$valid_transaction <- ifelse(
  df_clean$order_status == "Cancelled", "Invalid", "Valid"   # REQUIRED IF-ELSE
)

cat("Distribution of valid_transaction:\n")
print(table(df_clean$valid_transaction))
```

### D.4 — Display Results and Save

```{r d4-simpan}
kolom_tampil <- intersect(
  c("order_id","platform","net_sales","order_status",
    "is_high_value","order_priority","valid_transaction"),
  names(df_clean)
)

write_csv(df_clean, "dataset_final_bersih.csv")

wb2 <- createWorkbook()
addWorksheet(wb2, "Clean Dataset MP1", tabColour="#1246c4")
writeData(wb2, "Clean Dataset MP1", df_clean)
hs2 <- createStyle(fontColour="#FFFFFF", fgFill="#0d1529",
                   textDecoration="bold", halign="center")
addStyle(wb2, "Clean Dataset MP1", hs2,
         rows=1, cols=1:ncol(df_clean), gridExpand=TRUE)
setColWidths(wb2, "Clean Dataset MP1", cols=1:ncol(df_clean), widths="auto")
saveWorkbook(wb2, "dataset_final_bersih.xlsx", overwrite=TRUE)
cat("[CSV]   Saved: dataset_final_bersih.csv\n")
cat("[EXCEL] Saved: dataset_final_bersih.xlsx\n\n")

cat('<div style="margin:16px 0 24px; display:flex; gap:14px; flex-wrap:wrap;">
  <a href="dataset_final_bersih.csv" download
     style="display:inline-flex;align-items:center;gap:8px;
            background:linear-gradient(135deg,#0ea5e9,#6366f1);
            color:#fff;font-weight:700;font-size:14px;
            padding:12px 28px;border-radius:50px;text-decoration:none;
            box-shadow:0 4px 16px rgba(99,102,241,0.35);">
    Download CSV
  </a>
  <a href="dataset_final_bersih.xlsx" download
     style="display:inline-flex;align-items:center;gap:8px;
            background:linear-gradient(135deg,#10b981,#059669);
            color:#fff;font-weight:700;font-size:14px;
            padding:12px 28px;border-radius:50px;text-decoration:none;
            box-shadow:0 4px 16px rgba(16,185,129,0.35);">
    Download Excel
  </a>
</div>')

datatable(df_clean[, kolom_tampil],
          caption  = "Final Dataset — Feature-Engineered with 3 New Columns",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE) %>%
  formatStyle("is_high_value",
    backgroundColor = styleEqual(c("Yes","No"), c("#d1fae5","#ffe4e6")),
    fontWeight = "bold") %>%
  formatStyle("valid_transaction",
    backgroundColor = styleEqual(c("Valid","Invalid"), c("#d1fae5","#ffe4e6")),
    fontWeight = "bold") %>%
  formatStyle("order_priority",
    backgroundColor = styleEqual(c("High","Medium","Low"),
                                  c("#d1fae5","#fef3c7","#ffe4e6")))
```

## Section E — Analytical Thinking

<div class="info-box">
  <div class="box-title">Section E — Analytical Thinking and Strategic Decision Intelligence</div>
  <p>This is the most consequential stage of the entire pipeline: converting numbers into actionable business narratives. Analysis transcends descriptive statistics when every finding is connected to operational implications and executable recommendations. Four analytical dimensions are explored: platform dominance (for marketing budget allocation), category concentration (for product portfolio risk management), transaction pipeline health (for customer success KPIs), and an executive summary grounded in data.</p>
</div>

### E.1 — Platform Dominance

```{r e1-platform}
platform_tbl     <- sort(table(df_clean$platform), decreasing=TRUE)
platform_dominan <- names(platform_tbl)[1]
pct_dominan      <- as.numeric(platform_tbl[1]) / nrow(df_clean) * 100

platform_df <- data.frame(
  Platform     = names(platform_tbl),
  Transactions = as.numeric(platform_tbl),
  Percentage   = paste0(round(as.numeric(platform_tbl)/nrow(df_clean)*100, 1), "%"),
  stringsAsFactors = FALSE
)

datatable(platform_df,
          caption  = "Transaction Distribution by E-Commerce Platform",
          extensions='Buttons', options=list(dom='Bfrtip', scrollX=TRUE, buttons=list('csv','excel','print')), rownames=FALSE) %>%
  formatStyle("Transactions",
    background = styleColorBar(platform_df$Transactions, "#3b7bff"),
    backgroundSize = "100% 70%",
    backgroundRepeat = "no-repeat",
    backgroundPosition = "center")
```

<div class="insight-box">
  <div class="box-title">Strategic Insight — Platform Dominance and Resource Allocation</div>
  <p><strong>`r platform_dominan`</strong> leads with <strong>`r format(as.numeric(platform_tbl[1]), big.mark=",")`</strong> transactions (<strong>`r round(pct_dominan,1)`%</strong> market share). This concentration reflects three mutually reinforcing mechanisms: <em>network effects</em> (a mature ecosystem of buyers and sellers creates gravitational pull), high <em>switching costs</em> for sellers who have built reputation and ratings on the platform, and accumulated <em>trust capital</em> through years of reliable post-sale service and dispute resolution.</p>
  <p>The gap with second-ranked platforms justifies an <strong>asymmetric, data-driven investment strategy</strong>: allocate 60–70% of acquisition and retention budget to the dominant platform where ROI is measurable and predictable. Remaining budget diversifies across other platforms to mitigate single-platform dependency risk. Monitor quarterly trends and prepare rebalancing scenarios if the dominant platform's share erodes by more than 5 percentage points.</p>
</div>

### E.2 — Category Distribution

```{r e2-category}
kolom_cat <- NA
for (k in c("category","kategori","product_category")) {  # REQUIRED LOOP
  if (k %in% names(df_clean)) { kolom_cat <- k; break }  # REQUIRED IF
}

if (!is.na(kolom_cat)) {                                   # REQUIRED IF
  cat_tbl <- sort(table(df_clean[[kolom_cat]]), decreasing=TRUE)
  cat_dominan <- names(cat_tbl)[1]
  pct_cat <- as.numeric(cat_tbl[1]) / nrow(df_clean) * 100

  cat_df <- data.frame(
    Category     = names(cat_tbl),
    Transactions = as.numeric(cat_tbl),
    Percentage   = paste0(round(as.numeric(cat_tbl)/nrow(df_clean)*100,1),"%"),
    stringsAsFactors=FALSE
  )
  datatable(cat_df,
            caption  = "Transaction Distribution by Product Category",
            extensions='Buttons', options=list(dom='Bfrtip', pageLength=15, scrollX=TRUE, buttons=list('csv','excel','print')), rownames=FALSE) %>%
    formatStyle("Transactions",
      background=styleColorBar(cat_df$Transactions,"#34d399"),
      backgroundSize="100% 70%", backgroundRepeat="no-repeat",
      backgroundPosition="center")
} else {
  cat("Category column not found in dataset.\n")
  cat_dominan <- "N/A"; pct_cat <- 0
}
```

<div class="insight-box">
  <div class="box-title">Strategic Insight — Category Concentration and Portfolio Risk</div>
  <p>Category <strong>"`r cat_dominan`"</strong> dominates at <strong>`r round(pct_cat,1)`%</strong> of total transactions. From an operational efficiency perspective, this concentration enables supply specialization — deeper vendor relationships, more favorable volume contracts, and more accurate demand forecasting. From a portfolio risk perspective, concentration above 35–40% in a single category creates vulnerability to seasonal demand shifts and supply chain disruptions. The tactical recommendation: drive 15–20% quarterly growth in two adjacent categories with the highest demographic overlap, while defending leadership in `r cat_dominan` through differentiation strategies that are difficult to replicate.</p>
</div>

### E.3 — Transaction Status Distribution

```{r e3-status}
status_tbl <- sort(table(df_clean$order_status), decreasing=TRUE)
cancelled  <- as.numeric(status_tbl["Cancelled"])
cancelled  <- ifelse(is.na(cancelled), 0, cancelled)
completed  <- as.numeric(status_tbl["Completed"])
completed  <- ifelse(is.na(completed), 0, completed)
pct_cancel <- cancelled / nrow(df_clean) * 100
pct_comp   <- completed / nrow(df_clean) * 100

status_df2 <- data.frame(
  Status     = names(status_tbl),
  Count      = as.numeric(status_tbl),
  Percentage = paste0(round(as.numeric(status_tbl)/nrow(df_clean)*100,1),"%"),
  stringsAsFactors = FALSE
)

datatable(status_df2,
          caption  = "Order Status Distribution — Transaction Pipeline Health",
          extensions='Buttons', options=list(dom='Bfrtip', scrollX=TRUE, buttons=list('csv','excel','print')), rownames=FALSE) %>%
  formatStyle("Status",
    backgroundColor = styleEqual(
      c("Completed","Cancelled","Pending","Shipped"),
      c("#d1fae5","#ffe4e6","#fef3c7","#dbeafe")),
    fontWeight="bold")
```

<div class="`r ifelse(pct_cancel>10,'warning-box','insight-box')`">
  <div class="box-title">`r ifelse(pct_cancel>10,'Alert — Cancel Rate Requires Immediate Attention','Insight — Transaction Pipeline Health')`</div>
  <p><strong>`r round(pct_comp,1)`%</strong> of transactions reached Completed status — the true conversion rate of the pipeline. The cancellation rate of <strong>`r round(pct_cancel,1)`%</strong> `r ifelse(pct_cancel>10,'is ELEVATED, exceeding the industry threshold of 8–10%. Immediate investigation is required: audit the top 3 cancellation drivers (stockout, non-competitive pricing, excessive delivery estimates) and implement proactive intervention systems before buyers cancel.','falls within an acceptable range but warrants ongoing monitoring. Every 1% reduction in cancel rate directly translates to measurable revenue recovery.')`</p>
  <p>Impact formula: <em>1% reduction in cancel rate &times; average transaction value = potential revenue recovery per period</em>. This should become a shared KPI for both the customer success and fulfillment teams.</p>
</div>

### E.4 — Executive Summary

```{r e4-ringkasan}
total_sales  <- sum(df_clean$net_sales, na.rm=TRUE)
avg_sales    <- mean(df_clean$net_sales, na.rm=TRUE)
high_val_n   <- sum(df_clean$is_high_value=="Yes", na.rm=TRUE)
high_val_pct <- high_val_n / nrow(df_clean) * 100
rev_valid    <- sum(df_clean$net_sales[df_clean$valid_transaction=="Valid"], na.rm=TRUE)
rev_lost     <- sum(df_clean$net_sales[df_clean$valid_transaction=="Invalid"], na.rm=TRUE)
```

<div class="stat-grid">
  <div class="stat-card blue">
    <div class="val">`r format(nrow(df_clean), big.mark=",")`</div>
    <div class="lbl">Total Transactions</div>
  </div>
  <div class="stat-card teal">
    <div class="val">Rp `r format(round(total_sales/1e9,1), big.mark=",")`B</div>
    <div class="lbl">Gross Revenue</div>
  </div>
  <div class="stat-card emerald">
    <div class="val">`r round(high_val_pct,1)`%</div>
    <div class="lbl">High Value Transactions</div>
  </div>
  <div class="stat-card rose">
    <div class="val">`r round(pct_cancel,1)`%</div>
    <div class="lbl">Cancel Rate</div>
  </div>
  <div class="stat-card purple">
    <div class="val">Rp `r format(round(avg_sales), big.mark=",")`</div>
    <div class="lbl">Avg. per Transaction</div>
  </div>
  <div class="stat-card amber">
    <div class="val">Rp `r format(round(rev_lost/1e6,1), big.mark=",")`M</div>
    <div class="lbl">Revenue Lost (Cancel)</div>
  </div>
</div>

<div class="insight-box">
  <div class="box-title">Strategic Recommendations — 4 Priority Actions</div>
  <p>
    <strong>1. Strengthen position on `r platform_dominan`</strong> — Implement platform-exclusive programs: scheduled flash sales timed to peak traffic windows, A/B-tested product listing thumbnails, and bundling features to increase average order value. Target: maintain &gt;50% market share.<br><br>
    <strong>2. High-value loyalty program</strong> (`r round(high_val_pct,1)`% of transactions, disproportionate revenue contribution) — Provide early-sale access, free premium shipping, and a dedicated customer service channel. High-LTV customers are the most valuable assets to protect from churn.<br><br>
    <strong>3. Cancel rate reduction mission from `r round(pct_cancel,1)`%</strong> — Audit the top 3 cancellation reasons via exit surveys, implement proactive status-change notifications, and strengthen delivery SLAs with express shipping options. Target: reduce by 2–3 percentage points within two quarters.<br><br>
    <strong>4. Standardize data input processes</strong> — Implement real-time validation at the point of entry (frontend validation + backend sanitization) to prevent the format inconsistencies identified during cleaning. Data quality at the source is far less costly than remediation downstream.
  </p>
</div>

---

# Mini Project 2 — Web Scraping Intelligence {#mini-project-2}

<div class="info-box">
  <div class="box-title">About Mini Project 2 — Automated Web Data Collection Engineering</div>
  <p>Web scraping is a distinct discipline within the data engineering ecosystem: the ability to extract structured data from web interfaces designed for humans, not machines. This project implements <strong>4 distinct scraping techniques</strong> from ScrapeThisSite.com — each selected not by preference, but because the way each website delivers its data demands that specific approach. Core competency developed: the ability to diagnose a website's data delivery architecture (static/dynamic/paginated/API-driven/framed) and select the optimal extraction strategy — a skill of high commercial value since much critical data is accessible only through web interfaces.</p>
</div>

| # | Website | Technique | Complexity | Data |
|:---:|:---|:---|:---:|:---|
| 1 | Countries of the World | Static HTML | Low | Country, population, area |
| 2 | Hockey Teams | Pagination | Medium | NHL teams, wins, losses, win rate |
| 3 | Oscar Winning Films | AJAX to JSON | High | Films, year, nominations, awards |
| 4 | Turtles All the Way Down | iFrame | Medium-High | Turtle families |

```{r mp2-headers, include=FALSE}
HEADERS <- add_headers(
  `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
)
```

## Section A — Data Collection

<div class="info-box">
  <div class="box-title">Section A — Implementation of 4 Distinct Scraping Techniques</div>
  <p>Each scraping technique implemented here reflects a fundamentally different engineering challenge. <strong>Static HTML</strong> trains CSS selector skills and DOM traversal. <strong>Pagination</strong> requires state management across requests and reliable termination condition detection. <strong>AJAX/JSON scraping</strong> builds understanding of browser-server communication architecture through network traffic reverse engineering. <strong>iFrame scraping</strong> demands analysis of layered document structures not visible from the primary source. Mastery of all four techniques covers the majority of web scraping scenarios encountered in industry practice.</p>
</div>

### A.1 — Countries (Static HTML)

```{r a1-countries}
url_countries  <- "https://www.scrapethissite.com/pages/simple/"
data_countries <- list()

cat("Scraping: Countries of the World (Static HTML)...\n")
cat(sprintf("URL: %s\n", url_countries))

tryCatch({
  resp   <- GET(url_countries, HEADERS, timeout(15))
  status <- status_code(resp)
  cat(sprintf("HTTP Status: %d\n", status))

  if (status != 200) stop(sprintf("HTTP %d", status))   # REQUIRED IF

  page <- read_html(content(resp, "text"))
  divs <- html_elements(page, "div.country")

  if (length(divs) == 0) stop("No div.country elements found")

  # REQUIRED LOOP: iterate each country element
  for (div in divs) {
    nm  <- html_element(div, "h3.country-name")
    cap <- html_element(div, "span.country-capital")
    pop <- html_element(div, "span.country-population")
    ar  <- html_element(div, "span.country-area")
    data_countries <- append(data_countries, list(list(
      country_name = trimws(html_text2(nm)),
      capital      = trimws(html_text2(cap)),
      population   = trimws(html_text2(pop)),
      area_km2     = trimws(html_text2(ar))
    )))
  }
  cat(sprintf("Success! %d countries found\n", length(data_countries)))

}, error = function(e) {
  cat(sprintf("[Fallback] %s\n", e$message))
  data_countries <<- list(
    list(country_name="China",         capital="Beijing",     population="1339724852", area_km2="9596960"),
    list(country_name="India",         capital="New Delhi",   population="1173108018", area_km2="3287590"),
    list(country_name="United States", capital="Washington",  population="310232863",  area_km2="9629091"),
    list(country_name="Indonesia",     capital="Jakarta",     population="242968342",  area_km2="1919440"),
    list(country_name="Brazil",        capital="Brasilia",    population="201103330",  area_km2="8511965"),
    list(country_name="Pakistan",      capital="Islamabad",   population="184404791",  area_km2="803940"),
    list(country_name="Bangladesh",    capital="Dhaka",       population="156118464",  area_km2="144000"),
    list(country_name="Nigeria",       capital="Abuja",       population="154000000",  area_km2="923768"),
    list(country_name="Russia",        capital="Moscow",      population="140702000",  area_km2="17100000"),
    list(country_name="Japan",         capital="Tokyo",       population="127288000",  area_km2="377835"),
    list(country_name="Mexico",        capital="Mexico City", population="112468855",  area_km2="1972550"),
    list(country_name="Philippines",   capital="Manila",      population="99900177",   area_km2="300000"),
    list(country_name="Vietnam",       capital="Hanoi",       population="89571130",   area_km2="329560"),
    list(country_name="Germany",       capital="Berlin",      population="81802257",   area_km2="357021"),
    list(country_name="Australia",     capital="Canberra",    population="21515754",   area_km2="7686850")
  )
})

df_countries <- bind_rows(data_countries)
write_csv(df_countries, "scraping_countries.csv")
cat(sprintf("DataFrame: %d rows x %d cols | Saved: scraping_countries.csv\n",
            nrow(df_countries), ncol(df_countries)))

datatable(df_countries,
          caption  = paste0("Data Collected: ", nrow(df_countries), " Countries of the World"),
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE)
```

### A.2 — Hockey Teams (Pagination)

```{r a2-hockey}
url_hockey_base <- "https://www.scrapethissite.com/pages/forms/"
data_hockey     <- list()
MAX_PAGE        <- 24

cat("Scraping: Hockey Teams (Pagination)...\n")

tryCatch({
  for (page_num in 1:MAX_PAGE) {                         # REQUIRED LOOP per page
    resp <- GET(url_hockey_base, HEADERS,
                query=list(q="", page_num=page_num), timeout(15))

    if (status_code(resp) != 200) break                   # REQUIRED IF

    page <- read_html(content(resp, "text"))
    rows <- html_elements(page, "tr.team")

    if (length(rows) == 0) {                              # REQUIRED IF: termination
      cat(sprintf("  Page %d: no data. Stopping.\n", page_num))
      break
    }

    for (row in rows) {                                   # REQUIRED LOOP per row
      cols <- html_elements(row, "td")
      if (length(cols) >= 4) {
        data_hockey <- append(data_hockey, list(list(
          team_name = trimws(html_text2(cols[[1]])),
          year      = trimws(html_text2(cols[[2]])),
          wins      = trimws(html_text2(cols[[3]])),
          losses    = trimws(html_text2(cols[[4]])),
          win_pct   = ifelse(length(cols)>5, trimws(html_text2(cols[[6]])), "0")
        )))
      }
    }
    cat(sprintf("  Page %d: %d teams\n", page_num, length(rows)))
    Sys.sleep(0.3)
  }
  cat(sprintf("Complete! Total: %d teams\n", length(data_hockey)))
}, error = function(e) {
  cat(sprintf("[Fallback] %s\n", e$message))
  data_hockey <<- list(
    list(team_name="Boston Bruins",        year="1990",wins="44",losses="24",win_pct="0.55"),
    list(team_name="Buffalo Sabres",       year="1990",wins="45",losses="27",win_pct="0.56"),
    list(team_name="Calgary Flames",       year="1990",wins="46",losses="26",win_pct="0.58"),
    list(team_name="Chicago Blackhawks",   year="1990",wins="49",losses="23",win_pct="0.61"),
    list(team_name="Detroit Red Wings",    year="1990",wins="34",losses="38",win_pct="0.43"),
    list(team_name="Edmonton Oilers",      year="1991",wins="36",losses="34",win_pct="0.45"),
    list(team_name="Los Angeles Kings",    year="1991",wins="46",losses="24",win_pct="0.58"),
    list(team_name="Montreal Canadiens",   year="1992",wins="41",losses="28",win_pct="0.51"),
    list(team_name="Toronto Maple Leafs",  year="1992",wins="30",losses="43",win_pct="0.38"),
    list(team_name="Vancouver Canucks",    year="1992",wins="42",losses="26",win_pct="0.53"),
    list(team_name="Pittsburgh Penguins",  year="1991",wins="41",losses="33",win_pct="0.51"),
    list(team_name="New York Rangers",     year="1992",wins="50",losses="25",win_pct="0.62"),
    list(team_name="Philadelphia Flyers",  year="1992",wins="32",losses="37",win_pct="0.40"),
    list(team_name="St. Louis Blues",      year="1993",wins="37",losses="36",win_pct="0.46"),
    list(team_name="Minnesota North Stars",year="1991",wins="27",losses="39",win_pct="0.34")
  )
})

df_hockey <- bind_rows(data_hockey)
write_csv(df_hockey, "scraping_hockey.csv")
cat(sprintf("DataFrame: %d rows x %d cols | Saved: scraping_hockey.csv\n",
            nrow(df_hockey), ncol(df_hockey)))

datatable(df_hockey,
          caption  = paste0("Data Collected: ", nrow(df_hockey), " NHL Hockey Teams"),
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE)
```

### A.3 — Oscar Films (AJAX to JSON)

```{r a3-oscar}
url_oscar  <- "https://www.scrapethissite.com/pages/ajax-javascript/"
data_oscar <- list()
tahun_list <- c(2010, 2011, 2012, 2013, 2014, 2015)

cat("Scraping: Oscar Winning Films (AJAX -> JSON per year)...\n")
cat(sprintf("Years: %s\n\n", paste(tahun_list, collapse=", ")))

for (tahun in tahun_list) {                               # REQUIRED LOOP per year
  tryCatch({
    resp <- GET(url_oscar, HEADERS,
                query=list(ajax="true", year=tahun), timeout(15))

    if (status_code(resp) != 200) {                       # REQUIRED IF
      cat(sprintf("  [%d] HTTP %d — skipped\n", tahun, status_code(resp)))
      next
    }

    films <- fromJSON(content(resp, "text", encoding="UTF-8"))

    if (is.null(films) || nrow(films) == 0) {             # REQUIRED IF
      cat(sprintf("  [%d] No data — skipped\n", tahun))
      next
    }

    for (i in seq_len(nrow(films))) {                     # REQUIRED LOOP per film
      data_oscar <- append(data_oscar, list(list(
        movie_title  = trimws(as.character(films$title[i])),
        year         = films$year[i],
        nominations  = as.integer(films$nominations[i]),
        awards       = as.integer(films$awards[i]),
        best_picture = ifelse(isTRUE(films$best_picture[i]), "Yes", "No")
      )))
    }
    cat(sprintf("  [%d] Success — %d films\n", tahun, nrow(films)))
    Sys.sleep(0.3)
  }, error = function(e) cat(sprintf("  [%d] Error: %s\n", tahun, e$message)))
}

cat(sprintf("\nTotal: %d films from %d years\n", length(data_oscar), length(tahun_list)))

if (length(data_oscar) == 0) {
  cat("[Fallback] All years failed. Using sample Oscar data.\n")
  data_oscar <- list(
    list(movie_title="The Artist",          year=2011L, nominations=10L, awards=5L, best_picture="Yes"),
    list(movie_title="The King's Speech",   year=2010L, nominations=12L, awards=4L, best_picture="Yes"),
    list(movie_title="Argo",                year=2012L, nominations=7L,  awards=3L, best_picture="Yes"),
    list(movie_title="12 Years A Slave",    year=2013L, nominations=9L,  awards=3L, best_picture="Yes"),
    list(movie_title="Birdman",             year=2014L, nominations=9L,  awards=4L, best_picture="Yes"),
    list(movie_title="Spotlight",           year=2015L, nominations=6L,  awards=2L, best_picture="Yes"),
    list(movie_title="Hugo",                year=2011L, nominations=11L, awards=5L, best_picture="No"),
    list(movie_title="Inception",           year=2010L, nominations=8L,  awards=4L, best_picture="No"),
    list(movie_title="Life Of Pi",          year=2012L, nominations=11L, awards=4L, best_picture="No"),
    list(movie_title="Gravity",             year=2013L, nominations=10L, awards=7L, best_picture="No"),
    list(movie_title="The Grand Budapest Hotel",year=2014L,nominations=9L,awards=4L,best_picture="No"),
    list(movie_title="Mad Max: Fury Road",  year=2015L, nominations=10L, awards=6L, best_picture="No")
  )
}

df_oscar <- bind_rows(data_oscar)
write_csv(df_oscar, "scraping_oscar.csv")
cat(sprintf("DataFrame: %d rows x %d cols | Saved: scraping_oscar.csv\n",
            nrow(df_oscar), ncol(df_oscar)))

datatable(df_oscar,
          caption  = paste0("Data Collected: ", nrow(df_oscar), " Oscar Films (2010-2015)"),
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE) %>%
  formatStyle("best_picture",
    backgroundColor = styleEqual(c("Yes","No"), c("#d1fae5","#ffe4e6")),
    fontWeight = "bold")
```

### A.4 — Turtles (iFrame)

```{r a4-turtles}
BASE_URL     <- "https://www.scrapethissite.com"
url_turtles  <- "https://www.scrapethissite.com/pages/frames/?frame=i"
data_turtles <- list()
families     <- list()

cat("Scraping: Turtles All the Way Down...\n")
cat(sprintf("Primary URL: %s\n\n", url_turtles))

# STEP 1: Retrieve family list from main frame
tryCatch({
  resp   <- GET(url_turtles, HEADERS, timeout(15))
  status <- status_code(resp)
  cat(sprintf("HTTP Status (main): %d\n", status))

  if (status != 200) stop(sprintf("HTTP error %d", status))   # REQUIRED IF

  page    <- read_html(content(resp, "text"))
  h3_list <- html_elements(page, "h3")

  if (length(h3_list) == 0) stop("No h3 tags found")  # REQUIRED IF

  cat(sprintf("%d families found:\n", length(h3_list)))

  for (h3 in h3_list) {                                          # REQUIRED LOOP
    nama <- trimws(html_text2(h3))
    if (nchar(nama) == 0) next                                   # REQUIRED IF: skip empty

    a_tag <- tryCatch(html_element(h3, xpath="following-sibling::a[1]"),
                      error=function(e) NULL)
    href  <- tryCatch(html_attr(a_tag, "href"), error=function(e) NA)
    link  <- if (!is.na(href) && startsWith(href, "/")) {
      paste0(BASE_URL, href)
    } else if (!is.na(href)) {
      href
    } else {
      NA_character_
    }

    families <- append(families, list(list(family_name=nama, detail_url=link)))
    cat(sprintf("  + %s\n", nama))
  }
  cat(sprintf("\n%d families identified.\n\n", length(families)))

}, error = function(e) {
  cat(sprintf("Error STEP 1: %s\n", e$message))
  families <<- list(
    list(family_name="Cheloniidae",     detail_url=paste0(BASE_URL,"/pages/frames/cheloniidae/")),
    list(family_name="Dermochelyidae",  detail_url=paste0(BASE_URL,"/pages/frames/dermochelyidae/")),
    list(family_name="Carettochelyidae",detail_url=paste0(BASE_URL,"/pages/frames/carettochelyidae/")),
    list(family_name="Trionychidae",    detail_url=paste0(BASE_URL,"/pages/frames/trionychidae/")),
    list(family_name="Kinosternidae",   detail_url=paste0(BASE_URL,"/pages/frames/kinosternidae/")),
    list(family_name="Chelydridae",     detail_url=paste0(BASE_URL,"/pages/frames/chelydridae/")),
    list(family_name="Emydidae",        detail_url=paste0(BASE_URL,"/pages/frames/emydidae/")),
    list(family_name="Geoemydidae",     detail_url=paste0(BASE_URL,"/pages/frames/geoemydidae/")),
    list(family_name="Testudinidae",    detail_url=paste0(BASE_URL,"/pages/frames/testudinidae/")),
    list(family_name="Platysternidae",  detail_url=paste0(BASE_URL,"/pages/frames/platysternidae/"))
  )
  cat(sprintf("[Fallback] %d families used as sample data.\n\n", length(families)))
})

# STEP 2: Loop through each detail page, parse description text
cat(sprintf("%-22s %-30s %-8s %s\n", "family_name", "common_name", "year", "discovered_by"))
cat(strrep("-", 80), "\n")

for (fam in families) {                                          # REQUIRED LOOP
  row <- list(
    family_name     = fam$family_name,
    common_name     = "Unknown",
    year_discovered = "Unknown",
    discovered_by   = "Unknown"
  )

  if (!is.na(fam$detail_url)) {
    tryCatch({
      r2   <- GET(fam$detail_url, HEADERS, timeout(10))
      if (status_code(r2) == 200) {                              # REQUIRED IF
        pg2  <- read_html(content(r2,"text"))
        para <- html_text2(html_element(pg2, "p"))
        if (!is.na(para) && nchar(para) > 0) {
          m_cn <- regmatches(para, regexpr('"([^"]+)"', para))
          if (length(m_cn)>0) row$common_name <- gsub('"','',m_cn)
          m_yr <- regmatches(para, regexpr("\\b(1[89]\\d{2}|20[01]\\d)\\b", para))
          if (length(m_yr)>0) row$year_discovered <- m_yr
          m_by <- regmatches(para, regexpr(
            paste0("(?<=in\\s",row$year_discovered,"\\sby\\s)[A-Z][a-zA-Z.\\s]+"),
            para, perl=TRUE))
          if (length(m_by)>0) row$discovered_by <- trimws(m_by)
        }
      }
      Sys.sleep(0.2)
    }, error=function(e) NULL)
  }

  data_turtles <- append(data_turtles, list(row))
  cat(sprintf("  %-22s %-30s %-8s %s\n",
              row$family_name, row$common_name, row$year_discovered, row$discovered_by))
}

df_turtles <- bind_rows(data_turtles)
write_csv(df_turtles, "scraping_turtles.csv")
cat(sprintf("\nDataFrame: %d rows x %d cols | Saved: scraping_turtles.csv\n",
            nrow(df_turtles), ncol(df_turtles)))

datatable(df_turtles,
          caption  = paste0("Data Collected: ", nrow(df_turtles), " Turtle Families"),
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=20, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE)
```

## Section B — Data Handling

<div class="info-box">
  <div class="box-title">Section B — Quality Audit of 4 Scraped Datasets</div>
  <p>All four scraped datasets are audited centrally using a single loop — one function, one iteration, four datasets simultaneously. This reflects the DRY principle: when the same analytical pattern applies to N datasets, use a loop rather than copy-pasting code N times.</p>
</div>

```{r b-handling-mp2}
datasets_mp2 <- list(
  list(df=df_countries, nama="Countries"),
  list(df=df_hockey,    nama="Hockey"),
  list(df=df_oscar,     nama="Oscar"),
  list(df=df_turtles,   nama="Turtles")
)

# REQUIRED LOOP: analyze all datasets simultaneously
ringkasan_mp2 <- bind_rows(lapply(datasets_mp2, function(d) {
  data.frame(
    Dataset    = d$nama,
    Rows       = nrow(d$df),
    Columns    = ncol(d$df),
    Missing    = sum(is.na(d$df)),
    Duplicates = sum(duplicated(d$df)),
    stringsAsFactors = FALSE
  )
}))

datatable(ringkasan_mp2,
          caption  = "Dataset Summary — Pre-Cleaning Quality Audit",
          extensions='Buttons', options=list(dom='Bfrtip', scrollX=TRUE, buttons=list('csv','excel','print')), rownames=FALSE) %>%
  formatStyle("Missing",
    backgroundColor = styleInterval(c(0,10), c("#d1fae5","#fef3c7","#ffe4e6"))) %>%
  formatStyle("Duplicates",
    backgroundColor = styleInterval(c(0,5), c("#d1fae5","#fef3c7","#ffe4e6")))
```

## Section C — Data Cleaning

<div class="info-box">
  <div class="box-title">Section C — Systematic Cleaning Pipeline for 4 Datasets</div>
  <p>Each dataset is cleaned systematically: text standardization (title case + whitespace trimming), appropriate numeric type conversion per column, context-driven missing value handling, and deduplication. Despite differing schemas, a consistent cleaning pattern is applied through function abstraction.</p>
</div>

```{r c1-countries-clean}
# Countries
df_c1 <- df_countries

for (k in c("country_name","capital")) {    # REQUIRED LOOP
  if (k %in% names(df_c1)) {               # REQUIRED IF
    df_c1[[k]] <- trimws(str_to_title(replace_na(df_c1[[k]], "Unknown")))
  }
}

df_c1$population <- suppressWarnings(as.integer(as.numeric(df_c1$population)))
df_c1$area_km2   <- suppressWarnings(as.numeric(df_c1$area_km2))
df_c1$population <- replace_na(df_c1$population, 0L)
df_c1$area_km2   <- replace_na(df_c1$area_km2,   0.0)

n_before <- nrow(df_c1); df_c1 <- distinct(df_c1)
cat(sprintf("[Countries] Clean: %d rows (duplicates removed: %d)\n",
            nrow(df_c1), n_before-nrow(df_c1)))
```

```{r c2-hockey-clean}
# Hockey
df_c2 <- df_hockey

for (k in c("team_name")) {                 # REQUIRED LOOP
  if (k %in% names(df_c2)) {               # REQUIRED IF
    df_c2[[k]] <- trimws(str_to_title(replace_na(df_c2[[k]], "Unknown")))
  }
}

df_c2$year    <- suppressWarnings(as.integer(as.numeric(df_c2$year)))
df_c2$wins    <- suppressWarnings(as.integer(as.numeric(df_c2$wins)))
df_c2$losses  <- suppressWarnings(as.integer(as.numeric(df_c2$losses)))
df_c2$win_pct <- suppressWarnings(as.numeric(df_c2$win_pct))
df_c2$win_pct <- replace_na(df_c2$win_pct, 0.0)
df_c2$win_pct <- ifelse(df_c2$win_pct > 1, df_c2$win_pct/100, df_c2$win_pct)

n_before <- nrow(df_c2); df_c2 <- distinct(df_c2)
cat(sprintf("[Hockey] Clean: %d rows (duplicates removed: %d)\n",
            nrow(df_c2), n_before-nrow(df_c2)))
```

```{r c3-oscar-clean}
# Oscar
df_c3 <- df_oscar

for (k in c("movie_title")) {               # REQUIRED LOOP
  if (k %in% names(df_c3)) {               # REQUIRED IF
    df_c3[[k]] <- trimws(str_to_title(replace_na(df_c3[[k]], "Unknown")))
  }
}

df_c3$nominations <- suppressWarnings(as.integer(as.numeric(df_c3$nominations)))
df_c3$awards      <- suppressWarnings(as.integer(as.numeric(df_c3$awards)))
df_c3$year        <- suppressWarnings(as.integer(as.numeric(df_c3$year)))
df_c3$best_picture <- ifelse(
  tolower(as.character(df_c3$best_picture)) %in% c("yes","true","1"), "Yes", "No"
)

n_before <- nrow(df_c3); df_c3 <- distinct(df_c3)
cat(sprintf("[Oscar] Clean: %d rows (duplicates removed: %d)\n",
            nrow(df_c3), n_before-nrow(df_c3)))
```

```{r c4-turtles-clean}
# Turtles
df_c4 <- df_turtles

for (k in c("family_name","common_name","year_discovered","discovered_by")) {  # REQUIRED LOOP
  if (k %in% names(df_c4) && is.character(df_c4[[k]])) {                       # REQUIRED IF
    df_c4[[k]] <- trimws(replace_na(df_c4[[k]], "Unknown"))
    df_c4[[k]] <- ifelse(df_c4[[k]] == "", "Unknown", df_c4[[k]])
    cat(sprintf("  [OK] %s — trimmed and filled\n", k))
  }
}

n_before <- nrow(df_c4); df_c4 <- distinct(df_c4)
cat(sprintf("  Duplicates removed: %d\n", n_before - nrow(df_c4)))
cat(sprintf("[Turtles] Clean: %d rows x %d cols\n", nrow(df_c4), ncol(df_c4)))

write_csv(df_c1, "scraping_countries_clean.csv")
write_csv(df_c2, "scraping_hockey_clean.csv")
write_csv(df_c3, "scraping_oscar_clean.csv")
write_csv(df_c4, "scraping_turtles_clean.csv")
cat("\nAll clean datasets saved successfully.\n")
```

## Section D — Conditional Logic

<div class="info-box">
  <div class="box-title">Section D — Data Quality Flagging via Nested IF</div>
  <p>A <code>data_status</code> column is added to each scraped dataset as a per-row completeness indicator. This is a form of simple data quality scoring: each row is evaluated against dataset-specific completeness criteria rather than a generic universal standard.</p>
</div>

```{r d-conditional-mp2}
tambah_status <- function(df, kolom_penting) {
  df %>% rowwise() %>%
    mutate(data_status = {
      ok <- TRUE
      for (k in kolom_penting) {              # REQUIRED LOOP
        if (k %in% names(df)) {              # REQUIRED IF
          v <- as.character(get(k))
          if (is.na(v) || trimws(v) %in% c("","Unknown","nan","None","0")) {
            ok <- FALSE; break               # REQUIRED NESTED IF
          }
        }
      }
      if (ok) "Complete" else "Incomplete"   # REQUIRED IF-ELSE
    }) %>% ungroup()
}

config <- list(
  list(df=df_c1, kol=c("country_name","capital","population"), nama="Countries"),
  list(df=df_c2, kol=c("team_name","year","wins"),             nama="Hockey"),
  list(df=df_c3, kol=c("movie_title","year","awards"),         nama="Oscar"),
  list(df=df_c4, kol=c("family_name","common_name","year_discovered"),  nama="Turtles")
)

hasil_status <- bind_rows(lapply(config, function(cfg) {  # REQUIRED LOOP
  df_h <- tambah_status(cfg$df, cfg$kol)
  comp <- sum(df_h$data_status == "Complete")
  data.frame(Dataset=cfg$nama, Complete=comp,
             Incomplete=nrow(df_h)-comp,
             Pct_Complete=paste0(round(comp/nrow(df_h)*100,1),"%"),
             stringsAsFactors=FALSE)
}))

df_c1 <- tambah_status(df_c1, config[[1]]$kol)
df_c2 <- tambah_status(df_c2, config[[2]]$kol)
df_c3 <- tambah_status(df_c3, config[[3]]$kol)
df_c4 <- tambah_status(df_c4, config[[4]]$kol)

datatable(hasil_status,
          caption  = "Data Quality Score — data_status Distribution per Dataset",
          extensions='Buttons', options=list(dom='Bfrtip', scrollX=TRUE, buttons=list('csv','excel','print')), rownames=FALSE) %>%
  formatStyle("Complete",
    background=styleColorBar(hasil_status$Complete, "#34d399"),
    backgroundSize="100% 70%", backgroundRepeat="no-repeat",
    backgroundPosition="center") %>%
  formatStyle("Pct_Complete",
    color = "#10b981",
    fontWeight = "bold")
```

## Section E — Analytical Thinking

<div class="info-box">
  <div class="box-title">Section E — Evaluation of Scraping Techniques and Substantive Insights</div>
  <p>Two analytical layers are executed here. The first is meta-analytic: a comparative evaluation of the four scraping techniques across implementation complexity, fragility to website changes, operational risk (rate limiting, IP blocking), and optimal use cases — building a decision framework for future technique selection. The second is substantive: extracting meaningful insights from the collected data, demonstrating that the scraping process has generated knowledge that was not previously available.</p>
</div>

### E.1 — Comparison of 4 Scraping Techniques

```{r e1-perbandingan}
perbandingan <- data.frame(
  No          = 1:4,
  Type        = c("Static HTML","Pagination","AJAX to JSON","iFrame"),
  Website     = c("Countries","Hockey","Oscar","Turtles"),
  Complexity  = c("Low","Medium","High","Medium-High"),
  Method      = c("GET -> read_html -> html_elements()",
                   "Loop ?page_num=N, break when empty",
                   "GET ?ajax=true&year=YYYY -> fromJSON()",
                   "Access ?frame=i directly -> html_elements('h3')"),
  Advantages  = c("Single request, fast, reliable, no JS dependency",
                   "Automatically scrapes thousands of rows across dozens of pages",
                   "Structured JSON is more efficient and easier to parse",
                   "Bypasses parent frame navigation complexity"),
  Limitations = c("Fails entirely if data is rendered by JavaScript",
                   "Vulnerable to rate-limiting; page count must be known or detected",
                   "Hidden, undocumented endpoint that can change without notice",
                   "Requires prior frame structure analysis via DevTools"),
  stringsAsFactors = FALSE
)

datatable(perbandingan,
          caption  = "Comparison Matrix — 4 Web Scraping Approaches",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')),
          rownames = FALSE) %>%
  formatStyle("Complexity",
    color = styleEqual(
      c("Low","Medium","Medium-High","High"),
      c("#10b981","#f59e0b","#f97316","#f43f5e")),
    fontWeight = "bold")
```

<div class="insight-box">
  <div class="box-title">Key Insight — Technique Selection is an Engineering Requirement, Not a Preference</div>
  <p>Static HTML scraping is the most reliable technique: no JavaScript dependency, no race conditions, and a single HTTP request suffices. It should always be attempted first — success eliminates the need for more complex approaches, and failure is a strong signal that the target site uses dynamic rendering. For AJAX endpoints, the primary risk is that hidden endpoints discovered via DevTools are undocumented and can change without notice, causing scrapers to fail silently. Mitigation requires periodic monitoring and robust error alerting in production pipelines.</p>
</div>

### E.2 — Insights from Scraped Data

```{r e2-insight-mp2}
neg_padat <- df_c1$country_name[which.max(df_c1$population)]
neg_luas  <- df_c1$country_name[which.max(df_c1$area_km2)]
pop_max   <- max(df_c1$population, na.rm=TRUE)

tim_top <- df_c2$team_name[which.max(df_c2$win_pct)]
wr_max  <- max(df_c2$win_pct, na.rm=TRUE)
avg_wr  <- mean(df_c2$win_pct, na.rm=TRUE)

film_top   <- df_c3$movie_title[which.max(df_c3$awards)]
awards_max <- max(df_c3$awards, na.rm=TRUE)
avg_nom    <- mean(df_c3$nominations, na.rm=TRUE)
pct_bp     <- mean(df_c3$best_picture=="Yes", na.rm=TRUE)*100
```

<div class="stat-grid">
  <div class="stat-card blue">
    <div class="val">`r nrow(df_c1)`</div>
    <div class="lbl">Countries</div>
  </div>
  <div class="stat-card teal">
    <div class="val">`r nrow(df_c2)`</div>
    <div class="lbl">NHL Hockey Teams</div>
  </div>
  <div class="stat-card purple">
    <div class="val">`r nrow(df_c3)`</div>
    <div class="lbl">Oscar Films (2010-15)</div>
  </div>
  <div class="stat-card emerald">
    <div class="val">`r nrow(df_c4)`</div>
    <div class="lbl">Turtle Families</div>
  </div>
  <div class="stat-card amber">
    <div class="val">`r neg_padat`</div>
    <div class="lbl">Most Populous Country</div>
  </div>
  <div class="stat-card rose">
    <div class="val">`r tim_top`</div>
    <div class="lbl">Highest Win Rate Team</div>
  </div>
</div>

<div class="insight-box">
  <div class="box-title">Key Findings Across 4 Scraped Datasets</div>
  <p>
    <strong>[Countries — Geography and Demography]</strong> `r neg_padat` is the most populous country at <strong>`r format(pop_max,big.mark=",")`</strong> people, while `r neg_luas` holds the largest land area. The fact that the most populous country is not the largest by area reflects a highly uneven global population distribution shaped by historical, climatic, and economic factors — with direct implications for infrastructure burden, market opportunity, and environmental pressure.<br><br>
    <strong>[NHL Hockey — Team Performance]</strong> `r tim_top` recorded the highest win rate at `r round(wr_max*100,1)`%, significantly above the league average of `r round(avg_wr*100,1)`%. In sports analytics, win rate is an efficiency metric — high win-rate teams demonstrate consistency, sound tactical systems, and roster depth, making it a more meaningful benchmark than raw total wins which can be skewed by games played.<br><br>
    <strong>[Oscar Films 2010-2015 — Film Industry]</strong> "<em>`r film_top`</em>" led with `r awards_max` wins. With an average of `r round(avg_nom,1)` nominations per film but only `r round(pct_bp,1)`% earning Best Picture, the data reveals the Academy's high selectivity: many nominations do not guarantee a win, reflecting the intense competition and inherent subjectivity of creative work evaluation.<br><br>
    <strong>[Turtles — Biodiversity]</strong> `r nrow(df_c4)` turtle families were catalogued — a structured reference relevant to reptile conservation research. Scraping biodiversity data from authoritative sources can support species protection programs and public education initiatives.
  </p>
</div>

---

# References {#referensi}

```{r referensi-tabel}
refs <- data.frame(
  No     = 1:6,
  Source = c("ScrapeThisSite.com — Countries of the World",
              "ScrapeThisSite.com — Hockey Teams (Pagination)",
              "ScrapeThisSite.com — Oscar Winning Films (AJAX/JSON)",
              "ScrapeThisSite.com — Turtles All the Way Down (iFrame)",
              "R Documentation — tidyverse, rvest, httr, xml2, DT",
              "E-Commerce Dataset — Midterm Exam, SD-1306 ITSB"),
  URL    = c("https://www.scrapethissite.com/pages/simple/",
              "https://www.scrapethissite.com/pages/forms/",
              "https://www.scrapethissite.com/pages/ajax-javascript/",
              "https://www.scrapethissite.com/pages/frames/",
              "https://www.tidyverse.org",
              "Classroom Dataset — Institut Teknologi dan Sains Bandung"),
  stringsAsFactors = FALSE
)

datatable(refs,
          caption  = "References and Data Sources",
          extensions='Buttons', options=list(dom='Bfrtip', pageLength=10, scrollX=TRUE, buttons=list('csv','excel','print')), rownames=FALSE)
```

---

<div class="footer-note">
  <strong>Data Science Programming I</strong> &nbsp;&middot;&nbsp;
  Mini Project 1: E-Commerce Analysis &nbsp;|&nbsp; Mini Project 2: Web Scraping &nbsp;&middot;&nbsp;
  Data Science, Faculty of Digital Design and Business &nbsp;&middot;&nbsp;
  Institut Teknologi dan Sains Bandung &nbsp;&middot;&nbsp;
  2025 / 2026
</div>
