Midterm Examination — Data Science Programming I
Mini Project: E-Commerce Data Pipeline & Web Scraping Intelligence
E-Commerce Data Pipeline & Web Scraping Intelligence
1 Mini Project 1 — E-Commerce Data Pipeline
This mini project simulates a real-world data engineering scenario: integrating e-commerce transaction data from 5 different file formats (CSV, Excel, JSON, TXT, XML) into a single clean, analysis-ready dataset. The pipeline follows production-grade principles — each stage is documented, reproducible, and designed to handle edge cases gracefully. The final output is a structured dataset with business-driven feature engineering ready for executive reporting, BI dashboards, or predictive modeling.
1.1 Section A — Data Collection
The primary objective is to build a fully automated, format-agnostic data ingestion pipeline. Rather than writing separate code for each file, a single loop structure handles all formats through condition-based dispatch — reflecting the principles of Clean Code and DRY (Don’t Repeat Yourself). Every file is verified for availability before reading, then merged into a unified dataset with provenance metadata enabling a full audit trail back to the original source of each row.
1.1.1 A.1 — File Inventory and Availability Check
1.1.2 A.2 — Custom XML Parser Function
baca_xml() defined and ready.
1.1.3 A.3 — REQUIRED LOOP: Read All Files
[OK] Midterm_Exam_Classroom.csv | 2000 rows | 22 cols
[OK] Midterm_Exam_Classroom.xlsx | 2000 rows | 22 cols
[OK] Midterm_Exam_Classroom.json | 2000 rows | 22 cols
[OK] Midterm_Exam_Classroom.txt | 2000 rows | 1 cols
[OK] Midterm_Exam_Classroom.xml | 2000 rows | 22 cols
5 of 5 files successfully read.
1.1.4 A.4 — Merge and Export to Excel
Merged: Midterm_Exam_Classroom.csv (2000 rows)
Merged: Midterm_Exam_Classroom.xlsx (2000 rows)
Merged: Midterm_Exam_Classroom.json (2000 rows)
Merged: Midterm_Exam_Classroom.txt (2000 rows)
Merged: Midterm_Exam_Classroom.xml (2000 rows)
Total Rows : 10,000
Total Columns: 24
All columns cast to character — numeric conversion occurs in Section C.
[EXCEL] Saved: hasil_gabungan.xlsx
1.2 Section B — Data Handling
Before any intervention, the state of the data must be thoroughly and objectively understood — much like a medical diagnosis before treatment. Section B executes a three-dimensional audit: (1) Data type profiling to verify alignment between storage format and column semantics; (2) Missing value analysis measuring not just counts but patterns that may indicate systemic upstream issues; and (3) Duplicate detection separating technical redundancy from genuine business duplication. The audit findings form the scientific basis for all cleaning strategies in Section C.
1.2.1 B.1 — Column Profile
1.2.2 B.2 — Missing Values
Total missing values : 54682
Columns with missing : 23 of 24
1.2.3 B.3 — Duplicate Rows
Duplicate rows : 45
Percentage : 0.45%
Sample duplicate rows (first 5):
1.2.4 B.4 — Data Quality Issues
1.3 Section C — Data Cleaning
Data cleaning is not merely tidying — it is an analytical decision-making process that must be scientifically defensible. Every transformation in Section C follows informed cleaning principles: the rationale for each strategy (why median over mean? why fill with “Unknown” rather than drop rows?) is explicitly grounded in data distribution characteristics and business context. The sequence of operations is intentional — standardization precedes imputation, and deduplication occurs last to ensure no clean rows are inadvertently removed. Each step is quantitatively validated by comparing before-and-after states.
1.3.1 C.0 — Create Working Copy
Working copy created: 10,000 rows x 24 columns
1.3.2 C.1 — Platform Standardization
Platform distribution after standardization:
Blibli Lazada Shopee Tiktok Shop Tokopedia Unknown
1616 1552 1664 1624 1544 2000
1.3.3 C.2 — Net Sales Cleaning
Min : Rp 0
Max : Rp 18,080,224
Mean : Rp 1,101,874
Remaining NA: 0
1.3.4 C.3 — Missing Value Imputation
Missing payment_method after imputation: 0
Missing net_sales after imputation : 0
1.3.5 C.4 — Order Status Standardization
Order status distribution after standardization:
Batal Delivered On Delivery Retur Returned
12 50 4 4 6
Batal Cancelled Completed Delivered On Delivery
120 476 1600 4670 188
Retur Returned Shipped Unknown
168 294 408 2000
1.3.6 C.5 — REQUIRED LOOP: Text Column Cleaning
[OK] platform | missing: 0 -> 0
[OK] category | missing: 2000 -> 0
[OK] product_name | missing: 2000 -> 0
[OK] order_status | missing: 0 -> 0
[OK] payment_method | missing: 0 -> 0
1.3.7 C.6 — Deduplication and Summary
Duplicates removed : 45 rows
Clean dataset : 9,955 rows
Total columns : 24
1.4 Section D — Conditional Logic
Feature engineering bridges raw data and actionable business intelligence. The three derived columns added here are not merely technical exercises — each represents business logic actively used in e-commerce operations: is_high_value for premium customer segmentation and loyalty programs, order_priority for fulfillment resource allocation and SLA definition, and valid_transaction for separating confirmed revenue from at-risk pipeline. The enriched dataset carries significantly higher analytical value and can be consumed directly by Business Intelligence teams without further preprocessing.
| New Column | Logic | Type | Business Use |
|---|---|---|---|
is_high_value |
net_sales > 1,000,000 |
Binary | Premium customer identification |
order_priority |
3-tier based on net_sales | Multi-class | SLA and resource allocation |
valid_transaction |
Based on order_status | Binary | Revenue forecasting |
1.4.1 D.1 — is_high_value Column
Distribution of is_high_value:
No Yes
6871 3084
1.4.2 D.2 — order_priority Column (NESTED IF)
Distribution of order_priority:
High Low Medium
3084 5203 1668
1.4.3 D.3 — valid_transaction Column
Distribution of valid_transaction:
Invalid Valid
472 9483
1.4.4 D.4 — Display Results and Save
[CSV] Saved: dataset_final_bersih.csv
[EXCEL] Saved: dataset_final_bersih.xlsx
<div style="margin:16px 0 24px; display:flex; gap:14px; flex-wrap:wrap;">
<a href="dataset_final_bersih.csv" download
style="display:inline-flex;align-items:center;gap:8px;
background:linear-gradient(135deg,#0ea5e9,#6366f1);
color:#fff;font-weight:700;font-size:14px;
padding:12px 28px;border-radius:50px;text-decoration:none;
box-shadow:0 4px 16px rgba(99,102,241,0.35);">
Download CSV
</a>
<a href="dataset_final_bersih.xlsx" download
style="display:inline-flex;align-items:center;gap:8px;
background:linear-gradient(135deg,#10b981,#059669);
color:#fff;font-weight:700;font-size:14px;
padding:12px 28px;border-radius:50px;text-decoration:none;
box-shadow:0 4px 16px rgba(16,185,129,0.35);">
Download Excel
</a>
</div>
1.5 Section E — Analytical Thinking
This is the most consequential stage of the entire pipeline: converting numbers into actionable business narratives. Analysis transcends descriptive statistics when every finding is connected to operational implications and executable recommendations. Four analytical dimensions are explored: platform dominance (for marketing budget allocation), category concentration (for product portfolio risk management), transaction pipeline health (for customer success KPIs), and an executive summary grounded in data.
1.5.1 E.1 — Platform Dominance
Unknown leads with 1,991 transactions (20% market share). This concentration reflects three mutually reinforcing mechanisms: network effects (a mature ecosystem of buyers and sellers creates gravitational pull), high switching costs for sellers who have built reputation and ratings on the platform, and accumulated trust capital through years of reliable post-sale service and dispute resolution.
The gap with second-ranked platforms justifies an asymmetric, data-driven investment strategy: allocate 60–70% of acquisition and retention budget to the dominant platform where ROI is measurable and predictable. Remaining budget diversifies across other platforms to mitigate single-platform dependency risk. Monitor quarterly trends and prepare rebalancing scenarios if the dominant platform’s share erodes by more than 5 percentage points.
1.5.2 E.2 — Category Distribution
Category “Unknown” dominates at 20% of total transactions. From an operational efficiency perspective, this concentration enables supply specialization — deeper vendor relationships, more favorable volume contracts, and more accurate demand forecasting. From a portfolio risk perspective, concentration above 35–40% in a single category creates vulnerability to seasonal demand shifts and supply chain disruptions. The tactical recommendation: drive 15–20% quarterly growth in two adjacent categories with the highest demographic overlap, while defending leadership in Unknown through differentiation strategies that are difficult to replicate.
1.5.3 E.3 — Transaction Status Distribution
16% of transactions reached Completed status — the true conversion rate of the pipeline. The cancellation rate of 4.7% falls within an acceptable range but warrants ongoing monitoring. Every 1% reduction in cancel rate directly translates to measurable revenue recovery.
Impact formula: 1% reduction in cancel rate × average transaction value = potential revenue recovery per period. This should become a shared KPI for both the customer success and fulfillment teams.
1.5.4 E.4 — Executive Summary
<div class="val">9,955</div>
<div class="lbl">Total Transactions</div>
<div class="val">Rp 11B</div>
<div class="lbl">Gross Revenue</div>
<div class="val">31%</div>
<div class="lbl">High Value Transactions</div>
<div class="val">4.7%</div>
<div class="lbl">Cancel Rate</div>
<div class="val">Rp 1,102,849</div>
<div class="lbl">Avg. per Transaction</div>
<div class="val">Rp 0M</div>
<div class="lbl">Revenue Lost (Cancel)</div>
1. Strengthen position on Unknown — Implement platform-exclusive programs: scheduled flash sales timed to peak traffic windows, A/B-tested product listing thumbnails, and bundling features to increase average order value. Target: maintain >50% market share.
2. High-value loyalty program (31% of transactions, disproportionate revenue contribution) — Provide early-sale access, free premium shipping, and a dedicated customer service channel. High-LTV customers are the most valuable assets to protect from churn.
3. Cancel rate reduction mission from 4.7% — Audit the top 3 cancellation reasons via exit surveys, implement proactive status-change notifications, and strengthen delivery SLAs with express shipping options. Target: reduce by 2–3 percentage points within two quarters.
4. Standardize data input processes — Implement real-time validation at the point of entry (frontend validation + backend sanitization) to prevent the format inconsistencies identified during cleaning. Data quality at the source is far less costly than remediation downstream.
2 Mini Project 2 — Web Scraping Intelligence
Web scraping is a distinct discipline within the data engineering ecosystem: the ability to extract structured data from web interfaces designed for humans, not machines. This project implements 4 distinct scraping techniques from ScrapeThisSite.com — each selected not by preference, but because the way each website delivers its data demands that specific approach. Core competency developed: the ability to diagnose a website’s data delivery architecture (static/dynamic/paginated/API-driven/framed) and select the optimal extraction strategy — a skill of high commercial value since much critical data is accessible only through web interfaces.
| # | Website | Technique | Complexity | Data |
|---|---|---|---|---|
| 1 | Countries of the World | Static HTML | Low | Country, population, area |
| 2 | Hockey Teams | Pagination | Medium | NHL teams, wins, losses, win rate |
| 3 | Oscar Winning Films | AJAX to JSON | High | Films, year, nominations, awards |
| 4 | Turtles All the Way Down | iFrame | Medium-High | Turtle families |
2.1 Section A — Data Collection
Each scraping technique implemented here reflects a fundamentally different engineering challenge. Static HTML trains CSS selector skills and DOM traversal. Pagination requires state management across requests and reliable termination condition detection. AJAX/JSON scraping builds understanding of browser-server communication architecture through network traffic reverse engineering. iFrame scraping demands analysis of layered document structures not visible from the primary source. Mastery of all four techniques covers the majority of web scraping scenarios encountered in industry practice.
2.1.1 A.1 — Countries (Static HTML)
Scraping: Countries of the World (Static HTML)...
URL: https://www.scrapethissite.com/pages/simple/
HTTP Status: 200
Success! 250 countries found
DataFrame: 250 rows x 4 cols | Saved: scraping_countries.csv
2.1.2 A.2 — Hockey Teams (Pagination)
Scraping: Hockey Teams (Pagination)...
Page 1: 25 teams
Page 2: 25 teams
Page 3: 25 teams
Page 4: 25 teams
Page 5: 25 teams
Page 6: 25 teams
Page 7: 25 teams
Page 8: 25 teams
Page 9: 25 teams
Page 10: 25 teams
Page 11: 25 teams
Page 12: 25 teams
Page 13: 25 teams
Page 14: 25 teams
Page 15: 25 teams
Page 16: 25 teams
Page 17: 25 teams
Page 18: 25 teams
Page 19: 25 teams
Page 20: 25 teams
Page 21: 25 teams
Page 22: 25 teams
Page 23: 25 teams
Page 24: 7 teams
Complete! Total: 582 teams
DataFrame: 582 rows x 5 cols | Saved: scraping_hockey.csv
2.1.3 A.3 — Oscar Films (AJAX to JSON)
Scraping: Oscar Winning Films (AJAX -> JSON per year)...
Years: 2010, 2011, 2012, 2013, 2014, 2015
[2010] Success — 13 films
[2011] Success — 15 films
[2012] Success — 15 films
[2013] Success — 12 films
[2014] Success — 16 films
[2015] Success — 16 films
Total: 87 films from 6 years
DataFrame: 87 rows x 5 cols | Saved: scraping_oscar.csv
2.1.4 A.4 — Turtles (iFrame)
Scraping: Turtles All the Way Down...
Primary URL: https://www.scrapethissite.com/pages/frames/?frame=i
HTTP Status (main): 200
14 families found:
+ Carettochelyidae
+ Cheloniidae
+ Chelydridae
+ Dermatemydidae
+ Dermochelyidae
+ Emydidae
+ Geoemydidae
+ Kinosternidae
+ Platysternidae
+ Testudinidae
+ Trionychidae
+ Chelidae
+ Pelomedusidae
+ Podocnemididae
14 families identified.
family_name common_name year discovered_by
--------------------------------------------------------------------------------
Carettochelyidae Pig-nosed turtle 1887 Boulenger.
Cheloniidae Sea turtles 1811 Oppel.
Chelydridae Snapping turtles 1831 Gray.
Dermatemydidae Central American river turtle 1870 Gray.
Dermochelyidae Leatherback sea turtle 1843 Fitzinger.
Emydidae Pond or water turtles 1815 Rafinesque.
Geoemydidae Asian river, leaf, roofed or Asian box turtles 1868 Theobald.
Kinosternidae Mud or musk turtles 1857 Agassiz.
Platysternidae Big-headed turtle 1869 Gray.
Testudinidae Tortoises Unknown Unknown
Trionychidae Softshell turtles 1826 Fitzinger.
Chelidae Austro-American sideneck turtles 1831 Gray.
Pelomedusidae Afro-American sideneck turtles 1868 Cope.
Podocnemididae Madagascar big-headed, Big-headed Amazon River turtle and South American sideneck river turtles 1869 Gray.
DataFrame: 14 rows x 4 cols | Saved: scraping_turtles.csv
2.2 Section B — Data Handling
All four scraped datasets are audited centrally using a single loop — one function, one iteration, four datasets simultaneously. This reflects the DRY principle: when the same analytical pattern applies to N datasets, use a loop rather than copy-pasting code N times.
2.3 Section C — Data Cleaning
Each dataset is cleaned systematically: text standardization (title case + whitespace trimming), appropriate numeric type conversion per column, context-driven missing value handling, and deduplication. Despite differing schemas, a consistent cleaning pattern is applied through function abstraction.
[Countries] Clean: 250 rows (duplicates removed: 0)
[Hockey] Clean: 582 rows (duplicates removed: 0)
[Oscar] Clean: 87 rows (duplicates removed: 0)
[OK] family_name — trimmed and filled
[OK] common_name — trimmed and filled
[OK] year_discovered — trimmed and filled
[OK] discovered_by — trimmed and filled
Duplicates removed: 0
[Turtles] Clean: 14 rows x 4 cols
All clean datasets saved successfully.
2.4 Section D — Conditional Logic
A data_status column is added to each scraped dataset as a per-row completeness indicator. This is a form of simple data quality scoring: each row is evaluated against dataset-specific completeness criteria rather than a generic universal standard.
2.5 Section E — Analytical Thinking
Two analytical layers are executed here. The first is meta-analytic: a comparative evaluation of the four scraping techniques across implementation complexity, fragility to website changes, operational risk (rate limiting, IP blocking), and optimal use cases — building a decision framework for future technique selection. The second is substantive: extracting meaningful insights from the collected data, demonstrating that the scraping process has generated knowledge that was not previously available.
2.5.1 E.1 — Comparison of 4 Scraping Techniques
Static HTML scraping is the most reliable technique: no JavaScript dependency, no race conditions, and a single HTTP request suffices. It should always be attempted first — success eliminates the need for more complex approaches, and failure is a strong signal that the target site uses dynamic rendering. For AJAX endpoints, the primary risk is that hidden endpoints discovered via DevTools are undocumented and can change without notice, causing scrapers to fail silently. Mitigation requires periodic monitoring and robust error alerting in production pipelines.
2.5.2 E.2 — Insights from Scraped Data
<div class="val">250</div>
<div class="lbl">Countries</div>
<div class="val">582</div>
<div class="lbl">NHL Hockey Teams</div>
<div class="val">87</div>
<div class="lbl">Oscar Films (2010-15)</div>
<div class="val">14</div>
<div class="lbl">Turtle Families</div>
<div class="val">China</div>
<div class="lbl">Most Populous Country</div>
<div class="val">Detroit Red Wings</div>
<div class="lbl">Highest Win Rate Team</div>
[Countries — Geography and Demography] China is the most populous country at 1,330,044,000 people, while Russia holds the largest land area. The fact that the most populous country is not the largest by area reflects a highly uneven global population distribution shaped by historical, climatic, and economic factors — with direct implications for infrastructure burden, market opportunity, and environmental pressure.
[NHL Hockey — Team Performance] Detroit Red Wings recorded the highest win rate at 75.6%, significantly above the league average of 45.9%. In sports analytics, win rate is an efficiency metric — high win-rate teams demonstrate consistency, sound tactical systems, and roster depth, making it a more meaningful benchmark than raw total wins which can be skewed by games played.
[Oscar Films 2010-2015 — Film Industry] “Gravity” led with 7 wins. With an average of 3.9 nominations per film but only 6.9% earning Best Picture, the data reveals the Academy’s high selectivity: many nominations do not guarantee a win, reflecting the intense competition and inherent subjectivity of creative work evaluation.
[Turtles — Biodiversity] 14 turtle families were catalogued — a structured reference relevant to reptile conservation research. Scraping biodiversity data from authoritative sources can support species protection programs and public education initiatives.