PEMROGRAMAN SAINS DATA
Midterm Exam
Veronica Maria Lucia Ferreira Xavier
Januaria Teresinha
Chricyesia Winnerlady Frexisovara Uvas
1 Mini Project: CASE STUDY E-Commerce
Di era ekonomi digital saat ini, industri E-Commerce menghasilkan volume data yang sangat besar setiap harinya. Data tersebut berasal dari berbagai platform seperti Shopee, Tokopedia, Lazada, dan TikTok Shop. Tantangan utama yang dihadapi oleh seorang Data Analyst bukanlah kelangkaan data, melainkan bagaimana cara mengumpulkan data yang tersebar dalam berbagai format (CSV, JSON, Excel, XML, TXT) dan membersihkannya agar siap dianalisis.
Dataset yang digunakan dalam studi kasus ini mencerminkan kondisi data di dunia nyata yang sering kali “kotor”—memiliki format tanggal yang tidak konsisten, adanya nilai yang hilang (missing values), penulisan teks yang tidak seragam, hingga duplikasi data. Oleh karena itu, kemampuan untuk melakukan Data Collecting, Data Handling, dan Data Cleaning menggunakan pemrograman (Python/R) menjadi kompetensi yang sangat krusial.
Studi kasus ini menggunakan dataset transaksi E-Commerce yang dikumpulkan dari 5 sumber file berbeda dengan karakteristik sebagai berikut:
ecommerce.csv: Data transaksi dalam format standar CSV.
ecommerce.xlsx: Data mentah yang merepresentasikan operasional harian.
ecommerce.json: Data dengan struktur nested yang biasanya berasal dari API.
ecommerce.txt: Data teks dengan pemisah khusus (pipe-separated).
ecommerce.xml: Data berbasis tag yang sering digunakan untuk pertukaran data antar sistem.
Tujuan utama dari pengerjaan mini project ini adalah:
Data Collection: Mengotomatisasi proses pembacaan dan penggabungan berbagai format file menggunakan teknik looping dan conditional logic.
Data Cleaning: Melakukan standardisasi data, menangani missing values, serta memperbaiki tipe data yang tidak sesuai (misal: mengubah teks harga menjadi numerik).
Business Logic: Menerapkan logika bisnis melalui conditional logic untuk mengidentifikasi transaksi bernilai tinggi dan prioritas pesanan.
Insight Discovery: Menarik kesimpulan strategis mengenai platform, kategori produk, dan status transaksi yang paling dominan dalam bisnis.
1.1 SECTION A – DATA COLLECTION USING PROGRAMMING
=== MEMULAI PROSES INGESTI DATA ===
------------------------------------------------------------
File : ecommerce.csv
Jml Baris : 2000
Total Kolom : 22
Nama Kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
Status : Ready to merge (Base Reference)
------------------------------------------------------------
File : ecommerce.json
Jml Baris : 2000
Total Kolom : 22
Nama Kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
Status : Ready to merge
------------------------------------------------------------
File : ecommerce.txt
Jml Baris : 2000
Total Kolom : 22
Nama Kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
Status : Ready to merge
------------------------------------------------------------
File : ecommerce.xlsx
Jml Baris : 2000
Total Kolom : 22
Nama Kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
Status : Ready to merge
------------------------------------------------------------
File : ecommerce.xml
Jml Baris : 2000
Total Kolom : 22
Nama Kolom : order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag
Status : Ready to merge
============================================================
HASIL PENGGABUNGAN DATA
TOTAL BARIS TERGABUNG : 10000
TOTAL KOLOM : 23
============================================================
Table: Preview Dataset Mentah
| order_id | order_date | ship_date | platform | category | product_name | unit_price | quantity | gross_sales | campaign | voucher_code | discount_pct | discount_value | shipping_cost | net_sales | payment_method | customer_segment | region | stock_status | order_status | customer_rating | priority_flag | source_file |
|:--------:|:----------:|:----------:|:-----------:|:-----------:|:--------------:|:----------:|:--------:|:-----------:|:-------------:|:------------:|:------------:|:--------------:|:-------------:|:---------:|:---------------:|:----------------:|:----------:|:------------:|:------------:|:---------------:|:-------------:|:-------------:|
| ORD00612 | 2024-04-19 | 2024/04/24 | Tokopedia | home living | Table Lamp | 188905 | 4 | 755620 | Flash Sale | DISC10 | 10 | 75562 | 12000 | 680058 | E-Wallet | VIP | Bekasi | Preorder | completed | 5 | Y | ecommerce.csv |
| ORD00112 | 2024/01/24 | 29/01/2024 | TikTok Shop | Electronics | Power Bank | 1476873 | 1 | 1476873 | Normal Day | NONE | 0 | 0 | 18000 | 1476873 | cod | Returning | Makassar | In Stock | completed | 5 | N | ecommerce.csv |
| ORD01186 | 2024-06-12 | 06-19-2024 | Tokopedia | Fashion | Women Dress | 231072 | 2 | 462144 | Mega Campaign | DISC20 | 20 | 92429 | 15000 | 369715 | Virtual Account | Returning | Surabaya | In Stock | delivered | 3 | Yes | ecommerce.csv |
| ORD01511 | 2024/08/07 | 08-14-2024 | Tokopedia | home living | Vacuum Cleaner | 512063 | 3 | 1536189 | Flash Sale | DISC10 | 10 | 153619 | 0 | 1382570 | Transfer Bank | New | Yogyakarta | In Stock | DELIVERED | 4 | No | ecommerce.csv |
| ORD00772 | 2024-12-08 | | Tokopedia | Beauty | Body Lotion | 221586 | 2 | 443172 | Payday Sale | DISC15 | 15 | Rp 66.476 | 0 | 376696 | COD | Returning | Surabaya | In Stock | DELIVERED | 5 | normal | ecommerce.csv |
| ORD00880 | 2024/04/06 | | blibli | Beauty | Lip Tint | 297973 | 8 | 2383784 | Normal Day | NONE | 0 | 0 | 12000 | 0 | credit card | VIP | Makassar | Low Stock | Cancelled | NA | Yes | ecommerce.csv |
| ORD00592 | 03-15-2024 | | Blibli | Fashion | Sneakers | 102844 | 8 | 822752 | Mega Campaign | DISC20 | 20 | 164550 | 25000 | 0 | E-Wallet | Returning | Bekasi | In Stock | CANCEL | NA | Yes | ecommerce.csv |
| ORD01367 | 07-24-2024 | 2024-07-25 | Blibli | Sports | Dumbbell 5kg | 297954 | 10 | 2979540 | Clearance | CLEAR30 | 30 | 893862 | 20000 | 2085678 | COD | New | Yogyakarta | In Stock | delivered | 5 | N | ecommerce.csv |
| ORD01178 | 11-14-2024 | 2024-11-16 | Blibli | Sports | Dumbbell 5kg | 710898 | 2 | 1421796 | Payday Sale | DISC15 | 15 | 213269 | 18000 | 1208527 | Virtual Account | VIP | Bandung | In Stock | delivered | 5 | PRIORITY | ecommerce.csv |
| ORD00276 | 2024/06/15 | 17/06/2024 | Lazada | Sports | Cycling Gloves | 526279 | 4 | 2105116 | Payday Sale | DISC15 | 15 | 315767 | 20000 | 1789349 | E-Wallet | Returning | Bandung | In Stock | delivered | 3 | normal | ecommerce.csv |
1.2 SECTION B – DATA HANDLING
=== ANALISIS KONDISI DATASET ===
1. Dimensi Dataset:
- Total Baris : 10000
- Total Kolom : 23
2. Tipe Data Setiap Kolom:
Kolom Tipe
order_id character
order_date character
ship_date character
platform character
category character
product_name character
unit_price character
quantity character
gross_sales character
campaign character
voucher_code character
discount_pct character
discount_value character
shipping_cost character
net_sales character
payment_method character
customer_segment character
region character
stock_status character
order_status character
customer_rating character
priority_flag character
source_file character
3. Identifikasi Missing Values per Kolom:
ship_date voucher_code discount_pct payment_method customer_rating
1000 245 345 175 2030
priority_flag
940
4. Identifikasi Baris Duplikat:
- Jumlah baris yang duplikat: 45 baris
5. Ringkasan Masalah Kualitas Data:
a. Tipe Data Tidak Sesuai: Kolom angka (unit_price, quantity, net_sales) masih bertipe character.
b. Missing Values: Terdapat nilai kosong pada kolom-kolom tertentu (seperti customer_rating).
c. Baris Duplikat: Ditemukan 45 baris yang identik.
d. Inkonsistensi Format: Format tanggal dan penulisan teks (kapitalisasi) belum seragam.
1.3 SECTION C – DATA CLEANING
=== MEMULAI PROSES PEMBERSIHAN DATA (VERSI FINAL) ===
Proses pembersihan selesai dengan sukses.
Table: Dataset Bersih dan Terstruktur (Section C)
| order_id | order_date | ship_date | platform | category | product_name | unit_price | quantity | gross_sales | campaign | voucher_code | discount_pct | discount_value | shipping_cost | net_sales | payment_method | customer_segment | region | stock_status | order_status | customer_rating | priority_flag | source_file |
|:--------:|:----------:|:----------:|:-----------:|:-----------:|:--------------:|:----------:|:--------:|:-----------:|:-------------:|:------------:|:------------:|:--------------:|:-------------:|:---------:|:---------------:|:----------------:|:----------:|:------------:|:------------:|:---------------:|:-------------:|:-------------:|
| ORD00612 | 2024-04-19 | 2024/04/24 | Tokopedia | Home Living | Table Lamp | 188905 | 4 | 755620 | Flash Sale | DISC10 | 10 | 75562.000 | 12000 | 680058 | E-Wallet | VIP | Bekasi | Preorder | Completed | 5 | Y | ecommerce.csv |
| ORD00112 | 2024/01/24 | 29/01/2024 | Tiktok Shop | Electronics | Power Bank | 1476873 | 1 | 1476873 | Normal Day | NONE | 0 | 0.000 | 18000 | 1476873 | COD | Returning | Makassar | In Stock | Completed | 5 | N | ecommerce.csv |
| ORD01186 | 2024-06-12 | 06-19-2024 | Tokopedia | Fashion | Women Dress | 231072 | 2 | 462144 | Mega Campaign | DISC20 | 20 | 92429.000 | 15000 | 369715 | Virtual Account | Returning | Surabaya | In Stock | Completed | 3 | Yes | ecommerce.csv |
| ORD01511 | 2024/08/07 | 08-14-2024 | Tokopedia | Home Living | Vacuum Cleaner | 512063 | 3 | 1536189 | Flash Sale | DISC10 | 10 | 153619.000 | 0 | 1382570 | Transfer Bank | New | Yogyakarta | In Stock | Completed | 4 | No | ecommerce.csv |
| ORD00772 | 2024-12-08 | | Tokopedia | Beauty | Body Lotion | 221586 | 2 | 443172 | Payday Sale | DISC15 | 15 | 66.476 | 0 | 376696 | COD | Returning | Surabaya | In Stock | Completed | 5 | normal | ecommerce.csv |
| ORD00880 | 2024/04/06 | | Blibli | Beauty | Lip Tint | 297973 | 8 | 2383784 | Normal Day | NONE | 0 | 0.000 | 12000 | 0 | Credit Card | VIP | Makassar | Low Stock | Cancelled | 3 | Yes | ecommerce.csv |
| ORD00592 | 03-15-2024 | | Blibli | Fashion | Sneakers | 102844 | 8 | 822752 | Mega Campaign | DISC20 | 20 | 164550.000 | 25000 | 0 | E-Wallet | Returning | Bekasi | In Stock | Cancelled | 3 | Yes | ecommerce.csv |
| ORD01367 | 07-24-2024 | 2024-07-25 | Blibli | Sports | Dumbbell 5kg | 297954 | 10 | 2979540 | Clearance | CLEAR30 | 30 | 893862.000 | 20000 | 2085678 | COD | New | Yogyakarta | In Stock | Completed | 5 | N | ecommerce.csv |
| ORD01178 | 11-14-2024 | 2024-11-16 | Blibli | Sports | Dumbbell 5kg | 710898 | 2 | 1421796 | Payday Sale | DISC15 | 15 | 213269.000 | 18000 | 1208527 | Virtual Account | VIP | Bandung | In Stock | Completed | 5 | PRIORITY | ecommerce.csv |
| ORD00276 | 2024/06/15 | 17/06/2024 | Lazada | Sports | Cycling Gloves | 526279 | 4 | 2105116 | Payday Sale | DISC15 | 15 | 315767.000 | 20000 | 1789349 | E-Wallet | Returning | Bandung | In Stock | Completed | 3 | normal | ecommerce.csv |
1.4 SECTION D – CONDITIONAL LOGIC
=== MENERAPKAN LOGIKA BISNIS (SECTION D) ===
Berhasil menambahkan kolom: is_high_value, order_priority, valid_transaction.
Table: Hasil Implementasi Conditional Logic (Section D)
| order_id | net_sales | order_status | is_high_value | order_priority | valid_transaction |
|:--------:|:---------:|:------------:|:-------------:|:--------------:|:-----------------:|
| ORD00612 | 680058 | Completed | No | Medium | Valid |
| ORD00112 | 1476873 | Completed | Yes | High | Valid |
| ORD01186 | 369715 | Completed | No | Low | Valid |
| ORD01511 | 1382570 | Completed | Yes | High | Valid |
| ORD00772 | 376696 | Completed | No | Low | Valid |
| ORD00880 | 0 | Cancelled | No | Low | Invalid |
| ORD00592 | 0 | Cancelled | No | Low | Invalid |
| ORD01367 | 2085678 | Completed | Yes | High | Valid |
| ORD01178 | 1208527 | Completed | Yes | High | Valid |
| ORD00276 | 1789349 | Completed | Yes | High | Valid |
1.5 SECTION E – ANALYTICAL THINKING
| Dominasi Platform | |
| platform | Total |
|---|---|
| Shopee | 2080 |
| Tiktok Shop | 2030 |
| Blibli | 2020 |
| Lazada | 1940 |
| Tokopedia | 1930 |
| Frekuensi Kategori | |
| category | Frekuensi |
|---|---|
| Fashion | 2100 |
| Sports | 2070 |
| Beauty | 1940 |
| Home Living | 1910 |
| Electronics | 1820 |
| Home_living | 160 |
| Status Transaksi | |
| order_status | Jumlah |
|---|---|
| Completed | 7900 |
| Cancelled | 595 |
| Shipped | 510 |
| Returned | 375 |
| On Delivery | 240 |
2 Web Scraping & Data Programming Process
2.1 SECTION A – DATA COLLECTION USING PROGRAMMING
2.1.1 1. Countries of the World (Static Page)
=== SCRAPING: COUNTRIES OF THE WORLD ===
Jumlah data diambil: 250
| country_name | capital | population |
|---|---|---|
| China | Beijing | 1330044000 |
| India | New Delhi | 1173108018 |
| United States | Washington | 310232863 |
| Indonesia | Jakarta | 242968342 |
| Brazil | Brasília | 201103330 |
| Pakistan | Islamabad | 184404791 |
| Bangladesh | Dhaka | 156118464 |
| Nigeria | Abuja | 154000000 |
| Russia | Moscow | 140702000 |
| Japan | Tokyo | 127288000 |
2.1.2 Hockey Teams (Pagination & Form)
| team_name | year | wins |
|---|---|---|
| Detroit Red Wings | 1995 | 62 |
| Detroit Red Wings | 2005 | 58 |
| Pittsburgh Penguins | 1992 | 56 |
| Detroit Red Wings | 2007 | 54 |
| Washington Capitals | 2009 | 54 |
| Vancouver Canucks | 2010 | 54 |
| Dallas Stars | 2005 | 53 |
| Buffalo Sabres | 2006 | 53 |
| Boston Bruins | 2008 | 53 |
| San Jose Sharks | 2008 | 53 |
2.2 SECTION B – DATA HANDLING
=== ANALISIS STRUKTUR DATA HASIL SCRAPING ===
==================================================
DATASET: Countries of the World
==================================================
1. Struktur Dasar:
- Jumlah Baris : 250
- Jumlah Kolom : 3
- Nama Kolom : country_name, capital, population
2. Tipe Data Setiap Kolom:
Kolom Tipe
country_name character
capital character
population numeric
3. Pemeriksaan Kualitas:
- Missing Values per Kolom:
country_name capital population
0 0 0
- Jumlah Baris Duplikat: 0 baris
4. Isu Kualitas Data yang Ditemukan (Countries):
a. Tipe Data: Kolom 'population' dan 'area' mungkin masih terbaca sebagai character/teks karena adanya koma/titik.
b. Missing Values: Potensi adanya sel kosong pada data geografi yang tidak tercatat lengkap.
==================================================
DATASET: Hockey Teams (Pagination & Form)
==================================================
1. Struktur Dasar:
- Jumlah Baris : 582
- Jumlah Kolom : 3
- Nama Kolom : team_name, year, wins
2. Tipe Data Setiap Kolom:
Kolom Tipe
team_name character
year numeric
wins numeric
3. Pemeriksaan Kualitas:
- Missing Values per Kolom:
team_name year wins
0 0 0
- Jumlah Baris Duplikat: 0 baris
4. Isu Kualitas Data yang Ditemukan (Hockey):
a. Duplikasi: Potensi baris ganda jika proses pagination mengambil data yang sama secara tidak sengaja.
b. Konsistensi: Kolom 'wins' dan 'points' perlu dikonversi ke numerik agar bisa dilakukan perhitungan statistik.
2.3 SECTION C – DATA CLEANING
SECTION C: DATA CLEANING
[1] Standardisasi teks...
[2] Konversi ke numeric...
[3] Menangani missing values...
[4] Validasi data...
[5] Menghapus duplicate...
Proses pembersihan selesai.
Dataset hasil cleaning:
Countries:
| country_name | capital | population |
|:-------------:|:----------:|:------------:|
| China | Beijing | 1.330044e+09 |
| India | New Delhi | 1.173108e+09 |
| United States | Washington | 3.102329e+08 |
| Indonesia | Jakarta | 2.429683e+08 |
| Brazil | Brasília | 2.011033e+08 |
| Pakistan | Islamabad | 1.844048e+08 |
| Bangladesh | Dhaka | 1.561185e+08 |
| Nigeria | Abuja | 1.540800e+00 |
| Russia | Moscow | 1.407020e+08 |
| Japan | Tokyo | 1.272880e+08 |
Hockey Teams:
| team_name | year | wins |
|:-------------------:|:----:|:----:|
| Detroit Red Wings | 1995 | 62 |
| Detroit Red Wings | 2005 | 58 |
| Pittsburgh Penguins | 1992 | 56 |
| Detroit Red Wings | 2007 | 54 |
| Washington Capitals | 2009 | 54 |
| Vancouver Canucks | 2010 | 54 |
| Dallas Stars | 2005 | 53 |
| Buffalo Sabres | 2006 | 53 |
| Boston Bruins | 2008 | 53 |
| San Jose Sharks | 2008 | 53 |
2.4 SECTION D – CONDITIONAL LOGIC
SECTION D: CONDITIONAL LOGIC
[1] Handling missing elements...
[2] Determining data status...
Kolom 'data_status' berhasil ditambahkan.
Preview Countries:
| country_name | capital | population | data_status |
|:-------------:|:----------:|:------------:|:-----------:|
| China | Beijing | 1.330044e+09 | Incomplete |
| India | New Delhi | 1.173108e+09 | Incomplete |
| United States | Washington | 3.102329e+08 | Incomplete |
| Indonesia | Jakarta | 2.429683e+08 | Incomplete |
| Brazil | Brasília | 2.011033e+08 | Incomplete |
| Pakistan | Islamabad | 1.844048e+08 | Incomplete |
| Bangladesh | Dhaka | 1.561185e+08 | Incomplete |
| Nigeria | Abuja | 1.540800e+00 | Incomplete |
| Russia | Moscow | 1.407020e+08 | Incomplete |
| Japan | Tokyo | 1.272880e+08 | Incomplete |
Preview Hockey Teams:
| team_name | year | wins | data_status |
|:-------------------:|:----:|:----:|:-----------:|
| Detroit Red Wings | 1995 | 62 | Incomplete |
| Detroit Red Wings | 2005 | 58 | Incomplete |
| Pittsburgh Penguins | 1992 | 56 | Incomplete |
| Detroit Red Wings | 2007 | 54 | Incomplete |
| Washington Capitals | 2009 | 54 | Incomplete |
| Vancouver Canucks | 2010 | 54 | Incomplete |
| Dallas Stars | 2005 | 53 | Incomplete |
| Buffalo Sabres | 2006 | 53 | Incomplete |
| Boston Bruins | 2008 | 53 | Incomplete |
| San Jose Sharks | 2008 | 53 | Incomplete |
2.5 SECTION E – ANALYTICAL THINKING
1. Website mana paling mudah di-scrape?
Berdasarkan dataset yang digunakan, yaitu Countries of the World dan Hockey Teams, terdapat perbedaan tingkat kemudahan dalam proses scraping.
Website Paling Mudah di-Scrape: Countries of the World
Alasan : Website ini memiliki struktur Statis. Seluruh data tersedia dalam satu dokumen HTML tunggal dengan penamaan class yang konsisten, sehingga pengambilan data dapat dilakukan secara efisien menggunakan fungsi dasar dari library rvest tanpa memerlukan logika perulangan yang kompleks.
2. Website mana paling sulit?
Website Paling Sulit di-Scrape: Hockey Teams
Alasan : Website ini mengimplementasikan mekanisme Pagination (pembagian data per halaman). Tantangan utama meliputi perlunya pengelolaan URL yang dinamis, penanganan delay antar permintaan (request), dan risiko inkonsistensi data antar halaman yang memerlukan pembersihan lebih intensif pada tahap berikutnya.
3. Perbedaan pendekatan:
Beberapa metode umum dalam penyajian data di web adalah sebagai berikut:
Static: Data disajikan langsung di dalam struktur HTML saat halaman pertama kali dimuat, sehingga dapat diakses tanpa interaksi tambahan.
Pagination: Data dibagi ke dalam beberapa halaman yang terpisah, sehingga pengguna perlu berpindah halaman untuk melihat keseluruhan data.
AJAX: Data dimuat secara asinkron menggunakan permintaan tambahan ke server tanpa perlu memuat ulang seluruh halaman.
Iframe: Data ditampilkan melalui elemen bingkai yang mengambil konten dari sumber atau halaman lain.
- Tabel berikut menjelaskan perbedaan teknis dari berbagai metode penyajian data di web:
| Pendekatan | Mekanisme Penyajian Data | Tantangan Scraping |
|---|---|---|
| Static | Data dimuat langsung dalam struktur HTML saat halaman pertama kali diakses. | Relatif rendah; cukup menggunakan pemilih CSS atau XPath standar. |
| Pagination | Data dibagi ke dalam beberapa sub-halaman untuk optimasi beban server. | Membutuhkan perulangan (looping) dan manajemen URL yang sistematis. |
| AJAX | Data dimuat secara asinkron tanpa memperbarui keseluruhan halaman. | Memerlukan inspeksi pada Network Tab untuk menemukan sumber API asli. |
| Iframe | Data ditampilkan di dalam bingkai (frame) yang berasal dari URL eksternal. | Harus mengekstraksi URL sumber asli di dalam tag
<iframe> terlebih dahulu. |