?????? ????
Light

Members

Dosen

Dosen Pengampu

Bakti Siregar, M.Sc.,CDS.

Student Major in Data Science

Data Science Programming

Institut Teknologi Sains Bandung

Fifi Muthia Pitaloka

NIM: 52250038

Clara Maisie Wanghili

NIM: 52250039

Nazwa Nur Ramadhani

NIM: 52250045

Section A

SECTION A – DATA COLLECTION USING PROGRAMMING

Data Collection & Integration Overview
Nama_File Jumlah_Baris Jumlah_Kolom Nama_Kolom Tipe
ecommerce.csv 2000 22 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag Detail File
ecommerce.json 2000 22 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag Detail File
ecommerce.txt 2000 22 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag Detail File
ecommerce.xlsx 2000 22 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag Detail File
ecommerce.xml 2000 22 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag Detail File
TOTAL 10000 22 Ready to merge Summary

The program successfully imported data from five formats (.csv, .xlsx, .json, .txt, .xml), checked rows/columns/column names, adjusted mismatched structures, and merged compatible datasets into merged_df, saved as merged_data.csv.

This demonstrates effective integration of multi-format data for further analysis.

Section B

SECTION B – DATA HANDLING

Informasi Umum Dataset
Total_Baris Total_Kolom
10000 22
Tipe Data Setiap Kolom
Nama_Kolom Tipe_Data
order_id order_id character
order_date order_date character
ship_date ship_date character
platform platform character
category category character
product_name product_name character
unit_price unit_price character
quantity quantity character
gross_sales gross_sales character
campaign campaign character
voucher_code voucher_code character
discount_pct discount_pct character
discount_value discount_value character
shipping_cost shipping_cost character
net_sales net_sales character
payment_method payment_method character
customer_segment customer_segment character
region region character
stock_status stock_status character
order_status order_status character
customer_rating customer_rating character
priority_flag priority_flag character
Missing Values per Kolom
Nama_Kolom Jumlah_Missing
order_id order_id 0
order_date order_date 0
ship_date ship_date 651
platform platform 0
category category 0
product_name product_name 0
unit_price unit_price 0
quantity quantity 0
gross_sales gross_sales 0
campaign campaign 0
voucher_code voucher_code 196
discount_pct discount_pct 276
discount_value discount_value 0
shipping_cost shipping_cost 0
net_sales net_sales 0
payment_method payment_method 140
customer_segment customer_segment 0
region region 0
stock_status stock_status 0
order_status order_status 0
customer_rating customer_rating 1614
priority_flag priority_flag 752
Duplicate Rows
Duplicate_Rows
5581
Identifikasi Masalah Kualitas Data
No Masalah
1 Terdapat nilai missing (NA) pada beberapa kolom
2 Format data tidak konsisten (huruf besar/kecil, spasi)
3 Duplikasi data ditemukan pada beberapa baris

Examination of merged_df reveals final dimensions post-merge (row and column counts from various files), data types per column (numeric, character, or factor) for verification, missing value distribution per column to identify gaps needing cleaning, and duplicate row counts for repetition detection, all essential cleaning steps to ensure a tidy, valid, and optimal dataset for further analysis.

Section C

SECTION C – DATA CLEANING

## [1] "Total cleaned data: 10000"
## [1] "Saved to clean_data.csv & clean_data.xlsx"

The cleaning process transformed merged_df into a tidier, consistent df_clean: standardized platform column (e.g., “tokped” → “Tokopedia”, uniform casing), cleaned unit_price (removed currency symbols/characters, converted to numeric, negatives → 0), filled missing payment_method with “Unknown”, replaced missing customer_rating with mean rating, and standardized order_status (e.g., “delivered” → “Completed”) resulting in valid, consistent data ready for further analysis.

Section D

SECTION D – CONDITIONAL LOGIC

## [1] "Total data after transformation: 10000"

Feature engineering added new columns to df_clean for easier transaction analysis: is_high_value flags sales over Rp1,000,000 as “Yes” (others “No”); order_priority categorizes net_sales into High, Medium, or Low for prioritizing high-value orders; and valid_transaction distinguishes valid transactions from “Cancelled” (Invalid). These enhance business analysis, transaction segmentation, and effective decision-making.

Section E

SECTION E – ANALYTICAL THINKING

1. Which platform is the most dominant? Based on data analysis, Shopee is the most dominant with the highest transaction volume compared to Tokopedia and other platforms. This dominance indicates Shopee as the most widely used e-commerce platform, supported by aggressive promotions, user-friendly app, and broad market reach, thereby strengthening its competitive position.

2. Which category appears most frequently? The most frequent product category in the dataset is Electronics, indicating the highest demand and transaction volume compared to others. This reflects consumer preference for electronics due to daily needs, rapid technological advancements, and wide product variety.

3. Which transaction status is the most common? The most common transaction status is Completed, indicating most transactions succeed without cancellation. This reflects high success rates, user trust in the platform, and effective payment, delivery, and service systems.

Overall, the data shows dominance by one main platform, a specific product category, and mostly Completed transactions, indicating an active and stable e-commerce market.

3 Oscar Winning Films (AJAX / Javascript)

## [1] "Total Oscar data: 520"
## [1] "Saved to oscar_data.csv & oscar_data.xlsx"

In scraping Oscar Winning Films data, we used AJAX scraping tailored to the oscars.yipitdata.com website structure, where data isn’t rendered directly in HTML but sent via background JSON requests, so we employed httr and jsonlite libraries for direct HTTP GET requests to the endpoint without needing a browser or Selenium, using for loops to iterate through all 87 ceremony years stored as nested lists in JSON and if conditions to skip empty or NULL data per year iteration, despite challenges like the per-year nested JSON structure (not flat dataframes) requiring extra loops and as.data.frame() conversion each time, plus missing Award/Category columns since the site covers only Best Picture, so we manually filled the category column, yielding 520 rows of Best Picture nominations from 1927 to 2014 across 87 ceremony years with 87 winning films and 433 nominated but non-winning films, averaging 6 films competing yearly and the tightest races in 1934 and 1935 (12 nominations each), stored in columns movie_title, year, category, and winner (TRUE/FALSE), showing that AJAX-based sites demand a different approach from static HTML by pulling data straight from JSON endpoints while properly handling nested structures and incomplete columns.

4 Turtles (Frames / iFrames)

## [1] "Total turtles scraped: 14"
## [1] "Saved to turtles_data.csv"

In the scraping process, various methods like static HTML, pagination, AJAX, and iframe were used tailored to each website’s structure, employing for loops for repeated data extraction and if conditions to handle missing or unavailable data, despite challenges such as data embedded in iframes or not directly in HTML requiring detail page access, resulting in diverse datasets with varying structures that provided insights into data volume and patterns, including turtle family data from scrapethissite.com/pages/frames/ successfully extracted using rvest iframe scraping in two stages, first accessing the main page to get the iframe src attribute since data wasn’t directly available, then requesting the iframe URL and looping through each turtle’s detail page to yield 14 complete families with scientific names and descriptions covering common names, first discovery years, and discoverers from the oldest Testudinidae (Tortoises) in 1788 by Batsch to the newest Carettochelyidae (Pig-nosed turtle) in 1887 by Boulenger with contributions from Gray and Fitzinger on multiple families, stored in columns Name, Description, and Additional_Info (empty or “No additional info” as unavailable on the site), demonstrating that iframe structures demand extra detail page access steps while ensuring only relevant data is captured effectively.