Dosen Pengampu
Bakti Siregar, M.Sc.,CDS.
Student Major in Data Science
Data Science Programming
Institut Teknologi Sains Bandung
Fifi Muthia Pitaloka
NIM: 52250038
Clara Maisie Wanghili
NIM: 52250039
Nazwa Nur Ramadhani
NIM: 52250045
Dosen Pengampu
Bakti Siregar, M.Sc.,CDS.
Institut Teknologi Sains Bandung
NIM: 52250038
NIM: 52250039
NIM: 52250045
| Nama_File | Jumlah_Baris | Jumlah_Kolom | Nama_Kolom | Tipe |
|---|---|---|---|---|
| ecommerce.csv | 2000 | 22 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag | Detail File |
| ecommerce.json | 2000 | 22 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag | Detail File |
| ecommerce.txt | 2000 | 22 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag | Detail File |
| ecommerce.xlsx | 2000 | 22 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag | Detail File |
| ecommerce.xml | 2000 | 22 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag | Detail File |
| TOTAL | 10000 | 22 | Ready to merge | Summary |
The program successfully imported data from five formats (.csv, .xlsx, .json, .txt, .xml), checked rows/columns/column names, adjusted mismatched structures, and merged compatible datasets into merged_df, saved as merged_data.csv.
This demonstrates effective integration of multi-format data for further analysis.
| Total_Baris | Total_Kolom |
|---|---|
| 10000 | 22 |
| Nama_Kolom | Tipe_Data | |
|---|---|---|
| order_id | order_id | character |
| order_date | order_date | character |
| ship_date | ship_date | character |
| platform | platform | character |
| category | category | character |
| product_name | product_name | character |
| unit_price | unit_price | character |
| quantity | quantity | character |
| gross_sales | gross_sales | character |
| campaign | campaign | character |
| voucher_code | voucher_code | character |
| discount_pct | discount_pct | character |
| discount_value | discount_value | character |
| shipping_cost | shipping_cost | character |
| net_sales | net_sales | character |
| payment_method | payment_method | character |
| customer_segment | customer_segment | character |
| region | region | character |
| stock_status | stock_status | character |
| order_status | order_status | character |
| customer_rating | customer_rating | character |
| priority_flag | priority_flag | character |
| Nama_Kolom | Jumlah_Missing | |
|---|---|---|
| order_id | order_id | 0 |
| order_date | order_date | 0 |
| ship_date | ship_date | 651 |
| platform | platform | 0 |
| category | category | 0 |
| product_name | product_name | 0 |
| unit_price | unit_price | 0 |
| quantity | quantity | 0 |
| gross_sales | gross_sales | 0 |
| campaign | campaign | 0 |
| voucher_code | voucher_code | 196 |
| discount_pct | discount_pct | 276 |
| discount_value | discount_value | 0 |
| shipping_cost | shipping_cost | 0 |
| net_sales | net_sales | 0 |
| payment_method | payment_method | 140 |
| customer_segment | customer_segment | 0 |
| region | region | 0 |
| stock_status | stock_status | 0 |
| order_status | order_status | 0 |
| customer_rating | customer_rating | 1614 |
| priority_flag | priority_flag | 752 |
| Duplicate_Rows |
|---|
| 5581 |
| No | Masalah |
|---|---|
| 1 | Terdapat nilai missing (NA) pada beberapa kolom |
| 2 | Format data tidak konsisten (huruf besar/kecil, spasi) |
| 3 | Duplikasi data ditemukan pada beberapa baris |
Examination of merged_df reveals final dimensions post-merge (row and column counts from various files), data types per column (numeric, character, or factor) for verification, missing value distribution per column to identify gaps needing cleaning, and duplicate row counts for repetition detection, all essential cleaning steps to ensure a tidy, valid, and optimal dataset for further analysis.
## [1] "Total cleaned data: 10000"
## [1] "Saved to clean_data.csv & clean_data.xlsx"
The cleaning process transformed merged_df into a tidier, consistent df_clean: standardized platform column (e.g., “tokped” → “Tokopedia”, uniform casing), cleaned unit_price (removed currency symbols/characters, converted to numeric, negatives → 0), filled missing payment_method with “Unknown”, replaced missing customer_rating with mean rating, and standardized order_status (e.g., “delivered” → “Completed”) resulting in valid, consistent data ready for further analysis.
## [1] "Total data after transformation: 10000"
Feature engineering added new columns to df_clean for easier transaction analysis: is_high_value flags sales over Rp1,000,000 as “Yes” (others “No”); order_priority categorizes net_sales into High, Medium, or Low for prioritizing high-value orders; and valid_transaction distinguishes valid transactions from “Cancelled” (Invalid). These enhance business analysis, transaction segmentation, and effective decision-making.
1. Which platform is the most dominant? Based on data analysis, Shopee is the most dominant with the highest transaction volume compared to Tokopedia and other platforms. This dominance indicates Shopee as the most widely used e-commerce platform, supported by aggressive promotions, user-friendly app, and broad market reach, thereby strengthening its competitive position.
2. Which category appears most frequently? The most frequent product category in the dataset is Electronics, indicating the highest demand and transaction volume compared to others. This reflects consumer preference for electronics due to daily needs, rapid technological advancements, and wide product variety.
3. Which transaction status is the most common? The most common transaction status is Completed, indicating most transactions succeed without cancellation. This reflects high success rates, user trust in the platform, and effective payment, delivery, and service systems.
Overall, the data shows dominance by one main platform, a specific product category, and mostly Completed transactions, indicating an active and stable e-commerce market.
## [1] "Total Oscar data: 520"
## [1] "Saved to oscar_data.csv & oscar_data.xlsx"
In scraping Oscar Winning Films data, we used AJAX scraping tailored to the oscars.yipitdata.com website structure, where data isn’t rendered directly in HTML but sent via background JSON requests, so we employed httr and jsonlite libraries for direct HTTP GET requests to the endpoint without needing a browser or Selenium, using for loops to iterate through all 87 ceremony years stored as nested lists in JSON and if conditions to skip empty or NULL data per year iteration, despite challenges like the per-year nested JSON structure (not flat dataframes) requiring extra loops and as.data.frame() conversion each time, plus missing Award/Category columns since the site covers only Best Picture, so we manually filled the category column, yielding 520 rows of Best Picture nominations from 1927 to 2014 across 87 ceremony years with 87 winning films and 433 nominated but non-winning films, averaging 6 films competing yearly and the tightest races in 1934 and 1935 (12 nominations each), stored in columns movie_title, year, category, and winner (TRUE/FALSE), showing that AJAX-based sites demand a different approach from static HTML by pulling data straight from JSON endpoints while properly handling nested structures and incomplete columns.
## [1] "Total turtles scraped: 14"
## [1] "Saved to turtles_data.csv"
In the scraping process, various methods like static HTML, pagination, AJAX, and iframe were used tailored to each website’s structure, employing for loops for repeated data extraction and if conditions to handle missing or unavailable data, despite challenges such as data embedded in iframes or not directly in HTML requiring detail page access, resulting in diverse datasets with varying structures that provided insights into data volume and patterns, including turtle family data from scrapethissite.com/pages/frames/ successfully extracted using rvest iframe scraping in two stages, first accessing the main page to get the iframe src attribute since data wasn’t directly available, then requesting the iframe URL and looping through each turtle’s detail page to yield 14 complete families with scientific names and descriptions covering common names, first discovery years, and discoverers from the oldest Testudinidae (Tortoises) in 1788 by Batsch to the newest Carettochelyidae (Pig-nosed turtle) in 1887 by Boulenger with contributions from Gray and Fitzinger on multiple families, stored in columns Name, Description, and Additional_Info (empty or “No additional info” as unavailable on the site), demonstrating that iframe structures demand extra detail page access steps while ensuring only relevant data is captured effectively.