Midtrem Exaam
Mintrem Exam ~ Season 2
1 Mini Project: CASE STUDY E-Commerce
1.1 Introduction
This mini project aims to process and analyze an e-commerce dataset from various file formats using Python. The process includes data collection from multiple sources, checking data structure consistency, merging datasets, handling data quality issues, cleaning data, applying business logic, and drawing simple insights from the analysis results. Through this task, the basic workflow of data processing from data collection to analytical thinking can be understood.
1.2 SECTION A – DATA COLLECTION USING PROGRAMMING
Objective: To collect and combine data from multiple sources using Python programming.
## Data successfully extracted to folder: extracted_data
## CSV file successfully saved: combined_clean_data.csv
| File_Name | Number_of_Rows | Number_of_Columns | Column_Names |
|---|---|---|---|
| ecommerce.csv | 2000 | 23 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file |
| ecommerce.json | 2000 | 23 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file |
| ecommerce.txt | 2000 | 23 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file |
| ecommerce.xlsx | 2000 | 23 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file |
| ecommerce.xml | 2000 | 23 | order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file |
| Status | Total_Rows | Total_Columns |
|---|---|---|
| Ready to merge | 10000 | 23 |
1.2.1 Interpretation
This program extracts a ZIP file, reads multiple data files in
different formats (CSV, JSON, TXT, Excel, and XML), and converts them
into DataFrames using appropriate methods (including a special function
for XML). Each dataset is given an additional source_file
column to indicate its origin, and basic information such as number of
rows, columns, and column names is displayed. It then checks the
consistency of column structures using an IF-ELSE condition; if all
files have the same structure, they are merged into one dataset. The
final result is a combined dataset with 10,000 rows and 23 columns,
which is saved as combined_clean_data.csv.
1.3 SECTION B – DATA HANDLING
Objective: To understand the condition of the dataset after the merging process.
| Total_Rows | Total_Columns |
|---|---|
| 10000 | 23 |
| Column_Name | Data_Type | |
|---|---|---|
| order_id | order_id | character |
| order_date | order_date | character |
| ship_date | ship_date | character |
| platform | platform | character |
| category | category | character |
| product_name | product_name | character |
| unit_price | unit_price | character |
| quantity | quantity | character |
| gross_sales | gross_sales | character |
| campaign | campaign | character |
| voucher_code | voucher_code | character |
| discount_pct | discount_pct | character |
| discount_value | discount_value | character |
| shipping_cost | shipping_cost | character |
| net_sales | net_sales | character |
| payment_method | payment_method | character |
| customer_segment | customer_segment | character |
| region | region | character |
| stock_status | stock_status | character |
| order_status | order_status | character |
| customer_rating | customer_rating | character |
| priority_flag | priority_flag | character |
| source_file | source_file | character |
| Column_Name | Missing_Values | |
|---|---|---|
| order_id | order_id | 0 |
| order_date | order_date | 0 |
| ship_date | ship_date | 651 |
| platform | platform | 0 |
| category | category | 0 |
| product_name | product_name | 0 |
| unit_price | unit_price | 0 |
| quantity | quantity | 0 |
| gross_sales | gross_sales | 0 |
| campaign | campaign | 0 |
| voucher_code | voucher_code | 196 |
| discount_pct | discount_pct | 276 |
| discount_value | discount_value | 0 |
| shipping_cost | shipping_cost | 0 |
| net_sales | net_sales | 0 |
| payment_method | payment_method | 140 |
| customer_segment | customer_segment | 0 |
| region | region | 0 |
| stock_status | stock_status | 0 |
| order_status | order_status | 0 |
| customer_rating | customer_rating | 1614 |
| priority_flag | priority_flag | 752 |
| source_file | source_file | 0 |
| Description | Total |
|---|---|
| Number of Duplicate Rows | 45 |
| Issues |
|---|
| - There are missing values in some columns |
| - There are duplicate rows due to file merging |
| - The date column is still in character format (not yet converted to date) |
| - Data comes from multiple sources (potential inconsistency) |
1.3.1 Interpretation
This code analyzes the merged dataset by first displaying basic
information such as the number of rows, columns, and data types, where
all columns are still in object format. It then performs a data quality
check by identifying missing values in several columns (such as
ship_date, voucher_code, and
customer_rating) and detecting 45 duplicate rows. Finally,
it highlights key data issues, including the presence of missing values,
duplicates caused by merging multiple files, date columns that have not
yet been converted to datetime format, and potential inconsistencies due
to data originating from different sources. —
1.4 SECTION C – DATA CLEANING
Objective: To clean the data using programming logic.
| Stage | Values |
|---|---|
| Before | Tokopedia, TikTok Shop, blibli, Blibli, Lazada, Shopee, lazada, TIKTOK SHOP, TOKOPEDIA, tokopedia, Tiktok Shop, tiktok shop, SHOPEE, LAZADA, shopee, BLIBLI, tiktok shop , tokopedia , blibli , lazada , shopee |
| After | Tokopedia, Tiktok Shop, Blibli, Lazada, Shopee |
| Description |
|---|
| Price values cleaned (Rp, dots, and commas removed; converted to numeric; negative values set to 0) |
| Column | Missing_Before | Handling |
|---|---|---|
| payment_method | 175 | Filled with ‘Unknown’ |
| customer_rating | 2030 | Filled with mean = 4.25 |
| Stage | Values |
|---|---|
| Before | completed, delivered, DELIVERED, Cancelled, CANCEL, Delivered, On Delivery, Returned, RETUR, cancelled, Shipped, Batal, shipped, returned, batal, COMPLETED, CANCELLED, BATAL, on delivery, SHIPPED, ON DELIVERY, retur, completed , cancelled , delivered , batal , returned , on delivery , retur , shipped |
| After | Completed, Cancelled, Cancel, On Delivery, Returned, Retur, Shipped, Batal |
| Cleaned_Columns |
|---|
| platform, payment_method, order_status |
| Total_Rows | Total_Columns | Missing_After |
|---|---|---|
| 10000 | 23 | 1875 |
## Sample data after cleaning:
1.4.1 Interpretation
The dataset shows that Shopee dominates transactions, with Fashion being the most frequent category. Most orders are marked as Completed, with an average shipping time of around 2–3 days. Tokopedia is more commonly used for Electronics, though its overall volume is smaller compared to Shopee. —
1.5 SECTION D – CONDITIONAL LOGIC
Objective: To apply business logic using if / if-else statements.
| is_high_value | Count |
|---|---|
| No | 6130 |
| Yes | 3870 |
| order_priority | Count |
|---|---|
| High | 3870 |
| Low | 4035 |
| Medium | 2095 |
| valid_transaction | Count |
|---|---|
| Invalid | 415 |
| Valid | 9585 |
1.5.1 Interpretation
The dataset highlights transaction classifications based on sales value and status. High-value orders (above 1,000,000) are flagged as Yes and given High priority, while medium and low sales are categorized accordingly. Most transactions are marked Valid, except those Cancelled which are labeled Invalid. This provides a clear view of order importance and reliability within the dataset. —
1.6 SECTION E – ANALYTICAL THINKING
Objective: To derive simple insights from the data.
| Platform | Count |
|---|---|
| Blibli | 2020 |
| Lazada | 1940 |
| Shopee | 2080 |
| Tiktok Shop | 2030 |
| Tokopedia | 1930 |
| Description | Platform | Count |
|---|---|---|
| Most Frequent Platform | Shopee | 2080 |
| Category | Count |
|---|---|
| electronics | 3 |
| Beauty | 66 |
| BEAUTY | 3 |
| beauty | 33 |
| electronics | 21 |
| Electronics | 63 |
| ELECTRONICS | 3 |
| fashion | 24 |
| home living | 24 |
| home_living | 6 |
| Sports | 84 |
| SPORTS | 3 |
| sports | 27 |
| beauty | 157 |
| Beauty | 1644 |
| BEAUTY | 37 |
| electronics | 96 |
| Electronics | 1577 |
| ELECTRONICS | 57 |
| fashion | 91 |
| Fashion | 1830 |
| FASHION | 155 |
| home living | 126 |
| Home Living | 1710 |
| HOME LIVING | 50 |
| home_living | 4 |
| Home_Living | 140 |
| HOME_LIVING | 10 |
| sports | 113 |
| Sports | 1791 |
| SPORTS | 52 |
| Description | Category | Count |
|---|---|---|
| Most Frequent Category | Fashion | 1830 |
| Status | Count |
|---|---|
| Batal | 165 |
| Cancel | 180 |
| Cancelled | 415 |
| Completed | 7900 |
| On Delivery | 240 |
| Retur | 215 |
| Returned | 375 |
| Shipped | 510 |
| Description | Status | Count |
|---|---|---|
| Most Common Status | Completed | 7900 |
1.6.1 Interpretation
Shopee is the most dominant platform with 2,080 transactions, closely followed by TikTok Shop, Blibli, Lazada, and Tokopedia. Fashion is the most frequent category (1,830 entries), although there are many duplicates with inconsistent formatting (e.g., FASHION, fashion, Fashion), showing the need for standardization. Most transactions are marked as Completed (7,900), while smaller portions are Shipped, Cancelled, or Returned.
2 Web Scraping & Data Programming Process
2.1 Introduction
This section covers web scraping using R on two different websites: Oscar Winning Films (AJAX/JavaScript) and Turtles All the Way Down (Frames/iFrames). The first website contains dynamic content loaded via AJAX, while the second uses iframe structures that require specific access to embedded frames.
This process aims to understand scraping techniques for complex
website structures using R libraries such as rvest,
handling both static and dynamic data.
2.2 Oscar Winning Films (AJAX / Javascript)
2.2.1 SECTION A – DATA COLLECTION USING PROGRAMMING
Objective: To collect data from various types of websites using programming.
## TOTAL RECORDS: 87
## TOTAL COLUMNS: 5
##
## Data saved to oscar_films.csv
2.2.1.1 Interpretation
- The dataset contains Oscar-nominated and winning films from
2010–2015.
- Each year has a relatively consistent number of films.
- The “Best Picture Winner” category highlights
exactly one film per year.
- Winning films generally have higher award counts compared to
nominated films.
- The awards bar visualization helps quickly identify top-performing films.
- Conclusion: Best Picture winners tend to achieve stronger results in total awards, although they do not always have the highest number of nominations.
2.2.2 SECTION B – DATA HANDLING
Objective: To understand the structure of the data obtained from web scraping.
## TOTAL ROWS: 87
## TOTAL COLUMNS: 5
## TOTAL MISSING VALUES: 81
## TOTAL DUPLICATES: 0
2.2.2.1 Interpretation
- The data structure shows that the dataset is well-organized, with
appropriate columns and data types for analysis.
- The missing values check indicates that the data is mostly complete
or contains very few missing entries.
- No duplicate data was found, meaning there is no redundancy in the
dataset.
- Several potential data issues were identified, such as:
- whitespace in titles (can affect filtering),
- logical inconsistencies (e.g., winner with zero awards),
- unrealistic values (awards greater than nominations).
- whitespace in titles (can affect filtering),
- Conclusion:Overall, the dataset is clean and ready for analysis, but some anomalies should be reviewed to ensure more accurate results.
2.2.3 SECTION C – DATA CLEANING
Objective: Clean the dataset using programming logic
##
## Clean data saved to oscar_films_clean.csv | Total rows: 6
2.2.3.1 Interpretation
- The text standardization process improved data consistency by
removing extra spaces and unifying formats (Title and Category are now
consistent).
- Missing value handling made the dataset more complete by filling numeric values and removing invalid text rows.
- No duplicates were found or all duplicates have been handled,
ensuring no redundant data.
- Data type conversion ensures numeric and year columns are properly
formatted, and invalid values have been corrected.
- Conclusion: After cleaning, the dataset is more structured, consistent, and ready for further analysis with reduced risk of errors.
2.2.4 SECTION D – CONDITIONAL LOGIC
Objective: Apply conditional logic in scraping & cleaning
##
## Final dataset saved to oscar_films_final.csv | Total rows: 6
2.2.4.1 Interpretation
- Assigning default values ensures that no data is missing, allowing
all rows to remain usable for analysis.
- This process improves data consistency, especially in key columns
such as Title, Category, and Year.
- The data is then classified into three statuses:
- Complete → all fields are valid
- Incomplete → some fields are missing or not
valid
- Invalid → logical errors exist (e.g., awards >
nominations)
- Complete → all fields are valid
- The status distribution provides a quick overview of overall data
quality.
- Conclusion: Conditional logic enhances data quality and clarity by ensuring all records have values and are categorized based on their validity level.
2.3 Turtles (Frames / iFrames)
2.3.1 SECTION A – DATA COLLECTION USING PROGRAMMING
Objective: To collect data from various types of websites using programming.
## TOTAL RECORDS : 14
## TOTAL COLUMNS : 3
##
## Saved to turtles_scraping.csv
2.3.1.1 Interpretation
- The data was successfully extracted from a frames/iFrames-based
webpage, indicating correct handling of a more complex page
structure.
- The dataset includes turtle family names, descriptions, and
additional information, providing a well-rounded overview for each
category.
- The total records indicate that each turtle family was scraped
successfully without major errors.
- The description column provides the main overview,
while additional_info contains supporting details such
as characteristics or facts.
- Conclusion: The scraping process successfully collected structured and complete data, making it ready for further analysis or exploration of turtle families.
2.3.2 SECTION B – DATA HANDLING
Objective: To understand the structure of the data obtained from web scraping.
## Total Rows : 14
## Total Columns : 3
## Total Missing Values: 0
## Total Duplicates: 0
2.3.2.1 Interpretation
- The data structure shows a simple and clear set of columns (name,
description, additional_info), making it easy to analyze.
- The missing values check indicates that the dataset is mostly
complete, reflecting good data quality.
- No duplicate data was found, meaning each turtle family entry is
unique.
- Some potential data issues were identified:
- very long text in additional_info → may reduce
readability,
- very short descriptions → may lack meaningful information,
- any missing values → could affect analysis accuracy.
- very long text in additional_info → may reduce
readability,
- Conclusion: The turtles dataset is generally clean and ready for analysis, but minor improvements in text length and completeness would enhance its overall quality.
2.3.3 SECTION C – DATA CLEANING
Objective: Clean the dataset using programming logic
##
## Cleaned data saved to turtles_clean.csv | Total rows: 14
2.3.3.1 Interpretation
- The text standardization process improved consistency by trimming
whitespace and unifying formats (name in proper case, other fields in
lowercase).
- Missing value handling was applied using logic:
- name column filled with “unknown”
- rows with missing values in other columns were removed
→ resulting in cleaner and more relevant data.
- name column filled with “unknown”
- No duplicates were found or all duplicates were handled, ensuring no
redundant entries.
- Data type validation shows all columns remain as character type,
which is appropriate for textual data.
- The cleaning summary indicates that the number of rows may decrease
due to removal of incomplete data.
- Conclusion: The cleaning process successfully improves data quality by making it consistent, removing invalid entries, and preparing the dataset for further analysis.
2.3.4 SECTION D – CONDITIONAL LOGIC
Objective: Apply conditional logic in scraping & cleaning
2.3.4.1 Interpretation
- Assigning default values ensures that no important fields
(name, description, additional_info) remain
empty.
- This allows all records to remain usable even if the original data
was incomplete.
- The dataset is classified into two statuses:
- Complete → all information is present and
valid
- Incomplete → some fields still rely on default
values
- Complete → all information is present and
valid
- The status distribution provides a quick overview of how much data
is fully complete versus incomplete.
Conclusion: Conditional logic improves data quality by filling missing values and categorizing records based on completeness, making the dataset more reliable and easier to analyze.
2.3.5 SECTION E – ANALYTICAL THINKING
Objective: Analyze the scraping process
2.3.5.1 1. Easiest Website to Scrape
The Oscar Winning Films (AJAX) website is the easiest to scrape because the data is available through a JSON API, allowing direct extraction without complex HTML parsing.
2.3.5.2 2. Most Difficult Website to Scrape
The Turtles (Frames / iFrames) website is the
hardest because:
- Data is inside iframe structures
- Requires accessing nested pages
- Not directly available in the main HTML document
2.3.5.3 3. Differences in Scraping Approaches
- Static Website
- Data is directly available in HTML
- Easy scraping using
rvest
- Most simple and stable method
- Data is directly available in HTML
- Pagination
- Data is split across multiple pages
- Requires looping through pages
- Risk of missing data if pagination is not fully handled
- Data is split across multiple pages
- AJAX
- Data is loaded from an API (JSON)
- Not always visible in initial HTML
- More efficient when API endpoint is used directly
- Data is loaded from an API (JSON)
- Iframe
- Data is embedded in another page
- Requires accessing internal frame URLs
- Most complex structure
- Data is embedded in another page
2.3.5.4 4. Insights
- Website structure strongly affects scraping difficulty.
- API-based (AJAX) scraping is more efficient than HTML parsing.
- More layers (iframe/pagination) increase scraping complexity.
2.3.5.5 5. Recommendations
- Use browser DevTools Network tab to find hidden API endpoints.
- Always inspect iframe source URLs before scraping embedded content.
3 Conclusion
In conclusion, this mini project demonstrates that the process of collecting, processing, and analyzing data from multiple sources requires different approaches depending on the data characteristics, such as static pages, pagination, AJAX, and iframe structures. The use of both Python and R in web scraping and data processing provides flexibility in handling complex and diverse data structures.
Furthermore, the implementation of conditional logic (if / if-else) and looping plays an important role in automating data processing tasks, including cleaning, standardization, and dataset merging. The main challenges encountered include inconsistent data structures across websites, missing or incomplete data, and dynamic content that requires advanced scraping techniques.
Overall, this project provides a comprehensive understanding of the data science workflow, from data collection to generating insights that can support decision-making.
4 Reference
Siregar, B. (n.d.). Data Science Programming: Study Case Using R and Python. Online module. bookdown.org. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/