Midtrem Exaam

Mintrem Exam ~ Season 2

Naisya
NAISYA HAFIZH MUFIDAH
NIM: 52250040
Nurul Iffah
NURUL IFFAH
NIM: 52250037
Ulin
ULIN NIKMAH
NIM: 52250042
Study Program Sains Data
University INSTITUT TEKNOLOGI SAINS BANDUNG
Supporting Lecturer Bakti Siregar, M Sc., CSD
MIDTERM EXAM
MIDTERM EXAM ~ SEASON 2

1 Mini Project: CASE STUDY E-Commerce

1.1 Introduction

This mini project aims to process and analyze an e-commerce dataset from various file formats using Python. The process includes data collection from multiple sources, checking data structure consistency, merging datasets, handling data quality issues, cleaning data, applying business logic, and drawing simple insights from the analysis results. Through this task, the basic workflow of data processing from data collection to analytical thinking can be understood.


1.2 SECTION A – DATA COLLECTION USING PROGRAMMING

Objective: To collect and combine data from multiple sources using Python programming.

## Data successfully extracted to folder: extracted_data
## CSV file successfully saved: combined_clean_data.csv
File Summary Information
File_Name Number_of_Rows Number_of_Columns Column_Names
ecommerce.csv 2000 23 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.json 2000 23 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.txt 2000 23 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.xlsx 2000 23 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.xml 2000 23 order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
Status and Total Data
Status Total_Rows Total_Columns
Ready to merge 10000 23

1.2.1 Interpretation

This program extracts a ZIP file, reads multiple data files in different formats (CSV, JSON, TXT, Excel, and XML), and converts them into DataFrames using appropriate methods (including a special function for XML). Each dataset is given an additional source_file column to indicate its origin, and basic information such as number of rows, columns, and column names is displayed. It then checks the consistency of column structures using an IF-ELSE condition; if all files have the same structure, they are merged into one dataset. The final result is a combined dataset with 10,000 rows and 23 columns, which is saved as combined_clean_data.csv.


1.3 SECTION B – DATA HANDLING

Objective: To understand the condition of the dataset after the merging process.

Dataset Information
Total_Rows Total_Columns
10000 23
Data Types per Column
Column_Name Data_Type
order_id order_id character
order_date order_date character
ship_date ship_date character
platform platform character
category category character
product_name product_name character
unit_price unit_price character
quantity quantity character
gross_sales gross_sales character
campaign campaign character
voucher_code voucher_code character
discount_pct discount_pct character
discount_value discount_value character
shipping_cost shipping_cost character
net_sales net_sales character
payment_method payment_method character
customer_segment customer_segment character
region region character
stock_status stock_status character
order_status order_status character
customer_rating customer_rating character
priority_flag priority_flag character
source_file source_file character
Missing Values per Column
Column_Name Missing_Values
order_id order_id 0
order_date order_date 0
ship_date ship_date 651
platform platform 0
category category 0
product_name product_name 0
unit_price unit_price 0
quantity quantity 0
gross_sales gross_sales 0
campaign campaign 0
voucher_code voucher_code 196
discount_pct discount_pct 276
discount_value discount_value 0
shipping_cost shipping_cost 0
net_sales net_sales 0
payment_method payment_method 140
customer_segment customer_segment 0
region region 0
stock_status stock_status 0
order_status order_status 0
customer_rating customer_rating 1614
priority_flag priority_flag 752
source_file source_file 0
Duplicate Rows
Description Total
Number of Duplicate Rows 45
Data Issues Identification
Issues
- There are missing values in some columns
- There are duplicate rows due to file merging
- The date column is still in character format (not yet converted to date)
- Data comes from multiple sources (potential inconsistency)

1.3.1 Interpretation

This code analyzes the merged dataset by first displaying basic information such as the number of rows, columns, and data types, where all columns are still in object format. It then performs a data quality check by identifying missing values in several columns (such as ship_date, voucher_code, and customer_rating) and detecting 45 duplicate rows. Finally, it highlights key data issues, including the presence of missing values, duplicates caused by merging multiple files, date columns that have not yet been converted to datetime format, and potential inconsistencies due to data originating from different sources. —

1.4 SECTION C – DATA CLEANING

Objective: To clean the data using programming logic.

Platform Standardization
Stage Values
Before Tokopedia, TikTok Shop, blibli, Blibli, Lazada, Shopee, lazada, TIKTOK SHOP, TOKOPEDIA, tokopedia, Tiktok Shop, tiktok shop, SHOPEE, LAZADA, shopee, BLIBLI, tiktok shop , tokopedia , blibli , lazada , shopee
After Tokopedia, Tiktok Shop, Blibli, Lazada, Shopee
Price Cleaning
Description
Price values cleaned (Rp, dots, and commas removed; converted to numeric; negative values set to 0)
Missing Value Handling
Column Missing_Before Handling
payment_method 175 Filled with ‘Unknown’
customer_rating 2030 Filled with mean = 4.25
Order Status Standardization
Stage Values
Before completed, delivered, DELIVERED, Cancelled, CANCEL, Delivered, On Delivery, Returned, RETUR, cancelled, Shipped, Batal, shipped, returned, batal, COMPLETED, CANCELLED, BATAL, on delivery, SHIPPED, ON DELIVERY, retur, completed , cancelled , delivered , batal , returned , on delivery , retur , shipped
After Completed, Cancelled, Cancel, On Delivery, Returned, Retur, Shipped, Batal
Loop Cleaning Columns
Cleaned_Columns
platform, payment_method, order_status
Final Cleaning Summary
Total_Rows Total_Columns Missing_After
10000 23 1875
## Sample data after cleaning:

1.4.1 Interpretation

The dataset shows that Shopee dominates transactions, with Fashion being the most frequent category. Most orders are marked as Completed, with an average shipping time of around 2–3 days. Tokopedia is more commonly used for Electronics, though its overall volume is smaller compared to Shopee. —

1.5 SECTION D – CONDITIONAL LOGIC

Objective: To apply business logic using if / if-else statements.

Distribution of High Value Transactions
is_high_value Count
No 6130
Yes 3870
Distribution of Order Priority
order_priority Count
High 3870
Low 4035
Medium 2095
Distribution of Valid Transactions
valid_transaction Count
Invalid 415
Valid 9585

1.5.1 Interpretation

The dataset highlights transaction classifications based on sales value and status. High-value orders (above 1,000,000) are flagged as Yes and given High priority, while medium and low sales are categorized accordingly. Most transactions are marked Valid, except those Cancelled which are labeled Invalid. This provides a clear view of order importance and reliability within the dataset. —

1.6 SECTION E – ANALYTICAL THINKING

Objective: To derive simple insights from the data.

Platform Frequency
Platform Count
Blibli 2020
Lazada 1940
Shopee 2080
Tiktok Shop 2030
Tokopedia 1930
Top Platform
Description Platform Count
Most Frequent Platform Shopee 2080
Category Frequency
Category Count
electronics 3
Beauty 66
BEAUTY 3
beauty 33
electronics 21
Electronics 63
ELECTRONICS 3
fashion 24
home living 24
home_living 6
Sports 84
SPORTS 3
sports 27
beauty 157
Beauty 1644
BEAUTY 37
electronics 96
Electronics 1577
ELECTRONICS 57
fashion 91
Fashion 1830
FASHION 155
home living 126
Home Living 1710
HOME LIVING 50
home_living 4
Home_Living 140
HOME_LIVING 10
sports 113
Sports 1791
SPORTS 52
Top Category
Description Category Count
Most Frequent Category Fashion 1830
Order Status Frequency
Status Count
Batal 165
Cancel 180
Cancelled 415
Completed 7900
On Delivery 240
Retur 215
Returned 375
Shipped 510
Top Status
Description Status Count
Most Common Status Completed 7900

1.6.1 Interpretation

Shopee is the most dominant platform with 2,080 transactions, closely followed by TikTok Shop, Blibli, Lazada, and Tokopedia. Fashion is the most frequent category (1,830 entries), although there are many duplicates with inconsistent formatting (e.g., FASHION, fashion, Fashion), showing the need for standardization. Most transactions are marked as Completed (7,900), while smaller portions are Shipped, Cancelled, or Returned.

2 Web Scraping & Data Programming Process

2.1 Introduction

This section covers web scraping using R on two different websites: Oscar Winning Films (AJAX/JavaScript) and Turtles All the Way Down (Frames/iFrames). The first website contains dynamic content loaded via AJAX, while the second uses iframe structures that require specific access to embedded frames.

This process aims to understand scraping techniques for complex website structures using R libraries such as rvest, handling both static and dynamic data.


2.2 Oscar Winning Films (AJAX / Javascript)

2.2.1 SECTION A – DATA COLLECTION USING PROGRAMMING

Objective: To collect data from various types of websites using programming.

## TOTAL RECORDS: 87
## TOTAL COLUMNS: 5
## 
## Data saved to oscar_films.csv

2.2.1.1 Interpretation

  • The dataset contains Oscar-nominated and winning films from 2010–2015.
  • Each year has a relatively consistent number of films.
  • The “Best Picture Winner” category highlights exactly one film per year.
  • Winning films generally have higher award counts compared to nominated films.
  • The awards bar visualization helps quickly identify top-performing films.
  • Conclusion: Best Picture winners tend to achieve stronger results in total awards, although they do not always have the highest number of nominations.

2.2.2 SECTION B – DATA HANDLING

Objective: To understand the structure of the data obtained from web scraping.

## TOTAL ROWS: 87
## TOTAL COLUMNS: 5
## TOTAL MISSING VALUES: 81
## TOTAL DUPLICATES: 0

2.2.2.1 Interpretation

  • The data structure shows that the dataset is well-organized, with appropriate columns and data types for analysis.
  • The missing values check indicates that the data is mostly complete or contains very few missing entries.
  • No duplicate data was found, meaning there is no redundancy in the dataset.
  • Several potential data issues were identified, such as:
    • whitespace in titles (can affect filtering),
    • logical inconsistencies (e.g., winner with zero awards),
    • unrealistic values (awards greater than nominations).
  • Conclusion:Overall, the dataset is clean and ready for analysis, but some anomalies should be reviewed to ensure more accurate results.

2.2.3 SECTION C – DATA CLEANING

Objective: Clean the dataset using programming logic

## 
## Clean data saved to oscar_films_clean.csv | Total rows: 6

2.2.3.1 Interpretation

  • The text standardization process improved data consistency by removing extra spaces and unifying formats (Title and Category are now consistent).
  • Missing value handling made the dataset more complete by filling numeric values and removing invalid text rows.
  • No duplicates were found or all duplicates have been handled, ensuring no redundant data.
  • Data type conversion ensures numeric and year columns are properly formatted, and invalid values have been corrected.
  • Conclusion: After cleaning, the dataset is more structured, consistent, and ready for further analysis with reduced risk of errors.

2.2.4 SECTION D – CONDITIONAL LOGIC

Objective: Apply conditional logic in scraping & cleaning

## 
## Final dataset saved to oscar_films_final.csv | Total rows: 6

2.2.4.1 Interpretation

  • Assigning default values ensures that no data is missing, allowing all rows to remain usable for analysis.
  • This process improves data consistency, especially in key columns such as Title, Category, and Year.
  • The data is then classified into three statuses:
    • Complete → all fields are valid
    • Incomplete → some fields are missing or not valid
    • Invalid → logical errors exist (e.g., awards > nominations)
  • The status distribution provides a quick overview of overall data quality.
  • Conclusion: Conditional logic enhances data quality and clarity by ensuring all records have values and are categorized based on their validity level.

2.3 Turtles (Frames / iFrames)

2.3.1 SECTION A – DATA COLLECTION USING PROGRAMMING

Objective: To collect data from various types of websites using programming.

## TOTAL RECORDS : 14
## TOTAL COLUMNS : 3
## 
## Saved to turtles_scraping.csv

2.3.1.1 Interpretation

  • The data was successfully extracted from a frames/iFrames-based webpage, indicating correct handling of a more complex page structure.
  • The dataset includes turtle family names, descriptions, and additional information, providing a well-rounded overview for each category.
  • The total records indicate that each turtle family was scraped successfully without major errors.
  • The description column provides the main overview, while additional_info contains supporting details such as characteristics or facts.
  • Conclusion: The scraping process successfully collected structured and complete data, making it ready for further analysis or exploration of turtle families.

2.3.2 SECTION B – DATA HANDLING

Objective: To understand the structure of the data obtained from web scraping.

## Total Rows    : 14
## Total Columns : 3
## Total Missing Values: 0
## Total Duplicates: 0

2.3.2.1 Interpretation

  • The data structure shows a simple and clear set of columns (name, description, additional_info), making it easy to analyze.
  • The missing values check indicates that the dataset is mostly complete, reflecting good data quality.
  • No duplicate data was found, meaning each turtle family entry is unique.
  • Some potential data issues were identified:
    • very long text in additional_info → may reduce readability,
    • very short descriptions → may lack meaningful information,
    • any missing values → could affect analysis accuracy.
  • Conclusion: The turtles dataset is generally clean and ready for analysis, but minor improvements in text length and completeness would enhance its overall quality.

2.3.3 SECTION C – DATA CLEANING

Objective: Clean the dataset using programming logic

## 
## Cleaned data saved to turtles_clean.csv | Total rows: 14

2.3.3.1 Interpretation

  • The text standardization process improved consistency by trimming whitespace and unifying formats (name in proper case, other fields in lowercase).
  • Missing value handling was applied using logic:
    • name column filled with “unknown”
    • rows with missing values in other columns were removed
      → resulting in cleaner and more relevant data.
  • No duplicates were found or all duplicates were handled, ensuring no redundant entries.
  • Data type validation shows all columns remain as character type, which is appropriate for textual data.
  • The cleaning summary indicates that the number of rows may decrease due to removal of incomplete data.
  • Conclusion: The cleaning process successfully improves data quality by making it consistent, removing invalid entries, and preparing the dataset for further analysis.

2.3.4 SECTION D – CONDITIONAL LOGIC

Objective: Apply conditional logic in scraping & cleaning

2.3.4.1 Interpretation

  • Assigning default values ensures that no important fields (name, description, additional_info) remain empty.
  • This allows all records to remain usable even if the original data was incomplete.
  • The dataset is classified into two statuses:
    • Complete → all information is present and valid
    • Incomplete → some fields still rely on default values
  • The status distribution provides a quick overview of how much data is fully complete versus incomplete.
  • Conclusion: Conditional logic improves data quality by filling missing values and categorizing records based on completeness, making the dataset more reliable and easier to analyze.

2.3.5 SECTION E – ANALYTICAL THINKING

Objective: Analyze the scraping process

2.3.5.1 1. Easiest Website to Scrape

The Oscar Winning Films (AJAX) website is the easiest to scrape because the data is available through a JSON API, allowing direct extraction without complex HTML parsing.

2.3.5.2 2. Most Difficult Website to Scrape

The Turtles (Frames / iFrames) website is the hardest because:
- Data is inside iframe structures
- Requires accessing nested pages
- Not directly available in the main HTML document

2.3.5.3 3. Differences in Scraping Approaches

  • Static Website
    • Data is directly available in HTML
    • Easy scraping using rvest
    • Most simple and stable method
  • Pagination
    • Data is split across multiple pages
    • Requires looping through pages
    • Risk of missing data if pagination is not fully handled
  • AJAX
    • Data is loaded from an API (JSON)
    • Not always visible in initial HTML
    • More efficient when API endpoint is used directly
  • Iframe
    • Data is embedded in another page
    • Requires accessing internal frame URLs
    • Most complex structure

2.3.5.4 4. Insights

  • Website structure strongly affects scraping difficulty.
  • API-based (AJAX) scraping is more efficient than HTML parsing.
  • More layers (iframe/pagination) increase scraping complexity.

2.3.5.5 5. Recommendations

  • Use browser DevTools Network tab to find hidden API endpoints.
  • Always inspect iframe source URLs before scraping embedded content.

3 Conclusion

In conclusion, this mini project demonstrates that the process of collecting, processing, and analyzing data from multiple sources requires different approaches depending on the data characteristics, such as static pages, pagination, AJAX, and iframe structures. The use of both Python and R in web scraping and data processing provides flexibility in handling complex and diverse data structures.

Furthermore, the implementation of conditional logic (if / if-else) and looping plays an important role in automating data processing tasks, including cleaning, standardization, and dataset merging. The main challenges encountered include inconsistent data structures across websites, missing or incomplete data, and dynamic content that requires advanced scraping techniques.

Overall, this project provides a comprehensive understanding of the data science workflow, from data collection to generating insights that can support decision-making.


4 Reference

Siregar, B. (n.d.). Data Science Programming: Study Case Using R and Python. Online module. bookdown.org. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/