Midtrem Exaam

Mintrem Exam ~ Season 2

NAISYA HAFIZH MUFIDAH

NIM: 52250040

NURUL IFFAH

NIM: 52250037

ULIN NIKMAH

NIM: 52250042

Study Program Sains Data

University INSTITUT TEKNOLOGI SAINS BANDUNG

Supporting Lecturer Bakti Siregar, M Sc., CSD

MIDTERM EXAM

MIDTERM EXAM ~ SEASON 2

1 Mini Project: CASE STUDY E-Commerce

1.1 Introduction

This mini project aims to process and analyze an e-commerce dataset from various file formats using Python. The process includes data collection from multiple sources, checking data structure consistency, merging datasets, handling data quality issues, cleaning data, applying business logic, and drawing simple insights from the analysis results. Through this task, the basic workflow of data processing from data collection to analytical thinking can be understood.

1.2 SECTION A – DATA COLLECTION USING PROGRAMMING

Objective: To collect and combine data from multiple sources using Python programming.

## Data successfully extracted to folder: extracted_data

## CSV file successfully saved: combined_clean_data.csv

File Summary Information
File_Name	Number_of_Rows	Number_of_Columns	Column_Names
ecommerce.csv	2000	23	order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.json	2000	23	order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.txt	2000	23	order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.xlsx	2000	23	order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file
ecommerce.xml	2000	23	order_id, order_date, ship_date, platform, category, product_name, unit_price, quantity, gross_sales, campaign, voucher_code, discount_pct, discount_value, shipping_cost, net_sales, payment_method, customer_segment, region, stock_status, order_status, customer_rating, priority_flag, source_file

Status and Total Data
Status	Total_Rows	Total_Columns
Ready to merge	10000	23

1.2.1 Interpretation

This program extracts a ZIP file, reads multiple data files in different formats (CSV, JSON, TXT, Excel, and XML), and converts them into DataFrames using appropriate methods (including a special function for XML). Each dataset is given an additional source_file column to indicate its origin, and basic information such as number of rows, columns, and column names is displayed. It then checks the consistency of column structures using an IF-ELSE condition; if all files have the same structure, they are merged into one dataset. The final result is a combined dataset with 10,000 rows and 23 columns, which is saved as combined_clean_data.csv.

1.3 SECTION B – DATA HANDLING

Objective: To understand the condition of the dataset after the merging process.

Dataset Information
Total_Rows	Total_Columns
10000	23

Data Types per Column
	Column_Name	Data_Type
order_id	order_id	character
order_date	order_date	character
ship_date	ship_date	character
platform	platform	character
category	category	character
product_name	product_name	character
unit_price	unit_price	character
quantity	quantity	character
gross_sales	gross_sales	character
campaign	campaign	character
voucher_code	voucher_code	character
discount_pct	discount_pct	character
discount_value	discount_value	character
shipping_cost	shipping_cost	character
net_sales	net_sales	character
payment_method	payment_method	character
customer_segment	customer_segment	character
region	region	character
stock_status	stock_status	character
order_status	order_status	character
customer_rating	customer_rating	character
priority_flag	priority_flag	character
source_file	source_file	character

Missing Values per Column
	Column_Name	Missing_Values
order_id	order_id	0
order_date	order_date	0
ship_date	ship_date	651
platform	platform	0
category	category	0
product_name	product_name	0
unit_price	unit_price	0
quantity	quantity	0
gross_sales	gross_sales	0
campaign	campaign	0
voucher_code	voucher_code	196
discount_pct	discount_pct	276
discount_value	discount_value	0
shipping_cost	shipping_cost	0
net_sales	net_sales	0
payment_method	payment_method	140
customer_segment	customer_segment	0
region	region	0
stock_status	stock_status	0
order_status	order_status	0
customer_rating	customer_rating	1614
priority_flag	priority_flag	752
source_file	source_file	0

Duplicate Rows
Description	Total
Number of Duplicate Rows	45

Data Issues Identification
Issues
- There are missing values in some columns
- There are duplicate rows due to file merging
- The date column is still in character format (not yet converted to date)
- Data comes from multiple sources (potential inconsistency)

1.3.1 Interpretation

This code analyzes the merged dataset by first displaying basic information such as the number of rows, columns, and data types, where all columns are still in object format. It then performs a data quality check by identifying missing values in several columns (such as ship_date, voucher_code, and customer_rating) and detecting 45 duplicate rows. Finally, it highlights key data issues, including the presence of missing values, duplicates caused by merging multiple files, date columns that have not yet been converted to datetime format, and potential inconsistencies due to data originating from different sources. —

1.4 SECTION C – DATA CLEANING

Objective: To clean the data using programming logic.

Platform Standardization
Stage	Values
Before	Tokopedia, TikTok Shop, blibli, Blibli, Lazada, Shopee, lazada, TIKTOK SHOP, TOKOPEDIA, tokopedia, Tiktok Shop, tiktok shop, SHOPEE, LAZADA, shopee, BLIBLI, tiktok shop , tokopedia , blibli , lazada , shopee
After	Tokopedia, Tiktok Shop, Blibli, Lazada, Shopee

Price Cleaning
Description
Price values cleaned (Rp, dots, and commas removed; converted to numeric; negative values set to 0)

Missing Value Handling
Column	Missing_Before	Handling
payment_method	175	Filled with ‘Unknown’
customer_rating	2030	Filled with mean = 4.25

Order Status Standardization
Stage	Values
Before	completed, delivered, DELIVERED, Cancelled, CANCEL, Delivered, On Delivery, Returned, RETUR, cancelled, Shipped, Batal, shipped, returned, batal, COMPLETED, CANCELLED, BATAL, on delivery, SHIPPED, ON DELIVERY, retur, completed , cancelled , delivered , batal , returned , on delivery , retur , shipped
After	Completed, Cancelled, Cancel, On Delivery, Returned, Retur, Shipped, Batal

Loop Cleaning Columns
Cleaned_Columns
platform, payment_method, order_status

Final Cleaning Summary
Total_Rows	Total_Columns	Missing_After
10000	23	1875

## Sample data after cleaning:

1.4.1 Interpretation

The dataset shows that Shopee dominates transactions, with Fashion being the most frequent category. Most orders are marked as Completed, with an average shipping time of around 2–3 days. Tokopedia is more commonly used for Electronics, though its overall volume is smaller compared to Shopee. —

1.5 SECTION D – CONDITIONAL LOGIC

Objective: To apply business logic using if / if-else statements.

Distribution of High Value Transactions
is_high_value	Count
No	6130
Yes	3870

Distribution of Order Priority
order_priority	Count
High	3870
Low	4035
Medium	2095

Distribution of Valid Transactions
valid_transaction	Count
Invalid	415
Valid	9585

1.5.1 Interpretation

The dataset highlights transaction classifications based on sales value and status. High-value orders (above 1,000,000) are flagged as Yes and given High priority, while medium and low sales are categorized accordingly. Most transactions are marked Valid, except those Cancelled which are labeled Invalid. This provides a clear view of order importance and reliability within the dataset. —

1.6 SECTION E – ANALYTICAL THINKING

Objective: To derive simple insights from the data.

Platform Frequency
Platform	Count
Blibli	2020
Lazada	1940
Shopee	2080
Tiktok Shop	2030
Tokopedia	1930

Top Platform
Description	Platform	Count
Most Frequent Platform	Shopee	2080

Category Frequency
Category	Count
electronics	3
Beauty	66
BEAUTY	3
beauty	33
electronics	21
Electronics	63
ELECTRONICS	3
fashion	24
home living	24
home_living	6
Sports	84
SPORTS	3
sports	27
beauty	157
Beauty	1644
BEAUTY	37
electronics	96
Electronics	1577
ELECTRONICS	57
fashion	91
Fashion	1830
FASHION	155
home living	126
Home Living	1710
HOME LIVING	50
home_living	4
Home_Living	140
HOME_LIVING	10
sports	113
Sports	1791
SPORTS	52

Top Category
Description	Category	Count
Most Frequent Category	Fashion	1830

Order Status Frequency
Status	Count
Batal	165
Cancel	180
Cancelled	415
Completed	7900
On Delivery	240
Retur	215
Returned	375
Shipped	510

Top Status
Description	Status	Count
Most Common Status	Completed	7900

1.6.1 Interpretation

Shopee is the most dominant platform with 2,080 transactions, closely followed by TikTok Shop, Blibli, Lazada, and Tokopedia. Fashion is the most frequent category (1,830 entries), although there are many duplicates with inconsistent formatting (e.g., FASHION, fashion, Fashion), showing the need for standardization. Most transactions are marked as Completed (7,900), while smaller portions are Shipped, Cancelled, or Returned.

2 Web Scraping & Data Programming Process

2.1 Introduction

This section covers web scraping using R on two different websites: Oscar Winning Films (AJAX/JavaScript) and Turtles All the Way Down (Frames/iFrames). The first website contains dynamic content loaded via AJAX, while the second uses iframe structures that require specific access to embedded frames.

This process aims to understand scraping techniques for complex website structures using R libraries such as rvest, handling both static and dynamic data.

2.2 Oscar Winning Films (AJAX / Javascript)

2.2.1 SECTION A – DATA COLLECTION USING PROGRAMMING

Objective: To collect data from various types of websites using programming.

## TOTAL RECORDS: 87

## TOTAL COLUMNS: 5

## 
## Data saved to oscar_films.csv

2.2.1.1 Interpretation

The dataset contains Oscar-nominated and winning films from 2010–2015.
Each year has a relatively consistent number of films.
The “Best Picture Winner” category highlights exactly one film per year.
Winning films generally have higher award counts compared to nominated films.
The awards bar visualization helps quickly identify top-performing films.
Conclusion: Best Picture winners tend to achieve stronger results in total awards, although they do not always have the highest number of nominations.

2.2.2 SECTION B – DATA HANDLING

Objective: To understand the structure of the data obtained from web scraping.

## TOTAL ROWS: 87

## TOTAL COLUMNS: 5

## TOTAL MISSING VALUES: 81

## TOTAL DUPLICATES: 0

2.2.2.1 Interpretation

The data structure shows that the dataset is well-organized, with appropriate columns and data types for analysis.
The missing values check indicates that the data is mostly complete or contains very few missing entries.
No duplicate data was found, meaning there is no redundancy in the dataset.
Several potential data issues were identified, such as:
- whitespace in titles (can affect filtering),
- logical inconsistencies (e.g., winner with zero awards),
- unrealistic values (awards greater than nominations).
Conclusion:Overall, the dataset is clean and ready for analysis, but some anomalies should be reviewed to ensure more accurate results.

2.2.3 SECTION C – DATA CLEANING

Objective: Clean the dataset using programming logic

## 
## Clean data saved to oscar_films_clean.csv | Total rows: 6

2.2.3.1 Interpretation

The text standardization process improved data consistency by removing extra spaces and unifying formats (Title and Category are now consistent).
Missing value handling made the dataset more complete by filling numeric values and removing invalid text rows.
No duplicates were found or all duplicates have been handled, ensuring no redundant data.
Data type conversion ensures numeric and year columns are properly formatted, and invalid values have been corrected.
Conclusion: After cleaning, the dataset is more structured, consistent, and ready for further analysis with reduced risk of errors.

2.2.4 SECTION D – CONDITIONAL LOGIC

Objective: Apply conditional logic in scraping & cleaning

## 
## Final dataset saved to oscar_films_final.csv | Total rows: 6

2.2.4.1 Interpretation

Assigning default values ensures that no data is missing, allowing all rows to remain usable for analysis.
This process improves data consistency, especially in key columns such as Title, Category, and Year.
The data is then classified into three statuses:
- Complete → all fields are valid
- Incomplete → some fields are missing or not valid
- Invalid → logical errors exist (e.g., awards > nominations)
The status distribution provides a quick overview of overall data quality.
Conclusion: Conditional logic enhances data quality and clarity by ensuring all records have values and are categorized based on their validity level.

2.3 Turtles (Frames / iFrames)

2.3.1 SECTION A – DATA COLLECTION USING PROGRAMMING

Objective: To collect data from various types of websites using programming.

## TOTAL RECORDS : 14

## TOTAL COLUMNS : 3

## 
## Saved to turtles_scraping.csv

2.3.1.1 Interpretation

The data was successfully extracted from a frames/iFrames-based webpage, indicating correct handling of a more complex page structure.
The dataset includes turtle family names, descriptions, and additional information, providing a well-rounded overview for each category.
The total records indicate that each turtle family was scraped successfully without major errors.
The description column provides the main overview, while additional_info contains supporting details such as characteristics or facts.
Conclusion: The scraping process successfully collected structured and complete data, making it ready for further analysis or exploration of turtle families.

2.3.2 SECTION B – DATA HANDLING

Objective: To understand the structure of the data obtained from web scraping.

## Total Rows    : 14

## Total Columns : 3

## Total Missing Values: 0

## Total Duplicates: 0

2.3.2.1 Interpretation

The data structure shows a simple and clear set of columns (name, description, additional_info), making it easy to analyze.
The missing values check indicates that the dataset is mostly complete, reflecting good data quality.
No duplicate data was found, meaning each turtle family entry is unique.
Some potential data issues were identified:
- very long text in additional_info → may reduce readability,
- very short descriptions → may lack meaningful information,
- any missing values → could affect analysis accuracy.
Conclusion: The turtles dataset is generally clean and ready for analysis, but minor improvements in text length and completeness would enhance its overall quality.

2.3.3 SECTION C – DATA CLEANING

Objective: Clean the dataset using programming logic

## 
## Cleaned data saved to turtles_clean.csv | Total rows: 14

2.3.3.1 Interpretation

The text standardization process improved consistency by trimming whitespace and unifying formats (name in proper case, other fields in lowercase).
Missing value handling was applied using logic:
- name column filled with “unknown”
- rows with missing values in other columns were removed
  → resulting in cleaner and more relevant data.
No duplicates were found or all duplicates were handled, ensuring no redundant entries.
Data type validation shows all columns remain as character type, which is appropriate for textual data.
The cleaning summary indicates that the number of rows may decrease due to removal of incomplete data.
Conclusion: The cleaning process successfully improves data quality by making it consistent, removing invalid entries, and preparing the dataset for further analysis.

2.3.4 SECTION D – CONDITIONAL LOGIC

Objective: Apply conditional logic in scraping & cleaning

2.3.4.1 Interpretation

Assigning default values ensures that no important fields (name, description, additional_info) remain empty.
This allows all records to remain usable even if the original data was incomplete.
The dataset is classified into two statuses:
- Complete → all information is present and valid
- Incomplete → some fields still rely on default values
The status distribution provides a quick overview of how much data is fully complete versus incomplete.
Conclusion: Conditional logic improves data quality by filling missing values and categorizing records based on completeness, making the dataset more reliable and easier to analyze.

2.3.5 SECTION E – ANALYTICAL THINKING

Objective: Analyze the scraping process

2.3.5.1 1. Easiest Website to Scrape

The Oscar Winning Films (AJAX) website is the easiest to scrape because the data is available through a JSON API, allowing direct extraction without complex HTML parsing.

2.3.5.2 2. Most Difficult Website to Scrape

The Turtles (Frames / iFrames) website is the hardest because:
- Data is inside iframe structures
- Requires accessing nested pages
- Not directly available in the main HTML document

2.3.5.3 3. Differences in Scraping Approaches

Static Website
- Data is directly available in HTML
- Easy scraping using rvest
- Most simple and stable method
Pagination
- Data is split across multiple pages
- Requires looping through pages
- Risk of missing data if pagination is not fully handled
AJAX
- Data is loaded from an API (JSON)
- Not always visible in initial HTML
- More efficient when API endpoint is used directly
Iframe
- Data is embedded in another page
- Requires accessing internal frame URLs
- Most complex structure

2.3.5.4 4. Insights

Website structure strongly affects scraping difficulty.
API-based (AJAX) scraping is more efficient than HTML parsing.
More layers (iframe/pagination) increase scraping complexity.

2.3.5.5 5. Recommendations

Use browser DevTools Network tab to find hidden API endpoints.
Always inspect iframe source URLs before scraping embedded content.

3 Conclusion

In conclusion, this mini project demonstrates that the process of collecting, processing, and analyzing data from multiple sources requires different approaches depending on the data characteristics, such as static pages, pagination, AJAX, and iframe structures. The use of both Python and R in web scraping and data processing provides flexibility in handling complex and diverse data structures.

Furthermore, the implementation of conditional logic (if / if-else) and looping plays an important role in automating data processing tasks, including cleaning, standardization, and dataset merging. The main challenges encountered include inconsistent data structures across websites, missing or incomplete data, and dynamic content that requires advanced scraping techniques.

Overall, this project provides a comprehensive understanding of the data science workflow, from data collection to generating insights that can support decision-making.

4 Reference

Siregar, B. (n.d.). Data Science Programming: Study Case Using R and Python. Online module. bookdown.org. Retrieved from https://bookdown.org/dsciencelabs/data_science_programming/

Midtrem Exaam

Mintrem Exam ~ Season 2

Midtrem Exaam

1 Mini Project: CASE STUDY E-Commerce

1.1 Introduction

1.2 SECTION A – DATA COLLECTION USING PROGRAMMING

1.2.1 Interpretation

1.3 SECTION B – DATA HANDLING

1.3.1 Interpretation

1.4 SECTION C – DATA CLEANING

1.4.1 Interpretation

1.5 SECTION D – CONDITIONAL LOGIC

1.5.1 Interpretation

1.6 SECTION E – ANALYTICAL THINKING

1.6.1 Interpretation

2 Web Scraping & Data Programming Process

2.1 Introduction

2.2 Oscar Winning Films (AJAX / Javascript)

2.2.1 SECTION A – DATA COLLECTION USING PROGRAMMING

2.2.1.1 Interpretation

2.2.2 SECTION B – DATA HANDLING

2.2.2.1 Interpretation

2.2.3 SECTION C – DATA CLEANING

2.2.3.1 Interpretation

2.2.4 SECTION D – CONDITIONAL LOGIC

2.2.4.1 Interpretation

2.3 Turtles (Frames / iFrames)

2.3.1 SECTION A – DATA COLLECTION USING PROGRAMMING

2.3.1.1 Interpretation

2.3.2 SECTION B – DATA HANDLING

2.3.2.1 Interpretation

2.3.3 SECTION C – DATA CLEANING

2.3.3.1 Interpretation

2.3.4 SECTION D – CONDITIONAL LOGIC

2.3.4.1 Interpretation

Conclusion: Conditional logic improves data quality by filling missing values and categorizing records based on completeness, making the dataset more reliable and easier to analyze.

2.3.5 SECTION E – ANALYTICAL THINKING

2.3.5.1 1. Easiest Website to Scrape

2.3.5.2 2. Most Difficult Website to Scrape

2.3.5.3 3. Differences in Scraping Approaches

2.3.5.4 4. Insights

2.3.5.5 5. Recommendations

3 Conclusion

4 Reference