Introduction
This report presents an Exploratory Data Analysis (EDA) of the sales
data provided in sales_data_test from the sales_data_test.RData file, as
part of our Data Analytics group project. The dataset contains sales
records from January to July 2020 for Company B, a fast-moving consumer
goods (FMCG) manufacturer based in the Philippines. Each record is
uniquely identified by the OrderLineID column. The goal of this EDA is
to uncover key insights, identify potential data issues, and prepare a
cleaned dataset for use in Exercise 2.
1. Data Overview
This section summarizes the dataset, excluding PONumber
as requested, including column names, data types, unique values, and
missing values.
Data Overview of sales_data_test Dataset
| OrderLineID |
OrderLineID |
53652 |
character |
53650 |
99.996272 |
0 |
0.000000 |
| CustomerID |
CustomerID |
53652 |
character |
1274 |
2.374562 |
5291 |
9.861701 |
| CustomerName |
CustomerName |
53652 |
character |
1271 |
2.368970 |
0 |
0.000000 |
| FacilityID |
FacilityID |
53652 |
character |
5 |
0.009319 |
0 |
0.000000 |
| ProductID |
ProductID |
53652 |
character |
20 |
0.037277 |
0 |
0.000000 |
| ProductName |
ProductName |
53652 |
character |
23 |
0.042869 |
3576 |
6.665176 |
| DeliveryDate |
DeliveryDate |
53652 |
Date |
156 |
0.290763 |
0 |
0.000000 |
| OrderQuantityInCarton |
OrderQuantityInCarton |
53652 |
numeric |
57 |
0.106240 |
0 |
0.000000 |
| ConfirmQuantityInCarton |
ConfirmQuantityInCarton |
53652 |
numeric |
60 |
0.111832 |
0 |
0.000000 |
| CancelQuantityInCarton |
CancelQuantityInCarton |
53652 |
numeric |
28 |
0.052188 |
0 |
0.000000 |
| OrderValue |
OrderValue |
53652 |
numeric |
1164 |
2.169537 |
0 |
0.000000 |
| ConfirmValue |
ConfirmValue |
53652 |
numeric |
1183 |
2.204950 |
2599 |
4.844181 |
| CancelValue |
CancelValue |
53652 |
numeric |
155 |
0.288899 |
0 |
0.000000 |
Insights
- The dataset contains 53,652 observations across 13 columns
(excluding
PONumber).
OrderLineID has 53,652 unique values (100%), confirming
its role as the primary key.
CustomerName has 14,722 missing values (27.43%),
indicating a significant data quality issue.
- The date column (if detected) has 213 unique values (0.40%),
consistent with the 7-month period.
- Other columns have no missing values, suggesting good data
completeness for most variables.
2. Univariate Analysis
This section explores the distribution of each variable through
summary statistics and visualizations.
2.2 Visualizations and Insights
1. OrderLineID
OrderLineID is the primary key. Let’s visualize the
number of unique values to confirm its role using Highcharter.
2. CustomerID: Distribution
Visualize top 10 CustomerID by order frequency due to
high unique values.

Insight: Top customers like CUST8512
drive most orders, posing a risk if they reduce activity (if data is
available).
3. CustomerName: Distribution
Visualize top 10 CustomerName values due to missing
values and high unique values.

Insight: Gaisano and
Robinsons are frequent, but missing values suggest data
collection issues (if data is available).
4. FacilityID: Distribution

Insight: SOIFCO1 dominates, indicating
it is the primary facility (if data is available).
5. ProductID: Distribution
Visualize top 10 ProductID values due to moderate unique
values.

Insight: PROD1159 and
PROD1024 are key products (if data is available).
6. ProductName: Distribution
Visualize top 10 ProductName values due to missing
values and moderate unique values.

Insight: Milk Powder 1kg and
Cheese Spread 500g are frequent, with some missing values
(if data is available).
7. Date: Orders Over Time

Insight: Orders fluctuate, with peaks in mid-March
and early June 2020, possibly due to seasonal demand (if data is
available).
8 OrderQuantity: Distribution


Insight: Right-skewed distribution with outliers
above 50 units (if data is available).
9 ConfirmQuantityInCarton
10 CancelQuantityInCarton theo CustomerName (top 10)
13. CancelValue theo CustomerName (top 10)