Introduction

This report presents an Exploratory Data Analysis (EDA) of the sales data provided in sales_data_test from the sales_data_test.RData file, as part of our Data Analytics group project. The dataset contains sales records from January to July 2020 for Company B, a fast-moving consumer goods (FMCG) manufacturer based in the Philippines. Each record is uniquely identified by the OrderLineID column. The goal of this EDA is to uncover key insights, identify potential data issues, and prepare a cleaned dataset for use in Exercise 2.

1. Data Overview

This section summarizes the dataset, excluding PONumber as requested, including column names, data types, unique values, and missing values.

Data Overview of sales_data_test Dataset
	ColumnName	TotalObservations	DataType	UniqueValues	PercentUnique	MissingValues	PercentMissing
OrderLineID	OrderLineID	53652	character	53650	99.996272	0	0.000000
CustomerID	CustomerID	53652	character	1274	2.374562	5291	9.861701
CustomerName	CustomerName	53652	character	1271	2.368970	0	0.000000
FacilityID	FacilityID	53652	character	5	0.009319	0	0.000000
ProductID	ProductID	53652	character	20	0.037277	0	0.000000
ProductName	ProductName	53652	character	23	0.042869	3576	6.665176
DeliveryDate	DeliveryDate	53652	Date	156	0.290763	0	0.000000
OrderQuantityInCarton	OrderQuantityInCarton	53652	numeric	57	0.106240	0	0.000000
ConfirmQuantityInCarton	ConfirmQuantityInCarton	53652	numeric	60	0.111832	0	0.000000
CancelQuantityInCarton	CancelQuantityInCarton	53652	numeric	28	0.052188	0	0.000000
OrderValue	OrderValue	53652	numeric	1164	2.169537	0	0.000000
ConfirmValue	ConfirmValue	53652	numeric	1183	2.204950	2599	4.844181
CancelValue	CancelValue	53652	numeric	155	0.288899	0	0.000000

Insights

The dataset contains 53,652 observations across 13 columns (excluding PONumber).
OrderLineID has 53,652 unique values (100%), confirming its role as the primary key.
CustomerName has 14,722 missing values (27.43%), indicating a significant data quality issue.
The date column (if detected) has 213 unique values (0.40%), consistent with the 7-month period.
Other columns have no missing values, suggesting good data completeness for most variables.

2. Univariate Analysis

This section explores the distribution of each variable through summary statistics and visualizations.

2.1 Summary Statistics

Data Frame Summary

sales_data_test

Dimensions: 53652 x 13
Duplicates: 2

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

OrderLineID [character]

1. SOI01712

2. SOI03339

3. SOI01149

4. SOI01150

5. SOI01151

6. SOI01152

7. SOI01153

8. SOI01154

9. SOI01155

10. SOI01156

[ 53640 others ]

2	(	0.0%	)
2	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
1	(	0.0%	)
53640	(	100.0%	)

53652 (100.0%)

0 (0.0%)

CustomerID [character]

1. 70002859
2. 70002860
3. 70002948
4. 70002858
5. 70009775
6. 70002937
7. 70002925
8. 70011315
9. 70002919
10. 70002866
[ 1263 others ]

233	(	0.5%	)
233	(	0.5%	)
233	(	0.5%	)
218	(	0.5%	)
211	(	0.4%	)
207	(	0.4%	)
190	(	0.4%	)
187	(	0.4%	)
177	(	0.4%	)
175	(	0.4%	)
46297	(	95.7%	)

48361 (90.1%)

5291 (9.9%)

CustomerName [character]

1. Customer 1028

2. Customer 1026

3. Customer 1317

4. Customer 1017

5. Customer 1313

6. Customer 1338

7. Customer 1268

8. Customer 1001

9. Customer 1021

10. Customer 1264

[ 1261 others ]

262	(	0.5%	)
260	(	0.5%	)
259	(	0.5%	)
245	(	0.5%	)
244	(	0.5%	)
233	(	0.4%	)
221	(	0.4%	)
199	(	0.4%	)
199	(	0.4%	)
199	(	0.4%	)
51331	(	95.7%	)

53652 (100.0%)

0 (0.0%)

FacilityID [character]

1. SOIFC01

2. SOIFC02

3. SOIFC03

4. SOIFC04

5. SOIFC05

41401	(	77.2%	)
5457	(	10.2%	)
1889	(	3.5%	)
3215	(	6.0%	)
1690	(	3.1%	)

53652 (100.0%)

0 (0.0%)

ProductID [character]

1. 10012436
2. 10013309
3. 10392453
4. 10410687
5. 10014791
6. 10015135
7. 10025523
8. 10016523
9. 10025529
10. 10381605
[ 10 others ]

6684	(	12.5%	)
6121	(	11.4%	)
5751	(	10.7%	)
5591	(	10.4%	)
4133	(	7.7%	)
3002	(	5.6%	)
2396	(	4.5%	)
2390	(	4.5%	)
2185	(	4.1%	)
1936	(	3.6%	)
13463	(	25.1%	)

53652 (100.0%)

0 (0.0%)

ProductName [character]

1. SS

2. AA

3. ZZ

4. LL

5. NN

6. RR

7. DD

8. FF

9. GG

10. CC

[ 12 others ]

6249	(	12.5%	)
5387	(	10.8%	)
5242	(	10.5%	)
4490	(	9.0%	)
3840	(	7.7%	)
2800	(	5.6%	)
2255	(	4.5%	)
2221	(	4.4%	)
2027	(	4.0%	)
1800	(	3.6%	)
13765	(	27.5%	)

50076 (93.3%)

3576 (6.7%)

DeliveryDate [Date]

min : 2020-01-07

med : 2020-03-25

max : 2020-07-07

range : 6m 0d

156 distinct values

53652 (100.0%)

0 (0.0%)

OrderQuantityInCarton [numeric]

Mean (sd) : 2.1 (2.9)

min ≤ med ≤ max:

0.5 ≤ 1 ≤ 151

IQR (CV) : 1 (1.4)

57 distinct values

53652 (100.0%)

0 (0.0%)

ConfirmQuantityInCarton [numeric]

Mean (sd) : 2.1 (4.1)

min ≤ med ≤ max:

0 ≤ 1 ≤ 640

IQR (CV) : 1 (1.9)

60 distinct values

53652 (100.0%)

0 (0.0%)

CancelQuantityInCarton [numeric]

Mean (sd) : 0 (0.4)

min ≤ med ≤ max:

-5 ≤ 0 ≤ 47

IQR (CV) : 0 (34.6)

27 distinct values

53652 (100.0%)

0 (0.0%)

OrderValue [numeric]

Mean (sd) : 6511.8 (11733.4)

min ≤ med ≤ max:

1066.4 ≤ 4013.6 ≤ 611190.5

IQR (CV) : 3507.4 (1.8)

1119 distinct values

53652 (100.0%)

0 (0.0%)

ConfirmValue [numeric]

Mean (sd) : 6471.5 (11622.3)

min ≤ med ≤ max:

0 ≤ 3876.6 ≤ 611190.5

IQR (CV) : 3507.1 (1.8)

1115 distinct values

51053 (95.2%)

2599 (4.8%)

CancelValue [numeric]

Mean (sd) : 30 (1241.4)

min ≤ med ≤ max:

-23639.5 ≤ 0 ≤ 188639.5

IQR (CV) : 0 (41.3)

125 distinct values

53652 (100.0%)

0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.5.0)
2025-06-12

2.2 Visualizations and Insights

1. OrderLineID

OrderLineID is the primary key. Let’s visualize the number of unique values to confirm its role using Highcharter.

2. CustomerID: Distribution

Visualize top 10 CustomerID by order frequency due to high unique values.

Insight: Top customers like CUST8512 drive most orders, posing a risk if they reduce activity (if data is available).

3. CustomerName: Distribution

Visualize top 10 CustomerName values due to missing values and high unique values.

Insight: Gaisano and Robinsons are frequent, but missing values suggest data collection issues (if data is available).

4. FacilityID: Distribution

Insight: SOIFCO1 dominates, indicating it is the primary facility (if data is available).

5. ProductID: Distribution

Visualize top 10 ProductID values due to moderate unique values.

Insight: PROD1159 and PROD1024 are key products (if data is available).

6. ProductName: Distribution

Visualize top 10 ProductName values due to missing values and moderate unique values.

Insight: Milk Powder 1kg and Cheese Spread 500g are frequent, with some missing values (if data is available).

7. Date: Orders Over Time

Insight: Orders fluctuate, with peaks in mid-March and early June 2020, possibly due to seasonal demand (if data is available).

8 OrderQuantity: Distribution

Insight: Right-skewed distribution with outliers above 50 units (if data is available).

9 ConfirmQuantityInCarton

10 CancelQuantityInCarton theo CustomerName (top 10)

11. OrderValue

12 ConfirmValue

13. CancelValue theo CustomerName (top 10)

3. Bivariate Analysis

1

2

###3

4. Data Quality Assessment

4.1 Identify Data Issues

4.1.1 Missing Values

CustomerID: 5,291 missing (9.87%).
CustomerName: 14,722 missing (27.43%).
ProductName: 3,576 missing (6.67%).

4.1.2 Duplication

There are 4 duplicates (should be 0).

4.1.3 Outliers

OrderQuantityInCarton has 5923 outliers.
OrderValue has 6106 outliers.

4.1.4 Unreasonable Values

There are 0 invalid DeliveryDate values.
There are 0 negative quantities.
There are 0 invalid quantity relationships (DispatchQuantity > OrderQuantityInCarton).

4.2 Propose and Implement Data Cleaning

4.2.1 Methods and Rationale

Missing Values:
- Method: Replace CustomerID, CustomerName, and ProductName missing values with “Unknown”.
- Reason: Preserves data integrity.
Duplicates:
- Method: Remove duplicates based on OrderLineID.
- Reason: Ensures primary key uniqueness.
Outliers:
- Method: Cap at 1.5 * IQR threshold.
- Reason: Reduces impact of extreme values.
Unreasonable Values:
- Method: Remove rows with invalid dates, negative quantities, or invalid quantity relationships.
- Reason: Ensures logical consistency.

4.2.2 Implementation

4.3 Summary of Cleaning

Replaced missing CustomerID, CustomerName, and ProductName with “Unknown”.
Removed duplicates (if any).
Capped outliers in OrderQuantityInCarton and OrderValue.
Removed rows with invalid dates or quantities.
Saved as cleaned_sales.RData.

Conclusion

This EDA provided insights into sales patterns and addressed data quality issues, preparing a cleaned dataset for Exercise 2.

EDA on Sales Data

Nguyen Chanh Hao

11:16 PM +07, May 31, 2025

Introduction

1. Data Overview

Insights

2. Univariate Analysis

2.1 Summary Statistics

Data Frame Summary

sales_data_test

2.2 Visualizations and Insights

1. OrderLineID

2. CustomerID: Distribution

3. CustomerName: Distribution

4. FacilityID: Distribution

5. ProductID: Distribution

6. ProductName: Distribution

7. Date: Orders Over Time

8 OrderQuantity: Distribution

9 ConfirmQuantityInCarton

10 CancelQuantityInCarton theo CustomerName (top 10)

11. OrderValue

12 ConfirmValue

13. CancelValue theo CustomerName (top 10)

3. Bivariate Analysis

1

2

4. Data Quality Assessment

4.1 Identify Data Issues

4.1.1 Missing Values

4.1.2 Duplication

4.1.3 Outliers

4.1.4 Unreasonable Values

4.2 Propose and Implement Data Cleaning

4.2.1 Methods and Rationale

4.2.2 Implementation

4.3 Summary of Cleaning

Conclusion