Introduction

This report presents an Exploratory Data Analysis (EDA) of the sales data provided in sales_data_test from the sales_data_test.RData file, as part of our Data Analytics group project. The dataset contains sales records from January to July 2020 for Company B, a fast-moving consumer goods (FMCG) manufacturer based in the Philippines. Each record is uniquely identified by the OrderLineID column. The goal of this EDA is to uncover key insights, identify potential data issues, and prepare a cleaned dataset for use in Exercise 2.

1. Data Overview

This section summarizes the dataset, excluding PONumber as requested, including column names, data types, unique values, and missing values.

Data Overview of sales_data_test Dataset
ColumnName TotalObservations DataType UniqueValues PercentUnique MissingValues PercentMissing
OrderLineID OrderLineID 53652 character 53650 99.996272 0 0.000000
CustomerID CustomerID 53652 character 1274 2.374562 5291 9.861701
CustomerName CustomerName 53652 character 1271 2.368970 0 0.000000
FacilityID FacilityID 53652 character 5 0.009319 0 0.000000
ProductID ProductID 53652 character 20 0.037277 0 0.000000
ProductName ProductName 53652 character 23 0.042869 3576 6.665176
DeliveryDate DeliveryDate 53652 Date 156 0.290763 0 0.000000
OrderQuantityInCarton OrderQuantityInCarton 53652 numeric 57 0.106240 0 0.000000
ConfirmQuantityInCarton ConfirmQuantityInCarton 53652 numeric 60 0.111832 0 0.000000
CancelQuantityInCarton CancelQuantityInCarton 53652 numeric 28 0.052188 0 0.000000
OrderValue OrderValue 53652 numeric 1164 2.169537 0 0.000000
ConfirmValue ConfirmValue 53652 numeric 1183 2.204950 2599 4.844181
CancelValue CancelValue 53652 numeric 155 0.288899 0 0.000000

Insights

  • The dataset contains 53,652 observations across 13 columns (excluding PONumber).
  • OrderLineID has 53,652 unique values (100%), confirming its role as the primary key.
  • CustomerName has 14,722 missing values (27.43%), indicating a significant data quality issue.
  • The date column (if detected) has 213 unique values (0.40%), consistent with the 7-month period.
  • Other columns have no missing values, suggesting good data completeness for most variables.

2. Univariate Analysis

This section explores the distribution of each variable through summary statistics and visualizations.

2.1 Summary Statistics

Data Frame Summary

sales_data_test

Dimensions: 53652 x 13
Duplicates: 2
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 OrderLineID [character]
1. SOI01712
2. SOI03339
3. SOI01149
4. SOI01150
5. SOI01151
6. SOI01152
7. SOI01153
8. SOI01154
9. SOI01155
10. SOI01156
[ 53640 others ]
2(0.0%)
2(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
1(0.0%)
53640(100.0%)
53652 (100.0%) 0 (0.0%)
2 CustomerID [character]
1. 70002859
2. 70002860
3. 70002948
4. 70002858
5. 70009775
6. 70002937
7. 70002925
8. 70011315
9. 70002919
10. 70002866
[ 1263 others ]
233(0.5%)
233(0.5%)
233(0.5%)
218(0.5%)
211(0.4%)
207(0.4%)
190(0.4%)
187(0.4%)
177(0.4%)
175(0.4%)
46297(95.7%)
48361 (90.1%) 5291 (9.9%)
3 CustomerName [character]
1. Customer 1028
2. Customer 1026
3. Customer 1317
4. Customer 1017
5. Customer 1313
6. Customer 1338
7. Customer 1268
8. Customer 1001
9. Customer 1021
10. Customer 1264
[ 1261 others ]
262(0.5%)
260(0.5%)
259(0.5%)
245(0.5%)
244(0.5%)
233(0.4%)
221(0.4%)
199(0.4%)
199(0.4%)
199(0.4%)
51331(95.7%)
53652 (100.0%) 0 (0.0%)
4 FacilityID [character]
1. SOIFC01
2. SOIFC02
3. SOIFC03
4. SOIFC04
5. SOIFC05
41401(77.2%)
5457(10.2%)
1889(3.5%)
3215(6.0%)
1690(3.1%)
53652 (100.0%) 0 (0.0%)
5 ProductID [character]
1. 10012436
2. 10013309
3. 10392453
4. 10410687
5. 10014791
6. 10015135
7. 10025523
8. 10016523
9. 10025529
10. 10381605
[ 10 others ]
6684(12.5%)
6121(11.4%)
5751(10.7%)
5591(10.4%)
4133(7.7%)
3002(5.6%)
2396(4.5%)
2390(4.5%)
2185(4.1%)
1936(3.6%)
13463(25.1%)
53652 (100.0%) 0 (0.0%)
6 ProductName [character]
1. SS
2. AA
3. ZZ
4. LL
5. NN
6. RR
7. DD
8. FF
9. GG
10. CC
[ 12 others ]
6249(12.5%)
5387(10.8%)
5242(10.5%)
4490(9.0%)
3840(7.7%)
2800(5.6%)
2255(4.5%)
2221(4.4%)
2027(4.0%)
1800(3.6%)
13765(27.5%)
50076 (93.3%) 3576 (6.7%)
7 DeliveryDate [Date]
min : 2020-01-07
med : 2020-03-25
max : 2020-07-07
range : 6m 0d
156 distinct values 53652 (100.0%) 0 (0.0%)
8 OrderQuantityInCarton [numeric]
Mean (sd) : 2.1 (2.9)
min ≤ med ≤ max:
0.5 ≤ 1 ≤ 151
IQR (CV) : 1 (1.4)
57 distinct values 53652 (100.0%) 0 (0.0%)
9 ConfirmQuantityInCarton [numeric]
Mean (sd) : 2.1 (4.1)
min ≤ med ≤ max:
0 ≤ 1 ≤ 640
IQR (CV) : 1 (1.9)
60 distinct values 53652 (100.0%) 0 (0.0%)
10 CancelQuantityInCarton [numeric]
Mean (sd) : 0 (0.4)
min ≤ med ≤ max:
-5 ≤ 0 ≤ 47
IQR (CV) : 0 (34.6)
27 distinct values 53652 (100.0%) 0 (0.0%)
11 OrderValue [numeric]
Mean (sd) : 6511.8 (11733.4)
min ≤ med ≤ max:
1066.4 ≤ 4013.6 ≤ 611190.5
IQR (CV) : 3507.4 (1.8)
1119 distinct values 53652 (100.0%) 0 (0.0%)
12 ConfirmValue [numeric]
Mean (sd) : 6471.5 (11622.3)
min ≤ med ≤ max:
0 ≤ 3876.6 ≤ 611190.5
IQR (CV) : 3507.1 (1.8)
1115 distinct values 51053 (95.2%) 2599 (4.8%)
13 CancelValue [numeric]
Mean (sd) : 30 (1241.4)
min ≤ med ≤ max:
-23639.5 ≤ 0 ≤ 188639.5
IQR (CV) : 0 (41.3)
125 distinct values 53652 (100.0%) 0 (0.0%)

Generated by summarytools 1.1.4 (R version 4.5.0)
2025-06-12

2.2 Visualizations and Insights

1. OrderLineID

OrderLineID is the primary key. Let’s visualize the number of unique values to confirm its role using Highcharter.

2. CustomerID: Distribution

Visualize top 10 CustomerID by order frequency due to high unique values.

Insight: Top customers like CUST8512 drive most orders, posing a risk if they reduce activity (if data is available).

3. CustomerName: Distribution

Visualize top 10 CustomerName values due to missing values and high unique values.

Insight: Gaisano and Robinsons are frequent, but missing values suggest data collection issues (if data is available).

4. FacilityID: Distribution

Insight: SOIFCO1 dominates, indicating it is the primary facility (if data is available).

5. ProductID: Distribution

Visualize top 10 ProductID values due to moderate unique values.

Insight: PROD1159 and PROD1024 are key products (if data is available).

6. ProductName: Distribution

Visualize top 10 ProductName values due to missing values and moderate unique values.

Insight: Milk Powder 1kg and Cheese Spread 500g are frequent, with some missing values (if data is available).

7. Date: Orders Over Time

Insight: Orders fluctuate, with peaks in mid-March and early June 2020, possibly due to seasonal demand (if data is available).

8 OrderQuantity: Distribution

Insight: Right-skewed distribution with outliers above 50 units (if data is available).

9 ConfirmQuantityInCarton

10 CancelQuantityInCarton theo CustomerName (top 10)

11. OrderValue

12 ConfirmValue

13. CancelValue theo CustomerName (top 10)

3. Bivariate Analysis

1

2

###3

4. Data Quality Assessment

4.1 Identify Data Issues

4.1.1 Missing Values

  • CustomerID: 5,291 missing (9.87%).
  • CustomerName: 14,722 missing (27.43%).
  • ProductName: 3,576 missing (6.67%).

4.1.2 Duplication

There are 4 duplicates (should be 0).

4.1.3 Outliers

  • OrderQuantityInCarton has 5923 outliers.
  • OrderValue has 6106 outliers.

4.1.4 Unreasonable Values

  • There are 0 invalid DeliveryDate values.
  • There are 0 negative quantities.
  • There are 0 invalid quantity relationships (DispatchQuantity > OrderQuantityInCarton).

4.2 Propose and Implement Data Cleaning

4.2.1 Methods and Rationale

  • Missing Values:
    • Method: Replace CustomerID, CustomerName, and ProductName missing values with “Unknown”.
    • Reason: Preserves data integrity.
  • Duplicates:
    • Method: Remove duplicates based on OrderLineID.
    • Reason: Ensures primary key uniqueness.
  • Outliers:
    • Method: Cap at 1.5 * IQR threshold.
    • Reason: Reduces impact of extreme values.
  • Unreasonable Values:
    • Method: Remove rows with invalid dates, negative quantities, or invalid quantity relationships.
    • Reason: Ensures logical consistency.

4.2.2 Implementation

4.3 Summary of Cleaning

  • Replaced missing CustomerID, CustomerName, and ProductName with “Unknown”.
  • Removed duplicates (if any).
  • Capped outliers in OrderQuantityInCarton and OrderValue.
  • Removed rows with invalid dates or quantities.
  • Saved as cleaned_sales.RData.

Conclusion

This EDA provided insights into sales patterns and addressed data quality issues, preparing a cleaned dataset for Exercise 2.