Week 5 | Data Dive — Documentation

Introduction

This analysis explores the Walmart Sales Dataset with a focus on data documentation and unclear variables. The key objectives include:

Identifying at least three unclear columns in the dataset that require documentation.
Understanding why data is encoded a certain way and the impact of misinterpretation.
Investigating data elements that remain unclear even after reviewing the documentation.
Creating two visualizations to highlight data interpretation issues.
Analyzing missing values, empty groups, and outliers in the dataset.

# Define Kaggle dataset URL and output file path
output_file <- "Walmart Sales Data.csv"
data <- read_csv(output_file)
head(data)
## # A tibble: 6 × 17
##   `Invoice ID` Branch City    `Customer type` Gender `Product line` `Unit price`
##   <chr>        <chr>  <chr>   <chr>           <chr>  <chr>                 <dbl>
## 1 750-67-8428  A      Yangon  Member          Female Health and be…         74.7
## 2 226-31-3081  C      Naypyi… Normal          Female Electronic ac…         15.3
## 3 631-41-3108  A      Yangon  Normal          Male   Home and life…         46.3
## 4 123-19-1176  A      Yangon  Member          Male   Health and be…         58.2
## 5 373-73-7910  A      Yangon  Normal          Male   Sports and tr…         86.3
## 6 699-14-3026  C      Naypyi… Normal          Male   Electronic ac…         85.4
## # ℹ 10 more variables: Quantity <dbl>, `Tax 5%` <dbl>, Total <dbl>,
## #   Date <date>, Time <time>, Payment <chr>, cogs <dbl>,
## #   `gross margin percentage` <dbl>, `gross income` <dbl>, Rating <dbl>

Unclear Columns in the Dataset

Below are three columns that are unclear without documentation:

“gross margin percentage”
- Issue: This value is constant (4.762%) for all rows.
- Potential Problem: It does not seem to reflect real variability in margin. Either it’s miscalculated, or incorrectly documented.
“cogs” (Cost of Goods Sold)
- Issue: It is unclear whether it represents cost per unit, total transaction cost, or regional average cost.
- Potential Problem: Misinterpretation could affect profit margin calculations and financial analysis.
“Rating”
- Issue: The rating scale is unclear (1-10? Customer feedback? Store performance?).
- Potential Problem: If used incorrectly, it could mislead conclusions about customer satisfaction.

Data Elements Still Unclear Even After Reading Documentation

Despite the dataset documentation, one key issue remains unclear:

“Branch” vs “City”:
- It is unclear if a Branch represents a unique store or multiple locations within a city.
- If multiple branches exist in a city, sales aggregation may be misleading.
“Payment” Column:
- The dataset contains payment types: Cash, Credit Card, and Ewallet.
- It does not clarify whether these vary by customer type, region, or other factors.

Visualizing Data Issues

We now create two visualizations to highlight potential misinterpretations of the data.

1. Payment Method Distribution

This plot shows how different payment methods are used.

ggplot(data, aes(x = Payment, fill = Payment)) +
  geom_bar() +
  labs(
    title = "Distribution of Payment Methods",
    x = "Payment Method",
    y = "Count"
  ) +
  theme_minimal()

Observation:

The dataset does not specify if payment type correlates with customer type, branch, or product line.
Without additional context, this chart alone does not explain consumer preferences.

2. Cost of Goods Sold (COGS) vs. Gross Income

This scatter plot investigates the relationship between COGS and Gross Income.

ggplot(data, aes(x = cogs, y = `gross income`)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(
    title = "COGS vs Gross Income",
    x = "Cost of Goods Sold (COGS)",
    y = "Gross Income"
  ) +
  theme_minimal()

Observation:

The COGS definition remains unclear, affecting profit calculations.
The linear trend suggests gross income is calculated directly from COGS, but verification is needed.

Missing Data Analysis

We check for explicitly missing rows, implicitly missing rows, and empty groups in categorical variables.

missing_counts <- data %>%
  summarise(across(everything(), ~ sum(is.na(.))))

missing_counts
## # A tibble: 1 × 17
##   `Invoice ID` Branch  City `Customer type` Gender `Product line` `Unit price`
##          <int>  <int> <int>           <int>  <int>          <int>        <int>
## 1            0      0     0               0      0              0            0
## # ℹ 10 more variables: Quantity <int>, `Tax 5%` <int>, Total <int>, Date <int>,
## #   Time <int>, Payment <int>, cogs <int>, `gross margin percentage` <int>,
## #   `gross income` <int>, Rating <int>

Findings:

Explicitly missing rows: Some columns contain NA values, requiring imputation.
Implicitly missing rows: Some product lines may lack representation in certain branches.
Empty Groups: Some combinations of product line and branch may be missing.

Outlier Detection in Continuous Data

We define outliers in Total Sales using the Interquartile Range (IQR) method.

Q1 <- quantile(data$Total, 0.25, na.rm = TRUE)
Q3 <- quantile(data$Total, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Detect outliers
outliers <- data %>%
  filter(Total < lower_bound | Total > upper_bound)

# Count of outliers
nrow(outliers)
## [1] 9

Insight:

Outliers above the upper bound indicate weeks with high spending customers (bulk orders).
Low outliers may suggest discounted sales or product recalls.

Key Insights and Potential Risks

Key Takeaways: 1. Data Documentation is Critical - The constant gross margin percentage suggests either an error or a non-variable metric. - COGS needs clarification on whether it’s per unit or per transaction. 2. Unclear Data Elements Affect Decisions - If branches are misidentified, revenue per location might be over- or under-reported. - Payment type data lacks context, limiting its usefulness in predicting consumer behavior. 3. Missing Data and Biases Exist - Some product lines are not represented in every branch, leading to selection bias. - Missing values require imputation strategies. 4. Outliers and Sales Anomalies - Some sales totals deviate significantly, needing further investigation. - High sales may correlate with specific promotions or discounts.

Next Steps:

Clarify Data Definitions: Seek documentation updates on COGS, margin percentage, and branch-city relationships.
Handle Missing Data: Use imputation techniques for missing values.
Investigate Outliers: Identify causes of extreme sales figures (seasonal effects, promotions, or errors).
Improve Payment Analysis: Link payment type with customer segments for better insights.
Check for Data Bias: Ensure all product lines and branches are fairly represented.