This analysis explores the Walmart Sales Dataset with a focus on data documentation and unclear variables. The key objectives include:
# Define Kaggle dataset URL and output file path
output_file <- "Walmart Sales Data.csv"
data <- read_csv(output_file)
head(data)
## # A tibble: 6 × 17
## `Invoice ID` Branch City `Customer type` Gender `Product line` `Unit price`
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 750-67-8428 A Yangon Member Female Health and be… 74.7
## 2 226-31-3081 C Naypyi… Normal Female Electronic ac… 15.3
## 3 631-41-3108 A Yangon Normal Male Home and life… 46.3
## 4 123-19-1176 A Yangon Member Male Health and be… 58.2
## 5 373-73-7910 A Yangon Normal Male Sports and tr… 86.3
## 6 699-14-3026 C Naypyi… Normal Male Electronic ac… 85.4
## # ℹ 10 more variables: Quantity <dbl>, `Tax 5%` <dbl>, Total <dbl>,
## # Date <date>, Time <time>, Payment <chr>, cogs <dbl>,
## # `gross margin percentage` <dbl>, `gross income` <dbl>, Rating <dbl>
Below are three columns that are unclear without documentation:
Despite the dataset documentation, one key issue remains unclear:
We now create two visualizations to highlight potential misinterpretations of the data.
This plot shows how different payment methods are used.
ggplot(data, aes(x = Payment, fill = Payment)) +
geom_bar() +
labs(
title = "Distribution of Payment Methods",
x = "Payment Method",
y = "Count"
) +
theme_minimal()
Observation:
This scatter plot investigates the relationship between COGS and Gross Income.
ggplot(data, aes(x = cogs, y = `gross income`)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "COGS vs Gross Income",
x = "Cost of Goods Sold (COGS)",
y = "Gross Income"
) +
theme_minimal()
Observation:
We check for explicitly missing rows, implicitly missing rows, and empty groups in categorical variables.
missing_counts <- data %>%
summarise(across(everything(), ~ sum(is.na(.))))
missing_counts
## # A tibble: 1 × 17
## `Invoice ID` Branch City `Customer type` Gender `Product line` `Unit price`
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
## # ℹ 10 more variables: Quantity <int>, `Tax 5%` <int>, Total <int>, Date <int>,
## # Time <int>, Payment <int>, cogs <int>, `gross margin percentage` <int>,
## # `gross income` <int>, Rating <int>
Findings:
We define outliers in Total Sales using the Interquartile Range (IQR) method.
Q1 <- quantile(data$Total, 0.25, na.rm = TRUE)
Q3 <- quantile(data$Total, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
# Define lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Detect outliers
outliers <- data %>%
filter(Total < lower_bound | Total > upper_bound)
# Count of outliers
nrow(outliers)
## [1] 9
Insight:
Key Takeaways: 1. Data Documentation is Critical - The constant gross margin percentage suggests either an error or a non-variable metric. - COGS needs clarification on whether it’s per unit or per transaction. 2. Unclear Data Elements Affect Decisions - If branches are misidentified, revenue per location might be over- or under-reported. - Payment type data lacks context, limiting its usefulness in predicting consumer behavior. 3. Missing Data and Biases Exist - Some product lines are not represented in every branch, leading to selection bias. - Missing values require imputation strategies. 4. Outliers and Sales Anomalies - Some sales totals deviate significantly, needing further investigation. - High sales may correlate with specific promotions or discounts.