This analysis was conducted to support strategic decision-making for a growing e-commerce business. Leadership requested data-driven insights to answer critical questions: Which region shows strongest performance for potential expansion? What products should we prioritize in inventory planning? How do we interpret seasonal revenue patterns? These questions require clean data and accurate analysis to make confident recommendations.
I examined Q1 2023 customer order data across multiple regions and product categories. The analysis emphasizes: - Systematic data quality assessment and cleaning - Accurate joins and aggregations - Clear visualization of regional performance, product mix, and temporal trends - Evidence-based insights to support business decisions
Through this analysis, I identified that South region leads revenue performance (37% of total), Electronics dominates product sales (54% market share), and the business shows strong Q1 growth momentum (164% increase from February to April). These insights directly answer leadership’s questions about where to focus resources and how to interpret performance patterns.
# Load the datasets
customers <- read_csv("customers.csv")
orders <- read_csv("orders.csv")
# Preview the data
cat("=== CUSTOMERS DATA ===\n")
## === CUSTOMERS DATA ===
head(customers) %>% kable()
| customer_id | signup_date | region |
|---|---|---|
| c001 | 1/15/23 | north |
| c002 | 2/2/23 | south |
| C003 | 2/20/23 | East |
| C004 | 3/1/23 | West |
| C005 | 3/5/23 | North |
| C006 | 3/10/23 | South |
cat("\n\n=== ORDERS DATA ===\n")
##
##
## === ORDERS DATA ===
head(orders) %>% kable()
| order_id | customer_id | order_date | amount | product_category |
|---|---|---|---|---|
| O001 | C001 | 2/1/23 | 120 | Electronics |
| O002 | C002 | 2/5/23 | 75 | Home |
| O003 | C003 | 2/25/23 | 210 | Electronics |
| O004 | C001 | 3/3/23 | 180 | Clothing |
| O005 | C004 | 3/10/23 | 95 | Home |
| O006 | C005 | 3/15/23 | 130 | Clothing |
Before proceeding with analysis, I examined both datasets to identify potential issues that could affect data integrity and analysis accuracy.
# Examine data structure
cat("=== CUSTOMERS DATASET STRUCTURE ===\n")
## === CUSTOMERS DATASET STRUCTURE ===
glimpse(customers)
## Rows: 10
## Columns: 3
## $ customer_id <chr> "c001", "c002", "C003", "C004", "C005", "C006", "C007", "C…
## $ signup_date <chr> "1/15/23", "2/2/23", "2/20/23", "3/1/23", "3/5/23", "3/10/…
## $ region <chr> "north", "south", "East", "West", "North", "South", "East"…
cat("\n\n=== ORDERS DATASET STRUCTURE ===\n")
##
##
## === ORDERS DATASET STRUCTURE ===
glimpse(orders)
## Rows: 15
## Columns: 5
## $ order_id <chr> "O001", "O002", "O003", "O004", "O005", "O006", "O007…
## $ customer_id <chr> "C001", "C002", "C003", "C001", "C004", "C005", "C002…
## $ order_date <chr> "2/1/23", "2/5/23", "2/25/23", "3/3/23", "3/10/23", "…
## $ amount <dbl> 120, 75, 210, 180, 95, 130, 220, 90, 310, 150, 50, 20…
## $ product_category <chr> "Electronics", "Home", "Electronics", "Clothing", "Ho…
cat("\n\n=== DATA QUALITY ISSUES IDENTIFIED ===\n\n")
##
##
## === DATA QUALITY ISSUES IDENTIFIED ===
# Issue 1: Customer ID casing inconsistency
cat("ISSUE 1: Customer ID Casing Inconsistency\n")
## ISSUE 1: Customer ID Casing Inconsistency
cat(" customers.csv contains:", paste(head(customers$customer_id, 5), collapse = ", "), "\n")
## customers.csv contains: c001, c002, C003, C004, C005
cat(" orders.csv contains:", paste(head(orders$customer_id, 5), collapse = ", "), "\n")
## orders.csv contains: C001, C002, C003, C001, C004
cat(" Impact: Case-sensitive joins will fail, losing all order data\n")
## Impact: Case-sensitive joins will fail, losing all order data
cat(" Solution: Standardize all IDs to uppercase before joining\n\n")
## Solution: Standardize all IDs to uppercase before joining
# Issue 2: Region casing inconsistency
cat("ISSUE 2: Region Capitalization Inconsistency\n")
## ISSUE 2: Region Capitalization Inconsistency
cat(" Unique region values:", paste(unique(customers$region), collapse = ", "), "\n")
## Unique region values: north, south, East, West, North, South
cat(" Impact: Same region counted separately (e.g., 'north' vs 'North')\n")
## Impact: Same region counted separately (e.g., 'north' vs 'North')
cat(" Solution: Standardize to Title Case for consistent grouping\n\n")
## Solution: Standardize to Title Case for consistent grouping
# Issue 3: Date format
cat("ISSUE 3: Date Fields Stored as Text\n")
## ISSUE 3: Date Fields Stored as Text
cat(" Date format example:", customers$signup_date[1], "\n")
## Date format example: 1/15/23
cat(" Impact: Cannot perform time-based analysis or sorting\n")
## Impact: Cannot perform time-based analysis or sorting
cat(" Solution: Parse dates using lubridate::mdy() function\n")
## Solution: Parse dates using lubridate::mdy() function
# Clean customers dataset
customers_clean <- customers %>%
mutate(
# Standardize customer_id to uppercase for consistent joins
customer_id = str_to_upper(customer_id),
# Standardize region to Title Case for consistent grouping
region = str_to_title(region),
# Parse signup_date from text to proper date format
signup_date = mdy(signup_date)
)
# Clean orders dataset
orders_clean <- orders %>%
mutate(
# Standardize customer_id to uppercase to match customers table
customer_id = str_to_upper(customer_id),
# Parse order_date from text to proper date format
order_date = mdy(order_date),
# Create month column for time-series aggregation
order_month = floor_date(order_date, "month")
)
# Verify the cleaning worked
cat("=== CLEANING VERIFICATION ===\n\n")
## === CLEANING VERIFICATION ===
cat("Unique customer IDs in customers:", n_distinct(customers_clean$customer_id), "\n")
## Unique customer IDs in customers: 10
cat("Unique customer IDs in orders:", n_distinct(orders_clean$customer_id), "\n")
## Unique customer IDs in orders: 10
cat("Standardized regions:", paste(sort(unique(customers_clean$region)), collapse = ", "), "\n\n")
## Standardized regions: East, North, South, West
cat("Sample of cleaned customers data:\n")
## Sample of cleaned customers data:
head(customers_clean, 3) %>% kable()
| customer_id | signup_date | region |
|---|---|---|
| C001 | 2023-01-15 | North |
| C002 | 2023-02-02 | South |
| C003 | 2023-02-20 | East |
cat("\n\nSample of cleaned orders data:\n")
##
##
## Sample of cleaned orders data:
head(orders_clean, 3) %>% kable()
| order_id | customer_id | order_date | amount | product_category | order_month |
|---|---|---|---|---|---|
| O001 | C001 | 2023-02-01 | 120 | Electronics | 2023-02-01 |
| O002 | C002 | 2023-02-05 | 75 | Home | 2023-02-01 |
| O003 | C003 | 2023-02-25 | 210 | Electronics | 2023-02-01 |
# Join orders with customer data
# Using left_join to keep all orders even if customer data is missing
orders_with_region <- orders_clean %>%
left_join(
customers_clean %>% select(customer_id, region, signup_date),
by = "customer_id"
)
# Verify join success
cat("=== JOIN VERIFICATION ===\n\n")
## === JOIN VERIFICATION ===
cat("Total orders in dataset:", nrow(orders_with_region), "\n")
## Total orders in dataset: 15
cat("Orders successfully matched with customer data:", sum(!is.na(orders_with_region$region)), "\n")
## Orders successfully matched with customer data: 15
cat("Join success rate:", scales::percent(sum(!is.na(orders_with_region$region)) / nrow(orders_with_region)), "\n\n")
## Join success rate: 100%
# Display sample of joined data showing the merge worked
cat("Sample of joined data (showing order with customer region):\n")
## Sample of joined data (showing order with customer region):
head(orders_with_region, 6) %>%
select(order_id, customer_id, order_date, amount, product_category, region) %>%
kable()
| order_id | customer_id | order_date | amount | product_category | region |
|---|---|---|---|---|---|
| O001 | C001 | 2023-02-01 | 120 | Electronics | North |
| O002 | C002 | 2023-02-05 | 75 | Home | South |
| O003 | C003 | 2023-02-25 | 210 | Electronics | East |
| O004 | C001 | 2023-03-03 | 180 | Clothing | North |
| O005 | C004 | 2023-03-10 | 95 | Home | West |
| O006 | C005 | 2023-03-15 | 130 | Clothing | North |
# Calculate key metrics by region
revenue_by_region <- orders_with_region %>%
group_by(region) %>%
summarise(
total_orders = n(),
total_revenue = sum(amount),
avg_order_value = round(mean(amount), 2),
.groups = 'drop'
) %>%
arrange(desc(total_revenue))
# Display results
cat("=== REVENUE METRICS BY REGION ===\n\n")
## === REVENUE METRICS BY REGION ===
revenue_by_region %>%
mutate(
total_revenue = scales::dollar(total_revenue),
avg_order_value = scales::dollar(avg_order_value)
) %>%
kable()
| region | total_orders | total_revenue | avg_order_value |
|---|---|---|---|
| South | 6 | $805 | $134.17 |
| North | 5 | $720 | $144.00 |
| East | 2 | $520 | $260.00 |
| West | 2 | $145 | $72.50 |
# Identify top and bottom performers
cat("\n\nKey Regional Insights:\n")
##
##
## Key Regional Insights:
cat("• Highest revenue region:", revenue_by_region$region[1],
"with", scales::dollar(revenue_by_region$total_revenue[1]),
"(", revenue_by_region$total_orders[1], "orders)\n")
## • Highest revenue region: South with $805 ( 6 orders)
cat("• Lowest revenue region:", revenue_by_region$region[nrow(revenue_by_region)],
"with", scales::dollar(revenue_by_region$total_revenue[nrow(revenue_by_region)]),
"(", revenue_by_region$total_orders[nrow(revenue_by_region)], "orders)\n")
## • Lowest revenue region: West with $145 ( 2 orders)
# Calculate metrics for each customer
customer_metrics <- orders_with_region %>%
group_by(customer_id, region) %>%
summarise(
total_orders = n(),
total_spent = sum(amount),
avg_order_amount = round(mean(amount), 2),
.groups = 'drop'
) %>%
arrange(desc(total_spent))
# Overall customer statistics
cat("=== CUSTOMER SPENDING METRICS ===\n\n")
## === CUSTOMER SPENDING METRICS ===
cat("Total unique customers:", n_distinct(customer_metrics$customer_id), "\n")
## Total unique customers: 10
cat("Overall average order amount per customer:",
scales::dollar(round(mean(customer_metrics$avg_order_amount), 2)), "\n")
## Overall average order amount per customer: $153.83
cat("Median order amount per customer:",
scales::dollar(round(median(customer_metrics$avg_order_amount), 2)), "\n\n")
## Median order amount per customer: $142.50
# Show top customers
cat("Top 5 Customers by Total Spending:\n")
## Top 5 Customers by Total Spending:
customer_metrics %>%
head(5) %>%
mutate(
total_spent = scales::dollar(total_spent),
avg_order_amount = scales::dollar(avg_order_amount)
) %>%
kable()
| customer_id | region | total_orders | total_spent | avg_order_amount |
|---|---|---|---|---|
| C002 | South | 3 | $385 | $128.33 |
| C007 | East | 1 | $310 | $310.00 |
| C001 | North | 2 | $300 | $150.00 |
| C005 | North | 2 | $270 | $135.00 |
| C006 | South | 2 | $220 | $110.00 |
# Calculate revenue by product category
product_revenue <- orders_with_region %>%
group_by(product_category) %>%
summarise(
total_orders = n(),
total_revenue = sum(amount),
avg_order_value = round(mean(amount), 2),
.groups = 'drop'
) %>%
arrange(desc(total_revenue))
# Calculate market share percentage
product_revenue <- product_revenue %>%
mutate(
revenue_share_pct = round((total_revenue / sum(total_revenue)) * 100, 1)
)
# Display top 3 categories
cat("=== TOP 3 PRODUCT CATEGORIES BY REVENUE ===\n\n")
## === TOP 3 PRODUCT CATEGORIES BY REVENUE ===
product_revenue %>%
head(3) %>%
mutate(
total_revenue = scales::dollar(total_revenue),
avg_order_value = scales::dollar(avg_order_value),
revenue_share = paste0(revenue_share_pct, "%")
) %>%
select(product_category, total_orders, total_revenue, avg_order_value, revenue_share) %>%
kable()
| product_category | total_orders | total_revenue | avg_order_value | revenue_share |
|---|---|---|---|---|
| Electronics | 6 | $1,190 | $198.33 | 54.3% |
| Clothing | 5 | $590 | $118.00 | 26.9% |
| Home | 4 | $410 | $102.50 | 18.7% |
# Key insight
cat("\n\nKey Product Insights:\n")
##
##
## Key Product Insights:
cat("• Top category:", product_revenue$product_category[1],
"with", scales::dollar(product_revenue$total_revenue[1]),
"(", product_revenue$revenue_share_pct[1], "% of total revenue)\n")
## • Top category: Electronics with $1,190 ( 54.3 % of total revenue)
# Calculate revenue by month
monthly_revenue <- orders_with_region %>%
group_by(order_month) %>%
summarise(
total_orders = n(),
total_revenue = sum(amount),
avg_order_value = round(mean(amount), 2),
.groups = 'drop'
) %>%
arrange(order_month)
# Display monthly trends
cat("=== MONTHLY REVENUE TRENDS ===\n\n")
## === MONTHLY REVENUE TRENDS ===
monthly_revenue %>%
mutate(
month = format(order_month, "%B %Y"),
total_revenue = scales::dollar(total_revenue),
avg_order_value = scales::dollar(avg_order_value)
) %>%
select(month, total_orders, total_revenue, avg_order_value) %>%
kable()
| month | total_orders | total_revenue | avg_order_value |
|---|---|---|---|
| February 2023 | 3 | $405 | $135.00 |
| March 2023 | 5 | $715 | $143.00 |
| April 2023 | 7 | $1,070 | $152.86 |
# Identify unusual months
cat("\n\nMonthly Trend Insights:\n")
##
##
## Monthly Trend Insights:
avg_monthly_revenue <- mean(monthly_revenue$total_revenue)
cat("• Average monthly revenue:", scales::dollar(round(avg_monthly_revenue, 2)), "\n\n")
## • Average monthly revenue: $730
# Highest month
max_month <- monthly_revenue %>% filter(total_revenue == max(total_revenue))
pct_above <- round((max_month$total_revenue / avg_monthly_revenue - 1) * 100, 1)
cat("• Highest revenue month:", format(max_month$order_month, "%B %Y"),
"with", scales::dollar(max_month$total_revenue),
"(", pct_above, "% above average)\n")
## • Highest revenue month: April 2023 with $1,070 ( 46.6 % above average)
# Lowest month
min_month <- monthly_revenue %>% filter(total_revenue == min(total_revenue))
pct_below <- round((min_month$total_revenue / avg_monthly_revenue - 1) * 100, 1)
cat("• Lowest revenue month:", format(min_month$order_month, "%B %Y"),
"with", scales::dollar(min_month$total_revenue),
"(", pct_below, "% below average)\n")
## • Lowest revenue month: February 2023 with $405 ( -44.5 % below average)
# Growth trend
cat("• Growth from Feb to April:",
scales::percent((max_month$total_revenue / min_month$total_revenue - 1)), "\n")
## • Growth from Feb to April: 164%
# Visualization 1: Revenue by Region
ggplot(revenue_by_region, aes(x = reorder(region, total_revenue), y = total_revenue, fill = region)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = scales::dollar(total_revenue)), hjust = -0.1, size = 4) +
coord_flip() +
scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.15))) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Total Revenue by Region",
subtitle = "South region leads in total revenue generation",
x = "Region",
y = "Total Revenue ($)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
panel.grid.major.y = element_blank()
)
# Visualization 2: Monthly Revenue Trend
ggplot(monthly_revenue, aes(x = order_month, y = total_revenue)) +
geom_line(color = "#2C3E50", size = 1.2) +
geom_point(size = 3, color = "#E74C3C") +
geom_hline(yintercept = avg_monthly_revenue, linetype = "dashed", color = "#95A5A6", size = 0.8) +
geom_text(aes(label = scales::dollar(total_revenue)), vjust = -1, size = 3.5) +
scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0.1, 0.2))) +
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
labs(
title = "Monthly Revenue Trend",
subtitle = "Strong upward growth momentum from February to April 2023",
x = "Month",
y = "Total Revenue ($)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.minor = element_blank()
)
# Visualization 3: Product Category Performance
ggplot(product_revenue, aes(x = reorder(product_category, total_revenue), y = total_revenue, fill = product_category)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = paste0(scales::dollar(total_revenue), "\n(", revenue_share_pct, "%)")),
hjust = -0.1, size = 3.5) +
coord_flip() +
scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.2))) +
scale_fill_brewer(palette = "Pastel1") +
labs(
title = "Revenue by Product Category",
subtitle = "Electronics leads with over half of total revenue",
x = "Product Category",
y = "Total Revenue ($)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
panel.grid.major.y = element_blank()
)
cat("=== SUMMARY OF FINDINGS ===\n\n")
## === SUMMARY OF FINDINGS ===
cat("REGIONAL PERFORMANCE:\n")
## REGIONAL PERFORMANCE:
cat("• South region leads with", scales::dollar(revenue_by_region$total_revenue[1]),
"representing", round(revenue_by_region$total_revenue[1]/sum(revenue_by_region$total_revenue)*100, 1),
"% of total revenue\n")
## • South region leads with $805 representing 36.8 % of total revenue
cat("• All four regions show order activity, indicating broad market penetration\n")
## • All four regions show order activity, indicating broad market penetration
cat("• East region has highest average order value at $260, suggesting premium customer segment\n\n")
## • East region has highest average order value at $260, suggesting premium customer segment
cat("TEMPORAL TRENDS:\n")
## TEMPORAL TRENDS:
cat("• Strong Q1 growth momentum: revenue increased 164% from February to April\n")
## • Strong Q1 growth momentum: revenue increased 164% from February to April
cat("• April generated", scales::dollar(max_month$total_revenue),
"— the strongest month, 47% above quarterly average\n")
## • April generated $1,070 — the strongest month, 47% above quarterly average
cat("• February started lower but established foundation for accelerating growth trend\n\n")
## • February started lower but established foundation for accelerating growth trend
cat("PRODUCT PERFORMANCE:\n")
## PRODUCT PERFORMANCE:
cat("• Electronics dominates with 54.3% market share (", scales::dollar(product_revenue$total_revenue[1]), ")\n")
## • Electronics dominates with 54.3% market share ( $1,190 )
cat("• Three-category portfolio shows diversification: Electronics, Clothing, Home\n")
## • Three-category portfolio shows diversification: Electronics, Clothing, Home
cat("• Electronics and Clothing together represent 81% of total revenue\n\n")
## • Electronics and Clothing together represent 81% of total revenue
cat("CUSTOMER BEHAVIOR:\n")
## CUSTOMER BEHAVIOR:
cat("• Average customer order value:", scales::dollar(round(mean(customer_metrics$avg_order_amount), 2)), "\n")
## • Average customer order value: $153.83
cat("• Customer spending patterns vary significantly, from $72.50 to $310 average order value\n")
## • Customer spending patterns vary significantly, from $72.50 to $310 average order value
cat("• Repeat purchase behavior observed in", sum(customer_metrics$total_orders >= 2), "out of 10 customers\n")
## • Repeat purchase behavior observed in 4 out of 10 customers
Issues Identified and Resolved:
str_to_upper() to ensure
case-insensitive matching during joinsstr_to_title(), consolidating six
variations into four clean categorieslubridate::mdy() for accurate
time-series analysisThis analysis demonstrates a systematic approach to data analytics: identifying data quality issues, documenting their impact, implementing standardized cleaning procedures, verifying results, and presenting findings through clear visualizations and metrics.
Leadership Questions Answered:
Regional Performance: South region generates the highest revenue ($805, 37% of total), making it the strongest candidate for expansion investment, though East region shows highest customer value ($260 avg order)
Product Strategy: Electronics and Clothing together drive 81% of revenue, clearly indicating inventory prioritization
Seasonal Patterns: February’s lower revenue ($405) isn’t a problem to fix—it’s the foundation of strong upward momentum that accelerated through Q1, suggesting natural growth trajectory rather than seasonal weakness
Customer Insights: 40% of customers make repeat purchases, and spending patterns vary widely ($72-$310 avg), indicating opportunities for targeted customer segmentation strategies
What This Analysis Demonstrates:
This work showcases my ability to take messy real-world data, systematically clean and analyze it, and extract actionable business insights that directly support decision-making—exactly the analytical approach needed at Diversity Lab to turn legal talent data into strategies that advance fairness and equity.
Analysis completed by Tanvi Kadam for Diversity Lab Data Analyst Take-Home Assignment