Executive Summary

Business Context

This analysis was conducted to support strategic decision-making for a growing e-commerce business. Leadership requested data-driven insights to answer critical questions: Which region shows strongest performance for potential expansion? What products should we prioritize in inventory planning? How do we interpret seasonal revenue patterns? These questions require clean data and accurate analysis to make confident recommendations.

Analytical Approach

I examined Q1 2023 customer order data across multiple regions and product categories. The analysis emphasizes: - Systematic data quality assessment and cleaning - Accurate joins and aggregations - Clear visualization of regional performance, product mix, and temporal trends - Evidence-based insights to support business decisions

Key Findings Achieved

Through this analysis, I identified that South region leads revenue performance (37% of total), Electronics dominates product sales (54% market share), and the business shows strong Q1 growth momentum (164% increase from February to April). These insights directly answer leadership’s questions about where to focus resources and how to interpret performance patterns.


1. Setup and Data Loading

# Load the datasets
customers <- read_csv("customers.csv")
orders <- read_csv("orders.csv")

# Preview the data
cat("=== CUSTOMERS DATA ===\n")
## === CUSTOMERS DATA ===
head(customers) %>% kable()
customer_id signup_date region
c001 1/15/23 north
c002 2/2/23 south
C003 2/20/23 East
C004 3/1/23 West
C005 3/5/23 North
C006 3/10/23 South
cat("\n\n=== ORDERS DATA ===\n")
## 
## 
## === ORDERS DATA ===
head(orders) %>% kable()
order_id customer_id order_date amount product_category
O001 C001 2/1/23 120 Electronics
O002 C002 2/5/23 75 Home
O003 C003 2/25/23 210 Electronics
O004 C001 3/3/23 180 Clothing
O005 C004 3/10/23 95 Home
O006 C005 3/15/23 130 Clothing

2. Data Quality Assessment

Before proceeding with analysis, I examined both datasets to identify potential issues that could affect data integrity and analysis accuracy.

# Examine data structure
cat("=== CUSTOMERS DATASET STRUCTURE ===\n")
## === CUSTOMERS DATASET STRUCTURE ===
glimpse(customers)
## Rows: 10
## Columns: 3
## $ customer_id <chr> "c001", "c002", "C003", "C004", "C005", "C006", "C007", "C…
## $ signup_date <chr> "1/15/23", "2/2/23", "2/20/23", "3/1/23", "3/5/23", "3/10/…
## $ region      <chr> "north", "south", "East", "West", "North", "South", "East"…
cat("\n\n=== ORDERS DATASET STRUCTURE ===\n")
## 
## 
## === ORDERS DATASET STRUCTURE ===
glimpse(orders)
## Rows: 15
## Columns: 5
## $ order_id         <chr> "O001", "O002", "O003", "O004", "O005", "O006", "O007…
## $ customer_id      <chr> "C001", "C002", "C003", "C001", "C004", "C005", "C002…
## $ order_date       <chr> "2/1/23", "2/5/23", "2/25/23", "3/3/23", "3/10/23", "…
## $ amount           <dbl> 120, 75, 210, 180, 95, 130, 220, 90, 310, 150, 50, 20…
## $ product_category <chr> "Electronics", "Home", "Electronics", "Clothing", "Ho…
cat("\n\n=== DATA QUALITY ISSUES IDENTIFIED ===\n\n")
## 
## 
## === DATA QUALITY ISSUES IDENTIFIED ===
# Issue 1: Customer ID casing inconsistency
cat("ISSUE 1: Customer ID Casing Inconsistency\n")
## ISSUE 1: Customer ID Casing Inconsistency
cat("  customers.csv contains:", paste(head(customers$customer_id, 5), collapse = ", "), "\n")
##   customers.csv contains: c001, c002, C003, C004, C005
cat("  orders.csv contains:", paste(head(orders$customer_id, 5), collapse = ", "), "\n")
##   orders.csv contains: C001, C002, C003, C001, C004
cat("  Impact: Case-sensitive joins will fail, losing all order data\n")
##   Impact: Case-sensitive joins will fail, losing all order data
cat("  Solution: Standardize all IDs to uppercase before joining\n\n")
##   Solution: Standardize all IDs to uppercase before joining
# Issue 2: Region casing inconsistency  
cat("ISSUE 2: Region Capitalization Inconsistency\n")
## ISSUE 2: Region Capitalization Inconsistency
cat("  Unique region values:", paste(unique(customers$region), collapse = ", "), "\n")
##   Unique region values: north, south, East, West, North, South
cat("  Impact: Same region counted separately (e.g., 'north' vs 'North')\n")
##   Impact: Same region counted separately (e.g., 'north' vs 'North')
cat("  Solution: Standardize to Title Case for consistent grouping\n\n")
##   Solution: Standardize to Title Case for consistent grouping
# Issue 3: Date format
cat("ISSUE 3: Date Fields Stored as Text\n")
## ISSUE 3: Date Fields Stored as Text
cat("  Date format example:", customers$signup_date[1], "\n")
##   Date format example: 1/15/23
cat("  Impact: Cannot perform time-based analysis or sorting\n")
##   Impact: Cannot perform time-based analysis or sorting
cat("  Solution: Parse dates using lubridate::mdy() function\n")
##   Solution: Parse dates using lubridate::mdy() function

3. Data Cleaning and Standardization

# Clean customers dataset
customers_clean <- customers %>%
  mutate(
    # Standardize customer_id to uppercase for consistent joins
    customer_id = str_to_upper(customer_id),
    # Standardize region to Title Case for consistent grouping
    region = str_to_title(region),
    # Parse signup_date from text to proper date format
    signup_date = mdy(signup_date)
  )

# Clean orders dataset
orders_clean <- orders %>%
  mutate(
    # Standardize customer_id to uppercase to match customers table
    customer_id = str_to_upper(customer_id),
    # Parse order_date from text to proper date format
    order_date = mdy(order_date),
    # Create month column for time-series aggregation
    order_month = floor_date(order_date, "month")
  )

# Verify the cleaning worked
cat("=== CLEANING VERIFICATION ===\n\n")
## === CLEANING VERIFICATION ===
cat("Unique customer IDs in customers:", n_distinct(customers_clean$customer_id), "\n")
## Unique customer IDs in customers: 10
cat("Unique customer IDs in orders:", n_distinct(orders_clean$customer_id), "\n")
## Unique customer IDs in orders: 10
cat("Standardized regions:", paste(sort(unique(customers_clean$region)), collapse = ", "), "\n\n")
## Standardized regions: East, North, South, West
cat("Sample of cleaned customers data:\n")
## Sample of cleaned customers data:
head(customers_clean, 3) %>% kable()
customer_id signup_date region
C001 2023-01-15 North
C002 2023-02-02 South
C003 2023-02-20 East
cat("\n\nSample of cleaned orders data:\n")
## 
## 
## Sample of cleaned orders data:
head(orders_clean, 3) %>% kable()
order_id customer_id order_date amount product_category order_month
O001 C001 2023-02-01 120 Electronics 2023-02-01
O002 C002 2023-02-05 75 Home 2023-02-01
O003 C003 2023-02-25 210 Electronics 2023-02-01

4. Data Integration

# Join orders with customer data
# Using left_join to keep all orders even if customer data is missing
orders_with_region <- orders_clean %>%
  left_join(
    customers_clean %>% select(customer_id, region, signup_date), 
    by = "customer_id"
  )

# Verify join success
cat("=== JOIN VERIFICATION ===\n\n")
## === JOIN VERIFICATION ===
cat("Total orders in dataset:", nrow(orders_with_region), "\n")
## Total orders in dataset: 15
cat("Orders successfully matched with customer data:", sum(!is.na(orders_with_region$region)), "\n")
## Orders successfully matched with customer data: 15
cat("Join success rate:", scales::percent(sum(!is.na(orders_with_region$region)) / nrow(orders_with_region)), "\n\n")
## Join success rate: 100%
# Display sample of joined data showing the merge worked
cat("Sample of joined data (showing order with customer region):\n")
## Sample of joined data (showing order with customer region):
head(orders_with_region, 6) %>% 
  select(order_id, customer_id, order_date, amount, product_category, region) %>%
  kable()
order_id customer_id order_date amount product_category region
O001 C001 2023-02-01 120 Electronics North
O002 C002 2023-02-05 75 Home South
O003 C003 2023-02-25 210 Electronics East
O004 C001 2023-03-03 180 Clothing North
O005 C004 2023-03-10 95 Home West
O006 C005 2023-03-15 130 Clothing North

5. Analysis: Revenue by Region

# Calculate key metrics by region
revenue_by_region <- orders_with_region %>%
  group_by(region) %>%
  summarise(
    total_orders = n(),
    total_revenue = sum(amount),
    avg_order_value = round(mean(amount), 2),
    .groups = 'drop'
  ) %>%
  arrange(desc(total_revenue))

# Display results
cat("=== REVENUE METRICS BY REGION ===\n\n")
## === REVENUE METRICS BY REGION ===
revenue_by_region %>% 
  mutate(
    total_revenue = scales::dollar(total_revenue),
    avg_order_value = scales::dollar(avg_order_value)
  ) %>%
  kable()
region total_orders total_revenue avg_order_value
South 6 $805 $134.17
North 5 $720 $144.00
East 2 $520 $260.00
West 2 $145 $72.50
# Identify top and bottom performers
cat("\n\nKey Regional Insights:\n")
## 
## 
## Key Regional Insights:
cat("• Highest revenue region:", revenue_by_region$region[1], 
    "with", scales::dollar(revenue_by_region$total_revenue[1]), 
    "(", revenue_by_region$total_orders[1], "orders)\n")
## • Highest revenue region: South with $805 ( 6 orders)
cat("• Lowest revenue region:", revenue_by_region$region[nrow(revenue_by_region)], 
    "with", scales::dollar(revenue_by_region$total_revenue[nrow(revenue_by_region)]),
    "(", revenue_by_region$total_orders[nrow(revenue_by_region)], "orders)\n")
## • Lowest revenue region: West with $145 ( 2 orders)

6. Analysis: Average Order Amount per Customer

# Calculate metrics for each customer
customer_metrics <- orders_with_region %>%
  group_by(customer_id, region) %>%
  summarise(
    total_orders = n(),
    total_spent = sum(amount),
    avg_order_amount = round(mean(amount), 2),
    .groups = 'drop'
  ) %>%
  arrange(desc(total_spent))

# Overall customer statistics
cat("=== CUSTOMER SPENDING METRICS ===\n\n")
## === CUSTOMER SPENDING METRICS ===
cat("Total unique customers:", n_distinct(customer_metrics$customer_id), "\n")
## Total unique customers: 10
cat("Overall average order amount per customer:", 
    scales::dollar(round(mean(customer_metrics$avg_order_amount), 2)), "\n")
## Overall average order amount per customer: $153.83
cat("Median order amount per customer:", 
    scales::dollar(round(median(customer_metrics$avg_order_amount), 2)), "\n\n")
## Median order amount per customer: $142.50
# Show top customers
cat("Top 5 Customers by Total Spending:\n")
## Top 5 Customers by Total Spending:
customer_metrics %>%
  head(5) %>%
  mutate(
    total_spent = scales::dollar(total_spent),
    avg_order_amount = scales::dollar(avg_order_amount)
  ) %>%
  kable()
customer_id region total_orders total_spent avg_order_amount
C002 South 3 $385 $128.33
C007 East 1 $310 $310.00
C001 North 2 $300 $150.00
C005 North 2 $270 $135.00
C006 South 2 $220 $110.00

7. Analysis: Top 3 Product Categories by Revenue

# Calculate revenue by product category
product_revenue <- orders_with_region %>%
  group_by(product_category) %>%
  summarise(
    total_orders = n(),
    total_revenue = sum(amount),
    avg_order_value = round(mean(amount), 2),
    .groups = 'drop'
  ) %>%
  arrange(desc(total_revenue))

# Calculate market share percentage
product_revenue <- product_revenue %>%
  mutate(
    revenue_share_pct = round((total_revenue / sum(total_revenue)) * 100, 1)
  )

# Display top 3 categories
cat("=== TOP 3 PRODUCT CATEGORIES BY REVENUE ===\n\n")
## === TOP 3 PRODUCT CATEGORIES BY REVENUE ===
product_revenue %>%
  head(3) %>%
  mutate(
    total_revenue = scales::dollar(total_revenue),
    avg_order_value = scales::dollar(avg_order_value),
    revenue_share = paste0(revenue_share_pct, "%")
  ) %>%
  select(product_category, total_orders, total_revenue, avg_order_value, revenue_share) %>%
  kable()
product_category total_orders total_revenue avg_order_value revenue_share
Electronics 6 $1,190 $198.33 54.3%
Clothing 5 $590 $118.00 26.9%
Home 4 $410 $102.50 18.7%
# Key insight
cat("\n\nKey Product Insights:\n")
## 
## 
## Key Product Insights:
cat("• Top category:", product_revenue$product_category[1], 
    "with", scales::dollar(product_revenue$total_revenue[1]),
    "(", product_revenue$revenue_share_pct[1], "% of total revenue)\n")
## • Top category: Electronics with $1,190 ( 54.3 % of total revenue)

9. Data Visualizations

# Visualization 1: Revenue by Region
ggplot(revenue_by_region, aes(x = reorder(region, total_revenue), y = total_revenue, fill = region)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::dollar(total_revenue)), hjust = -0.1, size = 4) +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.15))) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Total Revenue by Region",
    subtitle = "South region leads in total revenue generation",
    x = "Region",
    y = "Total Revenue ($)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    panel.grid.major.y = element_blank()
  )

# Visualization 2: Monthly Revenue Trend
ggplot(monthly_revenue, aes(x = order_month, y = total_revenue)) +
  geom_line(color = "#2C3E50", size = 1.2) +
  geom_point(size = 3, color = "#E74C3C") +
  geom_hline(yintercept = avg_monthly_revenue, linetype = "dashed", color = "#95A5A6", size = 0.8) +
  geom_text(aes(label = scales::dollar(total_revenue)), vjust = -1, size = 3.5) +
  scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0.1, 0.2))) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title = "Monthly Revenue Trend",
    subtitle = "Strong upward growth momentum from February to April 2023",
    x = "Month",
    y = "Total Revenue ($)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  )

# Visualization 3: Product Category Performance
ggplot(product_revenue, aes(x = reorder(product_category, total_revenue), y = total_revenue, fill = product_category)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(scales::dollar(total_revenue), "\n(", revenue_share_pct, "%)")), 
            hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_y_continuous(labels = scales::dollar_format(), expand = expansion(mult = c(0, 0.2))) +
  scale_fill_brewer(palette = "Pastel1") +
  labs(
    title = "Revenue by Product Category",
    subtitle = "Electronics leads with over half of total revenue",
    x = "Product Category",
    y = "Total Revenue ($)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    panel.grid.major.y = element_blank()
  )


10. Key Insights Summary

cat("=== SUMMARY OF FINDINGS ===\n\n")
## === SUMMARY OF FINDINGS ===
cat("REGIONAL PERFORMANCE:\n")
## REGIONAL PERFORMANCE:
cat("• South region leads with", scales::dollar(revenue_by_region$total_revenue[1]), 
    "representing", round(revenue_by_region$total_revenue[1]/sum(revenue_by_region$total_revenue)*100, 1),
    "% of total revenue\n")
## • South region leads with $805 representing 36.8 % of total revenue
cat("• All four regions show order activity, indicating broad market penetration\n")
## • All four regions show order activity, indicating broad market penetration
cat("• East region has highest average order value at $260, suggesting premium customer segment\n\n")
## • East region has highest average order value at $260, suggesting premium customer segment
cat("TEMPORAL TRENDS:\n")
## TEMPORAL TRENDS:
cat("• Strong Q1 growth momentum: revenue increased 164% from February to April\n")
## • Strong Q1 growth momentum: revenue increased 164% from February to April
cat("• April generated", scales::dollar(max_month$total_revenue), 
    "— the strongest month, 47% above quarterly average\n")
## • April generated $1,070 — the strongest month, 47% above quarterly average
cat("• February started lower but established foundation for accelerating growth trend\n\n")
## • February started lower but established foundation for accelerating growth trend
cat("PRODUCT PERFORMANCE:\n")
## PRODUCT PERFORMANCE:
cat("• Electronics dominates with 54.3% market share (", scales::dollar(product_revenue$total_revenue[1]), ")\n")
## • Electronics dominates with 54.3% market share ( $1,190 )
cat("• Three-category portfolio shows diversification: Electronics, Clothing, Home\n")
## • Three-category portfolio shows diversification: Electronics, Clothing, Home
cat("• Electronics and Clothing together represent 81% of total revenue\n\n")
## • Electronics and Clothing together represent 81% of total revenue
cat("CUSTOMER BEHAVIOR:\n")
## CUSTOMER BEHAVIOR:
cat("• Average customer order value:", scales::dollar(round(mean(customer_metrics$avg_order_amount), 2)), "\n")
## • Average customer order value: $153.83
cat("• Customer spending patterns vary significantly, from $72.50 to $310 average order value\n")
## • Customer spending patterns vary significantly, from $72.50 to $310 average order value
cat("• Repeat purchase behavior observed in", sum(customer_metrics$total_orders >= 2), "out of 10 customers\n")
## • Repeat purchase behavior observed in 4 out of 10 customers

11. Technical Notes

Data Quality & Cleaning Approach

Issues Identified and Resolved:

  1. Customer ID Standardization: Converted all customer IDs to uppercase using str_to_upper() to ensure case-insensitive matching during joins
  2. Region Normalization: Standardized region names to Title Case using str_to_title(), consolidating six variations into four clean categories
  3. Date Parsing: Converted text-based dates to proper date objects using lubridate::mdy() for accurate time-series analysis
  4. Join Verification: Achieved 100% join success rate between orders and customers datasets

Tools & Packages Used

  • R version 4.3.1 for statistical computing and data analysis
  • tidyverse (dplyr, ggplot2, readr) for data manipulation and visualization
  • lubridate for date/time handling
  • knitr for creating formatted tables and reports

Analytical Methodology

This analysis demonstrates a systematic approach to data analytics: identifying data quality issues, documenting their impact, implementing standardized cleaning procedures, verifying results, and presenting findings through clear visualizations and metrics.



12. Analysis Outcomes

Business Problem Solved

Leadership Questions Answered:

  1. Regional Performance: South region generates the highest revenue ($805, 37% of total), making it the strongest candidate for expansion investment, though East region shows highest customer value ($260 avg order)

  2. Product Strategy: Electronics and Clothing together drive 81% of revenue, clearly indicating inventory prioritization

  3. Seasonal Patterns: February’s lower revenue ($405) isn’t a problem to fix—it’s the foundation of strong upward momentum that accelerated through Q1, suggesting natural growth trajectory rather than seasonal weakness

  4. Customer Insights: 40% of customers make repeat purchases, and spending patterns vary widely ($72-$310 avg), indicating opportunities for targeted customer segmentation strategies

What This Analysis Demonstrates:

This work showcases my ability to take messy real-world data, systematically clean and analyze it, and extract actionable business insights that directly support decision-making—exactly the analytical approach needed at Diversity Lab to turn legal talent data into strategies that advance fairness and equity.


Analysis completed by Tanvi Kadam for Diversity Lab Data Analyst Take-Home Assignment