Introduction

Olist is a Brazilian e-commerce startup founded in 2015 that connects small and medium-sized businesses to major online marketplaces, such as Amazon. By acting as an intermediary platform, Olist enables smaller retailers to access large digital marketplaces and expand their customer reach.

The dataset analyzed in this report contains information on approximately 100,000 orders placed between 2016 and 2018. The data is organized into multiple relational files, each offering a different analytical dimension of the business: • Customer dataset: Contains information about customers and their geographic locations. • Order items dataset: Includes details about the products within each order, such as price and freight value. • Payment dataset: Provides information on payment type and payment amount for each order. • Orders dataset: Serves as the core dataset, containing order-level information such as customer ID, order status, purchase timestamp, and delivery dates. • Products dataset: Includes product characteristics such as category and physical dimensions. • Product category translation dataset: Allows product categories to be translated from Portuguese to English.

While every order includes a purchase timestamp, not all orders contain a delivery timestamp. Specifically, 2,965 orders had a delivery date of NA because they were either canceled or still in process.

As a for-profit organization, Olist’s primary objectives include maximizing sales performance while maintaining operational efficiency. The following visualizations analyze both revenue generation and delivery effectiveness to assess overall business performance.

library(flexdashboard)
library(data.table)
library(lubridate)
library(sqldf)
library(ggplot2)
library(geobr)
library(dplyr)
library(sf)
library(stringr) 
library(scales)
library(ggrepel)
library(plotly)

# Set my working directory to where my data lives
setwd("U:/R Data Visualizations/archive (1)")

my_cache = "geobr_cache"
dir.create(my_cache, showWarnings = FALSE)
options(geobr.cache_dir = my_cache)

# Contains information about the customer and location with a customer_unique_id and customer_id to link to orders
customers = "olist_customers_dataset.csv"
customers_df = fread(customers)

# Contains details around product such as its category, weight, and size
products = "olist_products_dataset.csv"
products_df = fread(products)

# Contains details related to item value, its order, its seller and the product_id
order_items = "olist_order_items_dataset.csv"
items_df = fread(order_items)

# Contains details related to the payment of orders
order_payments = "olist_order_payments_dataset.csv"
payments_df = fread(order_payments)

# This is the main dataset for orders, showing orders, the customer, order status, and dates related to the order
orders = "olist_orders_dataset.csv"
orders_df = fread(orders)

# Contains translations for the product category
product_translation = "product_category_name_translation.csv"
translation_df = read.csv(product_translation)

Top 10 Customers by Order Quantity

The bar charts display the top 10 customers ranked by (1) total number of orders and (2) total money spent. Notably, there is no overlap between the customers appearing in these two charts.

This suggests the presence of distinct customer segments: • Customers who place frequent, low-value orders. • Customers who make infrequent but high-value purchases.

This distinction presents a strategic opportunity for targeted marketing segmentation. High-frequency, low-spend customers may be more price-sensitive and convenience-driven. Promotional campaigns, loyalty rewards, and bundle discount offers may effectively increase retention and lifetime value within this segment.

Conversely, high-spend customers likely prioritize product performance, return on investment, and quality. Personalized outreach, exclusive offers, or premium service engagement could strengthen loyalty among these customers.

Recognizing these behavioral differences allows Olist to allocate marketing resources more effectively and increase customer lifetime value through differentiated strategies.

# Merge tables, orders data with customer data
order_cust_df = sqldf("SELECT * FROM orders_df INNER JOIN customers_df USING (customer_id)")

# Merge Joined(order, customer) data with payment data
order_cust_payment_df = sqldf("SELECT * FROM order_cust_df INNER JOIN payments_df USING (order_id)")

# Create dataframe of customer order quantity for charts
customer_order_quantity_df = sqldf("SELECT customer_id, COUNT(*) AS total_orders FROM order_cust_payment_df GROUP BY customer_id ORDER BY total_orders DESC")


# Map customer IDs to a more manageable unique string
unique_ids = unique(order_cust_payment_df$customer_id)
ids = openxlsx::int2col(1:length(unique_ids))

id_map = data.frame(customer_id = unique_ids, mapped_id = ids)

cust_orders_mapped = customer_order_quantity_df %>%
  left_join(id_map, by = "customer_id")


ggplot(cust_orders_mapped[1:10,], aes(x = reorder(mapped_id, -total_orders), y = total_orders))+
  geom_bar(colour = "black", fill = "red", stat = "identity") +
  labs(title = "Top 10 Customers by Order Quantity", x = "Customer ID", y = "Total Orders") +
  geom_text(aes(label = total_orders), color = "black", vjust = -0.5) +
  theme(plot.title = element_text(hjust = 0.5))

Top 10 Customers by Total Spent

# Create dataframe of customer order payments for charts
customer_spent_df = sqldf("SELECT customer_id, SUM(payment_value) AS total_spent FROM order_cust_payment_df GROUP BY customer_id ORDER BY total_spent DESC")

# Map unique customer IDs to a more manageable ID that was created
cust_spent_mapped = customer_spent_df %>%
  left_join(id_map, by = "customer_id")

#Build a chart of top 10 customers by total spent
ggplot(cust_spent_mapped[1:10,], aes(x = reorder(mapped_id, -total_spent), y = total_spent)) +  
  geom_bar(colour = "black", fill = "red", stat = "identity") +
  labs(title = "Top 10 Customers by Total Spent", x = "Customer ID", y = "Total Spent") +
  scale_y_continuous(labels = comma) + 
  geom_text(aes(label = scales::dollar(total_spent)), color = "black", vjust = -0.5) + 
  theme(plot.title = element_text(hjust = 0.5))

Order Quantity by Order Date

The line charts illustrate monthly trends in total orders and total sales revenue. As expected, both charts follow similar trajectories.

In 2016, order volume and sales revenue were relatively low. Given that Olist was founded in 2015, this likely reflects the early growth stage of the company as it established marketplace partnerships and built brand recognition.

Throughout 2017, both metrics show steady and consistent growth, culminating in a peak in November 2017. During this month, Olist processed 7,872 orders, generating $1,197,318 in revenue. This spike likely corresponds with seasonal purchasing patterns, such as holiday shopping and promotional sales events.

Following November 2017, orders and revenue fluctuate but remain relatively stable until September 2018, when a sharp decline occurs. This drop aligns with the end of the dataset rather than indicating a true operational downturn.

Given the consistent upward trend from 2016 through mid-2018, Olist demonstrates strong revenue momentum. Strategically, the company could consider expanding into complementary revenue streams to further capitalize on its established growth trajectory.

# Create a data frame of revenues and orders over the years and months
orders_over_time = order_cust_payment_df %>%
  select(order_purchase_timestamp, payment_value) %>%
  mutate(date = as.Date(paste("01", month(ymd_hms(order_purchase_timestamp)), year(ymd_hms(order_purchase_timestamp))), format = '%d %m %Y')) %>%
  group_by(date) %>%
  summarise(total_revenue = sum(payment_value), total_orders = length(date), .groups = "keep") %>%
  data.frame()

# Build data frame of the high and low order counts from the orders over time data frame
hi_lo_orders = orders_over_time %>%
  filter(total_orders == min(total_orders) | total_orders == max(total_orders)) %>%
  data.frame()


# Create a line chart of orders over the years and months
ggplot(orders_over_time, aes(x = date, y = total_orders)) +
  geom_line(color = "red", linewidth = 1) +
  geom_point(shape = 21, size = 3, color = "black", fill = "white") + 
  labs(x = "Date of Order", y = "Total Orders", title = "Total Orders by Order Date") + 
  scale_y_continuous(labels = comma) + 
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme_gray() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1)) +
  geom_point(data = hi_lo_orders, aes(x = date, y = total_orders), shape = 21, size = 4, color = "red", fill = "red") +
  geom_label_repel(aes(label = ifelse(total_orders == max(total_orders) | total_orders == min(total_orders), scales::comma(total_orders), "")),
                   box.padding = 1,
                   point.padding = 1,
                   size = 4,
                   color = "Grey50",
                   segment.color = "black")

Total Sales by Order Date

# Build data frame of the high and low revenues from the orders over time data frame
hi_lo_revenue = orders_over_time %>%
  filter(total_revenue == min(total_revenue) | total_revenue == max(total_revenue)) %>%
  data.frame()


# Create a line chart of revenues over the years and months
ggplot(orders_over_time, aes(x = date, y = total_revenue)) +
  geom_line(color = "red", linewidth = 1) +
  geom_point(shape = 21, size = 3, color = "black", fill = "white") + 
  labs(x = "Date of Order", y = "Total Sales", title = "Sales by Order Date") + 
  scale_y_continuous(labels = comma) + 
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme_gray() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1)) +
  geom_point(data = hi_lo_revenue, aes(x = date, y = total_revenue), shape = 21, size = 4, color = "red", fill = "red") +
  geom_label_repel(aes(label = ifelse(total_revenue == max(total_revenue) | total_revenue == min(total_revenue), scales::dollar(total_revenue), "")),
                   box.padding = 1,
                   point.padding = 1,
                   size = 4,
                   color = "Grey50",
                   segment.color = "black")

Delivery Statuses over The Years

The nested pie chart presents order delivery status by year, categorizing orders as delivered early, on time, late, canceled, unavailable, or in process.

The majority of orders across all years were delivered earlier than the estimated delivery date. This performance likely contributes positively to customer satisfaction and brand loyalty.

Aside from 2016—which experienced a higher proportion of canceled or in-process orders, possibly due to early-stage logistical challenges—the second most common outcome in subsequent years was late delivery.

Although approximately 90% of deliveries are completed on time or early, the presence of late deliveries presents an opportunity for operational refinement. Olist should analyze supply chain bottlenecks, carrier performance, warehouse efficiency, and regional distribution challenges to reduce late deliveries and further improve customer experience.

# Change data formats to be used for calculations
orders_df$order_approved_at = as.Date(orders_df$order_approved_at)
orders_df$order_delivered_customer_date = as.Date(format(orders_df$order_delivered_customer_date, "%Y-%m-%d"))
orders_df$order_estimated_delivery_date = as.Date(orders_df$order_estimated_delivery_date)

# Create columns to show how many days order arrived before or after the estimated date (negative means early delivery), and to show how long it took for items to be delivered from when it was approved
orders_df$delivery_date_difference = as.numeric(orders_df$order_delivered_customer_date - orders_df$order_estimated_delivery_date, units = "days")
orders_df$approved_to_delivered = as.numeric(orders_df$order_delivered_customer_date - orders_df$order_approved_at, units = "days")

# Create data frame that includes a delivery status
# Early = delivered before estimated, late = delivered after estimated data, on time = delivered on estimated date
order_statuses = orders_df %>%
  mutate(early_late_ontime = ifelse(delivery_date_difference < 0, "Early",
                                    ifelse(delivery_date_difference == 0, "On Time",
                                           ifelse(delivery_date_difference > 0, "Late", NA)))) %>%
  mutate(early_late_ontime = ifelse(is.na(early_late_ontime), str_to_title(order_status), early_late_ontime)) %>%
  mutate(early_late_ontime = ifelse(early_late_ontime == "Delivered", "Delivered to Logistic Partner",
                                    ifelse(early_late_ontime %in% c("Approved", "Created", "Processing", "Shipped", "Invoiced"), "In Process", early_late_ontime))) %>%
  mutate(year = year(order_purchase_timestamp)) %>%
  group_by(early_late_ontime, year) %>%
  summarise(total_orders = length(early_late_ontime), .groups = "keep") %>%
  data.frame()

# Build a nested pie chart showing the delivery statuses over the years
plot_ly(hole = 0.7) %>%
  layout(title ="Delivery Status (2016 - 2018)") %>%
  add_trace(data = order_statuses[order_statuses$year == 2018, ],
            labels = ~early_late_ontime,
            values = ~order_statuses[order_statuses$year == 2018, "total_orders"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Year: 2018<br>Order Status: %{label}<br>Percent: %{percent}<br>Order Count: %{value}<extra></extra>") %>%
  add_trace(data = order_statuses[order_statuses$year == 2017,],
            labels = ~early_late_ontime,
            values = ~order_statuses[order_statuses$year == 2017, "total_orders"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Year: 2017<br>Order Status: %{label}<br>Percent: %{percent}<br>Order Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.16, 0.84),
              y = c(0.16, 0.84))) %>%
  add_trace(data = order_statuses[order_statuses$year == 2016,],
            labels = ~early_late_ontime,
            values = ~order_statuses[order_statuses$year == 2016, "total_orders"],
            type = "pie",
            textposition = "inside",
            hovertemplate = "Year: 2016<br>Order Status: %{label}<br>Percent: %{percent}<br>Order Count: %{value}<extra></extra>",
            domain = list(
              x = c(0.27, 0.73),
              y = c(0.27, 0.73)))

Orders and Revenue by Product Category

This visualization examines the top 15 product categories by total orders and sales across each year, grouping remaining categories into an “Other” category.

As seen in previous charts, 2016 accounts for a relatively small proportion of total activity. Growth in 2017 and 2018 is visible across most major categories.

Several categories show continued growth into 2018, while categories such as “Toys” and “Cool Stuff” experienced stronger performance in 2017 than in 2018. This may indicate: • Shifting consumer preferences • Increased competition • Reduced promotional emphasis

Further demographic analysis could help determine whether specific customer segments drive purchases in these categories. Targeted promotional campaigns may help reinvigorate underperforming segments.

# Link multiple data frames together to get product categories associated with orders
product_details_start = sqldf("WITH prod_details AS (WITH prod_orders AS (WITH ord_items AS (SELECT * FROM orders_df JOIN items_df USING (order_id))
                  SELECT * FROM ord_items JOIN products_df USING (product_id))
                  SELECT * FROM prod_orders JOIN translation_df USING (product_category_name))
                  SELECT * FROM prod_details JOIN payments_df USING (order_id)")


# Create a data frame that calculates the revenue and order count for product categories by year
product_details = product_details_start %>%
  select(order_purchase_timestamp, payment_value, product_category_name_english) %>%
  mutate(year = year(order_purchase_timestamp), product_category_name_english = str_to_title(gsub("_", " ", product_category_name_english))) %>%
  group_by(product_category_name_english, year) %>%
  summarise(total_revenue = sum(payment_value), total_orders = length(product_category_name_english), .groups = "keep") %>%
  data.frame()

# Get the top 15 product categories by revenue
top_products = product_details %>%
  group_by(product_category_name_english) %>%
  summarise(total_revenue = sum(total_revenue))

top_15_products = top_products[order(top_products$total_revenue, decreasing = TRUE), ][1:15,]

# Any product that is not in top 15 becomes "Other"
product_details_other = product_details %>%
  mutate(product_category_name_english = ifelse(product_category_name_english %in% top_15_products$product_category_name_english, product_category_name_english, "Other")) %>%
  group_by(product_category_name_english, year) %>%
  summarise(total_revenue = sum(total_revenue), total_orders = sum(total_orders), .groups = "keep") %>%
  data.frame()

product_details_other$year = as.character(product_details_other$year)

# Separate the revenue data for product categories to use for a line chart
product_cat_revenue = product_details_other %>%
  select(product_category_name_english, total_revenue) %>%
  group_by(product_category_name_english) %>%
  summarise(total_revenue = sum(total_revenue)) %>%
  data.frame()

# Build sequence for second axis for line chart
ylab = seq(0, max(product_cat_revenue$total_revenue)/1000, 500)
my_labs = paste0("$", ylab, "K")


# Build stacked bar chart of total orders by year for each product category. Line chart shows total revenue for each product category
ggplot(product_details_other, aes(x = reorder(product_category_name_english, total_orders, sum), y = total_orders, fill = year)) +
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
  coord_flip() + 
  theme_light() +
  labs(title = "Orders and Revenue by Product Category", y = "Total Orders", x = "Product Category", fill = "Year") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Reds") +
  geom_line(inherit.aes = FALSE, data = product_cat_revenue, aes(x = product_category_name_english, y = total_revenue/50, colour = "Total Revenue", group = 1), 
            linewidth = 1) +
  scale_color_manual(NULL, values = "black") + 
  scale_y_continuous(labels = comma, sec.axis = sec_axis(~. * 50, name = "Total Revenue", labels = my_labs, breaks = ylab * 1000)) +
  geom_point(inherit.aes = FALSE, data = product_cat_revenue, aes(x = product_category_name_english, y = total_revenue/50, group = 1), 
             size = 3, shape = 21, fill = "white")

Sales by Brazilian States

The heatmap illustrates total sales by Brazilian state, with darker shading indicating higher revenue concentration. São Paulo accounts for the largest share of Olist’s total sales, generating over $5 million in revenue. Neighboring southeastern states also show moderate sales activity, while northern and northwestern states demonstrate relatively low revenue contributions.

This geographic concentration suggests that Olist’s market penetration is strongest in Brazil’s economically developed southeastern region. The limited presence in other regions raises important strategic questions: • Is marketplace integration less developed in northern Brazil? • Are logistics networks less efficient in these regions? • Is marketing penetration lower outside the southeast?

If Olist’s infrastructure and partnerships are underdeveloped in these areas, expansion could represent a significant growth opportunity. Geographic diversification may reduce revenue concentration risk while unlocking untapped demand in emerging regional markets.

state_payments = sqldf("SELECT customer_state AS abbrev_state, SUM(payment_value) AS total_revenue FROM order_cust_payment_df GROUP BY customer_state") 

# Read in Brazil states and their geographical data to plot on a map
# brazil_states = read_state(year = 2020)

# JOIN the Brazilian state geographical data with the profit state data for plotting
# brazil_states = brazil_states %>%
#  left_join(state_payments, by = c("abbrev_state" = "abbrev_state"))

# Build a heatmap of sales within Brazilian states, red indicates higher profits
# ggplot(data = brazil_states) +
#  geom_sf(aes(fill = total_revenue)) +
#  scale_fill_gradient(low = "beige", high = "red", labels = scales::dollar) + 
#  geom_sf_text(aes(label = abbrev_state), size = 3, color = "black", fontface = "bold", check_overlap = TRUE) +
#  theme_minimal() +
#  labs(title = "Sales by Brazilian States", fill = "Total Sales") +
#  scale_y_continuous(labels = comma) + 
#  theme_void()

Conclusion

The analysis of Olist’s transactional data reveals a company experiencing strong growth momentum from 2016 through 2018. Sales and order volume increased steadily, customer delivery performance is largely effective, and geographic concentration highlights both strength and opportunity.

Key strategic insights include: 1. Clear customer segmentation between high-frequency and high-value buyers. 2. Strong operational performance, with approximately 90% of deliveries completed on time or early. 3. Revenue concentration in southeastern Brazil, particularly São Paulo, indicating opportunity for regional expansion. 4. Product category variation suggesting potential for targeted promotional strategies.