1 Background

1.1 Brief

Sales analysis is a method to examine how well some products are selling, especially compared to its goal or target. It provides important insights about the top-selling and under-performing products, market trend, most profitable customers, and many others. Here, I am going to conduct a sales analysis on the E-Commerce Data obtained from Kaggle. It consists of sales details from Indian e-commerce website.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. I provided a brief explanation about their function.

dplyr: data manipulation
lubridate: handling datetime data
glue: interpolate data into strings
ggplot2: visualizing data
scales: scaling in visualization
ggrepel: labeling in visualization
plotly: interactive visualization

library(dplyr)
library(lubridate)
library(glue)
library(ggplot2)
library(scales)
library(ggrepel)
library(plotly)

2 Data Preparation

2.1 Data Inspection

The data is split into 2 csv files:

orders: contains purchase information (order ID, order date, and customer details)
details: contains the price, quantity, profit, product category and subcategory in every order ID

for naming conventions, I will change the name of each column.

orders <- read.csv("data_input/orders.csv")
names(orders) <- c("order.id", "order.date", "customer.name", "state", "city")

# replace empty values with NA
orders[orders == ""] <- NA

head(orders)

Some information about the features in orders:

order.id: ID of the purchase order
order.date: Date of purchase
customer.name: Name of the customer
state: State of the customer
city: City of the customer

details <- read.csv("data_input/details.csv")
names(details) <- c("order.id", "amount", "profit", "quantity", "category", "sub.category")
head(details)

Some information about the features in details:

order.id: ID of the purchase order
amount: Price of the order
profit : Profit made by the purchase
quantity: Quantity of purchase
category: Category of product
sub.category: Sub-category the product belongs

We can see that some columns have incorrect data types, we will fix it later on, now let’s check if there is any missing value in the dataset.

2.2 Data Wrangling

2.2.1 Missing Values

anyNA(details)

## [1] FALSE

anyNA(orders)

## [1] TRUE

there exist some missing values in orders data, let’s see how many missing values there are in the data.

colSums(is.na(orders))

##      order.id    order.date customer.name         state          city 
##            60            60            60            60            60

na_rows <- rowSums(is.na(orders))
orders[which(na_rows > 0), ]

The orders data should be connected with details by their order.id. There are 60 empty rows in orders data, which means it has 500 list of orders. Let’s check how many unique order.id there are in details.

length(unique(details$order.id))

## [1] 500

Since details contains 500 unique order.id as well, we can just drop the missing values in the orders data. They are only empty rows, our data is complete.

orders <- na.omit(orders)
orders

2.2.2 Incorrect Data Types

Each dataframe has their own incorrect data types. Here are the following variables that need to be converted:

order.date: date
state: categorical
city: categorical
category: categorical
sub.category: categorical

So, converting data types for the order, details and target data.

# List of Orders
orders <- orders %>% 
  mutate(order.date = dmy(order.date),
         state = as.factor(state),
         city = as.factor(city))

# Order Details
details[, c("category", "sub.category")] <- lapply(details[, c("category", "sub.category")],
                                                   FUN = as.factor)

We are done with the data wrangling. Our data is ready to be analysed.

3 Sales Analysis

3.1 Sales Trend

The first thing I would like to analyse is the sales trend. We will review the historical revenue to detect whether there are patterns in our data or not. In business, trend analysis allows to predict the future market. It may also provides some important information. Let’s see the monthly profitability in our data. Firstly, we will create a new dataframe profit from details to see the total profit for each order ID, and then merge it with orders to get the order date.

profit <- details %>% 
  group_by(order.id) %>% 
  summarise(total.amount = sum(amount),
            total.profit = sum(profit),
            total.quantity = sum(quantity)) %>% 
  merge(orders)

head(profit)

We want to see the monthly profit. So we will need to reconstruct the dataframe. Let’s first create a new dataframe monthly_sales which contains monthly sales data.

monthly_sales <- profit %>% 
  mutate(order.month = month(order.date, label = T),
         order.year = year(order.date)) %>% 
  group_by(order.year, order.month) %>% 
  summarise(total.amount = sum(total.amount),
            total.profit = sum(total.profit),
            total.quantity = sum(total.quantity)) %>% 
  mutate(order.month.year = factor(paste(order.month, order.year))) %>% 
  ungroup() %>% 
  select(-c(order.year, order.month))

monthly_sales

Visualizing the monthly profit from April 2018 to March 2019.

# Lock the order of order.month.year
monthly_sales$order.month.year <- factor(monthly_sales$order.month.year,
                                         levels = monthly_sales$order.month.year)

# Visualization
monthly_profit_plot <- monthly_sales %>% 
  ggplot(mapping = aes(x = order.month.year,
                       y = total.profit,
                       text = glue("Profit in {order.month.year}: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit > 0), color = "#002433") +
  scale_fill_manual(values = c("#cd5c5c", "#4a708b")) +
  labs(title = "Monthly Sales Profit",
       y = "Total Profit") +
  scale_x_discrete(labels = wrap_format(5)) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        legend.position = "none",
        axis.title.x = element_blank())

ggplotly(monthly_profit_plot, tooltip = "text")

We can see that from April 2018 until September 2018, the e-commerce sales experienced losses. But, since October 2018, the e-commerce sales made quite a big profit, especially in November 2018. Just by looking at the graph, we can tell that the profit made can easily covers all the losses it previously had. Also, let’s visualize the monthly quantity sales and check whether it correlates with the profit.

monthly_sales_plot <- monthly_sales %>% 
  ggplot(mapping = aes(x = order.month.year,
                       y = total.quantity,
                       text = glue("Product sold in {gsub('\n', ' ', order.month.year)}: {format(total.quantity, big.mark = ',')}"))) +
  geom_line(group = 1, color = "#4a708b", size = 1) +
  geom_point() +
  labs(title = "Monthly Sales",
       y = "Quantity Sold") +
  scale_x_discrete(labels = wrap_format(5)) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        legend.position = "none",
        axis.title.x = element_blank())

ggplotly(monthly_sales_plot, tooltip = "text")

From both the profit and sales plot, we can gain an important information. Since October 2018, the e-commerce industry made quite a big profit, and the number of quantity sold increases significantly (although it fluctuates) until it reached the peak on March 2019. From business perspective, it may give an insight that there is a change in consumer behavior, with a great shift towards online shopping. Additionally, it seems that the number of product sold does not guarantee the profit earned. For instances, the highest loss occurs in June 2018, although the number of lowest products sold is in July 2018. Let’s check the relationship between quantity sold and profit earned.

3.2 Quantity and Profit Relationship

Let’s visualize a scatterplot to see the relationship between quantity and profit.

details %>% 
  ggplot(mapping = aes(x = quantity,
                       y = profit)) +
  geom_jitter() +
  geom_smooth(color = "red") +
  labs(title = "Quantity Sold and Profit Relationship",
       x = "Quantity Sold",
       y = "Profit") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      axis.title.y = element_text(size = 12, face = "bold"),
      axis.title.x = element_text(size = 12, face = "bold"))

From the scatterplot, it seems that the quantity sold does not correlate with the profit. This means that making a lot of sales does not guarantee that the seller will generate more profit. Another factor that may influence profit is the product category, so we will analyse it next.

3.3 Top Trending Product

After knowing that there might be a great shift towards online shopping, I would like to know what is the top trending product, i.e. the most profitable and the most sold product to sell on e-commerce. This may help gain an important recommendation about what product do people should supply in their online marketplace. There is a reason why I decided to see the most profitable and most sold product instead of just one of those. Selling the most profitable product will make the seller’s profit to be increased. However, if the seller also sell the most sold product, he will be more likely to get popularity. As we know, e-commerce platform tend to give sellers who are able to sell a lot of products more place for listings and advertisement. This is because they are more trustworthy to the buyers since they will have more products sold and reviews.

3.3.1 Most Profitable Product

First, let’s visualize the profit earned by each products’ category. We will create a new dataframe product_trend which will have the total profit and quantity sold for each category and sub category.

product_trend <- details %>% 
  group_by(category, sub.category) %>% 
  summarise(total.amount = sum(amount),
            total.profit = sum(profit),
            total.quantity = sum(quantity))

head(product_trend)

Next, selecting only the category and visualizing the most profitable product.

profit_cat_plot <- product_trend %>% 
  select(-sub.category) %>% 
  group_by(category) %>% 
  summarise_all(sum) %>% 
  ggplot(mapping = aes(x = category,
                       y = total.profit,
                       text = glue("Category: {category}
                                   Profit: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit), 
           color = "#002433") +
  labs(title = "Most Profitable Product Category",
       x = "Category",
       y = "Total Profit") +
  scale_y_continuous(labels = comma) +
  scale_fill_gradient(high = "#4a708b", low = "#a1b6c4") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.position = "none")

ggplotly(profit_cat_plot, tooltip = "text") %>% 
  layout(hoverlabel = list(bgcolor = "black"))

Both clothing and electronic product are profitable. Let’s see the most profitable sub-category from those 2 categories.

profit_subcat_plot <- product_trend %>% 
  filter(category %in% c("Clothing", "Electronics")) %>% 
  ggplot(mapping = aes(x = reorder(sub.category, total.profit),
                       y = total.profit,
                       text = glue("Sub-Category: {sub.category}
                                   Profit: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = category), color = "#002433") +
  scale_fill_manual(values = c("#cd5c5c", "#4a708b")) +
  labs(title = "Most Profitable Clothing and Electronic Product",
       x = "Sub Category",
       y = "Total Profit",
       fill = "Category") +
  scale_x_discrete(labels = label_wrap(width = 10)) +
  scale_y_continuous(labels = comma) +
  coord_flip() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.title = element_blank())

ggplotly(profit_subcat_plot, tooltip = "text") %>% 
  layout(legend = list(y = 0.5))

Based on the visualization, it turns out that all types of clothing made profit. However, for electronic products, there is a sub category that made loss, which is electronic games. Based on this visualization, we can infer that seller should avoid selling electronic games, and focus more on printers and accessories (for electronic products), plus trousers and stole (for clothing products). Do take a note that for clothing products, it is more recommended to also supply other sub categories because they might be complementary products, which means every single sub category of clothing product gives value to the other.

3.3.2 Most Sold Product

We have conducted sales analysis based on its profitability. Next, we will see the most sold product.

product_trend %>% 
  select(-sub.category) %>% 
  group_by(category) %>% 
  summarise_all(sum) %>% 
  mutate(csum = rev(cumsum(rev(total.quantity))), 
         pos = total.quantity/2 + lead(csum, 1),
         pos = if_else(is.na(pos), total.quantity/2, pos)) %>% 

  ggplot(mapping = aes(x = "", 
                       y = total.quantity, 
                       fill = category,
                       text = glue("{total.quantity}"))) +
  geom_col(color = "#002433") +
  coord_polar(theta = "y") +
  scale_fill_brewer("blues") +
  geom_label_repel(aes(y = pos, 
                       label = percent(total.quantity / sum(total.quantity))),
                   size = 4.5, 
                   nudge_x = 1, 
                   show.legend = FALSE) +
  labs(title = "Most Sold Product Category") +
  guides(fill = guide_legend(title = "Category")) +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 16))

Clothing products are sold the most in the e-commerce. We can infer that sellers that sell clothing products are more likely to interact with customers. Let’s see what clothing sub categories are the most sold.

sold_subcat_plot <- product_trend %>% 
  filter(category == "Clothing") %>% 
  ggplot(mapping = aes(x = reorder(sub.category, total.quantity),
                       y = total.quantity,
                       text = glue("Sub-Category: {sub.category}
                                   Quantity sold: {format(total.quantity, big.mark = ',')}"))) +
  geom_col(aes(fill = total.quantity), color = "#002433") +
  labs(title = "Most Sold Clothing Product",
       x = "Sub Category\n",
       y = "Quantity Sold",
       fill = "Category") +
  scale_y_continuous(labels = comma) +
  scale_fill_gradient(high = "#c93c3c", low = "white") +
  coord_flip() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.position = "none")

ggplotly(sold_subcat_plot, tooltip = "text") %>% 
  layout(hoverlabel = list(bgcolor = "black"))

Top 3 most sold clothing products are saree, handkerchief, and stole. Considering also the profitability, clothing store should not focus on just some types of clothing products. As I mentioned before, clothing products are complementary to each other. There’s a high chance that customers buy clothing products in pair, e.g. people who buy shirt may also buy trousers, etc.

4 Customer Demographic Analysis

Customer demographic analysis is a way to identify new opportunities to increase business and expand market reach. It is useful to determine the potential market for business. First, let’s see which city and state contribute the most to the quantity of products sold. We will create a new dataframe demo_df beforehand.

demo_df <- profit %>% 
  select(-c(order.id, order.date, customer.name)) %>% 
  group_by(state, city) %>% 
  summarise_all(sum) %>% 
  ungroup()

head(demo_df)

4.1 Most Profitable Area

Knowing which area is the most profitable may help those who want to expand their businesses, allowing them to be reached and well-known faster. Firstly, let’s plot the most profitable state and city.

profit_state <- demo_df %>% 
  select(-city) %>% 
  group_by(state) %>%
  summarise_all(sum) %>%
  slice_max(order_by = total.profit, n = 10) %>% 
  ggplot(mapping = aes(x = reorder(state, total.profit), 
                       y = total.profit,
                       text = glue("State: {state}
                                   Profit: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit), 
           color = "#002433") +
  labs(title = "Top 10 Most Profitable state",
       x = "State\n",
       y = "Profit",
       fill = "Category") +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(labels = label_wrap(5)) +
  scale_fill_gradient(high = "#004d85", low = "#b3d2e8") +
  coord_flip() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.position = "none")

ggplotly(profit_state, tooltip = "text") %>% 
  layout(hoverlabel = list(bgcolor = "black"))

profit_city <- demo_df %>% 
  select(-state) %>% 
  group_by(city) %>%
  summarise_all(sum) %>%
  slice_max(order_by = total.profit, n = 10) %>% 
  ggplot(mapping = aes(x = reorder(city, total.profit), 
                       y = total.profit,
                       text = glue("City: {city}
                                   Profit: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit), 
           color = "#002433") +
  labs(title = "Top 10 Most Profitable City",
       x = "City\n",
       y = "Profit",
       fill = "Category") +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(labels = label_wrap(5)) +
  scale_fill_gradient(high = "#c93c3c", low = "white") +
  coord_flip() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.position = "none")

ggplotly(profit_city, tooltip = "text") %>% 
  layout(hoverlabel = list(bgcolor = "black"))

The most profitable states are Maharashtra, followed by Madhya Pradesh and Uttar Pradesh. However, the profit gap between the latter 2 states are quite wide. In case of cities, the most profitables are Pune, followed by Indore and Delhi. The profit gap between each cities, however is far more tight than states. The top 3 most profitable cities are also located in the top 2 most profitable states. This may gives a target market recommendation for businesses who wants to do market expansion.

5 Conclusion

From the sales analysis, it can be concluded that since October 2018, there might be a change in consumer behavior, with a great shift towards online shopping. This can lead to a massive opportunity for both established businesses and people who want to start their own businesses, as they can emerge into the online marketplace. Clothing and electronic products generate the highest profit. For clothing products, it is recommended to supply trousers and stole for profitability, plus saree and handkerchief for customer engagement. Other types of clothings are complementary, and will also generate some profit. In case of electronic products, printers and accessories generate the most profit, followed by phone. An interesting thing is that electronic games generate loss. I presume that it is because there are already other bigger platforms for electronic games that is known world-wide, but further research about this topic might be conducted.

In case of demographic analysis, the most profitable states are Maharashtra, Madhya Pradesh, and Uttar Pradesh. While the most profitable cities are Pune, Indore, and Delhi, which are all located in Maharashtra and Madhya Pradesh. These places are highly recommended for market expansion. An interesting topic for further research is why these areas generate more profits than the others. It may be the case that these areas are more developed, have better logistics, or more reachable. Knowing these things may help business to provide better services for customers and fulfill their demand, which will lead to a better customer engagement and more profit.

Visualizing E-Commerce Sales Analysis

Stane Aurelius Ronotana

23 August 2021