1 Background

1.1 Brief

Sales analysis is a method to examine how well some products are selling, especially compared to its goal or target. It provides important insights about the top-selling and under-performing products, market trend, most profitable customers, and many others. Here, I am going to conduct a sales analysis on the E-Commerce Data obtained from Kaggle. It consists of sales details from Indian e-commerce website.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. I provided a brief explanation about their function.

  • dplyr: data manipulation
  • lubridate: handling datetime data
  • glue: interpolate data into strings
  • ggplot2: visualizing data
  • scales: scaling in visualization
  • ggrepel: labeling in visualization
  • plotly: interactive visualization
library(dplyr)
library(lubridate)
library(glue)
library(ggplot2)
library(scales)
library(ggrepel)
library(plotly)

2 Data Preparation

2.1 Data Inspection

The data is split into 2 csv files:

  • orders: contains purchase information (order ID, order date, and customer details)
  • details: contains the price, quantity, profit, product category and subcategory in every order ID

for naming conventions, I will change the name of each column.

orders <- read.csv("data_input/orders.csv")
names(orders) <- c("order.id", "order.date", "customer.name", "state", "city")

# replace empty values with NA
orders[orders == ""] <- NA

head(orders)

Some information about the features in orders:

  • order.id: ID of the purchase order
  • order.date: Date of purchase
  • customer.name: Name of the customer
  • state: State of the customer
  • city: City of the customer
details <- read.csv("data_input/details.csv")
names(details) <- c("order.id", "amount", "profit", "quantity", "category", "sub.category")
head(details)

Some information about the features in details:

  • order.id: ID of the purchase order
  • amount: Price of the order
  • profit : Profit made by the purchase
  • quantity: Quantity of purchase
  • category: Category of product
  • sub.category: Sub-category the product belongs

We can see that some columns have incorrect data types, we will fix it later on, now let’s check if there is any missing value in the dataset.

2.2 Data Wrangling

2.2.1 Missing Values

anyNA(details)
## [1] FALSE
anyNA(orders)
## [1] TRUE

there exist some missing values in orders data, let’s see how many missing values there are in the data.

colSums(is.na(orders))
##      order.id    order.date customer.name         state          city 
##            60            60            60            60            60
na_rows <- rowSums(is.na(orders))
orders[which(na_rows > 0), ]

The orders data should be connected with details by their order.id. There are 60 empty rows in orders data, which means it has 500 list of orders. Let’s check how many unique order.id there are in details.

length(unique(details$order.id))
## [1] 500

Since details contains 500 unique order.id as well, we can just drop the missing values in the orders data. They are only empty rows, our data is complete.

orders <- na.omit(orders)
orders

2.2.2 Incorrect Data Types

Each dataframe has their own incorrect data types. Here are the following variables that need to be converted:

  • order.date: date
  • state: categorical
  • city: categorical
  • category: categorical
  • sub.category: categorical

So, converting data types for the order, details and target data.

# List of Orders
orders <- orders %>% 
  mutate(order.date = dmy(order.date),
         state = as.factor(state),
         city = as.factor(city))

# Order Details
details[, c("category", "sub.category")] <- lapply(details[, c("category", "sub.category")],
                                                   FUN = as.factor)

We are done with the data wrangling. Our data is ready to be analysed.

3 Sales Analysis

3.1 Sales Trend

The first thing I would like to analyse is the sales trend. We will review the historical revenue to detect whether there are patterns in our data or not. In business, trend analysis allows to predict the future market. It may also provides some important information. Let’s see the monthly profitability in our data. Firstly, we will create a new dataframe profit from details to see the total profit for each order ID, and then merge it with orders to get the order date.

profit <- details %>% 
  group_by(order.id) %>% 
  summarise(total.amount = sum(amount),
            total.profit = sum(profit),
            total.quantity = sum(quantity)) %>% 
  merge(orders)

head(profit)

We want to see the monthly profit. So we will need to reconstruct the dataframe. Let’s first create a new dataframe monthly_sales which contains monthly sales data.

monthly_sales <- profit %>% 
  mutate(order.month = month(order.date, label = T),
         order.year = year(order.date)) %>% 
  group_by(order.year, order.month) %>% 
  summarise(total.amount = sum(total.amount),
            total.profit = sum(total.profit),
            total.quantity = sum(total.quantity)) %>% 
  mutate(order.month.year = factor(paste(order.month, order.year))) %>% 
  ungroup() %>% 
  select(-c(order.year, order.month))

monthly_sales

Visualizing the monthly profit from April 2018 to March 2019.

# Lock the order of order.month.year
monthly_sales$order.month.year <- factor(monthly_sales$order.month.year,
                                         levels = monthly_sales$order.month.year)

# Visualization
monthly_profit_plot <- monthly_sales %>% 
  ggplot(mapping = aes(x = order.month.year,
                       y = total.profit,
                       text = glue("Profit in {order.month.year}: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit > 0), color = "#002433") +
  scale_fill_manual(values = c("#cd5c5c", "#4a708b")) +
  labs(title = "Monthly Sales Profit",
       y = "Total Profit") +
  scale_x_discrete(labels = wrap_format(5)) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        legend.position = "none",
        axis.title.x = element_blank())

ggplotly(monthly_profit_plot, tooltip = "text")

We can see that from April 2018 until September 2018, the e-commerce sales experienced losses. But, since October 2018, the e-commerce sales made quite a big profit, especially in November 2018. Just by looking at the graph, we can tell that the profit made can easily covers all the losses it previously had. Also, let’s visualize the monthly quantity sales and check whether it correlates with the profit.

monthly_sales_plot <- monthly_sales %>% 
  ggplot(mapping = aes(x = order.month.year,
                       y = total.quantity,
                       text = glue("Product sold in {gsub('\n', ' ', order.month.year)}: {format(total.quantity, big.mark = ',')}"))) +
  geom_line(group = 1, color = "#4a708b", size = 1) +
  geom_point() +
  labs(title = "Monthly Sales",
       y = "Quantity Sold") +
  scale_x_discrete(labels = wrap_format(5)) +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        legend.position = "none",
        axis.title.x = element_blank())

ggplotly(monthly_sales_plot, tooltip = "text")

From both the profit and sales plot, we can gain an important information. Since October 2018, the e-commerce industry made quite a big profit, and the number of quantity sold increases significantly (although it fluctuates) until it reached the peak on March 2019. From business perspective, it may give an insight that there is a change in consumer behavior, with a great shift towards online shopping. Additionally, it seems that the number of product sold does not guarantee the profit earned. For instances, the highest loss occurs in June 2018, although the number of lowest products sold is in July 2018. Let’s check the relationship between quantity sold and profit earned.

3.2 Quantity and Profit Relationship

Let’s visualize a scatterplot to see the relationship between quantity and profit.

details %>% 
  ggplot(mapping = aes(x = quantity,
                       y = profit)) +
  geom_jitter() +
  geom_smooth(color = "red") +
  labs(title = "Quantity Sold and Profit Relationship",
       x = "Quantity Sold",
       y = "Profit") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
      axis.title.y = element_text(size = 12, face = "bold"),
      axis.title.x = element_text(size = 12, face = "bold"))

From the scatterplot, it seems that the quantity sold does not correlate with the profit. This means that making a lot of sales does not guarantee that the seller will generate more profit. Another factor that may influence profit is the product category, so we will analyse it next.

4 Customer Demographic Analysis

Customer demographic analysis is a way to identify new opportunities to increase business and expand market reach. It is useful to determine the potential market for business. First, let’s see which city and state contribute the most to the quantity of products sold. We will create a new dataframe demo_df beforehand.

demo_df <- profit %>% 
  select(-c(order.id, order.date, customer.name)) %>% 
  group_by(state, city) %>% 
  summarise_all(sum) %>% 
  ungroup()

head(demo_df)

4.1 Most Profitable Area

Knowing which area is the most profitable may help those who want to expand their businesses, allowing them to be reached and well-known faster. Firstly, let’s plot the most profitable state and city.

profit_state <- demo_df %>% 
  select(-city) %>% 
  group_by(state) %>%
  summarise_all(sum) %>%
  slice_max(order_by = total.profit, n = 10) %>% 
  ggplot(mapping = aes(x = reorder(state, total.profit), 
                       y = total.profit,
                       text = glue("State: {state}
                                   Profit: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit), 
           color = "#002433") +
  labs(title = "Top 10 Most Profitable state",
       x = "State\n",
       y = "Profit",
       fill = "Category") +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(labels = label_wrap(5)) +
  scale_fill_gradient(high = "#004d85", low = "#b3d2e8") +
  coord_flip() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.position = "none")

ggplotly(profit_state, tooltip = "text") %>% 
  layout(hoverlabel = list(bgcolor = "black"))
profit_city <- demo_df %>% 
  select(-state) %>% 
  group_by(city) %>%
  summarise_all(sum) %>%
  slice_max(order_by = total.profit, n = 10) %>% 
  ggplot(mapping = aes(x = reorder(city, total.profit), 
                       y = total.profit,
                       text = glue("City: {city}
                                   Profit: {format(total.profit, big.mark = ',')}"))) +
  geom_col(aes(fill = total.profit), 
           color = "#002433") +
  labs(title = "Top 10 Most Profitable City",
       x = "City\n",
       y = "Profit",
       fill = "Category") +
  scale_y_continuous(labels = comma) +
  scale_x_discrete(labels = label_wrap(5)) +
  scale_fill_gradient(high = "#c93c3c", low = "white") +
  coord_flip() +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.y = element_text(size = 12, face = "bold"),
        axis.title.x = element_text(size = 12, face = "bold"),
        legend.position = "none")

ggplotly(profit_city, tooltip = "text") %>% 
  layout(hoverlabel = list(bgcolor = "black"))

The most profitable states are Maharashtra, followed by Madhya Pradesh and Uttar Pradesh. However, the profit gap between the latter 2 states are quite wide. In case of cities, the most profitables are Pune, followed by Indore and Delhi. The profit gap between each cities, however is far more tight than states. The top 3 most profitable cities are also located in the top 2 most profitable states. This may gives a target market recommendation for businesses who wants to do market expansion.

5 Conclusion

From the sales analysis, it can be concluded that since October 2018, there might be a change in consumer behavior, with a great shift towards online shopping. This can lead to a massive opportunity for both established businesses and people who want to start their own businesses, as they can emerge into the online marketplace. Clothing and electronic products generate the highest profit. For clothing products, it is recommended to supply trousers and stole for profitability, plus saree and handkerchief for customer engagement. Other types of clothings are complementary, and will also generate some profit. In case of electronic products, printers and accessories generate the most profit, followed by phone. An interesting thing is that electronic games generate loss. I presume that it is because there are already other bigger platforms for electronic games that is known world-wide, but further research about this topic might be conducted.

In case of demographic analysis, the most profitable states are Maharashtra, Madhya Pradesh, and Uttar Pradesh. While the most profitable cities are Pune, Indore, and Delhi, which are all located in Maharashtra and Madhya Pradesh. These places are highly recommended for market expansion. An interesting topic for further research is why these areas generate more profits than the others. It may be the case that these areas are more developed, have better logistics, or more reachable. Knowing these things may help business to provide better services for customers and fulfill their demand, which will lead to a better customer engagement and more profit.