Introduction

About Me

I’m Briandamar Kencana a data enthusiast. I graduated with a bachelor’s degree in statistics. I have some work experience in the field of data, such as an internship at momobil.id, independent research and advisory Indonesia, staff data analyst at Alif Aza Asia and reserach fellow at Bank Indonesia. In addition, I have participated in several data training courses at Algoritma Data Science School, Coursera, Udemy and Dicoding.

Data Information

This data is contains of sales record included the profit or loss from a retail company (i named it PT. xxx) and i got this data from Algoritma Data Science School with CSV file format. PT.xxx is a global retail company who sells Office Supplies, Furniture and Technology stuff. The buyers are coming from different segment such as Corporate, Home Office until individual consumer.

Retail Dataset with CSV File Format

Method and Goals

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Source: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15

It is a good practice to understand the data first and try to gather as many insights from it. EDA becomes very important before doing feature engineering and modeling because in this stage we must understand the data first.

Goals : Show how to advance the company by implementing policies based on recommendations from the data analysis of retail dataset

R Programming with Retail

Data Inspection

Let’s start by opening retail.csv file, I will use the built in read.csv (…) function call, which reads the data in as a data frame, and assign the data frame to a variable retail. Make sure our data placed in the same folder our R project data. If you want to find out the working directory, use the getwd () function.

retail <- read.csv("data_input/retail.csv")
getwd()

#> [1] "D:/File Training/Algoritma Data Science/Pertemuan 1-3/Project Data Retail"

DT::datatable(head(retail,30),
              fillContainer=TRUE,
              rownames = FALSE,
              options = list(
                pageLength = 10,
                lengthMenu = c(10, 20, 30)),
                caption = htmltools::tags$caption(
                  style = 'caption-side:bottom; text-align: right;',
                  'Data Source:', htmltools::em('Algoritma Data Science School')))

I will use names() function to gets or sets the names of an object. There are thirteen object with the following names:

names(retail)

#>  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
#>  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
#> [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

A little Objects information :

Order.ID : Id of order.
Order.Date : Date of Order.
Ship.Date : Date of shipping.
Ship.Mode : type of shipment.
Customer.ID : Id of Customer.
Segment : Customers segment.
Product.ID : Id of Product.
Category : have 3 levels “Furniture”,“Office Supplies”,“Technology”
sub.category: more specific categories
Product.Name: Name of product that was sold.
Sales : How much earning from each sale.
Quantity : Quantity of item sold.
Discount : How much Discount was given for each sale.
Profit : How much can a company earn from each sale.

Next, I will check the missing values on retail dataset, missing values are represented by the symbol NA (not available) in R.

colSums(is.na(retail))

#>       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
#>            0            0            0            0            0            0 
#>      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
#>            0            0            0            0            0            0 
#>     Quantity     Discount       Profit 
#>            0            0            0

anyNA(retail)

#> [1] FALSE

Great, The retail dataset no missing value. Now I will use str () to observe the meta data from the data frame, unfortunately there are some data types that are not suitable. Next, I will explain how data transformation works in retail dataset.

str(retail)

#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#>  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Data Transformation

After conducting an inspection of retail data, it is known that there are data types that are not suitable. Then, the data type will be changed to be suitable for analysis purposes.

retail$Order.ID <- as.character(retail$Order.ID)
retail$Customer.ID <- as.character(retail$Customer.ID)
retail$Product.ID <- as.character(retail$Product.ID)
retail$Product.Name <- as.character(retail$Product.Name)
retail$Order.Date <-  as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date <-  as.Date(retail$Ship.Date, "%m/%d/%y")
retail <- retail[, -1]
str(retail)

#> 'data.frame':    9994 obs. of  14 variables:
#>  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#>  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
#>  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
#>  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
#>  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#>  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
#>  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#>  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#>  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
#>  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

After data transformation, the data type is suitable.Next, I will show you retail data analysis.

Data Analysis

Let’s start from the summary () function. It produces several descriptive statistical values for each variable, if the data class is a data frame. Numeric variables will produce a minimum value (Min.), first quantile (1st Qu.), median (Median), average (Mean), third quantile (3rd Qu.), maximum (Max.) and the number of missing values (NA’s ) If there is. For categorical variables (character or factor), it will produce a lot of data (Length), class (Class) and mode (Mode).

summary(retail)

#>    Order.ID           Order.Date           Ship.Date         
#>  Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
#>  Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
#>  Mode  :character   Median :2016-06-26   Median :2016-06-29  
#>                     Mean   :2016-04-30   Mean   :2016-05-03  
#>                     3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
#>                     Max.   :2017-12-30   Max.   :2018-01-05  
#>   Ship.Mode         Customer.ID          Segment           Product.ID       
#>  Length:9994        Length:9994        Length:9994        Length:9994       
#>  Class :character   Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
#>                                                                             
#>                                                                             
#>                                                                             
#>    Category         Sub.Category       Product.Name           Sales          
#>  Length:9994        Length:9994        Length:9994        Min.   :    0.444  
#>  Class :character   Class :character   Class :character   1st Qu.:   17.280  
#>  Mode  :character   Mode  :character   Mode  :character   Median :   54.490  
#>                                                           Mean   :  229.858  
#>                                                           3rd Qu.:  209.940  
#>                                                           Max.   :22638.480  
#>     Quantity        Discount          Profit         
#>  Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median : 3.00   Median :0.2000   Median :    8.666  
#>  Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :14.00   Max.   :0.8000   Max.   : 8399.976

sum(retail$Quantity)

#> [1] 37873

sum(retail$Sales)

#> [1] 2297201

sum(retail$Profit)

#> [1] 286397

Based on the results above :

First order occured in Jan 2014
Item sold in 2014-2017 is 37873 items with total Sales 2297201 and profit 286397
The average sales at 229.858, with max sales value at 22638.480 and minimum sales value at 0.444
Max quantity order was 14 within 4 years but average at 3.79
max amount of discount was 80%, but mean number of discount around 15%
PT. Xxx ever had loss at 6599 and gained 8399 as their profit with average profit at 28.657

Next I will show a deeper analysis, I will divide into 3 parts:

Sales Data Analyst
Product Data Analyst
Customer Data Analyst

Sales Data Analyst

I will analyze sales movements to find patterns and get the following results:

library(lubridate)
library(ggplot2)
library (zoo)
library(formattable)
library(dplyr)

retail.graph<-retail
retail.graph$Order.Date <- as.yearmon(retail.graph$Order.Date,"%Y/%m")
retail.agregate<-aggregate(formula = Sales ~ Order.Date, data = retail.graph, FUN = sum)
retail.agregate$Sales<- comma(retail.agregate$Sales, digits = 2)

ggplot(retail.agregate, aes(x = Order.Date, Sales)) +
  geom_line(color = '#c48c08') +
  geom_area(fill = '#c48c08', alpha = .1) +
  labs(x = 'Month'
       , y = 'Sales'
       , title = "Trend of Sales Retail Dataset") +
  theme(text = element_text(family = 'Gill Sans', color = "#444444")
        ,panel.background = element_rect(fill = '#260101')
        ,panel.grid.minor = element_line(color = '#420101')
        ,panel.grid.major = element_line(color = '#470303')
        ,plot.title = element_text(size = 18, colour= '#260101')
        ,axis.title = element_text(size = 10, color = '#555555')
        ,axis.title.y = element_text(vjust = 1)
        ,axis.title.x = element_text(hjust = 0)
        )

Sales will always decline at the beginning of the year, starting from the decline in January to February. This means that at the beginning of the new year the customer will think to stop shopping
Sales will start to rise in March and return to decline in April
On the month of May to August sales will fluctuate
on month sales Septemper be rising sharply and in October will decline sharply
The end of the year is the best sale, this happens from November to December.

By knowing that information, companies can implement appropriate policies to increase sales. For example, the company does not need to go any further to increase sales in September, November, and December. Because the psychology of customers will shop in these months, so the company is quite sure of stock availability or selling product that slow moving with discount.In the case of fahion retail, the sales strategy at the end of the year uses discounts rate of 50-70% for slow moving items.

Then, If you want to know sales movements by year can be seen below :

retail.agregate$Order.Year <-year(retail.agregate$Order.Date)
retail.Year<-aggregate(formula = Sales ~ Order.Year, data = retail.agregate, FUN = sum)

ggplot(retail.Year, aes(x = Order.Year, Sales)) +
  geom_line(color = '#c48c08') +
  geom_area(fill = '#c48c08', alpha = .1) +
  labs(x = 'Year'
       , y = 'Sales'
       , title = "Sales by Year") +
  theme(text = element_text(family = 'Gill Sans', color = "#444444")
        ,panel.background = element_rect(fill = '#260101')
        ,panel.grid.minor = element_line(color = '#420101')
        ,panel.grid.major = element_line(color = '#470303')
        ,plot.title = element_text(size = 18,color= '#260101')
        ,axis.title = element_text(size = 10, color = '#555555')
        ,axis.title.y = element_text(vjust = 1)
        ,axis.title.x = element_text(hjust = 0)
        )

Sales by year have increased, but in 2015 there was a decline. Growth rate of increase and decrease can be seen in the following table :

library(dplyr)
Growth_rate = retail.Year %>%
  
  arrange(retail.Year$Order.Year) %>%
  mutate(Diff_year = retail.Year$Order.Year - lag(retail.Year$Order.Year),
         Diff_growth = retail.Year$Sales - lag(retail.Year$Sales), 
         Rate_percent = (Diff_growth / Diff_year)/retail.Year$Sales * 100)
Growth_rate[c(1,2,5)]

Then I want to see, is the sale happening is more because of a discount?

retail$Order.Month <- month(retail$Order.Date, label=T)
crosstab <- xtabs(formula = Quantity ~ Order.Month + Discount, data = retail)
crosstab

#>            Discount
#> Order.Month    0  0.1 0.15  0.2  0.3 0.32  0.4 0.45  0.5  0.6  0.7  0.8
#>         Jan  773   12    5  446   20    9   51    0   15   18   77   49
#>         Feb  587    4    2  339   25    1   17    0   12   15   28   37
#>         Mar 1250   28   12  853   85   22   79    0   25   24   80  106
#>         Apr 1099   17   27  961   42    4   51    6   11   18  129   82
#>         May 1344   13    7 1009   78    8   30    4   14   42  136  106
#>         Jun 1210   29   26 1048   56    7   44    0   25   52   98   85
#>         Jul 1299   26   13  965   81    0   58    0   17   40  110   96
#>         Aug 1297   14   15 1122   46    5   51   10   15   29   94   86
#>         Sep 2574   68   21 1743  104   13   88    4   32   37  236  142
#>         Oct 1483   37   18 1122   63    3   67    5   20   48  140   98
#>         Nov 2722   65   27 2038  139   18  172   10   23   90  299  172
#>         Dec 2629   60   25 2014  110   15   78    6   32   88  233  129

heatmap(crosstab, Rowv = NA, Colv = NA, cexCol = 0.8, scale = "row")

Awesome, many products are sold without discounts. if I look at using a heatmap, I will find that the dark color is on without a discount and a 20% discount. it can be mean, customers are familiar and comfortable shopping at the company. so, companies do not need to hold large discounts to attract the attention of customers.I think, if the company wants to hold a discount, it’s better between April-August because sales are still up and down. Now, I will see the profit from the retail dataset. The results were surprising because there were more than one loss period.

retail.graph2<-retail
retail.graph2$Order.Date <- as.yearmon(retail.graph2$Order.Date,"%Y/%m")
retail.agregate2<-aggregate(formula = Profit ~ Order.Date, data = retail.graph2, FUN = sum)
retail.agregate2$Profit<- comma(retail.agregate2$Profit, digits = 2)

ggplot(retail.agregate2, aes(x = Order.Date, Profit)) +
  geom_line(color = '#c48c08') +
  geom_area(fill = '#c48c08', alpha = .1) +
  labs(x = 'Month'
       , y = 'Profit'
       , title = "Profit of Company") +
  theme(text = element_text(family = 'Gill Sans', color = "#444444")
         ,panel.background = element_rect(fill = '#260101')
        ,panel.grid.minor = element_line(color = '#420101')
        ,panel.grid.major = element_line(color = '#470303')
        ,plot.title = element_text(size = 18, color= '#260101')
        ,axis.title = element_text(size = 10, color = '#555555')
        ,axis.title.y = element_text(vjust = 1)
        ,axis.title.x = element_text(hjust = 0)
        )

Then I will see sales and profits that occur in July 2014 and January 2015, is it due to products that sell more discounts?

retail20142015<-retail.graph2[ retail.graph2$Order.Date == "Jul 2014" | retail.graph2$Order.Date == "Jan 2015", ]

convert <- function(x){
  
  if (x < 0) { 
    x <- "Negative"
  }
  else if (x > 0 ) {
    x <- "positive"
  }
  else {
    x <- "Zero"
  }
  return(x)
}

retail20142015$status_profit <- sapply(X = retail20142015$Profit, FUN = convert)

crosstab_status_profit<- xtabs(formula = Quantity ~ status_profit+ Discount, data =retail20142015)

crosstab_status_profit

#>              Discount
#> status_profit   0 0.1 0.15 0.2 0.3 0.32 0.4 0.5 0.6 0.7 0.8
#>      Negative   0   0    0  51  20    9  33   5  15  30  30
#>      positive 331   1    1 249   0    0   4   0   0   0   0
#>      Zero       2   0    0   5   0    0   0   0   0   0   0

There are still more products sold at normal prices and at 20% discount. Then I will see based on nominal value, because it is possible that products sold using discount have a small quantity but the impact on the nominal value is large.

crosstab_status_profit2<- xtabs(formula = round(Profit) ~  Discount+status_profit, data =retail20142015)

crosstab_status_profit2

#>         status_profit
#> Discount Negative positive  Zero
#>     0           0     3131     0
#>     0.1         0       11     0
#>     0.15        0        7     0
#>     0.2      -418     1816     0
#>     0.3      -377        0     0
#>     0.32     -438        0     0
#>     0.4     -2467        3     0
#>     0.5      -244        0     0
#>     0.6      -336        0     0
#>     0.7      -990        0     0
#>     0.8     -3815        0     0

Great, I’ve found the cause of profit can be negative. If I look at using crosstab above, I will find a 40% discount and a 80% discount has a huge loss. This happens because some of the products that are discounted are not appropriate, so it’s important to use price optimization. It helps regulate and control pricing decisions across every category and product type. But I will not further analysis related to price optimization at this time.

Product Data Analyst

In this section, I will analyze products that five fast-moving and slow-moving product based on the number of items sold. Then, I will analyze the top five product sales based on nominal sales.

product_moving<-aggregate(formula = Quantity ~ Product.Name, data = retail, FUN = sum)
product_moving <- product_moving[order(product_moving$Quantity, decreasing = T), ]
DT::datatable(head(product_moving,5),
              filter = 'top',
              rownames = FALSE,
              options = list(
                pageLength = 5))

Top 5 of fast moving product base on item sold can be seen above.The first order is staples with 215 item sold, the next sequence is staple envelopes with 170 item sold. These products that need to be considered stock availability not to reduce product sales

DT::datatable(tail(product_moving,5),
              filter = 'top',
              rownames = FALSE,
              options = list(
                pageLength = 5))

These products were only sold 1 in 2014-2017. Try to use discounts for these products to increase sales, so the product does not become dead stock.

top5<-aggregate(formula = Sales ~ Product.Name, data = retail, FUN = sum)
top5 <- top5[order(top5$Sales, decreasing = T), ]
DT::datatable(head(top5,5),
              filter = 'top',
              rownames = FALSE,
              options = list(
                pageLength = 5))

Top 5 sales base on product sales can be seen above, the first order is Canon imageCLASS 2200 Advanced Copier.

if you want to find out which type of category has the highest sales based on quantity and sales, you can see below:

xtabs(Quantity~ Category, retail)

#> Category
#>       Furniture Office Supplies      Technology 
#>            8028           22906            6939

xtabs(Sales~ Category, retail)

#> Category
#>       Furniture Office Supplies      Technology 
#>        741999.8        719047.0        836154.0

The advanced level of retail data analysis is market basket analysis. Market basket analysis is an unsupervised learning technique that can be useful for analyzing transactional data. It can be a powerful technique in analyzing the purchasing patterns of consumers. but this time, I didn’t analyze it until the basketball analysis market. maybe I’ll analyze using market basket analysis next time, but if you want to know about market basket analysis you can go to Market Basket Analysis.

Customer Data Analyst

In this section, i will find out the most buyer segments in each category, the type of shipping that the customer likes and then will use a cohort to find out how often customers first shop and will continue to shop actively. The Most buyer segments are consumer, consumer become the most segments in each category

Crosstab2<-xtabs(Sales~Segment+Category,retail)

plot(Crosstab2)

xtabs(Quantity~Ship.Mode,retail)

#> Ship.Mode
#>    First Class       Same Day   Second Class Standard Class 
#>           5693           1960           7423          22797

Here, I tried a subset of transactions that occurred in 2017 and Get the initial InvoiceDate for each customer,this is the customer’s join date.

Here I calculate the number of days a customer has been a customer with the company. Based on this information, I calculate the amount of months said customer has been active. Active as in making purchases.

Reminder:

- All cohorts are of the year 2017
- Cohort membership is based on the month of customer acquisition

Each of the twelve rows represent a cohort, and its columns, 1 through 11, represent the months post acquisition. Column 0 shows the count of customer acquisition per month. The intersections of row and column feature the count of active cohort members.

The retention rate mixpanel below features the same data from the previously shown mixpanel. The difference here is the intersections between rows and columns now feature the retention rates for each cohort group. Column 0 gives us the count of acquired customers for each month.

We can see that the August cohort features significantly higher retention rates than the other cohorts. By looking upwards, diagonally, from each acquisition month, we’re able to keep track of any specific month other than January. With this, we can see that the month of November features a higher rate of retention than other months for most of the cohorts.

The horizontal panel below the retention rate mixpanel features the average retention rate for each monthly age of all cohorts. The average count of acquired customers per month is 65.75. Generally, the average retention rate increasing and decreasing following sales graphic in Sales Data Analyst section such sales in March rise and fall in april month.

Summary

In 4 years, since bigining of 2014-2017 PT. xxx have sold total 37873 items with total Sales 2297201 and profit 286397. This company selling 3 main category : Furniture, Office supplies and Technology. From this category, Technology has the highest sales. Sales will always decline at the beginning of the year and The end of the year is the best sale, this happens from November to December.

Many products are sold without discounts, staples into the first order based on items sold and Advanced Copier Canon imageCLASS 2200 to become the first by sales. There are also many products that don’t sell like Xerox 20, so we need a sales strategy for these product.

Buyer who buying the product was coming from different segment, such as : Consumer, Corporate and Home Office. ‘Consumer’ was giving the higest ammount of sales. Shipping option consist of 4 types : First Class, Same day, Second class, and Standart class. Based on data, from all segment, most of them are choosing Standart Class as their shipping option.The average customer retention rate is around 13.89% -25.13% per month

Recommendation

Apply strategies to increase sales in low sales months, the decline could be due to low visitor arrivals.
Maintain and increase the stock of fast moving product at the month that have high sales
Implementing price optimization to helps regulate and control pricing decisions across every category and product type.
Maintain customer and company relationships so that customers are always active in buying

A work by Briandamar Kencana

damarbrian@gmail.com

Exploratory Data Analysis (EDA) on Retail Data

Briandamar Kencana

June 19, 2020