I’m Briandamar Kencana a data enthusiast. I graduated with a bachelor’s degree in statistics. I have some work experience in the field of data, such as an internship at momobil.id, independent research and advisory Indonesia, staff data analyst at Alif Aza Asia and reserach fellow at Bank Indonesia. In addition, I have participated in several data training courses at Algoritma Data Science School, Coursera, Udemy and Dicoding.
This data is contains of sales record included the profit or loss from a retail company (i named it PT. xxx) and i got this data from Algoritma Data Science School with CSV file format. PT.xxx is a global retail company who sells Office Supplies, Furniture and Technology stuff. The buyers are coming from different segment such as Corporate, Home Office until individual consumer.
Retail Dataset with CSV File Format
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
It is a good practice to understand the data first and try to gather as many insights from it. EDA becomes very important before doing feature engineering and modeling because in this stage we must understand the data first.
Goals : Show how to advance the company by implementing policies based on recommendations from the data analysis of retail dataset
Let’s start by opening retail.csv file, I will use the built in read.csv (…) function call, which reads the data in as a data frame, and assign the data frame to a variable retail. Make sure our data placed in the same folder our R project data. If you want to find out the working directory, use the getwd () function.
retail <- read.csv("data_input/retail.csv")
getwd()
#> [1] "D:/File Training/Algoritma Data Science/Pertemuan 1-3/Project Data Retail"
DT::datatable(head(retail,30),
fillContainer=TRUE,
rownames = FALSE,
options = list(
pageLength = 10,
lengthMenu = c(10, 20, 30)),
caption = htmltools::tags$caption(
style = 'caption-side:bottom; text-align: right;',
'Data Source:', htmltools::em('Algoritma Data Science School')))
I will use names() function to gets or sets the names of an object. There are thirteen object with the following names:
names(retail)
#> [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
#> [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
#> [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
A little Objects information :
Next, I will check the missing values on retail dataset, missing values are represented by the symbol NA (not available) in R.
colSums(is.na(retail))
#> Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
#> 0 0 0 0 0 0
#> Segment Product.ID Category Sub.Category Product.Name Sales
#> 0 0 0 0 0 0
#> Quantity Discount Profit
#> 0 0 0
anyNA(retail)
#> [1] FALSE
Great, The retail dataset no missing value. Now I will use str () to observe the meta data from the data frame, unfortunately there are some data types that are not suitable. Next, I will explain how data transformation works in retail dataset.
str(retail)
#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
#> $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
After conducting an inspection of retail data, it is known that there are data types that are not suitable. Then, the data type will be changed to be suitable for analysis purposes.
retail$Order.ID <- as.character(retail$Order.ID)
retail$Customer.ID <- as.character(retail$Customer.ID)
retail$Product.ID <- as.character(retail$Product.ID)
retail$Product.Name <- as.character(retail$Product.Name)
retail$Order.Date <- as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, "%m/%d/%y")
retail <- retail[, -1]
str(retail)
#> 'data.frame': 9994 obs. of 14 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
After data transformation, the data type is suitable.Next, I will show you retail data analysis.
Let’s start from the summary () function. It produces several descriptive statistical values for each variable, if the data class is a data frame. Numeric variables will produce a minimum value (Min.), first quantile (1st Qu.), median (Median), average (Mean), third quantile (3rd Qu.), maximum (Max.) and the number of missing values (NA’s ) If there is. For categorical variables (character or factor), it will produce a lot of data (Length), class (Class) and mode (Mode).
summary(retail)
#> Order.ID Order.Date Ship.Date
#> Length:9994 Min. :2014-01-03 Min. :2014-01-07
#> Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
#> Mode :character Median :2016-06-26 Median :2016-06-29
#> Mean :2016-04-30 Mean :2016-05-03
#> 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
#> Max. :2017-12-30 Max. :2018-01-05
#> Ship.Mode Customer.ID Segment Product.ID
#> Length:9994 Length:9994 Length:9994 Length:9994
#> Class :character Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character Mode :character
#>
#>
#>
#> Category Sub.Category Product.Name Sales
#> Length:9994 Length:9994 Length:9994 Min. : 0.444
#> Class :character Class :character Class :character 1st Qu.: 17.280
#> Mode :character Mode :character Mode :character Median : 54.490
#> Mean : 229.858
#> 3rd Qu.: 209.940
#> Max. :22638.480
#> Quantity Discount Profit
#> Min. : 1.00 Min. :0.0000 Min. :-6599.978
#> 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
#> Median : 3.00 Median :0.2000 Median : 8.666
#> Mean : 3.79 Mean :0.1562 Mean : 28.657
#> 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :14.00 Max. :0.8000 Max. : 8399.976
sum(retail$Quantity)
#> [1] 37873
sum(retail$Sales)
#> [1] 2297201
sum(retail$Profit)
#> [1] 286397
Based on the results above :
Next I will show a deeper analysis, I will divide into 3 parts:
I will analyze sales movements to find patterns and get the following results:
library(lubridate)
library(ggplot2)
library (zoo)
library(formattable)
library(dplyr)
retail.graph<-retail
retail.graph$Order.Date <- as.yearmon(retail.graph$Order.Date,"%Y/%m")
retail.agregate<-aggregate(formula = Sales ~ Order.Date, data = retail.graph, FUN = sum)
retail.agregate$Sales<- comma(retail.agregate$Sales, digits = 2)
ggplot(retail.agregate, aes(x = Order.Date, Sales)) +
geom_line(color = '#c48c08') +
geom_area(fill = '#c48c08', alpha = .1) +
labs(x = 'Month'
, y = 'Sales'
, title = "Trend of Sales Retail Dataset") +
theme(text = element_text(family = 'Gill Sans', color = "#444444")
,panel.background = element_rect(fill = '#260101')
,panel.grid.minor = element_line(color = '#420101')
,panel.grid.major = element_line(color = '#470303')
,plot.title = element_text(size = 18, colour= '#260101')
,axis.title = element_text(size = 10, color = '#555555')
,axis.title.y = element_text(vjust = 1)
,axis.title.x = element_text(hjust = 0)
)
By knowing that information, companies can implement appropriate policies to increase sales. For example, the company does not need to go any further to increase sales in September, November, and December. Because the psychology of customers will shop in these months, so the company is quite sure of stock availability or selling product that slow moving with discount.In the case of fahion retail, the sales strategy at the end of the year uses discounts rate of 50-70% for slow moving items.
Then, If you want to know sales movements by year can be seen below :
retail.agregate$Order.Year <-year(retail.agregate$Order.Date)
retail.Year<-aggregate(formula = Sales ~ Order.Year, data = retail.agregate, FUN = sum)
ggplot(retail.Year, aes(x = Order.Year, Sales)) +
geom_line(color = '#c48c08') +
geom_area(fill = '#c48c08', alpha = .1) +
labs(x = 'Year'
, y = 'Sales'
, title = "Sales by Year") +
theme(text = element_text(family = 'Gill Sans', color = "#444444")
,panel.background = element_rect(fill = '#260101')
,panel.grid.minor = element_line(color = '#420101')
,panel.grid.major = element_line(color = '#470303')
,plot.title = element_text(size = 18,color= '#260101')
,axis.title = element_text(size = 10, color = '#555555')
,axis.title.y = element_text(vjust = 1)
,axis.title.x = element_text(hjust = 0)
)
Sales by year have increased, but in 2015 there was a decline. Growth rate of increase and decrease can be seen in the following table :
library(dplyr)
Growth_rate = retail.Year %>%
arrange(retail.Year$Order.Year) %>%
mutate(Diff_year = retail.Year$Order.Year - lag(retail.Year$Order.Year),
Diff_growth = retail.Year$Sales - lag(retail.Year$Sales),
Rate_percent = (Diff_growth / Diff_year)/retail.Year$Sales * 100)
Growth_rate[c(1,2,5)]
Then I want to see, is the sale happening is more because of a discount?
retail$Order.Month <- month(retail$Order.Date, label=T)
crosstab <- xtabs(formula = Quantity ~ Order.Month + Discount, data = retail)
crosstab
#> Discount
#> Order.Month 0 0.1 0.15 0.2 0.3 0.32 0.4 0.45 0.5 0.6 0.7 0.8
#> Jan 773 12 5 446 20 9 51 0 15 18 77 49
#> Feb 587 4 2 339 25 1 17 0 12 15 28 37
#> Mar 1250 28 12 853 85 22 79 0 25 24 80 106
#> Apr 1099 17 27 961 42 4 51 6 11 18 129 82
#> May 1344 13 7 1009 78 8 30 4 14 42 136 106
#> Jun 1210 29 26 1048 56 7 44 0 25 52 98 85
#> Jul 1299 26 13 965 81 0 58 0 17 40 110 96
#> Aug 1297 14 15 1122 46 5 51 10 15 29 94 86
#> Sep 2574 68 21 1743 104 13 88 4 32 37 236 142
#> Oct 1483 37 18 1122 63 3 67 5 20 48 140 98
#> Nov 2722 65 27 2038 139 18 172 10 23 90 299 172
#> Dec 2629 60 25 2014 110 15 78 6 32 88 233 129
heatmap(crosstab, Rowv = NA, Colv = NA, cexCol = 0.8, scale = "row")
Awesome, many products are sold without discounts. if I look at using a heatmap, I will find that the dark color is on without a discount and a 20% discount. it can be mean, customers are familiar and comfortable shopping at the company. so, companies do not need to hold large discounts to attract the attention of customers.I think, if the company wants to hold a discount, it’s better between April-August because sales are still up and down. Now, I will see the profit from the retail dataset. The results were surprising because there were more than one loss period.
retail.graph2<-retail
retail.graph2$Order.Date <- as.yearmon(retail.graph2$Order.Date,"%Y/%m")
retail.agregate2<-aggregate(formula = Profit ~ Order.Date, data = retail.graph2, FUN = sum)
retail.agregate2$Profit<- comma(retail.agregate2$Profit, digits = 2)
ggplot(retail.agregate2, aes(x = Order.Date, Profit)) +
geom_line(color = '#c48c08') +
geom_area(fill = '#c48c08', alpha = .1) +
labs(x = 'Month'
, y = 'Profit'
, title = "Profit of Company") +
theme(text = element_text(family = 'Gill Sans', color = "#444444")
,panel.background = element_rect(fill = '#260101')
,panel.grid.minor = element_line(color = '#420101')
,panel.grid.major = element_line(color = '#470303')
,plot.title = element_text(size = 18, color= '#260101')
,axis.title = element_text(size = 10, color = '#555555')
,axis.title.y = element_text(vjust = 1)
,axis.title.x = element_text(hjust = 0)
)
Then I will see sales and profits that occur in July 2014 and January 2015, is it due to products that sell more discounts?
retail20142015<-retail.graph2[ retail.graph2$Order.Date == "Jul 2014" | retail.graph2$Order.Date == "Jan 2015", ]
convert <- function(x){
if (x < 0) {
x <- "Negative"
}
else if (x > 0 ) {
x <- "positive"
}
else {
x <- "Zero"
}
return(x)
}
retail20142015$status_profit <- sapply(X = retail20142015$Profit, FUN = convert)
crosstab_status_profit<- xtabs(formula = Quantity ~ status_profit+ Discount, data =retail20142015)
crosstab_status_profit
#> Discount
#> status_profit 0 0.1 0.15 0.2 0.3 0.32 0.4 0.5 0.6 0.7 0.8
#> Negative 0 0 0 51 20 9 33 5 15 30 30
#> positive 331 1 1 249 0 0 4 0 0 0 0
#> Zero 2 0 0 5 0 0 0 0 0 0 0
There are still more products sold at normal prices and at 20% discount. Then I will see based on nominal value, because it is possible that products sold using discount have a small quantity but the impact on the nominal value is large.
crosstab_status_profit2<- xtabs(formula = round(Profit) ~ Discount+status_profit, data =retail20142015)
crosstab_status_profit2
#> status_profit
#> Discount Negative positive Zero
#> 0 0 3131 0
#> 0.1 0 11 0
#> 0.15 0 7 0
#> 0.2 -418 1816 0
#> 0.3 -377 0 0
#> 0.32 -438 0 0
#> 0.4 -2467 3 0
#> 0.5 -244 0 0
#> 0.6 -336 0 0
#> 0.7 -990 0 0
#> 0.8 -3815 0 0
Great, I’ve found the cause of profit can be negative. If I look at using crosstab above, I will find a 40% discount and a 80% discount has a huge loss. This happens because some of the products that are discounted are not appropriate, so it’s important to use price optimization. It helps regulate and control pricing decisions across every category and product type. But I will not further analysis related to price optimization at this time.
In this section, I will analyze products that five fast-moving and slow-moving product based on the number of items sold. Then, I will analyze the top five product sales based on nominal sales.
product_moving<-aggregate(formula = Quantity ~ Product.Name, data = retail, FUN = sum)
product_moving <- product_moving[order(product_moving$Quantity, decreasing = T), ]
DT::datatable(head(product_moving,5),
filter = 'top',
rownames = FALSE,
options = list(
pageLength = 5))
Top 5 of fast moving product base on item sold can be seen above.The first order is staples with 215 item sold, the next sequence is staple envelopes with 170 item sold. These products that need to be considered stock availability not to reduce product sales
DT::datatable(tail(product_moving,5),
filter = 'top',
rownames = FALSE,
options = list(
pageLength = 5))
These products were only sold 1 in 2014-2017. Try to use discounts for these products to increase sales, so the product does not become dead stock.
top5<-aggregate(formula = Sales ~ Product.Name, data = retail, FUN = sum)
top5 <- top5[order(top5$Sales, decreasing = T), ]
DT::datatable(head(top5,5),
filter = 'top',
rownames = FALSE,
options = list(
pageLength = 5))
Top 5 sales base on product sales can be seen above, the first order is Canon imageCLASS 2200 Advanced Copier.
if you want to find out which type of category has the highest sales based on quantity and sales, you can see below:
xtabs(Quantity~ Category, retail)
#> Category
#> Furniture Office Supplies Technology
#> 8028 22906 6939
xtabs(Sales~ Category, retail)
#> Category
#> Furniture Office Supplies Technology
#> 741999.8 719047.0 836154.0
The advanced level of retail data analysis is market basket analysis. Market basket analysis is an unsupervised learning technique that can be useful for analyzing transactional data. It can be a powerful technique in analyzing the purchasing patterns of consumers. but this time, I didn’t analyze it until the basketball analysis market. maybe I’ll analyze using market basket analysis next time, but if you want to know about market basket analysis you can go to Market Basket Analysis.
In this section, i will find out the most buyer segments in each category, the type of shipping that the customer likes and then will use a cohort to find out how often customers first shop and will continue to shop actively. The Most buyer segments are consumer, consumer become the most segments in each category
Crosstab2<-xtabs(Sales~Segment+Category,retail)
plot(Crosstab2)
xtabs(Quantity~Ship.Mode,retail)
#> Ship.Mode
#> First Class Same Day Second Class Standard Class
#> 5693 1960 7423 22797
Here, I tried a subset of transactions that occurred in 2017 and Get the initial InvoiceDate for each customer,this is the customer’s join date.
Here I calculate the number of days a customer has been a customer with the company. Based on this information, I calculate the amount of months said customer has been active. Active as in making purchases.
Reminder:
- All cohorts are of the year 2017
- Cohort membership is based on the month of customer acquisition
Each of the twelve rows represent a cohort, and its columns, 1 through 11, represent the months post acquisition. Column 0 shows the count of customer acquisition per month. The intersections of row and column feature the count of active cohort members.
The retention rate mixpanel below features the same data from the previously shown mixpanel. The difference here is the intersections between rows and columns now feature the retention rates for each cohort group. Column 0 gives us the count of acquired customers for each month.
We can see that the August cohort features significantly higher retention rates than the other cohorts. By looking upwards, diagonally, from each acquisition month, we’re able to keep track of any specific month other than January. With this, we can see that the month of November features a higher rate of retention than other months for most of the cohorts.
The horizontal panel below the retention rate mixpanel features the average retention rate for each monthly age of all cohorts. The average count of acquired customers per month is 65.75. Generally, the average retention rate increasing and decreasing following sales graphic in Sales Data Analyst section such sales in March rise and fall in april month.
In 4 years, since bigining of 2014-2017 PT. xxx have sold total 37873 items with total Sales 2297201 and profit 286397. This company selling 3 main category : Furniture, Office supplies and Technology. From this category, Technology has the highest sales. Sales will always decline at the beginning of the year and The end of the year is the best sale, this happens from November to December.
Many products are sold without discounts, staples into the first order based on items sold and Advanced Copier Canon imageCLASS 2200 to become the first by sales. There are also many products that don’t sell like Xerox 20, so we need a sales strategy for these product.
Buyer who buying the product was coming from different segment, such as : Consumer, Corporate and Home Office. ‘Consumer’ was giving the higest ammount of sales. Shipping option consist of 4 types : First Class, Same day, Second class, and Standart class. Based on data, from all segment, most of them are choosing Standart Class as their shipping option.The average customer retention rate is around 13.89% -25.13% per month
RecommendationA work by Briandamar Kencana
damarbrian@gmail.com