1 Background

1.1 Objective

This project is created as an extra task for Algoritma’s Student for portofolio. In this task, I’ll use retail.csv dataset.

1.2 Description

retail.csv will show a dataset of a retail company, which run business in Furniture, Office Suplies and Technology Category and has different Segment of customer such as Consumer, Coporate and Home Office.

2 Libraries

In this case, here are the packages that I use:

library(dplyr)
library(lubridate)
library(ggplot2)
library(zoo)

3 Data Inspection

3.1 Read the Data and Create a Data Frame

First, we have to read the csv file first.

retail <- read.csv("data_input/retail.csv")
retail

The data contains of 9,994 rows and 15 columns.

There are also another way to check the dimension just in case we don’t want to print out all dataframe.

dim(retail)
## [1] 9994   15

See? It shows the same result.

3.2 Checking Columns in Data Frame

Let’s see what columns did they have in the file.

names(retail)
##  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
##  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
## [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

And it shows 15 column names which same when we called out for the number of columns we have.

3.3 Transforming Data Types

Before we process the data, we need to know whether the data we have is already in appropriate datatypes.

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
##  $ Order.Date  : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
##  $ Ship.Date   : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

From the result above, we find that some variables have inappropriate datatypes.

Now, let’s just do the coertion to put the data in the right datatypes, so we could analyze it easily.

retail[  , c("Order.ID","Customer.ID","Product.ID","Product.Name")] <- lapply(retail[  ,  c("Order.ID","Customer.ID","Product.ID","Product.Name")], as.character)
retail$Order.Date <-  mdy(retail$Order.Date)
retail$Ship.Date <-  mdy(retail$Ship.Date)

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

In this case, I changed the Order.ID type from Factor to Character (chr), because it indicates a unique code in every transactions. While Product.ID and Product.Name are also unique codes which depending on each other. See the example below. I just subsetted it into top 5 data.

head(retail[,c(8,11)],5)

For Order.Date and Ship.Date, the original data has a different type of date format DD/MM/YY while the original format in R is YY/MM/DD. With Date type, R will read the data with their Date format, so we could easily process the data in Date base.

Now let’s re-print our data with the appropriate data type, just to make sure. We don’t have to print out all of the data, in this case, I’ll just show the 5 bottom data.

tail(retail,5)

Now, we can see that the date type is already changed into R date format.

3.4 Missing Value Checking

Now, we’d better check whether the data is having missing values or not.

colSums(is.na(retail))
##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

As we can see, There’s no any missing value in every columnn.

Now let’s check if there were any duplications in this data.

retail[ duplicated(retail) == F, ]

The result shows the same number of rows which indicate that there was no duplication.

4 Data Analysis

Before we analyze the data, we can summarize the data we have.

summary(retail)
##      Row.ID       Order.ID           Order.Date           Ship.Date         
##  Min.   :   1   Length:9994        Min.   :2014-01-03   Min.   :2014-01-07  
##  1st Qu.:2499   Class :character   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
##  Median :4998   Mode  :character   Median :2016-06-26   Median :2016-06-29  
##  Mean   :4998                      Mean   :2016-04-30   Mean   :2016-05-03  
##  3rd Qu.:7496                      3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
##  Max.   :9994                      Max.   :2017-12-30   Max.   :2018-01-05  
##                                                                             
##           Ship.Mode    Customer.ID               Segment      Product.ID       
##  First Class   :1538   Length:9994        Consumer   :5191   Length:9994       
##  Same Day      : 543   Class :character   Corporate  :3020   Class :character  
##  Second Class  :1945   Mode  :character   Home Office:1783   Mode  :character  
##  Standard Class:5968                                                           
##                                                                                
##                                                                                
##                                                                                
##             Category         Sub.Category  Product.Name      
##  Furniture      :2121   Binders    :1523   Length:9994       
##  Office Supplies:6026   Paper      :1370   Class :character  
##  Technology     :1847   Furnishings: 957   Mode  :character  
##                         Phones     : 889                     
##                         Storage    : 846                     
##                         Art        : 796                     
##                         (Other)    :3613                     
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
## 

Summaries: The total data in the dataframe are 9,994 data. The store first transaction was on January 1, 2014 and the data contains of 4 years transactions from 2014 - 2017. Standar Class was the most favorite shipment method choose by customers. The biggest segment for the customer was Consumer. The most bought category was Office Supplies which related to Binders as its Sub Category. As we can see, The minimum Sales was 0.44 and the maximum was 22,638.48. The store applied some Discounts for their goods. And for the store profits, the maximum value was in 8,399.98 and the minimun was 8,399.97

Now, let’s go to the data analysis

We’d like to know how many customers did th store have in 4 years.

length(unique(retail$Customer.ID))
## [1] 793

Turned out, from the total of 9,994 transactions, the store had 793 loyal customers within 4 years. This means that most of the customers did a repeat order from the store.

Let’s find out which buyer type did the most transactions.

round(prop.table(table(retail$Segment))*100,2)
## 
##    Consumer   Corporate Home Office 
##       51.94       30.22       17.84
barplot(xtabs(formula = Profit ~ Segment, data = retail))

From the total of 793 customers, as we can see from the cross tab and the bar chart, more than 50% of the Segment came from Consumer.

Now, let’s check if the proportion and barchart result matched the criteria in contributing biggest number of Sales and Profit.

aggregate.data.frame(list(Quantity = retail$Quantity, Sales = retail$Sales, Profit = retail$Profit), 
                     by = list(Segment = retail$Segment), 
                     FUN = sum)

From the result, Consumer Segment contributed the biggest Sales and Profit amount.

Now, let’s check the most popular Ship Mode chosed by each Segment. First, we can show the cross tab for Ship Mode and Segment from the Total Sales number, which include the information of shipment mode chose by the buyer.

xtabs(formula = Sales ~ Ship.Mode + Segment, data = retail)
##                 Segment
## Ship.Mode         Consumer Corporate Home Office
##   First Class    159168.96 105858.47    86400.99
##   Same Day        60596.36  45121.32    22645.44
##   Second Class   231498.95 146126.04    81568.58
##   Standard Class 710137.07 409040.54   239038.14

If you think that the table is too complicated to see, it is because crosstab doesn’t give dataframe result. Here is a simpler way to show just the maximum number using dplyr package. Let’s find if we get the same result.

retail %>% 
  group_by(Ship.Mode, Segment) %>% 
  summarise(Sales = sum(Sales)) %>% 
  ungroup() %>% 
  filter(Sales == max(Sales))

And we get the same result, which shows that Standard Shipment was the most favorite method, just like what the summary(retail) shows, and the most Segment that chose this shipment method was Consumer. This also can relate that the biggest contribution for Sales was from Consumer Segment.

Now. let’s check the graph from the heatmap.

heatmap(xtabs(formula = Sales ~ Ship.Mode + Segment, data = retail), Rowv = NA, Colv = NA, cexCol = 0.8, scale = "column")

The graphs defines the same result like cross tab function we used above, showing that Standard Class method was the most chosen by all Segment, while the Second Class method was the least chosen.

Now, let’s compare the Ship.Mode with Category based on Profit.

shipment <- retail %>% 
  mutate(Ship.Mode = as.factor(Ship.Mode),
         Category = as.factor(Category)) %>% 
  group_by(Ship.Mode, Category) %>% 
  summarise(Profit = sum(Profit))
shipment

From the data we have, the result shows that Standard Class was the most popular Shipment Method chose by Segment even compared with Category. Now, here is the graph to show the relation.

ggplot(shipment, aes(x = Profit, y = Ship.Mode, fill = Category)) +
  geom_col(position = "dodge") + labs(x = "Profit", y ="Ship Mode")

See? Standard Class chart shows almost the maximum amount for the Profit.

Let’s check how many Products was the store sell.

length(unique(retail$Product.Name))
## [1] 1850

There were 1,850 items sold in the store from 3 Category

Now let’s check the most Category that have the biggest average both in Sales adn Profit.

aggregate.data.frame(list(Avg.Sales = retail$Sales, Avg.Profit = retail$Profit), 
                     by = list(Category = retail$Category), 
                     FUN = mean)

This shows us that Technology gave the biggest Sales and Profit number in 4 years, but according to the summary we have, the most bought Category wasn’t Technology. Now, let’s find out what was the most Category bought in 4 years.

aggregate(Quantity ~ Category, data = retail, FUN = sum)

And the most items bought was Office Supplies.

We can conclude that the price for Technology may varies for big amount because they have Machine Sub Category while for Office Supplies, they may not have a high price because a lot of people needs stationaries everyday, that’s why Office Supplies was the most sold item in the store.

Now, let’s see the relation of Category and Segment based on Quantity of the items.

xtabs(formula = Quantity ~ Category + Segment, data = retail)
##                  Segment
## Category          Consumer Corporate Home Office
##   Furniture           4166      2495        1367
##   Office Supplies    11758      7018        4130
##   Technology          3597      2095        1247

The biggest Category bought was Office Supplies and bought the most by Consumer Category.

We also can see it from the graph below.

plot(xtabs(Quantity~Segment+Category,retail))

As we can see, the graph shows the same result with the cross tab, which shows the biggest area of the rectangle in the Office Supplies and Consumer area.

This means, the biggest market for the Office Supplies was Consumer, could be conclude that Consumer use a lot of Office Supplies in their daily activities, could be for work or school activities.

Now, let’s find the Average Price for every Product based on Product.Name, and find the maximum and the minimum number.

Avg.Price <- aggregate.data.frame(list(Sales = retail$Sales, Quantity = retail$Quantity), 
                     by = list(Category = retail$Category, Product.Name = retail$Product.Name), 
                     FUN = sum)
Avg.Price$Avg.Price <- round(Avg.Price$Sales/Avg.Price$Quantity,2)
Avg.Price[Avg.Price$Avg.Price == max(Avg.Price$Avg.Price)|Avg.Price$Avg.Price == min(Avg.Price$Avg.Price)|Avg.Price$Sales == max(Avg.Price$Sales)|Avg.Price$Sales == min(Avg.Price$Sales), ]

The result shows that the lowest Average Price and Sales was from Office Supplies while the highest Average Price and Sales was from Technology. This result supports the previous data that shows the Selling Price for some Technology items were really high while the Selling Price for Ofice Supplies was relatively affordable. The scatter plot below can be used as a support to read the data distribution

ggplot(data = Avg.Price, aes (x = Sales, y = Avg.Price, col = Category )) + geom_point()

We can conclude that Office Supplies items wasn’t very expensive because it was regularly bought, while Technology which has Machine Sub.Category, the customer was rarely bought them. Besides, they were more pricey and once the customer have one machine, they won’t repurchase it.

We can check what kind of Sub.Category the store has.

unique(retail$Sub.Category)
##  [1] Bookcases   Chairs      Labels      Tables      Storage     Furnishings
##  [7] Art         Phones      Binders     Appliances  Paper       Accessories
## [13] Envelopes   Fasteners   Supplies    Machines    Copiers    
## 17 Levels: Accessories Appliances Art Binders Bookcases Chairs ... Tables

And it shows that the store had 17 Sub.Category items.

Now, let’s check which Sub.Category was the top 5 popular from Office Supplies Category.

qty <- (retail[retail$Category == "Office Supplies",])
qty <- aggregate(Quantity ~ Category + Sub.Category, data = retail, FUN = sum)
head(qty[order(qty$Quantity, decreasing = T),],5)

Binder was the most popular item bought. Here’s the graph to see the ranks for the Selling items.

ggplot(data = qty, aes(x = Quantity, y = reorder(Sub.Category, Quantity), fill = Category)) + geom_col() +
  labs(x = "Quantity", y = "Sub Category")

We are about to check if there were any items that don’t have profit at all with their original price and how many were the products had sold.

disc0 <- (retail[retail$Discount == 0 & retail$Profit == 0,])
profit0 <- aggregate(formula = Sales ~ Category + Product.Name, data = disc0, FUN = length)
profit0

There were 11 items that produced 0 profit. The total Sales for each product weren’t much, around 1-4 items sold.

The store has 2 options, whether to take the items out from their list or mark up the price for those items in the future.

I’m about to show the Product that gave the maximum and minumum Profit by their Category.

profit <- retail %>% 
  mutate(Category = as.factor(Category),
         Sub.Category = as.factor(Sub.Category),
         Product.Name = as.factor(Product.Name)) %>% 
  group_by(Category, Sub.Category, Product.Name) %>% 
  summarise(Sales = sum(Sales),
            Profit = sum(Profit))
profit[profit$Profit == max(profit$Profit)|profit$Profit == min(profit$Profit), ]

The maximum and minimum Profit was from Technology Category but came from different Sub.Category

Now, let’s visualise it with boxplot to see the outlier for the big gap

ggplot(profit, aes(y = Profit, x = Category)) + geom_boxplot()

From the dataframe and boxplot above, we get outliers data for maximum and minimum Profit. The higest profit was 25199.93 and the lowest profit was -8879.97 and it came from the same Category, which was Technology however, the items have different Sub.Category. The product that had the maximum was Copier and the minimun was Machine.

We can find out the outlier from variance and standard deviaion number.

aggregate(formula = Profit~Category, data = profit, FUN = var)
aggregate(formula = Profit ~ Category, data = profit, FUN = sd)

From the variance and standard deviation amount, we can conclude that Technology has a huge range of amount, so, there’s no doubt for any outliers in the data.

Now let’s see the Sales and Profit for Category side

aggregate.data.frame(list(Sales = profit$Sales, Profit = profit$Profit), 
                     by = list(Category = profit$Category), 
                     FUN = sum)

Technology had the highest Sales and Profit in 4 years.

We can compare the Sales and Profit based on Category to see the distribution of the data. First, let’s subset the minimum and maximum number for both Sales and Profit

profit[profit$Profit == max(profit$Profit)|profit$Profit == min(profit$Profit)|profit$Sales == max(profit$Sales)|profit$Sales == min(profit$Sales), ]

Turns out, the maximum Sales was the maximum Profit, while the minimum number for both Sales and Profit had their own number. To simplify the data, we can extract it into a scatter plot below.

ggplot(data = profit, aes (x = Profit, y = Sales, col = Category )) + geom_point() + geom_smooth()

I organize the total of 1,850 Products with the Total Quantity, Total Sales, Total Profit and Total Discount during the business for 4 years.

retail %>% 
  mutate(Category = as.factor(Category),
         Product.Name = as.factor(Product.Name)) %>% 
  group_by(Category, Product.Name) %>% 
  summarise(Total_Qty = sum(Quantity),
            Total_Sales = sum(Sales),
            Total_Profit = sum(Profit),
            Total_Disc = sum(Discount))

Let’s categorize the Date into Period in Month and Year, and see relation between Sales, Profit, Quantity, and Discount within 4 years.

period <- retail %>% 
  mutate(Order.Date = floor_date(Order.Date, unit = "month"),
         Period = as.yearmon(Order.Date)) %>% 
  group_by(Period) %>% 
  summarise(Total.Profit = sum(Profit),
            Total.Sales = sum(Sales),
            Total.Qty = sum(Quantity),
            Total.Disc = sum(Discount))
period

We already categorized the them into 48 Period of time started from January 2014 to December 2017

Now let’s find out the highest Period for Total.Profit, Total.Sales, Total.Qty, and Total.Disc in 48 months.

period[period$Total.Profit == max(period$Total.Profit),]
period[period$Total.Sales == max(period$Total.Sales),]
period[period$Total.Qty == max(period$Total.Qty),]
period[period$Total.Disc == max(period$Total.Disc),]

A lot of work to do? Don’t worry, let’s just summarize it into a simple result.

period[period$Total.Sales == max(period$Total.Sales) | period$Total.Profit == max(period$Total.Profit) | period$Total.Qty == max(period$Total.Qty) | period$Total.Disc == max(period$Total.Disc),]

Both of the method shows the same result, the most peaked Period for Total.Profit was on Dec 2016 while for Total.Sales, Total.Qty, and Total.Disc was on November 2017

ggplot(data = period, mapping = aes(x = Period)) +
  geom_line(aes(y = Total.Profit), col = "red") +
  geom_line(aes(y = Total.Sales), col = "blue") +
  geom_line(aes(y = Total.Qty), col = "green") +
  geom_line(aes(y = Total.Disc), col ="yellow") +
  labs(x = "Period", y = "Total")

Both of the method shows the same result, the most peaked Period for Total.Profit was on Dec 2016 while for Total.Sales, Total.Qty, and Total.Disc was on November 2017, just as the same as the dataframe result we have. Within the year of 2017, we can see that the Total.Sales decreased dramatically while the Total.Profit for those Period decreased just a bit, lower than the previous Year. From the table we have, we can conclude that the increasing in Sales could be affected with major Discount given by the store, could be done to sell out all stocks the store have or maybe because the products was slow-moving-items. Or the store was targeted on a highest Sales with stagnant Profit.

That’s all for my P4DS and PS LBB, hope you enjoy to read it.

See you :)