This project is created as an extra task for Algoritma’s Student for portofolio. In this task, I’ll use retail.csv
dataset.
retail.csv
will show a dataset of a retail company, which run business in Furniture, Office Suplies and Technology Category
and has different Segment
of customer such as Consumer, Coporate and Home Office.
In this case, here are the packages that I use:
library(dplyr)
library(lubridate)
library(ggplot2)
library(zoo)
First, we have to read the csv file first.
retail <- read.csv("data_input/retail.csv")
retail
The data contains of 9,994 rows and 15 columns.
There are also another way to check the dimension just in case we don’t want to print out all dataframe.
dim(retail)
## [1] 9994 15
See? It shows the same result.
Let’s see what columns did they have in the file.
names(retail)
## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
## [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
## [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
And it shows 15 column names which same when we called out for the number of columns we have.
Before we process the data, we need to know whether the data we have is already in appropriate datatypes.
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
## $ Order.Date : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
## $ Ship.Date : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
From the result above, we find that some variables have inappropriate datatypes.
Now, let’s just do the coertion to put the data in the right datatypes, so we could analyze it easily.
retail[ , c("Order.ID","Customer.ID","Product.ID","Product.Name")] <- lapply(retail[ , c("Order.ID","Customer.ID","Product.ID","Product.Name")], as.character)
retail$Order.Date <- mdy(retail$Order.Date)
retail$Ship.Date <- mdy(retail$Ship.Date)
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
In this case, I changed the Order.ID type from Factor to Character (chr), because it indicates a unique code in every transactions. While Product.ID and Product.Name are also unique codes which depending on each other. See the example below. I just subsetted it into top 5 data.
head(retail[,c(8,11)],5)
For Order.Date
and Ship.Date
, the original data has a different type of date format DD/MM/YY while the original format in R is YY/MM/DD. With Date type, R will read the data with their Date format, so we could easily process the data in Date base.
Now let’s re-print our data with the appropriate data type, just to make sure. We don’t have to print out all of the data, in this case, I’ll just show the 5 bottom data.
tail(retail,5)
Now, we can see that the date type is already changed into R date format.
Now, we’d better check whether the data is having missing values or not.
colSums(is.na(retail))
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
As we can see, There’s no any missing value in every columnn.
Now let’s check if there were any duplications in this data.
retail[ duplicated(retail) == F, ]
The result shows the same number of rows which indicate that there was no duplication.
Before we analyze the data, we can summarize the data we have.
summary(retail)
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Min. :2014-01-03 Min. :2014-01-07
## 1st Qu.:2499 Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
## Median :4998 Mode :character Median :2016-06-26 Median :2016-06-29
## Mean :4998 Mean :2016-04-30 Mean :2016-05-03
## 3rd Qu.:7496 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :9994 Max. :2017-12-30 Max. :2018-01-05
##
## Ship.Mode Customer.ID Segment Product.ID
## First Class :1538 Length:9994 Consumer :5191 Length:9994
## Same Day : 543 Class :character Corporate :3020 Class :character
## Second Class :1945 Mode :character Home Office:1783 Mode :character
## Standard Class:5968
##
##
##
## Category Sub.Category Product.Name
## Furniture :2121 Binders :1523 Length:9994
## Office Supplies:6026 Paper :1370 Class :character
## Technology :1847 Furnishings: 957 Mode :character
## Phones : 889
## Storage : 846
## Art : 796
## (Other) :3613
## Sales Quantity Discount Profit
## Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
##
Summaries: The total data in the dataframe are 9,994 data. The store first transaction was on January 1, 2014 and the data contains of 4 years transactions from 2014 - 2017. Standar Class was the most favorite shipment method choose by customers. The biggest segment for the customer was Consumer. The most bought category was Office Supplies which related to Binders as its Sub Category. As we can see, The minimum Sales was 0.44 and the maximum was 22,638.48. The store applied some Discounts for their goods. And for the store profits, the maximum value was in 8,399.98 and the minimun was 8,399.97
Now, let’s go to the data analysis
We’d like to know how many customers did th store have in 4 years.
length(unique(retail$Customer.ID))
## [1] 793
Turned out, from the total of 9,994 transactions, the store had 793 loyal customers within 4 years. This means that most of the customers did a repeat order from the store.
Let’s find out which buyer type did the most transactions.
round(prop.table(table(retail$Segment))*100,2)
##
## Consumer Corporate Home Office
## 51.94 30.22 17.84
barplot(xtabs(formula = Profit ~ Segment, data = retail))
From the total of 793 customers, as we can see from the cross tab and the bar chart, more than 50% of the Segment
came from Consumer.
Now, let’s check if the proportion and barchart result matched the criteria in contributing biggest number of Sales
and Profit
.
aggregate.data.frame(list(Quantity = retail$Quantity, Sales = retail$Sales, Profit = retail$Profit),
by = list(Segment = retail$Segment),
FUN = sum)
From the result, Consumer Segment
contributed the biggest Sales
and Profit
amount.
Now, let’s check the most popular Ship Mode
chosed by each Segment
. First, we can show the cross tab for Ship Mode
and Segment
from the Total Sales
number, which include the information of shipment mode chose by the buyer.
xtabs(formula = Sales ~ Ship.Mode + Segment, data = retail)
## Segment
## Ship.Mode Consumer Corporate Home Office
## First Class 159168.96 105858.47 86400.99
## Same Day 60596.36 45121.32 22645.44
## Second Class 231498.95 146126.04 81568.58
## Standard Class 710137.07 409040.54 239038.14
If you think that the table is too complicated to see, it is because crosstab doesn’t give dataframe result. Here is a simpler way to show just the maximum number using dplyr
package. Let’s find if we get the same result.
retail %>%
group_by(Ship.Mode, Segment) %>%
summarise(Sales = sum(Sales)) %>%
ungroup() %>%
filter(Sales == max(Sales))
And we get the same result, which shows that Standard Shipment was the most favorite method, just like what the summary(retail)
shows, and the most Segment
that chose this shipment method was Consumer. This also can relate that the biggest contribution for Sales was from Consumer Segment
.
Now. let’s check the graph from the heatmap.
heatmap(xtabs(formula = Sales ~ Ship.Mode + Segment, data = retail), Rowv = NA, Colv = NA, cexCol = 0.8, scale = "column")
The graphs defines the same result like cross tab function we used above, showing that Standard Class method was the most chosen by all Segment
, while the Second Class method was the least chosen.
Now, let’s compare the Ship.Mode
with Category
based on Profit
.
shipment <- retail %>%
mutate(Ship.Mode = as.factor(Ship.Mode),
Category = as.factor(Category)) %>%
group_by(Ship.Mode, Category) %>%
summarise(Profit = sum(Profit))
shipment
From the data we have, the result shows that Standard Class was the most popular Shipment Method chose by Segment
even compared with Category
. Now, here is the graph to show the relation.
ggplot(shipment, aes(x = Profit, y = Ship.Mode, fill = Category)) +
geom_col(position = "dodge") + labs(x = "Profit", y ="Ship Mode")
See? Standard Class chart shows almost the maximum amount for the Profit
.
Let’s check how many Products was the store sell.
length(unique(retail$Product.Name))
## [1] 1850
There were 1,850 items sold in the store from 3 Category
Now let’s check the most Category
that have the biggest average both in Sales
adn Profit
.
aggregate.data.frame(list(Avg.Sales = retail$Sales, Avg.Profit = retail$Profit),
by = list(Category = retail$Category),
FUN = mean)
This shows us that Technology gave the biggest Sales
and Profit
number in 4 years, but according to the summary we have, the most bought Category
wasn’t Technology. Now, let’s find out what was the most Category
bought in 4 years.
aggregate(Quantity ~ Category, data = retail, FUN = sum)
And the most items bought was Office Supplies.
We can conclude that the price for Technology may varies for big amount because they have Machine Sub Category
while for Office Supplies, they may not have a high price because a lot of people needs stationaries everyday, that’s why Office Supplies was the most sold item in the store.
Now, let’s see the relation of Category
and Segment
based on Quantity
of the items.
xtabs(formula = Quantity ~ Category + Segment, data = retail)
## Segment
## Category Consumer Corporate Home Office
## Furniture 4166 2495 1367
## Office Supplies 11758 7018 4130
## Technology 3597 2095 1247
The biggest Category
bought was Office Supplies and bought the most by Consumer Category
.
We also can see it from the graph below.
plot(xtabs(Quantity~Segment+Category,retail))
As we can see, the graph shows the same result with the cross tab, which shows the biggest area of the rectangle in the Office Supplies and Consumer area.
This means, the biggest market for the Office Supplies was Consumer, could be conclude that Consumer use a lot of Office Supplies in their daily activities, could be for work or school activities.
Now, let’s find the Average Price for every Product based on Product.Name
, and find the maximum and the minimum number.
Avg.Price <- aggregate.data.frame(list(Sales = retail$Sales, Quantity = retail$Quantity),
by = list(Category = retail$Category, Product.Name = retail$Product.Name),
FUN = sum)
Avg.Price$Avg.Price <- round(Avg.Price$Sales/Avg.Price$Quantity,2)
Avg.Price[Avg.Price$Avg.Price == max(Avg.Price$Avg.Price)|Avg.Price$Avg.Price == min(Avg.Price$Avg.Price)|Avg.Price$Sales == max(Avg.Price$Sales)|Avg.Price$Sales == min(Avg.Price$Sales), ]
The result shows that the lowest Average Price
and Sales
was from Office Supplies while the highest Average Price
and Sales
was from Technology. This result supports the previous data that shows the Selling Price for some Technology items were really high while the Selling Price for Ofice Supplies was relatively affordable. The scatter plot below can be used as a support to read the data distribution
ggplot(data = Avg.Price, aes (x = Sales, y = Avg.Price, col = Category )) + geom_point()
We can conclude that Office Supplies items wasn’t very expensive because it was regularly bought, while Technology which has Machine Sub.Category
, the customer was rarely bought them. Besides, they were more pricey and once the customer have one machine, they won’t repurchase it.
We can check what kind of Sub.Category
the store has.
unique(retail$Sub.Category)
## [1] Bookcases Chairs Labels Tables Storage Furnishings
## [7] Art Phones Binders Appliances Paper Accessories
## [13] Envelopes Fasteners Supplies Machines Copiers
## 17 Levels: Accessories Appliances Art Binders Bookcases Chairs ... Tables
And it shows that the store had 17 Sub.Category
items.
Now, let’s check which Sub.Category
was the top 5 popular from Office Supplies Category
.
qty <- (retail[retail$Category == "Office Supplies",])
qty <- aggregate(Quantity ~ Category + Sub.Category, data = retail, FUN = sum)
head(qty[order(qty$Quantity, decreasing = T),],5)
Binder was the most popular item bought. Here’s the graph to see the ranks for the Selling items.
ggplot(data = qty, aes(x = Quantity, y = reorder(Sub.Category, Quantity), fill = Category)) + geom_col() +
labs(x = "Quantity", y = "Sub Category")
We are about to check if there were any items that don’t have profit at all with their original price and how many were the products had sold.
disc0 <- (retail[retail$Discount == 0 & retail$Profit == 0,])
profit0 <- aggregate(formula = Sales ~ Category + Product.Name, data = disc0, FUN = length)
profit0
There were 11 items that produced 0 profit. The total Sales for each product weren’t much, around 1-4 items sold.
The store has 2 options, whether to take the items out from their list or mark up the price for those items in the future.
I’m about to show the Product that gave the maximum and minumum Profit by their Category
.
profit <- retail %>%
mutate(Category = as.factor(Category),
Sub.Category = as.factor(Sub.Category),
Product.Name = as.factor(Product.Name)) %>%
group_by(Category, Sub.Category, Product.Name) %>%
summarise(Sales = sum(Sales),
Profit = sum(Profit))
profit[profit$Profit == max(profit$Profit)|profit$Profit == min(profit$Profit), ]
The maximum and minimum Profit
was from Technology Category
but came from different Sub.Category
Now, let’s visualise it with boxplot to see the outlier for the big gap
ggplot(profit, aes(y = Profit, x = Category)) + geom_boxplot()
From the dataframe and boxplot above, we get outliers data for maximum and minimum Profit
. The higest profit was 25199.93 and the lowest profit was -8879.97 and it came from the same Category
, which was Technology however, the items have different Sub.Category
. The product that had the maximum was Copier and the minimun was Machine.
We can find out the outlier from variance
and standard deviaion
number.
aggregate(formula = Profit~Category, data = profit, FUN = var)
aggregate(formula = Profit ~ Category, data = profit, FUN = sd)
From the variance and standard deviation amount, we can conclude that Technology
has a huge range of amount, so, there’s no doubt for any outliers in the data.
Now let’s see the Sales
and Profit
for Category
side
aggregate.data.frame(list(Sales = profit$Sales, Profit = profit$Profit),
by = list(Category = profit$Category),
FUN = sum)
Technology had the highest Sales
and Profit
in 4 years.
We can compare the Sales
and Profit
based on Category
to see the distribution of the data. First, let’s subset the minimum and maximum number for both Sales
and Profit
profit[profit$Profit == max(profit$Profit)|profit$Profit == min(profit$Profit)|profit$Sales == max(profit$Sales)|profit$Sales == min(profit$Sales), ]
Turns out, the maximum Sales
was the maximum Profit
, while the minimum number for both Sales
and Profit
had their own number. To simplify the data, we can extract it into a scatter plot below.
ggplot(data = profit, aes (x = Profit, y = Sales, col = Category )) + geom_point() + geom_smooth()
I organize the total of 1,850 Products with the Total Quantity, Total Sales, Total Profit and Total Discount during the business for 4 years.
retail %>%
mutate(Category = as.factor(Category),
Product.Name = as.factor(Product.Name)) %>%
group_by(Category, Product.Name) %>%
summarise(Total_Qty = sum(Quantity),
Total_Sales = sum(Sales),
Total_Profit = sum(Profit),
Total_Disc = sum(Discount))
Let’s categorize the Date
into Period
in Month and Year, and see relation between Sales
, Profit
, Quantity
, and Discount
within 4 years.
period <- retail %>%
mutate(Order.Date = floor_date(Order.Date, unit = "month"),
Period = as.yearmon(Order.Date)) %>%
group_by(Period) %>%
summarise(Total.Profit = sum(Profit),
Total.Sales = sum(Sales),
Total.Qty = sum(Quantity),
Total.Disc = sum(Discount))
period
We already categorized the them into 48 Period
of time started from January 2014 to December 2017
Now let’s find out the highest Period
for Total.Profit
, Total.Sales
, Total.Qty
, and Total.Disc
in 48 months.
period[period$Total.Profit == max(period$Total.Profit),]
period[period$Total.Sales == max(period$Total.Sales),]
period[period$Total.Qty == max(period$Total.Qty),]
period[period$Total.Disc == max(period$Total.Disc),]
A lot of work to do? Don’t worry, let’s just summarize it into a simple result.
period[period$Total.Sales == max(period$Total.Sales) | period$Total.Profit == max(period$Total.Profit) | period$Total.Qty == max(period$Total.Qty) | period$Total.Disc == max(period$Total.Disc),]
Both of the method shows the same result, the most peaked Period
for Total.Profit
was on Dec 2016 while for Total.Sales
, Total.Qty
, and Total.Disc
was on November 2017
ggplot(data = period, mapping = aes(x = Period)) +
geom_line(aes(y = Total.Profit), col = "red") +
geom_line(aes(y = Total.Sales), col = "blue") +
geom_line(aes(y = Total.Qty), col = "green") +
geom_line(aes(y = Total.Disc), col ="yellow") +
labs(x = "Period", y = "Total")
Both of the method shows the same result, the most peaked Period
for Total.Profit
was on Dec 2016 while for Total.Sales
, Total.Qty
, and Total.Disc
was on November 2017, just as the same as the dataframe result we have. Within the year of 2017, we can see that the Total.Sales
decreased dramatically while the Total.Profit
for those Period
decreased just a bit, lower than the previous Year. From the table we have, we can conclude that the increasing in Sales
could be affected with major Discount
given by the store, could be done to sell out all stocks the store have or maybe because the products was slow-moving-items. Or the store was targeted on a highest Sales
with stagnant Profit
.
That’s all for my P4DS and PS LBB, hope you enjoy to read it.
See you :)