Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my second RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.
The data stored a record of a transactions of a retail store
specifying several variables. Please take a look at the following
glossary for your referrence:
- Order.ID: Id of order.
- Order.Date: Date of order.
- Ship.Date: Date of shipment.
- Ship.Mode: Type of shipment.
- Customer.ID: Id of customer.
- Segment : Customer’s segment.
- Product.ID: Id of product.
- Category : Product category with 3 levels (“Furniture”,
“Office Supplies”, “Technology”)
- Sub.Category: more specific product category
- Product.Name: Name of product that was sold.
- Sales: How much earning from each sale.
- Quantity: Quantity of item sold.
- Discount: How much discount was given for each
sale.
- Profit: How much can a company earn from each sale.
Before we start develop the analysis, let’s start with environment setup.
# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
options(scipen=9999)After we done with environment setup, let’s call several important library that will be used during analysis such as ‘lubridate’, ‘ggplot2’, ‘dplyr’.
library(lubridate)
library(ggplot2)
library(dplyr)Before we generate visualization of our dataset, we need to
understand what’s inside the dataset. Let’s start with reading
retail.csv in our folder using read.csv()
function, save it into an object named retail_df, and
display first 6 rows of the data.
retail_df <- read.csv("retail.csv")
head(retail_df)we can also check the last 6 rows of the dataset using ‘tail’ function.
tail(retail_df)Based on the output above, we now that our data has dimension of 9,994 rows and 14 columns. We can also quickly check the dimension using ‘dim’ function.
dim(retail_df)#> [1] 9994 14
Next we will check the data structure and the data types. We need to make sure our columns have the correct data types so we can optimize the visualization and also the insights.
str(retail_df)#> 'data.frame': 9994 obs. of 14 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
#> $ Ship.Date : chr "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Retail dataset has 10 ‘chr’, 3 ‘num’, and 1 ‘int’ data types, but we know that some columns have wrong data types. That’s why we need to do adjustment on ‘Order.Date’, ‘Ship.Date’, ‘Ship.Mode’, ‘Segment’, ‘Category’, ‘Sub.Category’ to get the correct data types.
retail_df$Order.Date<- mdy(retail_df$Order.Date)
retail_df$Ship.Date <- mdy(retail_df$Ship.Date)
retail_df$Ship.Mode <- as.factor(retail_df$Ship.Mode)
retail_df$Segment <- as.factor(retail_df$Segment)
retail_df$Category <- as.factor(retail_df$Category)
retail_df$Sub.Category <- as.factor(retail_df$Sub.Category)
str(retail_df)#> 'data.frame': 9994 obs. of 14 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Good, now our dataset has the correct data types for each column.
Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.
anyNA(retail_df)#> [1] FALSE
colSums(is.na(retail_df))#> Order.ID Order.Date Ship.Date Ship.Mode Customer.ID Segment
#> 0 0 0 0 0 0
#> Product.ID Category Sub.Category Product.Name Sales Quantity
#> 0 0 0 0 0 0
#> Discount Profit
#> 0 0
Based on the output above, the dataset doesn’t have any missing values. That means we don’t need to do any missing value data treatment on the dataset.
We will extract ‘Month’,‘Quarter’,and ‘year’ columns based on ‘Order.Date’ to help us get insight on which timeframe the store generate most sales or profit.
retail_df$Order.Month <- month(retail_df$Order.Date,label=TRUE, abbr = FALSE)
retail_df$Order.Quarter <- quarter(retail_df$Order.Date, type='year.quarter', fiscal_start = 1)
retail_df$Order.Year <- year(retail_df$Order.Date)
retail_df$Order.Quarter <- as.factor(retail_df$Order.Quarter)
retail_df$Order.Year <- as.factor(retail_df$Order.Year)
str(retail_df)#> 'data.frame': 9994 obs. of 17 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category : Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
#> $ Order.Month : Ord.factor w/ 12 levels "January"<"February"<..: 11 11 6 10 10 6 6 6 6 6 ...
#> $ Order.Quarter: Factor w/ 16 levels "2014.1","2014.2",..: 12 12 10 8 8 2 2 2 2 2 ...
#> $ Order.Year : Factor w/ 4 levels "2014","2015",..: 3 3 3 2 2 1 1 1 1 1 ...
We also need to add sales/quantity and profit/quantity columns in order to make objective comparison for each transaction at the retail store.
retail_df$SperQ <- retail_df$Sales / retail_df$Quantity
retail_df$PperQ <- retail_df$Profit / retail_df$Quantity
str(retail_df)#> 'data.frame': 9994 obs. of 19 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category : Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
#> $ Order.Month : Ord.factor w/ 12 levels "January"<"February"<..: 11 11 6 10 10 6 6 6 6 6 ...
#> $ Order.Quarter: Factor w/ 16 levels "2014.1","2014.2",..: 12 12 10 8 8 2 2 2 2 2 ...
#> $ Order.Year : Factor w/ 4 levels "2014","2015",..: 3 3 3 2 2 1 1 1 1 1 ...
#> $ SperQ : num 130.98 243.98 7.31 191.52 11.18 ...
#> $ PperQ : num 20.96 73.19 3.44 -76.61 1.26 ...
The last data transformation that we need to do is add ‘Discount_Cat’ columns to categorize ‘Discount’ columns into: ‘0%’,‘0-20%’,‘>=20%’
pw <- function(x){
if(x == 0){
x <- "0%"
}else if(x > 0 & x < 0.2){
x <- "0-20%"
}else{
x <- ">=20%"
}
}
retail_df$Discount_Cat <- sapply(retail_df$Discount, pw)
retail_df$Discount_Cat <- factor(retail_df$Discount_Cat,levels=c('0%','0-20%','>=20%'))
str(retail_df)#> 'data.frame': 9994 obs. of 20 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category : Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
#> $ Order.Month : Ord.factor w/ 12 levels "January"<"February"<..: 11 11 6 10 10 6 6 6 6 6 ...
#> $ Order.Quarter: Factor w/ 16 levels "2014.1","2014.2",..: 12 12 10 8 8 2 2 2 2 2 ...
#> $ Order.Year : Factor w/ 4 levels "2014","2015",..: 3 3 3 2 2 1 1 1 1 1 ...
#> $ SperQ : num 130.98 243.98 7.31 191.52 11.18 ...
#> $ PperQ : num 20.96 73.19 3.44 -76.61 1.26 ...
#> $ Discount_Cat : Factor w/ 3 levels "0%","0-20%",">=20%": 1 1 1 3 3 1 1 3 3 1 ...
Now the dataset has grow into 20 columns instead of initial 14 columns. we can conduct quick check of column names inside the dataset.
names(retail_df)#> [1] "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
#> [5] "Customer.ID" "Segment" "Product.ID" "Category"
#> [9] "Sub.Category" "Product.Name" "Sales" "Quantity"
#> [13] "Discount" "Profit" "Order.Month" "Order.Quarter"
#> [17] "Order.Year" "SperQ" "PperQ" "Discount_Cat"
dim(retail_df)#> [1] 9994 20
Cool!! The dataset is looking incredible, let’s check the statistical summary of our dataset.
summary(retail_df)#> Order.ID Order.Date Ship.Date
#> Length:9994 Min. :2014-01-03 Min. :2014-01-07
#> Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
#> Mode :character Median :2016-06-26 Median :2016-06-29
#> Mean :2016-04-30 Mean :2016-05-03
#> 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
#> Max. :2017-12-30 Max. :2018-01-05
#>
#> Ship.Mode Customer.ID Segment Product.ID
#> First Class :1538 Length:9994 Consumer :5191 Length:9994
#> Same Day : 543 Class :character Corporate :3020 Class :character
#> Second Class :1945 Mode :character Home Office:1783 Mode :character
#> Standard Class:5968
#>
#>
#>
#> Category Sub.Category Product.Name
#> Furniture :2121 Binders :1523 Length:9994
#> Office Supplies:6026 Paper :1370 Class :character
#> Technology :1847 Furnishings: 957 Mode :character
#> Phones : 889
#> Storage : 846
#> Art : 796
#> (Other) :3613
#> Sales Quantity Discount Profit
#> Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
#> 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
#> Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
#> Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
#> 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
#>
#> Order.Month Order.Quarter Order.Year SperQ
#> November :1471 2017.4 :1219 2014:1993 Min. : 0.336
#> December :1408 2016.4 : 918 2015:2102 1st Qu.: 5.470
#> September:1383 2017.3 : 903 2016:2587 Median : 16.270
#> October : 819 2015.4 : 806 2017:3312 Mean : 60.920
#> May : 735 2014.4 : 755 3rd Qu.: 63.940
#> June : 717 2016.3 : 740 Max. :3773.080
#> (Other) :3461 (Other):4653
#> PperQ Discount_Cat
#> Min. :-1319.9956 0% :4798
#> 1st Qu.: 0.7228 0-20%: 146
#> Median : 2.7670 >=20%:5050
#> Mean : 7.7994
#> 3rd Qu.: 8.7032
#> Max. : 1679.9952
#>
Based on the summary above, we may collect some insights from the
dataset:
1. First order date was occured in Jan 2014, In 4 year, the company can
booked 9994 transactions.
2. The most favorable Ship mode by customers is Standard Class.
3. There are 3 Categories of Product: Furniture, Office Supplies,
Technology. Office Supplies has the highest transaction.
4. There are 3 Buyers Segment at the store: Consumer, Corporate, Home
Office. Consumer Segment has the highest transaction.
5. The Retail Store tends to not give discount in their successful
transaction (~50%).
6. In 4 years of operation, the store makes significant improvement on
their number of transaction.
7. The highest store transaction happened at Q4 2017.
8. Sales range is around 0.444 - 22,638.480 and profit range is around
-6,599.978 - 8,399.976.
9. Sales per Quantity range is around 0.336 - 3,773.080 and Profit per
Quantity range is around -1,319.9956 - 3,773.080.
10. Buyers in average bought 3.79 product in their every shopping
session.
ggplot(retail_df, aes(Segment,Profit))+
geom_boxplot(aes(fill = Segment))+
scale_y_log10()+
labs(title = 'Profit by Segment', x='Segment', y='Profit',
fill = 'Segment', subtitle = 'red line indicate average Profit')+
theme(plot.title = element_text(hjust = 0.5))+
geom_hline(yintercept = mean(retail_df$Profit), color = 'red',
linetype = 5)ggplot(retail_df, aes(Segment,Sales))+
geom_boxplot(aes(fill = Segment))+
scale_y_log10()+
labs(title = 'Sales by Segment', x='Segment', y='Sales',
fill = 'Segment', subtitle = 'red line indicate average sales')+
theme(plot.title = element_text(hjust = 0.5))+
geom_hline(yintercept = mean(retail_df$Sales), color = 'red',
linetype = 5)ggplot(retail_df, aes(Category,Profit))+
geom_boxplot(aes(fill = Category))+
scale_y_log10()+
labs(title = 'Profit by Category', x='Category', y='Profit',
fill = 'Category', subtitle = 'red line indicate average profit')+
theme(plot.title = element_text(hjust = 0.5))+
geom_hline(yintercept = mean(retail_df$Profit), color = 'red',
linetype = 5)ggplot(retail_df, aes(Category,Sales))+
geom_boxplot(aes(fill = Category))+
scale_y_log10()+
labs(title = 'Sales by Category', x='Category', y='Sales',
fill = 'Category', subtitle = 'red line indicate average sales')+
theme(plot.title = element_text(hjust = 0.5))+
geom_hline(yintercept = mean(retail_df$Sales), color = 'red',
linetype = 5)ggplot(retail_df,aes(Category,SperQ)) +
geom_jitter(aes(col= retail_df$Discount_Cat)) +
geom_boxplot(alpha=0.5) +
scale_y_log10()+
labs(title = "Sales per Quantity by Category", x= "Category", y= "Sales per Quantity", col = "Discount Category") +
theme(plot.title = element_text(hjust = 0.5))
Insights:
1. Each Segment has similar mean profit value and have outlier in upper
and lower whisker.
2. Each Segment has similar mean sales value and have outlier in upper
whisker.
3. Office Supplies Category has the lowest mean profit and sales value
compare to other categories even though we know that Office Supply has
the largest transaction at the store.
4. Based on Sales per Quantity by Category plot, looks like Office
Supplies suffers low Sales per Quantity due to higher discount applied
on the transaction.
#create new group by table for Category in each quarter
retail_dfQ_grpcat <- retail_df %>% group_by(Order.Quarter,Category) %>% summarise(Total_Quantity = sum(Quantity),
Total_Sales = sum(Sales),
Total_Profit = sum(Profit),
.groups = 'drop')
#Generate Bar Chart to indicate Total Quantity Sold per Quarter by Category
ggplot(retail_dfQ_grpcat, aes(Order.Quarter,Total_Quantity,colour=Category))+
geom_col(aes(fill = Category))+
labs(title = 'Historical Quarterly Product Sold by Category', x='Order Date',
y='Total Quantity', fill = 'Category')+
theme(plot.title = element_text(vjust=0.5, hjust = 0.5),
axis.text.x = element_text(angle=90, hjust=0.5, vjust=0.5))xtabs(Quantity~ Category+Order.Quarter, retail_df)#> Order.Quarter
#> Category 2014.1 2014.2 2014.3 2014.4 2015.1 2015.2 2015.3 2015.4
#> Furniture 224 300 429 670 244 345 479 707
#> Office Supplies 642 948 1333 1646 573 956 1348 1838
#> Technology 162 275 397 555 173 303 414 599
#> Order.Quarter
#> Category 2016.1 2016.2 2016.3 2016.4 2017.1 2017.2 2017.3 2017.4
#> Furniture 332 455 558 848 280 535 612 1010
#> Office Supplies 656 1419 1750 2121 1174 1578 2118 2806
#> Technology 255 366 459 618 391 438 654 880
#create new group by table for Sub Category
retail_dfQ_grpsubcat <- retail_df %>% group_by(Sub.Category) %>% summarise(Total_Quantity = sum(Quantity),
.groups = 'drop')
retail_dfQ_grpsubcat <- retail_dfQ_grpsubcat[order(retail_dfQ_grpsubcat$Total_Quantity, decreasing = TRUE),]
#Generate bar chart to capture sub category ranking based on total quantity
ggplot(data = retail_dfQ_grpsubcat, mapping = aes(x= Total_Quantity, y= reorder(Sub.Category,Total_Quantity))) +
geom_col(aes(fill = Total_Quantity)) + # untuk membuat barplot
labs(
title = "Sub Category Rank Based on Total Quantity",
x = "Total Quantity",
y = "Sub Category"
) +
scale_fill_gradient(low = "#8ecae6", high = "#023047") + # untuk mewarnai plot dengan teknik gradasi
geom_text(mapping = aes(label = Total_Quantity), # menambahkan informasi label masing-masing kategori
col = "black", # memberikan warna pada text
nudge_x = 200) + # menggeser text berdasarkan sumbu x
theme_classic() +
theme(plot.title = element_text(vjust=0.5, hjust = 0.5),
legend.position = "none") # untuk menghilangkan legend.
Insights:
1. Based on Historical Quarterly Product Sold by Category, we can see
that the store always has positive total quantity of product sold growth
over the time since 2014.
2. Lowest total product sold always happened at Q1 while the highest
product sold happened at Q4. Office Supplies always be the number 1
ranking in term of total product sold at the retail store.
3. Based on Sub Category Rank Based on Total Quantity, Binders has the
highest Total Quantity and Copiers has the lowest Total Quantity at the
retail store.
Since we know that the largest total quantity happened at Q4 2017, we want to dig deeper on daily total sales profile for each category.
#create new group by table for Category
retail_df_grpcat <- retail_df %>% group_by(Order.Date,Category) %>% summarise(Total_Quantity = sum(Quantity),
Total_Sales = sum(Sales),
Total_Profit = sum(Profit),
.groups = 'drop')
#Extract date information and change to correct data type
retail_df_grpcat$Order.Month <- month(retail_df_grpcat$Order.Date,label=TRUE, abbr = FALSE)
retail_df_grpcat$Order.Quarter <- quarter(retail_df_grpcat$Order.Date, type='year.quarter', fiscal_start = 1)
retail_df_grpcat$Order.Year <- year(retail_df_grpcat$Order.Date)
retail_df_grpcat$Order.Quarter <- as.factor(retail_df_grpcat$Order.Quarter)
retail_df_grpcat$Order.Year <- as.factor(retail_df_grpcat$Order.Year)
#Generate Line Plot Historical Total Sales per Category
ggplot(retail_df_grpcat[(retail_df_grpcat$Order.Quarter=='2017.4') ,],
aes(Order.Date,Total_Sales,colour=Category))+
geom_line(aes(fill = Category))+
geom_point(aes(fill = Category))+
labs(title = 'Historical Total Sales by Category', x='Order Date',
y='Total Sales', fill = 'Category')+
theme(plot.title = element_text(hjust = 0.5))
Insights:
1. Based on daily chart above, Furniture sales is more stable than
Office Supplies and Technology.
2. The Highest Total Sales Happened at October 2017 in Technology
Category.
ggplot(retail_df,aes(Category,SperQ))+
scale_y_log10()+
geom_violin(aes(fill = Segment)) +
labs (title = "Sales per Quantity Violin Plot in Buyer Segment", x = "Category", y = "Sales per Quantity", fill ="Segment")+
theme(plot.title = element_text(hjust = 0.5))
Insights:
1. Office Supplies Category has the lowest Sales per Quantity value in
the most populated area compare to Furniture and Technology.
2. Furniture and Technology highest population range is around 10-1000,
Office SUpplies around 1-10
In 4 years (2014-2017), the company have generate 9994 transactions
and also shows significant growth on their number of unit sold at the
store. The store sells 3 main category of products: Furniture, Office
Supplies, and Technology. There are also 3 Buyers Segment at the store:
Consumer, Corporate, and Home Office. Office Supplies Category and
Consumer Segment have the most highest transaction. Meanwhile, The
customers like to use Standard Class ship mode to deliver their
order.
The store tends to not give discount in their successful transaction
(~50%). I think this is the ideal way to do their business because when
they gives a lot of discount, it will increase their risk to generate.
Based on Box Plot Analysis, high amount of discount have tendency to
generate much lower sales and can lead to loss of profit. My suggestion
is for the store to conduct exercise on maximum discount that can be
applied for each product category to minimize their loss.
A work by Ade Anggi N S