Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my first RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.
The data stored a record of a transactions of a retail store
specifying several variables. Please take a look at the following
glossary for your referrence:
- Order.ID: Id of order.
- Order.Date: Date of order.
- Ship.Date: Date of shipment.
- Ship.Mode: Type of shipment.
- Customer.ID: Id of customer.
- Segment : Customer’s segment.
- Product.ID: Id of product.
- Category : Product category with 3 levels (“Furniture”,
“Office Supplies”, “Technology”)
- Sub.Category: more specific product category
- Product.Name: Name of product that was sold.
- Sales: How much earning from each sale.
- Quantity: Quantity of item sold.
- Discount: How much discount was given for each
sale.
- Profit: How much can a company earn from each sale.
First of all, we need to read retail.csv in our
folder using read.csv() function, save it into an object
named retail_df, and display first 6 rows of the data.
retail_df <- read.csv("retail.csv")
head(retail_df)Display last 6 rows of retail dataset using ‘tail’ function
tail(retail_df)Retail dataset has 14 columns and 9994 rows. We can quickly check the dimension of our data by using ‘dim’ function.
dim(retail_df)#> [1] 9994 14
Even though we already know column names inside retail dataset, we still can validate by using ‘names’ function.
names(retail_df)#> [1] "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode" "Customer.ID"
#> [6] "Segment" "Product.ID" "Category" "Sub.Category" "Product.Name"
#> [11] "Sales" "Quantity" "Discount" "Profit"
The output above and column names list in the Case Description section are the same (consists of 14 columns).
After we load the data, we need to check data type for each column inside the dataset to get general picture of our data.
str(retail_df)#> 'data.frame': 9994 obs. of 14 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : chr "11/8/2016" "11/8/2016" "6/12/2016" "10/11/2015" ...
#> $ Ship.Date : chr "11/11/2016" "11/11/2016" "6/16/2016" "10/18/2015" ...
#> $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
#> $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Retail dataset has 10 ‘chr’, 3 ‘num’, and 1 ‘int’ data types, but we know that some columns have wrong data types. That’s why we need to do adjustment on ‘Order.Date’, ‘Ship.Date’, ‘Ship.Mode’, ‘Segment’, ‘Category’, ‘Sub.Category’ to get the correct data types.
library(lubridate)
retail_df$Order.Date<- mdy(retail_df$Order.Date)
retail_df$Ship.Date <- mdy(retail_df$Ship.Date)
retail_df$Ship.Mode <- as.factor(retail_df$Ship.Mode)
retail_df$Segment <- as.factor(retail_df$Segment)
retail_df$Category <- as.factor(retail_df$Category)
retail_df$Sub.Category <- as.factor(retail_df$Sub.Category)
str(retail_df)#> 'data.frame': 9994 obs. of 14 variables:
#> $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
#> $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
#> $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Looks like all of column names in retail dataset already in
correct data types.
Next, we will check whether our dataset has missing values or not.
colSums(is.na(retail_df))#> Order.ID Order.Date Ship.Date Ship.Mode Customer.ID Segment
#> 0 0 0 0 0 0
#> Product.ID Category Sub.Category Product.Name Sales Quantity
#> 0 0 0 0 0 0
#> Discount Profit
#> 0 0
anyNA(retail_df)#> [1] FALSE
Looks like the dataset doesn’t have any missing value since all
number below column names is zero and the output of
‘anyNA’ function is FALSE. That means our dataset is
ready to be processed and analyzed.
We start by check summary of our data by using ‘summary’ function.
summary(retail_df)#> Order.ID Order.Date Ship.Date
#> Length:9994 Min. :2014-01-03 Min. :2014-01-07
#> Class :character 1st Qu.:2015-05-23 1st Qu.:2015-05-27
#> Mode :character Median :2016-06-26 Median :2016-06-29
#> Mean :2016-04-30 Mean :2016-05-03
#> 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
#> Max. :2017-12-30 Max. :2018-01-05
#>
#> Ship.Mode Customer.ID Segment Product.ID
#> First Class :1538 Length:9994 Consumer :5191 Length:9994
#> Same Day : 543 Class :character Corporate :3020 Class :character
#> Second Class :1945 Mode :character Home Office:1783 Mode :character
#> Standard Class:5968
#>
#>
#>
#> Category Sub.Category Product.Name
#> Furniture :2121 Binders :1523 Length:9994
#> Office Supplies:6026 Paper :1370 Class :character
#> Technology :1847 Furnishings: 957 Mode :character
#> Phones : 889
#> Storage : 846
#> Art : 796
#> (Other) :3613
#> Sales Quantity Discount Profit
#> Min. : 0.444 Min. : 1.00 Min. :0.0000 Min. :-6599.978
#> 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
#> Median : 54.490 Median : 3.00 Median :0.2000 Median : 8.666
#> Mean : 229.858 Mean : 3.79 Mean :0.1562 Mean : 28.657
#> 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :22638.480 Max. :14.00 Max. :0.8000 Max. : 8399.976
#>
Insight from Data Summary:
1. First order date was occured in Jan 2014, In 4 year, the company can
booked 9994 transactions.
2. The most favorable Ship mode by customers is Standard Class.
3. More than 50% customers coming from Consumer Segment while Office
supplies category dominates the market.
4. Mean value of sales is 229.858 and there is huge gap from 3rd
Quartile sales value with Max sales value (~22k difference). That means
there is possibility of outlier value in Sales column.
5. The maximum discount offered by the seller is 80%.
6. The maximum quantity ordered by the buyer is 14 units.
7. There is possibility of outlier in Profit columns since there is huge
gap between mean profit value compare to minimum and maximum
value.
Since we have hypothesis of outlier appearance from Sales and Profit,
we will conduct investigation where the outlier coming from. We start by
analyzing relationship between segment and profit.
Statistic for Consumer Segment:
summary(retail_df[retail_df$Segment=='Consumer',][,c('Segment','Profit')])#> Segment Profit
#> Consumer :5191 Min. :-6599.978
#> Corporate : 0 1st Qu.: 1.584
#> Home Office: 0 Median : 8.375
#> Mean : 25.837
#> 3rd Qu.: 28.607
#> Max. : 6719.981
boxplot(retail_df[retail_df$Segment=='Consumer',][,c('Profit')])hist(retail_df[retail_df$Segment=='Consumer',][,c('Profit')],col = 'green')
Statistic for Corporate Segment:
summary(retail_df[retail_df$Segment=='Corporate',][,c('Segment','Profit')])#> Segment Profit
#> Consumer : 0 Min. :-3839.990
#> Corporate :3020 1st Qu.: 1.928
#> Home Office: 0 Median : 8.694
#> Mean : 30.457
#> 3rd Qu.: 29.843
#> Max. : 8399.976
boxplot(retail_df[retail_df$Segment=='Corporate',][,c('Profit')])hist(retail_df[retail_df$Segment=='Corporate',][,c('Profit')],col = 'green')Statistic for Home Office Segment:
summary(retail_df[retail_df$Segment=='Home Office',][,c('Segment','Profit')])#> Segment Profit
#> Consumer : 0 Min. :-3399.980
#> Corporate : 0 1st Qu.: 1.984
#> Home Office:1783 Median : 9.102
#> Mean : 33.819
#> 3rd Qu.: 31.560
#> Max. : 3919.989
boxplot(retail_df[retail_df$Segment=='Home Office',][,c('Profit')])hist(retail_df[retail_df$Segment=='Home Office',][,c('Profit')],col = 'green')
Data Population from positive profit row for each Segment:
aggregate(Profit~Segment,retail_df[(retail_df$Profit>=0),],length)Data Population from negative profit (loss) row for each Segment:
aggregate(Profit~Segment,retail_df[(retail_df$Profit<0),],length)Percentage of negative profit from sales of Segment:
aggregate(Profit~Segment,retail_df[(retail_df$Profit<0),],length)$Profit/(aggregate(Profit~Segment,retail_df[(retail_df$Profit<0),],length)$Profit+aggregate(Profit~Segment,retail_df[(retail_df$Profit>=0),],length)$Profit)#> [1] 0.193219 0.184106 0.174986
Standard Deviation (SD) profit for each segment category:
aggregate(Profit~Segment,retail_df,sd)Variance profit for each segment category:
aggregate(Profit~Segment,retail_df,var)Insights:
1. All of segment have similar profit distribution profile with many
outlier data in both upper and lower section.
2. Most of profit data population located in between -1000 & 1000
for every segment of the company.
3. Consumer, Corporate, and Home Office Segment also have similar
percentage of loss.
4. The data has large data spread based on variance value of profit for
each segment category and the spread still acceptable because SD value
for each segment category ~200 which means still considered as low.
Statistic for Consumer Segment:
summary(retail_df[retail_df$Segment=='Consumer',][,c('Segment','Sales')])#> Segment Sales
#> Consumer :5191 Min. : 0.444
#> Corporate : 0 1st Qu.: 17.264
#> Home Office: 0 Median : 53.720
#> Mean : 223.734
#> 3rd Qu.: 206.106
#> Max. :13999.960
boxplot(retail_df[retail_df$Segment=='Consumer',][,c('Sales')])hist(retail_df[retail_df$Segment=='Consumer',][,c('Sales')],col = 'green')Statistic for Corporate Segment:
summary(retail_df[retail_df$Segment=='Corporate',][,c('Segment','Sales')])#> Segment Sales
#> Consumer : 0 Min. : 0.556
#> Corporate :3020 1st Qu.: 17.469
#> Home Office: 0 Median : 56.540
#> Mean : 233.823
#> 3rd Qu.: 212.895
#> Max. :17499.950
boxplot(retail_df[retail_df$Segment=='Corporate',][,c('Sales')])hist(retail_df[retail_df$Segment=='Corporate',][,c('Sales')],col = 'green')Statistic for Home Office Segment:
summary(retail_df[retail_df$Segment=='Home Office',][,c('Segment','Sales')])#> Segment Sales
#> Consumer : 0 Min. : 0.99
#> Corporate : 0 1st Qu.: 17.05
#> Home Office:1783 Median : 52.44
#> Mean : 240.97
#> 3rd Qu.: 210.91
#> Max. :22638.48
boxplot(retail_df[retail_df$Segment=='Home Office',][,c('Sales')])hist(retail_df[retail_df$Segment=='Home Office',][,c('Sales')],col = 'green')
Standard Deviation (SD) profit for each segment category:
aggregate(Sales~Segment,retail_df,sd)Variance profit for each segment category:
aggregate(Sales~Segment,retail_df,var)Insights:
1. Sales data distribution for each segment is skewed to the right with
most of data located in 0-1000 class.
2. The biggest sales happened in Home Office segment (22,638.48).
3. Home Office segment has larger value of standard deviation and
variance. The company need to look more detail of reason on this kind of
fluctuation.
Statistic for Furniture Category:
summary(retail_df[retail_df$Category=='Furniture',][,c('Category','Profit')])#> Category Profit
#> Furniture :2121 Min. :-1862.312
#> Office Supplies: 0 1st Qu.: -12.849
#> Technology : 0 Median : 7.775
#> Mean : 8.699
#> 3rd Qu.: 33.727
#> Max. : 1013.127
boxplot(retail_df[retail_df$Category=='Furniture',][,c('Profit')])hist(retail_df[retail_df$Category=='Furniture',][,c('Profit')],col = 'green')
Statistic for Office Supplies Category:
summary(retail_df[retail_df$Category=='Office Supplies',][,c('Category','Profit')])#> Category Profit
#> Furniture : 0 Min. :-3701.893
#> Office Supplies:6026 1st Qu.: 2.102
#> Technology : 0 Median : 6.882
#> Mean : 20.327
#> 3rd Qu.: 19.416
#> Max. : 4946.370
boxplot(retail_df[retail_df$Category=='Office Supplies',][,c('Profit')])hist(retail_df[retail_df$Category=='Office Supplies',][,c('Profit')],col = 'green')Statistic for Technology Category:
summary(retail_df[retail_df$Category=='Technology',][,c('Category','Profit')])#> Category Profit
#> Furniture : 0 Min. :-6599.978
#> Office Supplies: 0 1st Qu.: 5.204
#> Technology :1847 Median : 25.018
#> Mean : 78.752
#> 3rd Qu.: 74.895
#> Max. : 8399.976
boxplot(retail_df[retail_df$Category=='Technology',][,c('Profit')])hist(retail_df[retail_df$Category=='Technology',][,c('Profit')],col = 'green')
Data Population from positive profit row for each category:
aggregate(Profit~Category,retail_df[(retail_df$Profit>=0),],length)Data Population from negative profit (loss) row for each category:
aggregate(Profit~Category,retail_df[(retail_df$Profit<0),],length)Percentage of negative profit from sales of category:
aggregate(Profit~Category,retail_df[(retail_df$Profit<0),],length)$Profit/(aggregate(Profit~Category,retail_df[(retail_df$Profit<0),],length)$Profit+aggregate(Profit~Category,retail_df[(retail_df$Profit>=0),],length)$Profit)#> [1] 0.3366337 0.1470295 0.1467244
Standard Deviation (SD) profit for each Category:
aggregate(Profit~Category,retail_df,sd)Variance profit for each Category:
aggregate(Profit~Category,retail_df,var)Insights:
1. All of category have similar profit distribution profile with many
outlier data in both upper and lower section.
2. Most of profit data population located in between -500 & 500 for
every category of product.
3. Furniture Category has the largest percentage of loss
(~33.66%).
4. The largest profit transaction was coming from Technology
Category(8,399.976) while the largest losss transaction also coming from
Technology Category (-6.599.978) 5. Technology Category has the largest
value of standard deviation and variance. The company need to look more
detail of reason on this kind of fluctuation.
Statistic for Furniture Category:
summary(retail_df[retail_df$Category=='Furniture',][,c('Category','Sales')])#> Category Sales
#> Furniture :2121 Min. : 1.892
#> Office Supplies: 0 1st Qu.: 47.040
#> Technology : 0 Median : 182.220
#> Mean : 349.835
#> 3rd Qu.: 435.168
#> Max. :4416.174
boxplot(retail_df[retail_df$Category=='Furniture',][,c('Sales')])hist(retail_df[retail_df$Category=='Furniture',][,c('Sales')],col = 'green')Statistic for Office Supplies Category:
summary(retail_df[retail_df$Category=='Office Supplies',][,c('Category','Sales')])#> Category Sales
#> Furniture : 0 Min. : 0.444
#> Office Supplies:6026 1st Qu.: 11.760
#> Technology : 0 Median : 27.418
#> Mean : 119.324
#> 3rd Qu.: 79.920
#> Max. :9892.740
boxplot(retail_df[retail_df$Category=='Office Supplies',][,c('Sales')])hist(retail_df[retail_df$Category=='Office Supplies',][,c('Sales')],col = 'green')Statistic for Technology Category:
summary(retail_df[retail_df$Category=='Technology',][,c('Category','Sales')])#> Category Sales
#> Furniture : 0 Min. : 0.99
#> Office Supplies: 0 1st Qu.: 68.02
#> Technology :1847 Median : 166.16
#> Mean : 452.71
#> 3rd Qu.: 448.53
#> Max. :22638.48
boxplot(retail_df[retail_df$Category=='Technology',][,c('Sales')])hist(retail_df[retail_df$Category=='Technology',][,c('Sales')],col = 'green')Standard Deviation (SD) profit for each category:
aggregate(Sales~Category,retail_df,sd)Variance profit for each category:
aggregate(Sales~Category,retail_df,var)Insights:
1. Sales data distribution for each Category is skewed to the right with
most of data located in 0-1000 class.
2. The biggest sales happened in Technology Category (22,638.48).
3. Technology Category has larger value of standard deviation and
variance of sales. The company need to look more detail of reason on
this kind of fluctuation.
plot(retail_df$Sales, retail_df$Profit,
xlab = "Sales",
ylab = "Profit",
col = retail_df$Category,
pch = 18) # mengganti bentuk titik scatterplot
abline(lm(retail_df$Profit ~ retail_df$Sales), col = "red")
legend("topright", legend = levels(retail_df$Category),
fill = 1:3) # berapa banyak warna yang ditampilkan
title("Correlation Between Sales and Profit")
Based on the plot above, sales and profit has positive correlation
value. Furniture Category has maximum range of sales ~5000 and has
tendency to create less profit compare to other category.
library(ggplot2)ggplot(data = aggregate(Quantity~Category,retail_df,sum), mapping = aes(x= Category, y= Quantity )) +
geom_col() +# untuk membuat barplot
labs(
title="Total Unit Sold For Each Category")+
theme(plot.title = element_text(hjust=0.5))
Total Unit Sold comes from Office Supplies Category.
ggplot(data = aggregate(Quantity~Segment,retail_df,sum), mapping = aes(x= Segment, y= Quantity )) +
geom_col() + # untuk membuat barplot
labs(
title="Total Unit Sold For Each Segment")+
theme(plot.title = element_text(hjust=0.5))
Meanwhile when we look at the Most unit sold from segment, the award
goes to consumer segment.
retail_df[retail_df$Sales == 0.444,]Answer: Sales 0.444 comes from segment ‘consumer’, office
supplies, and got loss 1.11.
sales_office_supplies <- retail_df[retail_df$Category=='Office Supplies',]
round(prop.table(table(sales_office_supplies$Segment))*100,2)#>
#> Consumer Corporate Home Office
#> 51.89 30.20 17.91
Answer: Consumer Segment = 51.89%
discount<-retail_df[retail_df$Discount==0.8,]
(sort(table(discount$Order.Date),decreasing = T)[1])#> 2014-07-26
#> 4
Answer: buyers get 80% discount 4 times and happened on 26
July 2014
discount2<-discount[discount$Order.Date=="2014-07-26",]
(table(discount2$Segment))#>
#> Consumer Corporate Home Office
#> 4 0 0
Answer: Buyers were coming from Consumer
Segment.
xtabs(Quantity~Segment+Category,retail_df)#> Category
#> Segment Furniture Office Supplies Technology
#> Consumer 4166 11758 3597
#> Corporate 2495 7018 2095
#> Home Office 1367 4130 1247
plot(xtabs(Quantity~Segment+Category,retail_df))
Answer: Based on the output, consumer segment with office supply
category make the highest quantity order (11,758 units)
aggregate(Quantity~Segment+Category,retail_df,median)Answer: Median of quantity for every segment and category =
3
sum(retail_df$Quantity)#> [1] 37873
Answer: 37,873 units
sum(retail_df$Profit)#> [1] 286397
sum(retail_df$Sales)#> [1] 2297201
Answer: 286,397
zero_disc<- retail_df[retail_df$Discount==0,]
knitr::kable(table(zero_disc$Discount))| Var1 | Freq |
|---|---|
| 0 | 4798 |
more_disc <- retail_df[retail_df$Discount>0 & retail_df$Discount<=0.8,]
knitr::kable(table(more_disc$Discount))| Var1 | Freq |
|---|---|
| 0.1 | 94 |
| 0.15 | 52 |
| 0.2 | 3657 |
| 0.3 | 227 |
| 0.32 | 27 |
| 0.4 | 206 |
| 0.45 | 11 |
| 0.5 | 66 |
| 0.6 | 138 |
| 0.7 | 418 |
| 0.8 | 300 |
Answer: According to the output above, almost half amount of
transaction was done without any discount given by the
company
xtabs(Profit~ Category, retail_df)#> Category
#> Furniture Office Supplies Technology
#> 18451.27 122490.80 145454.95
Answer: Technology Category
retail_df[retail_df$Profit==-6599.978,]Answer: The transaction happened on 25 Nov 2016 with Order.ID
CA-2016-108196. The company sold 5 units product of Technology Category
with 70% discount
xtabs(Profit~ Segment, retail_df)#> Segment
#> Consumer Corporate Home Office
#> 134119.21 91979.13 60298.68
Answer: Consumer Segment
xtabs(Profit~Segment+Ship.Mode,retail_df)#> Ship.Mode
#> Segment First Class Same Day Second Class Standard Class
#> Consumer 21374.044 9874.205 24946.911 77924.049
#> Corporate 14464.472 1818.142 18225.713 57470.807
#> Home Office 13131.324 4199.412 14274.011 28693.932
plot(xtabs(Profit~Segment+Ship.Mode,retail_df))
Answer: Consumer Segment with Standard Class Ship
Mode.
xtabs(Quantity~Category,retail_df)#> Category
#> Furniture Office Supplies Technology
#> 8028 22906 6939
Answer: Office Supply Category
A work by Ade Anggi N S