Data yang digunakan adalah Retail.csv
Simpan file csv di dalam folder yang sama dengan file markdown
retail <- read.csv('retail.csv')
head(retail)
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 2 2 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 3 3 CA-2016-138688 6/12/16 6/16/16 Second Class DV-13045
## 4 4 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 5 5 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 6 6 CA-2014-115812 6/9/14 6/14/14 Standard Class BH-11710
## Segment Product.ID Category Sub.Category
## 1 Consumer FUR-BO-10001798 Furniture Bookcases
## 2 Consumer FUR-CH-10000454 Furniture Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies Labels
## 4 Consumer FUR-TA-10000577 Furniture Tables
## 5 Consumer OFF-ST-10000760 Office Supplies Storage
## 6 Consumer FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Dataset retail memiliki data sebanyak 9994 baris dengan 15 kolom. Di mana masing-masing kolom berisi nilai Row.ID, Order.ID, Order.Date, Ship.Date, Ship.Mode, Customer.ID, Segment, Product.ID, Category, Sub.Category, Product.Name, Sales, Quantity, Discount, dan Profit.
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Ada beberapa data yang belum sesuai, maka perlu dilakukan perubahan tipe data. Untuk mengubah tipe date, bisa gunakan library Lubridate - Row.ID –> char - Order.Date, Ship.Date –> date - Segment, Category, Sub.Category –> factor
retail$Row.ID <- as.character(retail$Row.ID)
# install package lubridate
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
retail$Order.Date <- mdy(retail$Order.Date)
retail$Ship.Date <- mdy(retail$Ship.Date)
retail$Segment <- as.factor(retail$Segment)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : chr "1" "2" "3" "4" ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2016-11-08" "2016-11-08" ...
## $ Ship.Date : Date, format: "2016-11-11" "2016-11-11" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
# cek missing value
sum(is.na(retail))
## [1] 0
### Business Case Kita ingin mengetahui informasi dari dataset retail kategori apa yang paling menguntungkan, siapa yang membeli paling banyak, serta telusuri juga kategori barang apa yang menyebabkan kerugian pada perusahaan.
summary(retail)
## Row.ID Order.ID Order.Date
## Length:9994 Length:9994 Min. :2014-01-03
## Class :character Class :character 1st Qu.:2015-05-23
## Mode :character Mode :character Median :2016-06-26
## Mean :2016-04-30
## 3rd Qu.:2017-05-14
## Max. :2017-12-30
##
## Ship.Date Ship.Mode Customer.ID Segment
## Min. :2014-01-07 Length:9994 Length:9994 Consumer :5191
## 1st Qu.:2015-05-27 Class :character Class :character Corporate :3020
## Median :2016-06-29 Mode :character Mode :character Home Office:1783
## Mean :2016-05-03
## 3rd Qu.:2017-05-18
## Max. :2018-01-05
##
## Product.ID Category Sub.Category
## Length:9994 Furniture :2121 Binders :1523
## Class :character Office Supplies:6026 Paper :1370
## Mode :character Technology :1847 Furnishings: 957
## Phones : 889
## Storage : 846
## Art : 796
## (Other) :3613
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
##
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
##
Dari fungsi summary, diperoleh informasi :
sales_agg <- aggregate(formula=Sales~Segment+Category, data=retail, FUN=mean)
sales_agg <- sales_agg[order(sales_agg$Sales, decreasing = T),]
sales_agg
## Segment Category Sales
## 9 Home Office Technology 535.9767
## 8 Corporate Technology 444.8558
## 7 Consumer Technology 427.3395
## 2 Corporate Furniture 354.5198
## 1 Consumer Furniture 351.3471
## 3 Home Office Furniture 336.8251
## 5 Corporate Office Supplies 126.7453
## 4 Consumer Office Supplies 116.3902
## 6 Home Office Office Supplies 115.3090
Dapat disimpulkan bahwa Home Office pada kategori Technology memiliki rata-rata Sales tertinggi, sedangkan kategori Office Supplies memiliki rata-rata Sales terendah.
# buat histogram agar lebih mudah membaca data
# install library ggplot2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(sales_agg, aes(x = Category, y = Sales)) + geom_col(aes(fill = Segment), position = "dodge")
Saat kita menggunakan fungsi summary, didapatkan informasi bahwa consumer paling banyak melakukan transaksi. Kita akan mencari tahu barang dari kategori apa yang banyak dibeli oleh consumer.
consumer <- retail[retail$Segment == 'Consumer',]
consumer <- aggregate(formula=Profit~Product.Name+Category, data=retail, FUN=sum)
consumer <- head(consumer[order(consumer$Profit, decreasing = T),],5)
consumer
## Product.Name
## 1489 Canon imageCLASS 2200 Advanced Copier
## 776 Fellowes PB500 Electric Punch Plastic Comb Binding Machine with Manual Bind
## 1559 Hewlett Packard LaserJet 3310 Copier
## 1493 Canon PC1060 Personal Laser Copier
## 1560 HP Designjet T520 Inkjet Large Format Printer - 24" Color
## Category Profit
## 1489 Technology 25199.928
## 776 Office Supplies 7753.039
## 1559 Technology 6983.884
## 1493 Technology 4570.935
## 1560 Technology 4094.977
ggplot(consumer, aes(x = Profit, y = Product.Name)) + geom_col()
consumer_qty <- retail[retail$Segment == 'Consumer',]
consumer_qty <- aggregate(formula=Quantity~Product.Name+Category, data=retail, FUN=sum)
consumer_qty <- head(consumer_qty[order(consumer_qty$Quantity, decreasing = T),],5)
consumer_qty
## Product.Name Category Quantity
## 1150 Staples Office Supplies 215
## 1146 Staple envelope Office Supplies 170
## 723 Easy-staple paper Office Supplies 150
## 1151 Staples in misc. colors Office Supplies 86
## 258 KI Adjustable-Height Table Furniture 74
Dari data di atas, dapat kita simpulkan bahwa penjualan Canon imageCLASS 2200 Advanced Copier memiliki profit tertinggi sedangkan barang yang paling banyak dibeli oleh consumer adalah staples.
profit <- retail[retail$Category==retail$Category & retail$Profit <0,]
profit <- profit[,c(7,10,15)]
profit <- profit[order(profit$Profit, decreasing = F),]
ggplot(profit, aes(y = Sub.Category, x = Profit)) + geom_col(aes(fill = Segment))
Dari grafik di atas, terdapat informasi bahwa binders menyebabkan kerugian yang paling besar pada segmen consumer.