Data yang digunakan adalah Retail.csv

Data Preparation

1. Import Data

Simpan file csv di dalam folder yang sama dengan file markdown

retail <- read.csv('retail.csv')

2. Cek Data Retail

head(retail)
##   Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 2      2 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 3      3 CA-2016-138688    6/12/16   6/16/16   Second Class    DV-13045
## 4      4 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 5      5 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 6      6 CA-2014-115812     6/9/14   6/14/14 Standard Class    BH-11710
##     Segment      Product.ID        Category Sub.Category
## 1  Consumer FUR-BO-10001798       Furniture    Bookcases
## 2  Consumer FUR-CH-10000454       Furniture       Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies       Labels
## 4  Consumer FUR-TA-10000577       Furniture       Tables
## 5  Consumer OFF-ST-10000760 Office Supplies      Storage
## 6  Consumer FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694
str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Dataset retail memiliki data sebanyak 9994 baris dengan 15 kolom. Di mana masing-masing kolom berisi nilai Row.ID, Order.ID, Order.Date, Ship.Date, Ship.Mode, Customer.ID, Segment, Product.ID, Category, Sub.Category, Product.Name, Sales, Quantity, Discount, dan Profit.

3. Bersihkan Data

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Ada beberapa data yang belum sesuai, maka perlu dilakukan perubahan tipe data. Untuk mengubah tipe date, bisa gunakan library Lubridate - Row.ID –> char - Order.Date, Ship.Date –> date - Segment, Category, Sub.Category –> factor

retail$Row.ID <- as.character(retail$Row.ID)

# install package lubridate
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
retail$Order.Date <- mdy(retail$Order.Date)
retail$Ship.Date <- mdy(retail$Ship.Date)

retail$Segment <- as.factor(retail$Segment)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)

str(retail)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : chr  "1" "2" "3" "4" ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...
# cek missing value
sum(is.na(retail))
## [1] 0

Data Exploratory

### Business Case Kita ingin mengetahui informasi dari dataset retail kategori apa yang paling menguntungkan, siapa yang membeli paling banyak, serta telusuri juga kategori barang apa yang menyebabkan kerugian pada perusahaan.

summary(retail)
##     Row.ID            Order.ID           Order.Date        
##  Length:9994        Length:9994        Min.   :2014-01-03  
##  Class :character   Class :character   1st Qu.:2015-05-23  
##  Mode  :character   Mode  :character   Median :2016-06-26  
##                                        Mean   :2016-04-30  
##                                        3rd Qu.:2017-05-14  
##                                        Max.   :2017-12-30  
##                                                            
##    Ship.Date           Ship.Mode         Customer.ID               Segment    
##  Min.   :2014-01-07   Length:9994        Length:9994        Consumer   :5191  
##  1st Qu.:2015-05-27   Class :character   Class :character   Corporate  :3020  
##  Median :2016-06-29   Mode  :character   Mode  :character   Home Office:1783  
##  Mean   :2016-05-03                                                           
##  3rd Qu.:2017-05-18                                                           
##  Max.   :2018-01-05                                                           
##                                                                               
##   Product.ID                   Category         Sub.Category 
##  Length:9994        Furniture      :2121   Binders    :1523  
##  Class :character   Office Supplies:6026   Paper      :1370  
##  Mode  :character   Technology     :1847   Furnishings: 957  
##                                            Phones     : 889  
##                                            Storage    : 846  
##                                            Art        : 796  
##                                            (Other)    :3613  
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##                                                                         
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976  
## 

Dari fungsi summary, diperoleh informasi :

sales_agg <- aggregate(formula=Sales~Segment+Category, data=retail, FUN=mean)
sales_agg <- sales_agg[order(sales_agg$Sales, decreasing = T),]
sales_agg
##       Segment        Category    Sales
## 9 Home Office      Technology 535.9767
## 8   Corporate      Technology 444.8558
## 7    Consumer      Technology 427.3395
## 2   Corporate       Furniture 354.5198
## 1    Consumer       Furniture 351.3471
## 3 Home Office       Furniture 336.8251
## 5   Corporate Office Supplies 126.7453
## 4    Consumer Office Supplies 116.3902
## 6 Home Office Office Supplies 115.3090

Dapat disimpulkan bahwa Home Office pada kategori Technology memiliki rata-rata Sales tertinggi, sedangkan kategori Office Supplies memiliki rata-rata Sales terendah.

# buat histogram agar lebih mudah membaca data
# install library ggplot2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(sales_agg, aes(x = Category, y = Sales)) + geom_col(aes(fill = Segment), position = "dodge")

Saat kita menggunakan fungsi summary, didapatkan informasi bahwa consumer paling banyak melakukan transaksi. Kita akan mencari tahu barang dari kategori apa yang banyak dibeli oleh consumer.

consumer <- retail[retail$Segment == 'Consumer',]
consumer <- aggregate(formula=Profit~Product.Name+Category, data=retail, FUN=sum)
consumer <- head(consumer[order(consumer$Profit, decreasing = T),],5)
consumer
##                                                                     Product.Name
## 1489                                       Canon imageCLASS 2200 Advanced Copier
## 776  Fellowes PB500 Electric Punch Plastic Comb Binding Machine with Manual Bind
## 1559                                        Hewlett Packard LaserJet 3310 Copier
## 1493                                          Canon PC1060 Personal Laser Copier
## 1560                   HP Designjet T520 Inkjet Large Format Printer - 24" Color
##             Category    Profit
## 1489      Technology 25199.928
## 776  Office Supplies  7753.039
## 1559      Technology  6983.884
## 1493      Technology  4570.935
## 1560      Technology  4094.977
ggplot(consumer, aes(x = Profit, y = Product.Name)) + geom_col()

consumer_qty <- retail[retail$Segment == 'Consumer',]
consumer_qty <- aggregate(formula=Quantity~Product.Name+Category, data=retail, FUN=sum)
consumer_qty <- head(consumer_qty[order(consumer_qty$Quantity, decreasing = T),],5)
consumer_qty
##                    Product.Name        Category Quantity
## 1150                    Staples Office Supplies      215
## 1146            Staple envelope Office Supplies      170
## 723           Easy-staple paper Office Supplies      150
## 1151    Staples in misc. colors Office Supplies       86
## 258  KI Adjustable-Height Table       Furniture       74

Dari data di atas, dapat kita simpulkan bahwa penjualan Canon imageCLASS 2200 Advanced Copier memiliki profit tertinggi sedangkan barang yang paling banyak dibeli oleh consumer adalah staples.

profit <- retail[retail$Category==retail$Category & retail$Profit <0,]
profit <- profit[,c(7,10,15)]
profit <- profit[order(profit$Profit, decreasing = F),]
ggplot(profit, aes(y = Sub.Category, x = Profit)) + geom_col(aes(fill = Segment)) 

Dari grafik di atas, terdapat informasi bahwa binders menyebabkan kerugian yang paling besar pada segmen consumer.