hello everyone, on this occasion I will perform data subsetting using retail.csv
first we will import all the libraries that are used
library(tidyverse)
and then read our retail.csv
retail <- read.csv("retail.csv")
head(retail)
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 2 2 CA-2016-152156 11/8/16 11/11/16 Second Class CG-12520
## 3 3 CA-2016-138688 6/12/16 6/16/16 Second Class DV-13045
## 4 4 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 5 5 US-2015-108966 10/11/15 10/18/15 Standard Class SO-20335
## 6 6 CA-2014-115812 6/9/14 6/14/14 Standard Class BH-11710
## Segment Product.ID Category Sub.Category
## 1 Consumer FUR-BO-10001798 Furniture Bookcases
## 2 Consumer FUR-CH-10000454 Furniture Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies Labels
## 4 Consumer FUR-TA-10000577 Furniture Tables
## 5 Consumer OFF-ST-10000760 Office Supplies Storage
## 6 Consumer FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit
## 1 2 0.00 41.9136
## 2 3 0.00 219.5820
## 3 2 0.00 6.8714
## 4 5 0.45 -383.0310
## 5 2 0.20 2.5164
## 6 7 0.00 14.1694
tail(retail)
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 9989 9989 CA-2017-163629 11/17/17 11/21/17 Standard Class RA-19885
## 9990 9990 CA-2014-110422 1/21/14 1/23/14 Second Class TB-21400
## 9991 9991 CA-2017-121258 2/26/17 3/3/17 Standard Class DB-13060
## 9992 9992 CA-2017-121258 2/26/17 3/3/17 Standard Class DB-13060
## 9993 9993 CA-2017-121258 2/26/17 3/3/17 Standard Class DB-13060
## 9994 9994 CA-2017-119914 5/4/17 5/9/17 Second Class CC-12220
## Segment Product.ID Category Sub.Category
## 9989 Corporate TEC-PH-10004006 Technology Phones
## 9990 Consumer FUR-FU-10001889 Furniture Furnishings
## 9991 Consumer FUR-FU-10000747 Furniture Furnishings
## 9992 Consumer TEC-PH-10003645 Technology Phones
## 9993 Consumer OFF-PA-10004041 Office Supplies Paper
## 9994 Consumer OFF-AP-10002684 Office Supplies Appliances
## Product.Name
## 9989 Panasonic KX - TS880B Telephone
## 9990 Ultra Door Pull Handle
## 9991 Tenex B1-RE Series Chair Mats for Low Pile Carpets
## 9992 Aastra 57i VoIP phone
## 9993 It's Hot Message Books with Stickers, 2 3/4" x 5"
## 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
## Sales Quantity Discount Profit
## 9989 206.100 5 0.0 55.6470
## 9990 25.248 3 0.2 4.1028
## 9991 91.960 2 0.0 15.6332
## 9992 258.576 2 0.2 19.3932
## 9993 29.600 4 0.0 13.3200
## 9994 243.160 2 0.0 72.9480
we must know all type of our data
str(retail)
## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
## $ Order.Date : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
## $ Ship.Date : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
## $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
## $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
## $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Product.ID : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
## $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
## $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
## $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
There are many columns do not match with data type should be, so we will fix it
retail$Order.ID = as.character(retail$Order.ID)
retail$Order.Date = as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date = as.Date(retail$Ship.Date, "%m/%d/%y")
retail$Customer.ID = as.character(retail$Customer.ID)
retail$Product.ID = as.character(retail$Product.ID)
retail$Product.Name = as.character(retail$Product.Name)
next step is check missing values
colSums(is.na(retail))
## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 0 0 0 0 0 0
## Segment Product.ID Category Sub.Category Product.Name Sales
## 0 0 0 0 0 0
## Quantity Discount Profit
## 0 0 0
table(retail$Ship.Mode,retail$Segment) %>%
prop.table() %>%
round(2)
##
## Consumer Corporate Home Office
## First Class 0.08 0.05 0.03
## Same Day 0.03 0.01 0.01
## Second Class 0.10 0.06 0.03
## Standard Class 0.31 0.18 0.11
Insight : Consumer in Standard Class have a high proportion
sony = retail[grepl(pattern = "sony", retail$Product.Name, ignore.case = T),]
cisco = retail[grepl(pattern = "cisco", retail$Product.Name, ignore.case = T),]
xerox = retail[grepl(pattern = "xerox", retail$Product.Name, ignore.case = T),]
data = rbind(sony,cisco,xerox)
nrow(data)
## [1] 958
prop.table(table(data$Sub.Category)) %>% round(2)
##
## Accessories Appliances Art Binders Bookcases Chairs
## 0.04 0.00 0.00 0.00 0.00 0.00
## Copiers Envelopes Fasteners Furnishings Labels Machines
## 0.00 0.00 0.00 0.00 0.00 0.01
## Paper Phones Storage Supplies Tables
## 0.90 0.05 0.00 0.00 0.00
set.seed(1043)
data %>%
group_by(Sub.Category) %>%
slice_sample(Product.Name, n = 3) %>%
select(Sub.Category, Product.Name)
## # A tibble: 12 x 2
## # Groups: Sub.Category [4]
## Sub.Category Product.Name
## <fct> <chr>
## 1 Accessories Sony 64GB Class 10 Micro SDHC R40 Memory Card
## 2 Accessories Sony Micro Vault Click 16 GB USB 2.0 Flash Drive
## 3 Accessories Sony Micro Vault Click 4 GB USB 2.0 Flash Drive
## 4 Machines Xerox WorkCentre 6505DN Laser Multifunction Printer
## 5 Machines Cisco SPA525G2 5-Line IP Phone
## 6 Machines Cisco TelePresence System EX90 Videoconferencing Unit
## 7 Paper Xerox 1949
## 8 Paper Xerox 1905
## 9 Paper Xerox 200
## 10 Phones Cisco Unified IP Phone 7945G VoIP phone
## 11 Phones Cisco SPA301
## 12 Phones Cisco Small Business SPA 502G VoIP phone
Insight : Xerox is a product of Paper that sells quite a lot
retail %>%
filter(Category == "Office Supplies" &
Sub.Category %in% c("Binders","Paper")) %>%
group_by(Sub.Category) %>%
summarise(Profit = sum(Profit))
## # A tibble: 2 x 2
## Sub.Category Profit
## <fct> <dbl>
## 1 Binders 30222.
## 2 Paper 34054.
Insight : Binders & Paper are Product with the most Quantity, these product give a decent Profit
retail %>%
filter(Category == "Furniture") %>%
group_by(Segment, Sub.Category) %>%
summarise(Quantity = sum(Quantity)) %>%
arrange(-Quantity, .by_group= T)
## # A tibble: 12 x 3
## # Groups: Segment [3]
## Segment Sub.Category Quantity
## <fct> <fct> <int>
## 1 Consumer Furnishings 1834
## 2 Consumer Chairs 1234
## 3 Consumer Tables 602
## 4 Consumer Bookcases 496
## 5 Corporate Furnishings 1086
## 6 Corporate Chairs 719
## 7 Corporate Tables 419
## 8 Corporate Bookcases 271
## 9 Home Office Furnishings 643
## 10 Home Office Chairs 403
## 11 Home Office Tables 220
## 12 Home Office Bookcases 101
Insight : Each Sub.Category have a same pattern, which Furnishings always be highest Quantity, and Bookcase always be lowest Quantity
retail %>%
filter(Category == "Technology") %>%
count(Sub.Category, name = "Jumlah")
## Sub.Category Jumlah
## 1 Accessories 775
## 2 Copiers 68
## 3 Machines 115
## 4 Phones 889
Insight : Phones have highest selling in “Technology” Category
The retail dataset has a lot of information on the sale of various kinds of products, among these products that have the highest sales quantity are paper with the product names “Xerox” and Binders. As well as from all customer segments, for shipping more goods using standard classes, this is because the shipping price is more affordable