Progamming for Data Science using Retail dataset

Explanation

hello everyone, on this occasion I will perform data subsetting using retail.csv

first we will import all the libraries that are used

library(tidyverse)

and then read our retail.csv

retail <- read.csv("retail.csv")

Data Inspection

head(retail)

##   Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 2      2 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520
## 3      3 CA-2016-138688    6/12/16   6/16/16   Second Class    DV-13045
## 4      4 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 5      5 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335
## 6      6 CA-2014-115812     6/9/14   6/14/14 Standard Class    BH-11710
##     Segment      Product.ID        Category Sub.Category
## 1  Consumer FUR-BO-10001798       Furniture    Bookcases
## 2  Consumer FUR-CH-10000454       Furniture       Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies       Labels
## 4  Consumer FUR-TA-10000577       Furniture       Tables
## 5  Consumer OFF-ST-10000760 Office Supplies      Storage
## 6  Consumer FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694

tail(retail)

##      Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 9989   9989 CA-2017-163629   11/17/17  11/21/17 Standard Class    RA-19885
## 9990   9990 CA-2014-110422    1/21/14   1/23/14   Second Class    TB-21400
## 9991   9991 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9992   9992 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9993   9993 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060
## 9994   9994 CA-2017-119914     5/4/17    5/9/17   Second Class    CC-12220
##        Segment      Product.ID        Category Sub.Category
## 9989 Corporate TEC-PH-10004006      Technology       Phones
## 9990  Consumer FUR-FU-10001889       Furniture  Furnishings
## 9991  Consumer FUR-FU-10000747       Furniture  Furnishings
## 9992  Consumer TEC-PH-10003645      Technology       Phones
## 9993  Consumer OFF-PA-10004041 Office Supplies        Paper
## 9994  Consumer OFF-AP-10002684 Office Supplies   Appliances
##                                                                   Product.Name
## 9989                                           Panasonic KX - TS880B Telephone
## 9990                                                    Ultra Door Pull Handle
## 9991                        Tenex B1-RE Series Chair Mats for Low Pile Carpets
## 9992                                                     Aastra 57i VoIP phone
## 9993                         It's Hot Message Books with Stickers, 2 3/4" x 5"
## 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
##        Sales Quantity Discount  Profit
## 9989 206.100        5      0.0 55.6470
## 9990  25.248        3      0.2  4.1028
## 9991  91.960        2      0.0 15.6332
## 9992 258.576        2      0.2 19.3932
## 9993  29.600        4      0.0 13.3200
## 9994 243.160        2      0.0 72.9480

Data Cleansing & Coertions

we must know all type of our data

str(retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
##  $ Order.Date  : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
##  $ Ship.Date   : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

There are many columns do not match with data type should be, so we will fix it

retail$Order.ID = as.character(retail$Order.ID)
retail$Order.Date = as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date = as.Date(retail$Ship.Date, "%m/%d/%y")
retail$Customer.ID = as.character(retail$Customer.ID)
retail$Product.ID = as.character(retail$Product.ID)
retail$Product.Name = as.character(retail$Product.Name)

next step is check missing values

colSums(is.na(retail))

##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

Data Exploration

We want to know proportion each Segment by Ship.Mode

table(retail$Ship.Mode,retail$Segment) %>% 
  prop.table() %>% 
  round(2)

##                 
##                  Consumer Corporate Home Office
##   First Class        0.08      0.05        0.03
##   Same Day           0.03      0.01        0.01
##   Second Class       0.10      0.06        0.03
##   Standard Class     0.31      0.18        0.11

Insight : Consumer in Standard Class have a high proportion

We want to know total row the name of the product which contains the following name (Sony, Cisco, Xerox)

sony = retail[grepl(pattern = "sony", retail$Product.Name, ignore.case = T),]
cisco = retail[grepl(pattern = "cisco", retail$Product.Name, ignore.case = T),]
xerox = retail[grepl(pattern = "xerox", retail$Product.Name, ignore.case = T),]

data = rbind(sony,cisco,xerox)
nrow(data)

## [1] 958

prop.table(table(data$Sub.Category)) %>% round(2)

## 
## Accessories  Appliances         Art     Binders   Bookcases      Chairs 
##        0.04        0.00        0.00        0.00        0.00        0.00 
##     Copiers   Envelopes   Fasteners Furnishings      Labels    Machines 
##        0.00        0.00        0.00        0.00        0.00        0.01 
##       Paper      Phones     Storage    Supplies      Tables 
##        0.90        0.05        0.00        0.00        0.00

set.seed(1043)
data %>%
    group_by(Sub.Category) %>%
    slice_sample(Product.Name, n = 3) %>% 
    select(Sub.Category, Product.Name)

## # A tibble: 12 x 2
## # Groups:   Sub.Category [4]
##    Sub.Category Product.Name                                         
##    <fct>        <chr>                                                
##  1 Accessories  Sony 64GB Class 10 Micro SDHC R40 Memory Card        
##  2 Accessories  Sony Micro Vault Click 16 GB USB 2.0 Flash Drive     
##  3 Accessories  Sony Micro Vault Click 4 GB USB 2.0 Flash Drive      
##  4 Machines     Xerox WorkCentre 6505DN Laser Multifunction Printer  
##  5 Machines     Cisco SPA525G2 5-Line IP Phone                       
##  6 Machines     Cisco TelePresence System EX90 Videoconferencing Unit
##  7 Paper        Xerox 1949                                           
##  8 Paper        Xerox 1905                                           
##  9 Paper        Xerox 200                                            
## 10 Phones       Cisco Unified IP Phone 7945G VoIP phone              
## 11 Phones       Cisco SPA301                                         
## 12 Phones       Cisco Small Business SPA 502G VoIP phone

Insight : Xerox is a product of Paper that sells quite a lot

We want to know the Profit from selling Paper & Binders from retail dataset

retail %>%
    filter(Category == "Office Supplies" &
               Sub.Category %in% c("Binders","Paper")) %>%
    group_by(Sub.Category) %>%
    summarise(Profit = sum(Profit))

## # A tibble: 2 x 2
##   Sub.Category Profit
##   <fct>         <dbl>
## 1 Binders      30222.
## 2 Paper        34054.

Insight : Binders & Paper are Product with the most Quantity, these product give a decent Profit

We want to know total Quantity of Furniture sales by each Category

retail %>%
    filter(Category == "Furniture") %>%
    group_by(Segment, Sub.Category) %>%
    summarise(Quantity = sum(Quantity)) %>%
    arrange(-Quantity, .by_group= T)

## # A tibble: 12 x 3
## # Groups:   Segment [3]
##    Segment     Sub.Category Quantity
##    <fct>       <fct>           <int>
##  1 Consumer    Furnishings      1834
##  2 Consumer    Chairs           1234
##  3 Consumer    Tables            602
##  4 Consumer    Bookcases         496
##  5 Corporate   Furnishings      1086
##  6 Corporate   Chairs            719
##  7 Corporate   Tables            419
##  8 Corporate   Bookcases         271
##  9 Home Office Furnishings       643
## 10 Home Office Chairs            403
## 11 Home Office Tables            220
## 12 Home Office Bookcases         101

Insight : Each Sub.Category have a same pattern, which Furnishings always be highest Quantity, and Bookcase always be lowest Quantity

Last, we want to know top selling Sub.Category in “Technology” on Category Column

retail %>%
    filter(Category == "Technology") %>% 
    count(Sub.Category, name = "Jumlah")

##   Sub.Category Jumlah
## 1  Accessories    775
## 2      Copiers     68
## 3     Machines    115
## 4       Phones    889

Insight : Phones have highest selling in “Technology” Category

Conclusion

The retail dataset has a lot of information on the sale of various kinds of products, among these products that have the highest sales quantity are paper with the product names “Xerox” and Binders. As well as from all customer segments, for shipping more goods using standard classes, this is because the shipping price is more affordable