# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

options(scipen = 123)

1 Introduction

this is Learn by building project with team Algoritma. i will use simple base R to process data. i hope you enjoy it.

2 Data input

we will read the data to see any usefull information

retail <- read.csv("retail.csv", stringsAsFactors = T) # with this we change all character type data into categorical

head(retail)
str(retail)
#> 'data.frame':    9994 obs. of  15 variables:
#>  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Order.ID    : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
#>  $ Order.Date  : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
#>  $ Ship.Date   : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
#>  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#>  $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
#>  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#>  $ Product.ID  : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
#>  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#>  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#>  $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
#>  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
#>  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
#>  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#>  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

in here. we ca still see data type that still incorrect. - Order.Date and Ship.Date must become date type

retail$Order.Date <- as.Date(retail$Order.Date, format = "%m/%d/%y")

retail$Ship.Date <- as.Date(retail$Ship.Date, format = "%m/%d/%y")
dim(retail)
#> [1] 9994   15

in here, we gain insight that data have 15 columns and 9994 row

anyNA(retail)
#> [1] FALSE

we found no missing value too. so data can be use already with any manipulation.

summary(retail)
#>      Row.ID               Order.ID      Order.Date           Ship.Date         
#>  Min.   :   1   CA-2017-100111:  14   Min.   :2014-01-03   Min.   :2014-01-07  
#>  1st Qu.:2499   CA-2017-157987:  12   1st Qu.:2015-05-23   1st Qu.:2015-05-27  
#>  Median :4998   CA-2016-165330:  11   Median :2016-06-26   Median :2016-06-29  
#>  Mean   :4998   US-2016-108504:  11   Mean   :2016-04-30   Mean   :2016-05-03  
#>  3rd Qu.:7496   CA-2015-131338:  10   3rd Qu.:2017-05-14   3rd Qu.:2017-05-18  
#>  Max.   :9994   CA-2016-105732:  10   Max.   :2017-12-30   Max.   :2018-01-05  
#>                 (Other)       :9926                                            
#>           Ship.Mode      Customer.ID          Segment    
#>  First Class   :1538   WB-21850:  37   Consumer   :5191  
#>  Same Day      : 543   JL-15835:  34   Corporate  :3020  
#>  Second Class  :1945   MA-17560:  34   Home Office:1783  
#>  Standard Class:5968   PP-18955:  34                     
#>                        CK-12205:  32                     
#>                        EH-13765:  32                     
#>                        (Other) :9791                     
#>            Product.ID              Category         Sub.Category 
#>  OFF-PA-10001970:  19   Furniture      :2121   Binders    :1523  
#>  TEC-AC-10003832:  18   Office Supplies:6026   Paper      :1370  
#>  FUR-FU-10004270:  16   Technology     :1847   Furnishings: 957  
#>  FUR-CH-10001146:  15                          Phones     : 889  
#>  FUR-CH-10002647:  15                          Storage    : 846  
#>  TEC-AC-10002049:  15                          Art        : 796  
#>  (Other)        :9896                          (Other)    :3613  
#>                      Product.Name      Sales              Quantity    
#>  Staple envelope           :  48   Min.   :    0.444   Min.   : 1.00  
#>  Easy-staple paper         :  46   1st Qu.:   17.280   1st Qu.: 2.00  
#>  Staples                   :  46   Median :   54.490   Median : 3.00  
#>  Avery Non-Stick Binders   :  20   Mean   :  229.858   Mean   : 3.79  
#>  Staples in misc. colors   :  19   3rd Qu.:  209.940   3rd Qu.: 5.00  
#>  KI Adjustable-Height Table:  18   Max.   :22638.480   Max.   :14.00  
#>  (Other)                   :9797                                      
#>     Discount          Profit         
#>  Min.   :0.0000   Min.   :-6599.978  
#>  1st Qu.:0.0000   1st Qu.:    1.729  
#>  Median :0.2000   Median :    8.666  
#>  Mean   :0.1562   Mean   :   28.657  
#>  3rd Qu.:0.2000   3rd Qu.:   29.364  
#>  Max.   :0.8000   Max.   : 8399.976  
#> 

in here we gain information: 1. earliest Order date is 2014-01-03 and latest is 2017-12-30 2. there 4 ship mode : “First class”, “same day”, “second class”, “standard class” 3. there 3 segment in retail data : “Consumer”, “corporate”, “home office” 4. there are 3 product category : “Furniture”, “office supplies”, “technology” 5. there are min = 0.444 sales and max = 22638.480 sales 6. discount given to customer min = 0 and max = 0.8 percent 7. maximum profit gain = 8399.976 and min profit = -6599.978

boxplot(Profit ~ Category, data = retail)

boxplot(Profit ~ Category, data = retail, outline = F)

in this plot 1 all data with outline and 2nd lot without outline. we can see there is outliers and huge amount too.

aggregate(Profit~Category, retail, sd)
aggregate(Profit~Category, retail, mean)
sd(retail$Profit)
#> [1] 234.2601
mean(retail$Profit)
#> [1] 28.6569

interpretation:

  • mean 28.67
  • most of the data, 67% of the data is between 28.67 - 234 and 28.67 + 234
  • most of the data, 96% of the data is between 28.67 - 2 x 234 and 28.67 + 2 x 234
  • most of the data, 99% of the data is between 28.67 - 3 x 234 and 28.67 + 3 x 234
cor(retail$Sales, retail$Profit)
#> [1] 0.4790643

in here. we found weak correlation between sales and profit. it mean, there are another factor that can boost profit

3 Data Insight

we want to search information that can be use to help us in expanding our business

table(retail$Ship.Mode, retail$Segment)
#>                 
#>                  Consumer Corporate Home Office
#>   First Class         769       485         284
#>   Same Day            317       114         112
#>   Second Class       1020       609         316
#>   Standard Class     3085      1812        1071

in here, we can see which category using ship mode.

retail[retail$Sales == 0.444,]
retail[retail$Sales == 22638.480,]

this both set data show what item that have minimum and maximum sales.

aggregate(Profit~Category, retail, sum)

this data show Total profit from each category

aggregate(Profit~Category, retail, max)

this data show maximum profit in each category

aggregate(Profit~Category, retail, min)

this data show minimum profit in each category

aggregate(Profit ~ Sub.Category, retail, sum)

in here, we can see gain profit from each Sub.category.

4 Conclusion

this is example of what we can do with only base R. we can extract so many more information. depends on requirement. we can gain suitable insight that can fullfil our need.

my LBB end here. i hope it can help the other to gain information.

thank you