# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
options(scipen = 123)this is Learn by building project with team Algoritma. i will use simple base R to process data. i hope you enjoy it.
we will read the data to see any usefull information
retail <- read.csv("retail.csv", stringsAsFactors = T) # with this we change all character type data into categorical
head(retail)str(retail)#> 'data.frame': 9994 obs. of 15 variables:
#> $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ Order.ID : Factor w/ 5009 levels "CA-2014-100006",..: 2501 2501 2297 4373 4373 202 202 202 202 202 ...
#> $ Order.Date : Factor w/ 1237 levels "1/1/17","1/10/14",..: 305 305 836 94 94 922 922 922 922 922 ...
#> $ Ship.Date : Factor w/ 1334 levels "1/1/15","1/1/16",..: 220 220 907 129 129 897 897 897 897 897 ...
#> $ Ship.Mode : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
#> $ Customer.ID : Factor w/ 793 levels "AA-10315","AA-10375",..: 144 144 240 706 706 89 89 89 89 89 ...
#> $ Segment : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
#> $ Product.ID : Factor w/ 1862 levels "FUR-BO-10000112",..: 13 56 947 320 1317 186 563 1762 795 438 ...
#> $ Category : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
#> $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
#> $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
#> $ Sales : num 262 731.9 14.6 957.6 22.4 ...
#> $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
#> $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
#> $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
in here. we ca still see data type that still incorrect. - Order.Date and Ship.Date must become date type
retail$Order.Date <- as.Date(retail$Order.Date, format = "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, format = "%m/%d/%y")dim(retail)#> [1] 9994 15
in here, we gain insight that data have 15 columns and 9994 row
anyNA(retail)#> [1] FALSE
we found no missing value too. so data can be use already with any manipulation.
summary(retail)#> Row.ID Order.ID Order.Date Ship.Date
#> Min. : 1 CA-2017-100111: 14 Min. :2014-01-03 Min. :2014-01-07
#> 1st Qu.:2499 CA-2017-157987: 12 1st Qu.:2015-05-23 1st Qu.:2015-05-27
#> Median :4998 CA-2016-165330: 11 Median :2016-06-26 Median :2016-06-29
#> Mean :4998 US-2016-108504: 11 Mean :2016-04-30 Mean :2016-05-03
#> 3rd Qu.:7496 CA-2015-131338: 10 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
#> Max. :9994 CA-2016-105732: 10 Max. :2017-12-30 Max. :2018-01-05
#> (Other) :9926
#> Ship.Mode Customer.ID Segment
#> First Class :1538 WB-21850: 37 Consumer :5191
#> Same Day : 543 JL-15835: 34 Corporate :3020
#> Second Class :1945 MA-17560: 34 Home Office:1783
#> Standard Class:5968 PP-18955: 34
#> CK-12205: 32
#> EH-13765: 32
#> (Other) :9791
#> Product.ID Category Sub.Category
#> OFF-PA-10001970: 19 Furniture :2121 Binders :1523
#> TEC-AC-10003832: 18 Office Supplies:6026 Paper :1370
#> FUR-FU-10004270: 16 Technology :1847 Furnishings: 957
#> FUR-CH-10001146: 15 Phones : 889
#> FUR-CH-10002647: 15 Storage : 846
#> TEC-AC-10002049: 15 Art : 796
#> (Other) :9896 (Other) :3613
#> Product.Name Sales Quantity
#> Staple envelope : 48 Min. : 0.444 Min. : 1.00
#> Easy-staple paper : 46 1st Qu.: 17.280 1st Qu.: 2.00
#> Staples : 46 Median : 54.490 Median : 3.00
#> Avery Non-Stick Binders : 20 Mean : 229.858 Mean : 3.79
#> Staples in misc. colors : 19 3rd Qu.: 209.940 3rd Qu.: 5.00
#> KI Adjustable-Height Table: 18 Max. :22638.480 Max. :14.00
#> (Other) :9797
#> Discount Profit
#> Min. :0.0000 Min. :-6599.978
#> 1st Qu.:0.0000 1st Qu.: 1.729
#> Median :0.2000 Median : 8.666
#> Mean :0.1562 Mean : 28.657
#> 3rd Qu.:0.2000 3rd Qu.: 29.364
#> Max. :0.8000 Max. : 8399.976
#>
in here we gain information: 1. earliest Order date is 2014-01-03 and latest is 2017-12-30 2. there 4 ship mode : “First class”, “same day”, “second class”, “standard class” 3. there 3 segment in retail data : “Consumer”, “corporate”, “home office” 4. there are 3 product category : “Furniture”, “office supplies”, “technology” 5. there are min = 0.444 sales and max = 22638.480 sales 6. discount given to customer min = 0 and max = 0.8 percent 7. maximum profit gain = 8399.976 and min profit = -6599.978
boxplot(Profit ~ Category, data = retail)boxplot(Profit ~ Category, data = retail, outline = F) in this plot 1 all data with outline and 2nd lot without outline. we can see there is outliers and huge amount too.
aggregate(Profit~Category, retail, sd)aggregate(Profit~Category, retail, mean)sd(retail$Profit)#> [1] 234.2601
mean(retail$Profit)#> [1] 28.6569
interpretation:
cor(retail$Sales, retail$Profit)#> [1] 0.4790643
in here. we found weak correlation between sales and profit. it mean, there are another factor that can boost profit
we want to search information that can be use to help us in expanding our business
table(retail$Ship.Mode, retail$Segment)#>
#> Consumer Corporate Home Office
#> First Class 769 485 284
#> Same Day 317 114 112
#> Second Class 1020 609 316
#> Standard Class 3085 1812 1071
in here, we can see which category using ship mode.
retail[retail$Sales == 0.444,]retail[retail$Sales == 22638.480,]this both set data show what item that have minimum and maximum sales.
aggregate(Profit~Category, retail, sum)this data show Total profit from each category
aggregate(Profit~Category, retail, max)this data show maximum profit in each category
aggregate(Profit~Category, retail, min)this data show minimum profit in each category
aggregate(Profit ~ Sub.Category, retail, sum)in here, we can see gain profit from each Sub.category.
this is example of what we can do with only base R. we can extract so many more information. depends on requirement. we can gain suitable insight that can fullfil our need.
my LBB end here. i hope it can help the other to gain information.
thank you