LBB PROGRAMMING FOR DATA SCIENCE

1. Explanation

1.1 Brief

Hi !! Welcome to my Rmd :) in this LBB i will use previous data which is retail.csv

1.2 Data’s POV

This data is contains of sales record included the profit or loss from a retail company (i named it PT. Elips). PT.Elips is a global retail company who sells Office Supplies, Furniture and Technology stuff. The buyers are coming from different segment such as Corporate, Home Office until individual consumer. This company also could provide the best option for your delivery which suitable with your needs.

Note : Detail explanatory will be given at the end of content

retail <- read.csv("retail.csv")

My Input Data is done! Now let’s check our data!

2.1 Data Inspection

head(retail)
##         Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID   Segment
## 1 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520  Consumer
## 2 CA-2016-152156    11/8/16  11/11/16   Second Class    CG-12520  Consumer
## 3 CA-2016-138688    6/12/16   6/16/16   Second Class    DV-13045 Corporate
## 4 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335  Consumer
## 5 US-2015-108966   10/11/15  10/18/15 Standard Class    SO-20335  Consumer
## 6 CA-2014-115812     6/9/14   6/14/14 Standard Class    BH-11710  Consumer
##        Product.ID        Category Sub.Category
## 1 FUR-BO-10001798       Furniture    Bookcases
## 2 FUR-CH-10000454       Furniture       Chairs
## 3 OFF-LA-10000240 Office Supplies       Labels
## 4 FUR-TA-10000577       Furniture       Tables
## 5 OFF-ST-10000760 Office Supplies      Storage
## 6 FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694
tail(retail)
##            Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID   Segment
## 9989 CA-2017-163629   11/17/17  11/21/17 Standard Class    RA-19885 Corporate
## 9990 CA-2014-110422    1/21/14   1/23/14   Second Class    TB-21400  Consumer
## 9991 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060  Consumer
## 9992 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060  Consumer
## 9993 CA-2017-121258    2/26/17    3/3/17 Standard Class    DB-13060  Consumer
## 9994 CA-2017-119914     5/4/17    5/9/17   Second Class    CC-12220  Consumer
##           Product.ID        Category Sub.Category
## 9989 TEC-PH-10004006      Technology       Phones
## 9990 FUR-FU-10001889       Furniture  Furnishings
## 9991 FUR-FU-10000747       Furniture  Furnishings
## 9992 TEC-PH-10003645      Technology       Phones
## 9993 OFF-PA-10004041 Office Supplies        Paper
## 9994 OFF-AP-10002684 Office Supplies   Appliances
##                                                                   Product.Name
## 9989                                           Panasonic KX - TS880B Telephone
## 9990                                                    Ultra Door Pull Handle
## 9991                        Tenex B1-RE Series Chair Mats for Low Pile Carpets
## 9992                                                     Aastra 57i VoIP phone
## 9993                         It's Hot Message Books with Stickers, 2 3/4" x 5"
## 9994 Acco 7-Outlet Masterpiece Power Center, Wihtout Fax/Phone Line Protection
##        Sales Quantity Discount  Profit
## 9989 206.100        5      0.0 55.6470
## 9990  25.248        3      0.2  4.1028
## 9991  91.960        2      0.0 15.6332
## 9992 258.576        2      0.2 19.3932
## 9993  29.600        4      0.0 13.3200
## 9994 243.160        2      0.0 72.9480
dim(retail)
## [1] 9994   14
names(retail)
##  [1] "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"    "Customer.ID" 
##  [6] "Segment"      "Product.ID"   "Category"     "Sub.Category" "Product.Name"
## [11] "Sales"        "Quantity"     "Discount"     "Profit"

From our inspection we can conclude : * retail data contain 9994 of rows and 14 of coloumns * Each of column name : “Order.ID”, “Order.Date”, “Ship.Date”, “Ship.Mode”, “Customer.ID”, “Segment”,“Product.ID”, “Category”, “Sub.Category”, “Product.Name”, “Sales”, “Quantity”, “Discount”, “Profit”

2.2 Data Cleansing & Coertions

Check data type for each column

str(retail)
## 'data.frame':    9994 obs. of  14 variables:
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)

retail$Order.ID<- as.character (retail$Order.ID)
retail$Order.Date<- as.Date(retail$Order.Date, "%m/%d/%y")
retail$Ship.Date <- as.Date(retail$Ship.Date, "%m/%d/%y")
retail$Customer.ID<-as.character(retail$Customer.ID)
retail$Product.ID<- as.character(retail$Product.ID)
retail$Product.Name<-as.character(retail$Product.Name)
retail$Segment <- as.factor(retail$Segment)
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Category <- as.factor(retail$Category)
retail$Sub.Category <- as.factor(retail$Sub.Category)

str(retail)
## 'data.frame':    9994 obs. of  14 variables:
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Each of column has already changed into desired data type

Cek for missing value

colSums(is.na(retail))
##     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID      Segment 
##            0            0            0            0            0            0 
##   Product.ID     Category Sub.Category Product.Name        Sales     Quantity 
##            0            0            0            0            0            0 
##     Discount       Profit 
##            0            0

alright!! there are no missing values!!

Do subset do delete some column (1&2 because we dont need the informations). then save it into ‘Elips’ variable

retail_clean <- retail[,c(2:14)]
head(retail_clean)
##   Order.Date  Ship.Date      Ship.Mode Customer.ID   Segment      Product.ID
## 1 2016-11-08 2016-11-11   Second Class    CG-12520  Consumer FUR-BO-10001798
## 2 2016-11-08 2016-11-11   Second Class    CG-12520  Consumer FUR-CH-10000454
## 3 2016-06-12 2016-06-16   Second Class    DV-13045 Corporate OFF-LA-10000240
## 4 2015-10-11 2015-10-18 Standard Class    SO-20335  Consumer FUR-TA-10000577
## 5 2015-10-11 2015-10-18 Standard Class    SO-20335  Consumer OFF-ST-10000760
## 6 2014-06-09 2014-06-14 Standard Class    BH-11710  Consumer FUR-FU-10001487
##          Category Sub.Category
## 1       Furniture    Bookcases
## 2       Furniture       Chairs
## 3 Office Supplies       Labels
## 4       Furniture       Tables
## 5 Office Supplies      Storage
## 6       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit
## 1        2     0.00   41.9136
## 2        3     0.00  219.5820
## 3        2     0.00    6.8714
## 4        5     0.45 -383.0310
## 5        2     0.20    2.5164
## 6        7     0.00   14.1694

Now, retail dataset is ready to be processed and analyzed!

3. Data Explanation

So here is the brief explanation of our data

summary(retail_clean)
##    Order.Date           Ship.Date                   Ship.Mode   
##  Min.   :2014-01-03   Min.   :2014-01-07   First Class   :1538  
##  1st Qu.:2015-05-23   1st Qu.:2015-05-27   Same Day      : 543  
##  Median :2016-06-26   Median :2016-06-29   Second Class  :1945  
##  Mean   :2016-04-30   Mean   :2016-05-03   Standard Class:5968  
##  3rd Qu.:2017-05-14   3rd Qu.:2017-05-18                        
##  Max.   :2017-12-30   Max.   :2018-01-05                        
##                                                                 
##  Customer.ID               Segment      Product.ID       
##  Length:9994        Consumer   :5191   Length:9994       
##  Class :character   Corporate  :3020   Class :character  
##  Mode  :character   Home Office:1783   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##             Category         Sub.Category  Product.Name      
##  Furniture      :2121   Binders    :1523   Length:9994       
##  Office Supplies:6026   Paper      :1370   Class :character  
##  Technology     :1847   Furnishings: 957   Mode  :character  
##                         Phones     : 889                     
##                         Storage    : 846                     
##                         Art        : 796                     
##                         (Other)    :3613                     
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
## 

Summary :

  1. First order occured in Jan 2014
  2. Standart Class was the most popular one for shipping option and Same day delivery was the lowest one
  3. retail.csv has the most buyers coming from consumer segment and home office segment was the lowest
  4. Office Supply category was the best seller compare to others
  5. In Sub category, Other stuff was the most favourite one then followed by Binders and paper. Art is the lowest
  6. Retail csv. gained the average sales at 229.858; with max sales value at 22638.480 and minumum sales value at 0.444
  7. Max quantity order was 14 within 4 years but average at 3.79
  8. retail csv. max amount of discount was 80%, but mean number of discount around 15%
  9. retail csv. ever had loss at 6599.978 and gained 8399.976 as their profit with average profit at 28.657

Check the Outlier within profit

aggregate(Profit~Segment,retail_clean,mean)
##       Segment   Profit
## 1    Consumer 25.83687
## 2   Corporate 30.45667
## 3 Home Office 33.81866
aggregate(Profit~Segment,retail_clean,var)
##       Segment   Profit
## 1    Consumer 58851.86
## 2   Corporate 53817.63
## 3 Home Office 45105.89
aggregate(Profit~Segment,retail_clean,sd)
##       Segment   Profit
## 1    Consumer 242.5940
## 2   Corporate 231.9863
## 3 Home Office 212.3815
boxplot(retail_clean$Profit)

From result above, we find posibilities for the outliers, but from our calculation, Sd value is around 200 and its still can be tolerated, so the process may continue.

4. Data Manipulation & Transformation

  1. Which category gives the lowest sales number? how much profit?
retail_clean[retail_clean$Sales == 0.444,]
##      Order.Date  Ship.Date      Ship.Mode Customer.ID  Segment      Product.ID
## 4102 2017-06-19 2017-06-23 Standard Class    ZC-21910 Consumer OFF-AP-10002906
##             Category Sub.Category
## 4102 Office Supplies   Appliances
##                                                                    Product.Name
## 4102 Hoover Replacement Belt for Commercial Guardsman Heavy-Duty Upright Vacuum
##      Sales Quantity Discount Profit
## 4102 0.444        1      0.8  -1.11

Answer : Sales 0.444 comes from segment ‘consumer’, office supplies, and got loss 1.11

  1. Which segment buying the most office suppies?
offsup <- retail_clean[retail_clean$Category=="Office Supplies",]
round(prop.table(table(offsup$Segment))*100,2)
## 
##    Consumer   Corporate Home Office 
##       51.89       30.20       17.91

Answer : consumer segment –> 51.89%

  1. How many times buyers get the highest discount 80% and when it happened?
disc <- retail_clean[retail_clean$Discount == 0.8,]
(sort(table(disc$Order.Date),decreasing = T)[1])
## 2014-07-26 
##          4

Answer : 80% discount was on July 2014, happened 4 times

  1. Related to number 3, who was the buyers?
disc_buyer <- disc[disc$Order.Date=="2014-07-26",]
(table(disc_buyer$Segment))
## 
##    Consumer   Corporate Home Office 
##           4           0           0

Answer : all buyer was coming from consumer segment

  1. How much total of quantity order from each segment and category, and which is the highest?
xtabs(Quantity~Segment+Category,retail_clean)
##              Category
## Segment       Furniture Office Supplies Technology
##   Consumer         4166           11758       3597
##   Corporate        2495            7018       2095
##   Home Office      1367            4130       1247
plot(xtabs(Quantity~Segment+Category,retail_clean))

heatmap(xtabs(Quantity~Segment+Category, retail_clean), cexRow = 0.8, cexCol = 0.8, scale = "column", Colv = NA, Rowv = NA)

Answer : Based on result above: segmen consumer with category office supply make the highest quantity order with sum 11758

  1. Median of quantity for every segment and category?
aggregate(Quantity~Segment+Category,retail_clean,median)
##       Segment        Category Quantity
## 1    Consumer       Furniture        3
## 2   Corporate       Furniture        3
## 3 Home Office       Furniture        3
## 4    Consumer Office Supplies        3
## 5   Corporate Office Supplies        3
## 6 Home Office Office Supplies        3
## 7    Consumer      Technology        3
## 8   Corporate      Technology        3
## 9 Home Office      Technology        3

Answer : Median of quantity = 3

  1. How much Total quantity order since PT. Elips start selling the product?
sum(retail_clean$Quantity)
## [1] 37873

Answer : 37873 items

  1. How much profit in 4 years?
sum(retail_clean$Profit)
## [1] 286397

Answer : 286397

  1. Discount Percentage (0%) (0-20%) (>20%)
DisP<- retail_clean[retail_clean$Discount== 0,]
knitr::kable(table(DisP$Discount))
Var1 Freq
0 4798
DisP_new <- retail_clean[retail_clean$Discount>0 & retail_clean$Discount<=0.8,]
knitr::kable(table(DisP_new$Discount))
Var1 Freq
0.1 94
0.15 52
0.2 3657
0.3 227
0.32 27
0.4 206
0.45 11
0.5 66
0.6 138
0.7 418
0.8 300

Answer : based on this data, half ammount of data was seeling without any discount given

  1. Category which generate the highest profit?
xtabs(Profit~ Category, retail_clean)
## Category
##       Furniture Office Supplies      Technology 
##        18451.27       122490.80       145454.95

Answer : Technology

  1. Highest Loss in which transaction?
retail_clean[retail_clean$Profit==-6599.978,]
##      Order.Date  Ship.Date      Ship.Mode Customer.ID  Segment      Product.ID
## 7773 2016-11-25 2016-12-02 Standard Class    CS-12505 Consumer TEC-MA-10000418
##        Category Sub.Category                              Product.Name    Sales
## 7773 Technology     Machines Cubify CubeX 3D Printer Double Head Print 4499.985
##      Quantity Discount    Profit
## 7773        5      0.7 -6599.978

Answer : happened in 2016, was selling 5 items of machines, discount 70%

  1. Highest Profit in Segment?
xtabs(Profit~ Segment, retail_clean)
## Segment
##    Consumer   Corporate Home Office 
##   134119.21    91979.13    60298.68

Answer : Consumer was given the higest profit compare to other segment

  1. Sales number based on category ?
xtabs(Sales~ Category, retail_clean)
## Category
##       Furniture Office Supplies      Technology 
##        741999.8        719047.0        836154.0

Answer : Technology have the highest selling price compare to other

  1. Profit based on Ship Mode + segment?
xtabs(Profit~Segment+Ship.Mode,retail_clean)
##              Ship.Mode
## Segment       First Class  Same Day Second Class Standard Class
##   Consumer      21374.044  9874.205    24946.911      77924.049
##   Corporate     14464.472  1818.142    18225.713      57470.807
##   Home Office   13131.324  4199.412    14274.011      28693.932
plot(xtabs(Profit~Segment+Ship.Mode,retail_clean))

Answer : Most of buyer was choosing Standar Class for shipping options. Among them consumer was the higest one

  1. The highest quantity seller based on Category
xtabs(Quantity~Category,retail_clean)
## Category
##       Furniture Office Supplies      Technology 
##            8028           22906            6939

Answer : Office Supply

5 Explanatory Text & Business Recomendation

In 4 years, since bigining of 2014-2017 retail. csv have sold total 37873 items with total profit 286397. This company selling 3 main category : Furniture, Office supplies and Technology. From this category, ‘Technology’ generate the highest profit with value 145454.95 compare to others but ironically, Highest loss with ammount 6599.987 happened in 2016 was coming from ‘Technology’ category as well. Apart from that, Technology categoru has the highest selling price.

Buyer who buying the product was coming from different segment, such as : Consumer, Corporate and Home Office. ‘Consumer’ was giving the higest ammount of profit compare to other with 134119.21. Shipping option consist of 4 types : First Class, Same day, Second class, and Standart class. Based on data, from all segment, most of them are choosing Standart Class as their shipping option. Consumer segment is the most highest segment who choosing Standart class mode.

Recomendations :

From our calculation above, showing that Technology was selling with highest sales value but contradictory max loss was coming from Technology section as well. This is caused because discound given was too hight (80%) without think about the provit.Consider it, No discount given without making some profit!! Calculate twice before deciding how many percentage of discount would be given Need to increase market target at Corporate segment and home office segment, we may give them more discount with taking back the profit to company Make sure all category was safely supplied on the wareshouse specially for ffice supply items.Based on our calculation, buyers mostly buyed office supply category with 22906, it almost 2-3 times higer from other category