1. DATA INTRODUCTION

On this occasion we will conduct simple exploratory data analysis of a sales data in a retail company.

The buyers are coming from different segment such as Corporate, Home Office until individual consumer. This company also could provide the best option for your delivery which suitable with your needs.

2. DATA PREPARATION

Input the data and store it in a variable named Retail.

Retail <- read.csv("data_input/retail.csv")

And now we can do data inspection and cleansing.

2.1. Data Inspection

Check if the saved data is correct.

head(Retail)

Inspect the data.

str(Retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

dim(Retail)

## [1] 9994   15

names(Retail)

##  [1] "Row.ID"       "Order.ID"     "Order.Date"   "Ship.Date"    "Ship.Mode"   
##  [6] "Customer.ID"  "Segment"      "Product.ID"   "Category"     "Sub.Category"
## [11] "Product.Name" "Sales"        "Quantity"     "Discount"     "Profit"

From our inspection we can conclude :
* Retail data contain 9994 of rows and 15 of coloumns.
* Each of column name :
01. “Row.ID”,
02. “Order.ID”,
03. “Order.Date”,
04. “Ship.Date”,
05. “Ship.Mode”,
06. “Customer.ID”,
07. “Segment”,
08. “Product.ID”,
09. “Category”,
10. “Sub.Category”,
11. “Product.Name”,
12. “Sales”,
13. “Quantity”,
14. “Discount”, 15. “Profit”.

2.2. Data Clenasing & Coertions

From the str() result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion).

Retail$Row.ID <- as.character(Retail$Row.ID)
Retail$Order.Date<- as.Date(Retail$Order.Date, "%m/%d/%y")
Retail$Ship.Date <- as.Date(Retail$Ship.Date, "%m/%d/%y")
Retail$Ship.Mode <- as.factor(Retail$Ship.Mode)
Retail$Customer.ID<-as.character(Retail$Customer.ID)
Retail$Segment <- as.factor(Retail$Segment)
Retail$Product.ID<- as.character(Retail$Product.ID)
Retail$Category <- as.factor(Retail$Category    )
Retail$Sub.Category<-as.factor(Retail$Sub.Category)
Retail$Product.Name<-as.factor(Retail$Product.Name)
str(Retail)

## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : chr  "1" "2" "3" "4" ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: Factor w/ 1850 levels "\"While you Were Out\" Message Book, One Form per Page",..: 387 833 1440 367 574 570 1137 1099 535 295 ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Each of column already changed into desired data type

Now, we have to check for the missing value in the data.

colSums(is.na(Retail))

##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

anyNA(Retail)

## [1] FALSE

From the result above, now we know that there are no missing value in the Retail data.

We will do subsetting to delete some column (1&2 because we dont need the informations). then save it into Retail_new variable

Retail_new <- Retail[,c(3:15)]
head(Retail_new)

3. DATA EXPLORATION & VISUALIZATION

We will see a brief information from Retail_new data.

summary(Retail_new)

##    Order.Date           Ship.Date                   Ship.Mode   
##  Min.   :2014-01-03   Min.   :2014-01-07   First Class   :1538  
##  1st Qu.:2015-05-23   1st Qu.:2015-05-27   Same Day      : 543  
##  Median :2016-06-26   Median :2016-06-29   Second Class  :1945  
##  Mean   :2016-04-30   Mean   :2016-05-03   Standard Class:5968  
##  3rd Qu.:2017-05-14   3rd Qu.:2017-05-18                        
##  Max.   :2017-12-30   Max.   :2018-01-05                        
##                                                                 
##  Customer.ID               Segment      Product.ID       
##  Length:9994        Consumer   :5191   Length:9994       
##  Class :character   Corporate  :3020   Class :character  
##  Mode  :character   Home Office:1783   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##             Category         Sub.Category                      Product.Name 
##  Furniture      :2121   Binders    :1523   Staple envelope           :  48  
##  Office Supplies:6026   Paper      :1370   Easy-staple paper         :  46  
##  Technology     :1847   Furnishings: 957   Staples                   :  46  
##                         Phones     : 889   Avery Non-Stick Binders   :  20  
##                         Storage    : 846   Staples in misc. colors   :  19  
##                         Art        : 796   KI Adjustable-Height Table:  18  
##                         (Other)    :3613   (Other)                   :9797  
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
##

Summary :
1. First order occured in Jan 2014.
2. Standart Class was the most popular one for shipping option and Same day delivery was the lowest one.
3. The most buyers coming from consumer segment and home office segment was the lowest.
4. Office Supply category was the best seller compare to others.
5. In Sub category, other stuff was the most favourite one then followed by Binders and paper. Art is the lowest.
6. The average sales at 229.858; with max sales value at 22638.480 and minumum sales value at 0.444
7. Max quantity order was 14 within 4 years but average at 3.79.
8. Max amount of discount was 80%, but mean number of discount around 15%.
9. The ever had loss at 6599 and gained 8399 as their highest profit with average profit at 28.657.

Talking about “sales” then surely we will talking about “profit”.
So next we will do some aggregation using the Profit variable as the main reference with several other variables that are very related.
These variables include Category, Sub.category, & Product.Name.
Therefore we will do subsetting so that the data that appears is only data from these variables.

Retail_Profit <- Retail_new[ ,c("Category", "Sub.Category", "Product.Name", "Profit")]

And now we can carry out the above-mentioned aggression process along with visualizing the data.

3.1. Profit ~ Category

Retail_Profit_Category <- aggregate(Profit~Category,Retail_Profit,mean)

ggplot(data = Retail_Profit_Category, mapping = aes(x=Profit, y=Category )) +
  geom_col()

3.2. Profit ~ Sub.Category

Retail_Profit_Sub.Category <- aggregate(Profit~Sub.Category,Retail_Profit,mean)

ggplot(data = Retail_Profit_Sub.Category, mapping = aes(x=  Profit, y= Sub.Category )) +
  geom_col()

3.3. Profit ~ Product.Name

Retail_Profit_Product.Name <- aggregate(Profit~Product.Name,Retail_Profit,mean)
head(Retail_Profit_Product.Name[order(Retail_Profit_Product.Name$Profit, decreasing=T), ],5)

tail(Retail_Profit_Product.Name[order(Retail_Profit_Product.Name$Profit, decreasing=T), ],5)

4. DATA ANALYSIS

Based on the exploration of the data above, we can perform the following analysis:
1. The “technology” category is the category that generates the highest profit. And the furniture category is the one with the lowest profit among the three categories.
Suggestion : Prioritize adding product promotions to the furniture category but still maintaining the ongoing promotions for the other two categories.
2. In the sub-category for the type of “copier” product, it gets profit far above other sub-categories. Meanwhile, other sub-categories such as tables, supplies, & bookcases experienced losses.
Suggestion : Prioritize adding more product promotions to the other sub-category other than “copier” but still maintaining the ongoing promotions for the other two categories. And provide massive discounts for sales of sub-categories of products experiencing losses.
3. “Canon imageCLASS 2200 Advanced Copier” is a product that produces the highest profit among other products. However, there are some products as well that cause loss.
Suggestion : provide massive discounts for sales of products experiencing losses to minimize losses.

RETAIL - Data Analysis

Tubagus Fathul Arifin

2022-07-28