1 Explanation

In this project, I will use data from retail.csv to do exploratory data analysis. This data contains sales records from a retail company which I named PT. Vards. PT.Vards is a global retail company selling Office Supplies, Furniture and Technology stuff. The buyers come from various segments such as Corporate, Home Office, and individual consumers. This company can also provide the best options for customer delivery according to customer needs.

1.1 Input Data

First, we need to read data

vards <- read.csv("data_input/retail.csv")

Next, we want to see top 10 data owned by PT.Vards

head(vards,10)

Then, we want to dimension of data

dim(vards)
## [1] 9994   15

this data contains 9994 rows and 15 columns

2 Data Cleansing

The first step in conducting data analysis is to ensure that the data to be used is clean. The initial stage we do is check data type for each column in the data

str(vards)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : chr  "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
##  $ Ship.Date   : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode   : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category: chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Based on data type for each column above, there is an incorrect data type. Therefore, we have to change the data type

2.1 Explicit Coercion

vards$Order.Date <- as.Date(vards$Order.Date, format = "%m/%d/%y")
vards$Ship.Date <- as.Date(vards$Ship.Date, format = "%m/%d/%y")
vards$Ship.Mode <- as.factor(vards$Ship.Mode)
vards$Segment <- as.factor(vards$Segment)
vards$Category <- as.factor(vards$Category)
vards$Sub.Category <- as.factor(vards$Sub.Category)

and we recheck the data type again

str(vards)
## 'data.frame':    9994 obs. of  15 variables:
##  $ Row.ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID    : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date  : Date, format: "2016-11-08" "2016-11-08" ...
##  $ Ship.Date   : Date, format: "2016-11-11" "2016-11-11" ...
##  $ Ship.Mode   : Factor w/ 4 levels "First Class",..: 3 3 3 4 4 4 4 4 4 4 ...
##  $ Customer.ID : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment     : Factor w/ 3 levels "Consumer","Corporate",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Product.ID  : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category    : Factor w/ 3 levels "Furniture","Office Supplies",..: 1 1 2 1 2 1 2 3 2 2 ...
##  $ Sub.Category: Factor w/ 17 levels "Accessories",..: 5 6 11 17 15 10 3 14 4 2 ...
##  $ Product.Name: chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales       : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity    : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount    : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit      : num  41.91 219.58 6.87 -383.03 2.52 ...

Now the data type is in accordance with each column

2.2 Check Missing Values

anyNA(vards)
## [1] FALSE
colSums(is.na(vards))
##       Row.ID     Order.ID   Order.Date    Ship.Date    Ship.Mode  Customer.ID 
##            0            0            0            0            0            0 
##      Segment   Product.ID     Category Sub.Category Product.Name        Sales 
##            0            0            0            0            0            0 
##     Quantity     Discount       Profit 
##            0            0            0

Great! We don’t have missing values.

2.3 Finishing Data Cleansing

Next step, we want to retrieve the data needed to perform data analysis, i.e. data from columns 3-15, because we don’t need Row.ID and Order.ID columns

vards <- vards[,3:15]
head(vards)

finally, we finish doing data cleansing

3 Data Explanation

summary(vards)
##    Order.Date           Ship.Date                   Ship.Mode   
##  Min.   :2014-01-03   Min.   :2014-01-07   First Class   :1538  
##  1st Qu.:2015-05-23   1st Qu.:2015-05-27   Same Day      : 543  
##  Median :2016-06-26   Median :2016-06-29   Second Class  :1945  
##  Mean   :2016-04-30   Mean   :2016-05-03   Standard Class:5968  
##  3rd Qu.:2017-05-14   3rd Qu.:2017-05-18                        
##  Max.   :2017-12-30   Max.   :2018-01-05                        
##                                                                 
##  Customer.ID               Segment      Product.ID       
##  Length:9994        Consumer   :5191   Length:9994       
##  Class :character   Corporate  :3020   Class :character  
##  Mode  :character   Home Office:1783   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##             Category         Sub.Category  Product.Name      
##  Furniture      :2121   Binders    :1523   Length:9994       
##  Office Supplies:6026   Paper      :1370   Class :character  
##  Technology     :1847   Furnishings: 957   Mode  :character  
##                         Phones     : 889                     
##                         Storage    : 846                     
##                         Art        : 796                     
##                         (Other)    :3613                     
##      Sales              Quantity        Discount          Profit         
##  Min.   :    0.444   Min.   : 1.00   Min.   :0.0000   Min.   :-6599.978  
##  1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000   1st Qu.:    1.729  
##  Median :   54.490   Median : 3.00   Median :0.2000   Median :    8.666  
##  Mean   :  229.858   Mean   : 3.79   Mean   :0.1562   Mean   :   28.657  
##  3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000   3rd Qu.:   29.364  
##  Max.   :22638.480   Max.   :14.00   Max.   :0.8000   Max.   : 8399.976  
## 
  1. First order occured in January 2014
  2. Standard class is the most frequently used in ship modestandard class is the most frequently used in ship mode
  3. The segment that buys the most is the consumer segment
  4. The office supplies category is the category most frequently purchased by customers
  5. Sub category other is the most purchased, then followed by binders and paper
  6. PT Vards got the lowest sales value of 0.444 and the highest of 22638.480 and has an average of 229.858
  7. Purchase with the highest quantity, which is 14 quantity
  8. The average discount given by PT Vards is 15.62% and has the highest discount of 80%
  9. PT Vards earned an average profit of 28.657 with the lowest profit being a loss of -6599.978, the highest being a profit of 8399.976

4 Data Manipulation & Transformation

  1. Which category has the highest sales?
xtabs(Sales~Category,vards)
## Category
##       Furniture Office Supplies      Technology 
##        741999.8        719047.0        836154.0

** Answer : Technology category has the highest sales **

  1. How many transactions are sent using Standard Class and First Class?
nrow(vards[vards$Ship.Mode %in% c("Standard Class", "First Class"),])
## [1] 7506

** Answer : There are 7506 transactions whose delivery uses Standard Class and First Class **

  1. Which category has the highest profit?
xtabs(Profit~Category,vards)
## Category
##       Furniture Office Supplies      Technology 
##        18451.27       122490.80       145454.95

** Answer : Technology category has the highest profit **

  1. How many transactions from Technology category resulted in a loss(profit less than zero)?
nrow(vards[vards$Category == "Technology" & vards$Profit < 0,])
## [1] 271

** Answer : There are 271 transactions from Technology category that resulted in losses(profit less than zero) **

  1. What type of shipping does the Corporate segment use the most?
table(vards$Ship.Mode, vards$Segment)
##                 
##                  Consumer Corporate Home Office
##   First Class         769       485         284
##   Same Day            317       114         112
##   Second Class       1020       609         316
##   Standard Class     3085      1812        1071

** Answer : Type of delivery that is most widely used by the corporate segment is standard class **

  1. What segments and categories have the highest average profit?
profit <- aggregate(Profit~Segment+Category,vards,mean)
profit[order(profit$Profit,decreasing=T),]

** Answer : Home Office segment and Technology category have the highest average profit at PT.Vards **

  1. How much quantity sold in PT.Vards?
sum(vards$Quantity)
## [1] 37873

** Answer : There are 37873 sold at PT. Vards **

  1. How much total profit PT.Vards get?
sum(vards$Profit)
## [1] 286397

** Answer : PT. Vards get a total profit of 286397 **

  1. How much total sales PT.Vards get?
sum(vards$Sales)
## [1] 2297201

** Answer : PT. Vards get a total sales of 2297201 **

  1. Which category has the most quantity sold?
xtabs(Quantity~Category,vards)
## Category
##       Furniture Office Supplies      Technology 
##            8028           22906            6939

** Answer : Office supplies category has the most quantity sold **

5 Conclusion and Recommendation

5.1 Conclusion

During 2014 - 2017, PT.Vards had total sales of 2297201 and total profit of 286397 and sold 37873 items. The most sold category is the office supplies category. The type of delivery that is mostly used by customers is standard class. The technology category has the highest number of sales and profits compared to other categories.

The home office segment and technology category had the highest average profit, which was 89.152458. There are 271 transactions from the technology category that have losses. Then 7506 transactions use standard class and first class delivery types.

5.2 Recommendation

From the conclusion above, the technology category has the highest total sales and total profit compared to other categories. PT. Vards must find out what causes the technology category to have the highest total sales and total profit compared to other categories and find ways to increase the total profit and total sales of other categories.

The same thing happens with the type of delivery. The standard class delivery type is the most widely used type of delivery. PT. Vards must find out what causes standard class shipping types to be more in demand than other types of shipping. If because the price is cheaper, then PT. Vards may perform price reductions on other types of shipping to increase the number of customers using other types of shipping.