Background

The dataset given by an international e-commerce company who wants to discover key insights from their customer database. Three questions that they want to known is

What was the caused late delivery time?
What was the caused the product rating?

The dataset used for model building contained 10999 observations of 12 variables.
The data contains the following information:

Warehouse block : The Company have big Warehouse which is divided in to block such as A,B,C,D,E.
Mode of shipment :The Company Ships the products in multiple way such as Ship, Flight and Road.
Customer care calls : The number of calls made from enquiry for enquiry of the shipment.
Customer rating : The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best).
Cost of the product: Cost of the Product in US Dollars.
Prior purchases : The Number of Prior Purchase.
Product importance : The company has categorized the product in the various parameter such as low, medium, high.
Gender : Male and Female.
Discount offered : Discount offered on that specific product.
Weight in gms : It is the weight in grams.
Reached on time : It is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time.

Import and Cleaning Data

df <- read.csv('shipping.csv')
head(df, n = 10)

Now look the structure of dataframe

str(df)

## 'data.frame':    10999 obs. of  13 variables:
##  $ X                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ï..ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Warehouse_block    : chr  "D" "E" "A" "B" ...
##  $ Mode_of_Shipment   : chr  "Flight" "Flight" "Flight" "Flight" ...
##  $ Customer_care_calls: int  4 4 2 3 2 3 3 4 3 3 ...
##  $ Customer_rating    : int  2 5 2 3 2 1 4 1 4 2 ...
##  $ Cost_of_the_Product: int  177 216 183 176 184 162 250 233 150 164 ...
##  $ Prior_purchases    : int  3 2 4 4 3 3 3 2 3 3 ...
##  $ Product_importance : chr  "low" "low" "low" "medium" ...
##  $ Gender             : chr  "F" "M" "M" "M" ...
##  $ Discount_offered   : int  44 59 48 10 46 12 3 48 11 29 ...
##  $ Weight_in_gms      : int  1233 3088 3374 1177 2484 1417 2371 2804 1861 1187 ...
##  $ Reached.on.Time_Y.N: int  1 1 1 1 1 1 1 1 1 1 ...

Column “ï..ID” is just index number of rows, since we don’t need this column, we can delete this column

df <- df[, -1]

We know, some columns are not in the right structure. Data within “Warehouse_block”, “Mode_of_Shipment”, “Product_importance”, and “Reached.on.Time_Y.N” column are categorical data. So, we should change the structure of these columns into factor.

col.factor <- c("Warehouse_block", "Mode_of_Shipment", "Product_importance", "Gender", "Reached.on.Time_Y.N")
df[col.factor] <- lapply(df[col.factor], factor)

Now, lets check are data frame have any empty value (NA)

colSums(is.na(df))

##               ï..ID     Warehouse_block    Mode_of_Shipment Customer_care_calls 
##                   0                   0                   0                   0 
##     Customer_rating Cost_of_the_Product     Prior_purchases  Product_importance 
##                   0                   0                   0                   0 
##              Gender    Discount_offered       Weight_in_gms Reached.on.Time_Y.N 
##                   0                   0                   0                   0

Great, there are not empty value!

To get brief insight from the data, lets look the summary of data

summary(df)

##      ï..ID       Warehouse_block Mode_of_Shipment Customer_care_calls
##  Min.   :    1   A:1833          Flight:1777      Min.   :2.000      
##  1st Qu.: 2750   B:1833          Road  :1760      1st Qu.:3.000      
##  Median : 5500   C:1833          Ship  :7462      Median :4.000      
##  Mean   : 5500   D:1834                           Mean   :4.054      
##  3rd Qu.: 8250   E:3666                           3rd Qu.:5.000      
##  Max.   :10999                                    Max.   :7.000      
##  Customer_rating Cost_of_the_Product Prior_purchases  Product_importance
##  Min.   :1.000   Min.   : 96.0       Min.   : 2.000   high  : 948       
##  1st Qu.:2.000   1st Qu.:169.0       1st Qu.: 3.000   low   :5297       
##  Median :3.000   Median :214.0       Median : 3.000   medium:4754       
##  Mean   :2.991   Mean   :210.2       Mean   : 3.568                     
##  3rd Qu.:4.000   3rd Qu.:251.0       3rd Qu.: 4.000                     
##  Max.   :5.000   Max.   :310.0       Max.   :10.000                     
##  Gender   Discount_offered Weight_in_gms  Reached.on.Time_Y.N
##  F:5545   Min.   : 1.00    Min.   :1001   0:4436             
##  M:5454   1st Qu.: 4.00    1st Qu.:1840   1:6563             
##           Median : 7.00    Median :4149                      
##           Mean   :13.37    Mean   :3634                      
##           3rd Qu.:10.00    3rd Qu.:5050                      
##           Max.   :65.00    Max.   :7846

Some insight that we get from summary are:

Most of goods are shipped from warehouse block F
Most of goods are shipped by ship
Average customers make 4 calls
The average of ratings are 2.991
Most of goods are low and high priority
Female customers more slightly than male customers
Most of discounts are under 10%
The average of good weights are 3.36 kg
Most of the delivery is late

Identifying Main Causes of Late Shipment

round(prop.table(table(df$Reached.on.Time_Y.N, df$Mode_of_Shipment), margin = 2)*100, 2)

##    
##     Flight  Road  Ship
##   0  39.84 41.19 40.24
##   1  60.16 58.81 59.76

round(prop.table(table(df$Reached.on.Time_Y.N, df$Warehouse_block), margin = 2)*100, 2)

##    
##         A     B     C     D     E
##   0 41.35 39.77 40.32 40.24 40.15
##   1 58.65 60.23 59.68 59.76 59.85

round(prop.table(table(df$Reached.on.Time_Y.N, df$Prior_purchases), margin = 2)*100, 2)

##    
##         2     3     4     5     6     7     8    10
##   0 37.48 35.93 45.66 50.12 44.03 32.35 35.16 42.70
##   1 62.52 64.07 54.34 49.88 55.97 67.65 64.84 57.30

round(prop.table(table(df$Reached.on.Time_Y.N, df$Product_importance), margin = 2)*100, 2)

##    
##      high   low medium
##   0 35.02 40.72  40.95
##   1 64.98 59.28  59.05

As we can see, that late of shipment for each transport mode, warehouse origin, prior to purchase, and product importance are almost same, which is around 55% - 65%. Therefore, we can conclude that variable of late delivery is not single variable. Now, we will create a dataframe from combination of the above variable

# Filter of late and on time shipment
on.time <- df[df$Reached.on.Time_Y.N == 0, ]
late.time <- df[df$Reached.on.Time_Y.N == 1, ]

# Aggregating on time shipment
agg.on.time <- cbind((aggregate.data.frame(x = list(On.Time = on.time$Reached.on.Time_Y.N),
                                           by = list(Reached = on.time$Reached.on.Time_Y.N,
                                                     Moda = on.time$Mode_of_Shipment,
                                                     Prior = on.time$Product_importance,
                                                     Warehouse = on.time$Warehouse_block),
                                           FUN = length)),
                     (aggregate.data.frame(x = list(Avg.Cost = on.time$Cost_of_the_Product),
                                           by = list(Reached = on.time$Reached.on.Time_Y.N,
                                                     Moda = on.time$Mode_of_Shipment,
                                                     Prior = on.time$Product_importance,
                                                     Warehouse = on.time$Warehouse_block),
                                           FUN = mean)),
                     (aggregate.data.frame(x = list(Avg.Disc = on.time$Discount_offered),
                                           by = list(Reached = on.time$Reached.on.Time_Y.N,
                                                     Moda = on.time$Mode_of_Shipment,
                                                     Prior = on.time$Product_importance,
                                                     Warehouse = on.time$Warehouse_block),
                                           FUN = mean)),
                     (aggregate.data.frame(x = list(Avg.Call = on.time$Customer_care_calls),
                                           by = list(Reached = on.time$Reached.on.Time_Y.N,
                                                     Moda = on.time$Mode_of_Shipment,
                                                     Prior = on.time$Product_importance,
                                                     Warehouse = on.time$Warehouse_block),
                                           FUN = mean)),
                     (aggregate.data.frame(x = list(Avg.Rating = on.time$Customer_rating),
                                           by = list(Reached = on.time$Reached.on.Time_Y.N,
                                                     Moda = on.time$Mode_of_Shipment,
                                                     Prior = on.time$Product_importance,
                                                     Warehouse = on.time$Warehouse_block),
                                           FUN = mean)))

# Aggregating late time shipment
agg.late.time <- cbind((aggregate.data.frame(x = list(Late.Time = late.time$Reached.on.Time_Y.N),
                                           by = list(Reached = late.time$Reached.on.Time_Y.N,
                                                     Moda = late.time$Mode_of_Shipment,
                                                     Prior = late.time$Product_importance,
                                                     Warehouse = late.time$Warehouse_block),
                                           FUN = length)),
                     (aggregate.data.frame(x = list(Avg.Cost = late.time$Cost_of_the_Product),
                                           by = list(Reached = late.time$Reached.on.Time_Y.N,
                                                     Moda = late.time$Mode_of_Shipment,
                                                     Prior = late.time$Product_importance,
                                                     Warehouse = late.time$Warehouse_block),
                                           FUN = mean)),
                     (aggregate.data.frame(x = list(Avg.Disc = late.time$Discount_offered),
                                           by = list(Reached = late.time$Reached.on.Time_Y.N,
                                                     Moda = late.time$Mode_of_Shipment,
                                                     Prior = late.time$Product_importance,
                                                     Warehouse = late.time$Warehouse_block),
                                           FUN = mean)),
                     (aggregate.data.frame(x = list(Avg.Call = late.time$Customer_care_calls),
                                           by = list(Reached = late.time$Reached.on.Time_Y.N,
                                                     Moda = late.time$Mode_of_Shipment,
                                                     Prior = late.time$Product_importance,
                                                     Warehouse = late.time$Warehouse_block),
                                           FUN = mean)),
                     (aggregate.data.frame(x = list(Avg.Rating = late.time$Customer_rating),
                                           by = list(Reached = late.time$Reached.on.Time_Y.N,
                                                     Moda = late.time$Mode_of_Shipment,
                                                     Prior = late.time$Product_importance,
                                                     Warehouse = late.time$Warehouse_block),
                                           FUN = mean)))

# Merge late and on time shipment
reach.time <- cbind(agg.on.time[ , c(2:5)], Late.Time = agg.late.time[ , 5])
reach.time$Perc.OT <- round(reach.time$On.Time / rowSums(reach.time[ , c("On.Time", "Late.Time")]) * 100, 2)
reach.time$Perc.LT <- round(reach.time$Late / rowSums(reach.time[ , c("On.Time", "Late.Time")]) * 100, 2)
reach.time <- cbind(reach.time, 
                    Avg.Cost.OT = round(agg.on.time[ , 10], 2),
                    Avg.Cost.LT = round(agg.late.time[ , 10], 2),
                    Avg.Disc.OT = agg.on.time[ , 15],
                    Avg.Disc.LT = agg.late.time[ , 15],
                    Avg.Call.OT = agg.on.time[ , 20],
                    Avg.Call.LT = agg.late.time[ , 20])
head(reach.time)

Now, lets identify the most combination that have highest late of delivery time.

most.late <- reach.time[reach.time$Perc.LT > mean(reach.time$Perc.LT), ]
most.late[order(most.late$Prior, decreasing = F), ]

summary(most.late)

##      Moda      Prior   Warehouse    On.Time         Late.Time    
##  Flight:6   high  :9   A:4       Min.   :  5.00   Min.   : 19.0  
##  Road  :6   low   :5   B:5       1st Qu.: 31.00   1st Qu.: 67.0  
##  Ship  :7   medium:5   C:4       Median : 47.00   Median : 80.0  
##                        D:3       Mean   : 64.11   Mean   :109.7  
##                        E:3       3rd Qu.: 61.50   3rd Qu.:114.5  
##                                  Max.   :242.00   Max.   :378.0  
##     Perc.OT         Perc.LT       Avg.Cost.OT     Avg.Cost.LT   
##  Min.   :18.52   Min.   :60.97   Min.   :183.9   Min.   :182.3  
##  1st Qu.:33.33   1st Qu.:62.16   1st Qu.:208.2   1st Qu.:198.0  
##  Median :36.84   Median :63.16   Median :212.1   Median :205.9  
##  Mean   :34.73   Mean   :65.27   Mean   :212.0   Mean   :203.1  
##  3rd Qu.:37.84   3rd Qu.:66.67   3rd Qu.:216.7   3rd Qu.:208.4  
##  Max.   :39.03   Max.   :81.48   Max.   :237.0   Max.   :213.6  
##   Avg.Disc.OT     Avg.Disc.LT     Avg.Call.OT     Avg.Call.LT   
##  Min.   :4.000   Min.   :15.35   Min.   :3.200   Min.   :3.636  
##  1st Qu.:5.102   1st Qu.:17.74   1st Qu.:3.913   1st Qu.:3.854  
##  Median :5.400   Median :18.38   Median :4.091   Median :3.950  
##  Mean   :5.378   Mean   :18.49   Mean   :3.995   Mean   :3.929  
##  3rd Qu.:5.629   3rd Qu.:19.06   3rd Qu.:4.226   3rd Qu.:4.006  
##  Max.   :6.400   Max.   :22.44   Max.   :4.438   Max.   :4.284

Insight from the brief summary are:

Above is 19 combination of transport mode, product importance, and warehouse origin that are most late delivery.
The top combination that produces most late delivery is goods with high importance shipped by ship. High importance goods is recommended ship by flight.
Goods that are delivered late time have cost lower that goods that delivered on time. Also, good which have discounts more than 10% are delivered late time.
Goods that are delivered on time have more calls from customers, this means that employees will work if triggered by customers who asked their goods. Recommendations for this problem is set new Standard Operational Procedure (SOP), so their performance will still well although their are not triggered by customer calls.
Recommendation mode of transport from each warehouse origin and product importance is like below:

Identifying Rating

Rating will be connected to gender, reach time, and number of discounts.

aggregate(formula = df$Customer_rating ~ df$Gender, FUN = mean)

Buyer gender differences not make much difference to the rating that given.

aggregate(formula = df$Customer_rating ~ df$Reached.on.Time_Y.N, FUN = mean)

Goods that delivered late have a slightly better rating that goods that delivered on time. This indicated that reach time is not indicator from buyer to give determine the rating.

disc.category <- 
  function(x) {
    if (x <= 10) {
      x <- "disc 0% - 10%"
    } else if (x > 10 & x <= 20) {
      x <- "disc 11% - 20%"
    } else if (x > 20 & x <= 30) {
      x <- "disc 21% - 30%"
    } else if (x > 30 & x <= 40) {
      x <- "disc 31 - 40%" 
    } else if (x > 41 & x <= 50) {
      x <- "disc 41% - 50%"
    } else if (x > 51 & x <= 60) {
      x <- "disc 51% - 60%"
    } else {
      x <- "disc more than 60%"
    }
  }

df$disc.category <- sapply(df$Discount_offered, FUN = disc.category)
aggregate(formula = df$Customer_rating ~ df$disc.category, FUN = mean)

Amount of discount don’t have linear relationship with rating. Also, this indicated that discount isn’t indicator from buyer to give determine the rating.

Based on above explanatory, we can concluded that rating of goods is not determine by gender of buyer, delivery of goods, and amount of discount. The rating is possibility determine by quality of goods.

Conclusion

Most of goods that delivered late time is high importance product shipped by ship
Rating is determine by quality of product

Explanatory Data Analysis - Ecommerce Shipping Data

Rudi Haryanto

August 15, 2021

Background

Import and Cleaning Data

Identifying Main Causes of Late Shipment

Identifying Rating

Conclusion