The dataset given by an international e-commerce company who wants to discover key insights from their customer database. Three questions that they want to known is
The dataset used for model building contained 10999 observations of 12 variables.
The data contains the following information:
Warehouse block : The Company have big Warehouse which is divided in to block such as A,B,C,D,E.
Mode of shipment :The Company Ships the products in multiple way such as Ship, Flight and Road.
Customer care calls : The number of calls made from enquiry for enquiry of the shipment.
Customer rating : The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best).
Cost of the product: Cost of the Product in US Dollars.
Prior purchases : The Number of Prior Purchase.
Product importance : The company has categorized the product in the various parameter such as low, medium, high.
Gender : Male and Female.
Discount offered : Discount offered on that specific product.
Weight in gms : It is the weight in grams.
Reached on time : It is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time.
df <- read.csv('shipping.csv')
head(df, n = 10)Now look the structure of dataframe
str(df)## 'data.frame': 10999 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ï..ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Warehouse_block : chr "D" "E" "A" "B" ...
## $ Mode_of_Shipment : chr "Flight" "Flight" "Flight" "Flight" ...
## $ Customer_care_calls: int 4 4 2 3 2 3 3 4 3 3 ...
## $ Customer_rating : int 2 5 2 3 2 1 4 1 4 2 ...
## $ Cost_of_the_Product: int 177 216 183 176 184 162 250 233 150 164 ...
## $ Prior_purchases : int 3 2 4 4 3 3 3 2 3 3 ...
## $ Product_importance : chr "low" "low" "low" "medium" ...
## $ Gender : chr "F" "M" "M" "M" ...
## $ Discount_offered : int 44 59 48 10 46 12 3 48 11 29 ...
## $ Weight_in_gms : int 1233 3088 3374 1177 2484 1417 2371 2804 1861 1187 ...
## $ Reached.on.Time_Y.N: int 1 1 1 1 1 1 1 1 1 1 ...
Column “ï..ID” is just index number of rows, since we don’t need this column, we can delete this column
df <- df[, -1]We know, some columns are not in the right structure. Data within “Warehouse_block”, “Mode_of_Shipment”, “Product_importance”, and “Reached.on.Time_Y.N” column are categorical data. So, we should change the structure of these columns into factor.
col.factor <- c("Warehouse_block", "Mode_of_Shipment", "Product_importance", "Gender", "Reached.on.Time_Y.N")
df[col.factor] <- lapply(df[col.factor], factor)Now, lets check are data frame have any empty value (NA)
colSums(is.na(df))## ï..ID Warehouse_block Mode_of_Shipment Customer_care_calls
## 0 0 0 0
## Customer_rating Cost_of_the_Product Prior_purchases Product_importance
## 0 0 0 0
## Gender Discount_offered Weight_in_gms Reached.on.Time_Y.N
## 0 0 0 0
Great, there are not empty value!
To get brief insight from the data, lets look the summary of data
summary(df)## ï..ID Warehouse_block Mode_of_Shipment Customer_care_calls
## Min. : 1 A:1833 Flight:1777 Min. :2.000
## 1st Qu.: 2750 B:1833 Road :1760 1st Qu.:3.000
## Median : 5500 C:1833 Ship :7462 Median :4.000
## Mean : 5500 D:1834 Mean :4.054
## 3rd Qu.: 8250 E:3666 3rd Qu.:5.000
## Max. :10999 Max. :7.000
## Customer_rating Cost_of_the_Product Prior_purchases Product_importance
## Min. :1.000 Min. : 96.0 Min. : 2.000 high : 948
## 1st Qu.:2.000 1st Qu.:169.0 1st Qu.: 3.000 low :5297
## Median :3.000 Median :214.0 Median : 3.000 medium:4754
## Mean :2.991 Mean :210.2 Mean : 3.568
## 3rd Qu.:4.000 3rd Qu.:251.0 3rd Qu.: 4.000
## Max. :5.000 Max. :310.0 Max. :10.000
## Gender Discount_offered Weight_in_gms Reached.on.Time_Y.N
## F:5545 Min. : 1.00 Min. :1001 0:4436
## M:5454 1st Qu.: 4.00 1st Qu.:1840 1:6563
## Median : 7.00 Median :4149
## Mean :13.37 Mean :3634
## 3rd Qu.:10.00 3rd Qu.:5050
## Max. :65.00 Max. :7846
Some insight that we get from summary are:
round(prop.table(table(df$Reached.on.Time_Y.N, df$Mode_of_Shipment), margin = 2)*100, 2)##
## Flight Road Ship
## 0 39.84 41.19 40.24
## 1 60.16 58.81 59.76
round(prop.table(table(df$Reached.on.Time_Y.N, df$Warehouse_block), margin = 2)*100, 2)##
## A B C D E
## 0 41.35 39.77 40.32 40.24 40.15
## 1 58.65 60.23 59.68 59.76 59.85
round(prop.table(table(df$Reached.on.Time_Y.N, df$Prior_purchases), margin = 2)*100, 2)##
## 2 3 4 5 6 7 8 10
## 0 37.48 35.93 45.66 50.12 44.03 32.35 35.16 42.70
## 1 62.52 64.07 54.34 49.88 55.97 67.65 64.84 57.30
round(prop.table(table(df$Reached.on.Time_Y.N, df$Product_importance), margin = 2)*100, 2)##
## high low medium
## 0 35.02 40.72 40.95
## 1 64.98 59.28 59.05
As we can see, that late of shipment for each transport mode, warehouse origin, prior to purchase, and product importance are almost same, which is around 55% - 65%. Therefore, we can conclude that variable of late delivery is not single variable. Now, we will create a dataframe from combination of the above variable
# Filter of late and on time shipment
on.time <- df[df$Reached.on.Time_Y.N == 0, ]
late.time <- df[df$Reached.on.Time_Y.N == 1, ]
# Aggregating on time shipment
agg.on.time <- cbind((aggregate.data.frame(x = list(On.Time = on.time$Reached.on.Time_Y.N),
by = list(Reached = on.time$Reached.on.Time_Y.N,
Moda = on.time$Mode_of_Shipment,
Prior = on.time$Product_importance,
Warehouse = on.time$Warehouse_block),
FUN = length)),
(aggregate.data.frame(x = list(Avg.Cost = on.time$Cost_of_the_Product),
by = list(Reached = on.time$Reached.on.Time_Y.N,
Moda = on.time$Mode_of_Shipment,
Prior = on.time$Product_importance,
Warehouse = on.time$Warehouse_block),
FUN = mean)),
(aggregate.data.frame(x = list(Avg.Disc = on.time$Discount_offered),
by = list(Reached = on.time$Reached.on.Time_Y.N,
Moda = on.time$Mode_of_Shipment,
Prior = on.time$Product_importance,
Warehouse = on.time$Warehouse_block),
FUN = mean)),
(aggregate.data.frame(x = list(Avg.Call = on.time$Customer_care_calls),
by = list(Reached = on.time$Reached.on.Time_Y.N,
Moda = on.time$Mode_of_Shipment,
Prior = on.time$Product_importance,
Warehouse = on.time$Warehouse_block),
FUN = mean)),
(aggregate.data.frame(x = list(Avg.Rating = on.time$Customer_rating),
by = list(Reached = on.time$Reached.on.Time_Y.N,
Moda = on.time$Mode_of_Shipment,
Prior = on.time$Product_importance,
Warehouse = on.time$Warehouse_block),
FUN = mean)))
# Aggregating late time shipment
agg.late.time <- cbind((aggregate.data.frame(x = list(Late.Time = late.time$Reached.on.Time_Y.N),
by = list(Reached = late.time$Reached.on.Time_Y.N,
Moda = late.time$Mode_of_Shipment,
Prior = late.time$Product_importance,
Warehouse = late.time$Warehouse_block),
FUN = length)),
(aggregate.data.frame(x = list(Avg.Cost = late.time$Cost_of_the_Product),
by = list(Reached = late.time$Reached.on.Time_Y.N,
Moda = late.time$Mode_of_Shipment,
Prior = late.time$Product_importance,
Warehouse = late.time$Warehouse_block),
FUN = mean)),
(aggregate.data.frame(x = list(Avg.Disc = late.time$Discount_offered),
by = list(Reached = late.time$Reached.on.Time_Y.N,
Moda = late.time$Mode_of_Shipment,
Prior = late.time$Product_importance,
Warehouse = late.time$Warehouse_block),
FUN = mean)),
(aggregate.data.frame(x = list(Avg.Call = late.time$Customer_care_calls),
by = list(Reached = late.time$Reached.on.Time_Y.N,
Moda = late.time$Mode_of_Shipment,
Prior = late.time$Product_importance,
Warehouse = late.time$Warehouse_block),
FUN = mean)),
(aggregate.data.frame(x = list(Avg.Rating = late.time$Customer_rating),
by = list(Reached = late.time$Reached.on.Time_Y.N,
Moda = late.time$Mode_of_Shipment,
Prior = late.time$Product_importance,
Warehouse = late.time$Warehouse_block),
FUN = mean)))
# Merge late and on time shipment
reach.time <- cbind(agg.on.time[ , c(2:5)], Late.Time = agg.late.time[ , 5])
reach.time$Perc.OT <- round(reach.time$On.Time / rowSums(reach.time[ , c("On.Time", "Late.Time")]) * 100, 2)
reach.time$Perc.LT <- round(reach.time$Late / rowSums(reach.time[ , c("On.Time", "Late.Time")]) * 100, 2)
reach.time <- cbind(reach.time,
Avg.Cost.OT = round(agg.on.time[ , 10], 2),
Avg.Cost.LT = round(agg.late.time[ , 10], 2),
Avg.Disc.OT = agg.on.time[ , 15],
Avg.Disc.LT = agg.late.time[ , 15],
Avg.Call.OT = agg.on.time[ , 20],
Avg.Call.LT = agg.late.time[ , 20])
head(reach.time)Now, lets identify the most combination that have highest late of delivery time.
most.late <- reach.time[reach.time$Perc.LT > mean(reach.time$Perc.LT), ]
most.late[order(most.late$Prior, decreasing = F), ]summary(most.late)## Moda Prior Warehouse On.Time Late.Time
## Flight:6 high :9 A:4 Min. : 5.00 Min. : 19.0
## Road :6 low :5 B:5 1st Qu.: 31.00 1st Qu.: 67.0
## Ship :7 medium:5 C:4 Median : 47.00 Median : 80.0
## D:3 Mean : 64.11 Mean :109.7
## E:3 3rd Qu.: 61.50 3rd Qu.:114.5
## Max. :242.00 Max. :378.0
## Perc.OT Perc.LT Avg.Cost.OT Avg.Cost.LT
## Min. :18.52 Min. :60.97 Min. :183.9 Min. :182.3
## 1st Qu.:33.33 1st Qu.:62.16 1st Qu.:208.2 1st Qu.:198.0
## Median :36.84 Median :63.16 Median :212.1 Median :205.9
## Mean :34.73 Mean :65.27 Mean :212.0 Mean :203.1
## 3rd Qu.:37.84 3rd Qu.:66.67 3rd Qu.:216.7 3rd Qu.:208.4
## Max. :39.03 Max. :81.48 Max. :237.0 Max. :213.6
## Avg.Disc.OT Avg.Disc.LT Avg.Call.OT Avg.Call.LT
## Min. :4.000 Min. :15.35 Min. :3.200 Min. :3.636
## 1st Qu.:5.102 1st Qu.:17.74 1st Qu.:3.913 1st Qu.:3.854
## Median :5.400 Median :18.38 Median :4.091 Median :3.950
## Mean :5.378 Mean :18.49 Mean :3.995 Mean :3.929
## 3rd Qu.:5.629 3rd Qu.:19.06 3rd Qu.:4.226 3rd Qu.:4.006
## Max. :6.400 Max. :22.44 Max. :4.438 Max. :4.284
Insight from the brief summary are:
Above is 19 combination of transport mode, product importance, and warehouse origin that are most late delivery.
The top combination that produces most late delivery is goods with high importance shipped by ship. High importance goods is recommended ship by flight.
Goods that are delivered late time have cost lower that goods that delivered on time. Also, good which have discounts more than 10% are delivered late time.
Goods that are delivered on time have more calls from customers, this means that employees will work if triggered by customers who asked their goods. Recommendations for this problem is set new Standard Operational Procedure (SOP), so their performance will still well although their are not triggered by customer calls.
Recommendation mode of transport from each warehouse origin and product importance is like below:
Rating will be connected to gender, reach time, and number of discounts.
aggregate(formula = df$Customer_rating ~ df$Gender, FUN = mean)Buyer gender differences not make much difference to the rating that given.
aggregate(formula = df$Customer_rating ~ df$Reached.on.Time_Y.N, FUN = mean)Goods that delivered late have a slightly better rating that goods that delivered on time. This indicated that reach time is not indicator from buyer to give determine the rating.
disc.category <-
function(x) {
if (x <= 10) {
x <- "disc 0% - 10%"
} else if (x > 10 & x <= 20) {
x <- "disc 11% - 20%"
} else if (x > 20 & x <= 30) {
x <- "disc 21% - 30%"
} else if (x > 30 & x <= 40) {
x <- "disc 31 - 40%"
} else if (x > 41 & x <= 50) {
x <- "disc 41% - 50%"
} else if (x > 51 & x <= 60) {
x <- "disc 51% - 60%"
} else {
x <- "disc more than 60%"
}
}
df$disc.category <- sapply(df$Discount_offered, FUN = disc.category)
aggregate(formula = df$Customer_rating ~ df$disc.category, FUN = mean)Amount of discount don’t have linear relationship with rating. Also, this indicated that discount isn’t indicator from buyer to give determine the rating.
Based on above explanatory, we can concluded that rating of goods is not determine by gender of buyer, delivery of goods, and amount of discount. The rating is possibility determine by quality of goods.