Description

This report describes Product Shipment Delivered on time or not? prediction using classification. The dataset used in this report is E-Commerce Shipping Data in Kaggle. The dataset can be downloaded here. Link to dataset:. https://www.kaggle.com/prachi13/customer-analytics

Report outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Extract data in csv format into dataframe in R.

Ecommerce_df<-read.csv("C:/Ecommerce/ALuLproject/data/Ecommerce_Shipping_Data.csv")

See the data dimension. The dataset has 10999 rows and 12 columns.

dim(Ecommerce_df)
## [1] 10999    12

2. Exploratory Data Analysis

To find out the column names and types, we used str() function.

str(Ecommerce_df)
## 'data.frame':    10999 obs. of  12 variables:
##  $ ï..ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Warehouse_block    : chr  "D" "F" "A" "B" ...
##  $ Mode_of_Shipment   : chr  "Flight" "Flight" "Flight" "Flight" ...
##  $ Customer_care_calls: int  4 4 2 3 2 3 3 4 3 3 ...
##  $ Customer_rating    : int  2 5 2 3 2 1 4 1 4 2 ...
##  $ Cost_of_the_Product: int  177 216 183 176 184 162 250 233 150 164 ...
##  $ Prior_purchases    : int  3 2 4 4 3 3 3 2 3 3 ...
##  $ Product_importance : chr  "low" "low" "low" "medium" ...
##  $ Gender             : chr  "F" "M" "M" "M" ...
##  $ Discount_offered   : int  44 59 48 10 46 12 3 48 11 29 ...
##  $ Weight_in_gms      : int  1233 3088 3374 1177 2484 1417 2371 2804 1861 1187 ...
##  $ Reached.on.Time_Y.N: int  1 1 1 1 1 1 1 1 1 1 ...

From the result above, we know the following:
1. The first column is id. It is unique and unnecessary for prediction. So, it should be removed.
2. The twelfth column is Reached on time Currently the type is int and it should be converted to factor and change variable name to be shipping_status

# remove unnecessary columns
Ecommerce_df$ï..ID <- NULL
# change to factor for target variable
colnames(Ecommerce_df)[11] <- "shipping_status"  
Ecommerce_df$shipping_status <- factor(Ecommerce_df$shipping_status,
                                           levels = c(0,1),
                                           labels = c("Ontime", "Late"))

2.1. Univariate Data Analysis

Analysis of a single variable.
Number of on time and late in shipping status column.

library(ggplot2)
ggplot(data=Ecommerce_df, aes(x=shipping_status)) + geom_bar(color="white",fill = "#49B3E8")+ 
  geom_text(aes(y = ..count.. -50,label = paste0(round(prop.table(..count..),4) * 100, '%')), 
            stat = 'count', position = position_stack(vjust=0.5), size = 5,color= "white") +
  labs(title="Shipping Status Of Observation", x="Shipping Status") 

From the result above, we know that the number of shipping Late is more than ontime.

2.2. Bivariate Data Analysis

Analysis of two variables.
Distribution of Warehouse block variable based on shipping status.

Ecommerce_df$Warehouse_block <- as.factor(Ecommerce_df$Warehouse_block)
ggplot(Ecommerce_df, aes(x =  Warehouse_block, fill = shipping_status)) +
  geom_bar()

From the result above, we know the following:
Based on Warehouse_block , we know that the count of Late is more than On Time.

Distribution of Mode of Shipment variable based on shipping status.

Ecommerce_df$Mode_of_Shipment <- as.factor(Ecommerce_df$Mode_of_Shipment)
ggplot(Ecommerce_df, aes(x =  Mode_of_Shipment, fill = shipping_status)) +
  geom_bar()

From the result above, we know the following:
Based on Mode of Shipment , we know that the count of Late is more than On Time.

Distribution of Customer care calls variable based on shipping status.

ggplot(Ecommerce_df, aes(x = shipping_status, y = Customer_care_calls)) +
  geom_point(color = "darkcyan",
             size = 2, position = "jitter") +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              color = "darksalmon") +
  labs(title = "shipping status based on Customer care calls",
       x = "shipping status",
  
            y = "Customer care calls")

From the result above, we know the following:
Based on Customer care calls , we know that the count of Late is more than On Time.

Distribution of Customer rating variable based on shipping status.

ggplot(Ecommerce_df, aes(x = shipping_status, y = Customer_rating)) +
  geom_point(color = "darkcyan",
             size = 2, position = "jitter") +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              color = "darksalmon") +
  labs(title = "shipping status based on Customer rating",
       x = "shipping status",
       y = "Customer rating")

From the result above, we know the following:
Based on Customer rating , we know that the count of Late is more than On Time.

Distribution of Cost of the Product variable based on shipping status.

ggplot(Ecommerce_df, aes(x = shipping_status, y = Cost_of_the_Product)) +
  geom_point(color = "darkcyan",
             size = 2, position = "jitter") +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              color = "darksalmon") +
  labs(title = "shipping status based on Cost of the Product",
       x = "shipping status",
       y = "Cost of the Product")

From the result above, we know the following:
Based on Cost of the Product , we know that the count of Late is more than On Time.

Distribution of Prior purchases variable based on shipping status.

ggplot(Ecommerce_df, aes(x = shipping_status, y = Prior_purchases)) +
  geom_point(color = "darkcyan",
             size = 2, position = "jitter") +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              color = "darksalmon") +
  labs(title = "shipping status based on Prior purchases",
       x = "shipping status",
       y = "Prior purchases")

From the result above, we know the following:
Based on Prior purchases , we know that the count of Late is more than On Time.

Distribution of Product importance variable based on shipping status.

Ecommerce_df$Product_importance <- as.factor(Ecommerce_df$Product_importance)
ggplot(Ecommerce_df, aes(x =  Product_importance, fill = shipping_status)) +
  geom_bar()

From the result above, we know the following:
Based on Product importance , we know that the count of Late is more than On Time.

Distribution of Gender variable based on shipping status.

Ecommerce_df$Gender <- as.factor(Ecommerce_df$Gender)
ggplot(Ecommerce_df, aes(x =  Gender, fill = shipping_status)) +
  geom_bar()

From the result above, we know the following:
Based on Gender , we know that the count of Late is more than On Time.

Distribution of Discount offered variable based on shipping status.

ggplot(Ecommerce_df, aes(x = shipping_status, y = Discount_offered)) +
  geom_point(color = "darkcyan",
             size = 2, position = "jitter") +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              color = "darksalmon") +
  labs(title = "shipping status based on Discount offered",
       x = "shipping status",
       y = "Discount offered")

From the result above, we know the following:
Based on Discount offered , we know that the count of Late is more than On Time.

Distribution of Weight in gms variable based on shipping status.

ggplot(Ecommerce_df, aes(x = shipping_status, y =  Weight_in_gms)) +
  geom_point(color = "darkcyan",
             size = 2, position = "jitter") +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              color = "darksalmon") +
  labs(title = "shipping status based on Weight in gms",
       x = "shipping status",
       y = "Weight in gms")

From the result above, we know the following:
Based on Weight in gms , we know that the count of Late is more than On Time.

2.3. Multivariate Data Analysis (three or more variables)

Scatterplot of Discount offered and Weight in gms. shipping status is represented by color.

ggplot(Ecommerce_df, aes(x=Discount_offered, y=Weight_in_gms, color=shipping_status)) + geom_point()

From the result above, we know the following:
1. Based on Weight in gms , the number of on time is under 2000 and above 4000 and below 6000
2. Based on Discount offered , the number of late is is above 20 and below 60

correlation coefficient.
Compute and visualize correlation coefficient of each measurement.

library(corrgram)
corrgram(Ecommerce_df[,c("Warehouse_block", "Mode_of_Shipment", "Customer_care_calls",
                         "Customer_rating", "Cost_of_the_Product", "Prior_purchases",
                         "Product_importance", "Gender", "Discount_offered", "Weight_in_gms")],
         main="Correlation Coefficient variables")

From the result above, we know the following:
variable customer care calls has a strong corelation with variable cost of the product and variable discount offered has a strong corelation with variable weight in grams.

3. Data Preparation

3.1 Feature Selection

# remove unnecessary columns
Ecommerce_df$ï..ID <- NULL

3.2. Training and Test Division

m = nrow(Ecommerce_df)
set.seed(2021)

train_idx <- sample(m, 0.6 * m)
train_df <- Ecommerce_df[ train_idx, ]
test_df <- Ecommerce_df[ -train_idx, ]

4. Modelling

4.1. Logistic Regression

fit.log <- glm(formula = shipping_status ~ ., 
               data = train_df, 
               family = binomial())
summary(fit.log)
## 
## Call:
## glm(formula = shipping_status ~ ., family = binomial(), data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6825  -1.0748   0.1228   1.0848   1.9054  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.615e+00  2.758e-01   5.855 4.77e-09 ***
## Warehouse_blockB         -8.330e-02  9.775e-02  -0.852  0.39410    
## Warehouse_blockC          1.408e-02  9.892e-02   0.142  0.88680    
## Warehouse_blockD          3.761e-02  9.768e-02   0.385  0.70024    
## Warehouse_blockF         -1.126e-02  8.540e-02  -0.132  0.89508    
## Mode_of_ShipmentRoad      5.023e-02  9.861e-02   0.509  0.61048    
## Mode_of_ShipmentShip     -1.840e-03  7.777e-02  -0.024  0.98113    
## Customer_care_calls      -8.994e-02  2.791e-02  -3.223  0.00127 ** 
## Customer_rating          -2.688e-03  1.996e-02  -0.135  0.89286    
## Cost_of_the_Product      -2.002e-03  6.491e-04  -3.084  0.00204 ** 
## Prior_purchases          -9.323e-02  1.931e-02  -4.827 1.39e-06 ***
## Product_importancelow    -3.353e-01  1.080e-01  -3.104  0.00191 ** 
## Product_importancemedium -3.338e-01  1.085e-01  -3.076  0.00209 ** 
## GenderM                   1.839e-02  5.644e-02   0.326  0.74452    
## Discount_offered          1.157e-01  5.795e-03  19.962  < 2e-16 ***
## Weight_in_gms            -2.161e-04  2.056e-05 -10.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8907.2  on 6598  degrees of freedom
## Residual deviance: 7187.8  on 6583  degrees of freedom
## AIC: 7219.8
## 
## Number of Fisher Scoring iterations: 6

4.2 Classification Tree

library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
fit.ctree <- ctree(formula = shipping_status ~ .,
                   data = train_df,
                   control = ctree_control(maxdepth = 4))
plot(fit.ctree)

4.3. Random Forest

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(2021)
fit.forest <- randomForest(formula = shipping_status ~., 
                           data = train_df, 
                           na.action = na.roughfix)
fit.forest
## 
## Call:
##  randomForest(formula = shipping_status ~ ., data = train_df,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 34.54%
## Confusion matrix:
##        Ontime Late class.error
## Ontime   1889  782   0.2927742
## Late     1497 2431   0.3811100

4.4 Support Vector Machine

library(e1071)
fit.svm <- svm(formula = shipping_status ~ .,
               data = train_df)
fit.svm
## 
## Call:
## svm(formula = shipping_status ~ ., data = train_df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  4651

5. Evaluate The Model

actual <- test_df$shipping_status

5.1. Logistic Regression

prob <- predict(fit.log, test_df, type = "response")
pred.log <- factor(prob > 0.5, 
                   levels = c(FALSE, TRUE),
                   labels = c("on_time", "late"))
cm.log <- table(actual, pred.log, 
                dnn = c("Actual", "Predicted"))
cm.log
##         Predicted
## Actual   on_time late
##   Ontime    1029  736
##   Late       872 1763

5.2. Classification Tree

pred.ctree <- predict(fit.ctree, test_df)
cm.ctree <- table(actual, pred.ctree, 
                  dnn = c("Actual", "Predicted"))
cm.ctree
##         Predicted
## Actual   Ontime Late
##   Ontime   1700   65
##   Late     1328 1307

5.3. Random Forest

pred.forest <- predict(fit.forest, test_df)
cm.forest <- table(actual, pred.forest,
                   dnn = c("Actual", "Predicted"))
cm.forest
##         Predicted
## Actual   Ontime Late
##   Ontime   1316  449
##   Late     1001 1634

5.4. Support Vector Machine

pred.svm <- predict(fit.svm, na.omit(test_df))
cm.svm <- table(na.omit(test_df)$shipping_status, pred.svm, 
                dnn = c("Actual", "Predicted"))

cm.svm
##         Predicted
## Actual   Ontime Late
##   Ontime   1585  180
##   Late     1301 1334
performance <- function(cm, method){
  TN <- cm[1,1]
  TP <- cm[2,2]
  FN <- cm[2,1]
  FP <- cm[1,2]
  accuracy <- (TN+TP) / (TN+TP+FN+FP)
  precision <- (TP) / (TP+FP)
  recall <- (TP) / (FN+TP)
  f1score <- (2*precision*recall) / (precision + recall)
  
  result <- paste("***", method, "***",
                  "\naccuracy = ", round(accuracy,3),
                  "\nprecision = ", round(precision,3),
                  "\nrecall = ", round(recall,3),
                  "\nf1score = ", round(f1score,3))
  
  cat(result)
}
performance(cm.log, "Logistic Regression")
## *** Logistic Regression *** 
## accuracy =  0.635 
## precision =  0.705 
## recall =  0.669 
## f1score =  0.687
performance(cm.ctree, "Classification Tree")
## *** Classification Tree *** 
## accuracy =  0.683 
## precision =  0.953 
## recall =  0.496 
## f1score =  0.652
performance(cm.forest, "Random Forest")
## *** Random Forest *** 
## accuracy =  0.67 
## precision =  0.784 
## recall =  0.62 
## f1score =  0.693
performance(cm.svm, "Support Vector Machine")
## *** Support Vector Machine *** 
## accuracy =  0.663 
## precision =  0.881 
## recall =  0.506 
## f1score =  0.643

Improve The Performance. Focus on Classification Tree method Study parameter ctree_control()

fit.ctree <- ctree(formula = shipping_status ~ .,
                   data = train_df,
                   control = ctree_control(maxdepth = 0,
                                           mincriterion = 0.90,
                                           minsplit = 25))
plot(fit.ctree)

pred.ctree <- predict(fit.ctree, test_df)
cm.ctree <- table(actual, pred.ctree, 
                  dnn = c("Actual", "Predicted"))
cm.ctree
##         Predicted
## Actual   Ontime Late
##   Ontime   1706   59
##   Late     1337 1298
performance(cm.ctree, "Classification Tree")
## *** Classification Tree *** 
## accuracy =  0.683 
## precision =  0.957 
## recall =  0.493 
## f1score =  0.65

6. Reccomendation

From the result above, we know the following:
1. Classification Tree is the best among all the tested algortihms
2. The Performance can be improved on Classification Tree method Study parameter ctree_control()
3. the most important variables are discount offered and weight in gms