This report describes Product Shipment Delivered on time or not? prediction using classification. The dataset used in this report is E-Commerce Shipping Data in Kaggle. The dataset can be downloaded here. Link to dataset:. https://www.kaggle.com/prachi13/customer-analytics
Report outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Extract data in csv format into dataframe in R.
Ecommerce_df<-read.csv("C:/Ecommerce/ALuLproject/data/Ecommerce_Shipping_Data.csv")
See the data dimension. The dataset has 10999 rows and 12 columns.
dim(Ecommerce_df)
## [1] 10999 12
To find out the column names and types, we used str() function.
str(Ecommerce_df)
## 'data.frame': 10999 obs. of 12 variables:
## $ ï..ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Warehouse_block : chr "D" "F" "A" "B" ...
## $ Mode_of_Shipment : chr "Flight" "Flight" "Flight" "Flight" ...
## $ Customer_care_calls: int 4 4 2 3 2 3 3 4 3 3 ...
## $ Customer_rating : int 2 5 2 3 2 1 4 1 4 2 ...
## $ Cost_of_the_Product: int 177 216 183 176 184 162 250 233 150 164 ...
## $ Prior_purchases : int 3 2 4 4 3 3 3 2 3 3 ...
## $ Product_importance : chr "low" "low" "low" "medium" ...
## $ Gender : chr "F" "M" "M" "M" ...
## $ Discount_offered : int 44 59 48 10 46 12 3 48 11 29 ...
## $ Weight_in_gms : int 1233 3088 3374 1177 2484 1417 2371 2804 1861 1187 ...
## $ Reached.on.Time_Y.N: int 1 1 1 1 1 1 1 1 1 1 ...
From the result above, we know the following:
1. The first column is id. It is unique and unnecessary for prediction. So, it should be removed.
2. The twelfth column is Reached on time Currently the type is int and it should be converted to factor and change variable name to be shipping_status
# remove unnecessary columns
Ecommerce_df$ï..ID <- NULL
# change to factor for target variable
colnames(Ecommerce_df)[11] <- "shipping_status"
Ecommerce_df$shipping_status <- factor(Ecommerce_df$shipping_status,
levels = c(0,1),
labels = c("Ontime", "Late"))
Analysis of a single variable.
Number of on time and late in shipping status column.
library(ggplot2)
ggplot(data=Ecommerce_df, aes(x=shipping_status)) + geom_bar(color="white",fill = "#49B3E8")+
geom_text(aes(y = ..count.. -50,label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count', position = position_stack(vjust=0.5), size = 5,color= "white") +
labs(title="Shipping Status Of Observation", x="Shipping Status")
From the result above, we know that the number of shipping Late is more than ontime.
Analysis of two variables.
Distribution of Warehouse block variable based on shipping status.
Ecommerce_df$Warehouse_block <- as.factor(Ecommerce_df$Warehouse_block)
ggplot(Ecommerce_df, aes(x = Warehouse_block, fill = shipping_status)) +
geom_bar()
From the result above, we know the following:
Based on Warehouse_block , we know that the count of Late is more than On Time.
Distribution of Mode of Shipment variable based on shipping status.
Ecommerce_df$Mode_of_Shipment <- as.factor(Ecommerce_df$Mode_of_Shipment)
ggplot(Ecommerce_df, aes(x = Mode_of_Shipment, fill = shipping_status)) +
geom_bar()
From the result above, we know the following:
Based on Mode of Shipment , we know that the count of Late is more than On Time.
Distribution of Customer care calls variable based on shipping status.
ggplot(Ecommerce_df, aes(x = shipping_status, y = Customer_care_calls)) +
geom_point(color = "darkcyan",
size = 2, position = "jitter") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 3),
color = "darksalmon") +
labs(title = "shipping status based on Customer care calls",
x = "shipping status",
y = "Customer care calls")
From the result above, we know the following:
Based on Customer care calls , we know that the count of Late is more than On Time.
Distribution of Customer rating variable based on shipping status.
ggplot(Ecommerce_df, aes(x = shipping_status, y = Customer_rating)) +
geom_point(color = "darkcyan",
size = 2, position = "jitter") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 3),
color = "darksalmon") +
labs(title = "shipping status based on Customer rating",
x = "shipping status",
y = "Customer rating")
From the result above, we know the following:
Based on Customer rating , we know that the count of Late is more than On Time.
Distribution of Cost of the Product variable based on shipping status.
ggplot(Ecommerce_df, aes(x = shipping_status, y = Cost_of_the_Product)) +
geom_point(color = "darkcyan",
size = 2, position = "jitter") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 3),
color = "darksalmon") +
labs(title = "shipping status based on Cost of the Product",
x = "shipping status",
y = "Cost of the Product")
From the result above, we know the following:
Based on Cost of the Product , we know that the count of Late is more than On Time.
Distribution of Prior purchases variable based on shipping status.
ggplot(Ecommerce_df, aes(x = shipping_status, y = Prior_purchases)) +
geom_point(color = "darkcyan",
size = 2, position = "jitter") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 3),
color = "darksalmon") +
labs(title = "shipping status based on Prior purchases",
x = "shipping status",
y = "Prior purchases")
From the result above, we know the following:
Based on Prior purchases , we know that the count of Late is more than On Time.
Distribution of Product importance variable based on shipping status.
Ecommerce_df$Product_importance <- as.factor(Ecommerce_df$Product_importance)
ggplot(Ecommerce_df, aes(x = Product_importance, fill = shipping_status)) +
geom_bar()
From the result above, we know the following:
Based on Product importance , we know that the count of Late is more than On Time.
Distribution of Gender variable based on shipping status.
Ecommerce_df$Gender <- as.factor(Ecommerce_df$Gender)
ggplot(Ecommerce_df, aes(x = Gender, fill = shipping_status)) +
geom_bar()
From the result above, we know the following:
Based on Gender , we know that the count of Late is more than On Time.
Distribution of Discount offered variable based on shipping status.
ggplot(Ecommerce_df, aes(x = shipping_status, y = Discount_offered)) +
geom_point(color = "darkcyan",
size = 2, position = "jitter") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 3),
color = "darksalmon") +
labs(title = "shipping status based on Discount offered",
x = "shipping status",
y = "Discount offered")
From the result above, we know the following:
Based on Discount offered , we know that the count of Late is more than On Time.
Distribution of Weight in gms variable based on shipping status.
ggplot(Ecommerce_df, aes(x = shipping_status, y = Weight_in_gms)) +
geom_point(color = "darkcyan",
size = 2, position = "jitter") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 3),
color = "darksalmon") +
labs(title = "shipping status based on Weight in gms",
x = "shipping status",
y = "Weight in gms")
From the result above, we know the following:
Based on Weight in gms , we know that the count of Late is more than On Time.
Scatterplot of Discount offered and Weight in gms. shipping status is represented by color.
ggplot(Ecommerce_df, aes(x=Discount_offered, y=Weight_in_gms, color=shipping_status)) + geom_point()
From the result above, we know the following:
1. Based on Weight in gms , the number of on time is under 2000 and above 4000 and below 6000
2. Based on Discount offered , the number of late is is above 20 and below 60
correlation coefficient.
Compute and visualize correlation coefficient of each measurement.
library(corrgram)
corrgram(Ecommerce_df[,c("Warehouse_block", "Mode_of_Shipment", "Customer_care_calls",
"Customer_rating", "Cost_of_the_Product", "Prior_purchases",
"Product_importance", "Gender", "Discount_offered", "Weight_in_gms")],
main="Correlation Coefficient variables")
From the result above, we know the following:
variable customer care calls has a strong corelation with variable cost of the product and variable discount offered has a strong corelation with variable weight in grams.
3.1 Feature Selection
# remove unnecessary columns
Ecommerce_df$ï..ID <- NULL
m = nrow(Ecommerce_df)
set.seed(2021)
train_idx <- sample(m, 0.6 * m)
train_df <- Ecommerce_df[ train_idx, ]
test_df <- Ecommerce_df[ -train_idx, ]
4.1. Logistic Regression
fit.log <- glm(formula = shipping_status ~ .,
data = train_df,
family = binomial())
summary(fit.log)
##
## Call:
## glm(formula = shipping_status ~ ., family = binomial(), data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6825 -1.0748 0.1228 1.0848 1.9054
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.615e+00 2.758e-01 5.855 4.77e-09 ***
## Warehouse_blockB -8.330e-02 9.775e-02 -0.852 0.39410
## Warehouse_blockC 1.408e-02 9.892e-02 0.142 0.88680
## Warehouse_blockD 3.761e-02 9.768e-02 0.385 0.70024
## Warehouse_blockF -1.126e-02 8.540e-02 -0.132 0.89508
## Mode_of_ShipmentRoad 5.023e-02 9.861e-02 0.509 0.61048
## Mode_of_ShipmentShip -1.840e-03 7.777e-02 -0.024 0.98113
## Customer_care_calls -8.994e-02 2.791e-02 -3.223 0.00127 **
## Customer_rating -2.688e-03 1.996e-02 -0.135 0.89286
## Cost_of_the_Product -2.002e-03 6.491e-04 -3.084 0.00204 **
## Prior_purchases -9.323e-02 1.931e-02 -4.827 1.39e-06 ***
## Product_importancelow -3.353e-01 1.080e-01 -3.104 0.00191 **
## Product_importancemedium -3.338e-01 1.085e-01 -3.076 0.00209 **
## GenderM 1.839e-02 5.644e-02 0.326 0.74452
## Discount_offered 1.157e-01 5.795e-03 19.962 < 2e-16 ***
## Weight_in_gms -2.161e-04 2.056e-05 -10.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8907.2 on 6598 degrees of freedom
## Residual deviance: 7187.8 on 6583 degrees of freedom
## AIC: 7219.8
##
## Number of Fisher Scoring iterations: 6
4.2 Classification Tree
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
fit.ctree <- ctree(formula = shipping_status ~ .,
data = train_df,
control = ctree_control(maxdepth = 4))
plot(fit.ctree)
4.3. Random Forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(2021)
fit.forest <- randomForest(formula = shipping_status ~.,
data = train_df,
na.action = na.roughfix)
fit.forest
##
## Call:
## randomForest(formula = shipping_status ~ ., data = train_df, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 34.54%
## Confusion matrix:
## Ontime Late class.error
## Ontime 1889 782 0.2927742
## Late 1497 2431 0.3811100
4.4 Support Vector Machine
library(e1071)
fit.svm <- svm(formula = shipping_status ~ .,
data = train_df)
fit.svm
##
## Call:
## svm(formula = shipping_status ~ ., data = train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 4651
actual <- test_df$shipping_status
5.1. Logistic Regression
prob <- predict(fit.log, test_df, type = "response")
pred.log <- factor(prob > 0.5,
levels = c(FALSE, TRUE),
labels = c("on_time", "late"))
cm.log <- table(actual, pred.log,
dnn = c("Actual", "Predicted"))
cm.log
## Predicted
## Actual on_time late
## Ontime 1029 736
## Late 872 1763
5.2. Classification Tree
pred.ctree <- predict(fit.ctree, test_df)
cm.ctree <- table(actual, pred.ctree,
dnn = c("Actual", "Predicted"))
cm.ctree
## Predicted
## Actual Ontime Late
## Ontime 1700 65
## Late 1328 1307
5.3. Random Forest
pred.forest <- predict(fit.forest, test_df)
cm.forest <- table(actual, pred.forest,
dnn = c("Actual", "Predicted"))
cm.forest
## Predicted
## Actual Ontime Late
## Ontime 1316 449
## Late 1001 1634
5.4. Support Vector Machine
pred.svm <- predict(fit.svm, na.omit(test_df))
cm.svm <- table(na.omit(test_df)$shipping_status, pred.svm,
dnn = c("Actual", "Predicted"))
cm.svm
## Predicted
## Actual Ontime Late
## Ontime 1585 180
## Late 1301 1334
performance <- function(cm, method){
TN <- cm[1,1]
TP <- cm[2,2]
FN <- cm[2,1]
FP <- cm[1,2]
accuracy <- (TN+TP) / (TN+TP+FN+FP)
precision <- (TP) / (TP+FP)
recall <- (TP) / (FN+TP)
f1score <- (2*precision*recall) / (precision + recall)
result <- paste("***", method, "***",
"\naccuracy = ", round(accuracy,3),
"\nprecision = ", round(precision,3),
"\nrecall = ", round(recall,3),
"\nf1score = ", round(f1score,3))
cat(result)
}
performance(cm.log, "Logistic Regression")
## *** Logistic Regression ***
## accuracy = 0.635
## precision = 0.705
## recall = 0.669
## f1score = 0.687
performance(cm.ctree, "Classification Tree")
## *** Classification Tree ***
## accuracy = 0.683
## precision = 0.953
## recall = 0.496
## f1score = 0.652
performance(cm.forest, "Random Forest")
## *** Random Forest ***
## accuracy = 0.67
## precision = 0.784
## recall = 0.62
## f1score = 0.693
performance(cm.svm, "Support Vector Machine")
## *** Support Vector Machine ***
## accuracy = 0.663
## precision = 0.881
## recall = 0.506
## f1score = 0.643
Improve The Performance. Focus on Classification Tree method Study parameter ctree_control()
fit.ctree <- ctree(formula = shipping_status ~ .,
data = train_df,
control = ctree_control(maxdepth = 0,
mincriterion = 0.90,
minsplit = 25))
plot(fit.ctree)
pred.ctree <- predict(fit.ctree, test_df)
cm.ctree <- table(actual, pred.ctree,
dnn = c("Actual", "Predicted"))
cm.ctree
## Predicted
## Actual Ontime Late
## Ontime 1706 59
## Late 1337 1298
performance(cm.ctree, "Classification Tree")
## *** Classification Tree ***
## accuracy = 0.683
## precision = 0.957
## recall = 0.493
## f1score = 0.65
From the result above, we know the following:
1. Classification Tree is the best among all the tested algortihms
2. The Performance can be improved on Classification Tree method Study parameter ctree_control()
3. the most important variables are discount offered and weight in gms