Main Parameters
best_rf <- h2o.getModel(sorted_grid@model_ids[[1]]);best_rf@parameters%>%listviewer::jsonedit()
This analysis aims to understand key factors influencing COD (Cash on Delivery) success using delivery task data. Insights from this study can support operational decisions to optimise success rates and resource allocation. The analysis involved several steps:
Exploratory Data Analysis (EDA): To understand data distributions, task patterns, and COD success rates.
Feature Engineering: Creating variables such as hour_created, origin_dest, and taskDuration to capture relevant delivery characteristics.
Machine Learning Modelling: Applying Random Forest with hyperparameter tuning to predict COD success probability.
Evaluation: Using metrics such as AUC, confusion matrix, and SHAP analysis to interpret model performance and feature contributions.
Insights Extraction: Translating model outputs into actionable business recommendations.
In this analysis, several libraries were loaded to support the entire
process, such as tidyverse and
dplyr for data manipulation,
lubridate for date-time processing,
listviewer for exploring JSON data
structures, and caret for data
partitioning and modeling preparation. Meanwhile,
h2o served as the main library for machine learning
modeling, providing comprehensive functions ranging from
environment initialization, data conversion to H2O Frame, training
various algorithms, hyperparameter tuning with grid search, to efficient
model performance evaluation within a single framework.
lapply(c("tidyverse","dplyr","lubridate","listviewer","caret","h2o"),library,character.only=T)[[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
This dataset contains comprehensive information about delivery tasks, including task ID, creation and completion time, assigned worker,completion coordinates, and task type (flow). Additionally, it includes COD details such as payment amount and receipt status, as well as additional data (UserVar) covering delivery status, origin and destination branches, and package weight. This information is crucial for analyzing delivery patterns and predicting COD success.
At the preprocessing stage, data were mutated to extract dates and hours from taskCreatedTime and taskCompletedTime columns by converting their datetime format to Asia/Jakarta timezone. Several new variables were then created, including taskDuration to calculate task duration in minutes, hour_created to extract the task creation hour, origin_dest combining origin and destination branches, weekday to identify the day of task creation, and is_weekend as an indicator of whether the task was created on a weekend. These steps are essential to prepare relevant features for subsequent modeling and analysis.
#Raw Data
data<-jsonlite::fromJSON("C:/Users/ACERIndonesiaTest/OneDrive/Documents/data-task-sample-main/data-sample.json")
#Mutate taskCreated/Completed Date & Hour from both task time columns
Data<-data %>%
mutate(
taskCreatedTime_parsed = ymd_hms(taskCreatedTime, tz = "Asia/Jakarta"),
taskCompletedTime_parsed = ymd_hms(taskCompletedTime, tz = "Asia/Jakarta"),
taskCreatedDate = as.Date(taskCreatedTime_parsed),
taskCreatedHour = format(taskCreatedTime_parsed, "%H:%M:%S"),
taskCompletedDate = as.Date(taskCompletedTime_parsed),
taskCompletedHour = format(taskCompletedTime_parsed, "%H:%M:%S")
)
# Add New Columns (askDuration, hour_created, origin_dest, weekday, is_weekend)
Data$taskDuration<-(Data$taskCompletedTime_parsed-Data$taskCreatedTime_parsed)/60
Data$hour_created <- lubridate::hour(ymd_hms(Data$taskCreatedTime))
Data$origin_dest <- paste(Data$UserVar$branch_origin, Data$UserVar$branch_dest, sep = "_")
Data$weekday <- wday(ymd_hms(Data$taskCreatedTime), label = TRUE)
Data$is_weekend <- ifelse(Data$weekday %in% c("Sat","Sun"), 1, 0);listviewer::jsonedit(Data)
The cleaned dataset contains key variables for modeling, including taskAssignedTo (courier, converted to factor), UserVar.weight (package weight as numeric), hour.created (task creation hour), origin.dest (combination of origin and destination branches as factor), is.weekend (weekend indicator as factor), and cod (COD receipt status converted to factor). This preprocessing ensures that all variables have appropriate data types for analysis and subsequent machine learning modeling.
Dat<-data.frame(taskAssignedTo=Data$taskAssignedTo%>%as.factor(),
UserVar.weight=Data$UserVar$weight%>%as.numeric(),
hour.created=Data$hour_created,
origin.dest=Data$origin_dest%>%as.factor(),
is.weekend=Data$is_weekend%>%as.factor(),
cod=Data$cod$received%>%as.factor());Dat
Data were split into 75% for training and 25% for testing, with a set seed to ensure reproducibility, preparing datasets for model training and evaluation.
set.seed(123);i<-createDataPartition(Dat$cod,p=0.75,list=F)
dtr<-Dat[i,]
dts<-Dat[-i,]
trh<-as.h2o(dtr);trh%>%as.data.frame()
## | | | 0% | |======================================================================| 100%
tsh<-as.h2o(dts);tsh%>%as.data.frame()
## | | | 0% | |======================================================================| 100%
Hyperparameter tuning for Random Forest was performed using grid
search to identify the optimal parameter combination that produced the
best performance based on Area Under Curve (AUC).
Parameters tested included ntrees (number
of trees) [200, 500, 1000], max_depth
(maximum tree depth) [10, 20, 30, 40], min_rows
(minimum rows per leaf) [1, 5, 10, 20], and
sample_rate (proportion of data used per
tree) [0.7 to 1.0]. Variations of mtries
(number of variables considered at each split) ranged from default (-1)
to fractions based on the total number of predictors, and
col_sample_rate_per_tree (proportion of
features used per tree) was tested at [0.6, 0.8, 1.0]. This approach
ensured comprehensive model exploration to capture the most effective
hyperparameter combinations for COD prediction.
hyper_params <- list(
ntrees = c(200, 500, 1000),
max_depth = c(10, 20, 30, 40),
min_rows = c(1, 5, 10, 20),
sample_rate = c(0.7, 0.8, 0.9, 1.0),
mtries = c(-1, sqrt(length(setdiff(names(trh), "cod"))), length(setdiff(names(trh), "cod"))/3, length(setdiff(names(trh), "cod"))/2),
col_sample_rate_per_tree = c(0.6, 0.8, 1.0)
);listviewer::jsonedit(hyper_params)
Hyperparameter tuning used RandomDiscrete search
criteria with max_models = 100,
meaning the system randomly evaluated up to 100 different models out of
all possible parameter combinations, making it more efficient than
exhaustive (Cartesian) search. This strategy enabled broad parameter
space exploration with faster computation time, while
seed = 1234 ensured reproducible
results
search_criteria <- list(
strategy = "RandomDiscrete",
max_models = 100,
seed = 1234
);listviewer::jsonedit(search_criteria)
Hyperparameter tuning was performed using grid search with the predefined search criteria and parameters to obtain the best model based on AUC value.
rf_grid <- h2o.grid(
algorithm = "randomForest",
grid_id = "rf11gd",
x = setdiff(names(trh), "cod"),
y = "cod",
training_frame = trh,
validation_frame = tsh,
hyper_params = hyper_params,
search_criteria = search_criteria
)
Grid search results showed that the best Random Forest model was
obtained with col_sample_rate_per_tree = 0.6,
max_depth = 40, min_rows = 1, mtries = -1,
ntrees = 500, and sample_rate = 1,
achieving an AUC of 0.9959. This model was selected
from a total of 100 hyperparameter combinations tested,
with AUC ranges across all models between 0.98 and
0.9959, indicating that most models already had very high
performance. These findings suggest that although the best model had
specific hyperparameter combinations, performance stability
among combinations was also high, allowing consideration of
trade-offs between performance and model complexity if needed.
rf_grid<-read_rds("C:/Users/ACERIndonesiaTest/OneDrive/Documents/DTA Grid Search Hyperpar RF1.rds")
sorted_grid <- h2o.getGrid("rf11gd", sort_by = "AUC", decreasing = T)
bm<-sorted_grid@summary_table%>%as.data.frame()%>%column_to_rownames("model_ids");bm
The best Random Forest model produced consisted of 500 trees
with a maximum depth of 40, using
min_rows = 1 to split data to the most
granular level. It used
col_sample_rate_per_tree = 0.6, meaning
only 60% of features were considered per tree to enhance generalisation.
sample_rate = 1 ensured all data were
included in each tree, and mtries = -1
indicated default feature selection per split. This model was built to
predict COD (Cash on Delivery) using factors such as
taskAssignedTo (courier),
UserVar.weight (package weight), hour.created
(delivery hour), origin.dest (route), and is.weekend
(delivery day). These combinations show the model was designed
to capture complex patterns in COD data with high predictive
potential.
best_rf <- h2o.getModel(sorted_grid@model_ids[[1]]);best_rf@parameters%>%listviewer::jsonedit()
best_rf@allparameters%>%listviewer::jsonedit()
Results showed origin.dest (route) was the most influential factor for COD success (52%), indicating origin-destination combinations significantly affect package acceptance. taskAssignedTo (25%) highlighted significant performance differences among couriers, suggesting opportunities for improvement through evaluation and training. hour.created (15%) also had an impact, showing certain delivery times had higher acceptance rates. Meanwhile, UserVar.weight (7%) had a small but relevant influence on COD preferences based on package weight, and is.weekend (2%) had almost no impact, indicating stable COD behaviour between weekdays and weekends.
Insight:
Prioritise routes with high success rates or evaluate strategies for routes with low success rates.
Schedule COD deliveries at optimal times for operational efficiency.
Improve courier competency and service standards through data-driven coaching.
h2o.varimp(best_rf)%>%ggplot(aes(x=variable%>%reorder(percentage),y=percentage*100,fill=percentage*100))+
geom_bar(stat="identity",width=0.7)+scale_fill_gradient(low="steelblue",high="navy")+
theme_minimal()+theme(panel.grid = element_blank(),axis.title = element_blank(),
axis.text = element_text(size=11),legend.title = element_blank(),
plot.title = element_text(hjust=0.5,face="bold"))+
geom_label(aes(label=scales::percent(round(percentage,2))),
size=4,col="white")+coord_flip()+labs(title="Variable Importance Plot of Best RF Model")
Evaluation on the test data shows that the model achieved a very high AUC of 0.9959, indicating an almost perfect ability to distinguish between classes. The confusion matrix shows high accuracy with only 13 errors out of 588 data points (2.2% error rate), where specificity reached 99.8% (414/415) and recall for the TRUE class was 93% (161/173). Additionally, the highest F1 score was 0.96 at a threshold of 0.64, and maximum accuracy was 97.8%. These results demonstrate that the model has excellent predictive performance and stability in classifying COD status on the test dataset.
perf_best <- h2o.performance(best_rf, newdata = tsh);perf_best@metrics$AUC[[1]]
## [1] 0.9958772
perf_best@metrics$cm$table
perf_best@metrics$max_criteria_and_metric_scores
SHAP explains the magnitude and direction of each factor’s influence on model predictions, where positive values increase COD acceptance probability and negative values decrease it. In the plot, blue indicates data points with low factor values, while red indicates high factor values, making it easier to interpret the impact of low or high values on predictions.
Results showed origin.dest (route combinations) had the largest influence, meaning certain routes significantly increased or decreased COD success likelihood. taskAssignedTo (assigned courier) also had a large impact with both positive and negative contributions, indicating performance differences among couriers affecting COD outcomes. For hour.created (delivery hour), deliveries in early morning hours (blue) tended to decrease COD acceptance, while midday to evening hours (red) increased it. UserVar.weight (package weight) had a small impact, with heavier packages slightly reducing COD acceptance, while is.weekend (delivery day) had almost no influence, indicating stable COD acceptance between weekdays and weekends. In summary, route, courier, and delivery time are key factors that can be optimised to improve COD success rates.
hsh<-h2o.shap_summary_plot(best_rf, tsh)
hsh+labs(title = "SHAP Summary Plot – Best RF Model", y = "SHAP Contribution", x = "Feature") + theme_minimal(base_size = 13) + theme( plot.title = element_text(face = "bold", hjust = 0.5), axis.text = element_text(color = "black",size=9), legend.title = element_text(face = "bold"), panel.grid = element_blank() )
The analysis revealed that route combinations (origin.dest), courier assignments (taskAssignedTo), and delivery hours (hour.created) are the most influential factors determining COD success. Specifically, certain routes significantly increased or decreased acceptance rates, suggesting that operational strategies should prioritise or adjust these routes. Courier performance varied considerably, highlighting opportunities for targeted coaching to standardise service quality and improve success rates across the team. Delivery time also played a key role, with morning deliveries generally having lower acceptance, while midday to evening deliveries showed higher COD success, indicating potential for scheduling optimisation.
Package weight was found to have a small negative effect, with heavier packages slightly reducing acceptance probability, which could inform decisions on payment method options for specific weight categories. Additionally, day of the week (is.weekend) did not significantly impact COD success, suggesting operational scheduling can remain flexible across weekdays and weekends without affecting outcomes.
Overall, the Random Forest model demonstrated outstanding predictive performance (AUC 0.9959) with high accuracy and stability, confirming that the selected features effectively capture patterns associated with COD success. These insights provide clear directions for operational improvements, courier training, and strategic decision-making to optimise delivery efficiency and customer experience.
Based on these findings, several practical recommendations are proposed to optimise operational efficiency and improve COD success rates.
Prioritise routes with high success rates and re-evaluate strategies for routes with consistently low acceptance.
Schedule deliveries during midday to evening hours to maximise COD success rates.
Implement coaching programs for couriers to address performance gaps and standardise service quality.
Consider payment method strategies for heavier packages to improve acceptance.
Integrate model predictions into operational planning tools to enable data-driven decision-making at scale.