Download the list of shipments that have yet to be picked. Display the Row/bay for each shipment.
unpicked_shipments <- fetch("76425d87-f474-44a4-b0af-1bed708006eb")
## Parsed with column specification:
## cols(
## shipments_id = col_double(),
## orders_id = col_double(),
## products_id = col_double(),
## products_name = col_character(),
## products_quantity = col_double(),
## products_weight = col_double(),
## products_location = col_character(),
## Row = col_double(),
## Bay = col_double(),
## Shelf = col_character(),
## Bin = col_character(),
## date_purchased = col_datetime(format = ""),
## orders_status = col_double(),
## orders_status_name = col_character(),
## hold = col_double(),
## x = col_double(),
## y = col_double()
## )
ggplot(unpicked_shipments, aes(x = x, y = y)) + geom_point()
Group the multi-sku shipments into an average sku location.
avg_loc <- unpicked_shipments %>%
group_by(shipments_id) %>%
summarise(x = mean(x),
y = mean(y),
qty = n()) %>%
ungroup()
mat = matrix(nrow = length(avg_loc$x), ncol = 2)
mat[,1] = avg_loc$x
mat[,2] = avg_loc$y
mat %>% as.data.frame %>% ggplot(aes(V1, V2)) + geom_point(alpha=.5) + theme_bw()
library(ggplot2)
library(dplyr)
library(cluster)
## Winsorize a vector
winsor <- function(x, u){
if(any(x>u)) x[x>u] = u
x
}
I would like to cluster points into groups of similar size. For example I would like to group 1000 points into clusters of around 20 points each. The two aspects that are important here are:
In addition to the typical hierarchical clustering approach, I will test the following iterative approaches:
In the following \(s\) is the target cluster size.
Starting with one cluster containing all the points, a cluster is split in two if larger that \(1.5*s\). When all clusters are smaller than \(1.5*s\), the process stops.
The points are split in two using hierarchical clustering. I will try different linkage criteria. My guess is that the Ward criterion will be good at this because it tends to produce balanced dendrograms.
hclustit <- function(mat, clsize = 10, method='ward.D', split.th=1.5){
lab = rep('l', nrow(mat))
lab.size = table(lab)
while(any(lab.size>clsize*split.th)){
lab.ii = which(lab == names(lab.size)[which.max(lab.size)])
mmat = mat[lab.ii,]
hc.o = hclust(dist(mmat, method = 'manhattan'), method=method)
lab[lab.ii] = paste0(lab[lab.ii], '-', cutree(hc.o, 2))
lab.size = table(lab)
}
lab
}
While there are more than \(s\) unassigned points:
If the total number of points is not a multiple of \(s\), the remaining points could be either assigned to their own clusters or to an existing cluster. Actually, we completely control the cluster sizes here so we could fix the size of some clusters to \(s+1\) beforehand to avoid leftovers and ensure balanced sizes.
In the first step, a point is selected. I’ll start by choosing a point randomly (out of the unassigned points). Eventually I could try picking the points with close neighbors, or the opposite, far from other points. I’ll use the mean distance between a point and the others to define the order at which points are processed.
nnit <- function(mat, clsize = 10, method=c('random','maxd', 'mind')){
clsize.rle = rle(as.numeric(cut(1:nrow(mat), ceiling(nrow(mat)/clsize))))
clsize = clsize.rle$lengths
lab = rep(NA, nrow(mat))
dmat = as.matrix(dist(mat))
cpt = 1
while(sum(is.na(lab)) > 0){
lab.ii = which(is.na(lab))
dmat.m = dmat[lab.ii,lab.ii]
if(method[1]=='random'){
ii = sample.int(nrow(dmat.m),1)
} else if(method[1]=='maxd'){
ii = which.max(rowSums(dmat.m))
} else if(method[1]=='mind'){
ii = which.min(rowSums(dmat.m))
} else {
stop('unknown method')
}
lab.m = rep(NA, length(lab.ii))
lab.m[head(order(dmat.m[ii,]), clsize[cpt])] = cpt
lab[lab.ii] = lab.m
cpt = cpt + 1
}
if(any(is.na(lab))){
lab[which(is.na(lab))] = cpt
}
lab
}
As explained in a few pages online (e.g. here), one approach consists of using K-means to derive centers and then assigning the same amount of points to each center/cluster.
In the proposed approach the points are ordered by their distance to the closest center minus the distance to the farthest cluster. Each point is assigned to the best cluster in this order. If the best cluster is full, the second best is chosen, etc.
I’ll also try to order the points by the distance to the closest center, by the distance to the farthest cluster, or using a random order.
kmvar <- function(mat, clsize=10, method=c('random','maxd', 'mind', 'elki')){
k = ceiling(nrow(mat)/clsize)
km.o = kmeans(mat, k)
labs = rep(NA, nrow(mat))
centd = lapply(1:k, function(kk){
euc = t(mat)-km.o$centers[kk,]
sqrt(apply(euc, 2, function(x) sum(x^2)))
})
centd = matrix(unlist(centd), ncol=k)
clsizes = rep(0, k)
if(method[1]=='random'){
ptord = sample.int(nrow(mat))
} else if(method[1]=='elki'){
ptord = order(apply(centd, 1, min) - apply(centd, 1, max))
} else if(method[1]=='maxd'){
ptord = order(-apply(centd, 1, max))
} else if(method[1]=='mind'){
ptord = order(apply(centd, 1, min))
} else {
stop('unknown method')
}
for(ii in ptord){
bestcl = which.max(centd[ii,])
labs[ii] = bestcl
clsizes[bestcl] = clsizes[bestcl] + 1
if(clsizes[bestcl] >= clsize){
centd[,bestcl] = NA
}
}
return(labs)
}
While there are more than \(s\) unassigned points:
Instead of working at the level of the point, the idea is to find the best cluster at each step. The hierarchical clustering integrates information across all the (available) points which might be more robust than ad-hoc rules (e.g. nearest neighbors approach).
hcbottom <- function(mat, clsize = 10, method='ward.D'){
dmat = as.matrix(dist(mat, method = "manhattan"))
clsize.rle = rle(as.numeric(cut(1:nrow(mat), ceiling(nrow(mat)/clsize))))
clsizes = clsize.rle$lengths
cpt = 1
lab = rep(NA, nrow(mat))
for(clss in clsizes[-1]){
lab.ii = which(is.na(lab))
hc.o = hclust(as.dist(dmat[lab.ii, lab.ii]), method=method)
clt = 0
ct = length(lab.ii)-clss
while(max(clt)<clss){
cls = cutree(hc.o, ct)
clt = table(cls)
ct = ct - 1
}
cl.sel = which(cls == as.numeric(names(clt)[which.max(clt)]))
lab[lab.ii[head(cl.sel, clss)]] = cpt
cpt = cpt + 1
}
lab[which(is.na(lab))] = cpt
lab
}
Let’s aim for clusters of around \(s=20\) points.
N = 1000
clsize = 20
## Cluster statistics
clstats <- function(mat, cls){
dmat = as.matrix(dist(mat, method = 'manhattan'))
sk <- silhouette(as.numeric(factor(cls)), dmatrix=dmat)
mean.sk = mean(sk[,3])
ll = lapply(unique(cls), function(cl){
cl.ii = which(cls==cl)
d = as.dist(dmat[cl.ii, cl.ii])
tibble(lab=cl, max.dist=max(d), mean.dist=mean(d), size=length(cl.ii), mean.cl.sil=mean(sk[cl.ii, 3]), mean.sil=mean.sk)
})
do.call(rbind, ll)
}
## Hierarchical clustering based methods
hc.l = lapply(c('average', 'complete', 'ward.D', 'single'), function(meth){
## Hierarchical complete
lab.ct = hclust(dist(mat, method = 'manhattan'), method=meth) %>% cutree(N/clsize)
## Iterative dichotomy
lab.it = hclustit(mat, clsize, meth, split.th=1.5)
## ## Iterative "bottom-leaves" clustering
lab.hcb = hcbottom(mat, clsize, meth)
rbind(clstats(mat, lab.it) %>% mutate(method=meth, strategy='dichotomy'),
clstats(mat, lab.hcb) %>% mutate(method=meth, strategy='hclust-bottom'),
clstats(mat, lab.ct) %>% mutate(method=meth, strategy='hclust-cutree'))
})
## K-means variation
kmv.l = lapply(c('random', 'maxd', 'mind', 'elki'), function(meth){
lab = kmvar(mat, clsize, method=meth)
clstats(mat, lab) %>% mutate(method=meth, strategy='Kmeans-var')
})
## Nearest neighbors
nn.l = lapply(c('random', 'maxd', 'mind'), function(meth){
lab = nnit(mat, clsize, method=meth)
clstats(mat, lab) %>% mutate(method=meth, strategy='nearest neighbors')
})
cls.df = do.call(rbind, c(hc.l, kmv.l, nn.l))
ggplot(cls.df, aes(x=method, y=winsor(size, 3*clsize))) + geom_boxplot() + theme_bw() +
geom_hline(yintercept=clsize, linetype=2) +
facet_grid(.~strategy, space='free', scales='free') +
ylab(paste0('cluster size (winsorized at ', 3*clsize, ')')) +
theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
scale_y_continuous(breaks=seq(0,200,10))
We compute the average pairwise distance per cluster and the maximum pairwise distance per cluster.
## Average distance between points in a cluster
ggplot(cls.df, aes(x=method, y=mean.dist)) + geom_boxplot() + theme_bw() +
theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
facet_grid(.~strategy, space='free', scales='free') + ylab('average pairwise distance')
## Warning: Removed 67 rows containing non-finite values (stat_boxplot).
## Maximum distance between two points in a cluster
ggplot(cls.df, aes(x=method, y=max.dist)) + geom_boxplot() + theme_bw() +
theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
facet_grid(.~strategy, space='free', scales='free') + ylab('maximum pairwise distance')
## Warning: Removed 67 rows containing non-finite values (stat_boxplot).
Several approaches perform well. Among the methods with cluster size control, hclust-bottom and nearest neighbors looks quite good considering that they have the additional constraint of the fixed cluster size. However we notice a few outlier points with high values. These are usually the last cluster of the iterative process, kind of a left-over cluster with many distant outlier points. That’s why the nearest neighbors approach with the maxD rule, which starts with outlier points on purpose, is performing better (narrower distribution and much weaker outliers).
A summary bar chart with the median values only:
## Median after filtering for NA and infinite values
medianFilt <- function(x){
median(x[which(!is.infinite(x) & !is.na(x))])
}
## Summary
cls.df %>% select(-lab, -size, -mean.sil, -mean.cl.sil) %>% group_by(method, strategy) %>%
summarize_all(medianFilt) %>% gather(variable, value, -method, -strategy) %>%
mutate(variable=factor(variable, levels=c('mean.dist', 'max.dist'),
labels=c('average pairwise distance',
'maximum pairwise distance'))) %>%
ggplot(aes(x=paste(strategy, method), y=value, fill=strategy)) +
geom_bar(stat='identity') + facet_wrap(~variable, scales='free') + theme_bw() +
theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
xlab('') + ylab('median per-cluster value')
Another way of estimating cluster quality is the silhouette score. The higher the silhouette score for a point the better (the more it belongs to its cluster rather than another). We could look at the mean silhouette score per cluster or overall.
## Mean silhouette per cluster
ggplot(cls.df, aes(x=method, y=mean.cl.sil)) +
geom_hline(yintercept=0, linetype=2) +
geom_boxplot() + theme_bw() +
facet_grid(.~strategy, space='free', scales='free') +
theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
ylab('mean silhouette score per cluster')
## Mean silhouette score
cls.df %>% select(strategy, method, mean.sil) %>% unique %>%
ggplot(aes(x=method, y=mean.sil)) + geom_bar(stat='identity') + theme_bw() +
theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
facet_grid(.~strategy, space='free', scales='free') + ylab('mean silhouette score')
It looks like nn maxd and hclust-bottom are giving the best results among the fixed cluster sized options.
hclst <- hcbottom(mat, clsize, 'ward.D')
avg_loc['hclust'] = as.integer(hclst)
nn <- nnit(mat, clsize, 'maxd')
avg_loc['nn'] = as.integer(nn)
shipments_clustered <- merge(unpicked_shipments, avg_loc %>% select(shipments_id, hclust, nn)) %>%
mutate(hclust = as.factor(hclust),
nn = as.factor(nn))
ggplot(shipments_clustered, aes(x = x, y = y, color = hclust)) + geom_jitter() + scale_color_brewer(palette = 'Paired') + ggtitle("hclust clusters")
ggplot(shipments_clustered, aes(x = x, y = y, color = nn)) + geom_jitter() + scale_color_brewer(palette = 'Paired') + ggtitle("nn clusters")