Shipment Clustering

Download the list of shipments that have yet to be picked. Display the Row/bay for each shipment.

unpicked_shipments <- fetch("76425d87-f474-44a4-b0af-1bed708006eb")

## Parsed with column specification:
## cols(
##   shipments_id = col_double(),
##   orders_id = col_double(),
##   products_id = col_double(),
##   products_name = col_character(),
##   products_quantity = col_double(),
##   products_weight = col_double(),
##   products_location = col_character(),
##   Row = col_double(),
##   Bay = col_double(),
##   Shelf = col_character(),
##   Bin = col_character(),
##   date_purchased = col_datetime(format = ""),
##   orders_status = col_double(),
##   orders_status_name = col_character(),
##   hold = col_double(),
##   x = col_double(),
##   y = col_double()
## )

ggplot(unpicked_shipments, aes(x = x, y = y)) + geom_point()

Group the multi-sku shipments into an average sku location.

avg_loc <- unpicked_shipments %>%
  group_by(shipments_id) %>%
  summarise(x = mean(x),
            y = mean(y),
            qty = n()) %>%
  ungroup() 

mat = matrix(nrow = length(avg_loc$x), ncol = 2)
mat[,1] = avg_loc$x
mat[,2] = avg_loc$y
mat %>% as.data.frame %>% ggplot(aes(V1, V2)) + geom_point(alpha=.5) + theme_bw()

library(ggplot2)
library(dplyr)
library(cluster)
## Winsorize a vector
winsor <- function(x, u){
  if(any(x>u)) x[x>u] = u
  x
}

I would like to cluster points into groups of similar size. For example I would like to group 1000 points into clusters of around 20 points each. The two aspects that are important here are:

The cluster size distribution (or the deviation from the desired cluster size).
The quality of the clusters (i.e. how similar are points within a cluster).

In addition to the typical hierarchical clustering approach, I will test the following iterative approaches:

Iterative dichotomy: large clusters are split in two until around the desired size (using hierarchical clustering).
Iterative nearest neighbor: a point and its closest neighboring points are assigned to a cluster and removed before processing another point.
Same-size k-Means Variation that some quick googling returned.
Iterative “bottom-leaves” hierarchical clustering: keeping the first cluster of the desired size at the “bottom” of the dendrogram.

Methods

In the following \(s\) is the target cluster size.

Iterative dichotomy

Starting with one cluster containing all the points, a cluster is split in two if larger that \(1.5*s\). When all clusters are smaller than \(1.5*s\), the process stops.

The points are split in two using hierarchical clustering. I will try different linkage criteria. My guess is that the Ward criterion will be good at this because it tends to produce balanced dendrograms.

hclustit <- function(mat, clsize = 10, method='ward.D', split.th=1.5){
  lab = rep('l', nrow(mat))
  lab.size = table(lab)
  while(any(lab.size>clsize*split.th)){
    lab.ii = which(lab == names(lab.size)[which.max(lab.size)])
    mmat = mat[lab.ii,]
    hc.o = hclust(dist(mmat, method = 'manhattan'), method=method)
    lab[lab.ii] = paste0(lab[lab.ii], '-', cutree(hc.o, 2))
    lab.size = table(lab)
  }
  lab
}

Iterative nearest neighbor

While there are more than \(s\) unassigned points:

A point is selected. Randomly or following a rule (see below).
The \(s-1\) closest points are found and assigned to a new cluster.
These points are removed.

If the total number of points is not a multiple of \(s\), the remaining points could be either assigned to their own clusters or to an existing cluster. Actually, we completely control the cluster sizes here so we could fix the size of some clusters to \(s+1\) beforehand to avoid leftovers and ensure balanced sizes.

In the first step, a point is selected. I’ll start by choosing a point randomly (out of the unassigned points). Eventually I could try picking the points with close neighbors, or the opposite, far from other points. I’ll use the mean distance between a point and the others to define the order at which points are processed.

nnit <- function(mat, clsize = 10, method=c('random','maxd', 'mind')){
  clsize.rle = rle(as.numeric(cut(1:nrow(mat), ceiling(nrow(mat)/clsize))))
  clsize = clsize.rle$lengths
  lab = rep(NA, nrow(mat))
  dmat = as.matrix(dist(mat))
  cpt = 1
  while(sum(is.na(lab)) > 0){
    lab.ii = which(is.na(lab))
    dmat.m = dmat[lab.ii,lab.ii]
    if(method[1]=='random'){
      ii = sample.int(nrow(dmat.m),1)
    } else if(method[1]=='maxd'){
      ii = which.max(rowSums(dmat.m))
    } else if(method[1]=='mind'){
      ii = which.min(rowSums(dmat.m))
    } else {
      stop('unknown method')
    }
    lab.m = rep(NA, length(lab.ii))
    lab.m[head(order(dmat.m[ii,]), clsize[cpt])] = cpt
    lab[lab.ii] = lab.m
    cpt = cpt + 1
  }
  if(any(is.na(lab))){
    lab[which(is.na(lab))] = cpt
  }
  lab
}

Same-size k-Means Variation

As explained in a few pages online (e.g. here), one approach consists of using K-means to derive centers and then assigning the same amount of points to each center/cluster.

In the proposed approach the points are ordered by their distance to the closest center minus the distance to the farthest cluster. Each point is assigned to the best cluster in this order. If the best cluster is full, the second best is chosen, etc.

I’ll also try to order the points by the distance to the closest center, by the distance to the farthest cluster, or using a random order.

kmvar <- function(mat, clsize=10, method=c('random','maxd', 'mind', 'elki')){
  k = ceiling(nrow(mat)/clsize)
  km.o = kmeans(mat, k)
  labs = rep(NA, nrow(mat))
  centd = lapply(1:k, function(kk){
    euc = t(mat)-km.o$centers[kk,]
    sqrt(apply(euc, 2, function(x) sum(x^2)))
  })
  centd = matrix(unlist(centd), ncol=k)
  clsizes = rep(0, k)
  if(method[1]=='random'){
    ptord = sample.int(nrow(mat))
  } else if(method[1]=='elki'){
    ptord = order(apply(centd, 1, min) - apply(centd, 1, max))
  } else if(method[1]=='maxd'){
    ptord = order(-apply(centd, 1, max))
  } else if(method[1]=='mind'){
    ptord = order(apply(centd, 1, min))
  } else {
    stop('unknown method')
  }
  for(ii in ptord){
    bestcl = which.max(centd[ii,])
    labs[ii] = bestcl
    clsizes[bestcl] = clsizes[bestcl] + 1
    if(clsizes[bestcl] >= clsize){
      centd[,bestcl] = NA
    }
  }
  return(labs)
}

Iterative “bottom-leaves” hierarchical clustering

While there are more than \(s\) unassigned points:

A hierarchical clustering is built.
The tree is cut at increasing level until one cluster is \(\gt s\).
Assign these points to a cluster and repeat.

Instead of working at the level of the point, the idea is to find the best cluster at each step. The hierarchical clustering integrates information across all the (available) points which might be more robust than ad-hoc rules (e.g. nearest neighbors approach).

hcbottom <- function(mat, clsize = 10, method='ward.D'){
  dmat = as.matrix(dist(mat, method = "manhattan"))
  clsize.rle = rle(as.numeric(cut(1:nrow(mat), ceiling(nrow(mat)/clsize))))
  clsizes = clsize.rle$lengths
  cpt = 1
  lab = rep(NA, nrow(mat))
  for(clss in clsizes[-1]){
    lab.ii = which(is.na(lab))
    hc.o = hclust(as.dist(dmat[lab.ii, lab.ii]), method=method)
    clt = 0
    ct = length(lab.ii)-clss
    while(max(clt)<clss){
      cls = cutree(hc.o, ct)
      clt = table(cls)
      ct = ct - 1
    }
    cl.sel = which(cls == as.numeric(names(clt)[which.max(clt)]))
    lab[lab.ii[head(cl.sel, clss)]] = cpt
    cpt = cpt + 1
  }
  lab[which(is.na(lab))] = cpt
  lab
}

Results

Let’s aim for clusters of around \(s=20\) points.

N = 1000
clsize = 20
## Cluster statistics
clstats <- function(mat, cls){
  dmat = as.matrix(dist(mat, method = 'manhattan'))
  sk <- silhouette(as.numeric(factor(cls)), dmatrix=dmat)
  mean.sk = mean(sk[,3])
  ll = lapply(unique(cls), function(cl){
    cl.ii = which(cls==cl)
    d = as.dist(dmat[cl.ii, cl.ii])
    tibble(lab=cl, max.dist=max(d), mean.dist=mean(d), size=length(cl.ii), mean.cl.sil=mean(sk[cl.ii, 3]), mean.sil=mean.sk)
  })
  do.call(rbind, ll)
}
## Hierarchical clustering based methods
hc.l = lapply(c('average', 'complete', 'ward.D', 'single'), function(meth){
  ## Hierarchical complete
  lab.ct = hclust(dist(mat, method = 'manhattan'), method=meth) %>% cutree(N/clsize)
  ## Iterative dichotomy
  lab.it = hclustit(mat, clsize, meth, split.th=1.5)
  ## ## Iterative "bottom-leaves" clustering
  lab.hcb = hcbottom(mat, clsize, meth)
  rbind(clstats(mat, lab.it) %>% mutate(method=meth, strategy='dichotomy'),
        clstats(mat, lab.hcb) %>% mutate(method=meth, strategy='hclust-bottom'),
        clstats(mat, lab.ct) %>% mutate(method=meth, strategy='hclust-cutree'))
})
## K-means variation
kmv.l = lapply(c('random', 'maxd', 'mind', 'elki'), function(meth){
  lab = kmvar(mat, clsize, method=meth)
  clstats(mat, lab) %>% mutate(method=meth, strategy='Kmeans-var')
})
## Nearest neighbors
nn.l = lapply(c('random', 'maxd', 'mind'), function(meth){
  lab = nnit(mat, clsize, method=meth)
  clstats(mat, lab) %>% mutate(method=meth, strategy='nearest neighbors')
})
cls.df = do.call(rbind, c(hc.l, kmv.l, nn.l))

Cluster size

ggplot(cls.df, aes(x=method, y=winsor(size, 3*clsize))) + geom_boxplot() + theme_bw() +
  geom_hline(yintercept=clsize, linetype=2) +
  facet_grid(.~strategy, space='free', scales='free') +
  ylab(paste0('cluster size (winsorized at ', 3*clsize, ')')) +
  theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
  scale_y_continuous(breaks=seq(0,200,10))

The dichotomy and hclust sometimes produce an average cluster size close to the desired size but still many cluster are much smaller/bigger.
As expected, we get the correct size when using the methods that specifically control for cluster size.

Within-cluster distance

We compute the average pairwise distance per cluster and the maximum pairwise distance per cluster.

## Average distance between points in a cluster
ggplot(cls.df, aes(x=method, y=mean.dist)) + geom_boxplot() + theme_bw() +
  theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
  facet_grid(.~strategy, space='free', scales='free') + ylab('average pairwise distance')

## Warning: Removed 67 rows containing non-finite values (stat_boxplot).

## Maximum distance between two points in a cluster
ggplot(cls.df, aes(x=method, y=max.dist)) + geom_boxplot() + theme_bw() +
  theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
  facet_grid(.~strategy, space='free', scales='free') + ylab('maximum pairwise distance')

## Warning: Removed 67 rows containing non-finite values (stat_boxplot).

Several approaches perform well. Among the methods with cluster size control, hclust-bottom and nearest neighbors looks quite good considering that they have the additional constraint of the fixed cluster size. However we notice a few outlier points with high values. These are usually the last cluster of the iterative process, kind of a left-over cluster with many distant outlier points. That’s why the nearest neighbors approach with the maxD rule, which starts with outlier points on purpose, is performing better (narrower distribution and much weaker outliers).

A summary bar chart with the median values only:

## Median after filtering for NA and infinite values
medianFilt <- function(x){
  median(x[which(!is.infinite(x) & !is.na(x))])
}
## Summary 
cls.df %>% select(-lab, -size, -mean.sil, -mean.cl.sil) %>% group_by(method, strategy) %>%
  summarize_all(medianFilt) %>% gather(variable, value, -method, -strategy) %>%
  mutate(variable=factor(variable, levels=c('mean.dist', 'max.dist'),
                         labels=c('average pairwise distance',
                                  'maximum pairwise distance'))) %>% 
  ggplot(aes(x=paste(strategy, method), y=value, fill=strategy)) +
  geom_bar(stat='identity') + facet_wrap(~variable, scales='free') + theme_bw() +
  theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
  xlab('') + ylab('median per-cluster value')

Silhouette score

Another way of estimating cluster quality is the silhouette score. The higher the silhouette score for a point the better (the more it belongs to its cluster rather than another). We could look at the mean silhouette score per cluster or overall.

## Mean silhouette per cluster
ggplot(cls.df, aes(x=method, y=mean.cl.sil)) +
  geom_hline(yintercept=0, linetype=2) +
  geom_boxplot() + theme_bw() +
  facet_grid(.~strategy, space='free', scales='free') +
  theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
  ylab('mean silhouette score per cluster')

## Mean silhouette score
cls.df %>% select(strategy, method, mean.sil) %>% unique %>%
  ggplot(aes(x=method, y=mean.sil)) + geom_bar(stat='identity') + theme_bw() +
  theme(axis.text.x=element_text(hjust=1, vjust=1, angle=70)) +
  facet_grid(.~strategy, space='free', scales='free') + ylab('mean silhouette score')

Results

It looks like nn maxd and hclust-bottom are giving the best results among the fixed cluster sized options.

hclst <- hcbottom(mat, clsize, 'ward.D')
avg_loc['hclust'] = as.integer(hclst)

nn <- nnit(mat, clsize, 'maxd')
avg_loc['nn'] = as.integer(nn)

shipments_clustered <- merge(unpicked_shipments, avg_loc %>% select(shipments_id, hclust, nn)) %>%
  mutate(hclust = as.factor(hclust),
         nn = as.factor(nn))

ggplot(shipments_clustered, aes(x = x, y = y, color = hclust)) + geom_jitter() + scale_color_brewer(palette = 'Paired') + ggtitle("hclust clusters")

ggplot(shipments_clustered, aes(x = x, y = y, color = nn)) + geom_jitter() + scale_color_brewer(palette = 'Paired') + ggtitle("nn clusters")