Using Cluster Analysis for Random Assignment

One problem researchers face when wanting to randomize clusters (e.g. schools, clinics) to two or more interventions is the fact that randomization does not always produce equivalent groups. Techniques such as matched and blocking randomization which can match clusters on characteristics prior to randomization.

In this example, I provide a framework for how to use cluster analysis that allows for the inclusion of several different kinds of variables (e.g. continuous, ordinal, binary).

First, we need to library the packages that we will use in this example.

library(randomizeR)

## Loading required package: ggplot2

## Loading required package: plotrix

library(cluster)
library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

For this example, let us assume we a have smoking cessation program that we want to test in a large community mental health and behavior setting called Star Providers. Star Providers is located in several states with many clinics inside those states.

Let us assume we have two treatment and one control conditions. One approach we could take is to randomly assign the clinics, instead of individuals to the different interventions.

However, as we stated before it could be the case that the randomization does not work as we would hope and certain clinics with similar variables (e.g. race, gender) could get lumped together decreasing our study’s validity.

Therefore, we could match them on several clinic level characteristics. Let us assume that we have 12 clinics and we have aggregate data (i.e. percentages) for gender and race and we also have a binary indicator that indicates whether a clinic has a no smoking policy.

dat = data.frame( race = c(.5, .4 ,.3, .2, .1, .15, .34, .78, .9, .2, .2, .3), gender = c(.45, .32, .21, .12, .9, .8, .7,.5, .45, .23, .2, .3), no_smoking = c(rep(1, 6), rep(0,6)))
dat

##    race gender no_smoking
## 1  0.50   0.45          1
## 2  0.40   0.32          1
## 3  0.30   0.21          1
## 4  0.20   0.12          1
## 5  0.10   0.90          1
## 6  0.15   0.80          1
## 7  0.34   0.70          0
## 8  0.78   0.50          0
## 9  0.90   0.45          0
## 10 0.20   0.23          0
## 11 0.20   0.20          0
## 12 0.30   0.30          0

With a stratification design, using more than a few variables, particularly when we have a limited number of, clinics in this case, deepens the problem of dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality

Therefore, one approach is to create a distance matrix that can be used to cluster clinics that are similar. However, many approaches such as Euclidean measure require the data be standardized before distance is calculated to put all variables on the same scale. Therefore, researchers are limited in the types of variables that they can include since only continuous variables can be standardized.

However, the Gower coefficient can be used to calculate the distance between nominal, binary, and continuous variables: https://www.rdocumentation.org/packages/cluster/versions/2.0.7-1/topics/daisy

gower_dis = daisy(dat, metric = "gower")

## Warning in daisy(dat, metric = "gower"): binary variable(s) 3 treated as
## interval scaled

gower_dis

## Dissimilarities :
##             1          2          3          4          5          6
## 2  0.09722222                                                       
## 3  0.18589744 0.08867521                                            
## 4  0.26602564 0.16880342 0.08012821                                 
## 5  0.35897436 0.37286325 0.37820513 0.37500000                      
## 6  0.29540598 0.30929487 0.31463675 0.31143162 0.06356838           
## 7  0.50683761 0.52072650 0.55940171 0.63952991 0.51880342 0.45523504
## 8  0.47136752 0.56858974 0.65726496 0.73739316 0.78760684 0.72403846
## 9  0.50000000 0.59722222 0.68589744 0.76602564 0.85897436 0.79540598
## 10 0.55235043 0.45512821 0.38354701 0.38034188 0.66132479 0.59775641
## 11 0.56517094 0.46794872 0.37927350 0.36752137 0.67414530 0.61057692
## 12 0.48076923 0.38354701 0.37179487 0.45192308 0.67307692 0.60950855
##             7          8          9         10         11
## 2                                                        
## 3                                                        
## 4                                                        
## 5                                                        
## 6                                                        
## 7                                                        
## 8  0.26880342                                            
## 9  0.34017094 0.07136752                                 
## 10 0.25918803 0.35705128 0.38568376                      
## 11 0.27200855 0.36987179 0.39850427 0.01282051           
## 12 0.18760684 0.28547009 0.31410256 0.07158120 0.08440171
## 
## Metric :  mixed ;  Types = I, I, I 
## Number of objects : 12

With a distance matrix, we can then use standard cluster analysis techniques to identify which clinics are the most similar. I refer the readers to another one of my blog posts for more information about cluster analysis: http://rpubs.com/mhanauer/461566

anges_dat = agnes(x = gower_dis, diss = TRUE, method = "ward")
hcTree = cutree(anges_dat, k=3)
dat$hcTree = hcTree
cluster_map = fviz_cluster(list(data =dat, cluster = hcTree, repel = TRUE))
cluster_map

Once you know which clinics are in which clusters you can then use the crPar function in the randomizeR package, to randomize the clinics to the three interventions (treatment 1 = A, treatment 2 = B, and control = C).

For example, cluster one has four clinics in it and we want to randomly assign them to the three intervention options.

Now we have a sequence of randomly assigned interventions within cluster one. We could just assign these interventions in numerical order so…

Clinic 1: A (treatment 1) Clinic 2: B (treatment 2) Clinic 3: C (control) Clinic 4: B (treatment 2)

set.seed(12345)
cluster1_crPar = crPar(N = 4, K = 3)
cluster1_crPar

## 
## Object of class "crPar"
## 
## design = CR 
## N = 4 
## K = 3 
## groups = A B C

ran_assign_clus1 = genSeq(cluster1_crPar)
ran_assign_clus1

## 
## Object of class "rCrSeq"
## 
## design = CR 
## seed = 1548129329 
## N = 4 
## K = 3 
## groups = A B C 
## 
## The sequence M: 
## 
## 1 A B C B

Limitations: With a small number of clusters as is the case here, it could be that a cluster does not contain enough, clinics in this case, to randomly assign at least one clinic to each of the interventions; however, this is a problem that is present with blocking random assignment as well.