Problem setting

In this task from “Algorithms: Design and Analysis, Part 2” we need to perform clustering on set of points \(S\). It has 200000 nodes, each is represented by sequence of 24 bits.

The format of the file where data is stored is as follows:
[# of nodes] [# of bits for each node’s label]
[first bit of node 1] … [last bit of node 1] [first bit of node 2] … [last bit of node 2]

For example, the third line of the file “0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1” denotes the 24 bits associated with node #2.

The distance between two nodes \(u\) and \(v\) in this problem is defined as the Hamming distance — the number of differing bits — between the two nodes’ labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label “0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1” is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of \(k\) such that there is a \(k\)-clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

Solution

Since number of points is quite big, we cannot write out the graph with all the distances between all the points explicitly. It will be \(200000*199999/2=19999900000\) edges, which is too much. We will use another approach instead.

What we need to do in this task is put two points in one cluster iff the distance between them is 1 or 2. Let’s call such points neighbors. How can we find all of the neighbors for a given point \(u\)? Let’s think of every point as 24-bit representation of some integer number. Then neighbors of \(u\) are such points that differ from \(u\) in no more than 2 bytes. It means that \(x\) is a neighbor of \(u\) iff \(x XOR u\) is a number that has only 1 or 2 ones in the binary representation. We will call such numbers simple. Total number of simple numbers is \({24 \choose 2}+{24 \choose 1} = 300\).

We will use the following property of \(XOR\): if \(a\) \(XOR\) \(b = c\) than \(a\) \(XOR\) \(c = b\). It means that if we \(XOR\) \(u\) by all possible simple numbers we will get all theoretically possible neighbors of \(u\). If we leave only those numbers that lie in \(S\) we will have the neighbors of \(u\) in \(S\).

To store the data we will use array points of \(2^{24}\) integers. Initially we fill it with zeroes. While reading next line from the file we will compute corresponding integer curindex, and write into points[curindex] next integer to denote initial division into clusters where each point is its own cluster. We will also save numbers corresponding to every point in \(S\) indo array index for future purposes.

How do we compute the whole cluster for \(u\)? We compute neighbors, assign them to the \(u\)-th cluster, then compute neighbors for every neighbor of \(u\), leave only points that are not yet assigned to \(u\)-th cluster, assign them to the \(u\)-th cluster, then compute neighbors for every neighbor of neighbor of \(u\), leave only new points, assign them to \(u\)-th cluster and so on, until on some step we won’t get any new points for \(u\)-th cluster.

To perform complete clustering we will go through all of our points and if the next point does not yet belong to some previous full cluster we will find full cluster for that point. We will save leaders of the clusters into name array and in the end the length of this array (minus 1 for the initial zero) will give us the answer.

#download file from internet
import urllib

testfile = urllib.URLopener()
testfile.retrieve("http://spark-public.s3.amazonaws.com/algo2/datasets/clustering_big.txt", "clustering_big.txt")

#computing simple numbers
hamm=[0]*300
k=0
for i in range(24):
    for j in range(i,24):
        mask=1 << i
        mask2 = 1 << j
        hamm[k]=mask|mask2
        k=k+1

#function that computes neighbors for the given point
def neighbors(x):
    result = []
    for i in hamm:
        if points[x^i] !=0:
            result.append(x^i)
    return result

#reading the data into arrays points and index
points = [0]*pow(2,24)
index=[]
from timeit import default_timer as timer
start=timer()

with open("clustering_big.txt") as f:
    next(f)
    cluster=1
    for line in f:
        curindex=int(line.replace(' ',''), base = 2)
        index.append(curindex)
        points[curindex]=cluster
        cluster=cluster+1

readend=timer()
print("reading data:",readend-start)

#performing clustering
names=[0]
loopstart=timer()
for i in index:
    if points[i] in names:
        continue
    nclust=[i]
    while len(nclust)!=0:
        nnclust=[]
        for dot in nclust:
            for ind in neighbors(dot):
                if points[ind]!=points[i]:
                    nnclust.append(ind)
                    points[ind]=points[i]
        nclust=nnclust
    names.append(points[i])
end=timer()
print("loop time", end-loopstart)
print("total time", end-start)
print(len(names)-1)
## ('reading data:', 0.4645240306854248)
## ('loop time', 10.185559034347534)
## ('total time', 10.650142908096313)
## 6118