One-To-Many Join Size: A Random Graph Approach

August 22, 2017

Introduction

Join estimation is traditionally a cornerstone of query optimization.
In general case, precise join estimation is hard. A one-to-many subset of join graphs allows more precise estimation without loss of generality.
From probabilistic perspective, join graph is a random graph conditional on information available to us.
The task is to infer the statistical parameters of join graph distribution and use it to estimate the size of join.

The dominant approach to find join size is histogram approach.
It suffers from high complexity and large storage costs. Several approaches partially solve this problem (Ioannidis and Poosala (1995), Poosala, Haas, Ioannidis, and Shekita (1996))
Another option is to sample and estimate join on smaller subset of tuples.
Statistically, sample must be representative to have properties of true population. To satisfy this, the sample must be large, making sampling meaningless.
There are several approaches that make sampling more efficient (Lipton, Naughton, and Schneider (1990), Haas and Swami (1995)).
We use random graph approach that was first described by Erdos and Renyi (1960). Bipartite graph case was studied in Newman, Strogatz, and Watts (2001).

Consider an arbitrary set of relations \(R_1,\, R_2, \dots,\, R_n\) that have one-to-many relationship model.
Let us denote \(K_i\) as the size of the \(R_i\), and \(d^r_i,\, d^l_i\) as the amount of dangling tuples from joint attribute on the right ("one") and the left ("many") side of join relationship.
Let us form a join graph where node represents an item in the relation \(R_i\). The edge indicates the nodes share joint attribute.
Given the join graph, we estimate the size of a join as a number of nodes in \(R_n\) that that are connected to a node in \(R_1\).

Consider an arbitrary bipartite graph \(G(K_1,\, K_2)\), where \(K_1\) and \(K_2\) are nodes of separate components.
Suppose each pair of nodes \(x_i\in K_1,\, x_j\in K_2\) has an edge with probability \(p_k\).
If \(\forall \, \, p_i=p_j=p\), then the graph's edges distributed binomially (or Poisson if \(K_1+K_2\) is large).
To find probability, we need to find function \(G:\,\{(K_i,d_i^r,d_i^l)\}\rightarrow \{p_1,\,p_2,\dots,\, p_m\}\).

In join graph, consider the edges are generated according to probability.
\[ p_k=\begin{cases} \frac{K_i-d^r_1}{K_i}\text{ , for k=1,}\\ p\text{ , for k>1} \end{cases} \]
To find \(p\), we find expected value of edges from two sides of the join.
\[ E(I(\{R_i,R_j\}))=(K_i-d_i^r)\sum\limits_{i=1}^{K_2}p(1-\frac{K_i-1}{2}p)=\\ =p(1-\frac{K_i-1}{2}p)K_2 \]

Solving the equation, we can obtain estimate of probability.
\(\hat p=\frac{(K_i-1)^{-1}+\sqrt{(K_i-1)^{-2}-\frac{4(K_j-d_j^l)}{K_j(K_i-d_i^r)}}}{2}\).
We can consider all adjacent pairs of the relations to find their generating probability.
Applying procedure iteratively over adjacent relations we can get set of \(N-1\) probabilities.

Join estimation is equivalent of finding number of connected nodes from the last relation to the first in the random graph.
The reverse probability that a node would be connected to the preceding relation is \(\sum\limits_{i=0}^{S_{N-1}-1} p_{N-1}(1-p_{N-1})^i\).
Thus, the final join size becomes
\[ |R_1\Join R_2 \Join \dots \Join R_N|=K_n\prod\limits_{j=1}^{N-1}\sum\limits_{i=0}^{S_j-1} p_j(1-p_j)^i \]