August 22, 2017
Introduction
- Join estimation is traditionally a cornerstone of query optimization.
- In general case, precise join estimation is hard. A one-to-many subset of join graphs allows more precise estimation without loss of generality.
- From probabilistic perspective, join graph is a random graph conditional on information available to us.
- The task is to infer the statistical parameters of join graph distribution and use it to estimate the size of join.
Background
- The dominant approach to find join size is histogram approach.
- It suffers from high complexity and large storage costs. Several approaches partially solve this problem (Ioannidis and Poosala (1995), Poosala, Haas, Ioannidis, and Shekita (1996))
- Another option is to sample and estimate join on smaller subset of tuples.
- Statistically, sample must be representative to have properties of true population. To satisfy this, the sample must be large, making sampling meaningless.
- There are several approaches that make sampling more efficient (Lipton, Naughton, and Schneider (1990), Haas and Swami (1995)).
- We use random graph approach that was first described by Erdos and Renyi (1960). Bipartite graph case was studied in Newman, Strogatz, and Watts (2001).
Join Graph
- Consider an arbitrary set of relations \(R_1,\, R_2, \dots,\, R_n\) that have one-to-many relationship model.
- Let us denote \(K_i\) as the size of the \(R_i\), and \(d^r_i,\, d^l_i\) as the amount of dangling tuples from joint attribute on the right ("one") and the left ("many") side of join relationship.
- Let us form a join graph where node represents an item in the relation \(R_i\). The edge indicates the nodes share joint attribute.
- Given the join graph, we estimate the size of a join as a number of nodes in \(R_n\) that that are connected to a node in \(R_1\).
Join graph
Random Bipartite Graph
- Consider an arbitrary bipartite graph \(G(K_1,\, K_2)\), where \(K_1\) and \(K_2\) are nodes of separate components.
- Suppose each pair of nodes \(x_i\in K_1,\, x_j\in K_2\) has an edge with probability \(p_k\).
- If \(\forall \, \, p_i=p_j=p\), then the graph's edges distributed binomially (or Poisson if \(K_1+K_2\) is large).
- To find probability, we need to find function \(G:\,\{(K_i,d_i^r,d_i^l)\}\rightarrow \{p_1,\,p_2,\dots,\, p_m\}\).
Case of Join Graph
- In join graph, consider the edges are generated according to probability.
- \[
p_k=\begin{cases}
\frac{K_i-d^r_1}{K_i}\text{ , for k=1,}\\
p\text{ , for k>1}
\end{cases}
\]
- To find \(p\), we find expected value of edges from two sides of the join.
- \[
E(I(\{R_i,R_j\}))=(K_i-d_i^r)\sum\limits_{i=1}^{K_2}p(1-\frac{K_i-1}{2}p)=\\ =p(1-\frac{K_i-1}{2}p)K_2
\]
Case of Join Graph cont.
- Solving the equation, we can obtain estimate of probability.
- \(\hat p=\frac{(K_i-1)^{-1}+\sqrt{(K_i-1)^{-2}-\frac{4(K_j-d_j^l)}{K_j(K_i-d_i^r)}}}{2}\).
- We can consider all adjacent pairs of the relations to find their generating probability.
- Applying procedure iteratively over adjacent relations we can get set of \(N-1\) probabilities.
Join Estimation
- Join estimation is equivalent of finding number of connected nodes from the last relation to the first in the random graph.
- The reverse probability that a node would be connected to the preceding relation is \(\sum\limits_{i=0}^{S_{N-1}-1} p_{N-1}(1-p_{N-1})^i\).
- Thus, the final join size becomes
- \[
|R_1\Join R_2 \Join \dots \Join R_N|=K_n\prod\limits_{j=1}^{N-1}\sum\limits_{i=0}^{S_j-1} p_j(1-p_j)^i
\]
Noise sensitivity
Scatterplot comparison
Thank you