August 22, 2017

Introduction

  • Join estimation is traditionally a cornerstone of query optimization.
  • In general case, precise join estimation is hard. A one-to-many subset of join graphs allows more precise estimation without loss of generality.
  • From probabilistic perspective, join graph is a random graph conditional on information available to us.
  • The task is to infer the statistical parameters of join graph distribution and use it to estimate the size of join.

Background

  • The dominant approach to find join size is histogram approach.
  • It suffers from high complexity and large storage costs. Several approaches partially solve this problem (Ioannidis and Poosala (1995), Poosala, Haas, Ioannidis, and Shekita (1996))
  • Another option is to sample and estimate join on smaller subset of tuples.
  • Statistically, sample must be representative to have properties of true population. To satisfy this, the sample must be large, making sampling meaningless.
  • There are several approaches that make sampling more efficient (Lipton, Naughton, and Schneider (1990), Haas and Swami (1995)).
  • We use random graph approach that was first described by Erdos and Renyi (1960). Bipartite graph case was studied in Newman, Strogatz, and Watts (2001).

Join Graph

  • Consider an arbitrary set of relations \(R_1,\, R_2, \dots,\, R_n\) that have one-to-many relationship model.
  • Let us denote \(K_i\) as the size of the \(R_i\), and \(d^r_i,\, d^l_i\) as the amount of dangling tuples from joint attribute on the right ("one") and the left ("many") side of join relationship.
  • Let us form a join graph where node represents an item in the relation \(R_i\). The edge indicates the nodes share joint attribute.
  • Given the join graph, we estimate the size of a join as a number of nodes in \(R_n\) that that are connected to a node in \(R_1\).

Join graph

Random Bipartite Graph

  • Consider an arbitrary bipartite graph \(G(K_1,\, K_2)\), where \(K_1\) and \(K_2\) are nodes of separate components.
  • Suppose each pair of nodes \(x_i\in K_1,\, x_j\in K_2\) has an edge with probability \(p_k\).
  • If \(\forall \, \, p_i=p_j=p\), then the graph's edges distributed binomially (or Poisson if \(K_1+K_2\) is large).
  • To find probability, we need to find function \(G:\,\{(K_i,d_i^r,d_i^l)\}\rightarrow \{p_1,\,p_2,\dots,\, p_m\}\).

Case of Join Graph

  • In join graph, consider the edges are generated according to probability.
  • \[ p_k=\begin{cases} \frac{K_i-d^r_1}{K_i}\text{ , for k=1,}\\ p\text{ , for k>1} \end{cases} \]
  • To find \(p\), we find expected value of edges from two sides of the join.
  • \[ E(I(\{R_i,R_j\}))=(K_i-d_i^r)\sum\limits_{i=1}^{K_2}p(1-\frac{K_i-1}{2}p)=\\ =p(1-\frac{K_i-1}{2}p)K_2 \]

Case of Join Graph cont.

  • Solving the equation, we can obtain estimate of probability.
  • \(\hat p=\frac{(K_i-1)^{-1}+\sqrt{(K_i-1)^{-2}-\frac{4(K_j-d_j^l)}{K_j(K_i-d_i^r)}}}{2}\).
  • We can consider all adjacent pairs of the relations to find their generating probability.
  • Applying procedure iteratively over adjacent relations we can get set of \(N-1\) probabilities.

Join Estimation

  • Join estimation is equivalent of finding number of connected nodes from the last relation to the first in the random graph.
  • The reverse probability that a node would be connected to the preceding relation is \(\sum\limits_{i=0}^{S_{N-1}-1} p_{N-1}(1-p_{N-1})^i\).
  • Thus, the final join size becomes
  • \[ |R_1\Join R_2 \Join \dots \Join R_N|=K_n\prod\limits_{j=1}^{N-1}\sum\limits_{i=0}^{S_j-1} p_j(1-p_j)^i \]

Noise sensitivity

Scatterplot comparison

Thank you