Likelihood Parameter and Structure Learning

Jake

06/12/2022

Maximum Likelihood Parameter Learning

  • With complete data we count frequencies
    • This is the maximum likelihood estimate
    • Almost minimises the KL divergence

\[\theta^{ml}_{x|u} = P_D(x|u) = \frac{D\#(x,u)}{D\#(u)}\]

  • The variance of this estimator will decrease as the dataset increases in size
    • \(\theta^{ml}_{x|u}\) is asymptotically normal with mean \(P(X|u)\) and variance \(\frac{P(x|u)(1-P(x|u))}{NP(u)}\)
  • Likelihood of the estimates are defined as:

  • Loglikelihood can be decomposed as the conditional entropy over the families of the structure

Incomplete Data

  • For incomplete data, the likelihood uses inference to get probabilities based on the data that is observed

  • Expectation maximisation algorithm is used for incomplete data
    • Completes the dataset by creating an expected empirical distribution
      • Probability of each completion \(P_{\theta^0}(c_i|d_i)\)
    • Estimates parameters through ML
      • Guarenteed to have no less likelihood than the inital parameters

  • The expected empirical distribution of dataset \(D\) is defined as:

  • Use expected emp distribution to estimate parameters based on probabilities and counts

  • Completition probabilities change due to new parameters per iteration

  • EM estimates can be computed without constructing the expected empirical distribution

  • EM Algorithm:

  • Notes
    • EM may have different convergence based on inital estimate
    • At each iteration of EM you need to perform inference on Bayesian network
      • Correspond to posterior marginals over network families
      • Can use jointree to efficiently compute family marginals

Missing Data Mechanism

  • Can utilise an indicator variable for missing variables

  • Different types of missing data
    • Missing completely at random - no dependence
    • Missing at random - dependence with non-missing parameter
    • Missing not at random - dependence with missing parameter

  • MCAR and MAR have the same maximum likelihood results
    • If MAR assumption holds, the missing data mechanism can be ignored
  • MAR holds if the condition holds:
    • \(I\) and \(M\) are d-seperated by \(O\) in structure \(G_I\)

Maximum Likelihood Structure Learning

  • We are interested in a maximum likelihood structure
    • However, we also make adjustments for complexity in more complex models
  • Loglikelihood of a structure \(G\) depends on the parameter estimates

Learning Tree Structure

  • We consider learning a tree structure
    • Each node has at most 1 parent
    • Score connections based on mutual information in the empirical distribution
  • Can calculate the tscore by taking the mutual information between every variable and its single parent
    • Trees that maximise this tscore also maximise the likelihood

  • Create a fully connected undirected graph over all variables with cost of each edge being the mutual information between the two nodes connected by the edge
    • Compute the maximum spanning tree
      • Select a node as the root and direct edges away from the root
  • Guaranteed to have the maximal likelihood among all tree structures

Learning DAG Structure

  • Making the DAG structure more complicated will always increase the likelihood
    • Overfitting the data, leading to poor generalisation performance
  • To overcome overfitting, add a regularization term
    • Consider the DAG Dimension:

  • Therefore we consider maximising the score:

  • Can have different choices for \(\psi(N)\):

Searching for Network Structure

  • Very large amount of structures to consider
    • Greedy algorithms are often used
  • Efficiency of search algorithms are improved through decomposing socring functions
    • Ability to decompose the score into an aggregate of local scores
      • One for each network family