Maximum Likelihood Parameter Learning

With complete data we count frequencies
- This is the maximum likelihood estimate
- Almost minimises the KL divergence

\[\theta^{ml}_{x|u} = P_D(x|u) = \frac{D\#(x,u)}{D\#(u)}\]

The variance of this estimator will decrease as the dataset increases in size
- \(\theta^{ml}_{x|u}\) is asymptotically normal with mean \(P(X|u)\) and variance \(\frac{P(x|u)(1-P(x|u))}{NP(u)}\)
Likelihood of the estimates are defined as:

Loglikelihood can be decomposed as the conditional entropy over the families of the structure

Incomplete Data

For incomplete data, the likelihood uses inference to get probabilities based on the data that is observed

Expectation maximisation algorithm is used for incomplete data
- Completes the dataset by creating an expected empirical distribution
  - Probability of each completion \(P_{\theta^0}(c_i|d_i)\)
- Estimates parameters through ML
  - Guarenteed to have no less likelihood than the inital parameters

Use expected emp distribution to estimate parameters based on probabilities and counts
Completition probabilities change due to new parameters per iteration
EM estimates can be computed without constructing the expected empirical distribution

Notes
- EM may have different convergence based on inital estimate
- At each iteration of EM you need to perform inference on Bayesian network
  - Correspond to posterior marginals over network families
  - Can use jointree to efficiently compute family marginals

Different types of missing data
- Missing completely at random - no dependence
- Missing at random - dependence with non-missing parameter
- Missing not at random - dependence with missing parameter

MCAR and MAR have the same maximum likelihood results
- If MAR assumption holds, the missing data mechanism can be ignored
MAR holds if the condition holds:
- \(I\) and \(M\) are d-seperated by \(O\) in structure \(G_I\)

We are interested in a maximum likelihood structure
- However, we also make adjustments for complexity in more complex models
Loglikelihood of a structure \(G\) depends on the parameter estimates

We consider learning a tree structure
- Each node has at most 1 parent
- Score connections based on mutual information in the empirical distribution
Can calculate the tscore by taking the mutual information between every variable and its single parent
- Trees that maximise this tscore also maximise the likelihood

Create a fully connected undirected graph over all variables with cost of each edge being the mutual information between the two nodes connected by the edge
- Compute the maximum spanning tree
  - Select a node as the root and direct edges away from the root
Guaranteed to have the maximal likelihood among all tree structures

Making the DAG structure more complicated will always increase the likelihood
- Overfitting the data, leading to poor generalisation performance
To overcome overfitting, add a regularization term
- Consider the DAG Dimension:

Very large amount of structures to consider
- Greedy algorithms are often used
Efficiency of search algorithms are improved through decomposing socring functions
- Ability to decompose the score into an aggregate of local scores
  - One for each network family

Start with an inital structure and modify it locally to increase its score
- Add edge
- Remove edge
- Reverse edge
Can find a locally optimal solution
- Can improve this behaviour by doing random restarts, repeating the local search multiple times, each time starting with a different network
Adding or removing an edge only impacts 1 family score, while reversing an edge impacts 2 family scores
- Have to update only these scores

Can reduce the search space by assuming a total ordering on the variables
- Search only among network structures consistent with the chosen order
- Search process tries to find each variable \(X\) a set of parents \(U_i\subseteq X_1,...,X_{i-1}\)
Decomposes problem into \(n\) independent problems
- Search reduces to considering each variable \(X_i\) independently finding a set of parents \(U_i\subseteq X_1,...,X_{i-1}\)that maximises the \(score(X_i,U_i;D)\)
A common greedy search algorithm involves starting with an empty set of parents and successively adding varibles to the set one at a time until no additions increase the score.