Maximum Likelihood Parameter Learning
- With complete data we count frequencies
- This is the maximum likelihood estimate
- Almost minimises the KL divergence
\[\theta^{ml}_{x|u} = P_D(x|u) = \frac{D\#(x,u)}{D\#(u)}\]
- The variance of this estimator will decrease as the dataset increases in size
- \(\theta^{ml}_{x|u}\) is asymptotically normal with mean \(P(X|u)\) and variance \(\frac{P(x|u)(1-P(x|u))}{NP(u)}\)
- Likelihood of the estimates are defined as:
- Loglikelihood can be decomposed as the conditional entropy over the families of the structure
Incomplete Data
- For incomplete data, the likelihood uses inference to get probabilities based on the data that is observed
- Expectation maximisation algorithm is used for incomplete data
- Completes the dataset by creating an expected empirical distribution
- Probability of each completion \(P_{\theta^0}(c_i|d_i)\)
- Estimates parameters through ML
- Guarenteed to have no less likelihood than the inital parameters
- Completes the dataset by creating an expected empirical distribution
- The expected empirical distribution of dataset \(D\) is defined as:
Use expected emp distribution to estimate parameters based on probabilities and counts
Completition probabilities change due to new parameters per iteration
EM estimates can be computed without constructing the expected empirical distribution
- EM Algorithm:
- Notes
- EM may have different convergence based on inital estimate
- At each iteration of EM you need to perform inference on Bayesian network
- Correspond to posterior marginals over network families
- Can use jointree to efficiently compute family marginals
Missing Data Mechanism
- Can utilise an indicator variable for missing variables
- Different types of missing data
- Missing completely at random - no dependence
- Missing at random - dependence with non-missing parameter
- Missing not at random - dependence with missing parameter
- MCAR and MAR have the same maximum likelihood results
- If MAR assumption holds, the missing data mechanism can be ignored
- MAR holds if the condition holds:
- \(I\) and \(M\) are d-seperated by \(O\) in structure \(G_I\)
Maximum Likelihood Structure Learning
- We are interested in a maximum likelihood structure
- However, we also make adjustments for complexity in more complex models
- Loglikelihood of a structure \(G\) depends on the parameter estimates
Learning Tree Structure
- We consider learning a tree structure
- Each node has at most 1 parent
- Score connections based on mutual information in the empirical distribution
- Can calculate the tscore by taking the mutual information between every variable and its single parent
- Trees that maximise this tscore also maximise the likelihood
- Create a fully connected undirected graph over all variables with cost of each edge being the mutual information between the two nodes connected by the edge
- Compute the maximum spanning tree
- Select a node as the root and direct edges away from the root
- Compute the maximum spanning tree
- Guaranteed to have the maximal likelihood among all tree structures
Learning DAG Structure
- Making the DAG structure more complicated will always increase the likelihood
- Overfitting the data, leading to poor generalisation performance
- To overcome overfitting, add a regularization term
- Consider the DAG Dimension:
- Therefore we consider maximising the score:
- Can have different choices for \(\psi(N)\):
Searching for Network Structure
- Very large amount of structures to consider
- Greedy algorithms are often used
- Efficiency of search algorithms are improved through decomposing socring functions
- Ability to decompose the score into an aggregate of local scores
- One for each network family
- Ability to decompose the score into an aggregate of local scores
Local Search
- Start with an inital structure and modify it locally to increase its score
- Add edge
- Remove edge
- Reverse edge
- Can find a locally optimal solution
- Can improve this behaviour by doing random restarts, repeating the local search multiple times, each time starting with a different network
- Adding or removing an edge only impacts 1 family score, while reversing an edge impacts 2 family scores
- Have to update only these scores
Greedy Search
- Can reduce the search space by assuming a total ordering on the variables
- Search only among network structures consistent with the chosen order
- Search process tries to find each variable \(X\) a set of parents \(U_i\subseteq X_1,...,X_{i-1}\)
- Decomposes problem into \(n\) independent problems
- Search reduces to considering each variable \(X_i\) independently finding a set of parents \(U_i\subseteq X_1,...,X_{i-1}\)that maximises the \(score(X_i,U_i;D)\)
- A common greedy search algorithm involves starting with an empty set of parents and successively adding varibles to the set one at a time until no additions increase the score.
Optimal Search
- Searches through all combinations of parents based on the ordering.