Classification with Bayesian Networks

We can divide a network variable set into class (query) variables, $C$, and attributes (evidence variables), $\mathbf{e}$.
- Attributes are typically all variables except the class variables.
Given a set of evidence variables, we want to know the most likely class instantiation
- MPE query where $\mathbf{Q}={C}$
- If there is missing values, we need to do a MAP query that does not have the missing variables as evidence.

Variable Contribution

Some variables may not contribute to the classification query due to independence assumptions
- Can use the concept of markov blanket.

Only the CPTs that include the class variable will be determinant to the prediction
- Therefore only a smaller subset of CPTs are used in classification
- Prune nodes and edges through the VE_PR method
  - Discard disconnected nodes

Prediction

For class prediction, we need to compute:

\[ \max_c P(C|\mathbf{e}) = \max_c\frac{P(C,\mathbf{e})}{P(\mathbf{e})} = \max_c P(C,\mathbf{e}) \]

Therefore, we only need to calculate $$ to get class prediction.
- Normalising CPT will result in probabilities $P(C,\mathbf{e})$

Parameter Estimation

Can get network parameters in 2 ways
- Elicitation - Asking a human expert
- Empirically - estimate based on data
Empirical probabilities of an instantiation can be calculated easily
- Estimated based on frequency of occurrence

\[ P_D(A=T,B=T) = \underbrace{\overbrace{\frac{c(A=T,B=T)}{N}}^{\text{Count of occurances}}}_{\text{Number of observations}}\]

For conditional probabilities, we generalise bayes rule to empirical observations

\[ P_D(A=T|B=T) = \frac{P_D(A=T,B=T)}{P_D(B=T)}\]

Additive Smoothing

We should avoid assigning zero probabilities for any event unless completely sure
- Event may only occur within test set
- Event may only occur given one class instantiation
Can fix this problem through additive smoothing:
- $\alpha$ is the pseudo-count parameter. Often set to 1 to add 1 ‘psuedo’ sample to the observations.
- $c(X=x)$ is the count of occurrences of $X=x$
- $N$ is the number of observations
- $|X|$ is the number of possible values for $X$

\[ P_L = \frac{c(X=x)+\alpha}{N+\alpha|X|}\] * Can extend this to conditional probability: + $N_{A=a}$ is the number of observations where $A=a$ + $|B|$ is the number of possible values for $B$

\[ P_L(B=b|A=a) = \frac{c(B=b,A=a)+\alpha}{N_{A=a}+\alpha|B|}\]

Naive Bayes

Naive Bayes networks have a fixed structure
- only learning task is to estimate parameters
- Parameters is linear in $n$: $|C| + n|C||A|$

Each attribute has the class variable as its only parent
- As a result, NBC assumes all attributes are independent given the class. However, this is rarely the case

\[ A_1\perp A_2,...,A_n|C\]

Typically over-confident in probability prediction
- However, not much of an issue if we only consider the predicted classes.
To get the joint probability distribution we can use the chain rule for bayesian networks:

\[ P(C,A_1,...,A_n) = P(C)*P(A_1|C)*...*P(A_n|C) = P(C)*\prod_iP(A_i|C)\]

Log Probabilities

The product of so many small prrobabilities can possibly lead to numerical precision errors
- Can use log transformed probabilities thanks to monotonic increasing nature
- Example for Naive Bayes network:

\[ \log(P(C,A_1,...,A_n)) = \log(P(C))+\sum_i\log(P(A_i|C))\]

Tree Augmented Bayes Classifier (TAN)

Extensions such as TAN have been proposed to address the conditional independence assumptions.
- Allows for learning of a more elaborate dependency structure between varaibles through use of conditional mutual information.
Consider the Conditional Mutual Information:

\[ MI_D(A_i;A_j|C) = \sum_{a_i,a_j,c}P_D(a_i,a_j,c)\log_2\left(\frac{P_D(a_i,a_j|c)}{P_D(a_i|c)P_D(a_j|c)}\right) \]

Method

We create a fully connected un-directed graph with evidence variables, with weights of conditional mutual information score estimated from data.

The maximal spanning tree is then calculated:

A variable is chosen as the root variable and the edges are set in directions outwards from the root. The class node $C$ is then connected with directed edges to each attribute node:

Learn parameters for resulting graph.

Classification_BayesNet

Jake

05/10/2022