Classification_BayesNet

Jake

05/10/2022

Classification with Bayesian Networks

  • We can divide a network variable set into class (query) variables, \(C\), and attributes (evidence variables), \(\mathbf{e}\).
    • Attributes are typically all variables except the class variables.
  • Given a set of evidence variables, we want to know the most likely class instantiation
    • MPE query where \(\mathbf{Q}={C}\)
    • If there is missing values, we need to do a MAP query that does not have the missing variables as evidence.

Variable Contribution

  • Some variables may not contribute to the classification query due to independence assumptions
    • Can use the concept of markov blanket.

  • Only the CPTs that include the class variable will be determinant to the prediction
    • Therefore only a smaller subset of CPTs are used in classification
    • Prune nodes and edges through the VE_PR method
      • Discard disconnected nodes

Prediction

  • For class prediction, we need to compute:

\[ \max_c P(C|\mathbf{e}) = \max_c\frac{P(C,\mathbf{e})}{P(\mathbf{e})} = \max_c P(C,\mathbf{e}) \]

  • Therefore, we only need to calculate $$ to get class prediction.
    • Normalising CPT will result in probabilities \(P(C,\mathbf{e})\)

Parameter Estimation

  • Can get network parameters in 2 ways
    • Elicitation - Asking a human expert
    • Empirically - estimate based on data
  • Empirical probabilities of an instantiation can be calculated easily
    • Estimated based on frequency of occurrence

\[ P_D(A=T,B=T) = \underbrace{\overbrace{\frac{c(A=T,B=T)}{N}}^{\text{Count of occurances}}}_{\text{Number of observations}}\]

  • For conditional probabilities, we generalise bayes rule to empirical observations

\[ P_D(A=T|B=T) = \frac{P_D(A=T,B=T)}{P_D(B=T)}\]

Additive Smoothing

  • We should avoid assigning zero probabilities for any event unless completely sure
    • Event may only occur within test set
    • Event may only occur given one class instantiation
  • Can fix this problem through additive smoothing:
    • \(\alpha\) is the pseudo-count parameter. Often set to 1 to add 1 ‘psuedo’ sample to the observations.
    • \(c(X=x)\) is the count of occurrences of \(X=x\)
    • \(N\) is the number of observations
    • \(|X|\) is the number of possible values for \(X\)

\[ P_L = \frac{c(X=x)+\alpha}{N+\alpha|X|}\] * Can extend this to conditional probability: + \(N_{A=a}\) is the number of observations where \(A=a\) + \(|B|\) is the number of possible values for \(B\)

\[ P_L(B=b|A=a) = \frac{c(B=b,A=a)+\alpha}{N_{A=a}+\alpha|B|}\]

Naive Bayes

  • Naive Bayes networks have a fixed structure
    • only learning task is to estimate parameters
    • Parameters is linear in \(n\): \(|C| + n|C||A|\)

  • Each attribute has the class variable as its only parent
    • As a result, NBC assumes all attributes are independent given the class. However, this is rarely the case

\[ A_1\perp A_2,...,A_n|C\]

  • Typically over-confident in probability prediction
    • However, not much of an issue if we only consider the predicted classes.
  • To get the joint probability distribution we can use the chain rule for bayesian networks:

\[ P(C,A_1,...,A_n) = P(C)*P(A_1|C)*...*P(A_n|C) = P(C)*\prod_iP(A_i|C)\]

Log Probabilities

  • The product of so many small prrobabilities can possibly lead to numerical precision errors
    • Can use log transformed probabilities thanks to monotonic increasing nature
    • Example for Naive Bayes network:

\[ \log(P(C,A_1,...,A_n)) = \log(P(C))+\sum_i\log(P(A_i|C))\]

Tree Augmented Bayes Classifier (TAN)

  • Extensions such as TAN have been proposed to address the conditional independence assumptions.
    • Allows for learning of a more elaborate dependency structure between varaibles through use of conditional mutual information.
  • Consider the Conditional Mutual Information:

\[ MI_D(A_i;A_j|C) = \sum_{a_i,a_j,c}P_D(a_i,a_j,c)\log_2\left(\frac{P_D(a_i,a_j|c)}{P_D(a_i|c)P_D(a_j|c)}\right) \]

Method

  • We create a fully connected un-directed graph with evidence variables, with weights of conditional mutual information score estimated from data.

  • The maximal spanning tree is then calculated:

  • A variable is chosen as the root variable and the edges are set in directions outwards from the root. The class node \(C\) is then connected with directed edges to each attribute node:

  • Learn parameters for resulting graph.

Full algorithm