Some theory and notation

P(h) - prior probability of hypothesis h (i.e. before we observed the training data). P(D) - probability that data D will be observed. P(D|h) - probability of observing D given that h holds. P(h|D) - posterior probability (i.e. after we’ve seen the data) Bayes theorem:

\[P(h|D) = \frac{P(D|h)P(h)}{P(D)}\] MAP hypothesis - maximum a posteriori hypothesis (this is the most probable hypothesis \(h \in H\) given the observed data D). MAP: \[ h_MAP = \underset{h \in H}{argmax} P(h|D) = \underset{h \in H}{argmax} \frac{P(D|h)P(h)}{P(D)} = \underset{h \in H}{argmax} P(D|h)P(h)\]

If we assume no prior knowledge about our hypotheses we say that they are equally likely and assign them equal P(h). Thus the P(h) term becomes irrelevant and we can simply use just P(D|h). This is the ML - maximum likelyhood.

Brute-Force MAP Learning

The idea is very simple. Enumerate all hypothesis, then for each hypothesis \(h \in H\) calculate posterior probability and choose the hypothesis with the highest obtained probability. Assumptions:

train data D is noise free
target concept c is in hypothesis space H
we have no a priory probability info, so: \(P(h) = \frac{1}{|H|} \forall h \in H\) Then P(D|h) is simply 1 if \(d_i = h(x_i)\) for all target function values in D and 0 otherwise. The example is either consistent with the hypothesis or not.

Then P(h|D) by Bayes theorem:

\[P(h|D) = 0\] if h is inconsistend with D \[P(h|D) = \frac{1}{|VS_{H,D}|}\] if h is consistend with D

\(|VS_{H, D}|\) - subset of hypotheses consistent with D. Note that in this case every consistent hypothesis is in fact a MAP hypothesis. This kind of learner is called a consistent learner.

Naive Bayes

The nice thing about naive Bayes learner is that it’s performance can be compared to that of neural networks and decision tree learning. Bayesian approach tells us that we should return the most probable target value \(v_{MAP}\) as a result: \[v_{MAP} = \underset{v_j \in V}{argmax} P(v_j | a_1, a_2, ..., a_n)\] where \(a_1,...a_n\) are the attribute values of the new examples. Then if we apply the Bayes theorem we’ll obtain: \[v_{MAP} = \underset{v_j \in V}{argmax} \frac{P(a_1, a_2, ..., a_n | v_j) P(v_j)}{P(a_1,a_2,...,a_n)} = \underset{v_j \in V}{argmax} P(a_1, a_2, ..., a_n | v_j) P(v_j)\] Calculating \(P(a_1, a_2, ..., a_n | v_j)\) is… next to impossible. That’s why in naive Bayes we make an assumption that the attribute values probabilities are independent of each other given the target function value. Thus we can convert the above hairy probability into a product of probabilities: \[P(a_1, a_2, ..., a_n | v_j) = \prod_i{P(a_i|v_j)}\] Thus, the naive Bayes classifier will look like that: \[v_{NB} = \underset{v_j \in V}{argmax} P(v_j) \prod_i{P(a_i|v_j)}\]

Learning in naive Bayes means calculating \(P(v_j)\) and \(P(a_i|v_j)\)

Load datasets

Standard step - loading datasets.

data(iris)
wine <- read.csv("~/Studies/Year3/DataMining/wine.data", header=FALSE)
wine = cbind(wine[,-1],wine[,1])
colnames(wine)[ncol(wine)] = "target"
ion = read.table("~/Studies/Year3/Machine Learning/ionosphere.data",sep=",")

Data discretization

First we need to discretize our data so that we can apply naive Bayes algorithm. This could be done manually but there are nice packages that do it much more efficiently. But we need not only to discretisize our function but to develop a function that will be discretizing new examples according to the thresholds derived from the train data.

library(discretization)
# vector to be discretized
#thresholds - list of thresholds for each attribute
discretize <- function(query, thresholds) {
    result = query
    for (i in 1:length(thresholds)) {
        result[i] = findInterval(query[i],thresholds[[i]])+1
    }
    result
}
irisD = chiM(iris)
th = irisD$cutp
t(apply(iris[1:4,1:4],1,discretize, th))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1            1           3            1           1
## 2            1           2            1           1
## 3            1           2            1           1
## 4            1           2            1           1

Probabilities tables

The idea is to build a data structure that will hold all the necessary probabilities. For each class we calculate the overall probability for this class. Then inside each class we calculate the probabilities of each value for each attribute.

get_prob_tables <- function(dset, target_column_index) {
    # split by target function value
    m = nrow(dset)
    n = ncol(dset)
    subsets = split(dset,dset[,target_column_index])
    results = list()
    for (i in 1:length(subsets)) {
        loc_m = nrow(subsets[[i]])
        pr = loc_m/m
        vpr = list()
        for (j in 1:(n-1)) {
            df = data.frame(table(subsets[[i]][,j]))
            df$nc = df[,2]/loc_m
            vpr[[j]]=df
        }
        res = list(name = names(subsets[i]), p = pr, vpr = vpr)
        results[[i]] = res
    }
    names(results)=as.character(names(subsets))
    results
}

Classify

#query - query to classify
#prtables - probability tables
#thresholds for discretization
#m - number of virtual examples to estimate probability
classify <- function(query, prtables, threshods,m=1, discr = TRUE) {
    if (discr) {
        # discretize 
        query = discretize(query,threshods)
    }
    # now classify
    # for each class
    answer = "None"
    max_prob = 0
    for (cl in prtables) {
        pvj = cl$p
        atrprobs = cl$vpr
        for (i in 1:length(atrprobs)) {
            freq_table = atrprobs[[i]]
            svec = freq_table[,1]==query[i]
            if (any(svec)) {
                p = freq_table[svec,3]
                freq = freq_table[svec,2]
                }
            else {
                p=0
                freq=0
            }
            # estimate probability
            p = (freq+1)/(sum(freq_table[,2])+m)
            pvj = pvj*p
        }
        if (pvj > max_prob) {
            max_prob = pvj
            answer = cl$name
        }
    }
    answer
}

Calculate accuracy

library(caTools)
run_nb_experiment <- function(dset, train_ratio, discr=TRUE) {
    #split dataset
    target_column_index = ncol(dset)
    target_column = dset[,target_column_index]
    split = sample.split(target_column, SplitRatio = train_ratio)
    train = subset(dset, split==T)
    test = subset(dset, split==F)
    testX = test[,-target_column_index]
    testy = test[,target_column_index]
    
    if (discr) {
        # discretize
        trainD = chiM(train)
        train = trainD$Disc.data
        th = trainD$cutp
    }
    else {
        th = NULL
    }
    # build probablities
    pr_tables = get_prob_tables(train,target_column_index)
    predictions = apply(testX,1,classify, pr_tables,th,1,TRUE)
    sum(predictions == testy)/length(testy)
}

# assumes that the last column is the target column
get_nb_accuracy_tables <- function(dset, ratios) {
    results = c()
    for (ratio in ratios) {
        results = rbind(results, c(ratio,run_nb_experiment(dset,ratio)))
    }
    results
}

Results

Iris Dataset

ratios = seq(0.5,0.9,0.1)
iris_acc = get_nb_accuracy_tables(iris, ratios)

Iris
0.5	0.9600
0.6	0.8833
0.7	0.9111
0.8	0.9667
0.9	0.9333

Wine Dataset

wine_acc = get_nb_accuracy_tables(wine, ratios)

kable(wine_acc,digits=4,caption="Wine")

Wine
0.5	1.0000
0.6	0.9437
0.7	1.0000
0.8	0.9722
0.9	1.0000

plot_results(wine_acc, "Accuracy", "Wine")

Ionoshpere Dataset

ion_acc = get_nb_accuracy_tables(ion, ratios)

kable(ion_acc,digits=4,caption="Ionosphere")

Ionosphere
0.5	0.8523
0.6	0.9286
0.7	0.8762
0.8	0.8857
0.9	0.8333

plot_results(ion_acc, "Accuracy", "Ionosphere")

Bayesian Learning

Anastasia Makarevich

24/11/2016

Introduction