Bayesian learning. Idea: under certain assumptions any learning algorithm that minimizes squared error between the output hypothesis predictions and the training data will output a maximum likelihood hypothesis. What’s good is that it’s more flexible - meaning that instead of eliminating a hypothesis we simply give it a lower probablity (penalizing for every inconsistent example). It also allow us to make probabilistic predictions. Issues: - require initial knowledge of many probabilities - siginificant computational cost
P(h) - prior probability of hypothesis h (i.e. before we observed the training data). P(D) - probability that data D will be observed. P(D|h) - probability of observing D given that h holds. P(h|D) - posterior probability (i.e. after we’ve seen the data) Bayes theorem:
\[P(h|D) = \frac{P(D|h)P(h)}{P(D)}\] MAP hypothesis - maximum a posteriori hypothesis (this is the most probable hypothesis \(h \in H\) given the observed data D). MAP: \[ h_MAP = \underset{h \in H}{argmax} P(h|D) = \underset{h \in H}{argmax} \frac{P(D|h)P(h)}{P(D)} = \underset{h \in H}{argmax} P(D|h)P(h)\]
If we assume no prior knowledge about our hypotheses we say that they are equally likely and assign them equal P(h). Thus the P(h) term becomes irrelevant and we can simply use just P(D|h). This is the ML - maximum likelyhood.
The idea is very simple. Enumerate all hypothesis, then for each hypothesis \(h \in H\) calculate posterior probability and choose the hypothesis with the highest obtained probability. Assumptions:
Then P(h|D) by Bayes theorem:
\[P(h|D) = 0\] if h is inconsistend with D \[P(h|D) = \frac{1}{|VS_{H,D}|}\] if h is consistend with D
\(|VS_{H, D}|\) - subset of hypotheses consistent with D. Note that in this case every consistent hypothesis is in fact a MAP hypothesis. This kind of learner is called a consistent learner.
The nice thing about naive Bayes learner is that it’s performance can be compared to that of neural networks and decision tree learning. Bayesian approach tells us that we should return the most probable target value \(v_{MAP}\) as a result: \[v_{MAP} = \underset{v_j \in V}{argmax} P(v_j | a_1, a_2, ..., a_n)\] where \(a_1,...a_n\) are the attribute values of the new examples. Then if we apply the Bayes theorem we’ll obtain: \[v_{MAP} = \underset{v_j \in V}{argmax} \frac{P(a_1, a_2, ..., a_n | v_j) P(v_j)}{P(a_1,a_2,...,a_n)} = \underset{v_j \in V}{argmax} P(a_1, a_2, ..., a_n | v_j) P(v_j)\] Calculating \(P(a_1, a_2, ..., a_n | v_j)\) is… next to impossible. That’s why in naive Bayes we make an assumption that the attribute values probabilities are independent of each other given the target function value. Thus we can convert the above hairy probability into a product of probabilities: \[P(a_1, a_2, ..., a_n | v_j) = \prod_i{P(a_i|v_j)}\] Thus, the naive Bayes classifier will look like that: \[v_{NB} = \underset{v_j \in V}{argmax} P(v_j) \prod_i{P(a_i|v_j)}\]
Learning in naive Bayes means calculating \(P(v_j)\) and \(P(a_i|v_j)\)
Standard step - loading datasets.
data(iris)
wine <- read.csv("~/Studies/Year3/DataMining/wine.data", header=FALSE)
wine = cbind(wine[,-1],wine[,1])
colnames(wine)[ncol(wine)] = "target"
ion = read.table("~/Studies/Year3/Machine Learning/ionosphere.data",sep=",")
First we need to discretize our data so that we can apply naive Bayes algorithm. This could be done manually but there are nice packages that do it much more efficiently. But we need not only to discretisize our function but to develop a function that will be discretizing new examples according to the thresholds derived from the train data.
library(discretization)
# vector to be discretized
#thresholds - list of thresholds for each attribute
discretize <- function(query, thresholds) {
result = query
for (i in 1:length(thresholds)) {
result[i] = findInterval(query[i],thresholds[[i]])+1
}
result
}
irisD = chiM(iris)
th = irisD$cutp
t(apply(iris[1:4,1:4],1,discretize, th))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 3 1 1
## 2 1 2 1 1
## 3 1 2 1 1
## 4 1 2 1 1
The idea is to build a data structure that will hold all the necessary probabilities. For each class we calculate the overall probability for this class. Then inside each class we calculate the probabilities of each value for each attribute.
get_prob_tables <- function(dset, target_column_index) {
# split by target function value
m = nrow(dset)
n = ncol(dset)
subsets = split(dset,dset[,target_column_index])
results = list()
for (i in 1:length(subsets)) {
loc_m = nrow(subsets[[i]])
pr = loc_m/m
vpr = list()
for (j in 1:(n-1)) {
df = data.frame(table(subsets[[i]][,j]))
df$nc = df[,2]/loc_m
vpr[[j]]=df
}
res = list(name = names(subsets[i]), p = pr, vpr = vpr)
results[[i]] = res
}
names(results)=as.character(names(subsets))
results
}
#query - query to classify
#prtables - probability tables
#thresholds for discretization
#m - number of virtual examples to estimate probability
classify <- function(query, prtables, threshods,m=1, discr = TRUE) {
if (discr) {
# discretize
query = discretize(query,threshods)
}
# now classify
# for each class
answer = "None"
max_prob = 0
for (cl in prtables) {
pvj = cl$p
atrprobs = cl$vpr
for (i in 1:length(atrprobs)) {
freq_table = atrprobs[[i]]
svec = freq_table[,1]==query[i]
if (any(svec)) {
p = freq_table[svec,3]
freq = freq_table[svec,2]
}
else {
p=0
freq=0
}
# estimate probability
p = (freq+1)/(sum(freq_table[,2])+m)
pvj = pvj*p
}
if (pvj > max_prob) {
max_prob = pvj
answer = cl$name
}
}
answer
}
library(caTools)
run_nb_experiment <- function(dset, train_ratio, discr=TRUE) {
#split dataset
target_column_index = ncol(dset)
target_column = dset[,target_column_index]
split = sample.split(target_column, SplitRatio = train_ratio)
train = subset(dset, split==T)
test = subset(dset, split==F)
testX = test[,-target_column_index]
testy = test[,target_column_index]
if (discr) {
# discretize
trainD = chiM(train)
train = trainD$Disc.data
th = trainD$cutp
}
else {
th = NULL
}
# build probablities
pr_tables = get_prob_tables(train,target_column_index)
predictions = apply(testX,1,classify, pr_tables,th,1,TRUE)
sum(predictions == testy)/length(testy)
}
# assumes that the last column is the target column
get_nb_accuracy_tables <- function(dset, ratios) {
results = c()
for (ratio in ratios) {
results = rbind(results, c(ratio,run_nb_experiment(dset,ratio)))
}
results
}
ratios = seq(0.5,0.9,0.1)
iris_acc = get_nb_accuracy_tables(iris, ratios)
| 0.5 | 0.9600 |
| 0.6 | 0.8833 |
| 0.7 | 0.9111 |
| 0.8 | 0.9667 |
| 0.9 | 0.9333 |
wine_acc = get_nb_accuracy_tables(wine, ratios)
kable(wine_acc,digits=4,caption="Wine")
| 0.5 | 1.0000 |
| 0.6 | 0.9437 |
| 0.7 | 1.0000 |
| 0.8 | 0.9722 |
| 0.9 | 1.0000 |
plot_results(wine_acc, "Accuracy", "Wine")
ion_acc = get_nb_accuracy_tables(ion, ratios)
kable(ion_acc,digits=4,caption="Ionosphere")
| 0.5 | 0.8523 |
| 0.6 | 0.9286 |
| 0.7 | 0.8762 |
| 0.8 | 0.8857 |
| 0.9 | 0.8333 |
plot_results(ion_acc, "Accuracy", "Ionosphere")