Load dataset.
zoo <- read.csv("~/Studies/Year3/Machine Learning/zoo/zoo.data",
header=FALSE,
stringsAsFactors=FALSE)
cnames = c("name", "hair", "feathers", "eggs","milk",
"airborne","aquatic","predator","toothed",
"backbone","breathes", "venomous", "fins",
"legs", "tail", "domestic","catsize", "type")
colnames(zoo) = cnames
Formal task setting:
Given:
Instances X: Possible animals described by the attributes.
1. animal name: String (100 possible values)
2. hair Boolean
3. feathers Boolean
4. eggs Boolean
5. milk Boolean
6. airborne Boolean
7. aquatic Boolean
8. predator Boolean
9. toothed Boolean
10. backbone Boolean
11. breathes Boolean
12. venomous Boolean
13. fins Boolean
14. legs Numeric (6 possible values: {0,2,4,5,6,8})
15. tail Boolean
16. domestic Boolean
17. catsize Boolean
Hypotheses H: Each hypothesis is a conjunction of constraints on the attributes described above. The constraints may be “?” or “N” (no value is acceptable), or a specific value.
Target concept c: AnimalType : X -> {0,1} (1 if belongs to the class, 0 otherwise) Training examples D: Positive and negative examples of the target function
Determine:
A hypothesis h in H such that h(x) = c(x) for all x in X
Basic idea of concept learning - search through a large space of hypotheses.
Number of possible instances:
Number of possible hypotheses:
The first step is to define the procedure for the Find-S for just one class. So that we can later apply it to each class. The whole algorithm can be reduced to a simpler one: instead of iterating and replacing one value at a time we can just look at the whole column vector of the attirbute and then define the resulting hypothesis as follows:
# @param dset - data.frame or matrix, ONLY positive examples
library(plyr)
findS <- function(dset){
apply(
dset[,1:ncol(dset)-1],
2,
function(x) {
if (length(unique(x)) > 1) "?" else unique(x)
})
}
Let’s perform classical test with the algorithm with the known result to see that it’s correct. First pick our training examples:
e1 = t(c("Sunny", "Warm", "Normal","Strong","Warm","Same",1))
e2 = t(c("Sunny", "Warm", "High","Strong","Warm","Same",1))
e3 = t(c("Rainy", "Cold", "High","Strong","Warm","Change",0))
e4 = t(c("Sunny", "Warm", "High","Strong","Cool","Change",1))
Since we are concerned only about the positve ones, take the subset of those that have “Yes” as the value of the target function:
weather_dset_positive_only = as.data.frame(
rbind(e1,e2,e4),
stringsAsFactors = FALSE)
weather_dset = as.data.frame(
rbind(e1,e2,e3, e4),
stringsAsFactors = FALSE)
We expect to get {Sunny, Warm, ?, Strong, ?, ?}. Let’s check that:
weather_concept = findS(weather_dset_positive_only)
weather_concept
## V1 V2 V3 V4 V5 V6
## "Sunny" "Warm" "?" "Strong" "?" "?"
Now we want to apply our findS to each class subset of the zoo dataset, i.e. we want to process 7 subsets, where each subset contains only the positive examples of it’s class. The niceset way to do it is to use ddply function from the plyr package that allows to apply function ‘classwise’:
h_set = ddply(zoo, .(type), findS)
Now output the resulting hypotheses with nice labels.
h_set$type <- factor(
h_set$type,
levels = c(1,2,3,4,5,6,7),
labels = c(
"mammal","bird", "reptilia",
"fish","amphibias","insect",
"other" ))
h_set
## type name hair feathers eggs milk airborne aquatic predator toothed
## 1 mammal ? ? 0 ? 1 ? ? ? ?
## 2 bird ? 0 1 1 0 ? ? ? 0
## 3 reptilia ? 0 0 ? 0 0 ? ? ?
## 4 fish ? 0 0 1 0 0 1 ? 1
## 5 amphibias ? 0 0 1 0 0 1 ? 1
## 6 insect ? ? 0 1 0 ? 0 ? 0
## 7 other ? 0 0 ? 0 0 ? ? 0
## backbone breathes venomous fins legs tail domestic catsize
## 1 1 1 0 ? ? ? ? ?
## 2 1 1 0 0 2 1 ? ?
## 3 1 ? ? 0 ? 1 0 ?
## 4 1 0 ? 1 0 1 ? ?
## 5 1 1 ? 0 4 ? 0 0
## 6 0 1 ? 0 6 0 ? 0
## 7 0 ? ? 0 ? ? 0 ?
In my implementation I use the idea that the earlier you fail the better. So, if we have the most specific boundary and we find any negative example that is inconsistent with this specific boundary, we conclude that the concept can’t be learned.
As a part of the experiment I tried the classical way (without separating into positive and negative) - the algorithms gives the same results but takes much longer to learns or to fail. E.g. for class 3 it has to generate more than 105 hypotheses to finally exhaust the borders and fail. E.g. in the current implementation everything that can be learned is learned in less than 10 seconds while doing all the redundant steps can take literally minutes.
So I start with FindS to define the most specific hypothesis. It is equivalent to the situation when we have only the positive examples at the beginning and since the algorithms doesn’t specifiy it cannot happen we assume that it can (according to probability theory). So the findS results is basically the S boundary because:
Another consequence is that our S boundary will always contain just one hypothesis (so we don’t have to prune it or anything). So we can treat it s a vector. This is just the implementation bonus. The other tricks I used also speed up the learning significantly.
Since we’re doing multiclass problem, we will have to do it step by step, i.e. for each class perform one-vs-all classification. preprocess function basically labels the examples. In fact, this labelling is unnecessary provided we use the suggested approad of finding the S boundary first.
# Prepare the dataset: convert to the format needed for the algorithm
# @param dset - matrix/data.frame, where the last column is the class id
# @param class_id - integer, the id of the class
# @return - a list of two matrices: positive and negative examples
preprocess <- function(dset,class_id) {
dset = as.matrix(dset)
svec = dset[,ncol(dset)]==class_id
dset[svec,ncol(dset)] = 1
dset[!svec,ncol(dset)] = 0
positive = dset[svec,]
negative = dset[!svec,]
list(positive=positive,negative=negative)
}
To check if the example is consistent with the hypothesis I wrote a general function that checks if the hypothesis accepts the positive example and rejects the negative one. This little function is really important since it allows us to fail early and notice that there is something wrong with the instace space. This function is used to check if S is still valid.
# Check if the example is consistent with the hypothesis
# @param hypothesis - vector
# @param example - vector
consistent_he <- function(hypothesis, example) {
true_label = example[length(example)]
d = example[1:length(example)-1]
satisfies = all(hypothesis == d | hypothesis == "?")
TP = satisfies && true_label == 1
TN = !satisfies && true_label == 0
return(TP || TN)
}
This function finds the minimal specializations for the given negative example and the hypotheses that let him in. Although it’s possible in theory to generate all possible combinations consistent with the example etc, in fact we can significantly reduce computational time if we add wisely when generating new hypotheses. Here are my steps to get specializations:
# Finds the minimal specializations
#@param S - matrix, specific boundary, assumes S has exactly 1 row
#@param Gn - matrix, hypotheses that don't reject the negative example,
# must contain at least 1 row
#@param d - row vector, example inconsistent with hypotheses from Gn
get_specializations <- function(S,Gn,d) {
# trim label to allow two versions of d to be passed
if (ncol(Gn)!=length(d)) {
d = d[1:ncol(Gn)]
}
s = S[1,] #convert to vector
n = nrow(Gn)
result = Gn[F,]
for (i in seq(1:n)) {
g = Gn[i,] # get a vector
# find positions to work with:
indices = which((g=="?" | g==d) & (s != d & s != "?"))
for (i in indices) {
nh = g
nh[i] = s[i]
result = rbind(result,nh)
}
}
return(result)
}
This is a simple function for strict inequality of the hypothesis. I’m using the strick inequality so that when pruning and comparing the hypothesis to itself not to get conflicts.
# Checks if hypothesis1 is more general or equal to hypothesis2
# @param hypothesis1 - vector
# @param hypothesis2 - vector
# @return - boolean, true if hypothesis1 is more general than hypothesis2
more_general <- function(hypothesis1,hypothesis2) {
all(
hypothesis1[hypothesis1 != hypothesis2] == "?") &&
any(hypothesis1 != hypothesis2)
}
I particularly love my implementation of this function since it really well-optimized and has no loops.
pruneG <- function(G) {
selmatrix = t(apply(G, 1, function(h) {apply(G,1,more_general,h)}))
selvec = !apply(selmatrix,1,any)
subset(G,selvec)
}
This function combines everything together and allows us to finally apply the alrorithms to our dataset.
# Returns G and S boundaries or failure message
candidate_elimination <- function(dset, class_id) {
# assumes that the last column is the label column
dset = preprocess(dset,class_id)
# find specific boundary
S = as.matrix(t(findS(dset$positive)))
# Now work with negative examples only
G = matrix(data=rep("?",ncol(S)),nrow=1, byrow=TRUE)
colnames(G) = colnames(S)
n_negative = nrow(dset$negative)
for (i in seq(1:n_negative)) {
# check if the concept is teachable:
# if the negative example consistent
# with the S boundary
d = dset$negative[i,]
if (consistent_he(S,d)) {
consistent_hypotheses = apply(G,1,consistent_he,d)
# select inconsistent ones
Gn = subset(G,!consistent_hypotheses)
# remove those that are inconsistent
G = subset(G, consistent_hypotheses)
if (nrow(Gn) > 0) {
Gn = get_specializations(S,Gn,d)
}
G = rbind(G,Gn)
G = pruneG(G)
}
# give up and go home
else {
print(d)
message("Failed: concept can't be learned")
return('FAIL')
}
}
return(list(G=(unique(G)),S=S))
}
To proceed further (to reconstructing the version space) it’s would be nice to at least look at the results we’ve got. Let’s take the easiest example - mammal (class 1).
result = candidate_elimination(zoo,1)
result
## $G
## name hair feathers eggs milk airborne aquatic predator toothed backbone
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "?"
## breathes venomous fins legs tail domestic catsize
## nh "?" "?" "?" "?" "?" "?" "?"
##
## $S
## name hair feathers eggs milk airborne aquatic predator toothed
## [1,] "?" "?" "0" "?" "1" "?" "?" "?" "?"
## backbone breathes venomous fins legs tail domestic catsize
## [1,] "1" "1" "0" "?" "?" "?" "?" "?"
It’s quite easy to interpret this one. The S boundary represents the summary of all positive examples we’ve seen, so the non-“?” positions show the features that are true for all the class representatives in the sample, all “?” show the positions that are not specific for that class. I.e. a class 1 is a mamal and we can’t say it’s mammal if:
Any other features can be anything.
The G boundary is the summary of the negative examples we’ve seen that helps us to say that something doesn’t belong to the class. So we can say that none of the members of other classses give milk. So this hypothesis is consistent with the negative examples.
build_version_space <- function(S,G) {
s = S[1,] # vector
n = nrow(G)
# for each hypothesis in the G boundary
V = S[F,]
for (i in seq(1:n)) {
g = G[i,]
pos = which(g!=s & s!="?")
m = length(pos)
for (j in seq(1:(m-1))) {
combs = combn(pos,j)
for (k in seq(1:ncol(combs))) {
nh = s
selvec = combs[,k]
nh[selvec] = "?"
V = rbind(V,nh)
}
}
}
return(unique(V))
}
Now we can get the final answer, by first getting the boundaries and then reconstructing the version space for class 1.
class_id = 1
res = candidate_elimination(zoo,class_id)
V = build_version_space(res$S,res$G)
results = list(G= res$G, S = res$S, V = V)
V
## name hair feathers eggs milk airborne aquatic predator toothed backbone
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "?"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "?"
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "?"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "?"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "?"
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "?"
## nh "?" "?" "?" "?" "1" "?" "?" "?" "?" "1"
## nh "?" "?" "0" "?" "1" "?" "?" "?" "?" "?"
## breathes venomous fins legs tail domestic catsize
## nh "1" "0" "?" "?" "?" "?" "?"
## nh "1" "0" "?" "?" "?" "?" "?"
## nh "?" "0" "?" "?" "?" "?" "?"
## nh "1" "?" "?" "?" "?" "?" "?"
## nh "1" "0" "?" "?" "?" "?" "?"
## nh "?" "0" "?" "?" "?" "?" "?"
## nh "1" "?" "?" "?" "?" "?" "?"
## nh "?" "0" "?" "?" "?" "?" "?"
## nh "1" "?" "?" "?" "?" "?" "?"
## nh "?" "?" "?" "?" "?" "?" "?"
## nh "?" "0" "?" "?" "?" "?" "?"
## nh "1" "?" "?" "?" "?" "?" "?"
## nh "?" "?" "?" "?" "?" "?" "?"
## nh "?" "?" "?" "?" "?" "?" "?"
belongs_to_class <- function(VS, example) {
G = VS$G
s = VS$S[1,]
V = VS$V
# if the example is consistent with the S boundary
# classify as class member
if (consistent_he(s,example)) {
return(list(member = TRUE, prob = 1))
}
# if the example is not consistent with any
# hypothesis in G boundary classify as non-member
if (!any(apply(G,1,consistent_he,example))) {
return(list(member = FALSE, prob = 1))
}
# if neighter S, no G boundary can explicitly classify the example
# then perform voting on the versio space
prob = sum(apply(V,1,consistent_he,example))/nrow(V)
if (prob > 0.3) {
return(list(member=TRUE,prob = prob))
}
else {
return(list(member=FALSE,prob = 1-prob))
}
}
Now check that our classifier works at least with our dataset and classifies correctly our classes.
# get boundaries
res = candidate_elimination(zoo,1)
# get version space
V = build_version_space(res$S,res$G)
# combine
VS = list(G= res$G, S = res$S, V = V)
answer_matrix = c()
for (i in seq(1:nrow(zoo))) {
d = zoo[i,]
ans = belongs_to_class(VS,d)
answer_matrix = rbind(
answer_matrix,
c(ans$member, zoo[i,ncol(zoo)]==1, ans$prob))
}
colnames(answer_matrix) = c("h","c","prob")
head(answer_matrix)
## h c prob
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 0 0 1
## [4,] 1 1 1
## [5,] 1 1 1
## [6,] 1 1 1
Candidate elimination algorithm works well if the data is consistent, i.e. there are no contradictory examples. It is very important to find a the right place to detect the failure. For example, for the zoo dataset, concept for class 3 can’t be learned. Here’s what happens:
candidate_elimination(zoo,3)
## name hair feathers eggs milk airborne aquatic predator
## "newt" "0" "0" "1" "0" "0" "1" "1"
## toothed backbone breathes venomous fins legs tail domestic
## "1" "1" "1" "0" "0" "4" "1" "0"
## catsize type
## "0" "0"
## Failed: concept can't be learned
## [1] "FAIL"
That means that our example is classified as positive by our specific boundary S, however it is a negative example. That means that hthe concept can’t be learned and we give up. The same happens for class 7.
candidate_elimination(zoo,7)
## name hair feathers eggs milk airborne aquatic predator
## "flea" "0" "0" "1" "0" "0" "0" "0"
## toothed backbone breathes venomous fins legs tail domestic
## "0" "0" "1" "0" "0" "6" "0" "0"
## catsize type
## "0" "0"
## Failed: concept can't be learned
## [1] "FAIL"
Another observation: the size of the version space typically depends on the distance between the boundaries which in turn depends on the number of the positive examples to learn from. For exampe, for class 5 has very few positive examples. As a result the S boundary is very specific with just few question marks, while the G-boundary is huge:
res = candidate_elimination(zoo,5)
V = build_version_space(res$S,res$G)
nrow(V)
## [1] 2515
While class 1 contains decent number of positive examples, so the borders are much tighter:
res = candidate_elimination(zoo,1)
V = build_version_space(res$S,res$G)
nrow(V)
## [1] 14