Problem 0a: Preparing the dataset

Load dataset.

zoo <- read.csv("~/Studies/Year3/Machine Learning/zoo/zoo.data", 
                header=FALSE,
                stringsAsFactors=FALSE)
cnames = c("name", "hair", "feathers", "eggs","milk",
           "airborne","aquatic","predator","toothed",
           "backbone","breathes", "venomous", "fins", 
           "legs", "tail", "domestic","catsize", "type")
colnames(zoo) = cnames

Problem 0b: Formal task

Formal task setting:
Given:
Instances X: Possible animals described by the attributes.
1. animal name: String (100 possible values)
2. hair Boolean
3. feathers Boolean
4. eggs Boolean
5. milk Boolean
6. airborne Boolean
7. aquatic Boolean
8. predator Boolean
9. toothed Boolean
10. backbone Boolean
11. breathes Boolean
12. venomous Boolean
13. fins Boolean
14. legs Numeric (6 possible values: {0,2,4,5,6,8})
15. tail Boolean
16. domestic Boolean
17. catsize Boolean

Hypotheses H: Each hypothesis is a conjunction of constraints on the attributes described above. The constraints may be “?” or “N” (no value is acceptable), or a specific value.
Target concept c: AnimalType : X -> {0,1} (1 if belongs to the class, 0 otherwise) Training examples D: Positive and negative examples of the target function
Determine:
A hypothesis h in H such that h(x) = c(x) for all x in X

Basic idea of concept learning - search through a large space of hypotheses.

Number of possible instances:

excluding the name variable: (2^15) * 6 = 196608
including the name vairable: (2^15) * 6 * 100 = 19660800

Number of possible hypotheses:

excluding the name variable: (3^15) * 7 + 1 = 100442350
including the name variable: (3^15) * 7 * 101 + 1 = 10144677250

Problem 1: Find-S Algorithm

The first step is to define the procedure for the Find-S for just one class. So that we can later apply it to each class. The whole algorithm can be reduced to a simpler one: instead of iterating and replacing one value at a time we can just look at the whole column vector of the attirbute and then define the resulting hypothesis as follows:

set value in position “i” of the hypothesis to value v if the column i of the dataset has only this value v
set value in position “i” to “?” otherwise

# @param dset - data.frame or matrix, ONLY positive examples
library(plyr)
findS <- function(dset){
        apply(
                dset[,1:ncol(dset)-1], 
                2, 
                function(x) { 
                        if (length(unique(x)) > 1) "?" else unique(x)
                        })
}

Let’s perform classical test with the algorithm with the known result to see that it’s correct. First pick our training examples:

e1 = t(c("Sunny", "Warm", "Normal","Strong","Warm","Same",1))
e2 = t(c("Sunny", "Warm", "High","Strong","Warm","Same",1))
e3 = t(c("Rainy", "Cold", "High","Strong","Warm","Change",0))
e4 = t(c("Sunny", "Warm", "High","Strong","Cool","Change",1))

Since we are concerned only about the positve ones, take the subset of those that have “Yes” as the value of the target function:

weather_dset_positive_only = as.data.frame(
        rbind(e1,e2,e4),
        stringsAsFactors = FALSE)
weather_dset = as.data.frame(
        rbind(e1,e2,e3, e4), 
        stringsAsFactors = FALSE)

We expect to get {Sunny, Warm, ?, Strong, ?, ?}. Let’s check that:

weather_concept = findS(weather_dset_positive_only)
weather_concept

##       V1       V2       V3       V4       V5       V6 
##  "Sunny"   "Warm"      "?" "Strong"      "?"      "?"

Now we want to apply our findS to each class subset of the zoo dataset, i.e. we want to process 7 subsets, where each subset contains only the positive examples of it’s class. The niceset way to do it is to use ddply function from the plyr package that allows to apply function ‘classwise’:

h_set = ddply(zoo, .(type), findS)

Now output the resulting hypotheses with nice labels.

h_set$type <- factor(
        h_set$type, 
        levels = c(1,2,3,4,5,6,7), 
        labels = c(
                "mammal","bird", "reptilia",
                "fish","amphibias","insect", 
                "other" ))
h_set

##        type name hair feathers eggs milk airborne aquatic predator toothed
## 1    mammal    ?    ?        0    ?    1        ?       ?        ?       ?
## 2      bird    ?    0        1    1    0        ?       ?        ?       0
## 3  reptilia    ?    0        0    ?    0        0       ?        ?       ?
## 4      fish    ?    0        0    1    0        0       1        ?       1
## 5 amphibias    ?    0        0    1    0        0       1        ?       1
## 6    insect    ?    ?        0    1    0        ?       0        ?       0
## 7     other    ?    0        0    ?    0        0       ?        ?       0
##   backbone breathes venomous fins legs tail domestic catsize
## 1        1        1        0    ?    ?    ?        ?       ?
## 2        1        1        0    0    2    1        ?       ?
## 3        1        ?        ?    0    ?    1        0       ?
## 4        1        0        ?    1    0    1        ?       ?
## 5        1        1        ?    0    4    ?        0       0
## 6        0        1        ?    0    6    0        ?       0
## 7        0        ?        ?    0    ?    ?        0       ?

Problem 2: Candidate Elimination

In my implementation I use the idea that the earlier you fail the better. So, if we have the most specific boundary and we find any negative example that is inconsistent with this specific boundary, we conclude that the concept can’t be learned.

As a part of the experiment I tried the classical way (without separating into positive and negative) - the algorithms gives the same results but takes much longer to learns or to fail. E.g. for class 3 it has to generate more than 10⁵ hypotheses to finally exhaust the borders and fail. E.g. in the current implementation everything that can be learned is learned in less than 10 seconds while doing all the redundant steps can take literally minutes.

So I start with FindS to define the most specific hypothesis. It is equivalent to the situation when we have only the positive examples at the beginning and since the algorithms doesn’t specifiy it cannot happen we assume that it can (according to probability theory). So the findS results is basically the S boundary because:

it is consistent with all positive examples we’ve ‘seen’ (so basically all positive examples in the dataset)
minimal generalizations according to findS are added just to satisfy the new examples we see
the only member in G (which is all “?” at the beginning) in this case is by definition more general than anything we have in S. The border case is when our S collapses to all “?”, but this means that the concept is not teachable.

Another consequence is that our S boundary will always contain just one hypothesis (so we don’t have to prune it or anything). So we can treat it s a vector. This is just the implementation bonus. The other tricks I used also speed up the learning significantly.

Step 0: Simple preprocessing

Since we’re doing multiclass problem, we will have to do it step by step, i.e. for each class perform one-vs-all classification. preprocess function basically labels the examples. In fact, this labelling is unnecessary provided we use the suggested approad of finding the S boundary first.

# Prepare the dataset: convert to the format needed for the algorithm
# @param dset - matrix/data.frame, where the last column is the class id
# @param class_id - integer, the id of the class
# @return - a list of two matrices: positive and negative examples
preprocess <- function(dset,class_id) {
        dset = as.matrix(dset)
        svec = dset[,ncol(dset)]==class_id
        dset[svec,ncol(dset)] = 1
        dset[!svec,ncol(dset)] = 0
        positive = dset[svec,]
        negative = dset[!svec,]
        list(positive=positive,negative=negative)
}

Checking the consistency

To check if the example is consistent with the hypothesis I wrote a general function that checks if the hypothesis accepts the positive example and rejects the negative one. This little function is really important since it allows us to fail early and notice that there is something wrong with the instace space. This function is used to check if S is still valid.

# Check if the example is consistent with the hypothesis
# @param hypothesis - vector
# @param example - vector
consistent_he <- function(hypothesis, example) {
        true_label = example[length(example)]
        d = example[1:length(example)-1]
        satisfies = all(hypothesis == d | hypothesis == "?")
        TP = satisfies && true_label == 1
        TN = !satisfies && true_label == 0
        return(TP || TN)
}

Minimal specializations

This function finds the minimal specializations for the given negative example and the hypotheses that let him in. Although it’s possible in theory to generate all possible combinations consistent with the example etc, in fact we can significantly reduce computational time if we add wisely when generating new hypotheses. Here are my steps to get specializations:

find positions for which the folloing is true
- they cause the hypothesis from the general boundary lets the negative example in
- the s-hypothesis has some value in them and it’s not equal to the value in the negative example
stick the values from the s-hypothesis to these slots in the g-hypothesis one a time (creating a new hypothesis each time)

# Finds the minimal specializations
#@param S - matrix, specific boundary, assumes S has exactly 1 row
#@param Gn - matrix, hypotheses that don't reject the negative example, 
#            must contain at least 1 row
#@param d - row vector, example inconsistent with hypotheses from Gn
get_specializations <- function(S,Gn,d) {
        # trim label to allow two versions of d to be passed
        if (ncol(Gn)!=length(d)) {
                d = d[1:ncol(Gn)]
        }
        s = S[1,] #convert to vector
        n = nrow(Gn)
        result = Gn[F,]
        for (i in seq(1:n)) {
                g = Gn[i,] # get a vector
                # find positions to work with:
                indices = which((g=="?" | g==d) & (s != d & s != "?"))
                for (i in indices) {
                        nh = g
                        nh[i] = s[i]
                        result = rbind(result,nh)
                        }
        }
        return(result)
}

More general

This is a simple function for strict inequality of the hypothesis. I’m using the strick inequality so that when pruning and comparing the hypothesis to itself not to get conflicts.

# Checks if hypothesis1 is more general or equal to hypothesis2
# @param hypothesis1 - vector
# @param hypothesis2 - vector
# @return - boolean, true if hypothesis1 is more general than hypothesis2

more_general <- function(hypothesis1,hypothesis2) {
        all(
                hypothesis1[hypothesis1 != hypothesis2] == "?") && 
                any(hypothesis1 != hypothesis2)
}

Pruning the G boundary

I particularly love my implementation of this function since it really well-optimized and has no loops.

pruneG <- function(G) {
        selmatrix = t(apply(G, 1, function(h) {apply(G,1,more_general,h)}))
        selvec = !apply(selmatrix,1,any)
        subset(G,selvec)
}

Final implementation

This function combines everything together and allows us to finally apply the alrorithms to our dataset.

# Returns G and S boundaries or failure message
candidate_elimination <- function(dset, class_id) {
        # assumes that the last column is the label column
        dset = preprocess(dset,class_id)
        # find specific boundary
        S = as.matrix(t(findS(dset$positive)))
        # Now work with negative examples only
        G = matrix(data=rep("?",ncol(S)),nrow=1, byrow=TRUE)
        colnames(G) = colnames(S)
        n_negative = nrow(dset$negative)
        for (i in seq(1:n_negative)) {
                # check if the concept is teachable:
                # if the negative example consistent 
                # with the S boundary
                d = dset$negative[i,]
                if (consistent_he(S,d)) {
                        consistent_hypotheses = apply(G,1,consistent_he,d)
                        # select inconsistent ones
                        Gn = subset(G,!consistent_hypotheses)
                        # remove those that are inconsistent
                        G = subset(G, consistent_hypotheses)
                        if (nrow(Gn) > 0) {
                                Gn = get_specializations(S,Gn,d)
                        }
                        G = rbind(G,Gn)
                        G = pruneG(G)
                }
                # give up and go home
                else {
                        print(d)
                        message("Failed: concept can't be learned")
                        return('FAIL')
                }
        }
        return(list(G=(unique(G)),S=S))
}

Get the boundaries

To proceed further (to reconstructing the version space) it’s would be nice to at least look at the results we’ve got. Let’s take the easiest example - mammal (class 1).

result = candidate_elimination(zoo,1)
result

## $G
##    name hair feathers eggs milk airborne aquatic predator toothed backbone
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
##    breathes venomous fins legs tail domestic catsize
## nh "?"      "?"      "?"  "?"  "?"  "?"      "?"    
## 
## $S
##      name hair feathers eggs milk airborne aquatic predator toothed
## [1,] "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"    
##      backbone breathes venomous fins legs tail domestic catsize
## [1,] "1"      "1"      "0"      "?"  "?"  "?"  "?"      "?"

It’s quite easy to interpret this one. The S boundary represents the summary of all positive examples we’ve seen, so the non-“?” positions show the features that are true for all the class representatives in the sample, all “?” show the positions that are not specific for that class. I.e. a class 1 is a mamal and we can’t say it’s mammal if:

it doesn’t have feathers
gives milk
is backbone
breathes
not venomous

Any other features can be anything.

The G boundary is the summary of the negative examples we’ve seen that helps us to say that something doesn’t belong to the class. So we can say that none of the members of other classses give milk. So this hypothesis is consistent with the negative examples.

Reconstructing the version space

build_version_space <- function(S,G) {
        s = S[1,] # vector
        n = nrow(G)
        # for each hypothesis in the G boundary
        V = S[F,]
        for (i in seq(1:n)) {
                g = G[i,]
                pos = which(g!=s & s!="?")
                m = length(pos)
                for (j in seq(1:(m-1))) {
                        combs = combn(pos,j)
                        for (k in seq(1:ncol(combs))) {
                                nh = s
                                selvec = combs[,k]
                                nh[selvec] = "?"
                                V = rbind(V,nh)
                        }
                }
        }
        return(unique(V))
}

Putting everything together

Now we can get the final answer, by first getting the boundaries and then reconstructing the version space for class 1.

class_id = 1
res = candidate_elimination(zoo,class_id)
V = build_version_space(res$S,res$G)
results = list(G= res$G, S = res$S, V = V)
V

##    name hair feathers eggs milk airborne aquatic predator toothed backbone
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
## nh "?"  "?"  "?"      "?"  "1"  "?"      "?"     "?"      "?"     "1"     
## nh "?"  "?"  "0"      "?"  "1"  "?"      "?"     "?"      "?"     "?"     
##    breathes venomous fins legs tail domestic catsize
## nh "1"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "1"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "1"      "?"      "?"  "?"  "?"  "?"      "?"    
## nh "1"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "1"      "?"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "1"      "?"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "?"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "0"      "?"  "?"  "?"  "?"      "?"    
## nh "1"      "?"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "?"      "?"  "?"  "?"  "?"      "?"    
## nh "?"      "?"      "?"  "?"  "?"  "?"      "?"

Classification test

belongs_to_class <- function(VS, example) {
        G = VS$G
        s = VS$S[1,]
        V = VS$V
        # if the example is consistent with the S boundary
        # classify as class member
        if (consistent_he(s,example)) {
                return(list(member = TRUE, prob = 1))
        }
        # if the example is not consistent with any 
        # hypothesis in G boundary classify as non-member
        if (!any(apply(G,1,consistent_he,example))) {
                return(list(member = FALSE, prob = 1))
        }
        # if neighter S, no G boundary can explicitly classify the example
        # then perform voting on the versio space
        prob = sum(apply(V,1,consistent_he,example))/nrow(V)
        if (prob > 0.3) {
                return(list(member=TRUE,prob = prob))       
        }
        else {
                return(list(member=FALSE,prob = 1-prob))       
        }
}

Test classifier

Now check that our classifier works at least with our dataset and classifies correctly our classes.

# get boundaries
res = candidate_elimination(zoo,1)
# get version space 
V = build_version_space(res$S,res$G)
# combine
VS = list(G= res$G, S = res$S, V = V)
answer_matrix = c()
for (i in seq(1:nrow(zoo))) {
        d = zoo[i,]
        ans = belongs_to_class(VS,d)
        answer_matrix = rbind(
                answer_matrix,
                c(ans$member, zoo[i,ncol(zoo)]==1, ans$prob))
}
colnames(answer_matrix) = c("h","c","prob")
head(answer_matrix)

##      h c prob
## [1,] 1 1    1
## [2,] 1 1    1
## [3,] 0 0    1
## [4,] 1 1    1
## [5,] 1 1    1
## [6,] 1 1    1

Summary/Results

Candidate elimination algorithm works well if the data is consistent, i.e. there are no contradictory examples. It is very important to find a the right place to detect the failure. For example, for the zoo dataset, concept for class 3 can’t be learned. Here’s what happens:

candidate_elimination(zoo,3)

##     name     hair feathers     eggs     milk airborne  aquatic predator 
##   "newt"      "0"      "0"      "1"      "0"      "0"      "1"      "1" 
##  toothed backbone breathes venomous     fins     legs     tail domestic 
##      "1"      "1"      "1"      "0"      "0"      "4"      "1"      "0" 
##  catsize     type 
##      "0"      "0"

## Failed: concept can't be learned

## [1] "FAIL"

That means that our example is classified as positive by our specific boundary S, however it is a negative example. That means that hthe concept can’t be learned and we give up. The same happens for class 7.

candidate_elimination(zoo,7)

##     name     hair feathers     eggs     milk airborne  aquatic predator 
##   "flea"      "0"      "0"      "1"      "0"      "0"      "0"      "0" 
##  toothed backbone breathes venomous     fins     legs     tail domestic 
##      "0"      "0"      "1"      "0"      "0"      "6"      "0"      "0" 
##  catsize     type 
##      "0"      "0"

## Failed: concept can't be learned

## [1] "FAIL"

Another observation: the size of the version space typically depends on the distance between the boundaries which in turn depends on the number of the positive examples to learn from. For exampe, for class 5 has very few positive examples. As a result the S boundary is very specific with just few question marks, while the G-boundary is huge:

res = candidate_elimination(zoo,5)
V = build_version_space(res$S,res$G)
nrow(V)

## [1] 2515

While class 1 contains decent number of positive examples, so the borders are much tighter:

res = candidate_elimination(zoo,1)
V = build_version_space(res$S,res$G)
nrow(V)

## [1] 14

Find-S and Candidate Elimination

Anastasia Makarevich

18 October 2016