Task 1: (Programming)

Implement a function that takes a categorical vector (in character or factor form) and calculates the entropy for that vector.

entropy <- function(input.vector){
  # use the table function to count occurrences
  t <- table(input.vector)
  # normalize the table (now a probability)
  p <- t / sum(t)
  # calculate entropy
  return(sum(-p * log(p, 2)))  
}

Let’s take an example from the book to confirm this is working:

# lens24 data from Principles of Data Mining
cls <- c(3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 3, 3, 3, 3, 1, 3, 2, 3, 3)
age <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
spc <- c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2)
ast <- c(1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2)
trs <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2)

lens24 <- data.frame(age, spc, ast, trs, cls)

initial.entropy <- entropy(cls)

initial.entropy
## [1] 1.326088

We see that the value we we calculated of 1.3260875 matches the expected value of 1.3261

Task 2: (Programming)

Implement a function that calculates the information gain of one categorical vector when partitioning according to another categorical vector.

info.gain <- function(partition.by, categories){
  # Breakdown categories by partition.by
  partition.table <- table(partition.by, categories)
  
  # Count occurrences per partition
  row.sums <- as.vector(rowSums(partition.table))
  
  # Normalize rows
  prob <- partition.table / row.sums
  
  # Calculate entropy per row
  ent <- -prob * log(prob, 2)
  # we will define 0 * log(0) as 0
  ent[is.na(ent)] <- 0
  
  # Calculate information gain as 
  # original entropy of categories - weighted average of entropy partitioned data
  info.gain <- entropy(categories) - sum(ent * (row.sums / sum(row.sums)))
  
  return(info.gain)
}

Again, let’s check this against a known calculation:

df <- data.frame('age'    = info.gain(age, cls),
                 'specRx' = info.gain(spc, cls),
                 'astig'  = info.gain(ast, cls),
                 'tears'  = info.gain(trs, cls))
df
##         age     specRx     astig     tears
## 1 0.0393965 0.03951084 0.3770052 0.5487949

From the book we expect the above values to be:

Which match our calculations.

Task 3:

(Programming) Implement a function that takes as its input a data frame of categorical variables, one of which is identified as the target variable, and outputs the following in list format:

entropy.partition <- function(df, target){
  # extract column names
  cols <- names(df)
  # filter out target
  candidates <- cols[!(cols == target)]
  
  # calculate entropy gain when partitioning target by each candidate
  gain <- NULL
  for(col in candidates){
    gain <- c(gain, info.gain(df[[col]], df[[target]]))
  }
  names(gain) <- candidates
  
  # select the candidate that returns the max information gain
  # in case of a tie, select the first candidate in the list
  max.candidate <- names(which(gain == max(gain))[1])
  
  return(list('max' = max.candidate,
              'info.gain' = gain))
}

We can check the functionality of the above code with the same input we used in task 2:

entropy.partition(lens24, 'cls')
## $max
## [1] "trs"
## 
## $info.gain
##        age        spc        ast        trs 
## 0.03939650 0.03951084 0.37700523 0.54879494

The function returns a list with the first item containing the column name providing the highest information gain and the second item containing the information gain on the target column when partitioning according to the remaining columns.

Task 4:

Decision Tree Functions: Training

# ------------------------------------------------------------------------------------
# Function predict.outcome
# ------------------------------------------------------------------------------------
# helper function to determine best prediction at leaf of decision tree
# based on the filtered target column

# Input: outcome.vector - vector of classifications
#        threshold      - confidence threshold (expressed as probability) that
#                         needs to be exceeded to use this as a bases
#                         for a prediction
#        majority       - prediction based on simple top level majority
#                         to be used if threshold is not exceeded
#
# Output: prediction

predict.outcome <- function(outcome.vector, threshold, majority){
  # Predict class based on outcome.vector
  
  # Determine possible outcomes and percentage occurrence
  outcomes <- unique(outcome.vector)
  outcome.counts <- rep(0, length(outcomes))
  names(outcome.counts) <- as.character(outcomes)
  for(item in names(outcome.counts)){
    outcome.counts[item] <- sum(outcome.vector == item)
  }
  outcome.counts <- outcome.counts / sum(outcome.counts)
  
  # if the prediction, based on max outcome.counts, is greater than the threshold
  # return prediction based on max outcome counts
  # otherwise, go with the simple majority
  if(max(outcome.counts) >= threshold){
    max.index <- which(outcome.counts == max(outcome.counts))
    return(list('predict' = names(outcome.counts)[max.index],
                'margin'  = max(outcome.counts)))
  } else{
    return(majority)
  }
}

# ------------------------------------------------------------------------------------
# Function generate.dt
# ------------------------------------------------------------------------------------
# create a set of decision rules to classify observations based on given dataset

# Input: df         - data frame containing training data
#        target     - column in data frame with correct classification of observation
#        threshold; 
#        majority   - see function predict.outcome
#        g          - igraph object. Used as data structure
#        parent.id  - parent of current node
#        split.by   - value of parent node data at current node was split on
#
# Output: g         - igraph object with decision rules

generate.dt <- function(df, target,                 # data frame and classification col
                        threshold, majority,        # dealing with clashes
                        g, parent.id, split.by,     # graph constuction
                        depth){    
  
  # calculate current entropy of target
  current.entropy <- entropy(df[[target]])
  
  # if entropy is 0 or we're down to the classification column
  # we are at a leaf. Otherwise we need to branch
  if(current.entropy == 0 | ncol(df) == 1){
    outcome <- predict.outcome(df[[target]], threshold, majority)
    # name of node = classification(s)
    node.name <- paste(outcome$predict, round(outcome$margin, 2), sep="::")
    node.type <- 'leaf'
  } else{
    # determine the next column to split by
    node.name <- entropy.partition(df, target)[[1]]
    node.type <- 'branch'
  }
  
  # add node
  g <- g + vertex(node.name, type=node.type, set.size=nrow(df), depth=depth)
  node.id <- length(V(g))
  # add edge from parent to node (unless parent.id empty)
  if(parent.id != ''){
    # edge name is the value we are splitting by
    g <- g + edge(c(parent.id, node.id), name=split.by)
  }
  
  # if we're at a branch
  if(node.type == 'branch'){
    # get branch values
    unique.values <- unique(df[[node.name]])
    # filter out 
    col.filter <- !(names(df) %in% node.name)
    split.df <- df[ , col.filter, drop=FALSE]
    for(value in unique.values){
      new.df <- split.df[df[node.name] == value, , drop=FALSE]
      g <- generate.dt(new.df, target, 
                       threshold, majority,
                       g, node.id, value,
                       depth=depth + 1)
    }
  }
  E(g)$label <- E(g)$name
  return(g)
}

# small control function to create a decision tree
# most of the work is done in generate.dt
train.dt <- function(df, target, threshold){
  # initialize graph
  g <- graph.empty()
  # calculate a simple majority classification rule
  majority <- predict.outcome(df[[target]], 0, list())
  
  # add these as graph attributes
  g$target <- target
  g$threshold <- threshold
  g$majority <- majority
  depth <- 1
  
  g <- generate.dt(df, target, 
                   threshold, majority,
                   g, '', '',
                   depth)
  
  V(g)$label <- paste0(V(g)$name, " (", V(g)$set.size, ")")
  return(g)
}

Decision Tree Functions: Predicting

# ------------------------------------------------------------------------------------
# Function follow.dt
# ------------------------------------------------------------------------------------
# Input: obs          - single observation
#        g            - igraph object with decision rules (created with generate.dt)
#        current.node - place holder of current node. Starts at root (1)
# Output: string with prediction

follow.dt <- function(obs, g, current.node){
  # if the curent node is a 'leaf' make a prediction
  # just need to spit out extra support information
  if(V(g)[current.node]$type == 'leaf'){
    return(strsplit(V(g)[current.node]$name, split="::")[[1]][1])
  }
  
  # otherwise, move on to the next node
  
  # get column to split on from name of current node
  split.on <- obs[V(g)[current.node]$name]
  # determine which edge to follow
  for(edge in E(g)[from(current.node)]){
    if(E(g)[edge]$name == split.on){
      # we found the edge we match, follow to next node
      next.node <- V(g)[to(E(g)[edge])]
      return(follow.dt(obs, g, next.node))
    }
  }
}

# ------------------------------------------------------------------------------------
# Function predict.dt
# ------------------------------------------------------------------------------------
# control function that uses follow.dt to make predictions from a dataset
# Input: df - data frame. Should contain all arguments used in training dataset
#        g  - igraph object created with generate.dt

# Output: vector of predictions

predict.dt <- function(df, g){
  # initialize output vector
  outcome.vector <- c()
  
  # start at the root node
  root.node <- 1
  
  # for each row (observation) in the dataset, use g to make a prediction
  for(n in 1:nrow(df)){
    obs <- df[n, ]
    outcome.vector <- c(outcome.vector, follow.dt(obs, g, root.node))
  }
  return(outcome.vector)
}

Decision Tree Functions: Pruning

# ------------------------------------------------------------------------------------
# Function path.constructor
# ------------------------------------------------------------------------------------
# return vertex and edge names from root to given node
#
# Input: to.node - this is the target vertex

# Output: node edge pairs from root up to, but not including target node
#         in the form:
#         list([node.name, outbound edge name], [], ...)

path.constructor <- function(g, target.node){
  # get the edges from the root to the target node
  p <- get.shortest.paths(g, from=1, to=target.node, output='epath')
  
  # initialize return value
  path.list <- list()
  
  # for each edge in path, add to path.list c(origin node name, edge name)
  for(e in p$epath[[1]]){
    # end points of edge: [1] towared root, [2] away from root
    end.points <- get.edge(g, e)
    path.list <- c(path.list, list(c( V(g)[end.points[1]]$name, E(g)[e]$name )))
  }
  return(path.list)
}
# ------------------------------------------------------------------------------------
# Function node.filter
# ------------------------------------------------------------------------------------
# function: given node return df filter
#           use df filter to return observations that would split to given node
node.filter <- function(g, target.node){
  df.filter <- paste(lapply(path.constructor(g, target.node),
                          function(i) {paste0(i[1],
                                              " == ",
                                              "'", i[2], "'")}),
                   collapse=" & ")
  return(df.filter)
}

# ------------------------------------------------------------------------------------
# Function fast.predict
# ------------------------------------------------------------------------------------
# can use the technique above for what I think will be faster predictons
fast.predict <- function(df, g){
  df$prediction.column <- rep(NA, nrow(df))
  for(leaf in V(g)[V(g)$type == 'leaf']){
    predict <- strsplit(V(g)[leaf]$name, split="::")[[1]][1]
    f <- node.filter(g, leaf)
    
    attach(df)
    df[eval(parse(text = f)), 'prediction.column'] <- predict
    detach(df)
  }
  return(df$prediction.column)
}

# ------------------------------------------------------------------------------------
# Function prune.tree
# ------------------------------------------------------------------------------------
# prune a decision tree by comparing backed-up error rates of leaves to
#      static error rate of parent
#
# Input: graph and pruning data frame

# Output: pruned tree

prune.tree <- function(g, df){  
  # calculate prune.static.error and
  #           prune.weight for all nodes in graph
  for(ver in V(g)){
    f <- node.filter(g, ver)
    if(f == ""){
      node.df <- df
    } else{
      node.df <- with(df, df[eval(parse(text = f)), ])
    }
    
    V(g)[ver]$prune.weight <- nrow(node.df)
    
    node.prediction <- predict.outcome(node.df[[g$target]], 
                                       g$threshold, g$majority)$predict
    st.err <- sum(node.prediction != node.df[[g$target]]) / V(g)[ver]$prune.weight
    
    V(g)[ver]$prune.static.error <- st.err
    V(g)[ver]$prune.predict <- node.prediction
  }
  
  # get all pruning candidates
  leaf.parents <- V(g)[ nei( V(g)[which(V(g)$type == 'leaf')] ) ]
  
  prune.targets <- c()
  for(node in leaf.parents){
    ds.neighbors <- V(g)[ nei(node, mode='out')]
    if(all(V(g)[ nei(node, mode='out')]$type == 'leaf')){
      prune.targets <- c(prune.targets, node)
    }
  }
  
  # for each pruning candidate
  for(n in prune.targets){
    # calculate backed-up error
    bkd.up.err <- 0
    for(c in neighbors(g, n, mode='out')){
      bkd.up.err <- bkd.up.err + (V(g)[c]$prune.weight * V(g)[c]$prune.static.error)
    }
    bkd.up.err <- bkd.up.err / V(g)[n]$prune.weight
    
    # if static error < backed-up error prune
    if(V(g)[n]$prune.static.error < bkd.up.err){
      # prune
      # delete leaves
      delete.vertices(g, neighbors(g, n, mode='out'))
      
      # Create new leaf
      V(g)[n]$type <- 'leaf'
      V(g)[n]$name <- paste(V(g)[n]$prune.predict, 
                            round(V(g)[n]$prune.static.error, 2), sep="::")
    }
  }
  return(g)
}

# control structure for prune.tree

prune.dt <- function(g, df){
  delta <- 1
  while(delta > 0){
    starting.num <- vcount(g)
    g <- prune.tree(g, df)
    delta <- starting.num - vcount(g)
  }
  return(g)
}

Plotting Functions

pretty.plot <- function(g){
  layout <- layout.reingold.tilford(g)
  # rotate
  layout <- layout[ , c(2, 1)]
  # flip
  layout[ ,1] <- max(layout[ ,1]) - layout[ ,1, drop=FALSE]
  # stretch
  layout[ ,1] <- layout[ ,1] * layout[ ,1]
  
  plot(g, layout=layout, asp=0,
       vertex.shape='rectangle', vertex.size=20, vertex.size2=10,
       vertex.color='white', vertex.frame.color='white',
       edge.label.color='red', edge.arrow.size=0.3)
}

Testing

Let’s try it out on an example from the book:

# Add a row that tests how the code deals with a clash
lens24 <- rbind(lens24, c(3, 2, 2, 2, 2))

target <- 'cls'
threshold <- 0.9
g <- train.dt(lens24, target, threshold)

pretty.plot(g)

Now let’s see what we have with the jury data:

jury.df <- read.csv(file = "jury-training-data.csv")

# remove incomplete cases (there is one line of NAs)
jury.df <- jury.df[complete.cases(jury.df), ]

threshold <- 0.6
target <- 'tendency'

g <- train.dt(jury.df, target, threshold)

pretty.plot(g)

Can we use this for prediction?

jury.public <- read.csv(file = "jury-learning-data-public.csv")
jury.public <- jury.public[complete.cases(jury.public), ]

obs <- jury.public[2, ]
print(obs)
##      agegroup   employment gender marital   tendency
## 2 Older Adult Not Employed Female Married Not Guilty
current.node <- 1

print(follow.dt(obs, g, current.node))
## [1] "Not Guilty"

Yes!

Now let’s try the whole public data set

predictions <- predict.dt(jury.public, g)

# how did  we do?
print(sum(jury.public$tendency == predictions) / nrow(jury.public))
## [1] 0.6470588

Optimization

Let’s use the training set to see if there is an optimum threshold value

# This was run with values up to 1.0
threshold.values <- seq(from=0.51, to=0.70, by=0.01)
target <- 'tendency'

accuracy <- c()

for(threshold in threshold.values){
  g <- train.dt(jury.df, target, threshold)
  pr <- fast.predict(jury.public, g)
  accuracy <- c(accuracy, sum(jury.public$tendency == pr) / nrow(jury.public))
}

plot(threshold.values, accuracy, type="l")

It seems that the optimal threshold value is 0.53 and below. This makes sense because 0.53 is the global majority. By setting our threshold value below this we take whatever accuracy gains are available from splitting up the data set. What we probably risk is overfitting. None the less, we will choose a value of 0.52 and recalibrate our tree.

threshold <- 0.52
g <- train.dt(jury.df, target, threshold)

pr <- fast.predict(jury.public, g)

print(sum(jury.public$tendency == pr) / nrow(jury.public))
## [1] 0.6806723
pretty.plot(g)

Prune Tree

Now we can attempt to prune the tree using the post pruning technique of comparing the backed-up error rate with the static error rate.

pg <- prune.dt(g, jury.public)
pretty.plot(pg)

In this case we did not find any nodes that improved the accuracy of the predictions of the jury.public data set.

Analysis

(Analysis) Using your custom functions above, build by hand a decision tree on the jury data contained in the file jury-training-data.csv. Document the final set of rules you come up with for the data set. You should end up with a series of if/then statements, one for each final branch of your tree. Be sure to indicate the support and the probability at the end of each branch. You need not worry about a Laplace correction at this point.

The rules are documented in the decision tree below:

Categories are shown in blue, with the number of training observations fitting the description shown in parenthesis. The values to split on are shown in red. The classification is shown at the leaves. It includes a confidence estimate.

We can also document the rules as follows (each statement corresponds to a leaf in the diagram above):

rules <- c()
for(leaf in V(g)[V(g)$type == 'leaf']){
  f <- node.filter(g, leaf)
  rule <- paste("if", f, "then class =", strsplit(V(g)[leaf]$name, split="::")[[1]][1])
  rules <- c(rules, rule)
}

print(xtable(data.frame(rules=rules)), size="\\tiny")
% latex table generated in R 3.1.2 by xtable 1.7-4 package % Mon Mar 30 06:46:01 2015

Once you have built your decision tree, use it to classify all of the observations in the public and private testing data sets for the jury problem. Comment on how well your tree performs on the public data set.

For the public data set we have:

pr <- fast.predict(jury.public, g)

public.score <- sum(jury.public$tendency == pr) / nrow(jury.public)

print(public.score)
## [1] 0.6806723

An accuracy of 0.6806723 seems pretty reasonable for this data set.

If you “prune your tree” can you get better results on the testing data? Explain.

As discussed above, I implemented a pruning strategy but did not see any improvement.

# classify private set and save results in a csv file

jury.private <- read.csv(file = "jury-learning-data-public.csv")
jury.private <- jury.public[complete.cases(jury.public), ]

pr <- fast.predict(jury.private, g)

jury.private$predictions <- pr

write.csv(jury.private, "apalumbo-jury-private-predictions.csv")