Implement a function that takes a categorical vector (in character or factor form) and calculates the entropy for that vector.
entropy <- function(input.vector){
# use the table function to count occurrences
t <- table(input.vector)
# normalize the table (now a probability)
p <- t / sum(t)
# calculate entropy
return(sum(-p * log(p, 2)))
}
Let’s take an example from the book to confirm this is working:
# lens24 data from Principles of Data Mining
cls <- c(3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 1, 3, 2, 3, 3, 3, 3, 3, 1, 3, 2, 3, 3)
age <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
spc <- c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2)
ast <- c(1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2)
trs <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2)
lens24 <- data.frame(age, spc, ast, trs, cls)
initial.entropy <- entropy(cls)
initial.entropy
## [1] 1.326088
We see that the value we we calculated of 1.3260875 matches the expected value of 1.3261
Implement a function that calculates the information gain of one categorical vector when partitioning according to another categorical vector.
info.gain <- function(partition.by, categories){
# Breakdown categories by partition.by
partition.table <- table(partition.by, categories)
# Count occurrences per partition
row.sums <- as.vector(rowSums(partition.table))
# Normalize rows
prob <- partition.table / row.sums
# Calculate entropy per row
ent <- -prob * log(prob, 2)
# we will define 0 * log(0) as 0
ent[is.na(ent)] <- 0
# Calculate information gain as
# original entropy of categories - weighted average of entropy partitioned data
info.gain <- entropy(categories) - sum(ent * (row.sums / sum(row.sums)))
return(info.gain)
}
Again, let’s check this against a known calculation:
df <- data.frame('age' = info.gain(age, cls),
'specRx' = info.gain(spc, cls),
'astig' = info.gain(ast, cls),
'tears' = info.gain(trs, cls))
df
## age specRx astig tears
## 1 0.0393965 0.03951084 0.3770052 0.5487949
From the book we expect the above values to be:
Which match our calculations.
(Programming) Implement a function that takes as its input a data frame of categorical variables, one of which is identified as the target variable, and outputs the following in list format:
entropy.partition <- function(df, target){
# extract column names
cols <- names(df)
# filter out target
candidates <- cols[!(cols == target)]
# calculate entropy gain when partitioning target by each candidate
gain <- NULL
for(col in candidates){
gain <- c(gain, info.gain(df[[col]], df[[target]]))
}
names(gain) <- candidates
# select the candidate that returns the max information gain
# in case of a tie, select the first candidate in the list
max.candidate <- names(which(gain == max(gain))[1])
return(list('max' = max.candidate,
'info.gain' = gain))
}
We can check the functionality of the above code with the same input we used in task 2:
entropy.partition(lens24, 'cls')
## $max
## [1] "trs"
##
## $info.gain
## age spc ast trs
## 0.03939650 0.03951084 0.37700523 0.54879494
The function returns a list with the first item containing the column name providing the highest information gain and the second item containing the information gain on the target column when partitioning according to the remaining columns.
# ------------------------------------------------------------------------------------
# Function predict.outcome
# ------------------------------------------------------------------------------------
# helper function to determine best prediction at leaf of decision tree
# based on the filtered target column
# Input: outcome.vector - vector of classifications
# threshold - confidence threshold (expressed as probability) that
# needs to be exceeded to use this as a bases
# for a prediction
# majority - prediction based on simple top level majority
# to be used if threshold is not exceeded
#
# Output: prediction
predict.outcome <- function(outcome.vector, threshold, majority){
# Predict class based on outcome.vector
# Determine possible outcomes and percentage occurrence
outcomes <- unique(outcome.vector)
outcome.counts <- rep(0, length(outcomes))
names(outcome.counts) <- as.character(outcomes)
for(item in names(outcome.counts)){
outcome.counts[item] <- sum(outcome.vector == item)
}
outcome.counts <- outcome.counts / sum(outcome.counts)
# if the prediction, based on max outcome.counts, is greater than the threshold
# return prediction based on max outcome counts
# otherwise, go with the simple majority
if(max(outcome.counts) >= threshold){
max.index <- which(outcome.counts == max(outcome.counts))
return(list('predict' = names(outcome.counts)[max.index],
'margin' = max(outcome.counts)))
} else{
return(majority)
}
}
# ------------------------------------------------------------------------------------
# Function generate.dt
# ------------------------------------------------------------------------------------
# create a set of decision rules to classify observations based on given dataset
# Input: df - data frame containing training data
# target - column in data frame with correct classification of observation
# threshold;
# majority - see function predict.outcome
# g - igraph object. Used as data structure
# parent.id - parent of current node
# split.by - value of parent node data at current node was split on
#
# Output: g - igraph object with decision rules
generate.dt <- function(df, target, # data frame and classification col
threshold, majority, # dealing with clashes
g, parent.id, split.by, # graph constuction
depth){
# calculate current entropy of target
current.entropy <- entropy(df[[target]])
# if entropy is 0 or we're down to the classification column
# we are at a leaf. Otherwise we need to branch
if(current.entropy == 0 | ncol(df) == 1){
outcome <- predict.outcome(df[[target]], threshold, majority)
# name of node = classification(s)
node.name <- paste(outcome$predict, round(outcome$margin, 2), sep="::")
node.type <- 'leaf'
} else{
# determine the next column to split by
node.name <- entropy.partition(df, target)[[1]]
node.type <- 'branch'
}
# add node
g <- g + vertex(node.name, type=node.type, set.size=nrow(df), depth=depth)
node.id <- length(V(g))
# add edge from parent to node (unless parent.id empty)
if(parent.id != ''){
# edge name is the value we are splitting by
g <- g + edge(c(parent.id, node.id), name=split.by)
}
# if we're at a branch
if(node.type == 'branch'){
# get branch values
unique.values <- unique(df[[node.name]])
# filter out
col.filter <- !(names(df) %in% node.name)
split.df <- df[ , col.filter, drop=FALSE]
for(value in unique.values){
new.df <- split.df[df[node.name] == value, , drop=FALSE]
g <- generate.dt(new.df, target,
threshold, majority,
g, node.id, value,
depth=depth + 1)
}
}
E(g)$label <- E(g)$name
return(g)
}
# small control function to create a decision tree
# most of the work is done in generate.dt
train.dt <- function(df, target, threshold){
# initialize graph
g <- graph.empty()
# calculate a simple majority classification rule
majority <- predict.outcome(df[[target]], 0, list())
# add these as graph attributes
g$target <- target
g$threshold <- threshold
g$majority <- majority
depth <- 1
g <- generate.dt(df, target,
threshold, majority,
g, '', '',
depth)
V(g)$label <- paste0(V(g)$name, " (", V(g)$set.size, ")")
return(g)
}
# ------------------------------------------------------------------------------------
# Function follow.dt
# ------------------------------------------------------------------------------------
# Input: obs - single observation
# g - igraph object with decision rules (created with generate.dt)
# current.node - place holder of current node. Starts at root (1)
# Output: string with prediction
follow.dt <- function(obs, g, current.node){
# if the curent node is a 'leaf' make a prediction
# just need to spit out extra support information
if(V(g)[current.node]$type == 'leaf'){
return(strsplit(V(g)[current.node]$name, split="::")[[1]][1])
}
# otherwise, move on to the next node
# get column to split on from name of current node
split.on <- obs[V(g)[current.node]$name]
# determine which edge to follow
for(edge in E(g)[from(current.node)]){
if(E(g)[edge]$name == split.on){
# we found the edge we match, follow to next node
next.node <- V(g)[to(E(g)[edge])]
return(follow.dt(obs, g, next.node))
}
}
}
# ------------------------------------------------------------------------------------
# Function predict.dt
# ------------------------------------------------------------------------------------
# control function that uses follow.dt to make predictions from a dataset
# Input: df - data frame. Should contain all arguments used in training dataset
# g - igraph object created with generate.dt
# Output: vector of predictions
predict.dt <- function(df, g){
# initialize output vector
outcome.vector <- c()
# start at the root node
root.node <- 1
# for each row (observation) in the dataset, use g to make a prediction
for(n in 1:nrow(df)){
obs <- df[n, ]
outcome.vector <- c(outcome.vector, follow.dt(obs, g, root.node))
}
return(outcome.vector)
}
# ------------------------------------------------------------------------------------
# Function path.constructor
# ------------------------------------------------------------------------------------
# return vertex and edge names from root to given node
#
# Input: to.node - this is the target vertex
# Output: node edge pairs from root up to, but not including target node
# in the form:
# list([node.name, outbound edge name], [], ...)
path.constructor <- function(g, target.node){
# get the edges from the root to the target node
p <- get.shortest.paths(g, from=1, to=target.node, output='epath')
# initialize return value
path.list <- list()
# for each edge in path, add to path.list c(origin node name, edge name)
for(e in p$epath[[1]]){
# end points of edge: [1] towared root, [2] away from root
end.points <- get.edge(g, e)
path.list <- c(path.list, list(c( V(g)[end.points[1]]$name, E(g)[e]$name )))
}
return(path.list)
}
# ------------------------------------------------------------------------------------
# Function node.filter
# ------------------------------------------------------------------------------------
# function: given node return df filter
# use df filter to return observations that would split to given node
node.filter <- function(g, target.node){
df.filter <- paste(lapply(path.constructor(g, target.node),
function(i) {paste0(i[1],
" == ",
"'", i[2], "'")}),
collapse=" & ")
return(df.filter)
}
# ------------------------------------------------------------------------------------
# Function fast.predict
# ------------------------------------------------------------------------------------
# can use the technique above for what I think will be faster predictons
fast.predict <- function(df, g){
df$prediction.column <- rep(NA, nrow(df))
for(leaf in V(g)[V(g)$type == 'leaf']){
predict <- strsplit(V(g)[leaf]$name, split="::")[[1]][1]
f <- node.filter(g, leaf)
attach(df)
df[eval(parse(text = f)), 'prediction.column'] <- predict
detach(df)
}
return(df$prediction.column)
}
# ------------------------------------------------------------------------------------
# Function prune.tree
# ------------------------------------------------------------------------------------
# prune a decision tree by comparing backed-up error rates of leaves to
# static error rate of parent
#
# Input: graph and pruning data frame
# Output: pruned tree
prune.tree <- function(g, df){
# calculate prune.static.error and
# prune.weight for all nodes in graph
for(ver in V(g)){
f <- node.filter(g, ver)
if(f == ""){
node.df <- df
} else{
node.df <- with(df, df[eval(parse(text = f)), ])
}
V(g)[ver]$prune.weight <- nrow(node.df)
node.prediction <- predict.outcome(node.df[[g$target]],
g$threshold, g$majority)$predict
st.err <- sum(node.prediction != node.df[[g$target]]) / V(g)[ver]$prune.weight
V(g)[ver]$prune.static.error <- st.err
V(g)[ver]$prune.predict <- node.prediction
}
# get all pruning candidates
leaf.parents <- V(g)[ nei( V(g)[which(V(g)$type == 'leaf')] ) ]
prune.targets <- c()
for(node in leaf.parents){
ds.neighbors <- V(g)[ nei(node, mode='out')]
if(all(V(g)[ nei(node, mode='out')]$type == 'leaf')){
prune.targets <- c(prune.targets, node)
}
}
# for each pruning candidate
for(n in prune.targets){
# calculate backed-up error
bkd.up.err <- 0
for(c in neighbors(g, n, mode='out')){
bkd.up.err <- bkd.up.err + (V(g)[c]$prune.weight * V(g)[c]$prune.static.error)
}
bkd.up.err <- bkd.up.err / V(g)[n]$prune.weight
# if static error < backed-up error prune
if(V(g)[n]$prune.static.error < bkd.up.err){
# prune
# delete leaves
delete.vertices(g, neighbors(g, n, mode='out'))
# Create new leaf
V(g)[n]$type <- 'leaf'
V(g)[n]$name <- paste(V(g)[n]$prune.predict,
round(V(g)[n]$prune.static.error, 2), sep="::")
}
}
return(g)
}
# control structure for prune.tree
prune.dt <- function(g, df){
delta <- 1
while(delta > 0){
starting.num <- vcount(g)
g <- prune.tree(g, df)
delta <- starting.num - vcount(g)
}
return(g)
}
pretty.plot <- function(g){
layout <- layout.reingold.tilford(g)
# rotate
layout <- layout[ , c(2, 1)]
# flip
layout[ ,1] <- max(layout[ ,1]) - layout[ ,1, drop=FALSE]
# stretch
layout[ ,1] <- layout[ ,1] * layout[ ,1]
plot(g, layout=layout, asp=0,
vertex.shape='rectangle', vertex.size=20, vertex.size2=10,
vertex.color='white', vertex.frame.color='white',
edge.label.color='red', edge.arrow.size=0.3)
}
Let’s try it out on an example from the book:
# Add a row that tests how the code deals with a clash
lens24 <- rbind(lens24, c(3, 2, 2, 2, 2))
target <- 'cls'
threshold <- 0.9
g <- train.dt(lens24, target, threshold)
pretty.plot(g)
Now let’s see what we have with the jury data:
jury.df <- read.csv(file = "jury-training-data.csv")
# remove incomplete cases (there is one line of NAs)
jury.df <- jury.df[complete.cases(jury.df), ]
threshold <- 0.6
target <- 'tendency'
g <- train.dt(jury.df, target, threshold)
pretty.plot(g)
Can we use this for prediction?
jury.public <- read.csv(file = "jury-learning-data-public.csv")
jury.public <- jury.public[complete.cases(jury.public), ]
obs <- jury.public[2, ]
print(obs)
## agegroup employment gender marital tendency
## 2 Older Adult Not Employed Female Married Not Guilty
current.node <- 1
print(follow.dt(obs, g, current.node))
## [1] "Not Guilty"
Yes!
Now let’s try the whole public data set
predictions <- predict.dt(jury.public, g)
# how did we do?
print(sum(jury.public$tendency == predictions) / nrow(jury.public))
## [1] 0.6470588
Let’s use the training set to see if there is an optimum threshold value
# This was run with values up to 1.0
threshold.values <- seq(from=0.51, to=0.70, by=0.01)
target <- 'tendency'
accuracy <- c()
for(threshold in threshold.values){
g <- train.dt(jury.df, target, threshold)
pr <- fast.predict(jury.public, g)
accuracy <- c(accuracy, sum(jury.public$tendency == pr) / nrow(jury.public))
}
plot(threshold.values, accuracy, type="l")
It seems that the optimal threshold value is 0.53 and below. This makes sense because 0.53 is the global majority. By setting our threshold value below this we take whatever accuracy gains are available from splitting up the data set. What we probably risk is overfitting. None the less, we will choose a value of 0.52 and recalibrate our tree.
threshold <- 0.52
g <- train.dt(jury.df, target, threshold)
pr <- fast.predict(jury.public, g)
print(sum(jury.public$tendency == pr) / nrow(jury.public))
## [1] 0.6806723
pretty.plot(g)
Now we can attempt to prune the tree using the post pruning technique of comparing the backed-up error rate with the static error rate.
pg <- prune.dt(g, jury.public)
pretty.plot(pg)
In this case we did not find any nodes that improved the accuracy of the predictions of the jury.public data set.
(Analysis) Using your custom functions above, build by hand a decision tree on the jury data contained in the file jury-training-data.csv. Document the final set of rules you come up with for the data set. You should end up with a series of if/then statements, one for each final branch of your tree. Be sure to indicate the support and the probability at the end of each branch. You need not worry about a Laplace correction at this point.
The rules are documented in the decision tree below:
Categories are shown in blue, with the number of training observations fitting the description shown in parenthesis. The values to split on are shown in red. The classification is shown at the leaves. It includes a confidence estimate.
We can also document the rules as follows (each statement corresponds to a leaf in the diagram above):
rules <- c()
for(leaf in V(g)[V(g)$type == 'leaf']){
f <- node.filter(g, leaf)
rule <- paste("if", f, "then class =", strsplit(V(g)[leaf]$name, split="::")[[1]][1])
rules <- c(rules, rule)
}
print(xtable(data.frame(rules=rules)), size="\\tiny")
% latex table generated in R 3.1.2 by xtable 1.7-4 package % Mon Mar 30 06:46:01 2015
Once you have built your decision tree, use it to classify all of the observations in the public and private testing data sets for the jury problem. Comment on how well your tree performs on the public data set.
For the public data set we have:
pr <- fast.predict(jury.public, g)
public.score <- sum(jury.public$tendency == pr) / nrow(jury.public)
print(public.score)
## [1] 0.6806723
An accuracy of 0.6806723 seems pretty reasonable for this data set.
If you “prune your tree” can you get better results on the testing data? Explain.
As discussed above, I implemented a pruning strategy but did not see any improvement.
# classify private set and save results in a csv file
jury.private <- read.csv(file = "jury-learning-data-public.csv")
jury.private <- jury.public[complete.cases(jury.public), ]
pr <- fast.predict(jury.private, g)
jury.private$predictions <- pr
write.csv(jury.private, "apalumbo-jury-private-predictions.csv")