STA 279 Lab 7
Complete all Questions.
The Goal
Today, we are focused on classification trees. We are going to deepen our understanding of how trees work, and we are going to fit and visualize tree models in R.
Formatting your lab
When you open your Markdown file, your first chunk likely looks like:
Change it to:
The Data Set
We are going to stick with our Federalist data, as that will allow us to compare our results from multinomial regression in Lab 6 to our results with a classification tree.
Let’s start off by loading our libraries. Most of these will look familiar, but a few are new. The new ones are the libraries that we need for classification trees.
REMEMBER: If you get an error that says your
computer does not have or cannot find a package, go to the top of your
screen (the right of your camera for my Apple folks) and choose
Tools. From there, choose Install Packages.
From there, type in the name of the packages you need and install!
# Load the libraries
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(nnet)
library(tm)
# New - needed for trees
library(rpart)
library(rattle)
library(rpart.plot)Once you have loaded the libraries, you can use the following code to load the data.
# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/nche5o1oi5lag1kvoles5/LIWC_Federalist.csv?rlkey=0yvpqash13gcur6hkh7m7uwbl&st=3xntfjzh&dl=1")
# Convert to a data frame
Federalist <- data.frame(Federalist)[,-c(1,2,4)]
Federalist <- Federalist[-c(18:20),]
# Make sure author is treated as categorical
Federalist[, "author"] <- as.factor(Federalist[, "author"])Just as we did in Lab 6, our goal is to use the 118 features from LIWC features to (1) predict which author wrote each paper and (2) to identify and discuss key traits in the text that identify the writing of each author.
Today, we are going to do this using classification trees.
Considering Shape
When we fit a multinomial regression model, we are making the assumption that for numeric features, it makes sense to say things like “as word count increases by 1, the log relative risk of being written by Madison instead of Hamilton increases by .25.” This implies a linear change in the log relative risks as \(X\) increases. In other words, we assume the shape of the relationship between \(X\) and the log relative risk of Madison versus Hamilton and Jay versus Hamilton are both lines.
This assumption of shape is something that we need for basically all regression models. They assume a particular shape of the relationship between the feature and the left hand side of the model: \(Y\) for LSLR/MR models, the log odds for logistic, and the log relative risk for multinomial regression.
Let’s see if the linear shape assumption would be reasonable for this model. To teach R to check shape for multinomial regression, copy and paste the below code into a chunk and press play. DO NOT CHANGE ANYTHING IN THIS CHUNK!
multinomial_shape<- function(data, xvar , yvar, level, xname ){
# baseline
Y <- as.factor(data[,yvar])
baseline <- levels(Y)[1]
# Find the range for X
range <- c(min(data[,xvar]), max(data[,xvar]))
# Put 10% of the data in each bin
bins <- range[1]
for(i in 1:10){
binstop <- quantile(data[,xvar],.10*i)
bins <- c(bins, binstop)
}
bins <- c(bins, range[2])
# Count the number of bins
nbins <- length(bins) - 2
# Storage space for relative risk
RR.each <- rep(0,nbins)
# Storage space for log relative risk
log.RR.each <- rep(0,nbins)
xaxis <- rep(NA, nbins)
# Step 2: Obtain the Relative Risk of level vs. baseline
for(i in 1:nbins){
# Find the number of rows in each bin
scores.in <- which(data[,xvar]< bins[i+1] & data[,xvar] >= bins[i])
xaxis[i] <- mean(scores.in)
# Count how many of these rows have the level
# you have in the numerator
numerator <- length(which(data[scores.in,yvar]==level))
# Count how many of these rows have the level
# you have in the denominator
denominator <- length(which(data[scores.in,yvar]==baseline))
#Compute the relative risk
RR.each[i] <- ifelse(numerator>0 & denominator>0,numerator/denominator,0)
# Compute the log relative risk
log.RR.each[i] <- log(RR.each[i])
}
RRInfo <- data.frame(xaxis, log.RR.each)
# Check to see if any bins had no values for our level
infinite = which(log.RR.each==-Inf)
# For these, we want to group these values with the bin before or after to try to not get infinity
# So, if bin 3 is the issue, we try combining with another bin
for(i in infinite){
checkup <- log.RR.each[i-1]==-Inf
checkdown <- log.RR.each[i+1]==-Inf
if(is.na(checkdown) == TRUE & checkup==FALSE){
#This means we can combine with the one above
scores.in <- which(data[,xvar]< bins[i+1] & data[,xvar] >= bins[i])
xaxis[i+1] <- mean(scores.in)
xaxis[i] <- NA
numerator <- length(which(data[scores.in,yvar]==level))
# Count how many of these rows have the level
# you have in the denominator
denominator <- length(which(data[scores.in,yvar]==baseline))
#Compute the relative risk
RR.each[i+1] <- ifelse(numerator>0 & denominator>0,numerator/denominator,0)
# Compute the log relative risk
log.RR.each[i+1] <- log(RR.each[i-1])
RR.each[i] <- NA
log.RR.each[i] <- NA
}
if(is.na(checkdown) == FALSE){
if(checkdown==FALSE){
#This means we can combine with the one below
scores.in <- which(data[,xvar]< bins[i] & data[,xvar] >= bins[i-1])
xaxis[i-1] <- mean(scores.in)
xaxis[i] <- NA
numerator <- length(which(data[scores.in,yvar]==level))
# Count how many of these rows have the level
# you have in the denominator
denominator <- length(which(data[scores.in,yvar]==baseline))
#Compute the relative risk
RR.each[i-1] <- ifelse(numerator>0 & denominator>0,numerator/denominator,0)
# Compute the log relative risk
log.RR.each[i-1] <- log(RR.each[i-1])
RR.each[i] <- NA
log.RR.each[i] <- NA
}
if(checkdown==TRUE & checkup == FALSE){
#This means we can combine with the one above
scores.in <- which(data[,xvar]< bins[i+1] & data[,xvar] >= bins[i])
xaxis[i+1] <- mean(scores.in)
xaxis[i] <- NA
numerator <- length(which(data[scores.in,yvar]==level))
# Count how many of these rows have the level
# you have in the denominator
denominator <- length(which(data[scores.in,yvar]==baseline))
#Compute the relative risk
RR.each[i+1] <- ifelse(numerator>0 & denominator>0,numerator/denominator,0)
# Compute the log relative risk
log.RR.each[i+1] <- log(RR.each[i-1])
RR.each[i] <- NA
log.RR.each[i] <- NA
}
if(checkdown==TRUE & checkup == TRUE){
#In this case, we can't combine, and we leave this blank
RR.each[i] <- NA
log.RR.each[i] <- NA
}
}
}
RRInfo <- data.frame(xaxis, log.RR.each)
RRInfo <- na.omit(RRInfo)
ggplot(RRInfo, aes(xaxis, log.RR.each)) + geom_point()+ labs(x= xname, y = paste("Log RR:", level, "versus", baseline)) + stat_smooth(formula = y~x, method = "lm", se = FALSE)
}To use the function, we need to use
where
data= replace with the name of our data setxvar= replace with the name of the column holding our X variable, in quotes.yvar= replace with the name of the column holding our Y variable, in quotes.level= replace with the level you want to test against the baseline, in quotes.xname= replace with what you would like to label the x axis, in quotes.
For example, to use the function to check shape for WPS
for Madison vs. Hamilton, we would use:
Question 1
Use the function to check shape for prep (the percent
prepositions) for Madison vs. Hamilton. Hint: This should require only
two small changes to the code above.
Based on the plot, does it look reasonable to use a line to model the relationship between the percent prepositions and the log relative risk of Madison versus Hamilton?
If we decide a linear shape is not reasonable, we can use a transformation of \(X\), like a log transformation or a polynomial. However, we have 118 features. Checking shape and choosing a transformation that might work for all of them is going to be very difficult.
Instead of doing all of that, we might consider changing to a model that does not make a shape assumption…a model like classification trees.
Classification Tree: One Split
Trees do not rely on an assumption of shape to build a model because each step in the tree is a question. These questions are thing like “If the word count is less than 400, is a paper more likely to have been written by Madison, Hamilton, or Jay?” This means that trees make no assumptions about the shape of the relationship between X and Y. Instead, trees assume that we can use the features to group the data into authors using a series of splitting rules.
To see this, let’s build a tree with only one split.
This looks a lot like the code we have been using all semester with:
author: our \(Y\) variable.: meaning we use all other variables as featuresmethod= "class": tells R to fit a classification tree.maxdepth = 1: Tells R right now we only want one splitting rule (so 2 leaves)
To visualize the tree, we can use
Here, prep is the percent of the words in a paper that
are prepositions, meaning direction words (to, with, above, in, after,
before, etc.).
Question 2
If a paper has 22% of the words being prepositions, who do we predict wrote the paper?
Question 3
Why are no papers being predicted as Jay?
Question 4
What percent of papers have less than 18% of their words being prepositions?
Question 5
What percent of papers with at least 18% of their words being prepositions were written by (a) Madison, (b) Hamilton, and (c) Jay?
Right now, this tree is very clear. Only one feature is being used, and if we needed to chat with clients, we could easily explain how to read the tree, both for predictions and for explaining the relationships highlighted in the tree.
However…there were 118 different features to choose from. Why did we choose prepositions?? And why 18%??
Gini Index
When we grow a tree, our priority is stability in the leaves. A leaf is stable if there is one very dominant level of a categorical variable in the leaf.
Question 6
Our current tree has 2 leaves. Which leaf is more stable in our current tree: Leaf 1 (left) or Leaf 2 (right)?
We humans can assess stability by eye, but that doesn’t help the computer. Instead, we need some numeric way to assess stability or instability. We can then use this to determine which splitting rules to use to make the most stable leaf.
We measure instability using the Gini Index, which is a weighted average of the Gini Impurity Score in each leaf. We define the Gini Impurity Score as
\[G(Leaf~\ell ) = 1 - \left(\hat{p}_{\ell(H)}^2 +\hat{p}_{\ell(J)}^2 + \hat{p}_{\ell(M)}^2 \right)\] where
- \(\hat{p}_{\ell(H)}^2\) = proportion of papers in leaf \(\ell\) that were written by Hamilton.
- \(\hat{p}_{\ell(J)}^2\) = proportion of papers in leaf \(\ell\) that were written by Jay.
- \(\hat{p}_{\ell(M)}^2\) = proportion of papers in leaf \(\ell\) that were written by Madison.
If we had a different number of levels, you have a \(\hat{p}_{\ell(j)}^2\) term for each level \(j\) of \(Y\).
The Gini Impurity score measures instability, meaning it measures how unstable a leaf is. This means that more stable leaves have smaller values of the Gini Impurity Score.
Question 7
What is the Gini Impurity Score in Leaf 1?
Question 8
Should the Gini Impurity Score in Leaf 2 be higher, lower, or about the same as in Leaf 1? Explain without doing any calculations.
Once we have the Gini Impurity Score for each leaf, we compute the Gini Index as a weighted average of these impurity scores. The weights are the percent of the data in each leaf. For these data, this means
\[Gini = (\text{% Papers in Leaf 1}) \times G(Leaf~1 ) + (\text{% Papers in Leaf 2}) \times G(Leaf~2) \]
Question 9
What is the Gini Index for tree_onesplit? Show your work
(meaning you cannot use a function to do it for you, I need to see the
steps!)
If you would like to check your answer to Question 9, I have built you a function that can compute the Gini Index for a tree with one split. It relies on this function (which I did not write!). Copy and paste it into a chunk and press play. We will do nothing else with this function.
rpart_splits <- function(fit, digits = getOption("digits")) {
splits <- fit$splits
if (!is.null(splits)) {
ff <- fit$frame
is.leaf <- ff$var == "<leaf>"
n <- nrow(splits)
nn <- ff$ncompete + ff$nsurrogate + !is.leaf
ix <- cumsum(c(1L, nn))
ix_prim <- unlist(mapply(ix, ix + c(ff$ncompete, 0), FUN = seq, SIMPLIFY = F))
type <- rep.int("surrogate", n)
type[ix_prim[ix_prim <= n]] <- "primary"
type[ix[ix <= n]] <- "main"
left <- character(nrow(splits))
side <- splits[, 2L]
for (i in seq_along(left)) {
left[i] <- if (side[i] == -1L)
paste("<", format(signif(splits[i, 4L], digits)))
else if (side[i] == 1L)
paste(">=", format(signif(splits[i, 4L], digits)))
else {
catside <- fit$csplit[splits[i, 4L], 1:side[i]]
paste(c("L", "-", "R")[catside], collapse = "", sep = "")
}
}
cbind(data.frame(var = rownames(splits),
type = type,
node = rep(as.integer(row.names(ff)), times = nn),
ix = rep(seq_len(nrow(ff)), nn),
left = left),
as.data.frame(splits, row.names = F))
}
}Next, copy and paste the function below into a chunk and press play. This is the function that actually computes the Gini Index.
gini <- function(tree,data){
variable <-rpart_splits(tree)[1,"var"]
value <-rpart_splits(tree)[1,"index"]
#print(length(rpart_splits(tree)[1,"left"]))
if(length(rpart_splits(tree)[1,"left"]) >0 ){
sign <- unlist(strsplit(rpart_splits(tree)[1,"left"], " "))[1]
if(sign == ">=" ){
Leaf <- ifelse( data[,variable] >= value, "Leaf 1", "Leaf 2")
}
if(sign == ">" ){
Leaf <- ifelse( data[,variable] > value, "Leaf 1", "Leaf 2")
}
if(sign == "<=" ){
Leaf <- ifelse( data[,variable] <= value, "Leaf 1", "Leaf 2")
}
if(sign == "<" ){
Leaf <- ifelse( data[,variable] < value, "Leaf 1", "Leaf 2")
}
weights <- data.frame(table(Leaf)/length(Leaf))
Leaf1 <- subset(data, Leaf== "Leaf 1")
Leaf2 <- subset(data, Leaf== "Leaf 2")
props1 <- data.frame(table(Leaf1$author)/nrow(Leaf1))
props2 <- data.frame(table(Leaf2$author)/nrow(Leaf2))
Impurity1 <- 1 -(sum(props1$Freq^2))
Impurity2 <- 1 -(sum(props2$Freq^2))
out <- weights$Freq[1]*Impurity1 + weights$Freq[2]*Impurity2
}
if(length(rpart_splits(tree)[1,"left"]) ==0){
out <- NULL
}
out
}To use the function on our tree_onesplit, we use:
NOTE: Your answer from the function and your answer to Question 9 might be slightly different due to rounding that takes place in the visualization. Don’t worry about that!!
Question 10
When we grow a classification tree, we choose each splitting rule to
minimize the Gini Index. Our tree uses prep, the percent of
prepositions. This means that there should be no other feature that we
could split on and get a lower Gini Index. Let’s check.
Choose any other feature in the data set (just pick one). State the feature you chose.
Grow a tree with one split using your chosen feature. Use the
ginifunction to compute the Gini index, and state the Gini Index for that tree.Is this Gini Index larger, smaller, or the same as the one we got from
tree_onesplitwithprep?
So, this is our tree with only one split. However, we were only restricted to one split for now so we could dig into trees a little more deeply. In reality, we rarely stop after just one split!
Classification Tree 2: More Splits
When we grew our tree with one split, we used:
The maxdepth = 1 argument is the part that tells R to
stop after one split. If we remove it, the tree grows more!
Question 11
Make a plot to show tree_full.
Question 12
Which splitting rule gives us the largest reduction in the Gini Index in one split? Hint: Answering this should require no calculation!!
You will notice that this tree is not very large, even though we have a lot of features to choose from! Why did it stop growing???
Trees stop growing when they hit what is called a stopping rule. These are a series of rules that we choose that helps us keep trees from getting too large. Super larger trees (meaning trees with many leaves) are hard to interpret and generally do not predict well on test data…which is usually the whole point of prediction.
I don’t see any stopping rules in our code, though. Where did we specify these rules?
It turns out that R has done this for us.
## $minsplit
## [1] 20
##
## $minbucket
## [1] 7
##
## $cp
## [1] 0.01
##
## $maxcompete
## [1] 4
##
## $maxsurrogate
## [1] 5
##
## $usesurrogate
## [1] 2
##
## $surrogatestyle
## [1] 0
##
## $maxdepth
## [1] 30
##
## $xval
## [1] 10
See all of this output?? All of these are default stopping rules in R. What do they all mean?
Here are the important stopping rules:
minsplit= the minimum number of rows that must exist in a leaf in order for a split to be attempted.minbucket= the minimum number of rows allowed in any leaf.cp= the percent improvement in the Gini Index we require in order to accept a split. Right now, this is .01 = 1%. This means that if we cannot improve the Gini Index by at least 1%, we do not split a leaf (meaning we stop growing).maxdepth= the maximum depth of any leaf, with the root counted as depth 0. Basically, this is how we control how many splitting rules you go through to get from the root to a leaf.
Having default stopping rules is great, but we do need to think about whether or not they make sense for our data. Suppose I had 3 million rows of data. Do we really want to allow leaves as small as 7 rows??? The tree would be huge!!
If we want to change one of these stopping rules when growing a tree,
we can adapt our tree growing code. For example, if we want to change
the maxdepth stopping rule to 5, we use:
Question 13
We only have 82 rows in our data set, and right now we are only allowed to keep growing our tree if there are at least 20 rows (24%) of the data in each split. This seems like a lot.
Suppose we only want to require 10 rows in order to split. Adapt
tree_full to create a tree called tree_10 with
this change. Show a plot of this tree as the answer to this
question.
For the rest of the lab, use tree_10 for all
questions.
Our new tree shows a feature we haven’t talked about yet:
Analytic. According to the LIWC documentation:
“Analytic Thinking captures the degree to which people use words that suggest formal, logical, and hierarchical thinking patterns. People low in Analytical Thinking tend to write and think using language that is more intuitive and personal. Language scoring high in Analytic Thinking tends to be rewarded in academic settings and is correlated with things like grades and reasoning skills. Language scoring low in Analytic Thinking tends to be viewed as less cold and rigid, and more friendly and personable.” (Citation: https://www.liwc.app/help/liwc#Analytical-Thinking)
Question 14
If a Federalist paper has 17% prepositions (prep), 40
words per sentence (WPS), an analytic score of 90, and .2%
of the word referring to money (money). which author will
the tree predict wrote the paper?
Question 15
How stable is the prediction from Question 14? Explain.
Question 16
An historian asks you to use the tree to describe the writing traits that distinguish John Jay from the other two authors. Answer them!
As we can see, classification trees have a few nice properties! Even though we started with 118 features, only 3 of them are actually used in our model. This makes it fairly easy to visualize and interpret the relationships in the data.
Predictions
Now that we have built and interpreted the tree, let’s use it for making predictions. We can (1) make predictions on the training data to see how well our model fits the data, and (2) we make predictions on test data. Let’s start with the training data.
To make predictions using a tree, we use the code:
To make our confusion matrix, we use the same code as we did for multinomial:
holder <- table("Prediction"=yhat_train,"Actual" = Federalist[,"author"])
# Label the rows
row.names(holder) <- c("Predicted: Hamilton", "Predicted: Jay", "Predicted: Madison")
# Label the columns
colnames(holder) <- c("True: Hamilton", "True: Jay", "True: Madison")
# Format the confusion matrix
knitr::kable(holder)Question 17
Using the confusion matrix, explain to our historian client how well the model is doing at prediction on the training data using only 4 of the 118 features. You do NOT need to compute the accuracy or F1-score or anything like that - just discuss the confusion matrix.
In addition to making predictions on the training data, our client wants us to use our model to make predictions for the 15 Federalist papers with unknown authors. To read in these data, use the code below.
test <- read.csv("https://www.dropbox.com/scl/fi/45x2atyykounkdc811qmv/LIWC_Test.csv?rlkey=rcuxfme51l032cj1s16x3to4x&st=lvnp4z8i&dl=1")
test <-data.frame(test)To make predictions using a tree, we use the code:
Question 18
Use your tree to make predictions on the 15 papers in the test set.
Show your predictions neatly formatted using
knitr::kable.
Recall that for our multinomial regression model, the predictions were:
To compare the predictions on the test data from the multinomial model (using 6 features) and the tree model (using 4 features), we can make a table:
yhat_multinomial <- c("Madison","Madison", "Hamilton", rep("Madison",12))
yhat <- data.frame(1:15, yhat_test)
yhat$multi <- yhat_multinomial
knitr::kable( yhat, col.names= c("Paper", "Tree","Multinomial"))Question 19
Is there any difference in the predictions? Be specific.
As it is not known who wrote these papers, it is not possible to know which model is right. Typically, we use AIC to assess predictive ability on test data and guess and which model might have more accurate predictions. However, we cannot use AIC with trees. This means that assessing predictive abilities on test data is not really possible without having test data.
For your Data Analysis 2…I will give you test data to avoid this issue. We also learn how to create test data in STA 363 (shameless plug for a course I am teaching in Spring).
References
Data
The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers, Retrieved July 20, 2024 from https://github.com/nicholasjhorton/FederalistPapers.