STA 279 Lab 4

Complete all Questions.

The Goal

In our last class, we talked about TF-IDF, which is a tool we can use if we want to figure out what words characterize certain types of text. Today, we are going to see an application of how we can use this in practice.

The Data Set

The Federalist papers are a famous series of articles written in the 1700s in the early days of the United States. They debated and proposed ideas for the new government. Three famous authors of these papers were Alexander Hamilton, James Madison, and John Jay.

Most of the Federalist papers were known to have been written by either Hamilton, Madison, or Jay, but for 12 papers, the papers were submitted anonymously. After Madison’s death, Hamilton declared that he had written the articles. Since then, many scholars have worked to determine which of the three authors had written the anonymous papers using text analysis.

Our goal for today is to see if we can use the text analysis skills we have learned so far to determine who wrote the mystery papers.

To read in the data on the \(n = 85\) Federalist papers with known authors, use the following code:

# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/5hzqwsvlnym5u1mhmn0jy/Federalist.csv?rlkey=55x0p9fl02zls9ixxek64vlqy&st=isjnztof&dl=1")

# Convert to a data frame
Federalist <- data.frame(Federalist)

# Make sure author is treated as categorical
Federalist[, "author"] <- as.factor(Federalist[, "author"])

The columns are:

paper: the number given to the paper; think of this like an identifier for the article.
text: the text of the entire paper.
author: either Hamilton or Madison or Jay

Once you have loaded the data, load the packages you will need for this lab:

library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(tm)
library(naivebayes)

# NEW 
library(stringr)
library(forcats)

Modeling: Naive Bayes

So far, the only model we have learned for text data is Naive Bayes, though that will change very soon! This means we will be using Naive Bayes to predict the author for the 12 mystery papers after using the 85 papers with known authors to train our model.

To train a Naive Bayes model, we need features. We have been exploring techniques like bag of words models that use words for features, so let’s use an approach like that today. We learned that a key is to choose words that will help us determine which author wrote each paper, and we can use the presence of these words in each paper as features to predict author.

It may seem natural to use frequency as a way to determine which words we should use as features. In other words, we could the number of times each word occurs in a text (in this case a paper) and we choose the words that are most frequent for each author as our features. However, it turns out that this is not the best approach. Let’s see why.

Suppose we want to find the top 10 words in Madison’s articles after removing stop words. I’ve given you a skeleton of the code you need below:

top10_Madison <- Federalist |>
  filter(author == ...) |>
  unnest_tokens(word,... ) |>
  anti_join(stop_words, by = join_by(word)) |>
  count( ...   ) |>
  slice_max(n , n = ... )

Question 1

Complete the code above by filling in the … and annotate the code.

Note: You will note I used anti_join(stop_words, by = join_by(word)) rather than anti_join(stop_words). The codes do exactly the same thing. The only difference is that some of you have noticed you get annoying warnings about Join by: by = join_by(word) when you run anti_join(stop_words). The adaptation I have in the code above just removes that warning.

Question 2

Create and show a plot to show the top 10 words in Madison’s texts. Make sure your plot is well formatted and well labelled.

Hint: Here is a skeleton code to get you started:

ggplot( ... , aes(n, fct_reorder( ... , ... ) )) +
  geom_...() +
  labs( x = ... , y = ..., title = ...)

This shows us potential words to use as features for Madison, but we also need words for Hamilton and Jay. We could repeat the process 3 times, but it turns out we actually don’t have to. As we saw in class, we can use grouping in R to find the top 10 words for each author in one code, without having to repeat! A skeleton code for this is included below.

top10_all <- Federalist |>
  unnest_tokens(word,... ) |>
  anti_join(stop_words, by = join_by(word)) |>
  count( ...   ) |>
  group_by( ... ) |>
  slice_max(n , n = ... )

Question 3

Complete the code above. As the answer to this question, state the top most frequent word in (a) Hamilton’s papers and (b) Jay’s papers.

We can also create a plot to compare the authors using the following code.

ggplot( top10_all , aes(n, reorder_within( word , n, author),fill =author)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~author,ncol = 2, scales = "free_y") +
  scale_y_reordered() + 
  labs( x = ... , y = ..., title = ...)

You will note two new things in this plotting code.

reorder_within: We use fct_reorder when we want to make sure that the words appear in order on our bar graph. However, because we now have three different authors, we actually need to make sure that the words appear in order for all authors. reorder_within allows us to include author in our ordering so that happens!
facet_wrap(): This is what allows us to make one plot for each group. In this case, facet_wrap(~author) allows to make one plot for each author.

Question 4

Finish the code above to create the plot to compare the top 10 words in Madison’s, Hamilton’s, and Jay’s texts. Show the plot.

Question 5

Are there any words that show up in the top 10 list for more than one author? In other words, are there words that show up in more than one of the plots in Question 4?

The goal is for us to find words we can use as features in a model to predict author. For this purpose, it is helpful if we can find words that help us separate the writing of the three authors, meaning finding words that tend to be used commonly by one author but not the others. A word that is used often by multiple authors is not useful in determining among these authors for prediction.

All of this means that frequency alone is not enough. Instead, we need something that will help us find words that are commonly used by an author, but not commonly used by other authors. This is exactly the set up that motivates TF-IDF.

TF-IDF

The TF-IDF is a number that is high if a word is commonly used by an author but not commonly used by other authors. In other words, words with high TF-IDF scores fulfill the properties that we want in features for our Naive Bayes model to predict author.

As we learned in class, the TF-IDF score has two components: the TF score and IDF score.

Question 6

What does the TF score measure? In other words, briefly explain what having a high TF score tells us about a word.

Recall that the TF of word \(i\) in text group \(j\) is defined as:

\[TF_{i,j} = \frac{\text{Number of times word i appears in group j}}{\text{Total Number of words in group j}}\]

Question 7

How many groups of text are we working with today?

Note: This should not be a large number.

Question 8

The word “will” occurs 703 times in Hamilton’s papers, and there are 114321 words in total in Hamilton’s papers. Based on this, state and interpret the TF of the word “will” in Hamilton’s papers.

Once we have the TF score, the next step is to compute the IDF score.

Question 9

What does the IDF score measure? In other words, briefly explain what having a high IDF score tells us about a word.

Question 10

For these data, there are only three possible values for the IDF. State what those numeric values are and in what situations we would use each. In other words, in what situation would we get each of the 3 different values of the IDF?

Question 11

The word “will” occurs 247 times in Madison’s papers and 105 times in Jay’s papers. Based on this, what is the IDF of “will”?

The TF-IDF is computed by multiplying the TF and the IDF together.

Question 12

What is the TF-IDF for the word “will” in Hamilton’s papers? Based on this, is “will” going to be a useful word in identifying Hamilton’s papers from Madison’s or Jay’s? Explain.

Computing TF-IDF in R

Now that we know the TF-IDF will be useful for finding words we can use as features in our model, and we have reviewed how the TF-IDF is computed, the next step is to compute the TF-IDF for every word across the 85 Federalist papers. We then choose the words with the highest TF-IDF for each author as our features.

Given that there are over 14000 unique words in the papers, this would take a while. Luckily, R has one nice function that we can use to get the TF-IDF score for all words at once. .

tfidf_all <- Federalist |>

  unnest_tokens(word, text)|>
  
  count(word, author) |>
  
  # NEW!! Create the TF IDF Score
  bind_tf_idf(word, author, n)

Question 13

Annotate the code above!

Hint: I know the last line of code is new. To see what it does, I recommend running the code with and without this final line so you can see what it does! The structure of the bind_tf_idf code requires (1) each word, (2) the groups, and (3) the number of times each word appears in each group (n).

When you are done running the code above, you have a data set called tfidf_all with 14505 rows and 6 columns. Each row is a unique word in the Federalist papers, and for each author we are given (1) the frequency of that word, (2) the TF score, (3) the IDF score, and (4) the TF-IDF score of that word.

Question 14

Open up tfidf_all. What do you notice is unusual about the first few words that are listed?

This data set, like many in text, requires some cleaning before we proceed. To handle the issue in Question 14, we add a line of a code we have seen before:

tfidf_all <- Federalist |>

  unnest_tokens(word, text)|>
  
  filter(!grepl('[0-9]', word)) |>
  
  count(word, author) |>

  bind_tf_idf(word, author, n)

Question 15

Open up tfidf_all again. What do you notice is unusual now about the first few words that are listed?

This is a new one for us - weird punctuation. We have to deal with this a lot in text data, but luckily it can be handled with one additional line of code:

tfidf_all <- Federalist |>

  mutate(text = str_remove_all(text, "_")) |>

  unnest_tokens(word, text)|>
  
  filter(!grepl('[0-9]', word)) |>
  
  count(word, author) |>
  
  bind_tf_idf(word, author, n)

Question 16

Open up tfidf_all again. Do the first few words listed look okay now, meaning they are words without numbers or symbols?

Most of the time when we have cleaning issues in text, we discover them just as we did today, meaning during the course of an analysis. Text can be very long, and it is difficult to anticipate all the cleaning that might need to be done. This means our job is to look at the data as we go and keep our eyes out for anything that might look odd so we can handle it. There is no short cut for this - we just have to be vigilant.

The Top Few Words

At this point, we have the TF-IDF score for every unique word in the Federalist papers. However, we know from our discussion of BOW models that we do not want to use every single word as a feature. Instead, we want to choose words with high TF-IDF scores.

If we only want the to find the top 4 words in terms of TF-IDF for each author, we use a very similar code as we did for counting the top 4 words in terms of frequency:

tfidf_top4 <- Federalist |>

  mutate(text = str_remove_all(text, "_")) |>

  unnest_tokens(word, text)|>
  
  filter(!grepl('[0-9]', word)) |>
  
  count(word, author) |>
  
  bind_tf_idf(word, document = author, n) |>
  
  group_by(author) |>
  
  slice_max( tf_idf, n = 4)

The only change we have made to this code is in the slice_max part. Usually, we have slice_max( n , n = 4). This is because the column n in the data set holds the counts, meaning the number of times each word occurs in a group. If we want the top 4 words in terms of count, we want the top 4 (n=4) in the column n. However, now we want the top 4 words in terms of TF-IDF. If we want the top 4 words in terms of TF-IDF, we want the top 4 (n=4) in the column tf_idf.

Question 17

Look at the top words 4 in terms of TF-IDF for Madison. There should be one word in that is surprising. What is it?

Another cleaning issue!! The mystery papers were written anonymously, so these words are not useful features.

tfidf_top4 <- Federalist |>

  # Remove the word Hamilton (all in caps) from the text
  mutate(text = removeWords(text,"HAMILTON"))|>
  
  # Remove the word Madison (all in caps) from the text
  mutate(text = removeWords(text,"MADISON"))|>
  
  # Remove the word Jay (all in caps) from the text 
  mutate(text = removeWords(text,"JAY"))|>
  
  mutate(text = str_remove_all(text, "_")) |>
  
  unnest_tokens(word, text)|>
  
  filter(!grepl('[0-9]', word)) |>
  
  count(word, author, sort = TRUE) |>
  
  bind_tf_idf(word, document = author, n) |>
  
  group_by(author) |>
  
  slice_max(tf_idf , n = 4 )

Wow this code is getting long!! As a note, you can store pieces of this code as you go, and then you don’t have to run the whole long thing every time. However, because we keep tweaking the code as we find new cleaning issues, for me it’s sometimes easier to just work with the whole code so I know where to make adjustments as needed.

Question 18

After all our cleaning steps, create a plot showing the top 4 words for Hamilton, Madison, and Jay in terms of the TF-IDF using tfidf_top4 above.

Hint: You can use the code you already used to create a plot in Questions 3 and 4 for this! The only difference is that no longer want words with the highest count (n); we want words with the highest tf_idf. Let me know if you get stuck!

At this point, we have words that we can use as features in our model! Let’s give it a try.

The Anonymous Papers

To load the 12 anonymous papers, you can use the code below:

test <- read.csv("https://www.dropbox.com/scl/fi/yrfzwwn7olyaeq3mhb0bd/test.csv?rlkey=c6bs0ytyoomkq34aw7rx3bc62&st=tzvp12cp&dl=1")

Recall that Hamilton claimed to have written all of these papers. To determine whether or not this seems true, we are going create features for our Naive Bayes model using the words in tfidf_top4.The words that we want to use are all stored in tfidf_top4 in the first column. We need to create indicators for these features for both the training (Federalist) and test (test) data sets.

This is the code you need. You are not responsible for this code on exams!!!

# Create the features in the training data 
for( i in 1:12){
  Federalist[,i+3] <- grepl(paste("\\b",tfidf_top4$word[i] , "\\b", sep =""), tolower(Federalist$text))
  colnames(Federalist)[i+3] <- tfidf_top4$word[i]
}

# Create the features in the test data 

for( i in 1:12){
  test[,i+2] <- grepl(paste("\\b",tfidf_top4$word[i] , "\\b", sep=""), tolower(test$text))
  colnames(test)[i+2] <- tfidf_top4$word[i]
}

When you run the codes above, Federalist should be 85 rows with 15 columns, and test should be 12 rows with 14 columns.

Question 19

Train a Naive Bayes model using the Federalist data and the new features we have created. You don’t have to show me anything for this part. Note: When you train Naive Bayes, make sure the only columns you include are actually features!! The model won’t work if you forget to remove the columns you do not need.
To see how well our model fits the data, make predictions on the same Federalist data you use for training. As the answer to this question, use the code below to output a confusion matrix comparing the true authors in the Federalist data to the predictions from Naive Bayes.

holder <- table( yhat, Federalist$author)
rownames(holder) <- c("Predicted Hamilton", "Predicted Jay", "Predicted Madison")
colnames(holder) <- c("True Hamilton", "True Jay", "True Madison")

knitr::kable(holder)

Question 20

The point of looking at predictions on training data is similar to computing an \(R^2\) for a linear regression model. It can help us see whether the model is doing a good job reflecting what is going on in the data. In other words, it can help us see whether the model fits the data well.

Use the confusion matrix from Question 19 to explain to your historian client how reliable the predictions for author seem to be using Naive Bayes. Hint: This means things like talking about where the model is predicting well or not predicting well, etc.

At this point, we are ready to do what we set out to do in the beginning - make predictions to see who we think wrote the anonymous papers in test.

Question 21

Using the Naive Bayes model you have already trained, make predictions on the test data. Print out these 12 predictions as the answer to this question.

Question 22

Recall that Hamilton claimed to have written all 12 test papers. Based on your model, our historian client wants us to say whether or not we think this claim is true. State and justify your conclusion for your client.

This is what we can do with the features we have so far, but there are a few things to think about. We could have used more than 4 words as features - could this have changed our predictions? Are there other features we could create with text, like length of papers or writing style features? What about emotional context or tone?? All of these are things we can now consider as we have practiced our modeling foundations, and creating new features and seeing what we can do with them is our next step!

References

Data

The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers, Retrieved July 20, 2024 from https://github.com/nicholasjhorton/FederalistPapers.

Code

The code was adapted from Chapter 3 of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.

Activity

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2026 February 2.