STA 279 Lab 3

Complete all Questions.

The Goal

In our last class, we talked about TF-IDF, which is a tool we can use if we want to figure out what words characterize certain types of text. Today, we are going to see an application of how we can use this in practice.

The Data Set

The Federalist papers are a famous series of articles written in the 1700s in the early days of the United States. They debated and proposed ideas for the new government. Three famous authors of these papers were Alexander Hamilton, James Madison, and John Jay. Most of the papers were know to have been written by either Hamilton, Madison, or Jay, but for several papers, the author who wrote them was not known. Hamilton declared at one point that he had written the articles. Since then, many scholars have worked to determine which of the three authors had written the anonymous papers.

Today, we are provided with the text from the Federalist papers known to have been written by Hamilton, Madison, or Jay. Our goal is to use the text analysis skills we have learned so far to determine what words might differ among the three authors’ papers so we can take a guess at who wrote the anonymous articles.

To read in the data on the \(n = 85\) Federalist papers with known authors, use the following code:

# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/5hzqwsvlnym5u1mhmn0jy/Federalist.csv?rlkey=55x0p9fl02zls9ixxek64vlqy&st=isjnztof&dl=1")

# Convert to a data frame
Federalist <- data.frame(Federalist)

# Make sure author is treated as categorical
Federalist[, "author"] <- as.factor(Federalist[, "author"])

The columns are:

  • paper: the number given to the paper; think of this like an identifier for the article.
  • text: the text of the entire paper.
  • author: either Hamilton or Madison or Jay

Once you have loaded the data, load the packages you will need for this lab:

library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(tm)
# New!! 
library(forcats)

The Top 10 Words

We know that one way we can determine what words are commonly used in text is to just count how many times each word appears in an author’s papers. Let’s start there.

Suppose we want to find the top 10 words in Madison’s articles after removing stop words. I’ve given you a skeleton of the code you need below:

top10_Madison <- Federalist |>
  filter(author == ...) |>
  unnest_tokens(word,... ) |>
  anti_join(stop_words, by = join_by(word)) |>
  count( ...   ) |>
  slice_max(n , n = ... )

Question 1

Complete the code above by filling in the … and annotate the code.

Note: You will note I used anti_join(stop_words, by = join_by(word)) rather than anti_join(stop_words). The codes do exactly the same thing. The only difference is that some of you have noticed you get annoying warnings about Join by: by = join_by(word) when you run anti_join(stop_words). The adaptation I have in the code above just removes that warning.

Question 2

Create and show a plot to show the top 10 words in Madison’s texts. Make sure your plot is well formatted and well labelled.

Hint: Here is a skeleton code to get you started:

ggplot( ... , aes(n, fct_reorder( ... , ... ) )) +
  geom_...() +
  labs( x = ... , y = ..., title = ...)

Now that we have the top 10 words for Madison, let’s consider Hamilton and Jay. We could make a plot for Hamilton and Jay separately…but we don’t actually have to. Instead of filtering to include only one author, we can group by author so we can find the top 10 words for each author!

A skeleton code for this is included below.

top10_all <- Federalist |>
  unnest_tokens(word,... ) |>
  anti_join(stop_words, by = join_by(word)) |>
  count( ...   ) |>
  group_by( ... ) |>
  slice_max(n , n = ... )

Question 3

Complete the code above. As the answer to this question, state the top most frequent word in (a) Hamilton’s papers and (b) Jay’s papers.

We can then create a plot to compare the authors using the following code.

ggplot( top10_all , aes(n, reorder_within( word , n, author),fill =author)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~author,ncol = 2, scales = "free_y") +
  scale_y_reordered() + 
  labs( x = ... , y = ..., title = ...)

You will note two new things in this plotting code.

  • reorder_within: We use fct_reorder when we want to make sure that the words appear in order on our bar graph. However, because we now have three different authors, we actually need to make sure that the words appear in order for all authors. reorder_within allows us to include author in our ordering so that happens!

  • facet_wrap(): This is what allows us to make one plot for each group. In this case, facet_wrap(~author) allows to make one plot for each author.

Question 4

Finish the code above to create the plot to compare the top 10 words in Madison’s, Hamilton’s, and Jay’s texts. Show the plot.

Question 5

Are there any words that show up in the top 10 list for more than one author? In other words, are there words that show up in more than one of the plots in Question 4?

One of the things we notice in this graph is that several of the top 10 words are the same for all authors or for two of the three.

The goal is for us to determine which words distinguish the writing of the three authors, meaning finding words tend to be used commonly by one author by not the others. We are trying to figure out who wrote the anonymous papers, so we need to find words that might help us figure that out.

This means looking at the top 10 words may not be the most useful tool. Instead, we need something else.

TF-IDF

Because our goal is to find words that define the writing of each author, we are going to look at words with the highest TF-IDF score. As we learned in class, the TF-IDF score has two components: the TF score and IDF score. To get the TF-IDF score, we multiply the TF and IDF scores together!

Question 6

What does the TF score measure? In other words, briefly explain what having a high TF score tells us about a word.

Question 7

What does the IDF score measure? In other words, briefly explain what having a high IDF score tells us about a word.

Let’s start with the TF score. Recall that the TF of word \(i\) in document \(d\) is defined as:

\[TF_{i,d} = \frac{\text{Number of times word i appears in document d}}{\text{Total Number of words in document d}}\]

Question 8

How many documents are we working with today?

Note: This should not be a large number.

Question 9

The word “will” occurs 703 times in Hamilton’s papers. Based on this, what is the TF of the word “will” in Hamilton’s papers?

Hint: You will need to use code to determine the total number of words in Hamilton’s papers!

Once we have the TF score, the next step is to compute the IDF score.

Question 10

For these data, there are only three possible values for the IDF. State what those numeric values are and in what situations we would use each.

Question 11

The word “will” occurs 247 times in Madison’s papers and 105 times in Jay’s papers. Based on this, what is the IDF of “will”?

With both the TF score and the IDF score ready to be used, we can combine them to create the TF-IDF score!

Question 12

What is the TF-IDF for the word “will” in Hamilton’s papers? Based on this, is “will” going to be a useful word in distinguishing Hamilton’s papers from Madison’s or Jay’s? Explain.

Computing TF-IDF in R

Now that we know how to compute the TF-IDF in general, it would be helpful if we could do this for every word in the papers all at once. Luckily, R has one nice function that we can use to get the TF-IDF.

tfidf_all <- Federalist |>

  unnest_tokens(word, text)|>
  
  count(word, author) |>
  
  bind_tf_idf(word, document = author, n)

Question 13

Annotate the code above!

Hint: I know the last line of code is new. To see what it does, I recommend running the code with and without this final line so you can see what it does! The structure of the bind_tf_idf code requires (1) each word, (2) the documents, and (3) the number of times each word appears in each document (n).

Great! We now have the TF-IDF for every word in the papers. If we only want the top 5 words in terms of TF-IDF for each author, we use a very similar as we did for counting.

tfidf_top5 <- Federalist |>

  unnest_tokens(word, text)|>
  
  count(word, author) |>
  
  bind_tf_idf(word, document = author, n) |>
  
  group_by(author) |>
  
  slice_max( tf_idf, n = 5)

The only change we have made to this code is in the slice_max part. Usually, we have slice_max( n , n =5). This is because the column n in the data set holds the counts, meaning the number of times each word occurs in a document. If we want the top 5 words in terms of count, we want the top 5 (n=5) in the column n.

However, now we want the top 5 words in terms of TF-IDF. If we want the top 5 words in terms of TF-IDF, we want the top 5 (n=5) in the column tf_idf.

Question 14

Look at the top words in terms of TF-IDF for Madison. There should be one word in that is surprising. What is it?

It turns out that the words Hamilton, Madison, and Jay are actually in the text of the articles, because the authors names are listed!!! Since these are not present in the anonymous articles, we want to remove those words when finding the TF-IDF.

tfidf_all <- Federalist |>

  # Remove the word Hamilton (all in caps) from the text
  mutate(text = removeWords(text,"HAMILTON"))|>
  
  # Remove the word Madison (all in caps) from the text
  mutate(text = removeWords(text,"MADISON"))|>
  
  # Remove the word Jay (all in caps) from the text 
  mutate(text = removeWords(text,"JAY"))|>
  
  unnest_tokens(word, text)|>
  
  count(word, author, sort = TRUE) |>
  
  bind_tf_idf(word, document = author, n)

Question 15

Create a plot showing the top 10 words for Hamilton, Madison, and Jay in terms of the TF-IDF, after removing the authors names.

Hint: You can use the code you already used to create a plot in Questions 3 and 4 for this! The only difference is that no longer want words with the highest count (n); we want words with the highest tf_idf. Let me know if you get stuck!

Now we can see the words that distinguish the three author’s works!! We could analyze them and describe the differences…but that’s not our goal. Our goal is to try to determine which of our three author’s likely wrote the Federalist papers that do not have a known author.

The Anonymous Papers

To load the 15 anonymous papers, you can use the code below:

test <- read.csv("https://www.dropbox.com/scl/fi/yrfzwwn7olyaeq3mhb0bd/test.csv?rlkey=c6bs0ytyoomkq34aw7rx3bc62&st=tzvp12cp&dl=1")

Recall that Hamilton claimed to have written all of these papers. To determine whether or not this seems true, we are going to look at all the words we identified as having the highest TF-IDF for the three authors in the graph in Question 15. We are then going to count how many times a high TF-IDF word for each author shows up in the papers in test.

You can do this using the code below…which is a lot, I know. For now, let’s use it rather than trying to really break it down. If you have any questions, feel free to ask, but you are not responsible for this code.

tfidf_top10 <- tfidf_all |>
  group_by(author) |>
  slice_max(tf_idf , n = 10 )

test |>
  # Tokenize 
  unnest_tokens(word, text)|>

  # Look at only words that were top TF IDF
  # for the 3 authors
  filter( word %in% tfidf_top10$word) |>
  
  # Count how many times each of these words appears
  count(word, sort = TRUE) |>
  
  # Find which author each word is associated with 
  right_join(tfidf_top10[,1:2], by = "word")|>
  
  # And count!
  group_by(author)|>
  summarize( "Total High TFIDF Words in the test papers" = sum(n,na.rm= TRUE))

The output of this code is 3 counts. Each count tells us, in total, how many times the high TF-IDF words for each author show up in the test papers. This means that Hamilton followed by a 56 means that 56 words test papers are high TF-IDF words for Hamilton.

Question 16

Based on the output in the code above, does it seem likely that Hamilton wrote all 15 of the test papers? Explain.

If we want, we can adapt the code just a little to count the number of high TF-IDF words from Hamilton, Madison, and Jay in each paper. Again, you don’t need to understand this code, you can just use it!

by_paper <- test |>
  unnest_tokens(word, text)|>
  filter( word %in% tfidf_top10$word) |>
  count(word, paper) |>
  right_join(tfidf_top10[,1:2], by = "word")|>
  group_by(paper,author)|>
  summarize( Total = sum(n,na.rm= TRUE))

Question 17

  1. Using the code below, create and show a bar graph using by_paper with author on the x axis, Total on the y axis, and make a different plot for each paper (Hint: facet_wrap!)
ggplot( na.omit(by_paper), aes(author, Total)) + geom_col() + facet_wrap(~paper)
  1. Based on your plot, which author do you think wrote each paper? Justify your choices.

Next Steps

Okay, so now we have seen how to use TF-IDF to determine words that distinguish documents. This can be very helpful for things like determining key differences between different texts. We are now going to transition from the counts of words to the actual meanings of the words. What if instead of looking at counts, we considered what the words actually mean? That is where we are headed next week!

References

Data

The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers, Retrieved July 20, 2024 from https://github.com/nicholasjhorton/FederalistPapers.

Code

The code was adapted from Chapter 3 of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.

Activity

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 September 10.