STA 279 Lab 3
Complete all Questions.
The Goal
In our last class, we talked about TF-IDF, which is a tool we can use if we want to figure out what words characterize certain types of text. Today, we are going to see an application of how we can use this in practice.
The Data Set
The Federalist papers are a famous series of articles written in the 1700s in the early days of the United States. They debated and proposed ideas for the new government. Three famous authors of these papers were Alexander Hamilton, James Madison, and John Jay. Most of the papers were know to have been written by either Hamilton, Madison, or Jay, but for several papers, the author who wrote them was not known. Hamilton declared at one point that he had written the articles. Since then, many scholars have worked to determine which of the three authors had written the anonymous papers.
Today, we are provided with the text from the Federalist papers known to have been written by Hamilton, Madison, or Jay. Our goal is to use the text analysis skills we have learned so far to determine what words might differ among the three authors’ papers so we can take a guess at who wrote the anonymous articles.
To read in the data on the \(n = 85\) Federalist papers with known authors, use the following code:
# Load the data
Federalist <- read.csv("https://www.dropbox.com/scl/fi/5hzqwsvlnym5u1mhmn0jy/Federalist.csv?rlkey=55x0p9fl02zls9ixxek64vlqy&st=isjnztof&dl=1")
# Convert to a data frame
Federalist <- data.frame(Federalist)
# Make sure author is treated as categorical
Federalist[, "author"] <- as.factor(Federalist[, "author"])
The columns are:
paper
: the number given to the paper; think of this like an identifier for the article.text
: the text of the entire paper.author
: either Hamilton or Madison or Jay
Once you have loaded the data, load the packages you will need for this lab:
The Top 10 Words
We know that one way we can determine what words are commonly used in text is to just count how many times each word appears in an author’s papers. Let’s start there.
Suppose we want to find the top 10 words in Madison’s articles after removing stop words. I’ve given you a skeleton of the code you need below:
top10_Madison <- Federalist |>
filter(author == ...) |>
unnest_tokens(word,... ) |>
anti_join(stop_words, by = join_by(word)) |>
count( ... ) |>
slice_max(n , n = ... )
Question 1
Complete the code above by filling in the … and annotate the code.
Note: You will note I used
anti_join(stop_words, by = join_by(word))
rather than
anti_join(stop_words)
. The codes do
exactly the same thing. The only difference is that
some of you have noticed you get annoying warnings about
Join by: by = join_by(word)
when you run
anti_join(stop_words)
. The adaptation I have in the code
above just removes that warning.
Question 2
Create and show a plot to show the top 10 words in Madison’s texts. Make sure your plot is well formatted and well labelled.
Hint: Here is a skeleton code to get you started:
Now that we have the top 10 words for Madison, let’s consider Hamilton and Jay. We could make a plot for Hamilton and Jay separately…but we don’t actually have to. Instead of filtering to include only one author, we can group by author so we can find the top 10 words for each author!
A skeleton code for this is included below.
top10_all <- Federalist |>
unnest_tokens(word,... ) |>
anti_join(stop_words, by = join_by(word)) |>
count( ... ) |>
group_by( ... ) |>
slice_max(n , n = ... )
Question 3
Complete the code above. As the answer to this question, state the top most frequent word in (a) Hamilton’s papers and (b) Jay’s papers.
We can then create a plot to compare the authors using the following code.
ggplot( top10_all , aes(n, reorder_within( word , n, author),fill =author)) +
geom_col(show.legend = FALSE) +
facet_wrap(~author,ncol = 2, scales = "free_y") +
scale_y_reordered() +
labs( x = ... , y = ..., title = ...)
You will note two new things in this plotting code.
reorder_within
: We usefct_reorder
when we want to make sure that the words appear in order on our bar graph. However, because we now have three different authors, we actually need to make sure that the words appear in order for all authors.reorder_within
allows us to include author in our ordering so that happens!facet_wrap()
: This is what allows us to make one plot for each group. In this case,facet_wrap(~author)
allows to make one plot for each author.
Question 4
Finish the code above to create the plot to compare the top 10 words in Madison’s, Hamilton’s, and Jay’s texts. Show the plot.
Question 5
Are there any words that show up in the top 10 list for more than one author? In other words, are there words that show up in more than one of the plots in Question 4?
One of the things we notice in this graph is that several of the top 10 words are the same for all authors or for two of the three.
The goal is for us to determine which words distinguish the writing of the three authors, meaning finding words tend to be used commonly by one author by not the others. We are trying to figure out who wrote the anonymous papers, so we need to find words that might help us figure that out.
This means looking at the top 10 words may not be the most useful tool. Instead, we need something else.
TF-IDF
Because our goal is to find words that define the writing of each author, we are going to look at words with the highest TF-IDF score. As we learned in class, the TF-IDF score has two components: the TF score and IDF score. To get the TF-IDF score, we multiply the TF and IDF scores together!
Question 6
What does the TF score measure? In other words, briefly explain what having a high TF score tells us about a word.
Question 7
What does the IDF score measure? In other words, briefly explain what having a high IDF score tells us about a word.
Let’s start with the TF score. Recall that the TF of word \(i\) in document \(d\) is defined as:
\[TF_{i,d} = \frac{\text{Number of times word i appears in document d}}{\text{Total Number of words in document d}}\]
Question 8
How many documents are we working with today?
Note: This should not be a large number.
Question 9
The word “will” occurs 703 times in Hamilton’s papers. Based on this, what is the TF of the word “will” in Hamilton’s papers?
Hint: You will need to use code to determine the total number of words in Hamilton’s papers!
Once we have the TF score, the next step is to compute the IDF score.
Question 10
For these data, there are only three possible values for the IDF. State what those numeric values are and in what situations we would use each.
Question 11
The word “will” occurs 247 times in Madison’s papers and 105 times in Jay’s papers. Based on this, what is the IDF of “will”?
With both the TF score and the IDF score ready to be used, we can combine them to create the TF-IDF score!
Question 12
What is the TF-IDF for the word “will” in Hamilton’s papers? Based on this, is “will” going to be a useful word in distinguishing Hamilton’s papers from Madison’s or Jay’s? Explain.
Computing TF-IDF in R
Now that we know how to compute the TF-IDF in general, it would be helpful if we could do this for every word in the papers all at once. Luckily, R has one nice function that we can use to get the TF-IDF.
tfidf_all <- Federalist |>
unnest_tokens(word, text)|>
count(word, author) |>
bind_tf_idf(word, document = author, n)
Question 13
Annotate the code above!
Hint: I know the last line of code is new. To see what it does, I
recommend running the code with and without this final line so you can
see what it does! The structure of the bind_tf_idf
code
requires (1) each word, (2) the documents, and (3) the number of times
each word appears in each document (n
).
Great! We now have the TF-IDF for every word in the papers. If we only want the top 5 words in terms of TF-IDF for each author, we use a very similar as we did for counting.
tfidf_top5 <- Federalist |>
unnest_tokens(word, text)|>
count(word, author) |>
bind_tf_idf(word, document = author, n) |>
group_by(author) |>
slice_max( tf_idf, n = 5)
The only change we have made to this code is in the
slice_max
part. Usually, we have
slice_max( n , n =5)
. This is because the column
n
in the data set holds the counts, meaning the number of
times each word occurs in a document. If we want the top 5 words in
terms of count, we want the top 5 (n=5
) in the column
n
.
However, now we want the top 5 words in terms of TF-IDF. If we want
the top 5 words in terms of TF-IDF, we want the top 5 (n=5
)
in the column tf_idf
.
Question 14
Look at the top words in terms of TF-IDF for Madison. There should be one word in that is surprising. What is it?
It turns out that the words Hamilton, Madison, and Jay are actually in the text of the articles, because the authors names are listed!!! Since these are not present in the anonymous articles, we want to remove those words when finding the TF-IDF.
tfidf_all <- Federalist |>
# Remove the word Hamilton (all in caps) from the text
mutate(text = removeWords(text,"HAMILTON"))|>
# Remove the word Madison (all in caps) from the text
mutate(text = removeWords(text,"MADISON"))|>
# Remove the word Jay (all in caps) from the text
mutate(text = removeWords(text,"JAY"))|>
unnest_tokens(word, text)|>
count(word, author, sort = TRUE) |>
bind_tf_idf(word, document = author, n)
Question 15
Create a plot showing the top 10 words for Hamilton, Madison, and Jay in terms of the TF-IDF, after removing the authors names.
Hint: You can use the code you already used to create a plot in
Questions 3 and 4 for this! The only difference is that no longer want
words with the highest count (n
); we want words with the
highest tf_idf
. Let me know if you get stuck!
Now we can see the words that distinguish the three author’s works!! We could analyze them and describe the differences…but that’s not our goal. Our goal is to try to determine which of our three author’s likely wrote the Federalist papers that do not have a known author.
The Anonymous Papers
To load the 15 anonymous papers, you can use the code below:
test <- read.csv("https://www.dropbox.com/scl/fi/yrfzwwn7olyaeq3mhb0bd/test.csv?rlkey=c6bs0ytyoomkq34aw7rx3bc62&st=tzvp12cp&dl=1")
Recall that Hamilton claimed to have written all of these papers. To
determine whether or not this seems true, we are going to look at all
the words we identified as having the highest TF-IDF for the three
authors in the graph in Question 15. We are then going to count how many
times a high TF-IDF word for each author shows up in the papers in
test
.
You can do this using the code below…which is a lot, I know. For now, let’s use it rather than trying to really break it down. If you have any questions, feel free to ask, but you are not responsible for this code.
tfidf_top10 <- tfidf_all |>
group_by(author) |>
slice_max(tf_idf , n = 10 )
test |>
# Tokenize
unnest_tokens(word, text)|>
# Look at only words that were top TF IDF
# for the 3 authors
filter( word %in% tfidf_top10$word) |>
# Count how many times each of these words appears
count(word, sort = TRUE) |>
# Find which author each word is associated with
right_join(tfidf_top10[,1:2], by = "word")|>
# And count!
group_by(author)|>
summarize( "Total High TFIDF Words in the test papers" = sum(n,na.rm= TRUE))
The output of this code is 3 counts. Each count tells us, in total,
how many times the high TF-IDF words for each author show up in the test
papers. This means that Hamilton
followed by a
56
means that 56 words test
papers are high
TF-IDF words for Hamilton.
Question 16
Based on the output in the code above, does it seem likely that
Hamilton wrote all 15 of the test
papers? Explain.
If we want, we can adapt the code just a little to count the number of high TF-IDF words from Hamilton, Madison, and Jay in each paper. Again, you don’t need to understand this code, you can just use it!
by_paper <- test |>
unnest_tokens(word, text)|>
filter( word %in% tfidf_top10$word) |>
count(word, paper) |>
right_join(tfidf_top10[,1:2], by = "word")|>
group_by(paper,author)|>
summarize( Total = sum(n,na.rm= TRUE))
Question 17
- Using the code below, create and show a bar graph using
by_paper
with author on the x axis,Total
on the y axis, and make a different plot for each paper (Hint:facet_wrap
!)
- Based on your plot, which author do you think wrote each paper? Justify your choices.
Next Steps
Okay, so now we have seen how to use TF-IDF to determine words that distinguish documents. This can be very helpful for things like determining key differences between different texts. We are now going to transition from the counts of words to the actual meanings of the words. What if instead of looking at counts, we considered what the words actually mean? That is where we are headed next week!
References
Data
The data come from https://github.com/nicholasjhorton/FederalistPapers, the GitHub repository of Dr. Nicholas J Horton. Citation: Horton, Nicholas J. Federalist Papers, Retrieved July 20, 2024 from https://github.com/nicholasjhorton/FederalistPapers.
Code
The code was adapted from Chapter 3 of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.
Activity
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 September 10.