STA 279 Lab 3
Complete all Questions.
The Goal
In class, we have been discussing that words can be useful features when attempting to classify text data. We have seen how to use the frequency of a word to choose the top 10 words to use as features. Today, we are going use TF-IDF.
The Data Set
We will work with the same data on \(n=2000\) headlines from our last lab. (Yes, I know we’ve been working with this a lot! This is because it helps to see how different techniques applied to the same data set can yield different results. You will have a different data set to work with for Lab 4!)
headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")
The columns are:
title
: the title of the articleclickbait
: a human generated indicator for whether or not an article is clickbait.ids
: a numeric variable assigned to each title; think of this like an article identifier.
Once you have loaded the data, load the packages you will need for this lab:
TF-IDF
We have already determined which words tend to occur most frequently in clickbait and non-clickbait titles using raw counts. However, that relied on (1) removing stop words and numbers and (2) focusing on words with the highest frequency.
Today, we are going to use TF-IDF as a different tool for deciding which words might distinguish these two article types. This means we are curious to know in general what sorts of words are associated with clickbait titles rather than true news titles and vice versa. This is different that just finding the words that occur the most often; we want to find terms that define these two different title types.
The TF-IDF has two scores: the TF score and IDF score.
TF Score
Question 1
What does the TF score measure? In other words, briefly explain in words what the TF score tells us about a word.
Recall that the TF of word \(i\) in document \(d\) is defined as:
\[TF_{i,d} = \frac{\text{Number of times word i appears in document d}}{\text{Total Number of words in document d}}\]
For our purpose today, we have 2 documents. The first document is the collection of all clickbait titles and the second document is the collection of all non-clickbait titles. This means that to find the TF of a word in a clickbait title we need to (1) find the total number of words in the collection of all clickbait titles and (2) count the number of times each word appears in the collection of all clickbait titles.
Question 2
After removing punctuation, but without removing
stop words or numbers, tokenize all the clickbait
titles. Call your tokenized data set tidy_clickbait
. Based
on what you have just created, what is the denominator for the TF for d
= clickbait?
Question 3
Create an object called clickbait_words
that counts the
number of times each word in tidy_clickbait
appears in the
clickbait titles. For your answer to this question, state the word that
appears most often in clickbait titles and state how many times this
word occurs.
With the two pieces we need, we can now compute the TF.
Question 4
Compute the TF for the word that appears most often in clickbait titles. Show your work.
Great, we can compute the TF! However, we’d really like to be able to compare the TF of all the words at once.
Right now, clickbait_words
counts the number of times
each word appears, so it holds the numerator we need for the TF. This is
stored in a column called n
. We are going to use
mutate
to add on the column holding the total number of
words, which is the denominator we need for the TF.
Mutate allows us to add a new column to a data set. In this case, we
want to add a column called total
to
clickbait_words
that adds up all the word counts in the
column n
so we get the total number of words.
clickbait_words <- tidy_clickbait |>
# Count the number of times each word appears
count(word, sort = TRUE)|>
# Add a column for the total number of words
mutate(total = sum(n))
The code above adds a 3rd column to clickbait_words
which counts the total number of words in the clickbait titles. How does
it do this?
Well, the second line of code count(word, sort = TRUE)
takes the tokenized clickbait titles and adds a column called
n
which counts the number of times each word appears in
clickbait titles. The third line of code
mutate(total = sum(n))
creates a new column, called
total
, by summing up the n
column. This means
it counts the total number of words in the clickbait titles!
Question 5
Add a 4th column to clickbait_words
called
TF
which computes the TF for each word in the clickbait
titles. Show your code.
Question 6
What is the TF for the word “that”?
Question 7
Create a bar graph showing the top 15 words in terms of TF. Show the plot, and make sure you label your axes! Are the top 15 words in terms of TF primarily meaningful words, primarily stop words, or a mix?
Hint: Look at the end of Lab 2 for help with this!
IDF Score
Now that we have the TF, the next step is to compute the IDF.
Question 8
What does the IDF measure? In other words, briefly explain in words what the IDF tells us about a word.
Question 9
For these data, there are only two possible values for the IDF. State what those numerical values are and in what situations we would use each.
With both the TF and the IDF ready to be used, we can combine them to create the TF-IDF score!
TF-IDF
Computing the TF-IDF in R requires a few steps. We’ve seen that computing the TF by hand is fairly straightforward, but computing the IDF is more challenging. It requires that for each word, we look across all documents and count how many times the word appears. That’s not super straightforward.
Luckily, once we have the two pieces we need to compute the TF, R has a function that we can use to get the TF-IDF!
# Tokenize Headlines
tidy_headlines <- headlines |>
mutate(title = removePunctuation(title)) |>
unnest_tokens(word, title)
headlines_tfidf <- tidy_headlines |>
# Count the number of times each word appears
count(clickbait, word, sort = TRUE) |>
# Add on the TF-IDF
bind_tf_idf(word, document = clickbait, n)
The structure of the bind_tf_idf
code requires (1) each
word, (2) the documents, which for us is clickbait and not clickbait,
and (3) the number of times each word appears in each document
(n
).
I will note that I have broken the code above into two pieces for clarity of reading the code. However, you are welcome to combine them if you are comfortable doing so! The combined version is shown below for folks who are curious.
headlines_tfidf <- headlines |>
# Tokenize Headlines
mutate(title = removePunctuation(title)) |>
unnest_tokens(word, title) |>
# Count the number of times each word appears
count(clickbait, word, sort = TRUE) |>
# Add on the TF-IDF
bind_tf_idf(word, document = clickbait, n)
Question 10
Create a plot showing the top 15 words for clickbait titles and the top 15 words for the non-clickbait titles in terms of the TF-IDF.
The skeleton of the code is provided below. Your task is to fill in the pieces with REPLACE listed.
# Get the top 15 words in terms of TFIDF
# in each document
top15_tfidf <- headlines_tfidf |>
# Divide into the two documents
group_by( REPLACE ) |>
# Choose the top 15 words in terms of
# TF-IDF in each document
slice_max( REPLACE , n = REPLACE ) |>
ungroup() |>
# Reorder the word counts so the highest
# TF-IDF is on top
mutate(word = reorder(word, REPLACE ))
# Create the plot
ggplot(top15_tfidf, aes(REPLACE, word, fill = REPLACE)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ REPLACE, ncol = 2, scales = "free") +
labs(x = "TF-IDF", y = NULL) +
theme_light()
In this plot, you will notice that there are two numbers, both years. These are 2008 and 2015. Depending on the application, we choose could to remove these values or keep them in!
Question 11
Based on the bar graphs, describe to an interested party what seems to distinguish the types of words in clickbait titles from the types of words in non-clickbait titles.
Pronouns
One big thing we should notice from our plots is that the clickbait titles contain pronouns like “you” and “we”, while the non-clickbait titles generally do not. This is something y’all noticed in our first class. So, maybe we can use the number of pronouns as a feature!
Let’s start with only a few pronouns. There are many, and the nice thing about the technique we are going to try is that we can easily add or subtract from this list.
# A list of *some* (not all) pronouns
pronouns <- c("I", "me", "you", "he", "him", "she", "her",
"it", "we", "us", "they", "them", "one", "your", "my", "yours", "theirs")
Once we have specified a list of words we are interested in looking at (pronouns for now), we can count the number of times those specific words occur in clickbait and non-clickbait.
pronoun_counts <- tidy_headlines |>
# Group by clickbait
group_by(word, clickbait) |>
# Look at only words that are pronouns
filter(word %in% pronouns)|>
# And count how many times they occur
count(word,sort = TRUE)
Question 12
Look at the table created by the code above. Which pronouns occurs the most in our non-clickbait titles?
Question 13
Create a bar graph comparing the counts of pronouns in clickbait and non-clickbait titles.
Question 14
Are there any pronouns that are more common in non-clickbait titles than clickbait titles?
As a note…it turns out that this is not actually because this pronoun occurs more often. This pronoun is actually an acronym for a country!!
This is something that is tricky with text analysis. If an acronym actually spells out a word, it can be very difficult to distinguish between that acronym and the word, especially as tidyTex removes capitalization by default.
In this case, it is likely that these are not actually pronouns - it is more likely that these are acronyms, so we could remove this pronoun from our list.
Question 15
Re-create your bar graph from Question 13, but excluding “us” as a pronoun.
The number of pronouns is an example of a feature we can created which is specific to text data. There are many other such features, such as the number of question words or the number of contractions. We will start exploring these very soon!
Next Steps
Okay, so now we have seen how to use TF-IDF to determine words that distinguish documents. This can be very helpful for things like determining key differences between different texts.
We are now going to transition from the counts of words to the actual meanings of the words. What if instead of looking at counts, we considered what the words actually mean? That is where we are headed next week!
References
Data
The data set used in this lab is the sample_headlines
data set downloaded from
https://github.com/nicholasjhorton/textclassificationexamples/tree/master.
Citation: Horton, Nicholas J. Text Classification Examples,
Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.
Code
The code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.
Activity
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2024 September 17.