STA 279 Lab 3

Complete all Questions.

The Goal

In class, we have been discussing that words can be useful features when attempting to classify text data. We have seen how to use the frequency of a word to choose the top 10 words to use as features. Today, we are going use TF-IDF.

The Data Set

We will work with the same data on \(n=2000\) headlines from our last lab. (Yes, I know we’ve been working with this a lot! This is because it helps to see how different techniques applied to the same data set can yield different results. You will have a different data set to work with for Lab 4!)

headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")

The columns are:

title: the title of the article
clickbait: a human generated indicator for whether or not an article is clickbait.
ids: a numeric variable assigned to each title; think of this like an article identifier.

Once you have loaded the data, load the packages you will need for this lab:

library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(tm)
# New!! 
library(forcats)

TF-IDF

We have already determined which words tend to occur most frequently in clickbait and non-clickbait titles using raw counts. However, that relied on (1) removing stop words and numbers and (2) focusing on words with the highest frequency.

Today, we are going to use TF-IDF as a different tool for deciding which words might distinguish these two article types. This means we are curious to know in general what sorts of words are associated with clickbait titles rather than true news titles and vice versa. This is different that just finding the words that occur the most often; we want to find terms that define these two different title types.

The TF-IDF has two scores: the TF score and IDF score.

TF Score

Question 1

What does the TF score measure? In other words, briefly explain in words what the TF score tells us about a word.

Recall that the TF of word \(i\) in document \(d\) is defined as:

\[TF_{i,d} = \frac{\text{Number of times word i appears in document d}}{\text{Total Number of words in document d}}\]

For our purpose today, we have 2 documents. The first document is the collection of all clickbait titles and the second document is the collection of all non-clickbait titles. This means that to find the TF of a word in a clickbait title we need to (1) find the total number of words in the collection of all clickbait titles and (2) count the number of times each word appears in the collection of all clickbait titles.

Question 2

After removing punctuation, but without removing stop words or numbers, tokenize all the clickbait titles. Call your tokenized data set tidy_clickbait. Based on what you have just created, what is the denominator for the TF for d = clickbait?

Question 3

Create an object called clickbait_words that counts the number of times each word in tidy_clickbait appears in the clickbait titles. For your answer to this question, state the word that appears most often in clickbait titles and state how many times this word occurs.

With the two pieces we need, we can now compute the TF.

Question 4

Compute the TF for the word that appears most often in clickbait titles. Show your work.

Great, we can compute the TF! However, we’d really like to be able to compare the TF of all the words at once.

Right now, clickbait_words counts the number of times each word appears, so it holds the numerator we need for the TF. This is stored in a column called n. We are going to use mutate to add on the column holding the total number of words, which is the denominator we need for the TF.

Mutate allows us to add a new column to a data set. In this case, we want to add a column called total to clickbait_words that adds up all the word counts in the column n so we get the total number of words.

clickbait_words <- tidy_clickbait |>
  # Count the number of times each word appears 
  count(word, sort = TRUE)|>
  # Add a column for the total number of words
  mutate(total = sum(n))

The code above adds a 3rd column to clickbait_words which counts the total number of words in the clickbait titles. How does it do this?

Well, the second line of code count(word, sort = TRUE) takes the tokenized clickbait titles and adds a column called n which counts the number of times each word appears in clickbait titles. The third line of code mutate(total = sum(n)) creates a new column, called total, by summing up the n column. This means it counts the total number of words in the clickbait titles!

Question 5

Add a 4th column to clickbait_words called TF which computes the TF for each word in the clickbait titles. Show your code.

Question 6

What is the TF for the word “that”?

Question 7

Create a bar graph showing the top 15 words in terms of TF. Show the plot, and make sure you label your axes! Are the top 15 words in terms of TF primarily meaningful words, primarily stop words, or a mix?

Hint: Look at the end of Lab 2 for help with this!

IDF Score

Now that we have the TF, the next step is to compute the IDF.

Question 8

What does the IDF measure? In other words, briefly explain in words what the IDF tells us about a word.

Question 9

For these data, there are only two possible values for the IDF. State what those numerical values are and in what situations we would use each.

With both the TF and the IDF ready to be used, we can combine them to create the TF-IDF score!

TF-IDF

Computing the TF-IDF in R requires a few steps. We’ve seen that computing the TF by hand is fairly straightforward, but computing the IDF is more challenging. It requires that for each word, we look across all documents and count how many times the word appears. That’s not super straightforward.

Luckily, once we have the two pieces we need to compute the TF, R has a function that we can use to get the TF-IDF!

# Tokenize Headlines 
tidy_headlines <- headlines |>
  mutate(title = removePunctuation(title)) |>
  unnest_tokens(word, title)

headlines_tfidf <- tidy_headlines |>
  # Count the number of times each word appears 
  count(clickbait, word, sort = TRUE) |>
  # Add on the TF-IDF
  bind_tf_idf(word, document = clickbait, n)

The structure of the bind_tf_idf code requires (1) each word, (2) the documents, which for us is clickbait and not clickbait, and (3) the number of times each word appears in each document (n).

I will note that I have broken the code above into two pieces for clarity of reading the code. However, you are welcome to combine them if you are comfortable doing so! The combined version is shown below for folks who are curious.

headlines_tfidf <- headlines |>
  # Tokenize Headlines 
  mutate(title = removePunctuation(title)) |>
  unnest_tokens(word, title) |> 
  # Count the number of times each word appears 
  count(clickbait, word, sort = TRUE) |>
  # Add on the TF-IDF
  bind_tf_idf(word, document = clickbait, n)

Question 10

Create a plot showing the top 15 words for clickbait titles and the top 15 words for the non-clickbait titles in terms of the TF-IDF.

The skeleton of the code is provided below. Your task is to fill in the pieces with REPLACE listed.

# Get the top 15 words in terms of TFIDF
# in each document 
top15_tfidf <- headlines_tfidf |>
  # Divide into the two documents 
  group_by(  REPLACE  ) |>
  # Choose the top 15 words in terms of 
  # TF-IDF in each document 
  slice_max(  REPLACE    , n = REPLACE   ) |>
  ungroup() |>
  # Reorder the word counts so the highest
  # TF-IDF is on top 
  mutate(word = reorder(word, REPLACE     ))

# Create the plot 
ggplot(top15_tfidf, aes(REPLACE, word, fill = REPLACE)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ REPLACE, ncol = 2, scales = "free") +
  labs(x = "TF-IDF", y = NULL) +
  theme_light()

In this plot, you will notice that there are two numbers, both years. These are 2008 and 2015. Depending on the application, we choose could to remove these values or keep them in!

Question 11

Based on the bar graphs, describe to an interested party what seems to distinguish the types of words in clickbait titles from the types of words in non-clickbait titles.

Pronouns

One big thing we should notice from our plots is that the clickbait titles contain pronouns like “you” and “we”, while the non-clickbait titles generally do not. This is something y’all noticed in our first class. So, maybe we can use the number of pronouns as a feature!

Let’s start with only a few pronouns. There are many, and the nice thing about the technique we are going to try is that we can easily add or subtract from this list.

# A list of *some* (not all) pronouns
pronouns <- c("I", "me", "you", "he", "him", "she", "her",
    "it", "we", "us", "they", "them", "one", "your", "my", "yours", "theirs")

Once we have specified a list of words we are interested in looking at (pronouns for now), we can count the number of times those specific words occur in clickbait and non-clickbait.

pronoun_counts <- tidy_headlines |>
    # Group by clickbait 
    group_by(word, clickbait) |>
    # Look at only words that are pronouns
    filter(word %in% pronouns)|>
    # And count how many times they occur
    count(word,sort = TRUE)

Question 12

Look at the table created by the code above. Which pronouns occurs the most in our non-clickbait titles?

Question 13

Create a bar graph comparing the counts of pronouns in clickbait and non-clickbait titles.

Question 14

Are there any pronouns that are more common in non-clickbait titles than clickbait titles?

As a note…it turns out that this is not actually because this pronoun occurs more often. This pronoun is actually an acronym for a country!!

This is something that is tricky with text analysis. If an acronym actually spells out a word, it can be very difficult to distinguish between that acronym and the word, especially as tidyTex removes capitalization by default.

In this case, it is likely that these are not actually pronouns - it is more likely that these are acronyms, so we could remove this pronoun from our list.

Question 15

Re-create your bar graph from Question 13, but excluding “us” as a pronoun.

The number of pronouns is an example of a feature we can created which is specific to text data. There are many other such features, such as the number of question words or the number of contractions. We will start exploring these very soon!

Next Steps

Okay, so now we have seen how to use TF-IDF to determine words that distinguish documents. This can be very helpful for things like determining key differences between different texts.

We are now going to transition from the counts of words to the actual meanings of the words. What if instead of looking at counts, we considered what the words actually mean? That is where we are headed next week!

References

Data

The data set used in this lab is the sample_headlines data set downloaded from https://github.com/nicholasjhorton/textclassificationexamples/tree/master. Citation: Horton, Nicholas J. Text Classification Examples, Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.

Code

The code was adapted from of “Text Mining with R: A Tidy Approach”, written by Julia Silge and David Robinson. The book was last built on 2024-06-20.

Activity

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 September 17.