STA 279 Lab 2

Complete all Questions.

Goals

In our last class, we saw how we can analyze text data from lyrics. Today, we are going to practice this code and process with our headlines dataset from Lab 1.

Data

We will continue working with the same data on \(n=2000\) article titles from our last lab. To load the data, copy and paste the code below into a chunk and press play.

headlines <- read.csv("https://www.dropbox.com/scl/fi/r9p76t3v8aluz2jfypy6u/headlines.csv?rlkey=pi5rpu21xkwjw8qm7bofkrrej&st=jhc4e0ad&dl=1")

Recall that the columns are:

title: the title of the article
clickbait: a human generated indicator for whether or not an article is clickbait; FALSE means the article is not clickbait while TRUE means the article is clickbait.
ids: a numeric variable assigned to each article; think of this like an article identifier.

Once you have loaded the data, load the packages you will need for this lab:

library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
# New!! 
library(tm)
library(forcats)

As a reminder, anytime R tells you a package cannot be found means that it needs to be installed. You can do this from the Tools drop down menu at the top of your R screen. Let me know if you have any questions!

Clickbait Only: Data Cleaning

We are first going to do some EDA on the clickbait articles. In other words, we want to explore the titles that belong to clickbait articles to see if we can describe traits of these titles. This means that our first step today is to create a dataset that only contains clickbait titles, as headlines has both clickbait and non clickbait articles.

To create a dataset with only clickbait articles, we want to filter our dataset to contain only the rows in headlines that belong to clickbait titles. We will use the command filter to help us do this.

clickbait <- headlines |>
  filter(clickbait == "TRUE")

The filter command keeps only the rows in the dataset that meet the condition inside of the parentheses. In this case, we keep only the clickbait titles, which means only the ones with TRUE in the clickbait column.

Question 1

Create a dataset called notclickbait with only the non-clickbait titles. As an answer to this question, state how many rows are in the dataset you create.

Question 2

We have used the command select in this course already. What is the difference between select and filter?

Hint: If you get stuck, look at our last lab and find one of the places we used select. Compare that usage to the use of filter we just used above.

We now have a dataset called clickbait which contains only clickbait titles. The first thing we typically do with this dataset is to tokenize the article titles.

Question 3

Tokenize and store the titles in the clickbait dataset keeping hyphenated words together. Recall from the last lab that we should use tidy_clickbait as the name of where you store the tokenized titles.

As an answer to this question, state how many rows are in tidy_clickbait.

Clickbait Only: Most Frequent Words

Now that we have tokenized the titles, let’s see if we can determine which words occur most frequently in the clickbait titles. Here is where the fact that tokenizing data in R also results in all words being converted to lowercase is a good thing. Without this, R would count “Today” and “today” as different words. Because we convert all words to lower case, this issue is avoided!

In order to find out which words appear most frequently, we need count the number of times each word appears. We can count the number of times each word appears in tidy_clickbaitonly using this code:

clickbait_count <- headlines |>
  filter(clickbait == "TRUE") |> 
  unnest_tokens(word, title, token = "regex") |>
  count(word, sort = TRUE)

Question 4

Annotate the code above. In other words, add in comments using # that explain briefly what each line of code does.

This code will create a data frame with 2 columns. The first column (word) tells us one unique word that occurs in clickbait titles. The second column (n) is the number of times that word appears in our clickbait titles.

Question 5

How many unique words are there in total in the clickbait titles?

Hint: You do not need to write any code to figure this out!

Question 6

Which word appears the most often in the clickbait titles, and how often does this word appear?

Stop Words

You will likely notice something about the words that have the highest counts in the clickbait titles. We see words like “the” and “to”. These are all very common words in the English language, but we were hoping to find some words that could distinguish clickbait titles. For this goal, words like “the” are not helpful to us.

We call words that are needed for grammatical reasons, but that do not add to the content, stop words. You will find that we use a lot of stop words in English, including “the”, “it”, “a”, “as”, “an”, and so on.

If you are interested in what other words are considered stop words, you can use the following code to explore:

# Load a list of stop words 
data("stop_words")

# Look at the first 5 words
stop_words[1:5,"word"]

Question 7

How many words are there in the stop_words dataset?

Let’s take a look at the 13th title in the dataset:

headlines[13, "title"]

## [1] "Judge Guilty in Kickbacks Is Accused of Fixing Suit"

Question 8

Which words do you think could be removed from the title without losing the key content? In other words, which words do you think are stop words?

Since stop words do not actually add to the content, we often consider removing stop words before we conduct any analysis on text data. As we have already seen, if we do not remove stop words they often dominate the list of most frequent words, making it difficult to see which words might be useful features to differentiate clickbait and non-clickbait titles.

To (1) tokenize the 13th title and (2) remove the stop words, we use the following:

# Start with the 13th title
headlines[13,] |> 

  # Break the title into words 
  unnest_tokens(word,  title, token = "regex")|> # and then 

  # Remove all stop words
  anti_join(stop_words, by = "word")

Question 9

Why do you think we remove the stop words as the last step in this code? In other words, why does it come after tokenizing rather than before?

Question 10

When we counted the most frequent words in clickbait titles, we used:

clickbait_count <- headlines |>
  filter(clickbait == "TRUE") |> 
  unnest_tokens(word, title, token = "regex") |>
  count(word, sort = TRUE)

Adapt the code above to remove the stop words. As the answer to this question, state how many unique non-stop words are in the clickbait titles.

Considering Pronouns

In our first class, we talked about this data set and considered some key features that might help us distinguish clickbait titles from non-clickbait titles. One of the things we noticed was that pronouns like “you” we potentially helpful to look at.

However…when we remove stop words the way we have, we also remove pronouns. For our lab today, that’s probably something we do not want.

To tell R to remove all stop words EXCEPT the word “you”, we can remove “you” from the list of stop words using the following code:

stop_words <- subset(stop_words, word != "you")

We can then use anti_join(stop_words) like usual to remove the stop words, while keeping “you”.

Question 11

Adapt the line of code above to remove “we” from the list of stop words. Show your code.

Question 12

When we counted the most frequent words in clickbait titles, we used:

clickbait_count <- headlines |>
  filter(clickbait == "TRUE") |> 
  unnest_tokens(word, title, token = "regex") |>
  count(word, sort = TRUE)

Adapt the code above to remove the stop words EXCEPT “you” and “we”. As the answer to this question, state how many unique non-stop words are in the clickbait titles.

Visualizing the Most Frequent Words

Before we got off on our tangent about stop words, we were interested in looking at the most frequent words in clickbait articles (excluding stop words!). We are going to visualize the most frequent words in clickbait articles using (1) bar plots and (2) word clouds.

Bar Plots

A bar plot (or bar graph) is a familiar graph, and it turns out it is very useful when we want to visualize the most frequent words in a piece of text. Bar plots allow us to look at a categorical variable (in this case a word) and plot how many times that variable (word) appears in a data set.

To create plots in R, we will use the ggplot package. This allows us to make very pretty, very professional plots that we can customize.

The basic structure of a bar plot is

ggplot( dataset, aes( x = count, y = what we are counting) ) + 
    geom_col( fill = "color you want")

Question 13

Adapt the code above to plot the number of times each non-stop word appears in the clickbait titles, making the bars a color (like “blue”, “purple”, “green”, etc.)

NOTE: The plot will NOT be pretty! That’s okay, we’ll make it better!!

Question 14

The plot we just created has several issues. Tell me at least two of them!

Okay, so that’s a plot…but it’s not useful. There are a few issues with it, and we can adapt it to make it better.

One problem is that there are just too many words to be shown effectively on a plot. We don’t really want to see every word - our goal is to see words that characterize clickbait, which means we want to see words that occur frequently. One way to do this is to visualize the top 20, 15, 10, etc., most frequent words, rather than all the words.

To find the 15 words that appear most often in clickbait titles, we slice our data to keep only the top 15 most frequent words. We use the slice_max command to do this.

  # Choose only the top 15 most frequent words
  slice_max(n, n=15)

Here, it gets a little confusing because n is the name of the column that holds the count. However, the function also uses n = 15 to communicate how many words we want to choose, which in this case is 15.

As a side note, basically whenever you use the count function, the result will be in column called n. Because we use n for a lot of thing in statistics, you can always rename the column if you’d like!

Question 15

Go back to your code from Question 12 and add slice_max(n, n=15) to the end of it. Then, re-run your code from Question 13 and show your new bar plot.
What issues still seem to be occurring the bar plot?

To fix the final issue, we need to adapt the aes part of the code. Where we have y = word, we need y = fct_reorder(word, n). This code tells R to please order the bars so that the words with the highest frequency (biggest n) are on top.

Question 16

Adapt your code from Question 15 so the bars are ordered from largest count to smallest count. Show your graph!

If you create the bar plot using the code above, you will notice something a little odd. Some of our top “words” in the clickbait article titles are numbers! Does this happen with the non-clickbait titles??

Question 17

Create a bar graph for the top 15 words in the non-clickbait titles. Color the bars anything other than white, gray, black, or blue, and make sure to check your labels!!

From our analysis so far, we can see that having numbers in the title might be a good feature to use to identify whether or not a title is clickbait. Great!!

However, what if we would actually like to look at the top words (not numbers) in the titles? If we want to remove the numbers, we can add a line to our process of tokenizing:

# Start with the headlines dataset 
tidy_clickbaitonly <- headlines |>
  # Choose only the clickbait titles
  filter(clickbait =="TRUE") |>
  # Tokenize the titles 
  unnest_tokens(word, title, token = "regex") |>
  # Remove the stop words
  anti_join(stop_words) |>
  # Remove the numbers 
  filter(!grepl('[0-9]', word))

Adding all the comments with # as I have done in the code above is called annotating your code. This leaves notes for yourself and for others that explains clearly what each line of the code is doing. When we write professional code, it is generally annotated.

Question 18

Create a new bar plot of the top 10 words in the clickbait titles, after the numbers have been removed. Show the plot.

Question 19

Based on the bar graphs after removing numbers, comment on the difference between the words that seem to appear in clickbait titles versus non-clickbait titles.

Word Clouds

Another way we can visualize the most frequent words is by using a word cloud. A word cloud is a type of plot specific to text data that will show the most popular words in a dataset. The larger the size of a word in the plot, the more frequently that word appears in the dataset. This allows us to quickly compare the top few words.

Let’s start off by creating a word cloud with only the top 10 words in the clickbait dataset.

# Load the library 
library(wordcloud)
set.seed(279)

# Make the plot 
tidy_clickbaitonly |>
  count(word,sort = TRUE) |>
  with(wordcloud(word, n, max.words = 10))

The max.words part of the code specifies how many words will appear in the word cloud. If you want the top 15 words, for example, you use max.words = 15.

Question 20

Create a word cloud for the top 20 words in the non-clickbait titles (after removing stop words).

As a side note, you can format your word clouds with different color palettes. For example:

clickbait_count |>
  with(wordcloud(word, n, max.words = 10, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

References

Data

The dataset used in this lab is the sample_headlines dataset downloaded from https://github.com/nicholasjhorton/textclassificationexamples/tree/master. Citation: Horton, Nicholas J. Text Classification Examples, Retrieved July 20,2024 from https://github.com/nicholasjhorton/textclassificationexamples/tree/master.

Code

The code used in this lab is adapted from https://www.tidytextmining.com/ for STA 279. This adaptation is not endorsed by the original authors. Citation: Silge, Julia and Robinson, David. Text Mining with R: A Tidy Approach, Last built on 2024-08-13. Source: https://www.tidytextmining.com/